Conference item
Helping hands: an object-aware ego-centric video recognition model
- Abstract:
- We introduce an object-aware decoder for improving the performance of spatio-temporal representations on egocentric videos. The key idea is to enhance object-awareness during training by tasking the model to predict hand positions, object positions, and the semantic label of the objects using paired captions when available. At inference time the model only requires RGB frames as inputs, and is able to track and ground objects (although it has not been trained explicitly for this).We demonstrate the performance of the object-aware representations learnt by our model, by: (i) evaluating it for strong transfer, i.e. through zero-shot testing, on a number of downstream video-text retrieval and classification benchmarks; and (ii) by using the representations learned as input for long-term video understanding tasks (e.g. Episodic Memory in Ego4D). In all cases the performance improves over the state of the art—even compared to networks trained with far larger batch sizes. We also show that by using noisy image-level detection as pseudo-labels in training, the model learns to provide better bounding boxes using video consistency, as well as grounding the words in the associated text descriptions.Overall, we show that the model can act as a drop-in replacement for an ego-centric video model to improve performance through visual-text grounding.
- Publication status:
- Published
- Peer review status:
- Peer reviewed
Actions
Access Document
- Files:
-
-
(Preview, Accepted manuscript, pdf, 1.2MB, Terms of use)
-
- Publisher copy:
- 10.1109/ICCV51070.2023.01278
Authors
- Publisher:
- IEEE
- Host title:
- Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023
- Pages:
- 13901-13912
- Place of publication:
- Los Alamitos, California
- Publication date:
- 2024-01-15
- Acceptance date:
- 2023-07-14
- Event title:
- International Conference on Computer Vision, 2023
- Event location:
- Paris, France
- Event website:
- https://iccv2023.thecvf.com/
- Event start date:
- 2023-10-02
- Event end date:
- 2023-10-06
- DOI:
- EISSN:
-
2380-7504
- ISSN:
-
1550-5499
- EISBN:
- 9798350307184
- ISBN:
- 9798350307191
- Language:
-
English
- Keywords:
- Pubs id:
-
1544405
- Local pid:
-
pubs:1544405
- Deposit date:
-
2023-10-11
Terms of use
- Copyright holder:
- IEEE
- Copyright date:
- 2023
- Rights statement:
- © 2023 by The Institute of Electrical and Electronics Engineers, Inc. All rights reserved.
- Notes:
- This paper was presented at the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 2-6 Oxtober 2023, Paris, France. This is the accepted manuscript version of the article. The final version is at: 10.1109/ICCV51070.2023.01278
If you are the owner of this record, you can report an update to it here: Report update to this record