Conference item icon

Conference item

End-to-end learning of visual representations from uncurated instructional videos

Abstract:
Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing mis- alignments inherent in narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to- video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.
Publication status:
Published
Peer review status:
Peer reviewed

Actions


Access Document


Files:
Publisher copy:
10.1109/CVPR42600.2020.00990

Authors



Publisher:
IEEE
Host title:
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Pages:
9876-9886
Publication date:
2020-08-05
Acceptance date:
2020-02-23
Event title:
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Event location:
Online
Event website:
https://cvpr2020.thecvf.com/
Event start date:
2020-06-14
Event end date:
2020-06-19
DOI:
EISSN:
2575-7075
ISSN:
1063-6919
EISBN:
9781728171685
ISBN:
9781728171692


Language:
English
Keywords:
Pubs id:
1770544
Local pid:
pubs:1770544
Deposit date:
2024-06-14

Terms of use



Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP