Thesis icon

Thesis

Self-supervised video representation learning

Abstract:

Videos are an appealing source of data to train computer vision models. There exist almost infinite supplies of videos online, but exhaustive manual annotation is infeasible. The goal of this thesis is to learn strong video representations efficiently via self-supervised learning: a method that learns from the data rather than human annotations.

The thesis is structured around three themes: (1) self-supervised learning for short-term videos, (2) efficient video representation learning, and (3) self- supervised learning for long-term videos.

For short-term videos lasting only a few seconds, we show that predicting the video in the future is a strong learning signal at a large scale. We further show that strong video representations can be learned by taking two complementary modalities, namely RGB and optical flow, and using them to teach each other.

For efficient video representation learning, we show that large-scale pre-trained vision-language models can be effectively adapted via a prompt tuning technique. We also show that dropping image patches can accelerate the finetuning of classification tasks and pre-training of video-language models.

For long-term videos that last longer than a few minutes, we show that temporal alignment networks can be trained from the weak visual-textual correspondence within instructional videos. The resulting networks can automatically clean up the natural videos for effective vision-language training. In addition, we show that movie description models can be trained by leveraging the pre-trained vision- language models.

Actions


Access Document


Files:

Authors


More by this author
Division:
MPLS
Department:
Engineering Science
Sub department:
Engineering Science
Research group:
Visual Geometry Group
Oxford college:
Lady Margaret Hall
Role:
Author
ORCID:
0000-0002-1874-9664

Contributors

Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Sub department:
Engineering Science
Research group:
Visual Geometry Group
Oxford college:
Brasenose College
Role:
Supervisor
ORCID:
0000-0002-8945-8573
Institution:
Columbia University
Role:
Examiner
Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Sub department:
Engineering Science
Research group:
Visual Geometry Group
Oxford college:
New College
Role:
Examiner
ORCID:
0000-0003-1374-2858


More from this funder
Funder identifier:
http://dx.doi.org/10.13039/501100000288
Funding agency for:
Zisserman, A
Grant:
RP\R1\191132
Programme:
Royal Society Research Professorship
More from this funder
Funder identifier:
http://dx.doi.org/10.13039/501100000266
Funding agency for:
Zisserman, A
Grant:
EP/M013774/1
EP/T028572/1
Programme:
Seebibyte; VisualAI
More from this funder
Funder identifier:
http://dx.doi.org/10.13039/100017149
Funding agency for:
Han, T
Programme:
Google Deepmind Studentship


DOI:
Type of award:
DPhil
Level of award:
Doctoral
Awarding institution:
University of Oxford


Language:
English
Keywords:
Subjects:
Pubs id:
2042890
Local pid:
pubs:2042890
Deposit date:
2023-01-15

Terms of use



Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP