Thesis icon

Thesis

Learning and interpreting deep representations from multi-modal data

Abstract:

Deep learning has resulted in ground-breaking progress in a variety of domains, from core machine learning tasks such as image, language, and video understanding, to real-world industries such as medicine, autonomous driving, and agriculture. Its success has been driven by providing neural networks with manual supervision from large-scale labelled datasets such as ImageNet to automatically learn hierarchical data representations. However, obtaining large-scale labelled data is often a very time-consuming and expensive process. To address this challenge, we push the limits of self-supervision from multi-modal video data. Video data usually contain multiple modalities such as images, audio, transcribed speech and textual captions freely available. These modalities often share redundant semantic information and therefore can serve as pseudo-labels to supervise each other for representation learning without necessitating the use of manual human labels. Without the reliance on labelled data, we are able to train these deep representations on very large-scale video data of millions of video clips collected from the Internet. We show the scalability benefits of multi-modal self supervision by establishing a new state-of-the-art performance in a variety of domains: video action recognition, text-to-video retrieval, text-to-image retrieval and audio classification. We also introduce other technical innovations in terms of data transformations, model architecture and loss functions to further improve learning these deep video representations using multi-modal self-supervision. A secondary contribution of this thesis is new tools to improve the interpretability of deep representations, given that it is notoriously difficult to decipher the key features encoded in these deep representations. For images, we show how perturbation analysis can be used to analyze the intermediate representations of a network. For videos, we propose a novel clustering method using the Sinkhorn-Knopp algorithm to map deep video representations to human interpretable semantic pseudo-labels. The contributions in this thesis are steps to unlocking both the scalability and interpretability of deep video representation learning.

Actions


Access Document


Files:

Authors


More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Sub department:
Engineering Science
Research group:
Visual Geometry Group
Oxford college:
University College
Role:
Author

Contributors

Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Sub department:
Engineering Science
Research group:
Visual Geometry Group
Role:
Supervisor
ORCID:
0000-0003-1374-2858
Division:
MPLS
Department:
Engineering Science
Sub department:
Engineering Science
Research group:
Visual Geometry Group
Role:
Supervisor
Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Sub department:
Engineering Science
Research group:
Visual Geometry Group
Role:
Examiner
ORCID:
0000-0002-8945-8573
Institution:
University of Michigan
Role:
Examiner


More from this funder
Funding agency for:
Patrick, M
Grant:
EP/L015897/1
Programme:
Centre for Doctoral Training in Autonomous Intelligent Machines and Systems (AIMS)
More from this funder
Funder identifier:
http://dx.doi.org/10.13039/501100000697
Funding agency for:
Patrick, M
Programme:
Rhodes Scholarship


DOI:
Type of award:
DPhil
Level of award:
Doctoral
Awarding institution:
University of Oxford

Terms of use



Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP