Learning and interpreting deep representations from multi-modal data

Patrick, M

Thesis

Learning and interpreting deep representations from multi-modal data

Abstract:: Deep learning has resulted in ground-breaking progress in a variety of domains, from core machine learning tasks such as image, language, and video understanding, to real-world industries such as medicine, autonomous driving, and agriculture. Its success has been driven by providing neural networks with manual supervision from large-scale labelled datasets such as ImageNet to automatically learn hierarchical data representations. However, obtaining large-scale labelled data is often a very time-consuming and expensive process. To address this challenge, we push the limits of self-supervision from multi-modal video data. Video data usually contain multiple modalities such as images, audio, transcribed speech and textual captions freely available. These modalities often share redundant semantic information and therefore can serve as pseudo-labels to supervise each other for representation learning without necessitating the use of manual human labels. Without the reliance on labelled data, we are able to train these deep representations on very large-scale video data of millions of video clips collected from the Internet. We show the scalability benefits of multi-modal self supervision by establishing a new state-of-the-art performance in a variety of domains: video action recognition, text-to-video retrieval, text-to-image retrieval and audio classification. We also introduce other technical innovations in terms of data transformations, model architecture and loss functions to further improve learning these deep video representations using multi-modal self-supervision. A secondary contribution of this thesis is new tools to improve the interpretability of deep representations, given that it is notoriously difficult to decipher the key features encoded in these deep representations. For images, we show how perturbation analysis can be used to analyze the intermediate representations of a network. For videos, we propose a novel clustering method using the Sinkhorn-Knopp algorithm to map deep video representations to human interpretable semantic pseudo-labels. The contributions in this thesis are steps to unlocking both the scalability and interpretability of deep video representation learning.

Actions

Email

Email this record

Send the bibliographic details of this record to your email address.

Your Email
Please enter the email address that the record information will be sent to.

-
Your message (optional)
Please add any additional information to be included within the email.
Cite

Cite this record

APA Style

Patrick, M. (2021). Learning and interpreting deep representations from multi-modal data [PhD thesis]. University of Oxford.

MLA Style

Patrick, M. Learning and Interpreting Deep Representations from Multi-Modal Data. University of Oxford, 2021.

Chicago Style

Patrick, M. 2021. “Learning and Interpreting Deep Representations from Multi-Modal Data.” PhD thesis, University of Oxford.
Share
Print

Access Document

Files:: Mandela_Patrick_Thesis_2021.pdf

(Preview, Dissemination version, pdf, 24.3MB, Terms of use)

Authors

+ Patrick, M More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Sub department:: Engineering Science
Research group:: Visual Geometry Group
Oxford college:: University College
Role:: Author

Contributors

+ Vedaldi, A

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Sub department:: Engineering Science
Research group:: Visual Geometry Group
Role:: Supervisor
ORCID:: 0000-0003-1374-2858

+ Henriques, J

Division:: MPLS
Department:: Engineering Science
Sub department:: Engineering Science
Research group:: Visual Geometry Group
Role:: Supervisor

+ Zisserman, A

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Sub department:: Engineering Science
Research group:: Visual Geometry Group
Role:: Examiner
ORCID:: 0000-0002-8945-8573

+ Owens, A

Institution:: University of Michigan
Role:: Examiner

+ Engineering and Physical Sciences Research Council More from this funder

Funding agency for:: Patrick, M
Grant:: EP/L015897/1
Programme:: Centre for Doctoral Training in Autonomous Intelligent Machines and Systems (AIMS)

+ Rhodes Trust More from this funder

Funder identifier:: http://dx.doi.org/10.13039/501100000697
Funding agency for:: Patrick, M
Programme:: Rhodes Scholarship

DOI:: 10.5287/ora-zbmx0aejn
Type of award:: DPhil
Level of award:: Doctoral
Awarding institution:: University of Oxford

Language:: English
Keywords:: clustering

multi-modal learning

deep representations

explainable AI

interpretability

self-supervised learning

machine learning

representation learning

video understanding
Subjects:: Multi-modal learning

Computer vision

Transfer learning (Machine learning)

representation learning

video understanding
Deposit date:: 2021-08-27

Terms of use

Copyright holder:: Patrick, M

Licence:: CC Attribution-NonCommercial-NoDerivatives (CC BY-NC-ND)

Views and Downloads

About views and downloads

If you are the owner of this record, you can report an update to it here: Report update to this record

Thesis

Learning and interpreting deep representations from multi-modal data

Actions

Access Document

Authors

Contributors

Terms of use

Views and Downloads

Altmetrics

Dimensions

Thesis

Learning and interpreting deep representations from multi-modal data

Actions

Access Document

Authors

Contributors

Funding

Bibliographic Details

Item Description

Related Items

Terms of use

Metrics

Views and Downloads

Altmetrics

Dimensions