Thesis icon

Thesis

Learning with multimodal self-supervision

Abstract:

Deep learning has fueled an explosion of applications, yet training deep neural networks usually requires expensive human annotations. In this thesis we explore alternatives to avoid the substantial reliance on manual annotated examples when training deep neural networks. Specifically, we do so by either adapting self-supervised methods to automatically correct freely obtained data labels, or by completely abandoning the use of human labels and instead utilizing the natural co-occurrence of audio and visual information to learn object representations in videos.


Growing collections of digital data often provide noisy labels that can be exploited to supervise the learning process. Conventional data pre-processing includes correcting/cleaning them before training recognition models, but this can require infeasible amounts of manual effort. We consider correcting the annotation noise automatically, and hence eschew the need for costly manual annotation. We build and extend recent breakthroughs with a consistency loss which enables training even without ground truth, and an spatial memory map that provides flexible instance-level registration, leading to greater generalization.


We further explore multimodal sensory streams to provide self-supervision to the model by utilizing modality redundancy, i.e. the overlapping information between modalities. Representations are learned by harnessing different modalities without using any human-annotated labels. We demonstrate this technique using three different applications. First, we automatically curate a large-scale audio dataset, VGG-Sound, with more than 200k videos collected using visual guidance, training on which yields state-of-the-art models for audio recognition. Second, we present a method to improve and extend recent sound source localization techniques by introducing a mechanism to mine hard samples and add them to a contrastive learning formulation automatically. Finally, unlike existing audio-visual synchronization tasks performed on one specific domain, we propose to solve the synchronization problem in open world settings by exploring the use of several transformer-based architectures. With these models, we achieve state-of-the-art results in challenging speech datasets and show excellent generalization in a general sound dataset.

Actions


Access Document


Files:

Authors


More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Oxford college:
Oriel College
Role:
Author

Contributors

Role:
Supervisor
Role:
Supervisor
ORCID:
0000-0002-8945-8573


DOI:
Type of award:
DPhil
Level of award:
Doctoral
Awarding institution:
University of Oxford


Terms of use



Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP