Learning with multimodal self-supervision

Chen, H

Abstract:: Deep learning has fueled an explosion of applications, yet training deep neural networks usually requires expensive human annotations. In this thesis we explore alternatives to avoid the substantial reliance on manual annotated examples when training deep neural networks. Specifically, we do so by either adapting self-supervised methods to automatically correct freely obtained data labels, or by completely abandoning the use of human labels and instead utilizing the natural co-occurrence of audio and visual information to learn object representations in videos.

Growing collections of digital data often provide noisy labels that can be exploited to supervise the learning process. Conventional data pre-processing includes correcting/cleaning them before training recognition models, but this can require infeasible amounts of manual effort. We consider correcting the annotation noise automatically, and hence eschew the need for costly manual annotation. We build and extend recent breakthroughs with a consistency loss which enables training even without ground truth, and an spatial memory map that provides flexible instance-level registration, leading to greater generalization.

We further explore multimodal sensory streams to provide self-supervision to the model by utilizing modality redundancy, i.e. the overlapping information between modalities. Representations are learned by harnessing different modalities without using any human-annotated labels. We demonstrate this technique using three different applications. First, we automatically curate a large-scale audio dataset, VGG-Sound, with more than 200k videos collected using visual guidance, training on which yields state-of-the-art models for audio recognition. Second, we present a method to improve and extend recent sound source localization techniques by introducing a mechanism to mine hard samples and add them to a contrastive learning formulation automatically. Finally, unlike existing audio-visual synchronization tasks performed on one specific domain, we propose to solve the synchronization problem in open world settings by exploring the use of several transformer-based architectures. With these models, we achieve state-of-the-art results in challenging speech datasets and show excellent generalization in a general sound dataset.

Actions

Email

Email this record

Send the bibliographic details of this record to your email address.

Your Email
Please enter the email address that the record information will be sent to.

-
Your message (optional)
Please add any additional information to be included within the email.
Share
Cite

Cite this record

APA Style

Chen, H. (2021). Learning with multimodal self-supervision [PhD thesis]. University of Oxford.

MLA Style

Chen, H. Learning with Multimodal Self-Supervision. 2021. University of Oxford, PhD thesis.

Chicago Style

Chen, H. 2021. “Learning with Multimodal Self-Supervision.” PhD thesis, University of Oxford.
Print