End-to-end learning, and audio-visual human-centric video understanding

Brown, A

Thesis

End-to-end learning, and audio-visual human-centric video understanding

Abstract:: The field of machine learning has seen tremendous progress in the last decade, largely due to the advent of deep neural networks. When trained on large-scale labelled datasets, these machine learning algorithms can learn powerful semantic representations directly from the input data, end-to-end. End-to-end learning requires the availability of three core components: useful input data, target outputs, and an objective function for measuring how well the model's predictions match the target outputs. In this thesis, we explore and overcome a series of challenges as related to assembling these three components in the sufficient format and scale for end-to-end learning.

The first key idea presented in this thesis is to learn representations by enabling end-to-end learning for tasks where such challenges exist. We first explore whether better representations can be learnt for the image retrieval task by directly optimising the evaluation metric, Average Precision. This is notoriously challenging task, because such rank-based metrics are non-differentiable. We introduce a simple objective function that optimises a smoothed approximation of Average Precision, termed Smooth-AP, and demonstrate the benefits of training end-to-end over prior approaches. Secondly, we explore whether a representation can be learnt end-to-end for the task of image editing, where target data does not exist in sufficient scale. We propose a self-supervised approach that simulates target data by augmenting off-the-shelf image data, giving remarkable benefits over prior work.

The second idea presented in this thesis is focused on how to use the rich multi-modal signals that are essential for human perceptual systems as input data for deep neural networks. More specifically, we explore the use of audio-visual input data for the human-centric video understanding task. Here, we first explore if highly optimised speaker verification representations can transfer to the domain of movies where humans intentionally disguise their voice. We do this by collecting an audio-visual dataset of humans speaking in movies. Second, given strong identity discriminating representations, we present two methods that harness the complementarity and redundancy between multi-modal signals in order to build robust perceptual systems for determining who is present in a scene. These methods include an automated pipeline for labelling people in unlabelled video archives, and an approach for clustering people by identity in videos.

Actions

Email

Email this record

Send the bibliographic details of this record to your email address.

Your Email
Please enter the email address that the record information will be sent to.

-
Your message (optional)
Please add any additional information to be included within the email.
Cite

Cite this record

APA Style

Brown, A. (2022). End-to-end learning, and audio-visual human-centric video understanding [PhD thesis]. University of Oxford.

MLA Style

Brown, A. End-to-End Learning, and Audio-Visual Human-Centric Video Understanding. University of Oxford, 2022.

Chicago Style

Brown, A. 2022. “End-to-End Learning, and Audio-Visual Human-Centric Video Understanding.” PhD thesis, University of Oxford.
Share
Print

Access Document

Files:: Brown_2022_End-to-end_learning_and.pdf

(Preview, Dissemination version, pdf, 29.5MB, Terms of use)

Authors

+ Brown, A More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Role:: Author

Contributors

+ Zisserman, A

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Role:: Supervisor

+ Engineering and Physical Sciences Research Council More from this funder

Funder identifier:: https://ror.org/0439y7842
Funding agency for:: Brown, A

DOI:: 10.5287/ora-7jaqaaqa1
Type of award:: DPhil
Level of award:: Doctoral
Awarding institution:: University of Oxford

Language:: English
Keywords:: machine learning

neural networks

deep learning

computer vision

multi-modality
Subjects:: Machine learning

Deep learning (Machine learning)

Computer vision
Deposit date:: 2023-12-29

Terms of use

Copyright holder:: Brown, A

Licence:: CC Attribution (CC BY)

Views and Downloads

About views and downloads

If you are the owner of this record, you can report an update to it here: Report update to this record

Thesis

End-to-end learning, and audio-visual human-centric video understanding

Actions

Access Document

Authors

Contributors

Terms of use

Views and Downloads

Altmetrics

Dimensions

Thesis

End-to-end learning, and audio-visual human-centric video understanding

Actions

Access Document

Authors

Contributors

Funding

Bibliographic Details

Item Description

Related Items

Terms of use

Metrics

Views and Downloads

Altmetrics

Dimensions