Conference item

Self-supervised learning of audio-visual objects from video

Abstract:: Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning. To this end, we introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time. We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks: (a) multi-speaker sound source separation, (b) localizing and tracking speakers, (c) correcting misaligned audio-visual data, and (d) active speaker detection. Using our representation, these tasks can be solved entirely by training on unlabeled video, without the aid of object detectors. We also demonstrate the generality of our method by applying it to non-human speakers, including cartoons and puppets. Our model significantly outperforms other self-supervised approaches, and obtains performance competitive with methods that use supervised face detection.

Publication status:: Published

Peer review status:: Peer reviewed

Actions

Email

Email this record

Send the bibliographic details of this record to your email address.

Your Email
Please enter the email address that the record information will be sent to.

-
Your message (optional)
Please add any additional information to be included within the email.
Cite

Cite this record

APA Style

Afouras, T., Owens, A., Chung, J. S., & Zisserman, A. (2020). Self-supervised learning of audio-visual objects from video.

MLA Style

Afouras, T., et al. Self-Supervised Learning of Audio-Visual Objects from Video. Springer, 2020.

Chicago Style

Afouras, T, A Owens, JS Chung, and A Zisserman. 2020. “Self-Supervised Learning of Audio-Visual Objects from Video.” In . Lecture Notes in Computer Science. Springer.
Share
Print

Access Document

Files:: Afouras et al Self-supervised learning.pdf

(Preview, Accepted manuscript, 9.5MB, Terms of use)

Publisher copy:: 10.1007/978-3-030-58523-5_13

Authors

+ Afouras, T More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Role:: Author

+ Owens, A More by this author

Role:: Author

+ Chung, JS More by this author

Role:: Author

+ Zisserman, A More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Oxford college:: Brasenose College
Role:: Author
ORCID:: 0000-0002-8945-8573

Publisher:: Springer
Series:: Lecture Notes in Computer Science
Series number:: 12363
Publication date:: 2020-12-04
Acceptance date:: 2020-07-02
Event title:: 16th European Conference on Computer Vision (ECCV 2020)
Event website:: https://eccv2020.eu/
Event start date:: 2020-08-23
Event end date:: 2020-09-28
DOI:: 10.1007/978-3-030-58523-5_13
EISBN:: 9783030585235
ISBN:: 97830305852

Language:: English
Keywords:: FFR
Pubs id:: 1131225
Local pid:: pubs:1131225
Deposit date:: 2020-09-09

Terms of use

Copyright holder:: Springer Nature
Copyright date:: 2020
Rights statement:: © Springer Nature Switzerland AG 2020.
Notes:: This paper was presented at the 16th European Conference on Computer Vision (ECCV 2020), August 2020. This is the accepted manuscript version of the paper. The final version is available online from Springer at: https://doi.org/10.1007/978-3-030-58523-5_13

Licence:: Terms and Conditions of Use for Oxford University Research Archive

Views and Downloads

About views and downloads

If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP