Conference item icon

Conference item

Objects that sound

Abstract:

In this paper our objectives are, first, networks that can embed audio and visual inputs into a common space that is suitable for cross-modal retrieval; and second, a network that can localize the object that sounds in an image, given the audio signal. We achieve both these objectives by training from unlabelled video using only audio-visual correspondence (AVC) as the objective function. This is a form of cross-modal self-supervision from video. To this end, we design new network architectu...

Expand abstract
Publication status:
Published
Peer review status:
Peer reviewed
Version:
Accepted Manuscript

Actions


Access Document


Files:
Publisher copy:
10.1007/978-3-030-01246-5_27

Authors


Arandjelovic, R More by this author
More by this author
Institution:
University of Oxford
Division:
MPLS Division
Department:
Engineering Science
Oxford college:
Brasenose College
ORCID:
0000-0002-8945-8573
Publisher:
Springer Publisher's website
Volume:
Part 1
Pages:
451-466
Series:
Lecture Notes in Computer Science
Publication date:
2018-10-06
Acceptance date:
2018-07-03
DOI:
Pubs id:
pubs:966740
URN:
uri:d0bf6ced-8ccb-4edf-8ec0-02ad2d06efe8
UUID:
uuid:d0bf6ced-8ccb-4edf-8ec0-02ad2d06efe8
Local pid:
pubs:966740
ISBN:
978-3-030-01245-8

Terms of use


Metrics



If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP