Conference item icon

Conference item

Separating the “chirp” from the “chat”: self-supervised visual grounding of sound and language

Abstract:
We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visual aligned features solely through watching videos. We show that DenseAV can discover the “meaning” of words and the “location” of sounds without explicit localization supervision. Furthermore, it automatically discovers and distinguishes between these two types of associations without supervision. We show that DenseAV's localization abilities arise from a new multi-head feature aggregation operator that directly compares dense image and audio representations for contrastive learning. In contrast, many other systems that learn “global” audio and video representations cannot localize words and sound. Finally, we contribute two new datasets to improve the evaluation of AV representations through speech and sound prompted semantic segmentation. On these and other datasets we show DenseAV dramatically outperforms the prior art on speech and sound prompted semantic segmentation. DenseAV outperforms the current state-of-the-art, ImageBind, on cross-modal retrieval using fewer than half of the parameters. Project Page: https://aka.ms/denseav
Publication status:
Published
Peer review status:
Peer reviewed

Actions


Access Document


Files:
Publisher copy:
10.1109/cvpr52733.2024.01246

Authors


More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Role:
Author


More from this funder
Funder identifier:
https://ror.org/0439y7842
Grant:
EP/T028572/1


Publisher:
IEEE
Host title:
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Pages:
13117-13127
Publication date:
2024-09-16
Acceptance date:
2024-06-16
Event title:
IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR 2024)
Event location:
Seattle, Washington , USA
Event website:
https://cvpr.thecvf.com/Conferences/2024
Event start date:
2024-06-17
Event end date:
2024-06-21
DOI:
EISSN:
2575-7075
ISSN:
1063-6919


Language:
English
Keywords:
Pubs id:
2063441
Local pid:
pubs:2063441
Deposit date:
2024-11-19

Terms of use



Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP