Conference item icon

Conference item

Audio-visual synchronisation in the wild

Abstract:
In this paper, we consider the problem of audio-visual synchronisation applied to videos "in-the-wild" (i.e. of general classes beyond speech). As a new task, we identify and curate a test set with high audio-visual correlation, namely VGG-Sound Sync. We compare a number of transformer-based architectural variants specifically designed to model audio and visual signals of arbitrary length, while significantly reducing memory requirements during training. We further conduct an in-depth analysis on the curated dataset and define an evaluation metric for open domain audio-visual synchronisation. We apply our method on standard lip reading speech benchmarks, LRS2 and LRS3, with ablations on various aspects. Finally, we set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset. In all cases, our proposed model outperforms the previous state-of-the-art by a significant margin.
Publication status:
Published

Actions

Access Document

Files:
Publication website:
https://www.robots.ox.ac.uk/~vgg/research/avs/

Authors


Publisher:
British Machine Vision Association
Journal:
Proceedings of the 32nd British Machine Vision Conference More from this journal
Publication date:
2021-12-15
Acceptance date:
2021-10-15
Event title:
British Machine Vision Conference 2021


Language:
English
Keywords:
Pubs id:
1208996
Local pid:
pubs:1208996
Deposit date:
2021-11-11
ARK identifier:

Terms of use


Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP