Conference item
Audio-visual synchronisation in the wild
- Abstract:
- In this paper, we consider the problem of audio-visual synchronisation applied to videos "in-the-wild" (i.e. of general classes beyond speech). As a new task, we identify and curate a test set with high audio-visual correlation, namely VGG-Sound Sync. We compare a number of transformer-based architectural variants specifically designed to model audio and visual signals of arbitrary length, while significantly reducing memory requirements during training. We further conduct an in-depth analysis on the curated dataset and define an evaluation metric for open domain audio-visual synchronisation. We apply our method on standard lip reading speech benchmarks, LRS2 and LRS3, with ablations on various aspects. Finally, we set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset. In all cases, our proposed model outperforms the previous state-of-the-art by a significant margin.
- Publication status:
- Published
Actions
Access Document
- Files:
-
-
(Preview, Accepted manuscript, pdf, 959.7KB, Terms of use)
-
- Publication website:
- https://www.robots.ox.ac.uk/~vgg/research/avs/
Authors
- Publisher:
- British Machine Vision Association
- Journal:
- Proceedings of the 32nd British Machine Vision Conference More from this journal
- Publication date:
- 2021-12-15
- Acceptance date:
- 2021-10-15
- Event title:
- British Machine Vision Conference 2021
- Language:
-
English
- Keywords:
- Pubs id:
-
1208996
- Local pid:
-
pubs:1208996
- Deposit date:
-
2021-11-11
- ARK identifier:
Terms of use
- Copyright holder:
- BMVC
- Copyright date:
- 2021
- Rights statement:
- © BMVC 2021.
- Notes:
- This is the accepted manuscript version of the article. The final version is available from British Machine Vision Association at https://www.robots.ox.ac.uk/~vgg/research/avs/
If you are the owner of this record, you can report an update to it here: Report update to this record