Out of time: automated lip sync in the wild
The goal of this work is to determine the audio-video synchronisation between mouth motion and speech in a video.
We propose a two-stream ConvNet architecture that enables the mapping between the sound and the mouth images to be trained end-to-end from unlabelled data. The trained network is used to determine the lip-sync error in a video.
We apply the network to two further tasks: active speaker detection and lip reading. On both tasks we set a new state-of-the-art on standard benchmark datasets.
- Publication status:
- Peer review status:
- Peer reviewed
(Accepted manuscript, pdf, 3.2MB)
- Publisher copy:
- Springer Publisher's website
- 13th Asian Conference on Computer Vision Journal website
- Host title:
- Workshop on Multi-view Lip-reading, 13th Asian Conference on Computer Vision (ACCV 2016)
- Publication date:
- Acceptance date:
- Event location:
- Source identifiers:
- Pubs id:
- Local pid:
- Deposit date:
- Copyright holder:
- Springer International Publishing AG
- Copyright date:
- © Springer International Publishing AG 2017