The goal of this work is to determine the audio-video synchronisation between mouth motion and speech in a video.
We propose a two-stream ConvNet architecture that enables the mapping between the sound and the mouth images to be trained end-to-end from unlabelled data. The trained network is used to determine the lip-sync error in a video.
We apply the network to two further tasks: active speaker detection and lip reading. On both tasks we set a new state-of-the-art on standard benchmark datasets.
- Publication status:
- Peer review status:
- Peer reviewed
- Accepted manuscript
- Copyright holder:
- Springer International Publishing AG
- Copyright date:
© Springer International Publishing AG 2017
Out of time: automated lip sync in the wild
If you are the owner of this record, you can report an update to it here: Report update to this record