Conference item
WhisperX: time-accurate speech transcription of long-form audio
- Abstract:
- Large-scale, weakly-supervised speech recognition models, such as Whisper, have demonstrated impressive results on speech recognition across domains and languages. However, their application to long audio transcription via buffered or sliding window approaches is prone to drifting, hallucination & repetition; and prohibits batched transcription due to their sequential nature. Further, timestamps corresponding each utterance are prone to inaccuracies and word-level timestamps are not available out-of-the-box. To overcome these challenges, we present WhisperX, a time-accurate speech recognition system with word-level timestamps utilising voice activity detection and forced phoneme alignment. In doing so, we demonstrate state-of-the-art performance on long-form transcription and word segmentation benchmarks. Additionally, we show that pre-segmenting audio with our proposed VAD Cut & Merge strategy improves transcription quality and enables a twelvefold transcription speedup via batched inference. The code is available open-source.
- Publication status:
- Published
- Peer review status:
- Peer reviewed
Actions
Access Document
- Files:
-
-
(Preview, Accepted manuscript, pdf, 373.4KB, Terms of use)
-
- Publisher copy:
- 10.21437/Interspeech.2023-78
Authors
- Publisher:
- International Speech Communication Association
- Pages:
- 4489-4493
- Publication date:
- 2023-08-18
- Acceptance date:
- 2023-05-17
- Event title:
- 24th Interspeech Conference 2023
- Event location:
- Dublin, Ireland
- Event website:
- https://www.interspeech2023.org/
- Event start date:
- 2023-08-20
- Event end date:
- 2023-08-24
- DOI:
- Language:
-
English
- Keywords:
- Pubs id:
-
1341473
- Local pid:
-
pubs:1341473
- Deposit date:
-
2023-05-18
Terms of use
- Copyright date:
- 2023
- Notes:
- This paper will be presented at the 24th Interspeech Conference 2023, 20th - 24th August 2023, Dublin, Ireland. This is the accepted manuscript version of the article. The final version is available online from International Speech Communication Association at: https://doi.org/10.21437/Interspeech.2023-78
If you are the owner of this record, you can report an update to it here: Report update to this record