Conference item

WhisperX: time-accurate speech transcription of long-form audio

Abstract:: Large-scale, weakly-supervised speech recognition models, such as Whisper, have demonstrated impressive results on speech recognition across domains and languages. However, their application to long audio transcription via buffered or sliding window approaches is prone to drifting, hallucination & repetition; and prohibits batched transcription due to their sequential nature. Further, timestamps corresponding each utterance are prone to inaccuracies and word-level timestamps are not available out-of-the-box. To overcome these challenges, we present WhisperX, a time-accurate speech recognition system with word-level timestamps utilising voice activity detection and forced phoneme alignment. In doing so, we demonstrate state-of-the-art performance on long-form transcription and word segmentation benchmarks. Additionally, we show that pre-segmenting audio with our proposed VAD Cut & Merge strategy improves transcription quality and enables a twelvefold transcription speedup via batched inference. The code is available open-source.

Publication status:: Published

Peer review status:: Peer reviewed

Actions

Email

Email this record

Send the bibliographic details of this record to your email address.

Your Email
Please enter the email address that the record information will be sent to.

-
Your message (optional)
Please add any additional information to be included within the email.
Share
Cite

Cite this record

APA Style

Bain, M., Huh, J., Han, T., & Zisserman, A. (2023). WhisperX: time-accurate speech transcription of long-form audio. 24th Interspeech Conference 2023, 4489–4493.

MLA Style

Bain, M, et al. “WhisperX: Time-Accurate Speech Transcription of Long-Form Audio.” 24th Interspeech Conference 2023, 2023, pp. 4489–93.

Chicago Style

Bain, M, J Huh, T Han, and A Zisserman. 2023. “WhisperX: Time-Accurate Speech Transcription of Long-Form Audio.” In 24th Interspeech Conference 2023, 4489–93. International Speech Communication Association.
Print

Access Document

Files:: Bain_et_al_2023_WhisperX_time_accurate.pdf

(Preview, Accepted manuscript, pdf, 373.4KB, Terms of use)

Publisher copy:: 10.21437/Interspeech.2023-78

Authors

+ Bain, M More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Role:: Author

+ Huh, J More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Role:: Author

+ Han, T More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Role:: Author
ORCID:: 0000-0002-1874-9664

+ Zisserman, A More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Oxford college:: Brasenose College
Role:: Author
ORCID:: 0000-0002-8945-8573

Publisher:: International Speech Communication Association
Pages:: 4489-4493
Publication date:: 2023-08-18
Acceptance date:: 2023-05-17
Event title:: 24th Interspeech Conference 2023
Event location:: Dublin, Ireland
Event website:: https://www.interspeech2023.org/
Event start date:: 2023-08-20
Event end date:: 2023-08-24
DOI:: 10.21437/Interspeech.2023-78

Language:: English
Keywords:: FFR
Pubs id:: 1341473
Local pid:: pubs:1341473
Deposit date:: 2023-05-18
ARK identifier:: ark:/29072/ora_fece419295b74db8a0183cf728040194

Terms of use

Copyright date:: 2023
Notes:: This paper was presented at the 24th Interspeech Conference 2023, 20th - 24th August 2023, Dublin, Ireland. This is the accepted manuscript version of the article. The final version is available online from International Speech Communication Association at: https://doi.org/10.21437/Interspeech.2023-78

Licence:: Terms and Conditions of Use for Oxford University Research Archive

Views and Downloads

About views and downloads

If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP