Shot-by-shot: film-grammar-aware training-free audio description generation

Xie, J; Han, T; Bain, M; Nagrani, A; Khandelwal, E; Varol, G; Xie, W; Zisserman, A

AI Collection

Conference item

Shot-by-shot: film-grammar-aware training-free audio description generation

Abstract:: Our objective is the automatic generation of Audio Descriptions (ADs) for edited video material, such as movies and TV series. To achieve this, we propose a two-stage framework that leverages “shots” as the fundamental units of video understanding. This includes extending temporal context to neighbouring shots and incorporating film grammar devices, such as shot scales and thread structures, to guide AD generation. Our method is compatible with both open-source and proprietary VisualLanguage Models (VLMs), integrating expert knowledge from add-on modules without requiring additional training of the VLMs. We achieve state-of-the-art performance among all prior training-free approaches and even surpass fine-tuned methods on several benchmarks. To evaluate the quality of predicted ADs, we introduce a new evaluation measure – an action score – specifically targeted to assessing this important aspect of AD. Additionally, we propose a novel evaluation protocol that treats automatic frameworks as AD generation assistants and asks them to generate multiple candidate ADs for selection.

Publication status:: Accepted

Peer review status:: Peer reviewed

Actions

Email

Email this record

Send the bibliographic details of this record to your email address.

Your Email
Please enter the email address that the record information will be sent to.

-
Your message (optional)
Please add any additional information to be included within the email.
Share
Cite

Cite this record

APA Style

Xie, J., Han, T., Bain, M., Nagrani, A., Khandelwal, E., Varol, G., Xie, W., & Zisserman, A. (2025). Shot-by-shot: film-grammar-aware training-free audio description generation. International Conference on Computer Vision (ICCV 2025).

MLA Style

Xie, J, et al. “Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation.” International Conference on Computer Vision (ICCV 2025), 2025.

Chicago Style

Xie, J, T Han, M Bain, A Nagrani, E Khandelwal, G Varol, W Xie, and A Zisserman. 2025. “Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation.” In International Conference on Computer Vision (ICCV 2025). IEEE.
Print

Access Document

Files:: Xie_et_al_2025_Shot-by-shot_film-grammar-aware_training-free.pdf

(Preview, Accepted manuscript, pdf, 25.6MB, Terms of use)

Authors

+ Xie, J More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Role:: Author

+ Han, T More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Role:: Author

+ Bain, M More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Role:: Author

+ Nagrani, A More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Role:: Author

+ Khandelwal, E More by this author

Role:: Author

More authors...

+ Engineering and Physical Sciences Research Council More from this funder

Funder identifier:: https://ror.org/0439y7842
Grant:: EP/T028572/1

Publisher:: IEEE
Acceptance date:: 2025-07-23
Event title:: International Conference on Computer Vision (ICCV 2025)
Event location:: Honolulu, Hawai'i, USA
Event website:: https://iccv.thecvf.com/
Event start date:: 2025-10-19
Event end date:: 2025-10-23

Language:: English
Pubs id:: 2300178
Local pid:: pubs:2300178
Deposit date:: 2025-10-17
ARK identifier:: ark:/29072/ora_cad2c15a101342b185c1cb7a961e9f7b

Terms of use

Notes:: This paper will be presented at the International Conference on Computer Vision (ICCV 2025), 19th-23rd October 2025, Honolulu, Hawai'i, USA. The author accepted manuscript (AAM) of this paper has been made available under the University of Oxford's Open Access Publications Policy, and a CC BY public copyright licence has been applied.

Licence:: CC Attribution (CC BY)

Views and Downloads

About views and downloads

If you are the owner of this record, you can report an update to it here: Report update to this record

Conference item

Shot-by-shot: film-grammar-aware training-free audio description generation

Actions

Access Document

Authors

Terms of use

Views and Downloads

Altmetrics

Dimensions

Conference item

Shot-by-shot: film-grammar-aware training-free audio description generation

Actions

Access Document

Authors

Funding

Bibliographic Details

Item Description

Terms of use

Metrics

Views and Downloads

Altmetrics

Dimensions