Thesis
Video understanding using audio and visual modalities
- Abstract:
-
This thesis explores the field of audio-visual learning in the context of video understanding, focusing on two main aspects: developing novel methods for effectively integrating audio and visual modalities, and curating high-quality audio-visual datasets automatically. By leveraging the complementary nature of these modalities and addressing the challenge of dataset creation, we enhance machine perception and comprehension of video content. We overcome the limitations of single-modality processing and improve performance in various practical applications. Our work also contributes to the efficient generation of large-scale, diverse datasets crucial for progress in this field.
Our research addresses three key areas: action recognition, character-aware subtitle generation for TV shows, and efficient audio-visual dataset curation. In action recognition, we develop models that effectively integrate audio-visual cues to improve accuracy, particularly by utilising temporal context from the video. For subtitle generation, we propose a multimodal approach that not only transcribes speech accurately but also attributes dialogue to correct speakers and aligns subtitles precisely with audio content. In dataset curation, we tackle the challenge of creating large-scale, diverse, and accurately labeled audio-visual datasets, developing efficient methods to accelerate progress in the field.
Throughout this work, we introduce novel architectures and algorithms that effectively combine audio and visual information, as well as propose new datasets and automatic creation pipelines to reduce the cost of data collection and human annotation. Our approach is inspired by psychological research on human multisensory integration and aims to mimic human-like processing of audio-visual information.
Actions
Access Document
- Files:
-
-
(Preview, Dissemination version, pdf, 44.5MB, Terms of use)
-
Authors
Contributors
+ Zisserman, A
- Institution:
- University of Oxford
- Division:
- MPLS
- Department:
- Engineering Science
- Role:
- Supervisor
- ORCID:
- 0000-0002-8945-8573
+ Henriques, J
- Institution:
- University of Oxford
- Division:
- MPLS
- Department:
- Engineering Science
- Role:
- Examiner
+ Grauman, K
- Institution:
- University of Texas at Austin
- Role:
- Examiner
+ Engineering and Physical Sciences Research Council
More from this funder
- Funder identifier:
- https://ror.org/0439y7842
- Funding agency for:
- Huh, J
- Zisserman, A
- Grant:
- EP/T028572/1
- Programme:
- VisualAI
- DOI:
- Type of award:
- DPhil
- Level of award:
- Doctoral
- Awarding institution:
- University of Oxford
- Language:
-
English
- Keywords:
- Subjects:
- Deposit date:
-
2026-05-06
- ARK identifier:
Terms of use
- Copyright holder:
- Jaesung Huh
- Copyright date:
- 2024
- Licence:
- CC Attribution (CC BY)
If you are the owner of this record, you can report an update to it here: Report update to this record