Video understanding using audio and visual modalities

Huh, J

Thesis

Video understanding using audio and visual modalities

Abstract:: This thesis explores the field of audio-visual learning in the context of video understanding, focusing on two main aspects: developing novel methods for effectively integrating audio and visual modalities, and curating high-quality audio-visual datasets automatically. By leveraging the complementary nature of these modalities and addressing the challenge of dataset creation, we enhance machine perception and comprehension of video content. We overcome the limitations of single-modality processing and improve performance in various practical applications. Our work also contributes to the efficient generation of large-scale, diverse datasets crucial for progress in this field.

Our research addresses three key areas: action recognition, character-aware subtitle generation for TV shows, and efficient audio-visual dataset curation. In action recognition, we develop models that effectively integrate audio-visual cues to improve accuracy, particularly by utilising temporal context from the video. For subtitle generation, we propose a multimodal approach that not only transcribes speech accurately but also attributes dialogue to correct speakers and aligns subtitles precisely with audio content. In dataset curation, we tackle the challenge of creating large-scale, diverse, and accurately labeled audio-visual datasets, developing efficient methods to accelerate progress in the field.

Throughout this work, we introduce novel architectures and algorithms that effectively combine audio and visual information, as well as propose new datasets and automatic creation pipelines to reduce the cost of data collection and human annotation. Our approach is inspired by psychological research on human multisensory integration and aims to mimic human-like processing of audio-visual information.

Actions

Email

Email this record

Send the bibliographic details of this record to your email address.

Your Email
Please enter the email address that the record information will be sent to.

-
Your message (optional)
Please add any additional information to be included within the email.
Share
Cite

Cite this record

APA Style

Huh, J. (2024). Video understanding using audio and visual modalities [PhD thesis]. University of Oxford.

MLA Style

Huh, J. Video Understanding Using Audio and Visual Modalities. 2024. University of Oxford, PhD thesis.

Chicago Style

Huh, J. 2024. “Video Understanding Using Audio and Visual Modalities.” PhD thesis, University of Oxford.
Print