Thesis
Towards human-centric story understanding in video
- Abstract:
-
With endless amounts of data being uploaded every day, the potential for swift development of artificial intelligence has never been higher. Videos in particular contain a plethora of information for learning about the world. We can discern actions, interactions, movement patterns, speech, etc. But all too often, research tends to group and classify: a dog is a dog is a dog.
One of the challenges lies in transcending conventional class-based visual understanding and exploring the realm of instances. This thesis concerns itself with both named instances -- more specific than traditional classes -- and open-world, open-set instances -- more general than conventional class frameworks. In it, we discuss methods that address these challenges and could later serve as building blocks for holistic story understanding.
The thesis is structured in two broad themes: (1) identity-agnostic video understanding methods, and (2) personalisation of various video understanding tasks.
We first develop methods that are class-agnostic, and serve towards better tracking, re-identification, retrieval and semantic video processing. Our work demonstrates that localisation and re-identification of a person or an object in a video can be trained jointly, using semantically-initialised embeddings. Furthermore, we show that by designing a task-agnostic video sampler, we can increase the number of frames a large-language model can process, allowing us to learn from progressively longer videos.
We then focus on making video-understanding tasks identity dependent. We first design a method that tackles problems of compound retrieval, being able to jointly reason about `\textit{who} is doing what and where'. We then generalise this approach to work on not only humans, but any arbitrary object. We show that large visual-language models can recognise a specific instance (e.g. 'my dog Chia') amongst a large corpus of images. Finally, we recognise that not only visual representations, but also speech needs to be personalised. To this end, we present a method able to assign character names to speech segments even across multiple TV shows. Thus, we demonstrate crucial building blocks necessary for a more in-depth story understanding.
Actions
Access Document
- Files:
-
-
(Preview, Dissemination version, pdf, 40.0MB, Terms of use)
-
Authors
Contributors
+ Andrew, Z
- Institution:
- University of Oxford
- Division:
- MPLS
- Department:
- Engineering Science
- Role:
- Supervisor
- ORCID:
- 0000-0002-8945-8573
- DOI:
- Type of award:
- DPhil
- Level of award:
- Doctoral
- Awarding institution:
- University of Oxford
- Language:
-
English
- Keywords:
- Subjects:
- Deposit date:
-
2026-04-09
- ARK identifier:
Terms of use
- Copyright holder:
- Bruno Korbar
- Copyright date:
- 2024
If you are the owner of this record, you can report an update to it here: Report update to this record