Thesis
Sign language understanding using multimodal learning
- Abstract:
-
Sign languages are visual-spatial languages, representing the natural means of communication for deaf communities. Despite recent advancements in vision and language tasks, automatic sign language understanding remains largely unsolved. A key obstacle to making progress is the scarcity of appropriate training data. In this thesis, we aim to address this challenge.
First, we focus on visual keyword spotting (KWS) – the task of determining whether and when a keyword is spoken in a video – and leverage the fact that signers sometimes simultaneously mouth the word they sign. We initially propose a convolutional KWS architecture inspired by object detection methods, trained on data of talkings faces. We then improve the cross-modal interaction between the video and keyword representations by leveraging Transformers. Subsequently, we use the KWS model out-of-domain on signer mouthings as a means to localize signs: we automatically annotate hundreds of thousands of signs in readily available sign language interpreted TV data, by leveraging weakly-aligned subtitles to provide query words.
Second, to move beyond mouthings which are sparse, we propose different sign spotting approaches to automatically annotate signs in the continuous interpreted signing: (i) using visual sign language dictionaries in a multiple instance learning framework, (ii) exploiting the attention mechanism of a Transformer trained on a video-to-text sequence prediction task, (iii) pseudo-labelling from a strong sign recognition model, (iv) leveraging in-domain exemplars from previous approaches and sign representation similarities. All four approaches leverage the weakly-aligned subtitles and increase the vocabulary and density of automatic sign annotations. As a result, we obtain a large-scale, diverse, supervised dataset, and facilitate the learning of strong sign representations.
Third, we explore sign language tasks that entail predicting sequences of signs: fingerspelling and continuous sign language recognition (CSLR). For fingerspelling, we propose a weakly-supervised approach to detect and recognise sequences of letters, with a multiple-hypothesis loss function to learn from noisy supervision. For CSLR, we design a multi-task model capable of also performing sign language retrieval, and demonstrate promising results in large-vocabulary settings.
Finally, we explore obtaining stronger supervision from weak signals for a more general task, beyond the domain of sign language. Specifically, our focus shifts to verb understanding in video-language models – an important ability for modeling interactions among people, objects and the environment through space and time. For this task, we introduce a verb-focused contrastive framework consisting of two components: (i) leveraging pretrained large language models to create hard negatives for cross-modal contrastive learning; and (ii) enforcing a fine-grained alignment loss.
Actions
Authors
Contributors
- Institution:
- University of Oxford
- Division:
- MPLS
- Department:
- Engineering Science
- Role:
- Supervisor
- Funder identifier:
- https://ror.org/024bc3e07
- Funding agency for:
- Momeni, L
- Grant:
- D4D00240-DF00.02
- Programme:
- PhD Fellowship in Machine Perception, Speech Technology & Computer Vision (2022)
- DOI:
- Type of award:
- DPhil
- Level of award:
- Doctoral
- Awarding institution:
- University of Oxford
- Language:
-
English
- Keywords:
- Subjects:
- Deposit date:
-
2024-09-05
Terms of use
- Copyright holder:
- Momeni, L
- Copyright date:
- 2024
If you are the owner of this record, you can report an update to it here: Report update to this record