Thesis icon

Thesis

Sign language understanding using multimodal learning

Abstract:

Sign languages are visual-spatial languages, representing the natural means of communication for deaf communities. Despite recent advancements in vision and language tasks, automatic sign language understanding remains largely unsolved. A key obstacle to making progress is the scarcity of appropriate training data. In this thesis, we aim to address this challenge.

First, we focus on visual keyword spotting (KWS) – the task of determining whether and when a keyword is spoken in a video – and leverage the fact that signers sometimes simultaneously mouth the word they sign. We initially propose a convolutional KWS architecture inspired by object detection methods, trained on data of talkings faces. We then improve the cross-modal interaction between the video and keyword representations by leveraging Transformers. Subsequently, we use the KWS model out-of-domain on signer mouthings as a means to localize signs: we automatically annotate hundreds of thousands of signs in readily available sign language interpreted TV data, by leveraging weakly-aligned subtitles to provide query words.

Second, to move beyond mouthings which are sparse, we propose different sign spotting approaches to automatically annotate signs in the continuous interpreted signing: (i) using visual sign language dictionaries in a multiple instance learning framework, (ii) exploiting the attention mechanism of a Transformer trained on a video-to-text sequence prediction task, (iii) pseudo-labelling from a strong sign recognition model, (iv) leveraging in-domain exemplars from previous approaches and sign representation similarities. All four approaches leverage the weakly-aligned subtitles and increase the vocabulary and density of automatic sign annotations. As a result, we obtain a large-scale, diverse, supervised dataset, and facilitate the learning of strong sign representations.

Third, we explore sign language tasks that entail predicting sequences of signs: fingerspelling and continuous sign language recognition (CSLR). For fingerspelling, we propose a weakly-supervised approach to detect and recognise sequences of letters, with a multiple-hypothesis loss function to learn from noisy supervision. For CSLR, we design a multi-task model capable of also performing sign language retrieval, and demonstrate promising results in large-vocabulary settings.

Finally, we explore obtaining stronger supervision from weak signals for a more general task, beyond the domain of sign language. Specifically, our focus shifts to verb understanding in video-language models – an important ability for modeling interactions among people, objects and the environment through space and time. For this task, we introduce a verb-focused contrastive framework consisting of two components: (i) leveraging pretrained large language models to create hard negatives for cross-modal contrastive learning; and (ii) enforcing a fine-grained alignment loss.

Actions


Access Document


Files:

Authors


More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Role:
Author

Contributors

Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Role:
Supervisor


More from this funder
Funder identifier:
https://ror.org/024bc3e07
Funding agency for:
Momeni, L
Grant:
D4D00240-DF00.02
Programme:
PhD Fellowship in Machine Perception, Speech Technology & Computer Vision (2022)


DOI:
Type of award:
DPhil
Level of award:
Doctoral
Awarding institution:
University of Oxford

Terms of use



Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP