Sign language understanding using multimodal learning

Momeni, L

Thesis

Sign language understanding using multimodal learning

Abstract:: Sign languages are visual-spatial languages, representing the natural means of communication for deaf communities. Despite recent advancements in vision and language tasks, automatic sign language understanding remains largely unsolved. A key obstacle to making progress is the scarcity of appropriate training data. In this thesis, we aim to address this challenge.

First, we focus on visual keyword spotting (KWS) – the task of determining whether and when a keyword is spoken in a video – and leverage the fact that signers sometimes simultaneously mouth the word they sign. We initially propose a convolutional KWS architecture inspired by object detection methods, trained on data of talkings faces. We then improve the cross-modal interaction between the video and keyword representations by leveraging Transformers. Subsequently, we use the KWS model out-of-domain on signer mouthings as a means to localize signs: we automatically annotate hundreds of thousands of signs in readily available sign language interpreted TV data, by leveraging weakly-aligned subtitles to provide query words.

Second, to move beyond mouthings which are sparse, we propose different sign spotting approaches to automatically annotate signs in the continuous interpreted signing: (i) using visual sign language dictionaries in a multiple instance learning framework, (ii) exploiting the attention mechanism of a Transformer trained on a video-to-text sequence prediction task, (iii) pseudo-labelling from a strong sign recognition model, (iv) leveraging in-domain exemplars from previous approaches and sign representation similarities. All four approaches leverage the weakly-aligned subtitles and increase the vocabulary and density of automatic sign annotations. As a result, we obtain a large-scale, diverse, supervised dataset, and facilitate the learning of strong sign representations.

Third, we explore sign language tasks that entail predicting sequences of signs: fingerspelling and continuous sign language recognition (CSLR). For fingerspelling, we propose a weakly-supervised approach to detect and recognise sequences of letters, with a multiple-hypothesis loss function to learn from noisy supervision. For CSLR, we design a multi-task model capable of also performing sign language retrieval, and demonstrate promising results in large-vocabulary settings.

Finally, we explore obtaining stronger supervision from weak signals for a more general task, beyond the domain of sign language. Specifically, our focus shifts to verb understanding in video-language models – an important ability for modeling interactions among people, objects and the environment through space and time. For this task, we introduce a verb-focused contrastive framework consisting of two components: (i) leveraging pretrained large language models to create hard negatives for cross-modal contrastive learning; and (ii) enforcing a fine-grained alignment loss.

Actions

Email

Email this record

Send the bibliographic details of this record to your email address.

Your Email
Please enter the email address that the record information will be sent to.

-
Your message (optional)
Please add any additional information to be included within the email.
Cite

Cite this record

APA Style

Momeni, L. (2024). Sign language understanding using multimodal learning [PhD thesis]. University of Oxford.

MLA Style

Momeni, L. Sign Language Understanding Using Multimodal Learning. University of Oxford, 2024.

Chicago Style

Momeni, L. 2024. “Sign Language Understanding Using Multimodal Learning.” PhD thesis, University of Oxford.
Share
Print

Access Document

Files:: Momeni_2024_Sign_language_understanding.pdf

(Preview, Dissemination version, pdf, 50.4MB, Terms of use)

Authors

+ Momeni, L More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Role:: Author

Contributors

+ Zisserman, A

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Role:: Supervisor

+ Google (United Kingdom) More from this funder

Funder identifier:: https://ror.org/024bc3e07
Funding agency for:: Momeni, L
Grant:: D4D00240-DF00.02
Programme:: PhD Fellowship in Machine Perception, Speech Technology & Computer Vision (2022)

DOI:: 10.5287/ora-yraggyvgb
Type of award:: DPhil
Level of award:: Doctoral
Awarding institution:: University of Oxford

Language:: English
Keywords:: vision & language

sign language

video understanding

multimodal

deep learning
Subjects:: Computer vision

Deep learning (Machine learning)

Machine learning
Deposit date:: 2024-09-05

Terms of use

Copyright holder:: Momeni, L

Licence:: Terms and Conditions of Use for Oxford University Research Archive

Views and Downloads

About views and downloads

If you are the owner of this record, you can report an update to it here: Report update to this record

Thesis

Sign language understanding using multimodal learning

Actions

Access Document

Authors

Contributors

Terms of use

Views and Downloads

Altmetrics

Dimensions

Thesis

Sign language understanding using multimodal learning

Actions

Access Document

Authors

Contributors

Funding

Bibliographic Details

Item Description

Terms of use

Metrics

Views and Downloads

Altmetrics

Dimensions