Conference item icon

Conference item

Understanding co-speech gestures in-the-wild

Abstract:
Co-speech gestures play a vital role in non-verbal communication. In this paper, we introduce a new framework for co-speech gesture understanding in the wild. Specifically, we propose three new tasks and benchmarks to evaluate a model’s capability to comprehend gesture-speech-text associations: (i) gesture based retrieval, (ii) gesture word spotting, and (iii) active speaker detection using gestures. We present a new approach that learns a tri-modal video-gesture-speech-text representation to solve these tasks. By leveraging a combination of global phrase contrastive loss and local gesture-word coupling loss, we demonstrate that a strong gesture representation can be learned in a weakly supervised manner from videos in the wild. Our learned representations outperform previous methods, including large vision-language models (VLMs). Further analysis reveals that speech and text modalities capture distinct gesture related signals, underscoring the advantages of learning a shared tri-modal embedding space.
Publication status:
Published
Peer review status:
Peer reviewed

Actions

Access Document

Publisher copy:
10.1109/ICCV51701.2025.00930

Authors

More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Role:
Author
ORCID:
0009-0005-2845-5570
More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Role:
Author
More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Role:
Author
ORCID:
0000-0002-3914-1754
More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Oxford college:
Brasenose College
Role:
Author
ORCID:
0000-0002-8945-8573


More from this funder
Funder identifier:
https://ror.org/0439y7842
Grant:
EP/T028572/1


Publisher:
IEEE
Host title:
2025 IEEE/CVF International Conference on Computer Vision (ICCV)
Pages:
9977-9987
Publication date:
2026-04-29
Acceptance date:
2025-07-23
Event title:
International Conference on Computer Vision (ICCV 2025)
Event location:
Honolulu, Hawai'i, USA
Event website:
https://iccv.thecvf.com/
Event start date:
2025-10-19
Event end date:
2025-10-23
DOI:
EISSN:
2380-7504
ISSN:
1550-5499
EISBN:
9798331587758
ISBN:
9798331587765


Language:
English
Pubs id:
2320675
Local pid:
pubs:2320675
Deposit date:
2025-11-10
ARK identifier:

Terms of use


Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP