Conference item
Understanding co-speech gestures in-the-wild
- Abstract:
- Co-speech gestures play a vital role in non-verbal communication. In this paper, we introduce a new framework for co-speech gesture understanding in the wild. Specifically, we propose three new tasks and benchmarks to evaluate a model’s capability to comprehend gesture-speech-text associations: (i) gesture based retrieval, (ii) gesture word spotting, and (iii) active speaker detection using gestures. We present a new approach that learns a tri-modal video-gesture-speech-text representation to solve these tasks. By leveraging a combination of global phrase contrastive loss and local gesture-word coupling loss, we demonstrate that a strong gesture representation can be learned in a weakly supervised manner from videos in the wild. Our learned representations outperform previous methods, including large vision-language models (VLMs). Further analysis reveals that speech and text modalities capture distinct gesture related signals, underscoring the advantages of learning a shared tri-modal embedding space.
- Publication status:
- Published
- Peer review status:
- Peer reviewed
Actions
Access Document
- Files:
-
-
(Preview, Accepted manuscript, pdf, 4.6MB, Terms of use)
-
- Publisher copy:
- 10.1109/ICCV51701.2025.00930
Authors
+ Engineering and Physical Sciences Research Council
More from this funder
- Funder identifier:
- https://ror.org/0439y7842
- Grant:
- EP/T028572/1
- Publisher:
- IEEE
- Host title:
- 2025 IEEE/CVF International Conference on Computer Vision (ICCV)
- Pages:
- 9977-9987
- Publication date:
- 2026-04-29
- Acceptance date:
- 2025-07-23
- Event title:
- International Conference on Computer Vision (ICCV 2025)
- Event location:
- Honolulu, Hawai'i, USA
- Event website:
- https://iccv.thecvf.com/
- Event start date:
- 2025-10-19
- Event end date:
- 2025-10-23
- DOI:
- EISSN:
-
2380-7504
- ISSN:
-
1550-5499
- EISBN:
- 9798331587758
- ISBN:
- 9798331587765
- Language:
-
English
- Pubs id:
-
2320675
- Local pid:
-
pubs:2320675
- Deposit date:
-
2025-11-10
- ARK identifier:
Terms of use
- Copyright holder:
- Hegde et al
- Copyright date:
- 2026
- Rights statement:
- © 2026 IEEE
- Notes:
- This paper was presented at the International Conference on Computer Vision (ICCV 2025), 19th-23rd October 2025, Honolulu, Hawai'i, USA. The author accepted manuscript (AAM) of this paper has been made available under the University of Oxford's Open Access Publications Policy, and a CC BY public copyright licence has been applied.
- Licence:
- CC Attribution (CC BY)
If you are the owner of this record, you can report an update to it here: Report update to this record