Journal article icon

Journal article

TIER-LOC: visual query-based video clip localization in fetal ultrasound videos with a multi-tier transformer

Abstract:
In this paper, we introduce the Visual Query-based task of Video Clip Localization (VQ-VCL) for medical video understanding. Specifically, we aim to retrieve a video clip containing frames similar to a given exemplar frame from a given input video. To solve the task, we propose a novel visual query-based video clip localization model called TIER-LOC. TIER-LOC is designed to improve video clip retrieval, especially in fine-grained videos by extracting features from different levels, i.e., coarse to fine-grained, referred to as TIERS. The aim is to utilize multi-Tier features for detecting subtle differences, and adapting to scale or resolution variations, leading to improved video-clip retrieval. TIER-LOC has three main components: (1) a Multi-Tier Spatio-Temporal Transformer to fuse spatio-temporal features extracted from multiple Tiers of video frames with features from multiple Tiers of the visual query enabling better video understanding. (2) a Multi-Tier, Dual Anchor Contrastive Loss to deal with real-world annotation noise which can be notable at event boundaries and in videos featuring highly similar objects. (3) a Temporal Uncertainty-Aware Localization Loss designed to reduce the model sensitivity to imprecise event boundary. This is achieved by relaxing hard boundary constraints thus allowing the model to learn underlying class patterns and not be influenced by individual noisy samples. To demonstrate the efficacy of TIER-LOC, we evaluate it on two ultrasound video datasets and an open-source egocentric video dataset. First, we develop a sonographer workflow assistive task model to detect standard-frame clips in fetal ultrasound heart sweeps. Second, we assess our model's performance in retrieving standard-frame clips for detecting fetal anomalies in routine ultrasound scans, using the large-scale PULSE dataset. Lastly, we test our model's performance on an open-source computer vision video dataset by creating a VQ-VCL fine-grained video dataset based on the Ego4D dataset. Our model outperforms the best-performing state-of-the-art model by 7%, 4%, and 4% on the three video datasets, respectively.
Publication status:
Published
Peer review status:
Peer reviewed

Actions


Access Document


Files:
Publisher copy:
10.1016/j.media.2025.103611

Authors


More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Role:
Author
More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Role:
Author
More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Role:
Author
More by this author
Institution:
University of Oxford
Division:
MSD
Department:
Women's & Reproductive Health
Role:
Author


More from this funder
Funder identifier:
https://ror.org/0439y7842
Grant:
EP/T028572/1


Publisher:
Elsevier
Journal:
Medical Image Analysis More from this journal
Volume:
103
Article number:
103611
Publication date:
2025-05-02
Acceptance date:
2025-04-15
DOI:
EISSN:
1361-8423
ISSN:
1361-8415
Pmid:
40344944


Language:
English
Keywords:
Pubs id:
2122351
Local pid:
pubs:2122351
Deposit date:
2025-06-24

Terms of use



Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP