Gaze-assisted automatic captioning of fetal ultrasound videos using three-way multi-modal deep neural networks

Alsharid, M; Cai, Y; Sharma, H; Drukker, L; Noble, JA; Papageorghiou, AT

Journal article

Gaze-assisted automatic captioning of fetal ultrasound videos using three-way multi-modal deep neural networks

Abstract:: In this work, we present a novel gaze-assisted natural language processing (NLP)-based video captioning model to describe routine second-trimester fetal ultrasound scan videos in a vocabulary of spoken sonography. The primary novelty of our multi-modal approach is that the learned video captioning model is built using a combination of ultrasound video, tracked gaze and textual transcriptions from speech recordings. The textual captions that describe the spatio-temporal scan video content are learnt from sonographer speech recordings. The generation of captions is assisted by sonographer gaze-tracking information reflecting their visual attention while performing live-imaging and interpreting a frozen image. To evaluate the effect of adding, or withholding, different forms of gaze on the video model, we compare spatio-temporal deep networks trained using three multi-modal configurations, namely: (1) a gaze-less neural network with only text and video as input, (2) a neural network additionally using real sonographer gaze in the form of attention maps, and (3) a neural network using automatically-predicted gaze in the form of saliency maps instead. We assess algorithm performance through established general text-based metrics (BLEU, ROUGE-L, F1 score), a domain-specific metric (ARS), and metrics that consider the richness and efficiency of the generated captions with respect to the scan video. Results show that the proposed gaze-assisted models can generate richer and more diverse captions for clinical fetal ultrasound scan videos than those without gaze at the expense of the perceived sentence structure. The results also show that the generated captions are similar to sonographer speech in terms of discussing the visual content and the scanning actions performed.

Publication status:: Published

Peer review status:: Peer reviewed

Actions

Email

Email this record

Send the bibliographic details of this record to your email address.

Your Email
Please enter the email address that the record information will be sent to.

-
Your message (optional)
Please add any additional information to be included within the email.
Cite

Cite this record

APA Style

Alsharid, M., Cai, Y., Sharma, H., Drukker, L., Noble, J. A., & Papageorghiou, A. T. (2022). Gaze-assisted automatic captioning of fetal ultrasound videos using three-way multi-modal deep neural networks. Medical Image Analysis, 82.

MLA Style

Alsharid, M., et al. “Gaze-Assisted Automatic Captioning of Fetal Ultrasound Videos Using Three-Way Multi-Modal Deep Neural Networks.” Medical Image Analysis, vol. 82, Elsevier, 2022.

Chicago Style

Alsharid, M, Y Cai, H Sharma, L Drukker, JA Noble, and AT Papageorghiou. 2022. “Gaze-Assisted Automatic Captioning of Fetal Ultrasound Videos Using Three-Way Multi-Modal Deep Neural Networks.” Medical Image Analysis 82.
Share
Print

Access Document

Files:: Alsharid_et_al_2022_Gaze-assisted_automatic_captioning.pdf

(Preview, Version of record, pdf, 4.3MB, Terms of use)

Publisher copy:: 10.1016/j.media.2022.102630

Authors

+ Alsharid, M More by this author

Sub department:: DF ENGINEERING SCIENCE; GR KELLOGG COLLEGE
Role:: Author

+ Cai, Y More by this author

Role:: Author

+ Sharma, H More by this author

Role:: Author

+ Drukker, L More by this author

Role:: Author

+ Noble, JA More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Oxford college:: St Hilda's College
Role:: Author
ORCID:: 0000-0002-3060-3772

More authors...

+ European Commission More from this funder

Grant:: 694581

Publisher:: Elsevier
Journal:: Medical Image Analysis More from this journal
Volume:: 82
Article number:: 102630
Publication date:: 2022-09-17
Acceptance date:: 2022-09-13
DOI:: 10.1016/j.media.2022.102630
EISSN:: 1361-8423
ISSN:: 1361-8415
Pmid:: 36223683

Language:: English
Keywords:: fetal ultrasound

video captioning

audio–visual

FFR

gaze tracking

multi-modal
Pubs id:: 1286080
Local pid:: pubs:1286080
Deposit date:: 2022-11-21

Terms of use

Copyright holder:: Alsharid et al.

Licence:: CC Attribution (CC BY)

Views and Downloads

About views and downloads

If you are the owner of this record, you can report an update to it here: Report update to this record

Journal article

Gaze-assisted automatic captioning of fetal ultrasound videos using three-way multi-modal deep neural networks

Actions

Access Document

Authors

Terms of use

Views and Downloads

Altmetrics

Dimensions

Journal article

Gaze-assisted automatic captioning of fetal ultrasound videos using three-way multi-modal deep neural networks

Actions

Access Document

Authors

Funding

Bibliographic Details

Item Description

Terms of use

Metrics

Views and Downloads

Altmetrics

Dimensions