Frozen in time: A joint video and image encoder for end-to-end retrieval

Bain, M; Nagrani, A; Varol, G; Zisserman, A

AI Collection

Conference item

Frozen in time: A joint video and image encoder for end-to-end retrieval

Abstract:: Our objective in this work is video-text retrieval – in particular a joint embedding that enables efficient text-to-video retrieval. The challenges in this area include the design of the visual architecture and the nature of the training data, in that the available large scale video-text training datasets, such as HowTo100M, are noisy and hence competitive performance is achieved only at scale through large amounts of compute.We address both these challenges in this paper. We propose an end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets. Our model is an adaptation and extension of the recent ViT and Timesformer architectures, and consists of attention in both space and time. The model is flexible and can be trained on both image and video text datasets, either independently or in conjunction. It is trained with a curriculum learning schedule that begins by treating images as ‘frozen’ snapshots of video, and then gradually learns to attend to increasing temporal context when trained on video datasets. We also provide a new video-text pretraining dataset WebVid-2M, comprised of over two million videos with weak captions scraped from the internet. Despite training on datasets that are an order of magnitude smaller, we show that this approach yields state-of-the-art results on standard downstream video-retrieval benchmarks including MSR-VTT, MSVD, DiDeMo and LSMDC.

Publication status:: Published

Peer review status:: Peer reviewed

Actions

Email

Email this record

Send the bibliographic details of this record to your email address.

Your Email
Please enter the email address that the record information will be sent to.

-
Your message (optional)
Please add any additional information to be included within the email.
Share
Cite

Cite this record

APA Style

Bain, M., Nagrani, A., Varol, G., & Zisserman, A. (2022). Frozen in time: A joint video and image encoder for end-to-end retrieval. 2021 International Conference on Computer Vision (ICCV 2021), 1708–1718.

MLA Style

Bain, M, et al. “Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval.” 2021 International Conference on Computer Vision (ICCV 2021), 2022, pp. 1708–18.

Chicago Style

Bain, M, A Nagrani, G Varol, and A Zisserman. 2022. “Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval.” In 2021 International Conference on Computer Vision (ICCV 2021), 1708–18. IEEE.
Print

Access Document

Files:: Bain_et_al_2021_Frozen_in_time.pdf

(Preview, Accepted manuscript, pdf, 4.0MB, Terms of use)

Publisher copy:: 10.1109/ICCV48922.2021.00175

Authors

+ Bain, M More by this author

Role:: Author

+ Nagrani, A More by this author

Role:: Author

+ Varol, G More by this author

Role:: Author

+ Zisserman, A More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Oxford college:: Brasenose College
Role:: Author
ORCID:: 0000-0002-8945-8573

Publisher:: IEEE
Host title:: 2021 IEEE/CVF International Conference on Computer Vision (ICCV)
Pages:: 1708-1718
Publication date:: 2022-02-28
Acceptance date:: 2021-07-23
Event title:: 2021 International Conference on Computer Vision (ICCV 2021)
Event location:: Virtual Event
Event website:: https://iccv2021.thecvf.com/home
Event start date:: 2021-10-11
Event end date:: 2021-10-17
DOI:: 10.1109/ICCV48922.2021.00175
EISSN:: 2380-7504
EISBN:: 978-1-6654-2812-5
ISBN:: 978-1-6654-2813-2

Language:: English
Keywords:: FFR
Pubs id:: 1233022
Local pid:: pubs:1233022
Deposit date:: 2022-01-19
ARK identifier:: ark:/29072/ora_688f8ab5aeb7469691eb8a8b5cb03f9e

Terms of use

Copyright holder:: IEEE
Notes:: This is the accepted manuscript version of the paper. The final version is available online from IEEE at https://doi.org/10.1109/ICCV48922.2021.00175

Licence:: Terms and Conditions of Use for Oxford University Research Archive

Views and Downloads

About views and downloads

If you are the owner of this record, you can report an update to it here: Report update to this record

Conference item

Frozen in time: A joint video and image encoder for end-to-end retrieval

Actions

Access Document

Authors

Terms of use

Views and Downloads

Altmetrics

Dimensions

Conference item

Frozen in time: A joint video and image encoder for end-to-end retrieval

Actions

Access Document

Authors

Bibliographic Details

Item Description

Terms of use

Metrics

Views and Downloads

Altmetrics

Dimensions