LAVT: Language-Aware Vision Transformer for referring image segmentation

Yang, Z; Wang, J; Tang, Y; Chen, K; Zhao, H; Torr, PHS

AI Collection

Conference item

LAVT: Language-Aware Vision Transformer for referring image segmentation

Abstract:: Referring image segmentation is a fundamental vision-language task that aims to segment out an object referred to by a natural language expression from an image. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image. A paradigm for tackling this problem is to leverage a powerful vision-language (“cross-madal”) decoder to fuse features independently extracted from a vision encoder and a language encoder. Recent methods have made remarkable advancements in this paradigm by exploiting Transformers as cross-modal decoders, concurrent to the Transformer's overwhelming success in many other vision-language tasks. Adopting a different approach in this work, we show that significantly better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in intermediate layers of a vision Transformer encoder network. By conducting cross-modal feature fusion in the visual feature encoding stage, we can leverage the well-proven correlation modeling power of a Transformer encoder for excavating helpful multi-modal context. This way, accurate segmentation results are readily harvested with a light-weight mask predictor. Without bells and whistles, our method surpasses the previous state-of-the-art methods on Ref CoCo, RefCOCO+, and G-Ref by large margins.

Publication status:: Published

Peer review status:: Peer reviewed

Actions

Email

Email this record

Send the bibliographic details of this record to your email address.

Your Email
Please enter the email address that the record information will be sent to.

-
Your message (optional)
Please add any additional information to be included within the email.
Share
Cite

Cite this record

APA Style

Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., & Torr, P. H. S. (2022). LAVT: Language-Aware Vision Transformer for referring image segmentation. Conference on Computer Vision and Pattern Recognition (CVPR 2022), 18134–18144.

MLA Style

Yang, Z, et al. “LAVT: Language-Aware Vision Transformer for Referring Image Segmentation.” Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022, pp. 18134–44.

Chicago Style

Yang, Z, J Wang, Y Tang, K Chen, H Zhao, and PHS Torr. 2022. “LAVT: Language-Aware Vision Transformer for Referring Image Segmentation.” In Conference on Computer Vision and Pattern Recognition (CVPR 2022), 18134–44. IEEE.
Print

Access Document

Files:: Yang_et_al_2022_LAVT_Language_Aware.pdf

(Preview, Accepted manuscript, pdf, 3.1MB, Terms of use)

Publisher copy:: 10.1109/CVPR52688.2022.01762

Authors

+ Yang, Z More by this author

Role:: Author

+ Wang, J More by this author

Role:: Author

+ Tang, Y More by this author

Role:: Author

+ Chen, K More by this author

Role:: Author

+ Zhao, H More by this author

Role:: Author

More authors...

+ John Fell Fund More from this funder

Publisher:: IEEE
Host title:: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Pages:: 18134-18144
Publication date:: 2022-09-27
Acceptance date:: 2022-06-19
Event title:: Conference on Computer Vision and Pattern Recognition (CVPR 2022)
Event location:: New Orleans, Louisiana
Event website:: https://cvpr2022.thecvf.com/
Event start date:: 2022-06-19
Event end date:: 2022-06-24
DOI:: 10.1109/CVPR52688.2022.01762
EISSN:: 2575-7075
ISSN:: 1063-6919
EISBN:: 9781665469463
ISBN:: 9781665469470

Language:: English
Keywords:: FFR

grouping and shape analysis

segmentation

vision + language
Pubs id:: 1272209
Local pid:: pubs:1272209
Deposit date:: 2022-08-01
ARK identifier:: ark:/29072/ora_ec222c942b194f51ad76a7ed236899cf

Terms of use

Copyright holder:: IEEE
Notes:: This is the accepted manuscript version of the paper. The final version is available online from IEEE at: https://doi.org/10.1109/CVPR52688.2022.01762

Licence:: Terms and Conditions of Use for Oxford University Research Archive

Views and Downloads

About views and downloads

If you are the owner of this record, you can report an update to it here: Report update to this record

Conference item

LAVT: Language-Aware Vision Transformer for referring image segmentation

Actions

Access Document

Authors

Terms of use

Views and Downloads

Altmetrics

Dimensions

Conference item

LAVT: Language-Aware Vision Transformer for referring image segmentation

Actions

Access Document

Authors

Funding

Bibliographic Details

Item Description

Terms of use

Metrics

Views and Downloads

Altmetrics

Dimensions