Language matters: a weakly SupervisedVision-Language pre-training approach for scene text detection and spotting

Xue, C; Hao, Y; Lu, S; Torr, P; Bai, S

Conference item

Language matters: a weakly SupervisedVision-Language pre-training approach for scene text detection and spotting

Abstract:: Recently, Vision-Language Pre-training (VLP) techniques have greatly benefited various vision-language tasks by jointly learning visual and textual representations, which intuitively helps in Optical Character Recognition (OCR) tasks due to the rich visual and textual information in scene text images. However, these methods cannot well cope with OCR tasks because of the difficulty in both instance-level text encoding and image-text pair acquisition (i.e. images and captured texts in them). This paper presents a weakly supervised pre-training method, oCLIP, which can acquire effective scene text representations by jointly learning and aligning visual and textual information. Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features, respectively, as well as a visualtextual decoder that models the interaction among textual and visual features for learning effective scene text representations. With the learning of textual features, the pre-trained model can attend texts in images well with character awareness. Besides, these designs enable the learning from weakly annotated texts (i.e. partial texts in images without text bounding boxes) which mitigates the data annotation constraint greatly. Experiments over the weakly annotated images in ICDAR2019-LSVT show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks, respectively. In addition, the proposed method outperforms existing pre-training techniques consistently across multiple public datasets (e.g., +3.2% and +1.3% for Total-Text and CTW1500).

Publication status:: Published

Peer review status:: Peer reviewed

Actions

Email

Email this record

Send the bibliographic details of this record to your email address.

Your Email
Please enter the email address that the record information will be sent to.

-
Your message (optional)
Please add any additional information to be included within the email.
Cite

Cite this record

APA Style

Xue, C., Hao, Y., Lu, S., Torr, P., & Bai, S. (2022). Language matters: a weakly SupervisedVision-Language pre-training approach for scene text detection and spotting.

MLA Style

Xue, C., et al. Language Matters: a Weakly SupervisedVision-Language Pre-Training Approach for Scene Text Detection and Spotting. Springer, 2022.

Chicago Style

Xue, C, Y Hao, S Lu, P Torr, and S Bai. 2022. “Language Matters: a Weakly SupervisedVision-Language Pre-Training Approach for Scene Text Detection and Spotting.” In . Lecture Notes in Computer Science. Springer.
Share
Print

Access Document

Files:: Xue_et_al_2022_Language_matters_a.pdf

(Preview, Accepted manuscript, pdf, 911.5KB, Terms of use)

Publisher copy:: 10.1007/978-3-031-19815-1_17

Authors

+ Xue, C More by this author

Role:: Author

+ Hao, Y More by this author

Role:: Author

+ Lu, S More by this author

Role:: Author

+ Torr, P More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Role:: Author

+ Bai, S More by this author

Role:: Author

Publisher:: Springer
Series:: Lecture Notes in Computer Science
Series number:: 13688
Publication date:: 2022-10-20
Acceptance date:: 2022-10-25
Event title:: 2022 Computer Vision and Pattern Recognition (CVPR 2022)
Event location:: New Orleans, Louisiana, USA
Event website:: https://cvpr2022.thecvf.com/
Event start date:: 2022-06-19
Event end date:: 2022-06-24
DOI:: 10.1007/978-3-031-19815-1_17
EISBN:: 9783031198151
ISBN:: 9783031198144

Language:: English
Keywords:: FFR
Pubs id:: 1302159
Local pid:: pubs:1302159
Deposit date:: 2022-11-11

Terms of use

Copyright holder:: Xue et al.
Notes:: This is the accepted manuscript version of the paper. The final version is available online from Springer at: https://doi.org/10.1007/978-3-031-19815-1_17

Licence:: Terms and Conditions of Use for Oxford University Research Archive

Views and Downloads

About views and downloads

If you are the owner of this record, you can report an update to it here: Report update to this record

Conference item

Language matters: a weakly SupervisedVision-Language pre-training approach for scene text detection and spotting

Actions

Access Document

Authors

Terms of use

Views and Downloads

Altmetrics

Dimensions

Conference item

Language matters: a weakly SupervisedVision-Language pre-training approach for scene text detection and spotting

Actions

Access Document

Authors

Bibliographic Details

Item Description

Terms of use

Metrics

Views and Downloads

Altmetrics

Dimensions