Journal article
Patch-based separable transformer for visual recognition
- Abstract:
- The computational complexity of transformers limits it to be widely deployed onto frameworks for visual recognition. Recent work [9] significantly accelerates the network processing speed by reducing the resolution at the beginning of the network, however, it is still hard to be directly generalized onto other downstream tasks e.g. object detection and segmentation like CNN. In this paper, we present a transformer-based architecture retaining both the local and global interactions within the network, and can be transferable to other downstream tasks. The proposed architecture reforms the original full spatial self-attention into pixel-wise local attention and patch-wise global attention. Such factorization saves the computational cost while retaining the information of different granularities, which helps generate multi-scale features required by different tasks. By exploiting the factorized attention, we construct a Separable Transformer (SeT) for visual modeling. Experimental results show that SeT outperforms the previous state-of-the-art transformer-based approaches and its CNN counterparts on three major tasks including image classification, object detection and instance segmentation.
- Publication status:
- Published
- Peer review status:
- Peer reviewed
Actions
Access Document
- Files:
-
-
(Preview, Accepted manuscript, pdf, 709.5KB, Terms of use)
-
- Publisher copy:
- 10.1109/TPAMI.2022.3231725
Authors
- Publisher:
- IEEE
- Journal:
- IEEE Transactions on Pattern Analysis and Machine Intelligence More from this journal
- Volume:
- 45
- Issue:
- 7
- Pages:
- 9241 - 9247
- Publication date:
- 2022-12-23
- Acceptance date:
- 2022-12-18
- DOI:
- EISSN:
-
1939-3539
- ISSN:
-
0162-8828
- Language:
-
English
- Keywords:
- Pubs id:
-
1325768
- Local pid:
-
pubs:1325768
- Deposit date:
-
2023-02-10
Terms of use
- Copyright holder:
- IEEE
- Copyright date:
- 2022
- Rights statement:
- © 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
- Notes:
-
This is the accepted manuscript version of the article. The final version is available from IEEE at: 10.1109/TPAMI.2022.3231725
If you are the owner of this record, you can report an update to it here: Report update to this record