CounTR: transformer-based generalised visual counting

Conference item

Abstract:: In this paper, we consider the problem of generalised visual object counting, with the goal of developing a computational model for counting the number of objects from arbitrary semantic categories, using arbitrary number of “exemplars”, i.e. zero-shot or few-shot counting. To this end, we make the following four contributions: (1) We introduce a novel transformer-based architecture for generalised visual object counting, termed as Counting TRansformer (CounTR), which explicitly capture the similarity between image patches or with given “exemplars” with the attention mechanism; (2) We adopt a two-stage training regime, that first pre-trains the model with self-supervised learning, and followed by supervised fine-tuning; (3) We propose a simple, scalable pipeline for synthesizing training images with a large number of instances or that from different semantic categories, explicitly forcing the model to make use of the given “exemplars”; (4) We conduct thorough ablation studies on the large-scale counting benchmark, e.g. FSC-147, and demonstrate state-of-the-art performance on both zero and few-shot settings. Project page: https://verg-avesta.github.io/CounTR_Webpage/.

Files:: Liu_et_al_2022_CounTR_transformer-based_generalised.pdf

(Preview, Version of record, pdf, 4.0MB, Terms of use)

Publisher:: BMVA Press
Host title:: Proceedings of the 33rd British Machine Vision Conference (BMVC 2022)
Article number:: 370
Publication date:: 2022-11-25
Acceptance date:: 2022-09-30
Event title:: 33rd British Machine Vision Conference (BMVC 2022)
Event location:: London, UK
Event website:: https://bmvc2022.org/
Event start date:: 2022-11-21
Event end date:: 2022-11-24

Licence:: Terms and Conditions of Use for Oxford University Research Archive

If you are the owner of this record, you can report an update to it here: Report update to this record