ZeroRel: Multimodal Transformer-Guided Zero-Shot Relationship Retrieval for Generalized Scene Graph Generation

Khan, MJ; Siddiqui, AM; Rasool, M; Ali, H; Ghafoor, U; Khan, J

Journal article

ZeroRel: Multimodal Transformer-Guided Zero-Shot Relationship Retrieval for Generalized Scene Graph Generation

Abstract:: Scene Graph Generation (SGG) aims to represent an image’s objects and their pairwise relationships in a structured graph for downstream visual reasoning. However, conventional SGG models struggle with long-tail predicate distributions and closed-world vocabularies, resulting in poor generalization to rare or unseen relationships. We propose a neurosymbolic framework for zero-shot relationship retrieval that addresses these challenges by integrating deep visual features with external commonsense knowledge. Our model first detects objects and refines them via positional overlap and semantic similarity. It then retrieves candidate predicates through two complementary channels: (1) a visual-textual prototype retrieval that aligns subject-object representations with a broad predicate embedding space, and (2) a knowledge graph constrained retrieval that ranks relationships using heterogeneous commonsense graphs. A calibration and late-fusion module combines these channels, balancing confidence between head and tail classes. Evaluations on the Visual Genome (VG) and GQA benchmarks under zero-shot and open-vocabulary settings show strong strict zero-shot performance. On the reported VG split, ZeroRel reaches zR@100 = 37.1%, improving on the strongest prior zero-shot baseline in our comparison table (KnowZRel, 35.7%) while maintaining competitive overall recall and improved mean recall on rare predicates. The model also generalizes to GQA without retraining, demonstrating robust cross-dataset transfer. Ablations on knowledge sources and embedding models show that a heterogeneous Common Sense Knowledge Graph (CSKG) with ComplEx embeddings yields the best performance. These results indicate that combining visual prototype retrieval with structured knowledge retrieval improves coverage of rare and unseen relationships without sacrificing scene-graph quality on frequent predicates.

Publication status:: Published

Peer review status:: Peer reviewed

Actions

Email

Email this record

Send the bibliographic details of this record to your email address.

Your Email
Please enter the email address that the record information will be sent to.

-
Your message (optional)
Please add any additional information to be included within the email.
Share
Cite

Cite this record

APA Style

Khan, M. J., Siddiqui, A. M., Rasool, M., Ali, H., Ghafoor, U., & Khan, J. (2026). ZeroRel: Multimodal Transformer-Guided Zero-Shot Relationship Retrieval for Generalized Scene Graph Generation. Signal, Image and Video Processing, 20(7).

MLA Style

Khan, MJ, et al. “ZeroRel: Multimodal Transformer-Guided Zero-Shot Relationship Retrieval for Generalized Scene Graph Generation.” Signal, Image and Video Processing, vol. 20, no. 7, 2026.

Chicago Style

Khan, MJ, AM Siddiqui, M Rasool, H Ali, U Ghafoor, and J Khan. 2026. “ZeroRel: Multimodal Transformer-Guided Zero-Shot Relationship Retrieval for Generalized Scene Graph Generation.” Signal, Image and Video Processing 20 (7).
Print