Journal article icon

Journal article

ZeroRel: Multimodal Transformer-Guided Zero-Shot Relationship Retrieval for Generalized Scene Graph Generation

Abstract:
Scene Graph Generation (SGG) aims to represent an image’s objects and their pairwise relationships in a structured graph for downstream visual reasoning. However, conventional SGG models struggle with long-tail predicate distributions and closed-world vocabularies, resulting in poor generalization to rare or unseen relationships. We propose a neurosymbolic framework for zero-shot relationship retrieval that addresses these challenges by integrating deep visual features with external commonsense knowledge. Our model first detects objects and refines them via positional overlap and semantic similarity. It then retrieves candidate predicates through two complementary channels: (1) a visual-textual prototype retrieval that aligns subject-object representations with a broad predicate embedding space, and (2) a knowledge graph constrained retrieval that ranks relationships using heterogeneous commonsense graphs. A calibration and late-fusion module combines these channels, balancing confidence between head and tail classes. Evaluations on the Visual Genome (VG) and GQA benchmarks under zero-shot and open-vocabulary settings show strong strict zero-shot performance. On the reported VG split, ZeroRel reaches zR@100 = 37.1%, improving on the strongest prior zero-shot baseline in our comparison table (KnowZRel, 35.7%) while maintaining competitive overall recall and improved mean recall on rare predicates. The model also generalizes to GQA without retraining, demonstrating robust cross-dataset transfer. Ablations on knowledge sources and embedding models show that a heterogeneous Common Sense Knowledge Graph (CSKG) with ComplEx embeddings yields the best performance. These results indicate that combining visual prototype retrieval with structured knowledge retrieval improves coverage of rare and unseen relationships without sacrificing scene-graph quality on frequent predicates.
Publication status:
Published
Peer review status:
Peer reviewed

Actions

Access Document

Publisher copy:
10.1007/s11760-026-05439-7

Authors


Publisher:
Springer
Journal:
Signal, Image and Video Processing More from this journal
Volume:
20
Issue:
7
Article number:
411
Publication date:
2026-06-12
Acceptance date:
2026-05-11
DOI:
EISSN:
1863-1711
ISSN:
1863-1703


Language:
English
Keywords:
Source identifiers:
4226724
Deposit date:
2026-06-12
ARK identifier:
This ORA record was generated from metadata provided by an external service. It has not been edited by the ORA Team.

Terms of use


Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP