Conference item icon

Conference item

Agent-X: evaluating deep multimodal reasoning in vision-centric agentic tasks

Abstract:
Deep reasoning is fundamental for solving complex tasks, especially in visioncentric scenarios that demand sequential, multimodal understanding. However, existing benchmarks typically evaluate agents with fully synthetic, single-turn queries, limited visual modalities, and lack a framework to assess reasoning quality over multiple steps as required in real-world settings. To address this, we introduce Agent-X, a large-scale benchmark for evaluating vision-centric agents’ multistep and deep reasoning capabilities in real-world, multimodal settings. AgentX features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text. These tasks span six major agentic environments: general visual reasoning, web browsing, security and surveillance, autonomous driving, sports, and math reasoning. Our benchmark requires agents to integrate tool use with explicit, stepwise decision-making in these diverse settings. In addition, we propose a fine-grained, step-level evaluation framework that assesses the correctness and logical coherence of each reasoning step and the effectiveness of tool usage throughout the task. Our results reveal that even the best-performing models, including GPT, Gemini, and Qwen families, struggle to solve multi-step vision tasks, achieving less than 50% full-chain success. These findings highlight key bottlenecks in current LMM reasoning and tool-use capabilities and identify future research directions in vision-centric agentic reasoning models. Our data2 and code3 is publicly available.
Publication status:
Accepted
Peer review status:
Reviewed (other)

Actions

Access Document

Files:
Publication website:
https://arxiv.org/pdf/2505.24876

Authors


Publisher:
Berkeley RDI
Publication date:
2025-08-02
Acceptance date:
2025-08-02
Event title:
Agentic AI Summit 2025
Event location:
Berkeley, CA, USA
Event website:
https://rdi.berkeley.edu/events/agentic-ai-summit
Event start date:
2025-08-02
Event end date:
2025-08-02


Language:
English
Pubs id:
2281898
Local pid:
pubs:2281898
Deposit date:
2025-08-18
ARK identifier:

Terms of use


Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP