Conference item
Agent-X: evaluating deep multimodal reasoning in vision-centric agentic tasks
- Abstract:
- Deep reasoning is fundamental for solving complex tasks, especially in visioncentric scenarios that demand sequential, multimodal understanding. However, existing benchmarks typically evaluate agents with fully synthetic, single-turn queries, limited visual modalities, and lack a framework to assess reasoning quality over multiple steps as required in real-world settings. To address this, we introduce Agent-X, a large-scale benchmark for evaluating vision-centric agents’ multistep and deep reasoning capabilities in real-world, multimodal settings. AgentX features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text. These tasks span six major agentic environments: general visual reasoning, web browsing, security and surveillance, autonomous driving, sports, and math reasoning. Our benchmark requires agents to integrate tool use with explicit, stepwise decision-making in these diverse settings. In addition, we propose a fine-grained, step-level evaluation framework that assesses the correctness and logical coherence of each reasoning step and the effectiveness of tool usage throughout the task. Our results reveal that even the best-performing models, including GPT, Gemini, and Qwen families, struggle to solve multi-step vision tasks, achieving less than 50% full-chain success. These findings highlight key bottlenecks in current LMM reasoning and tool-use capabilities and identify future research directions in vision-centric agentic reasoning models. Our data2 and code3 is publicly available.
- Publication status:
- Accepted
- Peer review status:
- Reviewed (other)
Actions
Access Document
- Files:
-
-
(Preview, Accepted manuscript, pdf, 6.3MB, Terms of use)
-
- Publication website:
- https://arxiv.org/pdf/2505.24876
Authors
- Publisher:
- Berkeley RDI
- Publication date:
- 2025-08-02
- Acceptance date:
- 2025-08-02
- Event title:
- Agentic AI Summit 2025
- Event location:
- Berkeley, CA, USA
- Event website:
- https://rdi.berkeley.edu/events/agentic-ai-summit
- Event start date:
- 2025-08-02
- Event end date:
- 2025-08-02
- Language:
-
English
- Pubs id:
-
2281898
- Local pid:
-
pubs:2281898
- Deposit date:
-
2025-08-18
- ARK identifier:
Terms of use
- Copyright holder:
- Ashraf et al
- Copyright date:
- 2025
- Rights statement:
- ©2025 The Authors
- Notes:
-
This paper was presented at the Agentic AI Summit 2025, 2 August 2025, Berkeley, CA, USA.
The author accepted manuscript (AAM) of this paper has been made available under the University of Oxford's Open Access Publications Policy, and a CC BY public copyright licence has been applied.
- Licence:
- CC Attribution (CC BY)
If you are the owner of this record, you can report an update to it here: Report update to this record