Agent-X: evaluating deep multimodal reasoning in vision-centric agentic tasks

Ashraf, T; Saqib, A; Gani, H; AlMahri, M; Li, Y; Ahsan, N; Nawaz, U; Lahoud, J; Cholakkal, H; Shah, M; Torr, P; Khan, FS; Anwer, RM; Khan, S

Conference item

Agent-X: evaluating deep multimodal reasoning in vision-centric agentic tasks

Abstract:: Deep reasoning is fundamental for solving complex tasks, especially in visioncentric scenarios that demand sequential, multimodal understanding. However, existing benchmarks typically evaluate agents with fully synthetic, single-turn queries, limited visual modalities, and lack a framework to assess reasoning quality over multiple steps as required in real-world settings. To address this, we introduce Agent-X, a large-scale benchmark for evaluating vision-centric agents’ multistep and deep reasoning capabilities in real-world, multimodal settings. AgentX features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text. These tasks span six major agentic environments: general visual reasoning, web browsing, security and surveillance, autonomous driving, sports, and math reasoning. Our benchmark requires agents to integrate tool use with explicit, stepwise decision-making in these diverse settings. In addition, we propose a fine-grained, step-level evaluation framework that assesses the correctness and logical coherence of each reasoning step and the effectiveness of tool usage throughout the task. Our results reveal that even the best-performing models, including GPT, Gemini, and Qwen families, struggle to solve multi-step vision tasks, achieving less than 50% full-chain success. These findings highlight key bottlenecks in current LMM reasoning and tool-use capabilities and identify future research directions in vision-centric agentic reasoning models. Our data2 and code3 is publicly available.

Publication status:: Accepted

Peer review status:: Reviewed (other)

Actions

Email

Email this record

Send the bibliographic details of this record to your email address.

Your Email
Please enter the email address that the record information will be sent to.

-
Your message (optional)
Please add any additional information to be included within the email.
Share
Cite

Cite this record

APA Style

Ashraf, T., Saqib, A., Gani, H., AlMahri, M., Li, Y., Ahsan, N., Nawaz, U., Lahoud, J., Cholakkal, H., Shah, M., Torr, P., Khan, F. S., Anwer, R. M., & Khan, S. (2025). Agent-X: evaluating deep multimodal reasoning in vision-centric agentic tasks. Agentic AI Summit 2025.

MLA Style

Ashraf, T, et al. “Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks.” Agentic AI Summit 2025, 2025.

Chicago Style

Ashraf, T, A Saqib, H Gani, M AlMahri, Y Li, N Ahsan, U Nawaz, et al. 2025. “Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks.” In Agentic AI Summit 2025. Berkeley RDI.
Print

Access Document

Files:: Ashraf_et_al_2025_Agent-X_evaluating_deep.pdf

(Preview, Accepted manuscript, pdf, 6.3MB, Terms of use)

Publication website:: https://arxiv.org/pdf/2505.24876

Authors

+ Ashraf, T More by this author

Role:: Author

+ Saqib, A More by this author

Role:: Author

+ Gani, H More by this author

Role:: Author

+ AlMahri, M More by this author

Role:: Author

+ Li, Y More by this author

Role:: Author

More authors...

Publisher:: Berkeley RDI
Publication date:: 2025-08-02
Acceptance date:: 2025-08-02
Event title:: Agentic AI Summit 2025
Event location:: Berkeley, CA, USA
Event website:: https://rdi.berkeley.edu/events/agentic-ai-summit
Event start date:: 2025-08-02
Event end date:: 2025-08-02

Language:: English
Pubs id:: 2281898
Local pid:: pubs:2281898
Deposit date:: 2025-08-18
ARK identifier:: ark:/29072/ora_8b107049d5314d9da5c9d6405e5f0314

Terms of use

Copyright holder:: Ashraf et al
Notes:: This paper was presented at the Agentic AI Summit 2025, 2 August 2025, Berkeley, CA, USA.
The author accepted manuscript (AAM) of this paper has been made available under the University of Oxford's Open Access Publications Policy, and a CC BY public copyright licence has been applied.

Licence:: CC Attribution (CC BY)

Views and Downloads

About views and downloads

If you are the owner of this record, you can report an update to it here: Report update to this record

Conference item

Agent-X: evaluating deep multimodal reasoning in vision-centric agentic tasks

Actions

Access Document

Authors

Terms of use

Views and Downloads

Altmetrics

Dimensions

Conference item

Agent-X: evaluating deep multimodal reasoning in vision-centric agentic tasks

Actions

Access Document

Authors

Bibliographic Details

Item Description

Terms of use

Metrics

Views and Downloads

Altmetrics

Dimensions