Revisiting uncertainty estimation and calibration of large language models

Tao, L; Yeh, YF; Dong, M; Huang, T; Yu, J; Torr, P; Xu, C

AI Collection

Conference item

Revisiting uncertainty estimation and calibration of large language models

Abstract:: As large language models (LLMs) are increasingly deployed in high-stakes applications, robust uncertainty estimation is essential for ensuring the safe and trustworthy deployment of LLMs. We present the most comprehensive study to date of uncertainty estimation in LLMs, evaluating 80 models spanning open- and closed-source families, dense and Mixture-of-Experts (MoE) architectures, reasoning and nonreasoning modes, quantization variants and parameter scales from 0.6B to 671B. Focusing on three representative black-box single-pass methods, including token probability-based uncertainty (TPU), numerical verbal uncertainty (NVU), and linguistic verbal uncertainty (LVU), we systematically evaluate uncertainty calibration and selective classification using the challenging MMLU-Pro benchmark, which covers both reasoning-intensive and knowledge-based tasks. Our results show that LVU consistently outperforms TPU and NVU, offering stronger calibration and discrimination while being more interpretable. We also find that high accuracy does not imply reliable uncertainty, and that model scale, post-training, reasoning ability and quantization all influence estimation performance. Notably, LLMs exhibit better uncertainty estimates on reasoning tasks than on knowledge-heavy ones, and good calibration does not necessarily translate to effective error ranking. These findings highlight the need for multi-perspective evaluation and position LVU as a practical tool for improving the reliability of LLMs in real-world settings.

Publication status:: Accepted

Peer review status:: Peer reviewed

Actions

Email

Email this record

Send the bibliographic details of this record to your email address.

Your Email
Please enter the email address that the record information will be sent to.

-
Your message (optional)
Please add any additional information to be included within the email.
Share
Cite

Cite this record

APA Style

Tao, L., Yeh, Y. F., Dong, M., Huang, T., Yu, J., Torr, P., & Xu, C. (2025). Revisiting uncertainty estimation and calibration of large language models. 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Scaling Environments for Agents (SEA).

MLA Style

Tao, L, et al. “Revisiting Uncertainty Estimation and Calibration of Large Language Models.” 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Scaling Environments for Agents (SEA), 2025.

Chicago Style

Tao, L, YF Yeh, M Dong, T Huang, J Yu, P Torr, and C Xu. 2025. “Revisiting Uncertainty Estimation and Calibration of Large Language Models.” In 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Scaling Environments for Agents (SEA). NeurIPS.
Print

Access Document

Files:: Tao_et_al_2025_Revisiting_uncertainty_estimation.pdf

(Preview, Version of record, pdf, 537.6KB, Terms of use)

Publication website:: https://openreview.net/forum?id=Q9CreVjHH7

Authors

+ Tao, L More by this author

Role:: Author

+ Yeh, YF More by this author

Role:: Author

+ Dong, M More by this author

Role:: Author

+ Huang, T More by this author

Role:: Author

+ Yu, J More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Role:: Author

More authors...

Publisher:: NeurIPS
Host title:: Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Scaling Environments for Agents (SEA).
Publication date:: 2025-09-28
Acceptance date:: 2025-09-28
Event title:: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Scaling Environments for Agents (SEA)
Event location:: San Diego, CA, USA
Event website:: https://neurips.cc/virtual/2025/loc/san-diego/workshop/109540
Event start date:: 2025-12-07
Event end date:: 2025-12-07

Language:: English
Pubs id:: 2364449
Local pid:: pubs:2364449
Deposit date:: 2026-01-28
ARK identifier:: ark:/29072/ora_602ae47f1fbc456aaf2ebd0b12eaad8d

Terms of use

Rights statement:: This paper has been made open access via Creative Commons licensing (https://creativecommons.org/licenses/by/4.0/).
Notes:: This paper was presented at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Scaling Environments for Agents (SEA), 7th December 2025, San Diego, CA, USA.

Licence:: Terms and Conditions of Use for Oxford University Research Archive

Views and Downloads

About views and downloads

If you are the owner of this record, you can report an update to it here: Report update to this record

Conference item

Revisiting uncertainty estimation and calibration of large language models

Actions

Access Document

Authors

Terms of use

Views and Downloads

Altmetrics

Dimensions

Conference item

Revisiting uncertainty estimation and calibration of large language models

Actions

Access Document

Authors

Bibliographic Details

Item Description

Terms of use

Metrics

Views and Downloads

Altmetrics

Dimensions