Calibration Is Not Enough: Evaluating Confidence Estimation Under Language Variations 文章

ArXiv CS.CL2026-05-29NEWSen作者: Yuxi Xia, Dennis Ulmer, Terra Blevins, Yihong Liu, Hinrich Sch\"utze, Benjamin Roth

摘要

arXiv:2601.08064v2 Announce Type: replace Abstract: Confidence estimation (CE) indicates how reliable the answers of large language models are and impacts user trust and decision-making. Existing evaluations mainly concern the alignment between confidence and correctness, but ignore the variability of language: confidence estimates should remain consistent under semantically equivalent prompts or answer variations, while changing when answer meaning differs, as this may indicate a change in correctness. Therefore, we introduce a novel evaluation framework based on three complementary properties: \textbf{robustness} to prompt perturbations, \textbf{stability} across semantically equivalent answers, and \textbf{sensitivity} to semantically different answers.

Calibration Is Not Enough: Evaluating Confidence Estimation Under Language Variations 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术