Calibration Is Not Enough: Evaluating Confidence Estimation Under Language Variations 文章

ArXiv CS.CL2026-05-29NEWSen作者: Yuxi Xia, Dennis Ulmer, Terra Blevins, Yihong Liu, Hinrich Sch\"utze, Benjamin Roth

摘要

arXiv:2601.08064v2 Announce Type: replace Abstract: Confidence estimation (CE) indicates how reliable the answers of large language models are and impacts user trust and decision-making. Existing evaluations mainly concern the alignment between confidence and correctness, but ignore the variability of language: confidence estimates should remain consistent under semantically equivalent prompts or answer variations, while changing when answer meaning differs, as this may indicate a change in correctness. Therefore, we introduce a novel evaluation framework based on three complementary properties: \textbf{robustness} to prompt perturbations, \textbf{stability} across semantically equivalent answers, and \textbf{sensitivity} to semantically different answers.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据

相关技术

暂无数据