Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration 文章

ArXiv CS.AI2026-05-28NEWSen作者: Hankyeol Kim, Pilsung Kang

摘要

arXiv:2605.27752v1 Announce Type: new Abstract: LLM confidence calibration is often evaluated by comparing two signals: token-probability scores and verbalized confidence. These signals are sometimes treated as direct readouts of model uncertainty, but their comparison depends on measurement choices that are rarely made explicit. In the main analysis, we hold the verbalized-confidence elicitation fixed: a single prompt template, probability scale, and output format. We then vary the measurement axes that define the verbalized-vs-token comparison: which answer string receives the token-probability score, how that score is read from the answer tokens, and under which conditioning context it is measured. We evaluate this design on four QA benchmarks across three open 7--8B base/Instruct model families, with larger Qwen2.5 variants as same-family robustness checks.

相关公司

暂无数据

相关人物

暂无数据

相关技术

暂无数据