Reasoning or Fluency? Dissecting Probabilistic Confidence in Best-of-N Selection 文章

ArXiv CS.AI2026-06-04NEWSen作者: Hojin Kim, Jaehyung Kim

摘要

arXiv:2601.13735v2 Announce Type: replace Abstract: Probabilistic confidence metrics are increasingly adopted as proxies for reasoning quality in Best-of-N selection, under the assumption that higher confidence reflects higher reasoning fidelity. In this work, we challenge this assumption by investigating whether these metrics truly capture inter-step causal dependencies necessary for valid reasoning. We introduce three classes of inter-step causality perturbations that systematically disrupt dependencies between reasoning steps while preserving local fluency. Surprisingly, across diverse model families and reasoning benchmarks, we find that selection accuracy degrades only marginally under these disruptions. Even severe interventions, such as applying hard attention masks that directly prevent the model from attending to prior reasoning steps, do not substantially reduce selection performance.

Reasoning or Fluency? Dissecting Probabilistic Confidence in Best-of-N Selection 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术