Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs 文章

ArXiv CS.CL2026-06-16NEWSen作者: Nafiseh Nikeghbal, Amir Hossein Kargaran, Shaghayegh Kolli, Jana Diesner

详细信息

来源站点
ArXiv CS.CL
作者
Nafiseh Nikeghbal, Amir Hossein Kargaran, Shaghayegh Kolli, Jana Diesner
文章类型
NEWS
语言
en
发布日期
2026-06-16

摘要

arXiv:2606.16011v1 Announce Type: new Abstract: Standard accuracy benchmarks are designed to test how closely large language models (LLMs) approach correct answers, but are not suitable for testing whether LLMs stick with a correct answer when that answer is challenged by a plausible counter-argument. We introduce a controlled protocol for evaluating answer stability: after a model answers a multiple-choice question correctly, we challenge the model's answer with a coherent argument for an incorrect option and measure whether the model flips. The setup a) isolates argumentative content from overt social pressure and b) varies argument length, self-attribution, and cross-model source. Across seven frontier models and 57 MMLU subjects, flip rates range from 17.5% to 97.3%, revealing large differences in stability that are not captured by accuracy metrics alone. We find that self-attribution consistently increases flip rates (mean +7.1pp, up to +18.7pp).

相关事件

暂无数据

相关公司

暂无数据

相关人物

暂无数据