Biases in the Blind Spot: Detecting What LLMs Fail to Mention 文章

ArXiv CS.AI2026-06-01NEWSen作者: Iv\'an Arcuschin, David Chanin, Adri\`a Garriga-Alonso, Oana-Maria Camburu

摘要

arXiv:2602.10117v5 Announce Type: replace-cross Abstract: Large Language Models (LLMs) often provide chain-of-thought (CoT) reasoning traces that appear plausible, but may hide internal biases. We call these unverbalized biases. Monitoring models via their stated reasoning is therefore unreliable, and existing bias evaluations typically require predefined categories and hand-crafted datasets. In this work, we introduce a fully automated, black-box pipeline for detecting task-specific unverbalized biases. Given a task dataset, the pipeline uses LLM autoraters to generate candidate bias concepts. It then tests each concept on progressively larger input samples by generating positive and negative variations, and applies statistical techniques for multiple testing and early stopping. A concept is flagged as an unverbalized bias if it yields statistically significant performance differences while not being cited as justification in the model's CoTs.

Biases in the Blind Spot: Detecting What LLMs Fail to Mention 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (3)