Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training 文章

ArXiv CS.AI2026-06-01NEWSen作者: Christian Moya, Alex Semendinger, Guang Lin, Elliott Thornley

摘要

arXiv:2605.11134v2 Announce Type: replace-cross Abstract: Preference learning methods like Direct Preference Optimization (DPO) are known to induce reliance on spurious correlations, leading to sycophancy and length bias in today's language models and potentially severe goal misgeneralization in future systems. In this work, we provide a unified theoretical analysis of this phenomenon, characterizing the mechanisms of spurious learning, its consequences on deployment, and a provable mitigation strategy. Focusing on log-linear policies, we show that standard preference-learning objectives induce reliance on spurious features at the population level through two channels: mean spurious bias and causal-spurious correlation leakage. We then show that this reliance creates an irreducible vulnerability to distribution shift: more data from the same training distribution fails to reduce the model's dependence on spurious features.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据