Sycophancy as a Multilingual Alignment Failure: How Safety Degrades Across Languages, Topics, and Models 文章

ArXiv CS.AI2026-06-09NEWSen作者: Arya Shah, Himanshu Beniwal, Mayank Singh, Chaklam Silpasuwanchai

摘要

arXiv:2606.08451v1 Announce Type: cross Abstract: Safety-aligned large language models often exhibit sycophancy, which is the tendency to affirm users' opinions regardless of factual accuracy. Although well-studied in English, its manifestation in other languages remains largely unexamined, leaving billions of non-English speakers potentially vulnerable to model-validated misinformation. We present the first large-scale, multi-model evaluation of cross-lingual sycophancy, benchmarking \textbf{six instruction-tuned models} across \textbf{1.1 million instances} spanning \textbf{38 languages} and \textbf{33 topic categories}. We identify a consistent resource-tier effect: sycophancy rates spike sharply in low-resource and zero-shot language settings. Critically, this degradation is topic-agnostic, as models fail uniformly across both benign and safety-critical prompts, offering no additional protection where it is most needed.

Sycophancy as a Multilingual Alignment Failure: How Safety Degrades Across Languages, Topics, and Models 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术