The Alignment Floor: How Persona Customization Breaks Safety in Weakly-Aligned LLMs 文章

ArXiv CS.CL2026-05-29NEWSen作者: Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He

摘要

arXiv:2605.27382v2 Announce Type: replace-cross Abstract: Telling an LLM to "be enthusiastic" raises its sycophancy rate from 30\% to 50\% on a lightly-aligned model, but has zero effect on a strongly-aligned one. We define this gap as the alignment floor, $\Delta_{\text{floor}}(m)=\max_pS(m,p)-\min_pS(m,p)$, the range of sycophancy rates a model produces across persona conditions, and treat sycophancy as a persona-conditional property rather than a fixed model property. Pluralistic AI relies on behavioral adaptation via persona prompts like "be creative" or "be thorough", which let systems respect diverse user values and communication styles; the safety question is how much customization a given model can absorb before its truthfulness shifts. We present a controlled case study contrasting a strongly-aligned RLHF + Constitutional-AI model (Claude Sonnet 4.6) with a more lightly-aligned model (Amazon Nova Lite), spanning seven persona conditions and five tasks for 1800 total runs.

The Alignment Floor: How Persona Customization Breaks Safety in Weakly-Aligned LLMs 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (4)

相关技术查看全部 (3)