Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases 文章

ArXiv CS.CL2026-06-01NEWSen作者: Dongyoon Hahm, Dylan Hadfield-Menell, Kimin Lee

摘要

arXiv:2605.27355v2 Announce Type: replace-cross Abstract: Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors. This arises from core limitations of RLHF: (1) preference datasets are constructed from the LLM's own outputs, allowing it to influence them, and (2) pairwise comparisons only indicate which response is better, not why. These limitations can be exploited to cause alignment tampering. For example, if an LLM generates biased responses with higher quality, annotators will prefer them based on quality. However, preference labels do not distinguish quality from bias, and the reward model inherits this limitation.

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases 文章

摘要

相关事件查看全部 (1)

相关公司查看全部 (2)

相关人物

相关产品查看全部 (7)

相关技术查看全部 (21)