Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases 事件

PRODUCT_LAUNCH2026-05-27影响: MEDIUM

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases arXiv:2605.27355v1 Announce Type: cross Abstract: Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors. This arises fr

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases · 相关公司

N
NFLNONPROFIT
I
ISONONPROFIT
A
AriseCOMPANY
A
arXivNONPROFIT
I
ISESNONPROFIT
H
HuMANONPROFIT
T
TamCOMPANY
E
EARNNONPROFIT
A
ACTNONPROFIT