Reinforcement Learning Amplifies Emergent Misalignment from Harmless Rewards · 相关技术
相关技术
language modelODETERStructure-Induced Guided NeutralizationStanSTRSupervised Fine-TuningSFTSAMPRTLPLAOWLOPENon-Line-of-SightNPONATMITLIMITIn-context Sparse AttentionForFFIDELCognitive Tree ExplorationCircUit-Level backdoor Threat modelCalibrated Entropy ScoreCOMAir Traffic ControlAction UnitsARGANNAES