Large Language Models Hack Rewards, and Society 文章

ArXiv CS.CL2026-06-04NEWSen作者: Wei Liu, Xinyi Mou, Hanqi Yan, Zhongyu Wei, Yulan He

摘要

arXiv:2606.04075v1 Announce Type: cross Abstract: Reinforcement learning (RL) has become a dominant post-training paradigm, enabling large language models (LLMs) to learn from rewards. We observe that societal regulations are structurally similar to reward functions. They define measurable outcomes, thresholds, and exceptions, while often leaving institutional intent only partially specified. We hypothesise that the RL training process may exploit these gaps and therefore ask whether models' well-known tendency to hack reward functions during RL can scale into a more consequential failure mode named societal hacking: discovering loopholes in the rules society runs on. To study this phenomenon, we introduce SocioHack, a sandbox of 72 societal environments, and find that within these environments, reward hacking naturally emerges and leads to regulatory loophole discovery.

相关事件查看全部 (2)

Large Language Models Hack Rewards, and Society
2026-06-04REGULATION影响: MEDIUM
Large Language Models Hack Rewards, and Society
2026-06-04PRODUCT_LAUNCH影响: MEDIUM

相关公司

暂无数据

相关人物

暂无数据