Alignment Makes Language Models Normative, Not Descriptive 文章

ArXiv CS.CL2026-05-27NEWSen作者: Eilam Shapira, Moshe Tennenholtz, Roi Reichart

摘要

arXiv:2603.17218v2 Announce Type: replace Abstract: Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops.