Models That Know How Evaluations Are Designed Score Safer 文章

ArXiv CS.CL2026-05-28NEWSen作者: Katharina Deckenbach, Haritz Puerto, Jonas Geiping, Sahar Abdelnabi

摘要

arXiv:2605.28591v1 Announce Type: new Abstract: The validity of AI safety evaluations depends on models behaving consistently across controlled and deployment settings. Prior work has identified test-time contextual cues, such as hypothetical scenarios, as a source of verbalized evaluation awareness and subsequent behavioral shift. In this paper, we investigate a potential explanation of this phenomenon: evaluation meta-knowledge, defined as parametric knowledge about the structural traits that characterize evaluations. Similar to dataset contamination, where benchmark exposure leads to higher performance through memorization, we hypothesize that models trained on texts describing evaluation practices may implicitly learn to recognize and respond to evaluation-like contexts, for instance, through exposure to scientific articles or social media posts about AI benchmarking.

相关事件查看全部 (2)

Models That Know How Evaluations Are Designed Score Safer
2026-05-28PRODUCT_LAUNCH影响: MEDIUM

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据

相关技术

暂无数据