Verifying Meta-Awareness via Predictive Rewards in Reasoning Models 文章

ArXiv CS.AI2026-06-02NEWSen作者: Yoonjeon Kim, Doohyuk Jang, Eunho Yang

摘要

arXiv:2510.03259v2 Announce Type: replace-cross Abstract: Recent research on reasoning models explores the meta-awareness of language models, including their ability to determine optimal thinking duration, recognize knowledge boundaries, and structure concept-level thinking. While current large reasoning models depend solely on answer-based verification, we show that adding meta-awareness objectives leads to significant performance gains over models without such meta-knowledge. MAPR (Meta-Awareness via Predictive Reward) utilizes a self-generated task of predicting rollout statistics - specifically length, pass-rate, and concepts used - allowing for verification against the actual statistics. Furthermore, by leveraging this self-predictive capability, the model can regulate its reasoning behavior by i) filtering out trivial or unsolvable prompts, ii) reducing lengthy generations that tend to be incorrect, and iii) generating hints relevant to the problem.

Verifying Meta-Awareness via Predictive Rewards in Reasoning Models 文章

摘要

相关事件查看全部 (2)

相关公司

相关人物

相关产品查看全部 (3)

相关技术查看全部 (1)