Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders 文章

ArXiv CS.CL2026-05-27NEWSen作者: Yi Jing, Zao Dai, Jinwu Hu, Zijun Yao, Lei Hou, Juanzi Li, Xiaozhi Wang

摘要

arXiv:2605.27354v1 Announce Type: cross Abstract: Model internals encode rich information about how a large language model (LLM) processes its training data; however, post-training data engineering largely relies on external signals and ignores rich intrinsic signals lying in model internals. We propose SAERL, a data engineering framework for LLM reinforcement learning (RL). It models three intrinsic data properties: diversity, difficulty, and quality, using model internals extracted with Sparse Autoencoder (SAE), an advanced mechanistic interpretability tool. Each property grounds a concrete data engineering operation: SAE-space clustering with moderate batch mixing for batch diversity control, a difficulty proxy for easy-to-hard curriculum ordering, and a quality probe for data filtering. SAERL improves average accuracy by 3.00% over vanilla GRPO and reaches target accuracy with 20% fewer training steps on Qwen2.5-Math-1.

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders 文章

摘要

相关事件查看全部 (1)

相关公司查看全部 (2)

相关人物

相关产品查看全部 (14)

相关技术查看全部 (29)