Teacher-Guided Policy Optimization for On-Policy Reasoning Distillation under Large Policy Divergence 文章
ArXiv CS.AI2026-05-29NEWSen作者: Xinyu Liu, Kechen Jiao, Chunyang Xiao, Runsong Zhao, Junhao Ruan, Bei Li, Jiahao Liu, Qifan Wang, Xin Chen, Jingang Wang, Chenglong Wang, Tong Xiao, JingBo Zhu