Physics-Guided Policy Optimization with Self-Distillation 文章

ArXiv CS.AI2026-06-03NEWSen作者: Ke Wang, Yuning Wu, Haoran Liu, Chaoqun Jia, Devin Chen, Kai Wei

详细信息

来源站点: ArXiv CS.AI
作者: Ke Wang, Yuning Wu, Haoran Liu, Chaoqun Jia, Devin Chen, Kai Wei
文章类型: NEWS
语言: en
发布日期: 2026-06-03

摘要

arXiv:2606.03620v1 Announce Type: cross Abstract: Self-distilled policy optimization (SDPO) has become a popular paradigm for LLM post-training, where a model learns from its own predictions conditioned on privileged information. SDPO, however, is sensitive to how much each update step should be trusted: corrections from a self-teacher can be highly informative on some batches and misleading on others, and applying them uniformly with a fixed step size can destabilize training. Drawing inspiration from viscous-fluid dynamics and formalizing the analogy at the SDE level, we propose Physics-Guided Policy Optimization (PGPO), which introduces an information-modulated step-size multiplier derived from a mutual-information estimate between the student's predictions and the feedback-conditioned teacher. We show that this modulation preserves the order-1 weak-approximation guarantees of vanilla SGD, and incurs negligible overhead per iteration.

Physics-Guided Policy Optimization with Self-Distillation 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品查看全部 (1)

相关技术查看全部 (3)