Representation over Routing: Diagnosing Temporal Routing Pathologies in Multi-Timescale PPO 文章

ArXiv CS.AI2026-06-02NEWSen作者: Jing Sun

摘要

arXiv:2604.13517v4 Announce Type: replace-cross Abstract: Temporal credit assignment in reinforcement learning is often approached by introducing value estimates at multiple discount factors. A natural next step is to let the actor dynamically route among these temporal heads, using either differentiable attention or heuristic uncertainty weights. This paper argues that such routing can create a numerical shortcut rather than a reliable temporal abstraction. We study this issue in a controlled PPO setting on LunarLander-v2, using the environment as a visual sandbox for diagnosing failure modes. First, we formalize Surrogate Objective Hacking: a differentiable softmax router exposed to the PPO surrogate receives a direct gradient toward advantage heads that are numerically favorable for the current update, even when this routing change does not correspond to improved physical control.

相关公司

暂无数据

相关人物

暂无数据