Identifiable Token Correspondence for World Models 文章

ArXiv CS.CV2026-05-27NEWSen作者: Youngin Kim, Ray Sun, Inho Kim, Bumsoo Park, Hyun Oh Song

摘要

arXiv:2605.16457v3 Announce Type: replace-cross Abstract: Token-based transformer world models have shown strong performance in visual reinforcement learning, but often suffer from temporal inconsistency in long-horizon rollouts, including object duplication, disappearance, and transmutation. A key reason is that most existing approaches treat next-frame prediction purely as a token generation problem, without considering the persistence of tokens across time. We introduce Identifiable Token Correspondence (ITC), a decoding step for token-based transformer world models that formulates next-frame prediction as a structured assignment problem with latent token correspondence variables: each next-frame token is explained either by copying a token from the previous frame or by generating a new one. ITC leaves the transformer architecture and training procedure unchanged and can be added on top of existing backbones.