Structure over Pixels: Learning Variable-Length Visual Programs 文章

ArXiv CS.CV2026-05-28NEWSen作者: Piotr Wyrwi\'nski, Kacper Dobek, Krzysztof Krawiec

摘要

arXiv:2605.27696v1 Announce Type: new Abstract: Discrete visual tokenizers translate images into ordered sequences of codes, providing a natural representation for structural description of scenes. Yet existing adaptive tokenizers either require post-hoc search or select among a discrete set of pre-trained rates, rather than learning a continuous per-image sequence length coupled to the model and scene, and they typically train against pixel reconstruction, emphasizing texture rather than structure. We propose STROP, a discrete visual tokenizer architecture that forms structural scene representations and simultaneously learns how long an image's visual program should be. Using a four-phase curriculum supervised by local rate--distortion probes against frozen DINOv3 features, STROP optimizes a dedicated length head that estimates the active prefix length in a single forward pass.

相关公司

暂无数据

相关人物

暂无数据