MergeTok: Unified Continuous and Discrete Visual Tokenization via Token Merging 文章

ArXiv CS.CV2026-06-01NEWSen作者: Luyuan Zhang, Siyuan Li, Zedong Wang, Qingsong Xie, Cheng Tan, Anna Wang, Yanhao Zhang, Chen Chen, Haonan Lu, Haoqian Wang

摘要

arXiv:2605.30904v1 Announce Type: new Abstract: Most visual tokenizers for image generation are bifurcated into two families with complementary limitations: continuous VAEs offer high-fidelity reconstruction but suffer from dense, entangled latents that are poorly suited for semantic control, whereas discrete VQ-based models enable autoregressive generation yet struggle with gradient sparsity, unstable training, and codebook collapse. In this work, we introduce MergeTok, a unified tokenizer that jointly optimizes continuous (VAE) and discrete (VQ) tokenizers within a encoder-decoder architecture, leveraging token merging techniques as a semantic bridge. By clustering similar tokens during encoding, MergeTok establishes a structural prior that provides dual supervision signals: (i) it imposes merged-token semantic alignment in the VAE branch, regularizing its latent space toward disentangled, semantic-aware representations;