ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Discrete Diffusion Models 文章

ArXiv CS.CV2026-06-04NEWSen作者: Ruishu Zhu, Zhihao Huang, Jiacheng Sun, Ping Luo, Hongyuan Zhang, Xuelong Li

详细信息

来源站点: ArXiv CS.CV
作者: Ruishu Zhu, Zhihao Huang, Jiacheng Sun, Ping Luo, Hongyuan Zhang, Xuelong Li
文章类型: NEWS
语言: en
发布日期: 2026-06-04

摘要

arXiv:2512.14099v3 Announce Type: replace Abstract: Motivated by discrete diffusion's success in language-vision modeling, we explore its potential for multi-view generation, a task dominated by continuous approaches. We introduce ViewMask-1-to-3, formulating multi-view generation as a discrete sequence modeling problem where each viewpoint is represented as visual tokens from MAGVIT-v2. Through discrete diffusion via masked token prediction, our approach enables progressive multi-view generation via iterative token unmasking, unifying language and vision in a shared token space. Importantly, simple random masking combined with self-attention naturally encourages cross-view consistency without specialized architectures or 3D geometric priors. Our method outperforms the baseline on the GSO and 3D-FUTURE benchmarks, ranking first on average across standard image metrics, and achieving a 10.6% higher IoU than continuous diffusion models on 3D-FUTURE.

ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Discrete Diffusion Models 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品查看全部 (6)

相关技术查看全部 (1)