Let ViT Speak: Generative Language-Image Pre-training 事件

PRODUCT_LAUNCH2026-06-10影响: MEDIUM

Let ViT Speak: Generative Language-Image Pre-training arXiv:2605.00809v2 Announce Type: replace Abstract: In this paper, we present \textbf{Gen}erative \textbf{L}anguage-\textbf{I}mage \textbf{P}re-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens usi

Let ViT Speak: Generative Language-Image Pre-training · 相关报道