Text-Only Data Synthesis for Vision Language Model Training 事件

PRODUCT_LAUNCH2026-05-28影响: MEDIUM

Text-Only Data Synthesis for Vision Language Model Training arXiv:2503.22655v2 Announce Type: replace-cross Abstract: Training vision-language models (VLMs) typically requires large-scale, high-quality image-text pairs, but collecting or synthesizing such data is costly. In contrast, text data is abundant and inexpensive, prompting the question: can high-quality multimodal training data be synthesized purely from text? To tackle this, we propose a cross-integrated three-stage multimodal data sy