"In\^{t}elegi Rom\^ane\c{s}te?'' A Recipe for Romanian Vision-Language Models 文章

ArXiv CS.CL2026-06-01NEWSen作者: Mihai Masala, Marius Leordeanu, Mihai Dascalu, Traian Rebedea

摘要

arXiv:2605.31401v1 Announce Type: new Abstract: Vision-Language Models (VLMs) largely follow the text-only LLM trajectory, excelling on English benchmarks but sharply degrading on low-resource languages, where neither large-scale image-text corpora nor culturally grounded evaluations exist. We present a systematic study of building a language-specific VLM for Romanian, covering the full pipeline from data construction to architectural choices. We translate established English VLM training and evaluation corpora into Romanian, applying machine translation to textual annotations and to in-image text, preserving visual grounding while adapting the textual content. Using this data, we train and ablate a series of VLMs to isolate the contribution of (i) vision backbones of varying scale and pretraining, (ii) language backbones from multilingual to Romanian-adapted LLMs, and (iii) OCR-style image-text data.

"In\^{t}elegi Rom\^ane\c{s}te?'' A Recipe for Romanian Vision-Language Models 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (2)