Comparative Evaluation of Machine Translation Systems on Images with Text 文章

ArXiv CS.CL2026-05-29NEWSen作者: Blai Puchol, Sergio G\'omez Gonz\'alez, Miguel Domingo, Francisco Casacuberta

摘要

arXiv:2605.29476v1 Announce Type: new Abstract: This work presents a comparative evaluation of machine translation systems applied to images containing textual information, a task that lies at the intersection of computer vision and natural language processing. The study compares three main paradigms: modular pipelines that separate text detection, recognition, and translation; multi-modal large language models (MLLMs) capable of processing both image and text jointly; and an end-to-end model, Translatotron-V, which directly generates translated images. The modular systems employ state-of-the-art OCR (docTR) combined with multilingual LLMs such as Llama and EuroLLM, while the evaluated MLLMs include different configurations of Gemini 2.5. Experiments were conducted on parallel multilingual datasets covering multiple language pairs, with evaluation based on BLEU, chrF, and TER metrics.