Multimodal Large Language Model-Enabled Video Translation: A Role-Oriented Survey 文章

ArXiv CS.CV2026-06-02NEWSen作者: Bingzheng Qu, Kehai Chen, Xuefeng Bai, Min Zhang

摘要

arXiv:2604.11283v2 Announce Type: replace Abstract: Recent progress in multimodal large language models (MLLMs) is reshaping video translation from a cascaded pipeline of automatic speech recognition, machine translation, text-to-speech, and lip synchronization into a unified multimodal reasoning and generation problem. High-quality video translation requires not only semantic fidelity, but also temporal alignment, speaker consistency, and emotional expressiveness across visual, acoustic, and linguistic streams. This survey provides a focused review of MLLM-enabled video translation through a role-oriented taxonomy. We organize MLLM-enabled and MLLM-relevant studies into three functional roles: Semantic Reasoner, which grounds translation in video understanding, temporal reasoning, and multimodal fusion; Expressive Performer, which supports controllable and context-aware speech generation;