Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval 文章

ArXiv CS.CV2026-05-26NEWSen作者: Hao Sun, Yingyan Hou, Jiayan Guo, Bo Wang, Chunyu Yang, Jinsong Ni, Yan Zhang

摘要

arXiv:2605.24530v1 Announce Type: cross Abstract: Document retrieval in real-world scenarios faces significant challenges due to diverse document formats and modalities. Traditional text-based approaches rely on tailored parsing techniques that disregard layout information and are prone to errors, while recent parsing-free visual methods often struggle to capture fine-grained textual semantics in text-rich scenarios. To address these limitations, we propose \textbf{Unveil}, a novel visual-textual embedding framework that effectively integrates textual and visual features for robust document representation. Through knowledge distillation, we transfer the semantic understanding capabilities from the visual-textual embedding model to a purely visual model, enabling efficient parsing-free retrieval while preserving semantic fidelity.

相关公司

暂无数据

相关人物

暂无数据

相关技术

暂无数据