From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering 文章

ArXiv CS.AI2026-05-27NEWSen作者: Jos\'e Guilherme Marques dos Santos, Ricardo Yang, Rui Humberto Pereira, Alexandre Sousa, Br\'igida M\'onica Faria, Henrique Lopes Cardoso, Jos\'e Duarte, Jos\'e Lu\'is Reis, Lu\'is Paulo Reis, Pedro Pimenta, Jos\'e Paulo Marques dos Santos

摘要

arXiv:2604.04948v2 Announce Type: replace-cross Abstract: Retrieval-Augmented Generation (RAG) systems depend critically on the quality of document preprocessing, yet no prior study has evaluated PDF processing frameworks by their impact on downstream question-answering accuracy. We address this gap through a systematic comparison of four open-source PDF-to-Markdown conversion frameworks, Docling, MinerU, Marker, and DeepSeek OCR, across 21 pipeline configurations, varying the conversion tool, cleaning transformations, splitting strategy, and metadata enrichment. Evaluation was performed using a 50-question benchmark over a corpus of 36 Portuguese administrative documents (1706 pages, ~492K words), with LLM-as-judge scoring over 50 independent runs per configuration. Statistical significance was assessed via Wilcoxon signed-rank tests with Cohen's d effect sizes. Two baselines bounded the results: na\"ive PDFLoader (86.2%) and manually curated Markdown (91.3%).