From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering 文章
摘要
arXiv:2604.04948v2 Announce Type: replace-cross Abstract: Retrieval-Augmented Generation (RAG) systems depend critically on the quality of document preprocessing, yet no prior study has evaluated PDF processing frameworks by their impact on downstream question-answering accuracy. We address this gap through a systematic comparison of four open-source PDF-to-Markdown conversion frameworks, Docling, MinerU, Marker, and DeepSeek OCR, across 21 pipeline configurations, varying the conversion tool, cleaning transformations, splitting strategy, and metadata enrichment. Evaluation was performed using a 50-question benchmark over a corpus of 36 Portuguese administrative documents (1706 pages, ~492K words), with LLM-as-judge scoring over 50 independent runs per configuration. Statistical significance was assessed via Wilcoxon signed-rank tests with Cohen's d effect sizes. Two baselines bounded the results: na\"ive PDFLoader (86.2%) and manually curated Markdown (91.3%).
相关事件查看全部 (2)
相关公司查看全部 (4)
相关人物
暂无数据