How Do Document Parsers Break? Auditing Structural Vulnerability in Document Intelligence 文章

ArXiv CS.CL2026-05-27NEWSen作者: Yue Chen, Yihao Wang, Ziyi Tang, Yongsen Zheng, Keze Wang

摘要

arXiv:2605.19309v2 Announce Type: replace Abstract: Document Layout Analysis (DLA) pipelines provide structured page representations for retrieval-augmented generation, long-document question answering, and other document intelligence systems, yet their robustness evaluation remains largely area-centric. We identify this Footprint Bias and propose ProSA, a lightweight output-level auditing framework that decouples controlled probing, policy-driven targeting, and structure-aware diagnosis. ProSA combines Block-level Structural Loss Rate (B-SLR), granularity-aware exposure descriptors, and pathway attribution to analyze where structural identity is lost, at what exposure granularity failures emerge, and how failures propagate. Across MinerU and PP-StructureV3 on 1,000 pages, affected area weakly tracks perturbation-induced OCR instability (R^2=0.384/0.110), whereas B-SLR aligns much more closely with it (R^2=0.727/0.916).