Digging Up Citations: FOSSIL, a Dataset and Workflow for Reference Extraction in Law and the Humanities 文章

ArXiv CS.CL2026-06-02NEWSen作者: Luca Foppiano, Christian Boulanger

摘要

arXiv:2606.01109v1 Announce Type: cross Abstract: Citation extraction tools are designed for the structured end-of-document bibliographies of the natural sciences, but law and humanities scholarship cites references primarily in footnotes, where bibliographic data is interleaved with commentary and cross-references and varies widely across languages and styles. To address the scarcity of suitable gold-standard resources, we present FOSSIL (Footnote-based Open-access SSH Scientific Instance Labels), an openly licensed multilingual dataset of 96 annotated scholarly articles containing over 7,600 footnote-embedded references, together with PDF-TEI Editor (a collaborative web annotation tool), a documented seven-annotator workflow, and a Grobid specialization for footnote-based citations. In end-to-end evaluation, the specialized pipeline nearly doubles extraction quality over default Grobid (micro-F1 from 0.36 to 0.

相关公司

暂无数据

相关人物

暂无数据

相关技术

暂无数据