Source-Grounded Data Generation for Text-to-JSON Learning 文章

ArXiv CS.CL2026-06-19NEWSen作者: Sunghee Ahn, Guijin Son, Youngjae Yu

详细信息

来源站点
ArXiv CS.CL
作者
Sunghee Ahn, Guijin Son, Youngjae Yu
文章类型
NEWS
语言
en
发布日期
2026-06-19

摘要

arXiv:2606.20072v1 Announce Type: new Abstract: From financial filings to clinical records, legacy industries rely heavily on long, unstructured documents to store high-value information. Reliably extracting this information into structured, machine-readable representations is a key prerequisite to making the contents accessible to automated systems. JSON is a natural target for such structured extraction, yet constructing reliable and scalable text-to-JSON training data remains challenging. To address this gap, we propose STAGE (Spreadsheet-grounded Text-to-JSON Artifact GEneration), a source-grounded data generation pipeline that constructs reports and JSON schema by using LLMs for scalable synthesis while validating ground-truth values against the underlying spreadsheet. Evaluations on STAGE-Eval, our source-grounded benchmark with an 851-example test set, show that STAGE produces stronger training data than existing approaches. This improves Qwen3-4B exact match from 31.37% to 74.

相关事件

暂无数据

相关公司

暂无数据

相关人物

暂无数据