SPECTRA: Synthetic IR Test Collections with Relevance Oracles and Controlled Distractor Diagnostics 文章

ArXiv CS.AI2026-06-01NEWSen作者: Eric Liang

详细信息

来源站点: ArXiv CS.AI
作者: Eric Liang
文章类型: NEWS
语言: en
发布日期: 2026-06-01

摘要

arXiv:2605.31575v1 Announce Type: cross Abstract: Scalable information retrieval testing needs corpora that are large enough to stress index construction, ranking latency, query routing, and evaluation tooling, yet human-judged test collections remain expensive and may be unavailable when documents are private or still under design. This paper introduces SPECTRA, a reproducible framework for generating synthetic text corpora and retrieval test collections through a separation of latent topical structure, surface text realization, metadata controls, query intent generation, and deterministic relevance oracles. The framework is intended as a diagnostic complement to Cranfield-style and TREC-style evaluation, not as a replacement for human assessment. A single-process Python prototype generated corpora up to 60,000 documents and 9.61 million tokens while preserving controllable long-tail vocabulary growth and producing graded relevance labels for 96 queries.

SPECTRA: Synthetic IR Test Collections with Relevance Oracles and Controlled Distractor Diagnostics 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品查看全部 (1)

相关技术查看全部 (1)