Evaluating AI-based Scientific Knowledge Synthesis with Epidemiological Systematic Reviews 文章

ArXiv CS.AI2026-06-08NEWSen作者: Shreyansh Padarha, Ryan Othniel Kearns, Tristan Naidoo, Lingyi Yang, {\L}ukasz Borchmann, Piotr B{\L}aszczyk, Christian Morgenstern, Ruth McCabe, Sangeeta Bhatia, Philip H. Torr, Jakob Foerster, Scott A. Hale, Thomas Rawson, Anne Cori, Elizaveta Semenova, Adam Mahdi

查看原文 →

关系图谱

摘要

arXiv:2603.22327v2 Announce Type: replace-cross Abstract: Systematic literature reviews (SLRs) are a demanding and high-stakes form of scientific knowledge synthesis that remains underspecified as an evaluation setting for large language models (LLMs). We introduce AgentSLR, a large-scale evaluation harness comprising an SLR automation workflow and an expert annotated dataset covering 16,248 articles, designed to test LLM capabilities across the stages of SLRs in epidemiology. Reference annotations were derived from peer-reviewed studies on WHO priority pathogens and produced by domain experts. The harness evaluates each review stage as a separate unit with dedicated metrics enabling targeted failure analysis. We evaluated five frontier reasoning models and found that no single model dominated across all tasks, showing sub-task specialisation often hidden by aggregate benchmarks. Structured data extraction is a major bottleneck, with no model exceeding an average field-level F1 of 0.

Evaluating AI-based Scientific Knowledge Synthesis with Epidemiological Systematic Reviews 文章

摘要

相关事件查看全部 (1)

相关公司查看全部 (1)

相关人物

相关产品查看全部 (2)

相关技术查看全部 (1)