Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents 文章

ArXiv CS.CL2026-06-01NEWSen作者: Matt Turk

摘要

arXiv:2605.30590v1 Announce Type: cross Abstract: Two clinical AI systems can score nearly identically on coverage-based rubrics yet behave radically differently when their patient inputs change: one updates its recommendations to match the new clinical signal, while the other produces the same output regardless. We introduce the Causal Sensitivity Score (CSS), a pre-registered interventional metric that mutates oncology tumor-board cases along five clinically meaningful dimensions - biomarker flips, prior-treatment failures, biomarker removals, surgery-status changes, and stage perturbations - and scores whether each model updates its recommendations in the pre-registered correct direction using a {0, 0.5, 1.0} scale.

Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (1)