摘要
arXiv:2605.30590v1 Announce Type: cross Abstract: Two clinical AI systems can score nearly identically on coverage-based rubrics yet behave radically differently when their patient inputs change: one updates its recommendations to match the new clinical signal, while the other produces the same output regardless. We introduce the Causal Sensitivity Score (CSS), a pre-registered interventional metric that mutates oncology tumor-board cases along five clinically meaningful dimensions - biomarker flips, prior-treatment failures, biomarker removals, surgery-status changes, and stage perturbations - and scores whether each model updates its recommendations in the pre-registered correct direction using a {0, 0.5, 1.0} scale.
相关事件查看全部 (1)
Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents
2026-06-01PRODUCT_LAUNCH影响: MEDIUM
相关公司
暂无数据
相关人物
暂无数据
相关产品
暂无数据