Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy 事件

PRODUCT_LAUNCH2026-06-09影响: MEDIUM

Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy arXiv:2606.07929v1 Announce Type: new Abstract: Large language models (LLMs) are entering clinical practice based on benchmark accuracy that may fail to detect safety-relevant failure modes. Here we present AI-MASLD, a stress-audit framework that adapts the logic of metabolic stress testing from hepatology to the evaluation of clinical LLMs. Using 240 clinical cases across six narrative pertur