Auditing Stealth Sycophancy in Mental-Health Dialogue: Structured Clinical-State Diagnostics and Clean Matched Benchmarks 文章

ArXiv CS.CL2026-05-26NEWSen作者: Tianze Han, Beining Xu, Hanbo Zhang, Yongming Lu

摘要

arXiv:2605.03472v2 Announce Type: replace Abstract: Mental-health dialogue models are increasingly evaluated by AI-based evaluators, yet these evaluators often treat surface empathy, supportiveness, or fluency as evidence of safety. In this paper, we study a hidden failure mode that we call implicit sycophancy: a response may appear empathetic while implicitly reinforcing catastrophizing, avoidance, hopeless prediction, or CBT-style labeling. To examine this problem, we introduce a diagnostic benchmark for implicit-sycophancy detection, built from three representative mental-health dialogue sources covering everyday peer support, counseling-style emotional support, and crisis-oriented interaction, and further construct a leakage-audited clean single-response matched benchmark with 500 contexts and 1,500 matched response windows.