Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs 文章

ArXiv CS.AI2026-05-26NEWSen作者: Dylan Feng, Pragya Srivastava, Anca Dragan, Cassidy Laidlaw

摘要

arXiv:2605.21602v2 Announce Type: replace Abstract: Many safety and alignment failures of large language models (LLMs) occur due to out-of-distribution (OOD) situations: unusual prompt or response patterns that are unforeseen by model developers. We systematically study whether LLM monitoring pipelines can detect these OOD alignment failures by introducing a benchmark called Misalignment Out Of Distribution (MOOD). It is difficult to find failures that are truly OOD for off-the-shelf models trained on vast safety datasets. We sidestep this by including a restricted training set in MOOD that we use to train our own monitors, as well as seven test sets with diverse alignment failures that are outside the training distribution. Using MOOD, we find that guard models (safety classifiers) often fail to generalize OOD. To fix this, we propose combining guard models with OOD detectors.