DDOR: Delta Debugging for Explainable Overrefusal Testing and Repair 事件
PRODUCT_LAUNCH2026-06-03影响: MEDIUM
DDOR: Delta Debugging for Explainable Overrefusal Testing and Repair arXiv:2606.03601v1 Announce Type: cross Abstract: While safety alignment and guardrails help large language models (LLMs) avoid harmful outputs, they can also induce overrefusal, i.e., unwarranted rejection of benign queries that merely appear risky. We present DDOR (Delta Debugging for OverRefusal), a fully automated and explainable framework for overrefusal testing and repair in a black-box setting, where only model inputs a
DDOR: Delta Debugging for Explainable Overrefusal Testing and Repair · 相关报道
相关报道
DDOR: Delta Debugging for Explainable Overrefusal Testing and Repair
ArXiv CS.AI2026-06-03