Inverting the Shield: Systematically Generating Safety Tests from Policy Specifications 文章

ArXiv CS.AI2026-05-26NEWSen作者: Xiaoyue Lu, Xianglin Yang, Haijun Liu, Jiahao Liu, Kuntai Cai, Yan Xiao, Jin Song Dong

摘要

arXiv:2605.24883v1 Announce Type: new Abstract: The widespread integration of Large Language Models (LLMs) necessitates rigorous and systematic safety evaluation. Existing paradigms either rely on constructed benchmarks to assess safety from predefined perspectives, or employ dynamic red-teaming to probe potential vulnerabilities. While effective, these approaches face challenges, as they depend heavily on expert domain knowledge, offer limited systematic guarantees, and are vulnerable to rapid obsolescence. To address these limitations, we introduce a novel framework POLARIS that brings the rigor of specification-based software testing to AI safety. POLARIS first compiles unstructured natural-language policies into First-Order Logic (FOL) representations, establishing a traceable link between high-level rules and concrete test cases. This formalization enables the construction of a Semantic Policy Graph, where complex policy violation scenarios are encoded as traversable paths.

Inverting the Shield: Systematically Generating Safety Tests from Policy Specifications 文章

摘要

相关事件查看全部 (2)

相关公司查看全部 (3)

相关人物

相关产品查看全部 (12)

相关技术查看全部 (37)