Efficient LLM Moderation with Multi-Layer Latent Prototypes 文章

ArXiv CS.CL2026-06-02NEWSen作者: Maciej Chrab\k{a}szcz, Filip Szatkowski, Bartosz W\'ojcik, Jan Dubi\'nski, Tomasz Trzci\'nski, Sebastian Cygert

摘要

arXiv:2502.16174v4 Announce Type: replace-cross Abstract: Although modern LLMs are aligned with human values during post-training, robust moderation remains essential to prevent harmful outputs at deployment time. Existing approaches suffer from performance-efficiency trade-offs and are difficult to customize to user-specific requirements. Motivated by this gap, we introduce Multi-Layer Prototype Moderator (MLPM), a lightweight and highly customizable input moderation tool. We propose leveraging prototypes of intermediate representations across multiple layers to improve moderation quality while maintaining high efficiency. By design, our method adds negligible overhead to the generation pipeline and can be seamlessly applied to any model. MLPM achieves state-of-the-art performance on diverse moderation benchmarks and demonstrates strong scalability across model families of various sizes.

相关公司

暂无数据

相关人物

暂无数据

相关技术

暂无数据