Safety-Oriented Routing Analysis of Mixtral MoE Under Benign and Harmful Prompts 文章

ArXiv CS.AI2026-05-26NEWSen作者: Md Nurul Absar Siddiky

摘要

arXiv:2605.24270v1 Announce Type: new Abstract: Sparse mixture-of-experts (MoE) language models activate only a small subset of parameters for each token, making router behavior a central part of model computation. This paper studies routing behavior of Mixtral 8x7B-Instruct under benign and harmful prompts using two complementary signals: activation-based routing scores derived from expert selection frequencies and gradient-based scores derived from router-gate sensitivities. We analyze expert- and layer-level routing behavior and conduct expert-suppression interventions. The results show that activation-based expert usage is broad and long-tailed, whereas gradient-based importance is concentrated. At expert level, benign and harmful prompt groups remain close under both signals with modest separation. At layer level, activation-based routing is most selective around layers 8-15, while gradient-based importance is concentrated in final layers.

Safety-Oriented Routing Analysis of Mixtral MoE Under Benign and Harmful Prompts 文章

摘要

相关事件查看全部 (1)

相关公司查看全部 (4)

相关人物

相关产品查看全部 (5)

相关技术查看全部 (19)