Tying the Loop -- Tied Expert Layers in Mixture-of-Experts Language Models 文章

ArXiv CS.CL2026-06-16NEWSen作者: Martin Jaggi

详细信息

来源站点
ArXiv CS.CL
作者
Martin Jaggi
文章类型
NEWS
语言
en
发布日期
2026-06-16

摘要

arXiv:2606.16825v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) architectures efficiently scale Large Language Models (LLMs) by activating only a small fraction of their experts per token, yet the full parameter count - dominated by the expert parameters - must be held in training and inference memory. To address this, we introduce Expert Tying, an architectural modification that shares expert parameters across consecutive transformer layers while preserving independent, layer-wise routing and attention. We evaluate this approach across common, state-of-the-art architectures, including OLMoE, Qwen3, and DeepSeek-style MoEs. Our pretraining experiments demonstrate that tying experts can reduce memory footprint by almost 2x at virtually no degradation in perplexity or downstream quality.