Tying the Loop -- Tied Expert Layers in Mixture-of-Experts Language Models 文章

ArXiv CS.CL2026-06-16NEWSen作者: Martin Jaggi

详细信息

来源站点: ArXiv CS.CL
作者: Martin Jaggi
文章类型: NEWS
语言: en
发布日期: 2026-06-16

摘要

arXiv:2606.16825v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) architectures efficiently scale Large Language Models (LLMs) by activating only a small fraction of their experts per token, yet the full parameter count - dominated by the expert parameters - must be held in training and inference memory. To address this, we introduce Expert Tying, an architectural modification that shares expert parameters across consecutive transformer layers while preserving independent, layer-wise routing and attention. We evaluate this approach across common, state-of-the-art architectures, including OLMoE, Qwen3, and DeepSeek-style MoEs. Our pretraining experiments demonstrate that tying experts can reduce memory footprint by almost 2x at virtually no degradation in perplexity or downstream quality.

Tying the Loop -- Tied Expert Layers in Mixture-of-Experts Language Models 文章

详细信息

摘要

相关事件

相关公司查看全部 (1)

相关人物

相关产品查看全部 (4)

相关技术查看全部 (3)