A Dual-Path Architecture for Scaling Compute and Capacity in LLMs 文章

ArXiv CS.CL2026-05-29NEWSen作者: Markus Frey, Behzad Shomali, Joachim Koehler, Mehdi Ali

摘要

arXiv:2605.30202v1 Announce Type: new Abstract: Looped transformers apply a shared block multiple times and have emerged as a parameter-efficient route to scaling compute in language models. However, at fixed FLOPs a looped model has strictly less capacity than a baseline transformer. We propose a novel dual-path block that can flexibly scale compute, the number of sequential operations applied to a hidden state, and capacity, the parameters available at a single step. For this we expose both axes as parallel pathways within a single layer: a deep sublayer re-applied K times with shared parameters, and a wide sublayer with an enlarged feed-forward network applied once. Independent per-token gates combine both axes and allow detailed per-token routing analyses. We show that across two FLOP budgets, our dual-path model surpasses iso-FLOP matched models on language modeling and downstream evaluations, while using fewer parameters than the baseline at matched FLOPs.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据