Dynamic Short Convolutions Improve Transformers 文章

ArXiv CS.CL2026-06-03NEWSen作者: Oliver Sieberling, Bharat Runwal, Rameswar Panda, Yoon Kim

摘要

arXiv:2606.03825v1 Announce Type: cross Abstract: Transformers have become the dominant architecture for large language models, largely due to the scalability and flexibility of attention, feed-forward layers, residual connections, and normalization. This paper introduces dynamic short convolutions as an additional neural network primitive for improving Transformers. Unlike static short convolutions, dynamic convolutions use input-dependent filters, which preserves the locality bias of convolution while increasing expressivity. Motivating experiments show that applying dynamic short convolutions to key, query, and value representations improves performance on challenging associative recall tasks compared with static convolutional variants. Across language-modeling experiments ranging from 150M to 2B parameters, dynamic convolutions consistently outperform standard Transformers and Transformers augmented with static short convolutions. Fitting scaling laws indicates a 1.

Dynamic Short Convolutions Improve Transformers 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (11)