Depth-Attention: Cross-Layer Value Mixing for Language Models 文章

ArXiv CS.CL2026-06-04NEWSen作者: Boyi Zeng, Yiqin Hao, Zitong Wang, Shixiang Song, He Li, Feichen Song, Yifan Liu, Ziwei He, Xinbing Wang, Zhouhan Lin

查看原文 →

关系图谱

详细信息

来源站点: ArXiv CS.CL
作者: Boyi Zeng, Yiqin Hao, Zitong Wang, Shixiang Song, He Li, Feichen Song, Yifan Liu, Ziwei He, Xinbing Wang, Zhouhan Lin
文章类型: NEWS
语言: en
发布日期: 2026-06-04

原文

摘要

arXiv:2606.05014v1 Announce Type: new Abstract: Self-attention selects information freely across the sequence, but across depth, Transformers merely add each layer's output to the residual stream, so later layers cannot selectively reuse earlier-layer representations. Recent cross-layer methods improve this flow but operate on hidden states outside attention, adding state beyond the key-value cache at inference--a cost that becomes increasingly salient as modern LLMs compress the cache with grouped-query and multi-head latent attention. We introduce Depth-Attention, which performs this selection inside the attention module itself: before a layer attends over the sequence, its query attends over the keys of earlier layers at the same token position and mixes their values into the value that self-attention then reads.

Depth-Attention: Cross-Layer Value Mixing for Language Models 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品

相关技术查看全部 (9)