Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet 事件

PRODUCT_LAUNCH2026-05-29影响: MEDIUM

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet arXiv:2605.29358v1 Announce Type: new Abstract: We demonstrate that sparse autoencoders can extract interpretable features from Claude 3 Sonnet, a production-scale language model, addressing the open question of whether dictionary learning methods scale beyond small transformers. We trained sparse autoencoders with up to 34 million features on the model's middle layer residual stream, using scaling laws to guide hyp

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet · 相关报道