Building Better Activation Oracles 文章

ArXiv CS.AI2026-06-03NEWSen作者: Jan Bauer, Celeste De Schamphelaere, Adam Karvonen, Niclas Luick, Neel Nanda

摘要

arXiv:2606.02609v1 Announce Type: cross Abstract: Activation Oracles (AOs) are promising methods for interpreting residual stream activations. However, current AOs face important issues, such as hallucinations and vagueness. Additionally, text-inversion confounds make them hard to evaluate. To this end, we improve the Activation Oracle (AO) training regime in four ways: training on on-policy rollouts, improving the conversational dataset, feeding more layers and an improvement to the injection formula. The capability improvements are marginal, but quality of life improvements are quite substantial. In addition, we open source the first comprehensive evaluation suite for AO quality, which we call AObench. Overall, we hope that our work sets a foundation that helps improve AOs and other models in the paradigm of scalable, end-to-end interpretability.

相关事件查看全部 (2)

Building Better Activation Oracles
2026-06-03OPEN_SOURCE影响: MEDIUM
Building Better Activation Oracles
2026-06-03PRODUCT_LAUNCH影响: MEDIUM

相关公司

暂无数据

相关人物

暂无数据