LLM Self-Recognition: Steering and Retrieving Activation Signatures 文章

ArXiv CS.AI2026-06-06NEWSen作者: Thibaud Ardoin, Jonas Sch\"afer, Gerhard Wunder

摘要

arXiv:2606.06315v1 Announce Type: new Abstract: Recent advances in interpretability suggest that large language models (LLMs) implicitly encode signals in their generated text that enable self-recognition of their outputs. We demonstrate that this capability is reliable, even in low-entropy scenarios, and that it can be amplified through targeted intervention. By steering the internal residual stream during generation with a random sparse vector, we create a detectable fingerprint that enables attribution of a given text to a specific LLM. This signal is recoverable from the activations of an LLM used as a detector, achieving over 98% accuracy across multiple detection settings while preserving the quality of generated text. As AI-generated content proliferates, this approach offers a practical alternative to traditional detectors by leveraging the model's natural representation structure for attribution rather than embedding a signal externally.

LLM Self-Recognition: Steering and Retrieving Activation Signatures 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术