From Leaky Thoughts to Private Reasoning: Controlling What LRMs Say to Themselves 文章

ArXiv CS.CL2026-06-01NEWSen作者: Haritz Puerto, Haonan Li, Xudong Han, Timothy Baldwin, Iryna Gurevych

摘要

arXiv:2602.24210v2 Announce Type: replace Abstract: Large reasoning models (LRMs) produce reasoning traces (RTs) that often contain sensitive information. These leaky thoughts are difficult to control and frequently violate explicit privacy directives. Because RTs can be exposed through prompt injection attacks, this becomes a direct privacy risk to the user. We approach this as a controllability problem: since privacy directives are themselves instructions, improving instruction-following (IF) within the RT provides a direct path to reducing privacy leaks. To this end, we introduce an SFT dataset that teaches models to follow general instructions throughout their reasoning process, and propose Staged Decoding, a simple decoding strategy that decouples RT and answer generation using separate LoRA adapters to maximize IF of each component. We evaluate our approach on six models from two families (1.7B-14B parameters), across two IF benchmarks and two privacy benchmarks.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据