Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring 文章

ArXiv CS.CV2026-05-28NEWSen作者: Guanxu Chen, Jing Shao, Tao Luo, Lijie Hu, Qihao Lin, Dongrui Liu

摘要

arXiv:2502.05242v3 Announce Type: replace-cross Abstract: Large language models (LLMs) are becoming increasingly capable, but the mechanisms of their thinking and decision-making processes remain unclear. Chain-of-thoughts (CoTs) have been commonly utilized to externalize LLMs' thinking, but this strategy fails to accurately reflect LLMs' thinking process. Techniques based on LLMs' hidden representations provide an inner perspective to improve the monitorability of their latent thinking. However, previous methods only try to develop external modules instead of making LLMs themselves easier to monitor. In this paper, we propose a novel method, TELLME, improving the transparency of LLMs and helping monitors identify unsuitable and sensitive behaviors. Furthermore, we showcase the effectiveness of TELLME on detoxification tasks, where LLMs achieve consistent improvement among multimodal test sets, distinct architectures, and varying parameter scales.

Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (4)