Dual Mechanisms of Value Expression: Intrinsic vs. Prompted Values in Large Language Models 文章

ArXiv CS.CL2026-06-01NEWSen作者: Jongwook Han, Jongwon Lim, Injin Kong, Yohan Jo

详细信息

来源站点: ArXiv CS.CL
作者: Jongwook Han, Jongwon Lim, Injin Kong, Yohan Jo
文章类型: NEWS
语言: en
发布日期: 2026-06-01

摘要

arXiv:2509.24319v4 Announce Type: replace Abstract: Large language models can express values in two main ways: (1) intrinsic expression, reflecting the model's inherent values learned during training, and (2) prompted expression, elicited by explicit prompts. Given their widespread use in value alignment, it is paramount to clearly understand their underlying mechanisms, particularly whether they mostly overlap (as one might expect) or rely on distinct mechanisms. We analyze this largely understudied problem at the mechanistic level using two approaches: (1) value vectors, feature directions representing value mechanisms extracted from the residual stream, and (2) value neurons, MLP neurons that contribute to value vectors. We demonstrate that intrinsic and prompted value mechanisms partly share common components crucial for inducing value expression, generalizing across languages and reconstructing theoretical inter-value correlations in the model's internal representations.

Dual Mechanisms of Value Expression: Intrinsic vs. Prompted Values in Large Language Models 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品

相关技术