Cultural Value Alignment Via Latent Activation Steering in Large Language Models 文章

ArXiv CS.CL2026-05-27NEWSen作者: Trung Duc Anh Dang, Sarah Masud

摘要

arXiv:2605.26365v1 Announce Type: new Abstract: Large Language Models (LLMs) often exhibit homogenized cultural perspectives. While the World Values Survey (WVS) provides a gold standard for mapping human values, traditional direct prompting of LLMs on WVS often fails to access the model's latent cultural depth, leading to safety-aligned refusals or neutral responses. Here, we propose a generalizable framework for cultural evaluation and intervention that transitions from abstract queries to scenario-based behavioral probing. By extracting implicit token probabilities across 300 situational dilemmas, we bypass surface-level alignment to map the latent coordinates of LLMs cultural value. We further introduce activation steering to shift these internal alignments during the forward pass without retraining.