Steering Language Models Before They Speak: Logit-Level Interventions 文章

ArXiv CS.CL2026-05-29NEWSen作者: Hyeseon An, Shinwoo Park, Hyundong Jin, Yo-Sub Han

摘要

arXiv:2601.10960v2 Announce Type: replace Abstract: Controllable generation requires language models to realize output characteristics such as reading level, politeness, and toxicity. Existing steering methods are often indirect, require access to internal activations, or depend on auxiliary trained models. We propose SWAI, a training-free inference-time method that addresses these limitations by steering directly in logit space using corpus-derived token statistics. SWAI computes z-normalized one-vs-rest log-odds scores from labeled corpora and biases high-scoring tokens only within the model's top-K candidate set, allowing control to favor target-characteristic tokens while preserving contextually plausible choices. Across readability, politeness, and toxicity control, SWAI consistently improves over prompt-based and prior logit-level baselines without modifying model parameters, accessing internal layers, or training an auxiliary model.

Steering Language Models Before They Speak: Logit-Level Interventions 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术查看全部 (1)