Low-Resource Safety Failures Are Action Failures, Not Representation Failures 文章

ArXiv CS.CL2026-06-02NEWSen作者: Rashad Aziz, Ikhlasul Akmal Hanif, Fajri Koto

摘要

arXiv:2606.01196v1 Announce Type: new Abstract: Safety alignment learned in high-resource languages transfers poorly to low-resource languages. Models refuse harmful prompts in English but fail to refuse when the same prompts are translated into Swahili or Burmese. Adaptive steering methods like AdaSteer and CAST inherit this failure cross-lingually. We diagnose where transfer breaks down. Across Qwen2.5-7B, Gemma-2-9B, and Llama-3.1-8B on 23 languages, the harmfulness direction extracted from high-resource activations linearly separates harmful from harmless low-resource prompts nearly as well as high-resource ones. The relevant representation is present. Yet harmful refusal drops from 87.9% to 43.9%. The model fails to convert the representation into refusal. What fails to transfer is calibration of the safety decision, not the underlying representation.

相关公司

暂无数据

相关人物

暂无数据