CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations 文章

ArXiv CS.CL2026-05-27NEWSen作者: Mike Zhang, Ali Basirat, Desmond Elliott

摘要

arXiv:2605.26293v1 Announce Type: new Abstract: Prior work establishes that controlled contrastiveness between self-generated responses from large language models, set via reward scores, improves downstream preference tuning in English. We extend this method to multiple languages and evaluate two models across a total of 14 high and low-resource languages on a diverse set of tasks. Our central finding is that cross-lingual contrastive preference tuning on self-generations (CroCo) transfers without language-specific preference annotation. A reward model trained on English preferences (atop a multilingual base) produces useful within-language rankings across most languages, and pairing in either a monolingual or multilingual setting improves over each model on the majority of setups while preventing the catastrophic forgetting of supervised fine-tuning. We observe that the gains require on-policy data.