Recovering Diversity Without Losing Alignment: A DPO Recipe for Post-Trained LLMs 文章

ArXiv CS.CL2026-05-29NEWSen作者: Vinay Samuel, Yapei Chang, Mohit Iyyer

摘要

arXiv:2605.30021v1 Announce Type: new Abstract: Many open-ended instructions have multiple valid answers that users can benefit from seeing, but post-training often narrows an LLM's output space toward a small set of canonical responses. We introduce REDIPO, an offline DPO data-construction pipeline for recovering distinct valid answer modes while preserving the alignment benefits of the instruct model. For each prompt, REDIPO samples responses from both base and instruct models, rewrites base-model responses with the instruct model, filters candidates for safety and instruction-following quality, and builds preference pairs that favor marginally diverse responses among candidates with similar instruction-following reward. Across Qwen3-4B, OLMo-3-7B, and LLaMA-3.1-8B, REDIPO improves NoveltyBench distinct_k by 134%, 33%, and 44% relative to the instruct checkpoints, while DivPO changes diversity by 0%, -6%, and -4% on the same models.

Recovering Diversity Without Losing Alignment: A DPO Recipe for Post-Trained LLMs 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (13)

相关技术查看全部 (3)