Quality-constrained Entropy Maximization Policy Optimization for LLM Diversity 文章

ArXiv CS.CL2026-05-28NEWSen作者: Haihui Pan, Yuzhong Hong, Kaichen Zhang, Shaoke Lv, Junwei Bao, Hongfei Jiang, Yang Song

摘要

arXiv:2602.15894v2 Announce Type: replace Abstract: In many large language model (LLM) alignment applications, users expect not only high-quality outputs but also substantial diversity. However, existing methods often face a fundamental trade-off between these objectives: approaches that improve output quality tend to reduce diversity, while methods that increase diversity often do so at the expense of quality. In this work, we propose Quality-constrained Entropy Maximization Policy Optimization (QEMPO), a novel framework that enhances the diversity of LLM outputs while explicitly preserving output quality. QEMPO is grounded in a strong theoretical foundation: we derive a closed-form analytical solution that provably maximizes entropy-a principled measure of diversity-subject to a quality constraint, with guarantees on optimality under the defined objective. Leveraging this solution, QEMPO naturally supports both online and offline training settings.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据