Bilevel Optimization of Synthetic Trajectories for Multi-Turn LLM Fine-Tuning 文章

ArXiv CS.AI2026-05-26NEWSen作者: Shresth Verma, Mauricio Tec, Cheol Woo Kim, Kai Wang, Milind Tambe

摘要

arXiv:2605.24743v1 Announce Type: cross Abstract: While LLMs excel at single-turn generation, they struggle with long-horizon, multi-turn interactions. Offline reinforcement learning (RL) offers a scalable approach, yet its performance hinges on the availability and quality of multi-turn trajectory data. A common remedy is to augment training with synthetic trajectories generated by LLMs or simulators, but synthetic data is highly heterogeneous in quality, and naively treating all trajectories as equally informative can degrade performance. We propose BOOST, a bilevel optimization framework where the inner level trains the LLM on reweighted data and the outer level trains a lightweight reweighting head on held-out real validation tasks, assigning continuous trajectory-level weights without requiring an external judge.

Bilevel Optimization of Synthetic Trajectories for Multi-Turn LLM Fine-Tuning 文章

摘要

相关事件查看全部 (1)

相关公司查看全部 (3)

相关人物

相关产品查看全部 (7)

相关技术查看全部 (15)