GLASS: GRPO-Trained LoRA for Acoustic Style Steering in Zero-Shot Text-to-Speech 文章

ArXiv CS.CL2026-06-05NEWSen作者: Jaehoon Kang, Yejin Lee, Kyuhong Shim

摘要

arXiv:2606.05889v1 Announce Type: cross Abstract: We propose GLASS, a framework for composable acoustic style control in zero-shot autoregressive text-to-speech (TTS) that learns controls from post-generation rewards rather than style labels. In zero-shot TTS, a speaker prompt often entangles speaker identity with prosodic attributes such as speaking rate and pitch, making it difficult to change style without changing the prompt itself. GLASS instead treats each acoustic attribute as a reward-defined control direction. For each control axis, GLASS freezes the TTS backbone and trains one lightweight LoRA adapter with Group Relative Policy Optimization (GRPO), using speech-token length and mean F0 as style rewards and WER as an intelligibility anchor. Because each control is represented as a LoRA weight update, independently trained adapters can be swapped, interpolated, and composed through linear LoRA arithmetic without retraining the backbone.

GLASS: GRPO-Trained LoRA for Acoustic Style Steering in Zero-Shot Text-to-Speech 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (1)

相关技术查看全部 (6)