Unlocking Fine-Grained and Within-Utterance Speaking Style Control in Prompt-Based Text-to-Speech Models 文章

ArXiv CS.CL2026-05-28NEWSen作者: Jaehoon Kang, Yejin Lee, Yoonji Park, Kyuhong Shim

详细信息

来源站点: ArXiv CS.CL
作者: Jaehoon Kang, Yejin Lee, Yoonji Park, Kyuhong Shim
文章类型: NEWS
语言: en
发布日期: 2026-05-28

摘要

arXiv:2605.27376v1 Announce Type: new Abstract: While prompt-based text-to-speech (TTS) models enable natural language-driven speaking style control, they often provide limited fine-grained control and apply a single global style across an utterance. This restricts practical use cases that require continuous style attribute interpolation across utterances and time-varying style transitions within a single utterance. In this paper, we propose novel techniques to achieve both capabilities in existing prompt-based TTS models. For inter-utterance style interpolation, we compute direction vectors between contrastive style prompts in the embedding space and perform simple interpolation, enabling smooth transitions between style characteristics. For intra-utterance style transition, we first identify a strong attention bias toward early tokens in autoregressive TTS decoders, causing the initial audio realization to dominate subsequent generation.

Unlocking Fine-Grained and Within-Utterance Speaking Style Control in Prompt-Based Text-to-Speech Models 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品

相关技术