File size: 1,867 Bytes
e0405e4
 
 
 
76784ea
e0405e4
76784ea
 
 
 
e0405e4
76784ea
e0405e4
 
76784ea
e0405e4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
---
license: apache-2.0
---

# OV-InstructTTS: Towards Open-Vocabulary Instruct Text-to-Speech

<p align="center">
        &nbsp&nbsp🖥️ <a href="https://y-ren16.github.io/OV-InstructTTS">Demo</a> | 🤗 <a href="https://huggingface.co/datasets/y-ren16/OVSpeech">Datasets</a>&nbsp&nbsp | ⚙️ <a href="https://github.com/y-ren16/OV-InstructTTS" class="link-button"> Code</a>&nbsp&nbsp | 🤗 <a href="https://huggingface.co/y-ren16/OV-InstructTTS">Checkpoints</a>&nbsp&nbsp 
        <!-- |&nbsp&nbsp📑 <a href="https://arxiv.org/pdf/2510.00000">Paper</a>&nbsp&nbsp -->
<br>

# Introduction


Instruct Text-to-Speech (InstructTTS) leverages natural language descriptions as style prompts to guide speech synthesis. However, existing InstructTTS methods mainly rely on a direct combination of audio-related labels or their diverse rephrasings, making it difficult to handle flexible, high-level instructions. Such rigid control is insufficient for users such as content creators who wish to steer generation with descriptive instructions. To address these constraints, we introduce OV-InstructTTS, a new paradigm for open-vocabulary InstructTTS. We propose a comprehensive solution comprising a newly curated dataset, OV-Speech, and a novel reasoning-driven framework. The OV-Speech dataset pairs speech with open-vocabulary instructions, each augmented with a reasoning process that connects high-level instructions to acoustic features. The reasoning-driven framework infers emotional, acoustic, and paralinguistic information from open-vocabulary instructions before synthesizing speech. Evaluations show that this reasoning-driven approach significantly improves instruction-following fidelity and speech expressiveness. We believe this work can inspire the next user-friendly InstructTTS systems with stronger generalization and real-world applicability.