Add metadata (pipeline tag, library name) and improve model card content

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +60 -5
README.md CHANGED
@@ -1,25 +1,34 @@
1
  ---
2
  license: mit
 
 
3
  ---
 
4
  <div align="center">
5
 
6
  <img src="./pics/logo.png" alt="UltraVoice Logo" width="200">
7
 
8
  # UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models
9
 
10
- [![arXiv](https://img.shields.io/badge/arXiv-Preprint-b31b1b.svg)](https://arxiv.org/abs/2510.22588)
11
  [![Project Page](https://img.shields.io/badge/Project-Page-green)](https://bigai-nlco.github.io/UltraVoice)
12
  [![GitHub](https://img.shields.io/badge/GitHub-Code-black)](https://github.com/bigai-nlco/UltraVoice)
13
  [![HuggingFace Dataset](https://img.shields.io/badge/πŸ€—%20HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/tutu0604/UltraVoice)
 
14
  [![License](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
15
 
16
  </div>
17
 
18
  ---
19
 
 
 
 
20
  ## πŸ“ Abstract
21
 
22
- > Spoken dialogue models currently lack the ability for fine-grained speech style control, a critical capability for human-like interaction that is often overlooked in favor of purely functional capabilities like reasoning and question answering. To address this limitation, we introduce **UltraVoice**, the first large-scale speech dialogue dataset engineered for multiple fine-grained speech style control. Encompassing over 830 hours of speech dialogues, UltraVoice provides instructions across six key speech stylistic dimensions: emotion, speed, volume, accent, language, and composite styles. Fine-tuning leading models such as SLAM-Omni and VocalNet on UltraVoice significantly enhances their fine-grained speech stylistic controllability without degrading core conversational abilities. Specifically, our fine-tuned models achieve improvements of 29.12-42.33% in Mean Opinion Score (MOS) and 14.61-40.09 percentage points in Instruction Following Rate (IFR) on multi-dimensional control tasks. Moreover, on the URO-Bench benchmark, our fine-tuned models demonstrate substantial gains in core understanding, reasoning, and conversational abilities, with average improvements of +10.84% on the Basic setting and +7.87% on the Pro setting. Furthermore, the dataset's utility extends to training controllable Text-to-Speech (TTS) models, underscoring its high quality and broad applicability for expressive speech synthesis.
 
 
23
 
24
  ## 🎯 Overview
25
 
@@ -29,6 +38,31 @@ license: mit
29
 
30
  **Overview of the UltraVoice Dataset Construction and Stylistic Coverage.** The figure illustrates the complete pipeline and capabilities of UltraVoice: (1) The upper left section presents our four-step construction process: text corpus curation, style injection & response generation, stylized speech synthesis, and quality control & filtering. (2) The ring chart on the right visualizes the dataset's hierarchical control structure, with six main control dimensions in the inner ring (Emotion, Speed, Volume, Accent, Language, Composite) and their finer-grained sub-dimensions in the outer ring. (3) The lower panel showcases representative examples from each speech style dimension, demonstrating UltraVoice's rich stylistic coverage and multi-dimensional controllability, including emotion (e.g., angry, happy), speed (e.g., fast, slow), volume (e.g., high, low), language (e.g., Chinese, Japanese, Korean), accent (e.g., AU, CA, GB, IN, SG, ZA), and composite styles that combine multiple control attributes.
31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
  ---
33
 
34
  ## πŸ€– Available Models
@@ -76,8 +110,30 @@ To verify that fine-tuning on UltraVoice enhances rather than compromises genera
76
 
77
  **Evaluation of our SFT models (upper part) and existing strong baselines (lower part) on URO-Bench (EN).** Und.: Understanding. Conv.: Oral Conversation. Our results confirm that fine-tuning spoken dialogue models on UltraVoice enhances, rather than compromises, general conversational skills. All models showed substantial gains across Understanding, Reasoning, and Oral Conversation, with average improvements of **+10.84%** on the Basic setting and **+7.87%** on the Pro setting. Notably, the VocalNet-7B SFT model achieves state-of-the-art performance, outperforming strong baselines like Qwen2.5-Omni-7B and GLM4-Voice-9B, highlighting practical value beyond style control.
78
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
  ---
80
 
 
 
 
 
 
 
81
  ## πŸ“„ License
82
 
83
  These models are licensed under the **MIT License**. See the [LICENSE](https://github.com/bigai-nlco/UltraVoice/blob/main/LICENSE) file for details.
@@ -131,7 +187,7 @@ For questions, issues, or feedback:
131
  - **Dataset**: [UltraVoice on HuggingFace](https://huggingface.co/datasets/tutu0604/UltraVoice)
132
  - **Training&Inference Code**: [SLAM-LLM](https://github.com/X-LANCE/SLAM-LLM) | [VocalNet](https://github.com/SJTU-OmniAgent/VocalNet)
133
  - **Project Page**: [UltraVoice Official Website](https://bigai-nlco.github.io/UltraVoice)
134
- - **Paper**: [arXiv:2510.22588](https://arxiv.org/abs/2510.22588)
135
 
136
  ---
137
 
@@ -141,5 +197,4 @@ For questions, issues, or feedback:
141
 
142
  **🎀 Try our models and share your feedback!**
143
 
144
- </div>
145
-
 
1
  ---
2
  license: mit
3
+ pipeline_tag: text-to-speech
4
+ library_name: transformers
5
  ---
6
+
7
  <div align="center">
8
 
9
  <img src="./pics/logo.png" alt="UltraVoice Logo" width="200">
10
 
11
  # UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models
12
 
13
+ [![Paper](https://img.shields.io/badge/Paper-HuggingFace-blue)](https://huggingface.co/papers/2510.22588)
14
  [![Project Page](https://img.shields.io/badge/Project-Page-green)](https://bigai-nlco.github.io/UltraVoice)
15
  [![GitHub](https://img.shields.io/badge/GitHub-Code-black)](https://github.com/bigai-nlco/UltraVoice)
16
  [![HuggingFace Dataset](https://img.shields.io/badge/πŸ€—%20HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/tutu0604/UltraVoice)
17
+ [![HuggingFace Model](https://img.shields.io/badge/πŸ€—%20HuggingFace-Models-blue)](https://huggingface.co/tutu0604/UltraVoice-SFT)
18
  [![License](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
19
 
20
  </div>
21
 
22
  ---
23
 
24
+ ## πŸ“’ News
25
+ - **[2025-10]** We have released the `UltraVoice` dataset and model checkpoints.
26
+
27
  ## πŸ“ Abstract
28
 
29
+ Spoken dialogue models currently lack the ability for fine-grained speech style control, a critical capability for human-like interaction that is often overlooked in favor of purely functional capabilities like reasoning and question answering. To address this limitation, we introduce UltraVoice, the first large-scale speech dialogue dataset engineered for multiple fine-grained speech style control. Encompassing over 830 hours of speech dialogues, UltraVoice provides instructions across six key speech stylistic dimensions: emotion, speed, volume, accent, language, and composite styles. Fine-tuning leading models such as SLAM-Omni and VocalNet on UltraVoice significantly enhances their fine-grained speech stylistic controllability without degrading core conversational abilities. Specifically, our fine-tuned models achieve improvements of 29.12-42.33% in Mean Opinion Score (MOS) and 14.61-40.09 percentage points in Instruction Following Rate (IFR) on multi-dimensional control tasks designed in the UltraVoice. Moreover, on the URO-Bench benchmark, our fine-tuned models demonstrate substantial gains in core understanding, reasoning, and conversational abilities, with average improvements of +10.84% on the Basic setting and +7.87% on the Pro setting. Furthermore, the dataset's utility extends to training controllable Text-to-Speech (TTS) models, underscoring its high quality and broad applicability for expressive speech synthesis. The complete dataset and model checkpoints are available at: this https URL .
30
+
31
+ ---
32
 
33
  ## 🎯 Overview
34
 
 
38
 
39
  **Overview of the UltraVoice Dataset Construction and Stylistic Coverage.** The figure illustrates the complete pipeline and capabilities of UltraVoice: (1) The upper left section presents our four-step construction process: text corpus curation, style injection & response generation, stylized speech synthesis, and quality control & filtering. (2) The ring chart on the right visualizes the dataset's hierarchical control structure, with six main control dimensions in the inner ring (Emotion, Speed, Volume, Accent, Language, Composite) and their finer-grained sub-dimensions in the outer ring. (3) The lower panel showcases representative examples from each speech style dimension, demonstrating UltraVoice's rich stylistic coverage and multi-dimensional controllability, including emotion (e.g., angry, happy), speed (e.g., fast, slow), volume (e.g., high, low), language (e.g., Chinese, Japanese, Korean), accent (e.g., AU, CA, GB, IN, SG, ZA), and composite styles that combine multiple control attributes.
40
 
41
+ ## πŸ“Š Dataset
42
+ The **UltraVoice** dataset contains **100,770** high-quality spoken dialogue samples, totaling **832.92 hours** of audio.
43
+ Among them, **84,832** are explicitly conditioned on six major fine-grained speech style dimensions:
44
+ - Emotion (Neutral, Happy, Sad, Angry, Surprised, Fearful, Disgusted)
45
+ - Volume (Low, Normal, High)
46
+ - Speed (Slow, Normal, Fast)
47
+ - Accent (AU, CA, GB, IN, SG, ZA)
48
+ - Language (Chinese, Japanese, Korean)
49
+ - Composite (multi-style combinations)
50
+
51
+ The remaining **15,938** pairs are general English QA samples to ensure balance and generalization.
52
+ Average metrics indicate **mean CER of 5.93%** and **UTMOS 4.00**, showing high-quality, natural speech and stable stylistic control.
53
+
54
+ <div align="center">
55
+ <img src="pics/dataset_overview.png" alt="UltraVoice Dataset Statistics" width="80%">
56
+ </div>
57
+
58
+ **Detailed statistics across all control dimensions.** #Cnt. denotes the number of samples, Dur. represents the total duration in hours, CER indicates the average character error rate, and UTMOS measures the averaged naturalness score. The dataset encompasses 100,770 samples totaling 832.92 hours, with fine-grained control over emotion (7 categories), volume (3 levels), speed (3 levels), accent (6 regions: AU, CA, GB, IN, SG, ZA), language (Chinese, Korean, Japanese), and composite styles.
59
+
60
+ <div align="center">
61
+ <img src="pics/dataset_static.png" alt="UltraVoice Dataset Visualizations" width="80%">
62
+ </div>
63
+
64
+ **Statistical visualizations of the six fine-grained speech style control dimensions in UltraVoice.** The visualization methods are tailored to each dimension's characteristics: (a) **Emotion** and (d) **Accent** use t-SNE plots demonstrating clear class separability for categorical attributes; (b) **Speed** and (e) **Volume** employ box plots showing precise control over acoustic properties with distinct distributions; (c) **Language** and (f) **Composite** leverage word clouds highlighting lexical diversity and expressive richness across multilingual and multi-dimensional control scenarios.
65
+
66
  ---
67
 
68
  ## πŸ€– Available Models
 
110
 
111
  **Evaluation of our SFT models (upper part) and existing strong baselines (lower part) on URO-Bench (EN).** Und.: Understanding. Conv.: Oral Conversation. Our results confirm that fine-tuning spoken dialogue models on UltraVoice enhances, rather than compromises, general conversational skills. All models showed substantial gains across Understanding, Reasoning, and Oral Conversation, with average improvements of **+10.84%** on the Basic setting and **+7.87%** on the Pro setting. Notably, the VocalNet-7B SFT model achieves state-of-the-art performance, outperforming strong baselines like Qwen2.5-Omni-7B and GLM4-Voice-9B, highlighting practical value beyond style control.
112
 
113
+ ### 3. Validation of Data Quality via Controllable Text-to-Speech
114
+
115
+ To further validate the quality and utility of our dataset, we repurposed UltraVoice into a controllable TTS dataset and fine-tuned a pre-trained **EmoVoice-0.5B** model, creating **UltraVoice-0.5B-SFT**.
116
+
117
+ <div align="center">
118
+ <img src="pics/exp4.png" alt="TTS Performance on Emotion" width="90%">
119
+ </div>
120
+
121
+ **Performance of our UltraVoice-0.5B-SFT model on emotional TTS tasks.** The evaluation is conducted on both an out-of-domain test set (EmoVoice-DB, top) and an in-domain test set (UltraVoice, bottom). Bold and underlined values denote the best and second-best results, respectively. Our fine-tuned TTS model demonstrates strong multi-dimensional style control. On the out-of-domain EmoVoice-DB test set, our model achieves competitive performance against strong baselines such as PromptTTS, CosyVoice, and EmoVoice. Crucially, on our in-domain UltraVoice data, it substantially reduces the Word Error Rate (WER) to **3.97** from 19.82, achieving high emotion similarity (**0.95**) and naturalness (**4.46** UTMOS).
122
+
123
+ <div align="center">
124
+ <img src="pics/exp5.png" alt="TTS Performance on Multiple Dimensions" width="90%">
125
+ </div>
126
+
127
+ **MOS and IFR results of UltraVoice-0.5B-SFT across five style dimensions.** The model consistently improves both MOS and IFR scores across all tested dimensions (emotion, accent, speed, volume, and composite styles) compared to the pre-trained baseline. These results confirm that our instruction-style data effectively enhances controllable synthesis across a diverse range of styles, demonstrating the high quality and broad applicability of UltraVoice for expressive speech synthesis tasks.
128
+
129
  ---
130
 
131
+ ## 🧩 Model Checkpoints
132
+
133
+ All fine-tuned models and datasets are released on πŸ€—HuggingFace:
134
+ - [πŸ€— UltraVoice Dataset](https://huggingface.co/datasets/tutu0604/UltraVoice)
135
+ - [πŸ€— UltraVoice-SFT Models](https://huggingface.co/tutu0604/UltraVoice-SFT)
136
+
137
  ## πŸ“„ License
138
 
139
  These models are licensed under the **MIT License**. See the [LICENSE](https://github.com/bigai-nlco/UltraVoice/blob/main/LICENSE) file for details.
 
187
  - **Dataset**: [UltraVoice on HuggingFace](https://huggingface.co/datasets/tutu0604/UltraVoice)
188
  - **Training&Inference Code**: [SLAM-LLM](https://github.com/X-LANCE/SLAM-LLM) | [VocalNet](https://github.com/SJTU-OmniAgent/VocalNet)
189
  - **Project Page**: [UltraVoice Official Website](https://bigai-nlco.github.io/UltraVoice)
190
+ - **Paper**: [Hugging Face Papers:2510.22588](https://huggingface.co/papers/2510.22588)
191
 
192
  ---
193
 
 
197
 
198
  **🎀 Try our models and share your feedback!**
199
 
200
+ </div>