tutu0604
/

UltraVoice-SFT

@@ -1,25 +1,34 @@
 ---
 license: mit
 ---
 <div align="center">
 <img src="./pics/logo.png" alt="UltraVoice Logo" width="200">
 # UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models
-[![arXiv](https://img.shields.io/badge/arXiv-Preprint-b31b1b.svg)](https://arxiv.org/abs/2510.22588)
 [![Project Page](https://img.shields.io/badge/Project-Page-green)](https://bigai-nlco.github.io/UltraVoice)
 [![GitHub](https://img.shields.io/badge/GitHub-Code-black)](https://github.com/bigai-nlco/UltraVoice)
 [![HuggingFace Dataset](https://img.shields.io/badge/🤗%20HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/tutu0604/UltraVoice)
 [![License](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 </div>
 ---
 ## 📝 Abstract
-> Spoken dialogue models currently lack the ability for fine-grained speech style control, a critical capability for human-like interaction that is often overlooked in favor of purely functional capabilities like reasoning and question answering. To address this limitation, we introduce **UltraVoice**, the first large-scale speech dialogue dataset engineered for multiple fine-grained speech style control. Encompassing over 830 hours of speech dialogues, UltraVoice provides instructions across six key speech stylistic dimensions: emotion, speed, volume, accent, language, and composite styles. Fine-tuning leading models such as SLAM-Omni and VocalNet on UltraVoice significantly enhances their fine-grained speech stylistic controllability without degrading core conversational abilities. Specifically, our fine-tuned models achieve improvements of 29.12-42.33% in Mean Opinion Score (MOS) and 14.61-40.09 percentage points in Instruction Following Rate (IFR) on multi-dimensional control tasks. Moreover, on the URO-Bench benchmark, our fine-tuned models demonstrate substantial gains in core understanding, reasoning, and conversational abilities, with average improvements of +10.84% on the Basic setting and +7.87% on the Pro setting. Furthermore, the dataset's utility extends to training controllable Text-to-Speech (TTS) models, underscoring its high quality and broad applicability for expressive speech synthesis.
 ## 🎯 Overview
@@ -29,6 +38,31 @@ license: mit
 **Overview of the UltraVoice Dataset Construction and Stylistic Coverage.** The figure illustrates the complete pipeline and capabilities of UltraVoice: (1) The upper left section presents our four-step construction process: text corpus curation, style injection & response generation, stylized speech synthesis, and quality control & filtering. (2) The ring chart on the right visualizes the dataset's hierarchical control structure, with six main control dimensions in the inner ring (Emotion, Speed, Volume, Accent, Language, Composite) and their finer-grained sub-dimensions in the outer ring. (3) The lower panel showcases representative examples from each speech style dimension, demonstrating UltraVoice's rich stylistic coverage and multi-dimensional controllability, including emotion (e.g., angry, happy), speed (e.g., fast, slow), volume (e.g., high, low), language (e.g., Chinese, Japanese, Korean), accent (e.g., AU, CA, GB, IN, SG, ZA), and composite styles that combine multiple control attributes.
 ---
 ## 🤖 Available Models
@@ -76,8 +110,30 @@ To verify that fine-tuning on UltraVoice enhances rather than compromises genera
 **Evaluation of our SFT models (upper part) and existing strong baselines (lower part) on URO-Bench (EN).** Und.: Understanding. Conv.: Oral Conversation. Our results confirm that fine-tuning spoken dialogue models on UltraVoice enhances, rather than compromises, general conversational skills. All models showed substantial gains across Understanding, Reasoning, and Oral Conversation, with average improvements of **+10.84%** on the Basic setting and **+7.87%** on the Pro setting. Notably, the VocalNet-7B SFT model achieves state-of-the-art performance, outperforming strong baselines like Qwen2.5-Omni-7B and GLM4-Voice-9B, highlighting practical value beyond style control.
 ---
 ## 📄 License
 These models are licensed under the **MIT License**. See the [LICENSE](https://github.com/bigai-nlco/UltraVoice/blob/main/LICENSE) file for details.
@@ -131,7 +187,7 @@ For questions, issues, or feedback:
 - **Dataset**: [UltraVoice on HuggingFace](https://huggingface.co/datasets/tutu0604/UltraVoice)
 - **Training&Inference Code**: [SLAM-LLM](https://github.com/X-LANCE/SLAM-LLM) | [VocalNet](https://github.com/SJTU-OmniAgent/VocalNet)
 - **Project Page**: [UltraVoice Official Website](https://bigai-nlco.github.io/UltraVoice)
-- **Paper**: [arXiv:2510.22588](https://arxiv.org/abs/2510.22588)
 ---
@@ -141,5 +197,4 @@ For questions, issues, or feedback:
 **🎤 Try our models and share your feedback!**
-</div>

 ---
 license: mit
+pipeline_tag: text-to-speech
+library_name: transformers
 ---
 <div align="center">
 <img src="./pics/logo.png" alt="UltraVoice Logo" width="200">
 # UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models
+[![Paper](https://img.shields.io/badge/Paper-HuggingFace-blue)](https://huggingface.co/papers/2510.22588)
 [![Project Page](https://img.shields.io/badge/Project-Page-green)](https://bigai-nlco.github.io/UltraVoice)
 [![GitHub](https://img.shields.io/badge/GitHub-Code-black)](https://github.com/bigai-nlco/UltraVoice)
 [![HuggingFace Dataset](https://img.shields.io/badge/🤗%20HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/tutu0604/UltraVoice)
+[![HuggingFace Model](https://img.shields.io/badge/🤗%20HuggingFace-Models-blue)](https://huggingface.co/tutu0604/UltraVoice-SFT)
 [![License](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 </div>
 ---
+## 📢 News
+- **[2025-10]** We have released the `UltraVoice` dataset and model checkpoints.
 ## 📝 Abstract
+Spoken dialogue models currently lack the ability for fine-grained speech style control, a critical capability for human-like interaction that is often overlooked in favor of purely functional capabilities like reasoning and question answering. To address this limitation, we introduce UltraVoice, the first large-scale speech dialogue dataset engineered for multiple fine-grained speech style control. Encompassing over 830 hours of speech dialogues, UltraVoice provides instructions across six key speech stylistic dimensions: emotion, speed, volume, accent, language, and composite styles. Fine-tuning leading models such as SLAM-Omni and VocalNet on UltraVoice significantly enhances their fine-grained speech stylistic controllability without degrading core conversational abilities. Specifically, our fine-tuned models achieve improvements of 29.12-42.33% in Mean Opinion Score (MOS) and 14.61-40.09 percentage points in Instruction Following Rate (IFR) on multi-dimensional control tasks designed in the UltraVoice. Moreover, on the URO-Bench benchmark, our fine-tuned models demonstrate substantial gains in core understanding, reasoning, and conversational abilities, with average improvements of +10.84% on the Basic setting and +7.87% on the Pro setting. Furthermore, the dataset's utility extends to training controllable Text-to-Speech (TTS) models, underscoring its high quality and broad applicability for expressive speech synthesis. The complete dataset and model checkpoints are available at: this https URL .
+---
 ## 🎯 Overview
 **Overview of the UltraVoice Dataset Construction and Stylistic Coverage.** The figure illustrates the complete pipeline and capabilities of UltraVoice: (1) The upper left section presents our four-step construction process: text corpus curation, style injection & response generation, stylized speech synthesis, and quality control & filtering. (2) The ring chart on the right visualizes the dataset's hierarchical control structure, with six main control dimensions in the inner ring (Emotion, Speed, Volume, Accent, Language, Composite) and their finer-grained sub-dimensions in the outer ring. (3) The lower panel showcases representative examples from each speech style dimension, demonstrating UltraVoice's rich stylistic coverage and multi-dimensional controllability, including emotion (e.g., angry, happy), speed (e.g., fast, slow), volume (e.g., high, low), language (e.g., Chinese, Japanese, Korean), accent (e.g., AU, CA, GB, IN, SG, ZA), and composite styles that combine multiple control attributes.
+## 📊 Dataset
+The **UltraVoice** dataset contains **100,770** high-quality spoken dialogue samples, totaling **832.92 hours** of audio.
+Among them, **84,832** are explicitly conditioned on six major fine-grained speech style dimensions:
+- Emotion (Neutral, Happy, Sad, Angry, Surprised, Fearful, Disgusted)
+- Volume (Low, Normal, High)
+- Speed (Slow, Normal, Fast)
+- Accent (AU, CA, GB, IN, SG, ZA)
+- Language (Chinese, Japanese, Korean)
+- Composite (multi-style combinations)
+The remaining **15,938** pairs are general English QA samples to ensure balance and generalization.
+Average metrics indicate **mean CER of 5.93%** and **UTMOS 4.00**, showing high-quality, natural speech and stable stylistic control.
+<div align="center">
+  <img src="pics/dataset_overview.png" alt="UltraVoice Dataset Statistics" width="80%">
+</div>
+**Detailed statistics across all control dimensions.** #Cnt. denotes the number of samples, Dur. represents the total duration in hours, CER indicates the average character error rate, and UTMOS measures the averaged naturalness score. The dataset encompasses 100,770 samples totaling 832.92 hours, with fine-grained control over emotion (7 categories), volume (3 levels), speed (3 levels), accent (6 regions: AU, CA, GB, IN, SG, ZA), language (Chinese, Korean, Japanese), and composite styles.
+<div align="center">
+  <img src="pics/dataset_static.png" alt="UltraVoice Dataset Visualizations" width="80%">
+</div>
+**Statistical visualizations of the six fine-grained speech style control dimensions in UltraVoice.** The visualization methods are tailored to each dimension's characteristics: (a) **Emotion** and (d) **Accent** use t-SNE plots demonstrating clear class separability for categorical attributes; (b) **Speed** and (e) **Volume** employ box plots showing precise control over acoustic properties with distinct distributions; (c) **Language** and (f) **Composite** leverage word clouds highlighting lexical diversity and expressive richness across multilingual and multi-dimensional control scenarios.
 ---
 ## 🤖 Available Models
 **Evaluation of our SFT models (upper part) and existing strong baselines (lower part) on URO-Bench (EN).** Und.: Understanding. Conv.: Oral Conversation. Our results confirm that fine-tuning spoken dialogue models on UltraVoice enhances, rather than compromises, general conversational skills. All models showed substantial gains across Understanding, Reasoning, and Oral Conversation, with average improvements of **+10.84%** on the Basic setting and **+7.87%** on the Pro setting. Notably, the VocalNet-7B SFT model achieves state-of-the-art performance, outperforming strong baselines like Qwen2.5-Omni-7B and GLM4-Voice-9B, highlighting practical value beyond style control.
+### 3. Validation of Data Quality via Controllable Text-to-Speech
+To further validate the quality and utility of our dataset, we repurposed UltraVoice into a controllable TTS dataset and fine-tuned a pre-trained **EmoVoice-0.5B** model, creating **UltraVoice-0.5B-SFT**.
+<div align="center">
+  <img src="pics/exp4.png" alt="TTS Performance on Emotion" width="90%">
+</div>
+**Performance of our UltraVoice-0.5B-SFT model on emotional TTS tasks.** The evaluation is conducted on both an out-of-domain test set (EmoVoice-DB, top) and an in-domain test set (UltraVoice, bottom). Bold and underlined values denote the best and second-best results, respectively. Our fine-tuned TTS model demonstrates strong multi-dimensional style control. On the out-of-domain EmoVoice-DB test set, our model achieves competitive performance against strong baselines such as PromptTTS, CosyVoice, and EmoVoice. Crucially, on our in-domain UltraVoice data, it substantially reduces the Word Error Rate (WER) to **3.97** from 19.82, achieving high emotion similarity (**0.95**) and naturalness (**4.46** UTMOS).
+<div align="center">
+  <img src="pics/exp5.png" alt="TTS Performance on Multiple Dimensions" width="90%">
+</div>
+**MOS and IFR results of UltraVoice-0.5B-SFT across five style dimensions.** The model consistently improves both MOS and IFR scores across all tested dimensions (emotion, accent, speed, volume, and composite styles) compared to the pre-trained baseline. These results confirm that our instruction-style data effectively enhances controllable synthesis across a diverse range of styles, demonstrating the high quality and broad applicability of UltraVoice for expressive speech synthesis tasks.
 ---
+## 🧩 Model Checkpoints
+All fine-tuned models and datasets are released on 🤗HuggingFace:
+- [🤗 UltraVoice Dataset](https://huggingface.co/datasets/tutu0604/UltraVoice)
+- [🤗 UltraVoice-SFT Models](https://huggingface.co/tutu0604/UltraVoice-SFT)
 ## 📄 License
 These models are licensed under the **MIT License**. See the [LICENSE](https://github.com/bigai-nlco/UltraVoice/blob/main/LICENSE) file for details.
 - **Dataset**: [UltraVoice on HuggingFace](https://huggingface.co/datasets/tutu0604/UltraVoice)
 - **Training&Inference Code**: [SLAM-LLM](https://github.com/X-LANCE/SLAM-LLM) | [VocalNet](https://github.com/SJTU-OmniAgent/VocalNet)
 - **Project Page**: [UltraVoice Official Website](https://bigai-nlco.github.io/UltraVoice)
+- **Paper**: [Hugging Face Papers:2510.22588](https://huggingface.co/papers/2510.22588)
 ---
 **🎤 Try our models and share your feedback!**
+</div>