Improve model card for DMOSpeech 2: Add pipeline tag and detailed usage
Browse filesThis PR significantly enhances the model card for DMOSpeech 2 by adding crucial metadata and comprehensive information.
Key updates include:
* Adding the `pipeline_tag: text-to-speech` to ensure the model is properly categorized and discoverable on the Hugging Face Hub (e.g., at [https://huggingface.co/models?pipeline_tag=text-to-speech](https://huggingface.co/models?pipeline_tag=text-to-speech)).
* Including the full paper abstract to provide a detailed overview of the model's capabilities and contributions.
* Providing clear links to the official paper on Hugging Face, the project page, and the GitHub repository.
* Adding detailed inference instructions, including environment setup and checkpoint download, directly referencing the `demo.ipynb` in the official GitHub repository as the primary method for sample usage, thereby avoiding an incorrect code snippet.
This update will greatly improve the clarity and utility of the DMOSpeech 2 model card for the community.
|
@@ -1,3 +1,47 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
pipeline_tag: text-to-speech
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis
|
| 7 |
+
|
| 8 |
+
Diffusion-based text-to-speech (TTS) systems have made remarkable progress in zero-shot speech synthesis, yet optimizing all components for perceptual metrics remains challenging. Prior work with DMOSpeech demonstrated direct metric optimization for speech generation components, but duration prediction remained unoptimized. This paper presents DMOSpeech 2, which extends metric optimization to the duration predictor through a reinforcement learning approach. The proposed system implements a novel duration policy framework using group relative preference optimization (GRPO) with speaker similarity and word error rate as reward signals. By optimizing this previously unoptimized component, DMOSpeech 2 creates a more complete metric-optimized synthesis pipeline. Additionally, this paper introduces teacher-guided sampling, a hybrid approach leveraging a teacher model for initial denoising steps before transitioning to the student model, significantly improving output diversity while maintaining efficiency. Comprehensive evaluations demonstrate superior performance across all metrics compared to previous systems, while reducing sampling steps by half without quality degradation. These advances represent a significant step toward speech synthesis systems with metric optimization across multiple components.
|
| 9 |
+
|
| 10 |
+
This model was presented in the paper [DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis](https://huggingface.co/papers/2507.14988).
|
| 11 |
+
|
| 12 |
+
For more details, visit the [project page](https://dmospeech2.github.io/) or the [GitHub repository](https://github.com/yl4579/DMOSpeech2).
|
| 13 |
+
|
| 14 |
+
## Inference
|
| 15 |
+
|
| 16 |
+
To use the DMOSpeech 2 model, follow the steps below, adapted from the official GitHub repository.
|
| 17 |
+
|
| 18 |
+
### Pre-requisites
|
| 19 |
+
|
| 20 |
+
1. **Clone the repository:**
|
| 21 |
+
```bash
|
| 22 |
+
git clone https://github.com/yl4579/DMOSpeech2.git
|
| 23 |
+
cd DMOSpeech2
|
| 24 |
+
```
|
| 25 |
+
2. **Set up environment and install packages:**
|
| 26 |
+
```bash
|
| 27 |
+
conda create -n dmo2 python=3.10
|
| 28 |
+
conda activate dmo2
|
| 29 |
+
pip install -r requirements.txt
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
### Download Checkpoints
|
| 33 |
+
|
| 34 |
+
Download the pre-trained model checkpoints from Hugging Face to a `ckpts` folder in your cloned repository:
|
| 35 |
+
|
| 36 |
+
```bash
|
| 37 |
+
mkdir ckpts
|
| 38 |
+
cd ckpts
|
| 39 |
+
wget https://huggingface.co/yl4579/DMOSpeech2/resolve/main/model_85000.pt
|
| 40 |
+
wget https://huggingface.co/yl4579/DMOSpeech2/resolve/main/model_1500.pt
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
### Run Inference
|
| 44 |
+
|
| 45 |
+
You can run the inference and explore various synthesis schemes using the provided `demo.ipynb` notebook in the GitHub repository.
|
| 46 |
+
|
| 47 |
+
Refer to `src/demo.ipynb` in the [DMOSpeech 2 GitHub repository](https://github.com/yl4579/DMOSpeech2/blob/main/src/demo.ipynb) for detailed code examples and usage.
|