Update README.md
Browse files
README.md
CHANGED
|
@@ -44,7 +44,7 @@ GLM-TTS is a high-quality text-to-speech (TTS) synthesis system based on large l
|
|
| 44 |
|
| 45 |
By introducing a **Multi-Reward Reinforcement Learning** framework, GLM-TTS significantly improves the expressiveness of generated speech, achieving more natural emotional control compared to traditional TTS systems.
|
| 46 |
|
| 47 |
-
###
|
| 48 |
|
| 49 |
* **Zero-shot Voice Cloning:** Clone any speaker's voice with just 3-10 seconds of prompt audio.
|
| 50 |
* **RL-enhanced Emotion Control:** Utilizes a multi-reward reinforcement learning framework (GRPO) to optimize prosody and emotion.
|
|
@@ -53,7 +53,7 @@ By introducing a **Multi-Reward Reinforcement Learning** framework, GLM-TTS sign
|
|
| 53 |
* **Streaming Inference:** Supports real-time audio generation suitable for interactive applications.
|
| 54 |
* **Bilingual Support:** Optimized for Chinese and English mixed text.
|
| 55 |
|
| 56 |
-
##
|
| 57 |
|
| 58 |
GLM-TTS follows a two-stage design:
|
| 59 |
|
|
@@ -67,7 +67,7 @@ GLM-TTS follows a two-stage design:
|
|
| 67 |
### Reinforcement Learning Alignment
|
| 68 |
To tackle flat emotional expression, GLM-TTS uses a **Group Relative Policy Optimization (GRPO)** algorithm with multiple reward functions (Similarity, CER, Emotion, Laughter) to align the LLM's generation strategy.
|
| 69 |
|
| 70 |
-
##
|
| 71 |
|
| 72 |
Evaluated on `seed-tts-eval`. **GLM-TTS_RL** achieves the lowest Character Error Rate (CER) while maintaining high speaker similarity.
|
| 73 |
|
|
@@ -79,7 +79,7 @@ Evaluated on `seed-tts-eval`. **GLM-TTS_RL** achieves the lowest Character Error
|
|
| 79 |
| **GLM-TTS (Base)** | 1.03 | 76.1 | π Yes |
|
| 80 |
| **GLM-TTS_RL (Ours)** | **0.89** | 76.4 | π Yes |
|
| 81 |
|
| 82 |
-
##
|
| 83 |
|
| 84 |
### Installation
|
| 85 |
|
|
@@ -105,7 +105,7 @@ python glmtts_inference.py \
|
|
| 105 |
bash glmtts_inference.sh
|
| 106 |
```
|
| 107 |
|
| 108 |
-
##
|
| 109 |
|
| 110 |
We thank the following open-source projects for their support:
|
| 111 |
|
|
|
|
| 44 |
|
| 45 |
By introducing a **Multi-Reward Reinforcement Learning** framework, GLM-TTS significantly improves the expressiveness of generated speech, achieving more natural emotional control compared to traditional TTS systems.
|
| 46 |
|
| 47 |
+
### Key Features
|
| 48 |
|
| 49 |
* **Zero-shot Voice Cloning:** Clone any speaker's voice with just 3-10 seconds of prompt audio.
|
| 50 |
* **RL-enhanced Emotion Control:** Utilizes a multi-reward reinforcement learning framework (GRPO) to optimize prosody and emotion.
|
|
|
|
| 53 |
* **Streaming Inference:** Supports real-time audio generation suitable for interactive applications.
|
| 54 |
* **Bilingual Support:** Optimized for Chinese and English mixed text.
|
| 55 |
|
| 56 |
+
## System Architecture
|
| 57 |
|
| 58 |
GLM-TTS follows a two-stage design:
|
| 59 |
|
|
|
|
| 67 |
### Reinforcement Learning Alignment
|
| 68 |
To tackle flat emotional expression, GLM-TTS uses a **Group Relative Policy Optimization (GRPO)** algorithm with multiple reward functions (Similarity, CER, Emotion, Laughter) to align the LLM's generation strategy.
|
| 69 |
|
| 70 |
+
## Evaluation Results
|
| 71 |
|
| 72 |
Evaluated on `seed-tts-eval`. **GLM-TTS_RL** achieves the lowest Character Error Rate (CER) while maintaining high speaker similarity.
|
| 73 |
|
|
|
|
| 79 |
| **GLM-TTS (Base)** | 1.03 | 76.1 | π Yes |
|
| 80 |
| **GLM-TTS_RL (Ours)** | **0.89** | 76.4 | π Yes |
|
| 81 |
|
| 82 |
+
## Quick Start
|
| 83 |
|
| 84 |
### Installation
|
| 85 |
|
|
|
|
| 105 |
bash glmtts_inference.sh
|
| 106 |
```
|
| 107 |
|
| 108 |
+
## Acknowledgments & Citation
|
| 109 |
|
| 110 |
We thank the following open-source projects for their support:
|
| 111 |
|