Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,114 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- zh
|
| 4 |
+
- en
|
| 5 |
+
tags:
|
| 6 |
+
- llm
|
| 7 |
+
- tts
|
| 8 |
+
- zero-shot
|
| 9 |
+
- voice-cloning
|
| 10 |
+
- reinforcement-learning
|
| 11 |
+
- flow-matching
|
| 12 |
+
license: mit
|
| 13 |
+
pipeline_tag: text-to-speech
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS
|
| 17 |
+
|
| 18 |
+
<div align="center">
|
| 19 |
+
<img src=https://raw.githubusercontent.com/zai-org/GLM-V/refs/heads/main/assets/images/logo.svg width="50%"/>
|
| 20 |
+
</div>
|
| 21 |
+
|
| 22 |
+
<p align="center">
|
| 23 |
+
<a href="https://github.com/zai-org/GLM-TTS" target="_blank">π» GitHub Repository</a>
|
| 24 |
+
|
|
| 25 |
+
<a href="https://huggingface.co/spaces/zai-org/GLM-TTS" target="_blank">π€ Online Demo</a>
|
| 26 |
+
|
|
| 27 |
+
<a href="https://audio.z.ai/" target="_blank">π οΈ Audio.Z.AI</a>
|
| 28 |
+
</p>
|
| 29 |
+
|
| 30 |
+
## π Model Introduction
|
| 31 |
+
|
| 32 |
+
GLM-TTS is a high-quality text-to-speech (TTS) synthesis system based on large language models, supporting zero-shot voice cloning and streaming inference. The system adopts a two-stage architecture combining an LLM for speech token generation and a Flow Matching model for waveform synthesis.
|
| 33 |
+
|
| 34 |
+
By introducing a **Multi-Reward Reinforcement Learning** framework, GLM-TTS significantly improves the expressiveness of generated speech, achieving more natural emotional control compared to traditional TTS systems.
|
| 35 |
+
|
| 36 |
+
### Key Features
|
| 37 |
+
|
| 38 |
+
* **Zero-shot Voice Cloning:** Clone any speaker's voice with just 3-10 seconds of prompt audio.
|
| 39 |
+
* **RL-enhanced Emotion Control:** Utilizes a multi-reward reinforcement learning framework (GRPO) to optimize prosody and emotion.
|
| 40 |
+
* **High-quality Synthesis:** Generates speech comparable to commercial systems with reduced Character Error Rate (CER).
|
| 41 |
+
* **Phoneme-level Control:** Supports "Hybrid Phoneme + Text" input for precise pronunciation control (e.g., polyphones).
|
| 42 |
+
* **Streaming Inference:** Supports real-time audio generation suitable for interactive applications.
|
| 43 |
+
* **Bilingual Support:** Optimized for Chinese and English mixed text.
|
| 44 |
+
|
| 45 |
+
## System Architecture
|
| 46 |
+
|
| 47 |
+
GLM-TTS follows a two-stage design:
|
| 48 |
+
|
| 49 |
+
1. **Stage 1 (LLM):** A Llama-based model converts input text into speech token sequences.
|
| 50 |
+
2. **Stage 2 (Flow Matching):** A Flow model converts token sequences into high-quality mel-spectrograms, which are then turned into waveforms by a vocoder.
|
| 51 |
+
|
| 52 |
+
<div align="center">
|
| 53 |
+
<img src="https://raw.githubusercontent.com/zai-org/GLM-V/refs/heads/main/assets/images/architecture.png" width="60%" alt="GLM-TTS Architecture">
|
| 54 |
+
</div>
|
| 55 |
+
|
| 56 |
+
### Reinforcement Learning Alignment
|
| 57 |
+
To tackle flat emotional expression, GLM-TTS uses a **Group Relative Policy Optimization (GRPO)** algorithm with multiple reward functions (Similarity, CER, Emotion, Laughter) to align the LLM's generation strategy.
|
| 58 |
+
|
| 59 |
+
## Evaluation Results
|
| 60 |
+
|
| 61 |
+
Evaluated on `seed-tts-eval`. **GLM-TTS_RL** achieves the lowest Character Error Rate (CER) while maintaining high speaker similarity.
|
| 62 |
+
|
| 63 |
+
| Model | CER β | SIM β | Open-source |
|
| 64 |
+
| :--- | :---: | :---: | :---: |
|
| 65 |
+
| Seed-TTS | 1.12 | **79.6** | π No |
|
| 66 |
+
| CosyVoice2 | 1.38 | 75.7 | π Yes |
|
| 67 |
+
| F5-TTS | 1.53 | 76.0 | π Yes |
|
| 68 |
+
| **GLM-TTS (Base)** | 1.03 | 76.1 | π Yes |
|
| 69 |
+
| **GLM-TTS_RL (Ours)** | **0.89** | 76.4 | π Yes |
|
| 70 |
+
|
| 71 |
+
## Quick Start
|
| 72 |
+
|
| 73 |
+
### Installation
|
| 74 |
+
|
| 75 |
+
```bash
|
| 76 |
+
git clone [https://github.com/zai-org/GLM-TTS.git](https://github.com/zai-org/GLM-TTS.git)
|
| 77 |
+
cd GLM-TTS
|
| 78 |
+
pip install -r requirements.txt
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
#### Command Line Inference
|
| 82 |
+
|
| 83 |
+
```bash
|
| 84 |
+
python glmtts_inference.py \
|
| 85 |
+
--data=example_zh \
|
| 86 |
+
--exp_name=_test \
|
| 87 |
+
--use_cache \
|
| 88 |
+
# --phoneme # Add this flag to enable phoneme capabilities.
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
#### Shell Script Inference
|
| 92 |
+
|
| 93 |
+
```bash
|
| 94 |
+
bash glmtts_inference.sh
|
| 95 |
+
```
|
| 96 |
+
|
| 97 |
+
## Acknowledgments & Citation
|
| 98 |
+
|
| 99 |
+
We thank the following open-source projects for their support:
|
| 100 |
+
|
| 101 |
+
- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) - Providing frontend processing framework and high-quality vocoder
|
| 102 |
+
- [Llama](https://github.com/meta-llama/llama) - Providing basic language model architecture
|
| 103 |
+
- [Vocos](https://github.com/charactr-platform/vocos) - Providing high-quality vocoder
|
| 104 |
+
- [GRPO-Zero](https://github.com/policy-gradient/GRPO-Zero) - Reinforcement learning algorithm implementation inspiration
|
| 105 |
+
|
| 106 |
+
If you use GLM-TTS in your research, please cite:
|
| 107 |
+
|
| 108 |
+
```bibtex
|
| 109 |
+
@misc{glmtts2025,
|
| 110 |
+
title={GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward Reinforcement Learning},
|
| 111 |
+
author={CogAudio Group Members},
|
| 112 |
+
year={2025},
|
| 113 |
+
publisher={Zhipu AI Inc}
|
| 114 |
+
}
|