ZHANGYUXUAN-zR commited on
Commit
0af6a30
Β·
verified Β·
1 Parent(s): 0b4ddb7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +114 -3
README.md CHANGED
@@ -1,3 +1,114 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - zh
4
+ - en
5
+ tags:
6
+ - llm
7
+ - tts
8
+ - zero-shot
9
+ - voice-cloning
10
+ - reinforcement-learning
11
+ - flow-matching
12
+ license: mit
13
+ pipeline_tag: text-to-speech
14
+ ---
15
+
16
+ # GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS
17
+
18
+ <div align="center">
19
+ <img src=https://raw.githubusercontent.com/zai-org/GLM-V/refs/heads/main/assets/images/logo.svg width="50%"/>
20
+ </div>
21
+
22
+ <p align="center">
23
+ <a href="https://github.com/zai-org/GLM-TTS" target="_blank">πŸ’» GitHub Repository</a>
24
+ &nbsp;&nbsp;|&nbsp;&nbsp;
25
+ <a href="https://huggingface.co/spaces/zai-org/GLM-TTS" target="_blank">πŸ€— Online Demo</a>
26
+ &nbsp;&nbsp;|&nbsp;&nbsp;
27
+ <a href="https://audio.z.ai/" target="_blank">πŸ› οΈ Audio.Z.AI</a>
28
+ </p>
29
+
30
+ ## πŸ“– Model Introduction
31
+
32
+ GLM-TTS is a high-quality text-to-speech (TTS) synthesis system based on large language models, supporting zero-shot voice cloning and streaming inference. The system adopts a two-stage architecture combining an LLM for speech token generation and a Flow Matching model for waveform synthesis.
33
+
34
+ By introducing a **Multi-Reward Reinforcement Learning** framework, GLM-TTS significantly improves the expressiveness of generated speech, achieving more natural emotional control compared to traditional TTS systems.
35
+
36
+ ### Key Features
37
+
38
+ * **Zero-shot Voice Cloning:** Clone any speaker's voice with just 3-10 seconds of prompt audio.
39
+ * **RL-enhanced Emotion Control:** Utilizes a multi-reward reinforcement learning framework (GRPO) to optimize prosody and emotion.
40
+ * **High-quality Synthesis:** Generates speech comparable to commercial systems with reduced Character Error Rate (CER).
41
+ * **Phoneme-level Control:** Supports "Hybrid Phoneme + Text" input for precise pronunciation control (e.g., polyphones).
42
+ * **Streaming Inference:** Supports real-time audio generation suitable for interactive applications.
43
+ * **Bilingual Support:** Optimized for Chinese and English mixed text.
44
+
45
+ ## System Architecture
46
+
47
+ GLM-TTS follows a two-stage design:
48
+
49
+ 1. **Stage 1 (LLM):** A Llama-based model converts input text into speech token sequences.
50
+ 2. **Stage 2 (Flow Matching):** A Flow model converts token sequences into high-quality mel-spectrograms, which are then turned into waveforms by a vocoder.
51
+
52
+ <div align="center">
53
+ <img src="https://raw.githubusercontent.com/zai-org/GLM-V/refs/heads/main/assets/images/architecture.png" width="60%" alt="GLM-TTS Architecture">
54
+ </div>
55
+
56
+ ### Reinforcement Learning Alignment
57
+ To tackle flat emotional expression, GLM-TTS uses a **Group Relative Policy Optimization (GRPO)** algorithm with multiple reward functions (Similarity, CER, Emotion, Laughter) to align the LLM's generation strategy.
58
+
59
+ ## Evaluation Results
60
+
61
+ Evaluated on `seed-tts-eval`. **GLM-TTS_RL** achieves the lowest Character Error Rate (CER) while maintaining high speaker similarity.
62
+
63
+ | Model | CER ↓ | SIM ↑ | Open-source |
64
+ | :--- | :---: | :---: | :---: |
65
+ | Seed-TTS | 1.12 | **79.6** | πŸ”’ No |
66
+ | CosyVoice2 | 1.38 | 75.7 | πŸ‘ Yes |
67
+ | F5-TTS | 1.53 | 76.0 | πŸ‘ Yes |
68
+ | **GLM-TTS (Base)** | 1.03 | 76.1 | πŸ‘ Yes |
69
+ | **GLM-TTS_RL (Ours)** | **0.89** | 76.4 | πŸ‘ Yes |
70
+
71
+ ## Quick Start
72
+
73
+ ### Installation
74
+
75
+ ```bash
76
+ git clone [https://github.com/zai-org/GLM-TTS.git](https://github.com/zai-org/GLM-TTS.git)
77
+ cd GLM-TTS
78
+ pip install -r requirements.txt
79
+ ```
80
+
81
+ #### Command Line Inference
82
+
83
+ ```bash
84
+ python glmtts_inference.py \
85
+ --data=example_zh \
86
+ --exp_name=_test \
87
+ --use_cache \
88
+ # --phoneme # Add this flag to enable phoneme capabilities.
89
+ ```
90
+
91
+ #### Shell Script Inference
92
+
93
+ ```bash
94
+ bash glmtts_inference.sh
95
+ ```
96
+
97
+ ## Acknowledgments & Citation
98
+
99
+ We thank the following open-source projects for their support:
100
+
101
+ - [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) - Providing frontend processing framework and high-quality vocoder
102
+ - [Llama](https://github.com/meta-llama/llama) - Providing basic language model architecture
103
+ - [Vocos](https://github.com/charactr-platform/vocos) - Providing high-quality vocoder
104
+ - [GRPO-Zero](https://github.com/policy-gradient/GRPO-Zero) - Reinforcement learning algorithm implementation inspiration
105
+
106
+ If you use GLM-TTS in your research, please cite:
107
+
108
+ ```bibtex
109
+ @misc{glmtts2025,
110
+ title={GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward Reinforcement Learning},
111
+ author={CogAudio Group Members},
112
+ year={2025},
113
+ publisher={Zhipu AI Inc}
114
+ }