File size: 4,222 Bytes
0af6a30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
891ebe8
0af6a30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
891ebe8
0af6a30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
language:
- zh
- en
tags:
- llm
- tts
- zero-shot
- voice-cloning
- reinforcement-learning
- flow-matching
license: mit
pipeline_tag: text-to-speech
---

# GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS

<div align="center">
<img src=https://raw.githubusercontent.com/zai-org/GLM-TTS/refs/heads/main/assets/images/logo.svg  width="50%"/>
</div>

<p align="center">
    <a href="https://github.com/zai-org/GLM-TTS" target="_blank">πŸ’» GitHub Repository</a>
    &nbsp;&nbsp;|&nbsp;&nbsp;
    <a href="https://audio.z.ai/" target="_blank">πŸ› οΈ Audio.Z.AI</a>
</p>

## πŸ“– Model Introduction

GLM-TTS is a high-quality text-to-speech (TTS) synthesis system based on large language models, supporting zero-shot voice cloning and streaming inference. The system adopts a two-stage architecture combining an LLM for speech token generation and a Flow Matching model for waveform synthesis.

By introducing a **Multi-Reward Reinforcement Learning** framework, GLM-TTS significantly improves the expressiveness of generated speech, achieving more natural emotional control compared to traditional TTS systems.

### Key Features

* **Zero-shot Voice Cloning:** Clone any speaker's voice with just 3-10 seconds of prompt audio.
* **RL-enhanced Emotion Control:** Utilizes a multi-reward reinforcement learning framework (GRPO) to optimize prosody and emotion.
* **High-quality Synthesis:** Generates speech comparable to commercial systems with reduced Character Error Rate (CER).
* **Phoneme-level Control:** Supports "Hybrid Phoneme + Text" input for precise pronunciation control (e.g., polyphones).
* **Streaming Inference:** Supports real-time audio generation suitable for interactive applications.
* **Bilingual Support:** Optimized for Chinese and English mixed text.

## System Architecture

GLM-TTS follows a two-stage design:

1.  **Stage 1 (LLM):** A Llama-based model converts input text into speech token sequences.
2.  **Stage 2 (Flow Matching):** A Flow model converts token sequences into high-quality mel-spectrograms, which are then turned into waveforms by a vocoder.

<div align="center">
  <img src="https://raw.githubusercontent.com/zai-org/GLM-TTS/refs/heads/main/assets/images/architecture.png" width="60%" alt="GLM-TTS Architecture">
</div>

### Reinforcement Learning Alignment
To tackle flat emotional expression, GLM-TTS uses a **Group Relative Policy Optimization (GRPO)** algorithm with multiple reward functions (Similarity, CER, Emotion, Laughter) to align the LLM's generation strategy.

## Evaluation Results

Evaluated on `seed-tts-eval`. **GLM-TTS_RL** achieves the lowest Character Error Rate (CER) while maintaining high speaker similarity.

| Model | CER ↓ | SIM ↑ | Open-source |
| :--- | :---: | :---: | :---: |
| Seed-TTS | 1.12 | **79.6** | πŸ”’ No |
| CosyVoice2 | 1.38 | 75.7 | πŸ‘ Yes |
| F5-TTS | 1.53 | 76.0 | πŸ‘ Yes |
| **GLM-TTS (Base)** | 1.03 | 76.1 | πŸ‘ Yes |
| **GLM-TTS_RL (Ours)** | **0.89** | 76.4 | πŸ‘ Yes |

## Quick Start

### Installation

```bash
git clone [https://github.com/zai-org/GLM-TTS.git](https://github.com/zai-org/GLM-TTS.git)
cd GLM-TTS
pip install -r requirements.txt
```

#### Command Line Inference

```bash
python glmtts_inference.py \
    --data=example_zh \
    --exp_name=_test \
    --use_cache \
    # --phoneme # Add this flag to enable phoneme capabilities.
```

#### Shell Script Inference

```bash
bash glmtts_inference.sh
```

## Acknowledgments & Citation

We thank the following open-source projects for their support:

- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) - Providing frontend processing framework and high-quality vocoder
- [Llama](https://github.com/meta-llama/llama) - Providing basic language model architecture
- [Vocos](https://github.com/charactr-platform/vocos) - Providing high-quality vocoder
- [GRPO-Zero](https://github.com/policy-gradient/GRPO-Zero) - Reinforcement learning algorithm implementation inspiration

If you use GLM-TTS in your research, please cite:

```bibtex
@misc{glmtts2025,
  title={GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward Reinforcement Learning},
  author={CogAudio Group Members},
  year={2025},
  publisher={Zhipu AI Inc}
}