File size: 8,338 Bytes
52562da 1c3f89f 52562da 9d2cda9 1c3f89f 9d2cda9 0c6dd01 9d2cda9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 |
---
library_name: transformers
tags:
- text-to-speech
- automatic-speech-recognition
- voice-conversion
- speech
- audio
pipeline_tag: text-to-speech
language:
- en
- zh
license: apache-2.0
homepage: https://autoark.github.io/GPA/
repository: https://github.com/AutoArk/GPA
---
<div align="center">
<img src="figures/GPA_intro.png" width="80%" alt="GPA Logo"/>
# GPA: One Model for Speech Recognition, Text-to-Speech, and Voice Conversion
[](https://github.com/AutoArk/GPA)
</div>
> **TL;DR** GPA incorporates three speech tasks into one single model and this repo includes codes of training, fine-tuning and effecient deployment of GPA.
## 📖 Abstract
**GPA** stands for **General Purpose Audio**.
In academia, a student’s GPA (Grade Point Average) serves as a unified metric that reflects performance across diverse subjects—ranging from Calculus and Philosophy to Gym class.
Similarly, our GPA model unifies the three major pillars of audio tasks—Text-to-Speech (TTS), Automatic Speech Recognition (ASR), and Voice Conversion (VC)—into a single auto-regreesive transformer.
* Our open-source content includes support for multiple frameworks and provides **production-ready code suitable for cloud deployment.**
* we include concise **inference examples** and **training pipelines** for research purpose.
* The released 0.3B model is also perfect for **edge devices** and edge deployment is to be released.
## 🔍 Model Overview
<div align="center">
<img src="figures/GPA.png" width="80%" alt="GPA Model Architecture"/>
<br>
<div style="text-align: justify; width: 100%; margin: 10px auto; text-indent: 2em;">
<strong>Figure 1: Architecture of the proposed GPA framework.</strong> The model utilizes a shared Large Language Model (LLM) backbone to unify three core audio tasks: Understanding (ASR), Generation (TTS), and Editing (Voice Conversion). Depending on the task, the model processes different combinations of inputs (Source Audio, Target Text, or Reference Audio) via Semantic and Acoustic modules to generate the corresponding text or audio output.
</div>
</div>
## ⚡ Model Performance
The following results are obtained by benchmarking services instantiated via [the official deployment scripts](#-deployment), reflecting end-to-end performance in realistic serving scenarios rather than offline inference.
Among currently available open-source systems, **our model is one of the few that natively supports both concurrent and streaming inference, while achieving performance comparable to the first tier of existing approaches.**
> **💡Note**
>
> * **TTFC**: Time To First Chunk (TTS)
> * **TTFT**: Time To First Token (ASR)
> * **RTF**: Real-Time Factor (audio duration / synthesis time)
### TTS Streaming Benchmark (Latency & Throughput)
<div align="center">
<table>
<thead>
<tr>
<th>Concurrency</th>
<th>Avg TTFC (ms)</th>
<th>P50 TTFC (ms)</th>
<th>P99 TTFC (ms)</th>
<th>Avg RTF</th>
<th>P50 RTF</th>
<th>P99 RTF</th>
<th>Audio Dur (s)</th>
</tr>
</thead>
<tbody>
<tr><td>1</td><td>258.8</td><td>258.8</td><td>258.8</td><td>0.197</td><td>0.197</td><td>0.197</td><td>6.44</td></tr>
<tr><td>5</td><td>385.0</td><td>394.7</td><td>396.2</td><td>0.218</td><td>0.217</td><td>0.248</td><td>6.76</td></tr>
<tr><td>10</td><td>544.6</td><td>564.2</td><td>566.7</td><td>0.282</td><td>0.301</td><td>0.313</td><td>6.49</td></tr>
<tr><td>20</td><td>977.8</td><td>977.9</td><td>982.9</td><td>0.470</td><td>0.490</td><td>0.538</td><td>7.19</td></tr>
<tr><td>40</td><td>1797.0</td><td>1736.4</td><td>2564.5</td><td>0.421</td><td>0.400</td><td>0.587</td><td>6.33</td></tr>
<tr><td>80</td><td>3786.4</td><td>4054.4</td><td>5415.8</td><td>0.763</td><td>0.763</td><td>1.096</td><td>6.32</td></tr>
<tr><td>160</td><td>9847.9</td><td>10239.9</td><td>14350.3</td><td>1.718</td><td>1.740</td><td>2.577</td><td>6.44</td></tr>
</tbody>
</table>
<p><strong>Table 2. TTS Streaming RTF and Audio Duration</strong></p>
</div>
### ASR Streaming Benchmark
<div align="center">
<table>
<thead>
<tr>
<th>Concurrency</th>
<th>Avg TTFT (ms)</th>
<th>P50 TTFT (ms)</th>
<th>P99 TTFT (ms)</th>
<th>Avg Total (ms)</th>
</tr>
</thead>
<tbody>
<tr><td>1</td><td>157.5</td><td>157.5</td><td>157.5</td><td>190.9</td></tr>
<tr><td>5</td><td>394.1</td><td>393.7</td><td>395.9</td><td>400.0</td></tr>
<tr><td>10</td><td>589.6</td><td>721.3</td><td>723.3</td><td>598.1</td></tr>
<tr><td>20</td><td>1316.3</td><td>1495.6</td><td>1500.4</td><td>1317.8</td></tr>
<tr><td>40</td><td>2690.9</td><td>2678.3</td><td>2861.4</td><td>2693.7</td></tr>
<tr><td>80</td><td>3833.4</td><td>3961.3</td><td>4027.0</td><td>3845.1</td></tr>
<tr><td>160</td><td>5037.0</td><td>5689.3</td><td>6676.0</td><td>5044.0</td></tr>
</tbody>
</table>
<p><strong>Table 3. ASR Streaming Latency vs Concurrency</strong></p>
</div>
## 📊 Evaluation Metric Results
### TTS Evaluation Table
| Model | Open-Source | Model Size | test-zh CER (%) ↓ | test-zh Sim (%) ↑ | test-en WER (%) ↓ | test-en Sim (%) ↑ |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: |
| **Multi-Stage or NAR Methods** | | | | | | |
| Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 |
| Seed-TTS | ❌ | - | 1.12 | **79.6** | 2.25 | **76.2** |
| MiniMax-Speech | ❌ | - | 0.83 | 78.3 | 1.65 | 69.2 |
| F5-TTS | ✅ | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 |
| CosyVoice2 | ✅ | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 |
| FireRedTTS2 | ✅ | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 |
| Index-TTS2 | ✅ | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 |
| VibeVoice-1.5B | ✅ | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 |
| VibeVoice-Realtime | ✅ | 0.5B | - | - | 2.05 | 63.3 |
| HiggsAudio-v2 | ✅ | 3B | 1.50 | 74.0 | 2.44 | 67.7 |
| VoxCPM | ✅ | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 |
| GLM-TTS | ✅ | 1.5B | 1.03 | 76.1 | - | - |
| GLM-TTS RL | ✅ | 1.5B | 0.89 | 76.4 | - | - |
| Fun-CosyVoice3-0.5B-2512 | ✅ | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 |
| Fun-CosyVoice3-0.5B-2512_RL | ✅ | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 |
| **One-Stage AR Methods** | | | | | | |
| Spark TTS | ✅ | 0.5B | 1.20 | 66.0 | 1.98 | 57.3 |
| GPA-0.3B-preview | ✅ | 0.3B | **0.95** | 65.9 | **1.51** | 56.5 |
### ASR Evaluation Table
**Note:** ASR results on Librispeech and Aishell-1. WER (%) is reported for Librispeech, and CER (%) is reported for Aishell-1.
| Model | Model Size | Librispeech test-clean | Aishell-1 |
| :--- | :---: | :---: | :---: |
| **Models with < 0.5B parameters** | | | |
| Whisper-S | 0.24B | 3.13 | - |
| GPA-0.3B-preview | 0.3B | 8.88 | 4.50 |
| **Models with > 0.5B parameters** | | | |
| Fun-ASR-nano | 0.8B | 1.76 | 1.80 |
| FireRed-ASR | 1.1B | 1.84 | 0.54 |
| GLM-ASR-nano | 1.5B | 2.00 | 1.81 |
| GLM-ASR-nano* | 1.5B | 2.17 | 2.17 |
| Whisper-L | 1.55B | 1.82 | 4.72 |
| Kimi-Audio | - | 1.32 | 0.71 |
| Step-Audio2 | - | 1.17 | 0.63 |
| Seed-ASR | - | 1.58 | 0.68 |
| Seed-ASR* | - | 2.80 | 1.63 |
| Fun-ASR | 7.7B | 1.51 | 1.22 |
## 🙏 Acknowledgements
We borrowed a lot of code from the following excellent projects:
- [Spark-TTS](https://github.com/SparkAudio/Spark-TTS)
- [GLM-4-Voice](https://github.com/zai-org/GLM-4-Voice/tree/main/speech_tokenizer)
- [Emilia](https://github.com/open-mmlab/Amphion/tree/main/preprocessors/Emilia)
- [FlashTTS](https://github.com/HuiResearch/FlashTTS/tree/master/flashtts)
- [Qwen](https://github.com/QwenLM/Qwen)
## 🔗 Citation
If you find GPA useful for your research or projects, please cite us:
```bibtex
@misc{cai2026unifyingspeechrecognitionsynthesis,
title={Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers},
author={Runyuan Cai and Yu Lin and Yiming Wang and Chunlin Fu and Xiaodong Zeng},
year={2026},
eprint={2601.10770},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2601.10770},
}
``` |