File size: 8,338 Bytes
52562da
 
 
 
 
 
 
 
 
 
 
 
 
1c3f89f
 
52562da
9d2cda9
 
 
 
 
1c3f89f
9d2cda9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0c6dd01
 
 
 
 
 
 
 
9d2cda9
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
---

library_name: transformers
tags:
- text-to-speech
- automatic-speech-recognition
- voice-conversion
- speech
- audio
pipeline_tag: text-to-speech
language:
- en
- zh
license: apache-2.0
homepage: https://autoark.github.io/GPA/
repository: https://github.com/AutoArk/GPA
---

<div align="center">
  <img src="figures/GPA_intro.png" width="80%" alt="GPA Logo"/>

# GPA: One Model for Speech Recognition, Text-to-Speech, and Voice Conversion

[![GitHub](https://img.shields.io/badge/GitHub-AutoArk%2FGPA-blue?logo=github)](https://github.com/AutoArk/GPA)

</div>

> **TL;DR** GPA incorporates three speech tasks into one single model and this repo includes codes of training, fine-tuning and effecient deployment of GPA.

## 📖 Abstract

**GPA** stands for **General Purpose Audio**. 

In academia, a student’s GPA (Grade Point Average) serves as a unified metric that reflects performance across diverse subjects—ranging from Calculus and Philosophy to Gym class.

Similarly, our GPA model unifies the three major pillars of audio tasks—Text-to-Speech (TTS), Automatic Speech Recognition (ASR), and Voice Conversion (VC)—into a single auto-regreesive transformer.
*   Our open-source content includes support for multiple frameworks and provides **production-ready code suitable for cloud deployment.**
*   we include concise **inference examples** and **training pipelines** for research purpose.
*   The released 0.3B model is also perfect for **edge devices** and edge deployment is to be released.

## 🔍 Model Overview

<div align="center">
  <img src="figures/GPA.png" width="80%" alt="GPA Model Architecture"/>
  <br>
  <div style="text-align: justify; width: 100%; margin: 10px auto; text-indent: 2em;">
    <strong>Figure 1: Architecture of the proposed GPA framework.</strong> The model utilizes a shared Large Language Model (LLM) backbone to unify three core audio tasks: Understanding (ASR), Generation (TTS), and Editing (Voice Conversion). Depending on the task, the model processes different combinations of inputs (Source Audio, Target Text, or Reference Audio) via Semantic and Acoustic modules to generate the corresponding text or audio output.

  </div>

</div>



## ⚡ Model Performance

The following results are obtained by benchmarking services instantiated via [the official deployment scripts](#-deployment), reflecting end-to-end performance in realistic serving scenarios rather than offline inference.

Among currently available open-source systems, **our model is one of the few that natively supports both concurrent and streaming inference, while achieving performance comparable to the first tier of existing approaches.**

> **💡Note**
>

> * **TTFC**: Time To First Chunk (TTS)
> * **TTFT**: Time To First Token (ASR)
> * **RTF**: Real-Time Factor (audio duration / synthesis time)

### TTS Streaming Benchmark (Latency & Throughput)

<div align="center">
  <table>
    <thead>

      <tr>

        <th>Concurrency</th>

        <th>Avg TTFC (ms)</th>

        <th>P50 TTFC (ms)</th>

        <th>P99 TTFC (ms)</th>

        <th>Avg RTF</th>

        <th>P50 RTF</th>

        <th>P99 RTF</th>

        <th>Audio Dur (s)</th>

      </tr>

    </thead>

    <tbody>

      <tr><td>1</td><td>258.8</td><td>258.8</td><td>258.8</td><td>0.197</td><td>0.197</td><td>0.197</td><td>6.44</td></tr>

      <tr><td>5</td><td>385.0</td><td>394.7</td><td>396.2</td><td>0.218</td><td>0.217</td><td>0.248</td><td>6.76</td></tr>

      <tr><td>10</td><td>544.6</td><td>564.2</td><td>566.7</td><td>0.282</td><td>0.301</td><td>0.313</td><td>6.49</td></tr>

      <tr><td>20</td><td>977.8</td><td>977.9</td><td>982.9</td><td>0.470</td><td>0.490</td><td>0.538</td><td>7.19</td></tr>

      <tr><td>40</td><td>1797.0</td><td>1736.4</td><td>2564.5</td><td>0.421</td><td>0.400</td><td>0.587</td><td>6.33</td></tr>

      <tr><td>80</td><td>3786.4</td><td>4054.4</td><td>5415.8</td><td>0.763</td><td>0.763</td><td>1.096</td><td>6.32</td></tr>

      <tr><td>160</td><td>9847.9</td><td>10239.9</td><td>14350.3</td><td>1.718</td><td>1.740</td><td>2.577</td><td>6.44</td></tr>

    </tbody>

  </table>

  <p><strong>Table 2. TTS Streaming RTF and Audio Duration</strong></p>

</div>


### ASR Streaming Benchmark

<div align="center">
  <table>
    <thead>

      <tr>

        <th>Concurrency</th>

        <th>Avg TTFT (ms)</th>

        <th>P50 TTFT (ms)</th>

        <th>P99 TTFT (ms)</th>

        <th>Avg Total (ms)</th>

      </tr>

    </thead>

    <tbody>

      <tr><td>1</td><td>157.5</td><td>157.5</td><td>157.5</td><td>190.9</td></tr>

      <tr><td>5</td><td>394.1</td><td>393.7</td><td>395.9</td><td>400.0</td></tr>

      <tr><td>10</td><td>589.6</td><td>721.3</td><td>723.3</td><td>598.1</td></tr>

      <tr><td>20</td><td>1316.3</td><td>1495.6</td><td>1500.4</td><td>1317.8</td></tr>

      <tr><td>40</td><td>2690.9</td><td>2678.3</td><td>2861.4</td><td>2693.7</td></tr>

      <tr><td>80</td><td>3833.4</td><td>3961.3</td><td>4027.0</td><td>3845.1</td></tr>

      <tr><td>160</td><td>5037.0</td><td>5689.3</td><td>6676.0</td><td>5044.0</td></tr>

    </tbody>

  </table>

  <p><strong>Table 3. ASR Streaming Latency vs Concurrency</strong></p>

</div>


## 📊 Evaluation Metric Results

### TTS Evaluation Table

| Model | Open-Source | Model Size | test-zh CER (%) ↓ | test-zh Sim (%) ↑ | test-en WER (%) ↓ | test-en Sim (%) ↑ |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: |
| **Multi-Stage or NAR Methods** | | | | | | |
| Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 |
| Seed-TTS | ❌ | - | 1.12 | **79.6** | 2.25 | **76.2** |
| MiniMax-Speech | ❌ | - | 0.83 | 78.3 | 1.65 | 69.2 |
| F5-TTS | ✅ | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 |
| CosyVoice2 | ✅ | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 |
| FireRedTTS2 | ✅ | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 |
| Index-TTS2 | ✅ | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 |
| VibeVoice-1.5B | ✅ | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 |
| VibeVoice-Realtime | ✅ | 0.5B | - | - | 2.05 | 63.3 |
| HiggsAudio-v2 | ✅ | 3B | 1.50 | 74.0 | 2.44 | 67.7 |
| VoxCPM | ✅ | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 |
| GLM-TTS | ✅ | 1.5B | 1.03 | 76.1 | - | - |
| GLM-TTS RL | ✅ | 1.5B | 0.89 | 76.4 | - | - |
| Fun-CosyVoice3-0.5B-2512 | ✅ | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 |
| Fun-CosyVoice3-0.5B-2512_RL | ✅ | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 |

| **One-Stage AR Methods** | | | | | | |

| Spark TTS | ✅ | 0.5B | 1.20 | 66.0 | 1.98 | 57.3 |

| GPA-0.3B-preview | ✅ | 0.3B | **0.95** | 65.9 | **1.51** | 56.5 |



### ASR Evaluation Table



**Note:** ASR results on Librispeech and Aishell-1. WER (%) is reported for Librispeech, and CER (%) is reported for Aishell-1.



| Model | Model Size | Librispeech test-clean | Aishell-1 |

| :--- | :---: | :---: | :---: |

| **Models with < 0.5B parameters** | | | |

| Whisper-S | 0.24B | 3.13 | - |

| GPA-0.3B-preview | 0.3B | 8.88 | 4.50 |

| **Models with > 0.5B parameters** | | | |

| Fun-ASR-nano | 0.8B | 1.76 | 1.80 |

| FireRed-ASR | 1.1B | 1.84 | 0.54 |

| GLM-ASR-nano | 1.5B | 2.00 | 1.81 |

| GLM-ASR-nano* | 1.5B | 2.17 | 2.17 |

| Whisper-L | 1.55B | 1.82 | 4.72 |

| Kimi-Audio | - | 1.32 | 0.71 |

| Step-Audio2 | - | 1.17 | 0.63 |

| Seed-ASR | - | 1.58 | 0.68 |

| Seed-ASR* | - | 2.80 | 1.63 |

| Fun-ASR | 7.7B | 1.51 | 1.22 |



## 🙏 Acknowledgements



We borrowed a lot of code from the following excellent projects:



- [Spark-TTS](https://github.com/SparkAudio/Spark-TTS)

- [GLM-4-Voice](https://github.com/zai-org/GLM-4-Voice/tree/main/speech_tokenizer)

- [Emilia](https://github.com/open-mmlab/Amphion/tree/main/preprocessors/Emilia)

- [FlashTTS](https://github.com/HuiResearch/FlashTTS/tree/master/flashtts)

- [Qwen](https://github.com/QwenLM/Qwen)



## 🔗 Citation



If you find GPA useful for your research or projects, please cite us:



```bibtex

@misc{cai2026unifyingspeechrecognitionsynthesis,

      title={Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers}, 

      author={Runyuan Cai and Yu Lin and Yiming Wang and Chunlin Fu and Xiaodong Zeng},

      year={2026},

      eprint={2601.10770},

      archivePrefix={arXiv},

      primaryClass={cs.SD},

      url={https://arxiv.org/abs/2601.10770}, 

}

```