File size: 7,078 Bytes
7e53c9a
ac0f41e
 
 
 
 
 
 
 
 
 
 
7e53c9a
ac0f41e
 
 
 
654c1a8
ac0f41e
 
fb2c9e4
ac0f41e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fb2c9e4
ac0f41e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
library_name: transformers
tags:
- text-to-speech
- automatic-speech-recognition
- voice-conversion
- speech
- audio
pipeline_tag: text-to-speech
language:
- en
- zh
license: apache-2.0
base_model:
- Qwen/Qwen3-0.6B
homepage: https://autoark.github.io/GPA/
repository: https://github.com/AutoArk/GPA
---

<div align="center">
  <img src="figures/GPA.png" width="80%" alt="GPA Logo"/>

# GPA v1.5: One Model for Speech Recognition, Text-to-Speech, and Voice Conversion

[![ArXiv](https://img.shields.io/badge/ArXiv-2601.10770-b31b1b?logo=arxiv)](https://arxiv.org/abs/2601.10770)
[![GitHub](https://img.shields.io/badge/GitHub-AutoArk%2FGPA-blue?logo=github)](https://github.com/AutoArk/GPA)
[![Demo](https://img.shields.io/badge/Demo-GitHub%20Pages-blue?logo=github)](https://autoark.github.io/GPA/)
[![ONNX Runtime Assets](https://img.shields.io/badge/ONNX%20Runtime-GPA--v1.5--onnx--runtime-yellow)](https://huggingface.co/AutoArk-AI/GPA-v1.5-onnx-runtime)

</div>

> **TL;DR** This is the main Hugging Face checkpoint repo for **GPA v1.5**. Use it for native PyTorch / Hugging Face inference and fine-tuning. Runtime-optimized ONNX assets are published separately at [AutoArk-AI/GPA-v1.5-onnx-runtime](https://huggingface.co/AutoArk-AI/GPA-v1.5-onnx-runtime).

## What Is GPA v1.5?

**GPA** stands for **General Purpose Audio**.

GPA v1.5 is a unified autoregressive audio-language model for speech understanding and generation. It currently supports:

- **ASR**: automatic speech recognition.
- **TTS**: text-to-speech with reference voice conditioning.
- **Training / fine-tuning**: native Hugging Face `Trainer` workflow.
- **Deployment path**: ONNX runtime assets and service code for local CLI, FastAPI, and browser UI testing.

Voice conversion support in the native v1.5 path is on the roadmap.

<div align="center">
  <img src="figures/GPA_v1.5.jpeg" width="86%" alt="GPA v1.5 unified speech model overview"/>
  <br>
  <sub>GPA unifies speech understanding and generation in a single autoregressive audio-language model.</sub>
</div>

## Hugging Face and GitHub Mapping

This Hugging Face repo stores the large checkpoint assets. The code, examples, and docs live in the GitHub repo:

| Goal | GitHub Entry Point | Hugging Face Assets |
| :--- | :--- | :--- |
| Native PyTorch / Hugging Face inference | [`GPA_1.5/docs/infer.md`](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/docs/infer.md), [`GPA_1.5/infer.py`](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/infer.py) | This repo: `AutoArk-AI/GPA-v1.5` |
| Fine-tuning / continued training | [`GPA_1.5/docs/train.md`](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/docs/train.md), [`GPA_1.5/train.py`](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/train.py) | This repo: `AutoArk-AI/GPA-v1.5` |
| ONNX CLI / FastAPI / browser UI runtime | [`GPA_1.5/onnx_runtime/README.md`](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/onnx_runtime/README.md) | [`AutoArk-AI/GPA-v1.5-onnx-runtime`](https://huggingface.co/AutoArk-AI/GPA-v1.5-onnx-runtime) |

## Recommended Local Layout

For the least configuration, keep the checkpoint repos side by side:

```text
GPA-v1.5/
GPA-v1.5-HF/
  GPA-v1.5/
    spark_tokenizer_model/
  GPA-v1.5-onnx-runtime/
```

What each path is used for:

- `GPA-v1.5-HF/GPA-v1.5`: native PyTorch train / inference checkpoint.
- `GPA-v1.5-HF/GPA-v1.5/spark_tokenizer_model`: Spark tokenizer assets used by native TTS.
- `GPA-v1.5-HF/GPA-v1.5-onnx-runtime`: ONNX CLI / service / browser UI asset bundle.

With this layout, the native inference, training, and ONNX smoke tests can run without editing source paths.

## Download

```bash
git clone https://github.com/AutoArk/GPA.git GPA-v1.5
mkdir -p GPA-v1.5-HF

huggingface-cli download AutoArk-AI/GPA-v1.5 \
  --local-dir GPA-v1.5-HF/GPA-v1.5

huggingface-cli download AutoArk-AI/GPA-v1.5-onnx-runtime \
  --local-dir GPA-v1.5-HF/GPA-v1.5-onnx-runtime
```

## Where To Start

- **Fine-tuning / continued training**: [GPA_1.5/docs/train.md](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/docs/train.md)
- **Native PyTorch inference**: [GPA_1.5/docs/infer.md](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/docs/infer.md)
- **ONNX runtime deployment**: [GPA_1.5/onnx_runtime/README.md](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/onnx_runtime/README.md)

## GPA v1.5 Release Overview

| | GPA v1.5 |
| :--- | :--- |
| Checkpoint | Open-sourced on Hugging Face |
| Native inference | Direct PyTorch / Hugging Face execution for ASR and TTS |
| Native training | Fine-tuning and continued training with Hugging Face `Trainer` |
| ONNX runtime | CLI inference, FastAPI service, browser UI, voice registration, and runtime validation |
| Planned | Voice conversion support in the native v1.5 path |

## Evaluation Metric Results

### TTS Evaluation

| Model | Open-Source | Model Size | test-zh CER (%) ↓ | test-zh Sim (%) ↑ | test-en WER (%) ↓ | test-en Sim (%) ↑ |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: |
| Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 |
| Seed-TTS | No | - | 1.12 | **79.6** | 2.25 | **76.2** |
| MiniMax-Speech | No | - | 0.83 | 78.3 | 1.65 | 69.2 |
| F5-TTS | Yes | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 |
| CosyVoice2 | Yes | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 |
| FireRedTTS2 | Yes | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 |
| Index-TTS2 | Yes | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 |
| VibeVoice-1.5B | Yes | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 |
| VoxCPM | Yes | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 |
| Fun-CosyVoice3-0.5B-2512_RL | Yes | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 |
| Spark TTS | Yes | 0.5B | 1.20 | 66.0 | 1.98 | 57.3 |
| **GPA-v1.5** | **Yes** | **0.6B** | **1.03** | **70.2** | **1.43** | **63.5** |

### ASR Evaluation

WER (%) is reported for LibriSpeech. CER (%) is reported for AISHELL-1.

| Model | Model Size | LibriSpeech test-clean | LibriSpeech test-other | AISHELL-1 | test_Meeting | test_Net |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: |
| Whisper-S | 0.24B | 3.43 | 7.63 | - | - | - |
| **GPA-v1.5** | **0.6B** | **2.78** | **5.02** | **2.83** | **7.40** | **6.49** |
| Fun-ASR-nano | 0.8B | 1.76 | 4.33 | 1.80 | 6.60 | 6.01 |
| FireRed-ASR | 1.1B | 1.84 | 4.52 | 0.54 | 4.95 | 4.94 |
| GLM-ASR-nano | 1.5B | 2.00 | 4.19 | 1.81 | 6.73 | - |
| Whisper-L | 1.55B | 1.86 | 3.43 | 4.72 | 18.39 | 11.89 |
| Kimi-Audio | - | 1.32 | 2.63 | 0.71 | 6.24 | 6.45 |
| Step-Audio2 | - | 1.17 | 2.42 | 0.63 | 4.75 | 4.67 |
| Seed-ASR | - | 1.58 | 2.84 | 0.68 | 5.69 | 4.66 |
| Fun-ASR | 7.7B | 1.51 | 3.03 | 1.22 | 6.17 | 5.46 |

## License

This model is released under the Apache 2.0 license.

## Citation

If you find GPA useful for your research or projects, please cite us:

```bibtex
@misc{cai2026unifyingspeechrecognitionsynthesis,
      title={Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers},
      author={Runyuan Cai and Yu Lin and Yiming Wang and Chunlin Fu and Xiaodong Zeng},
      year={2026},
      eprint={2601.10770},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2601.10770},
}
```