Text-to-Speech
Transformers
Safetensors
English
Chinese
arkasr
text-generation
automatic-speech-recognition
voice-conversion
speech
audio
custom_code
Instructions to use AutoArk-AI/GPA-v1.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AutoArk-AI/GPA-v1.5 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="AutoArk-AI/GPA-v1.5", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("AutoArk-AI/GPA-v1.5", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 7,078 Bytes
7e53c9a ac0f41e 7e53c9a ac0f41e 654c1a8 ac0f41e fb2c9e4 ac0f41e fb2c9e4 ac0f41e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 | ---
library_name: transformers
tags:
- text-to-speech
- automatic-speech-recognition
- voice-conversion
- speech
- audio
pipeline_tag: text-to-speech
language:
- en
- zh
license: apache-2.0
base_model:
- Qwen/Qwen3-0.6B
homepage: https://autoark.github.io/GPA/
repository: https://github.com/AutoArk/GPA
---
<div align="center">
<img src="figures/GPA.png" width="80%" alt="GPA Logo"/>
# GPA v1.5: One Model for Speech Recognition, Text-to-Speech, and Voice Conversion
[](https://arxiv.org/abs/2601.10770)
[](https://github.com/AutoArk/GPA)
[](https://autoark.github.io/GPA/)
[](https://huggingface.co/AutoArk-AI/GPA-v1.5-onnx-runtime)
</div>
> **TL;DR** This is the main Hugging Face checkpoint repo for **GPA v1.5**. Use it for native PyTorch / Hugging Face inference and fine-tuning. Runtime-optimized ONNX assets are published separately at [AutoArk-AI/GPA-v1.5-onnx-runtime](https://huggingface.co/AutoArk-AI/GPA-v1.5-onnx-runtime).
## What Is GPA v1.5?
**GPA** stands for **General Purpose Audio**.
GPA v1.5 is a unified autoregressive audio-language model for speech understanding and generation. It currently supports:
- **ASR**: automatic speech recognition.
- **TTS**: text-to-speech with reference voice conditioning.
- **Training / fine-tuning**: native Hugging Face `Trainer` workflow.
- **Deployment path**: ONNX runtime assets and service code for local CLI, FastAPI, and browser UI testing.
Voice conversion support in the native v1.5 path is on the roadmap.
<div align="center">
<img src="figures/GPA_v1.5.jpeg" width="86%" alt="GPA v1.5 unified speech model overview"/>
<br>
<sub>GPA unifies speech understanding and generation in a single autoregressive audio-language model.</sub>
</div>
## Hugging Face and GitHub Mapping
This Hugging Face repo stores the large checkpoint assets. The code, examples, and docs live in the GitHub repo:
| Goal | GitHub Entry Point | Hugging Face Assets |
| :--- | :--- | :--- |
| Native PyTorch / Hugging Face inference | [`GPA_1.5/docs/infer.md`](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/docs/infer.md), [`GPA_1.5/infer.py`](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/infer.py) | This repo: `AutoArk-AI/GPA-v1.5` |
| Fine-tuning / continued training | [`GPA_1.5/docs/train.md`](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/docs/train.md), [`GPA_1.5/train.py`](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/train.py) | This repo: `AutoArk-AI/GPA-v1.5` |
| ONNX CLI / FastAPI / browser UI runtime | [`GPA_1.5/onnx_runtime/README.md`](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/onnx_runtime/README.md) | [`AutoArk-AI/GPA-v1.5-onnx-runtime`](https://huggingface.co/AutoArk-AI/GPA-v1.5-onnx-runtime) |
## Recommended Local Layout
For the least configuration, keep the checkpoint repos side by side:
```text
GPA-v1.5/
GPA-v1.5-HF/
GPA-v1.5/
spark_tokenizer_model/
GPA-v1.5-onnx-runtime/
```
What each path is used for:
- `GPA-v1.5-HF/GPA-v1.5`: native PyTorch train / inference checkpoint.
- `GPA-v1.5-HF/GPA-v1.5/spark_tokenizer_model`: Spark tokenizer assets used by native TTS.
- `GPA-v1.5-HF/GPA-v1.5-onnx-runtime`: ONNX CLI / service / browser UI asset bundle.
With this layout, the native inference, training, and ONNX smoke tests can run without editing source paths.
## Download
```bash
git clone https://github.com/AutoArk/GPA.git GPA-v1.5
mkdir -p GPA-v1.5-HF
huggingface-cli download AutoArk-AI/GPA-v1.5 \
--local-dir GPA-v1.5-HF/GPA-v1.5
huggingface-cli download AutoArk-AI/GPA-v1.5-onnx-runtime \
--local-dir GPA-v1.5-HF/GPA-v1.5-onnx-runtime
```
## Where To Start
- **Fine-tuning / continued training**: [GPA_1.5/docs/train.md](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/docs/train.md)
- **Native PyTorch inference**: [GPA_1.5/docs/infer.md](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/docs/infer.md)
- **ONNX runtime deployment**: [GPA_1.5/onnx_runtime/README.md](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/onnx_runtime/README.md)
## GPA v1.5 Release Overview
| | GPA v1.5 |
| :--- | :--- |
| Checkpoint | Open-sourced on Hugging Face |
| Native inference | Direct PyTorch / Hugging Face execution for ASR and TTS |
| Native training | Fine-tuning and continued training with Hugging Face `Trainer` |
| ONNX runtime | CLI inference, FastAPI service, browser UI, voice registration, and runtime validation |
| Planned | Voice conversion support in the native v1.5 path |
## Evaluation Metric Results
### TTS Evaluation
| Model | Open-Source | Model Size | test-zh CER (%) ↓ | test-zh Sim (%) ↑ | test-en WER (%) ↓ | test-en Sim (%) ↑ |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: |
| Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 |
| Seed-TTS | No | - | 1.12 | **79.6** | 2.25 | **76.2** |
| MiniMax-Speech | No | - | 0.83 | 78.3 | 1.65 | 69.2 |
| F5-TTS | Yes | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 |
| CosyVoice2 | Yes | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 |
| FireRedTTS2 | Yes | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 |
| Index-TTS2 | Yes | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 |
| VibeVoice-1.5B | Yes | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 |
| VoxCPM | Yes | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 |
| Fun-CosyVoice3-0.5B-2512_RL | Yes | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 |
| Spark TTS | Yes | 0.5B | 1.20 | 66.0 | 1.98 | 57.3 |
| **GPA-v1.5** | **Yes** | **0.6B** | **1.03** | **70.2** | **1.43** | **63.5** |
### ASR Evaluation
WER (%) is reported for LibriSpeech. CER (%) is reported for AISHELL-1.
| Model | Model Size | LibriSpeech test-clean | LibriSpeech test-other | AISHELL-1 | test_Meeting | test_Net |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: |
| Whisper-S | 0.24B | 3.43 | 7.63 | - | - | - |
| **GPA-v1.5** | **0.6B** | **2.78** | **5.02** | **2.83** | **7.40** | **6.49** |
| Fun-ASR-nano | 0.8B | 1.76 | 4.33 | 1.80 | 6.60 | 6.01 |
| FireRed-ASR | 1.1B | 1.84 | 4.52 | 0.54 | 4.95 | 4.94 |
| GLM-ASR-nano | 1.5B | 2.00 | 4.19 | 1.81 | 6.73 | - |
| Whisper-L | 1.55B | 1.86 | 3.43 | 4.72 | 18.39 | 11.89 |
| Kimi-Audio | - | 1.32 | 2.63 | 0.71 | 6.24 | 6.45 |
| Step-Audio2 | - | 1.17 | 2.42 | 0.63 | 4.75 | 4.67 |
| Seed-ASR | - | 1.58 | 2.84 | 0.68 | 5.69 | 4.66 |
| Fun-ASR | 7.7B | 1.51 | 3.03 | 1.22 | 6.17 | 5.46 |
## License
This model is released under the Apache 2.0 license.
## Citation
If you find GPA useful for your research or projects, please cite us:
```bibtex
@misc{cai2026unifyingspeechrecognitionsynthesis,
title={Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers},
author={Runyuan Cai and Yu Lin and Yiming Wang and Chunlin Fu and Xiaodong Zeng},
year={2026},
eprint={2601.10770},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2601.10770},
}
```
|