GPA-v1.5 / README.md
chua's picture
Add model card image assets
fb2c9e4
---
library_name: transformers
tags:
- text-to-speech
- automatic-speech-recognition
- voice-conversion
- speech
- audio
pipeline_tag: text-to-speech
language:
- en
- zh
license: apache-2.0
base_model:
- Qwen/Qwen3-0.6B
homepage: https://autoark.github.io/GPA/
repository: https://github.com/AutoArk/GPA
---
<div align="center">
<img src="figures/GPA.png" width="80%" alt="GPA Logo"/>
# GPA v1.5: One Model for Speech Recognition, Text-to-Speech, and Voice Conversion
[![ArXiv](https://img.shields.io/badge/ArXiv-2601.10770-b31b1b?logo=arxiv)](https://arxiv.org/abs/2601.10770)
[![GitHub](https://img.shields.io/badge/GitHub-AutoArk%2FGPA-blue?logo=github)](https://github.com/AutoArk/GPA)
[![Demo](https://img.shields.io/badge/Demo-GitHub%20Pages-blue?logo=github)](https://autoark.github.io/GPA/)
[![ONNX Runtime Assets](https://img.shields.io/badge/ONNX%20Runtime-GPA--v1.5--onnx--runtime-yellow)](https://huggingface.co/AutoArk-AI/GPA-v1.5-onnx-runtime)
</div>
> **TL;DR** This is the main Hugging Face checkpoint repo for **GPA v1.5**. Use it for native PyTorch / Hugging Face inference and fine-tuning. Runtime-optimized ONNX assets are published separately at [AutoArk-AI/GPA-v1.5-onnx-runtime](https://huggingface.co/AutoArk-AI/GPA-v1.5-onnx-runtime).
## What Is GPA v1.5?
**GPA** stands for **General Purpose Audio**.
GPA v1.5 is a unified autoregressive audio-language model for speech understanding and generation. It currently supports:
- **ASR**: automatic speech recognition.
- **TTS**: text-to-speech with reference voice conditioning.
- **Training / fine-tuning**: native Hugging Face `Trainer` workflow.
- **Deployment path**: ONNX runtime assets and service code for local CLI, FastAPI, and browser UI testing.
Voice conversion support in the native v1.5 path is on the roadmap.
<div align="center">
<img src="figures/GPA_v1.5.jpeg" width="86%" alt="GPA v1.5 unified speech model overview"/>
<br>
<sub>GPA unifies speech understanding and generation in a single autoregressive audio-language model.</sub>
</div>
## Hugging Face and GitHub Mapping
This Hugging Face repo stores the large checkpoint assets. The code, examples, and docs live in the GitHub repo:
| Goal | GitHub Entry Point | Hugging Face Assets |
| :--- | :--- | :--- |
| Native PyTorch / Hugging Face inference | [`GPA_1.5/docs/infer.md`](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/docs/infer.md), [`GPA_1.5/infer.py`](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/infer.py) | This repo: `AutoArk-AI/GPA-v1.5` |
| Fine-tuning / continued training | [`GPA_1.5/docs/train.md`](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/docs/train.md), [`GPA_1.5/train.py`](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/train.py) | This repo: `AutoArk-AI/GPA-v1.5` |
| ONNX CLI / FastAPI / browser UI runtime | [`GPA_1.5/onnx_runtime/README.md`](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/onnx_runtime/README.md) | [`AutoArk-AI/GPA-v1.5-onnx-runtime`](https://huggingface.co/AutoArk-AI/GPA-v1.5-onnx-runtime) |
## Recommended Local Layout
For the least configuration, keep the checkpoint repos side by side:
```text
GPA-v1.5/
GPA-v1.5-HF/
GPA-v1.5/
spark_tokenizer_model/
GPA-v1.5-onnx-runtime/
```
What each path is used for:
- `GPA-v1.5-HF/GPA-v1.5`: native PyTorch train / inference checkpoint.
- `GPA-v1.5-HF/GPA-v1.5/spark_tokenizer_model`: Spark tokenizer assets used by native TTS.
- `GPA-v1.5-HF/GPA-v1.5-onnx-runtime`: ONNX CLI / service / browser UI asset bundle.
With this layout, the native inference, training, and ONNX smoke tests can run without editing source paths.
## Download
```bash
git clone https://github.com/AutoArk/GPA.git GPA-v1.5
mkdir -p GPA-v1.5-HF
huggingface-cli download AutoArk-AI/GPA-v1.5 \
--local-dir GPA-v1.5-HF/GPA-v1.5
huggingface-cli download AutoArk-AI/GPA-v1.5-onnx-runtime \
--local-dir GPA-v1.5-HF/GPA-v1.5-onnx-runtime
```
## Where To Start
- **Fine-tuning / continued training**: [GPA_1.5/docs/train.md](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/docs/train.md)
- **Native PyTorch inference**: [GPA_1.5/docs/infer.md](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/docs/infer.md)
- **ONNX runtime deployment**: [GPA_1.5/onnx_runtime/README.md](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/onnx_runtime/README.md)
## GPA v1.5 Release Overview
| | GPA v1.5 |
| :--- | :--- |
| Checkpoint | Open-sourced on Hugging Face |
| Native inference | Direct PyTorch / Hugging Face execution for ASR and TTS |
| Native training | Fine-tuning and continued training with Hugging Face `Trainer` |
| ONNX runtime | CLI inference, FastAPI service, browser UI, voice registration, and runtime validation |
| Planned | Voice conversion support in the native v1.5 path |
## Evaluation Metric Results
### TTS Evaluation
| Model | Open-Source | Model Size | test-zh CER (%) ↓ | test-zh Sim (%) ↑ | test-en WER (%) ↓ | test-en Sim (%) ↑ |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: |
| Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 |
| Seed-TTS | No | - | 1.12 | **79.6** | 2.25 | **76.2** |
| MiniMax-Speech | No | - | 0.83 | 78.3 | 1.65 | 69.2 |
| F5-TTS | Yes | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 |
| CosyVoice2 | Yes | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 |
| FireRedTTS2 | Yes | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 |
| Index-TTS2 | Yes | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 |
| VibeVoice-1.5B | Yes | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 |
| VoxCPM | Yes | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 |
| Fun-CosyVoice3-0.5B-2512_RL | Yes | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 |
| Spark TTS | Yes | 0.5B | 1.20 | 66.0 | 1.98 | 57.3 |
| **GPA-v1.5** | **Yes** | **0.6B** | **1.03** | **70.2** | **1.43** | **63.5** |
### ASR Evaluation
WER (%) is reported for LibriSpeech. CER (%) is reported for AISHELL-1.
| Model | Model Size | LibriSpeech test-clean | LibriSpeech test-other | AISHELL-1 | test_Meeting | test_Net |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: |
| Whisper-S | 0.24B | 3.43 | 7.63 | - | - | - |
| **GPA-v1.5** | **0.6B** | **2.78** | **5.02** | **2.83** | **7.40** | **6.49** |
| Fun-ASR-nano | 0.8B | 1.76 | 4.33 | 1.80 | 6.60 | 6.01 |
| FireRed-ASR | 1.1B | 1.84 | 4.52 | 0.54 | 4.95 | 4.94 |
| GLM-ASR-nano | 1.5B | 2.00 | 4.19 | 1.81 | 6.73 | - |
| Whisper-L | 1.55B | 1.86 | 3.43 | 4.72 | 18.39 | 11.89 |
| Kimi-Audio | - | 1.32 | 2.63 | 0.71 | 6.24 | 6.45 |
| Step-Audio2 | - | 1.17 | 2.42 | 0.63 | 4.75 | 4.67 |
| Seed-ASR | - | 1.58 | 2.84 | 0.68 | 5.69 | 4.66 |
| Fun-ASR | 7.7B | 1.51 | 3.03 | 1.22 | 6.17 | 5.46 |
## License
This model is released under the Apache 2.0 license.
## Citation
If you find GPA useful for your research or projects, please cite us:
```bibtex
@misc{cai2026unifyingspeechrecognitionsynthesis,
title={Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers},
author={Runyuan Cai and Yu Lin and Yiming Wang and Chunlin Fu and Xiaodong Zeng},
year={2026},
eprint={2601.10770},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2601.10770},
}
```