GPA-v1.5 / README.md

Add model card image assets

fb2c9e4 21 days ago

7.08 kB

	---
	library_name: transformers
	tags:
	- text-to-speech
	- automatic-speech-recognition
	- voice-conversion
	- speech
	- audio
	pipeline_tag: text-to-speech
	language:
	- en
	- zh
	license: apache-2.0
	base_model:
	- Qwen/Qwen3-0.6B
	homepage: https://autoark.github.io/GPA/
	repository: https://github.com/AutoArk/GPA
	---

	<div align="center">
	<img src="figures/GPA.png" width="80%" alt="GPA Logo"/>

	# GPA v1.5: One Model for Speech Recognition, Text-to-Speech, and Voice Conversion

	[![ArXiv](https://img.shields.io/badge/ArXiv-2601.10770-b31b1b?logo=arxiv)](https://arxiv.org/abs/2601.10770)
	[![GitHub](https://img.shields.io/badge/GitHub-AutoArk%2FGPA-blue?logo=github)](https://github.com/AutoArk/GPA)
	[![Demo](https://img.shields.io/badge/Demo-GitHub%20Pages-blue?logo=github)](https://autoark.github.io/GPA/)
	[![ONNX Runtime Assets](https://img.shields.io/badge/ONNX%20Runtime-GPA--v1.5--onnx--runtime-yellow)](https://huggingface.co/AutoArk-AI/GPA-v1.5-onnx-runtime)

	</div>

	> TL;DR This is the main Hugging Face checkpoint repo for GPA v1.5. Use it for native PyTorch / Hugging Face inference and fine-tuning. Runtime-optimized ONNX assets are published separately at [AutoArk-AI/GPA-v1.5-onnx-runtime](https://huggingface.co/AutoArk-AI/GPA-v1.5-onnx-runtime).

	## What Is GPA v1.5?

	GPA stands for General Purpose Audio.

	GPA v1.5 is a unified autoregressive audio-language model for speech understanding and generation. It currently supports:

	- ASR: automatic speech recognition.
	- TTS: text-to-speech with reference voice conditioning.
	- Training / fine-tuning: native Hugging Face `Trainer` workflow.
	- Deployment path: ONNX runtime assets and service code for local CLI, FastAPI, and browser UI testing.

	Voice conversion support in the native v1.5 path is on the roadmap.

	<div align="center">
	<img src="figures/GPA_v1.5.jpeg" width="86%" alt="GPA v1.5 unified speech model overview"/>
	<br>
	<sub>GPA unifies speech understanding and generation in a single autoregressive audio-language model.</sub>
	</div>

	## Hugging Face and GitHub Mapping

	This Hugging Face repo stores the large checkpoint assets. The code, examples, and docs live in the GitHub repo:

	\| Goal \| GitHub Entry Point \| Hugging Face Assets \|
	\| :--- \| :--- \| :--- \|
	\| Native PyTorch / Hugging Face inference \| [`GPA_1.5/docs/infer.md`](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/docs/infer.md), [`GPA_1.5/infer.py`](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/infer.py) \| This repo: `AutoArk-AI/GPA-v1.5` \|
	\| Fine-tuning / continued training \| [`GPA_1.5/docs/train.md`](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/docs/train.md), [`GPA_1.5/train.py`](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/train.py) \| This repo: `AutoArk-AI/GPA-v1.5` \|
	\| ONNX CLI / FastAPI / browser UI runtime \| [`GPA_1.5/onnx_runtime/README.md`](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/onnx_runtime/README.md) \| [`AutoArk-AI/GPA-v1.5-onnx-runtime`](https://huggingface.co/AutoArk-AI/GPA-v1.5-onnx-runtime) \|

	## Recommended Local Layout

	For the least configuration, keep the checkpoint repos side by side:

	```text
	GPA-v1.5/
	GPA-v1.5-HF/
	GPA-v1.5/
	spark_tokenizer_model/
	GPA-v1.5-onnx-runtime/
	```

	What each path is used for:

	- `GPA-v1.5-HF/GPA-v1.5`: native PyTorch train / inference checkpoint.
	- `GPA-v1.5-HF/GPA-v1.5/spark_tokenizer_model`: Spark tokenizer assets used by native TTS.
	- `GPA-v1.5-HF/GPA-v1.5-onnx-runtime`: ONNX CLI / service / browser UI asset bundle.

	With this layout, the native inference, training, and ONNX smoke tests can run without editing source paths.

	## Download

	```bash
	git clone https://github.com/AutoArk/GPA.git GPA-v1.5
	mkdir -p GPA-v1.5-HF

	huggingface-cli download AutoArk-AI/GPA-v1.5 \
	--local-dir GPA-v1.5-HF/GPA-v1.5

	huggingface-cli download AutoArk-AI/GPA-v1.5-onnx-runtime \
	--local-dir GPA-v1.5-HF/GPA-v1.5-onnx-runtime
	```

	## Where To Start

	- Fine-tuning / continued training: [GPA_1.5/docs/train.md](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/docs/train.md)
	- Native PyTorch inference: [GPA_1.5/docs/infer.md](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/docs/infer.md)
	- ONNX runtime deployment: [GPA_1.5/onnx_runtime/README.md](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/onnx_runtime/README.md)

	## GPA v1.5 Release Overview

	\| \| GPA v1.5 \|
	\| :--- \| :--- \|
	\| Checkpoint \| Open-sourced on Hugging Face \|
	\| Native inference \| Direct PyTorch / Hugging Face execution for ASR and TTS \|
	\| Native training \| Fine-tuning and continued training with Hugging Face `Trainer` \|
	\| ONNX runtime \| CLI inference, FastAPI service, browser UI, voice registration, and runtime validation \|
	\| Planned \| Voice conversion support in the native v1.5 path \|

	## Evaluation Metric Results

	### TTS Evaluation

	\| Model \| Open-Source \| Model Size \| test-zh CER (%) ↓ \| test-zh Sim (%) ↑ \| test-en WER (%) ↓ \| test-en Sim (%) ↑ \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| Human \| - \| - \| 1.26 \| 75.5 \| 2.14 \| 73.4 \|
	\| Seed-TTS \| No \| - \| 1.12 \| 79.6 \| 2.25 \| 76.2 \|
	\| MiniMax-Speech \| No \| - \| 0.83 \| 78.3 \| 1.65 \| 69.2 \|
	\| F5-TTS \| Yes \| 0.3B \| 1.52 \| 74.1 \| 2.00 \| 64.7 \|
	\| CosyVoice2 \| Yes \| 0.5B \| 1.45 \| 75.7 \| 2.57 \| 65.9 \|
	\| FireRedTTS2 \| Yes \| 1.5B \| 1.14 \| 73.2 \| 1.95 \| 66.5 \|
	\| Index-TTS2 \| Yes \| 1.5B \| 1.03 \| 76.5 \| 2.23 \| 70.6 \|
	\| VibeVoice-1.5B \| Yes \| 1.5B \| 1.16 \| 74.4 \| 3.04 \| 68.9 \|
	\| VoxCPM \| Yes \| 0.5B \| 0.93 \| 77.2 \| 1.85 \| 72.9 \|
	\| Fun-CosyVoice3-0.5B-2512_RL \| Yes \| 0.5B \| 0.81 \| 77.4 \| 1.68 \| 69.5 \|
	\| Spark TTS \| Yes \| 0.5B \| 1.20 \| 66.0 \| 1.98 \| 57.3 \|
	\| GPA-v1.5 \| Yes \| 0.6B \| 1.03 \| 70.2 \| 1.43 \| 63.5 \|

	### ASR Evaluation

	WER (%) is reported for LibriSpeech. CER (%) is reported for AISHELL-1.

	\| Model \| Model Size \| LibriSpeech test-clean \| LibriSpeech test-other \| AISHELL-1 \| test_Meeting \| test_Net \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| Whisper-S \| 0.24B \| 3.43 \| 7.63 \| - \| - \| - \|
	\| GPA-v1.5 \| 0.6B \| 2.78 \| 5.02 \| 2.83 \| 7.40 \| 6.49 \|
	\| Fun-ASR-nano \| 0.8B \| 1.76 \| 4.33 \| 1.80 \| 6.60 \| 6.01 \|
	\| FireRed-ASR \| 1.1B \| 1.84 \| 4.52 \| 0.54 \| 4.95 \| 4.94 \|
	\| GLM-ASR-nano \| 1.5B \| 2.00 \| 4.19 \| 1.81 \| 6.73 \| - \|
	\| Whisper-L \| 1.55B \| 1.86 \| 3.43 \| 4.72 \| 18.39 \| 11.89 \|
	\| Kimi-Audio \| - \| 1.32 \| 2.63 \| 0.71 \| 6.24 \| 6.45 \|
	\| Step-Audio2 \| - \| 1.17 \| 2.42 \| 0.63 \| 4.75 \| 4.67 \|
	\| Seed-ASR \| - \| 1.58 \| 2.84 \| 0.68 \| 5.69 \| 4.66 \|
	\| Fun-ASR \| 7.7B \| 1.51 \| 3.03 \| 1.22 \| 6.17 \| 5.46 \|

	## License

	This model is released under the Apache 2.0 license.

	## Citation

	If you find GPA useful for your research or projects, please cite us:

	```bibtex
	@misc{cai2026unifyingspeechrecognitionsynthesis,
	title={Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers},
	author={Runyuan Cai and Yu Lin and Yiming Wang and Chunlin Fu and Xiaodong Zeng},
	year={2026},
	eprint={2601.10770},
	archivePrefix={arXiv},
	primaryClass={cs.SD},
	url={https://arxiv.org/abs/2601.10770},
	}
	```