Text-to-Speech
Transformers
Safetensors
English
Chinese
arkasr
text-generation
automatic-speech-recognition
voice-conversion
speech
audio
custom_code
Instructions to use AutoArk-AI/GPA-v1.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AutoArk-AI/GPA-v1.5 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="AutoArk-AI/GPA-v1.5", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("AutoArk-AI/GPA-v1.5", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| tags: | |
| - text-to-speech | |
| - automatic-speech-recognition | |
| - voice-conversion | |
| - speech | |
| - audio | |
| pipeline_tag: text-to-speech | |
| language: | |
| - en | |
| - zh | |
| license: apache-2.0 | |
| base_model: | |
| - Qwen/Qwen3-0.6B | |
| homepage: https://autoark.github.io/GPA/ | |
| repository: https://github.com/AutoArk/GPA | |
| <div align="center"> | |
| <img src="figures/GPA.png" width="80%" alt="GPA Logo"/> | |
| # GPA v1.5: One Model for Speech Recognition, Text-to-Speech, and Voice Conversion | |
| [](https://arxiv.org/abs/2601.10770) | |
| [](https://github.com/AutoArk/GPA) | |
| [](https://autoark.github.io/GPA/) | |
| [](https://huggingface.co/AutoArk-AI/GPA-v1.5-onnx-runtime) | |
| </div> | |
| > **TL;DR** This is the main Hugging Face checkpoint repo for **GPA v1.5**. Use it for native PyTorch / Hugging Face inference and fine-tuning. Runtime-optimized ONNX assets are published separately at [AutoArk-AI/GPA-v1.5-onnx-runtime](https://huggingface.co/AutoArk-AI/GPA-v1.5-onnx-runtime). | |
| ## What Is GPA v1.5? | |
| **GPA** stands for **General Purpose Audio**. | |
| GPA v1.5 is a unified autoregressive audio-language model for speech understanding and generation. It currently supports: | |
| - **ASR**: automatic speech recognition. | |
| - **TTS**: text-to-speech with reference voice conditioning. | |
| - **Training / fine-tuning**: native Hugging Face `Trainer` workflow. | |
| - **Deployment path**: ONNX runtime assets and service code for local CLI, FastAPI, and browser UI testing. | |
| Voice conversion support in the native v1.5 path is on the roadmap. | |
| <div align="center"> | |
| <img src="figures/GPA_v1.5.jpeg" width="86%" alt="GPA v1.5 unified speech model overview"/> | |
| <br> | |
| <sub>GPA unifies speech understanding and generation in a single autoregressive audio-language model.</sub> | |
| </div> | |
| ## Hugging Face and GitHub Mapping | |
| This Hugging Face repo stores the large checkpoint assets. The code, examples, and docs live in the GitHub repo: | |
| | Goal | GitHub Entry Point | Hugging Face Assets | | |
| | :--- | :--- | :--- | | |
| | Native PyTorch / Hugging Face inference | [`GPA_1.5/docs/infer.md`](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/docs/infer.md), [`GPA_1.5/infer.py`](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/infer.py) | This repo: `AutoArk-AI/GPA-v1.5` | | |
| | Fine-tuning / continued training | [`GPA_1.5/docs/train.md`](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/docs/train.md), [`GPA_1.5/train.py`](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/train.py) | This repo: `AutoArk-AI/GPA-v1.5` | | |
| | ONNX CLI / FastAPI / browser UI runtime | [`GPA_1.5/onnx_runtime/README.md`](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/onnx_runtime/README.md) | [`AutoArk-AI/GPA-v1.5-onnx-runtime`](https://huggingface.co/AutoArk-AI/GPA-v1.5-onnx-runtime) | | |
| ## Recommended Local Layout | |
| For the least configuration, keep the checkpoint repos side by side: | |
| ```text | |
| GPA-v1.5/ | |
| GPA-v1.5-HF/ | |
| GPA-v1.5/ | |
| spark_tokenizer_model/ | |
| GPA-v1.5-onnx-runtime/ | |
| ``` | |
| What each path is used for: | |
| - `GPA-v1.5-HF/GPA-v1.5`: native PyTorch train / inference checkpoint. | |
| - `GPA-v1.5-HF/GPA-v1.5/spark_tokenizer_model`: Spark tokenizer assets used by native TTS. | |
| - `GPA-v1.5-HF/GPA-v1.5-onnx-runtime`: ONNX CLI / service / browser UI asset bundle. | |
| With this layout, the native inference, training, and ONNX smoke tests can run without editing source paths. | |
| ## Download | |
| ```bash | |
| git clone https://github.com/AutoArk/GPA.git GPA-v1.5 | |
| mkdir -p GPA-v1.5-HF | |
| huggingface-cli download AutoArk-AI/GPA-v1.5 \ | |
| --local-dir GPA-v1.5-HF/GPA-v1.5 | |
| huggingface-cli download AutoArk-AI/GPA-v1.5-onnx-runtime \ | |
| --local-dir GPA-v1.5-HF/GPA-v1.5-onnx-runtime | |
| ``` | |
| ## Where To Start | |
| - **Fine-tuning / continued training**: [GPA_1.5/docs/train.md](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/docs/train.md) | |
| - **Native PyTorch inference**: [GPA_1.5/docs/infer.md](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/docs/infer.md) | |
| - **ONNX runtime deployment**: [GPA_1.5/onnx_runtime/README.md](https://github.com/AutoArk/GPA/blob/main/GPA_1.5/onnx_runtime/README.md) | |
| ## GPA v1.5 Release Overview | |
| | | GPA v1.5 | | |
| | :--- | :--- | | |
| | Checkpoint | Open-sourced on Hugging Face | | |
| | Native inference | Direct PyTorch / Hugging Face execution for ASR and TTS | | |
| | Native training | Fine-tuning and continued training with Hugging Face `Trainer` | | |
| | ONNX runtime | CLI inference, FastAPI service, browser UI, voice registration, and runtime validation | | |
| | Planned | Voice conversion support in the native v1.5 path | | |
| ## Evaluation Metric Results | |
| ### TTS Evaluation | |
| | Model | Open-Source | Model Size | test-zh CER (%) ↓ | test-zh Sim (%) ↑ | test-en WER (%) ↓ | test-en Sim (%) ↑ | | |
| | :--- | :---: | :---: | :---: | :---: | :---: | :---: | | |
| | Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 | | |
| | Seed-TTS | No | - | 1.12 | **79.6** | 2.25 | **76.2** | | |
| | MiniMax-Speech | No | - | 0.83 | 78.3 | 1.65 | 69.2 | | |
| | F5-TTS | Yes | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 | | |
| | CosyVoice2 | Yes | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 | | |
| | FireRedTTS2 | Yes | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 | | |
| | Index-TTS2 | Yes | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 | | |
| | VibeVoice-1.5B | Yes | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 | | |
| | VoxCPM | Yes | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 | | |
| | Fun-CosyVoice3-0.5B-2512_RL | Yes | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 | | |
| | Spark TTS | Yes | 0.5B | 1.20 | 66.0 | 1.98 | 57.3 | | |
| | **GPA-v1.5** | **Yes** | **0.6B** | **1.03** | **70.2** | **1.43** | **63.5** | | |
| ### ASR Evaluation | |
| WER (%) is reported for LibriSpeech. CER (%) is reported for AISHELL-1. | |
| | Model | Model Size | LibriSpeech test-clean | LibriSpeech test-other | AISHELL-1 | test_Meeting | test_Net | | |
| | :--- | :---: | :---: | :---: | :---: | :---: | :---: | | |
| | Whisper-S | 0.24B | 3.43 | 7.63 | - | - | - | | |
| | **GPA-v1.5** | **0.6B** | **2.78** | **5.02** | **2.83** | **7.40** | **6.49** | | |
| | Fun-ASR-nano | 0.8B | 1.76 | 4.33 | 1.80 | 6.60 | 6.01 | | |
| | FireRed-ASR | 1.1B | 1.84 | 4.52 | 0.54 | 4.95 | 4.94 | | |
| | GLM-ASR-nano | 1.5B | 2.00 | 4.19 | 1.81 | 6.73 | - | | |
| | Whisper-L | 1.55B | 1.86 | 3.43 | 4.72 | 18.39 | 11.89 | | |
| | Kimi-Audio | - | 1.32 | 2.63 | 0.71 | 6.24 | 6.45 | | |
| | Step-Audio2 | - | 1.17 | 2.42 | 0.63 | 4.75 | 4.67 | | |
| | Seed-ASR | - | 1.58 | 2.84 | 0.68 | 5.69 | 4.66 | | |
| | Fun-ASR | 7.7B | 1.51 | 3.03 | 1.22 | 6.17 | 5.46 | | |
| ## License | |
| This model is released under the Apache 2.0 license. | |
| ## Citation | |
| If you find GPA useful for your research or projects, please cite us: | |
| ```bibtex | |
| @misc{cai2026unifyingspeechrecognitionsynthesis, | |
| title={Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers}, | |
| author={Runyuan Cai and Yu Lin and Yiming Wang and Chunlin Fu and Xiaodong Zeng}, | |
| year={2026}, | |
| eprint={2601.10770}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.SD}, | |
| url={https://arxiv.org/abs/2601.10770}, | |
| } | |
| ``` | |