🎙️ X-ASR-zh-en

Chinese-English offline-streaming unified ASR model artifacts for low-latency deployment.

Huazhong University of Science and Technology

_{Participating Institutions}

📄 X-ASR-zh-en Technical Report: Coming Soon

🔍 Model Card Scope | 📦 Repository Contents | 📊 Evaluation | ⬇️ Download | 🚀 Deployment

🔍 Model Card Scope

🧩 X-ASR Series

X-ASR is a series of automatic speech recognition models built with the icefall framework. The series focuses on streaming ASR and low-latency deployment, while also supporting offline recognition. The broader project roadmap, source organization, issue tracking, and bilingual documentation are maintained on the GitHub project page.

🤖 X-ASR-zh-en

X-ASR-zh-en is trained on approximately 1 million hours of open-source and collected speech data. It is designed as an offline-streaming unified transducer ASR model with the Zipformer architecture, supporting both offline decoding and true streaming decoding. The model provides multiple streaming chunk sizes: 160 ms, 480 ms, 960 ms, and 1920 ms, supports punctuation and casing, and can be deployed with sherpa-onnx.

Zipformer architecture

✨ Artifact Page Notes

This repository is the model artifact page for X-ASR-zh-en.

What this artifact page provides	What the GitHub project provides
Downloadable model artifacts	Project-level overview
ONNX encoder / decoder / joiner files	Bilingual README and release notes
`sherpa-onnx` deployment entry point	Source layout and issue tracking
Model-card metadata, tags, license, and metrics	Development history and contribution workflow

📦 Repository Contents

Path	Purpose
`deployment/`	Deployment-ready `sherpa-onnx` runtime files and examples
`deployment/models/`	Exported streaming ONNX model variants
`deployment/infer_and_client/`	WebSocket server, inference wrapper, and test client
`figure/`	Architecture figure and demo preview media
`demo/`	Demo video asset
`applications/vibe-xasr/`	Vibe XASR desktop application package, manifest, and download notes
`streaming_exp/`	Averaged/pretrained checkpoint artifact for research reference

Directory Layout

.
|-- README.md
|-- config.json
|-- demo/
|   `-- demo.mov
|-- applications/
|   `-- vibe-xasr/
|       |-- README.md
|       |-- download_manifest.json
|       `-- VibeXASR-1.1.2-macos-universal.dmg
|-- deployment/
|   |-- infer_and_client/
|   |   |-- sherpa_streaming_client.py
|   |   |-- sherpa_streaming_infer.py
|   |   `-- sherpa_streaming_server.py
|   `-- models/
|       |-- chunk-160ms-model/
|       |   |-- encoder-160ms.onnx
|       |   |-- decoder-160ms.onnx
|       |   |-- joiner-160ms.onnx
|       |   `-- tokens.txt
|       |-- chunk-480ms-model/
|       |-- chunk-960ms-model/
|       `-- chunk-1920ms-model/
|-- figure/
|   |-- zipformer.png
|   |-- demo-preview.png
|   `-- institutions/
`-- streaming_exp/
    `-- pretrained.pt

🧩 Model Variants

Each streaming variant contains a matched encoder, decoder, joiner, and tokens.txt. Do not mix files across model folders.

Directory	Encoder	Decoder	Joiner	Intended chunk
`deployment/models/chunk-160ms-model`	`encoder-160ms.onnx`	`decoder-160ms.onnx`	`joiner-160ms.onnx`	160 ms
`deployment/models/chunk-480ms-model`	`encoder-480ms.onnx`	`decoder-480ms.onnx`	`joiner-480ms.onnx`	480 ms
`deployment/models/chunk-960ms-model`	`encoder-960ms.onnx`	`decoder-960ms.onnx`	`joiner-960ms.onnx`	960 ms
`deployment/models/chunk-1920ms-model`	`encoder-1920ms.onnx`	`decoder-1920ms.onnx`	`joiner-1920ms.onnx`	1920 ms

⭐ Highlights

Category	Description
Framework	icefall / k2
Architecture	Zipformer transducer
Runtime	sherpa-onnx
Languages	Chinese and English
Training scale	Approximately 1 million hours of open-source and collected speech data
Recognition modes	Offline decoding and true streaming decoding
Streaming chunks	160 ms, 480 ms, 960 ms, 1920 ms
Text output	Supports punctuation and casing

📊 Evaluation

The following results are for the current X-ASR-zh-en release. Values are WER/CER percentages; lower is better. All results are reported with greedy search.

Mode	Chunk size	LibriSpeech		GigaSpeech	WenetSpeech
Mode	Chunk size	clean	other	GigaSpeech	net	meeting
Streaming	160 ms	3.91	10.17	10.97	9.45	12.04
Streaming	480 ms	3.14	7.57	9.77	7.38	9.31
Streaming	960 ms	3.12	7.22	9.62	6.96	8.84
Streaming	1920 ms	2.84	6.47	9.46	6.42	8.03
Offline	-	2.69	5.76	9.23	5.96	7.20

Note: Bold numbers indicate the best result among the listed modes for each benchmark column.

Public Benchmark Model Comparison

The following table compares representative ASR models on the same public benchmark columns. Ranks are computed by AVG across the five listed columns; lower is better. Parameter sizes are shown when provided by the source sheet.

Rank	Model	Params	LibriSpeech		GigaSpeech	WenetSpeech		AVG
Rank	Model	Params	clean	other	GigaSpeech	net	meeting	AVG
1	Qwen3-ASR	1.7B	1.65	3.45	8.56	5.29	5.46	4.882
2	Qwen3-ASR	0.6B	2.18	4.54	8.94	5.97	6.88	5.702
3	X-ASR-zh-en (offline)	0.16B	2.56	5.56	9.17	5.83	7.06	6.036
4	SenseVoice-small	234M	3.16	7.21	11.24	5.73	6.47	6.762
5	VibeVoice-ASR	9B	2.18	5.65	9.49	14.45	17.19	9.792

GigaSpeechBench Vertical Domain Evaluation

The following results report GigaSpeechBench vertical-domain performance for the current X-ASR-zh-en release. Values are WER/CER percentages; lower is better. Domain abbreviations follow the GigaSpeechBench vertical-domain labels.

CH

Mode	Chunk size	ARG	AIT	ART	BIO	ECM	ENG	ENT	FIN	HUM	LAW	MED	MIL
Streaming	160 ms	9.88	6.76	4.39	7.32	4.13	3.58	8.45	3.23	10.42	6.58	4.25	2.55
Streaming	480 ms	8.67	6.17	3.60	6.22	3.78	3.04	7.04	2.78	9.43	5.84	3.76	2.11
Streaming	960 ms	8.00	5.69	3.44	6.10	3.69	2.88	6.71	2.72	9.07	5.58	3.69	2.11
Streaming	1920 ms	7.24	5.58	3.27	5.82	3.48	2.74	6.55	2.57	8.59	4.97	3.53	1.94
Offline	-	6.56	4.54	2.77	5.04	2.99	2.32	6.02	1.94	7.64	4.20	2.90	1.68

EN

Mode	Chunk size	ARG	AIT	ART	BIO	ECM	ENG	ENT	FIN	HUM	LAW	MED	MIL
Streaming	160 ms	5.29	8.57	8.55	7.31	4.33	5.01	16.25	5.58	7.36	13.39	6.03	6.20
Streaming	480 ms	4.62	8.40	7.73	6.12	4.19	4.65	14.50	5.21	6.79	11.51	5.59	6.02
Streaming	960 ms	4.58	8.35	7.45	6.00	4.13	4.44	13.99	5.12	6.58	10.86	5.52	6.04
Streaming	1920 ms	4.33	8.32	6.90	5.89	4.00	4.37	13.61	4.98	6.39	10.52	5.45	5.78
Offline	-	4.09	8.28	6.73	5.48	4.12	4.30	12.30	4.94	6.17	10.41	5.35	5.61

🎧 Demo

A sherpa-onnx based online demo is available here:

https://stream-asr.sjtuxlance.com/

Demo video:

Open demo video

⬇️ Download

GitHub

Use GitHub when you want the full project repository, bilingual documentation, training references, deployment examples, and issue-tracking context.

git lfs install
git clone https://github.com/Gilgamesh-J/X-ASR.git
cd X-ASR
git lfs pull

Hugging Face

Use Hugging Face when you want the model artifact page and standard HF Hub download tooling.

hf download GilgameshWind/X-ASR-zh-en \
  --local-dir ./X-ASR-zh-en

You can also clone the Hugging Face repository with Git LFS:

git lfs install
git clone https://huggingface.co/GilgameshWind/X-ASR-zh-en
cd X-ASR-zh-en
git lfs pull

ModelScope

Use ModelScope when you prefer the ModelScope mirror or Git LFS clone from ModelScope.

git lfs install
git clone https://www.modelscope.ai/Gilgamesh-J/X-ASR-zh-en.git
cd X-ASR-zh-en
git lfs pull

🚀 Deployment

The recommended runtime is sherpa-onnx. The shortest path is to use the deployment package in this repository.

cd deployment
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt

Start a CPU streaming server with the 160 ms model:

python infer_and_client/sherpa_streaming_server.py \
  --host 0.0.0.0 \
  --port 8766 \
  --tokens models/chunk-160ms-model/tokens.txt \
  --encoder models/chunk-160ms-model/encoder-160ms.onnx \
  --decoder models/chunk-160ms-model/decoder-160ms.onnx \
  --joiner models/chunk-160ms-model/joiner-160ms.onnx \
  --provider cpu \
  --sample-rate 16000 \
  --feature-dim 80 \
  --num-threads 1 \
  --decoding-method greedy_search \
  --model-type zipformer2 \
  --enable-endpoint-detection 0 \
  --text-format none

Test it with a WAV file:

python infer_and_client/sherpa_streaming_client.py \
  --server-uri ws://127.0.0.1:8766 \
  --wav /path/to/test.wav \
  --chunk-ms 100 \
  --simulate-realtime 1

For complete runtime options, see deployment/README.md.

⚠️ Intended Use and Limitations

This release is intended for Chinese-English ASR research, evaluation, demos, and deployment experiments.
The current release focuses on streaming and offline-streaming unified recognition.
Production latency depends on hardware, concurrency, audio chunking, endpointing, and server configuration.
The technical report with training details, evaluation protocol, ablations, and additional analysis is coming soon.

📄 Citation

The X-ASR-zh-en technical report is coming soon. Please cite the report once it is released. For now, refer to this model card and the GitHub project page.

📜 License

This model is released under the Apache-2.0 License.

🙏 Acknowledgements

This model is trained with icefall and deployed with sherpa-onnx.

icefall: https://github.com/k2-fsa/icefall
sherpa-onnx: https://github.com/k2-fsa/sherpa-onnx

Downloads last month: 36

Model tree for GilgameshWind/X-ASR-zh-en

Finetunes

1 model

Quantizations

3 models

GilgameshWind
/

X-ASR-zh-en