Instructions to use GilgameshWind/X-ASR-zh-en with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- K2
How to use GilgameshWind/X-ASR-zh-en with K2:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
ποΈ X-ASR-zh-en
Chinese-English offline-streaming unified ASR model artifacts for low-latency deployment.
|
|
|
|
Participating Institutions
π GitHub Project | πͺ Hugging Face Space | π§ Online Demo | π Deployment Guide
π X-ASR-zh-en Technical Report: Coming Soon
π Model Card Scope | π¦ Repository Contents | π Evaluation | β¬οΈ Download | π Deployment
π Model Card Scope
π§© X-ASR Series
X-ASR is a series of automatic speech recognition models built with the icefall framework. The series focuses on streaming ASR and low-latency deployment, while also supporting offline recognition. The broader project roadmap, source organization, issue tracking, and bilingual documentation are maintained on the GitHub project page.
π€ X-ASR-zh-en
X-ASR-zh-en is trained on approximately 1 million hours of open-source and collected speech data. It is designed as an offline-streaming unified transducer ASR model with the Zipformer architecture, supporting both offline decoding and true streaming decoding. The model provides multiple streaming chunk sizes: 160 ms, 480 ms, 960 ms, and 1920 ms, supports punctuation and casing, and can be deployed with sherpa-onnx.
β¨ HF-Specific Notes
This Hugging Face repository is the model artifact page for X-ASR-zh-en.
| What this page provides | What the GitHub page provides |
|---|---|
| Downloadable model artifacts | Project-level overview |
| ONNX encoder / decoder / joiner files | Bilingual README and release notes |
sherpa-onnx deployment entry point |
Source layout and issue tracking |
| Model-card metadata, tags, license, and metrics | Development history and contribution workflow |
π¦ Repository Contents
| Path | Purpose |
|---|---|
deployment/ |
Deployment-ready sherpa-onnx runtime files and examples |
deployment/models/ |
Exported streaming ONNX model variants |
deployment/infer_and_client/ |
WebSocket server, inference wrapper, and test client |
figure/ |
Architecture figure and demo preview media |
demo/ |
Demo video asset |
zipformer/ |
Training/export reference files for the Zipformer-based setup |
π§© Model Variants
Each streaming variant contains a matched encoder, decoder, joiner, and tokens.txt. Do not mix files across model folders.
| Directory | Encoder | Decoder | Joiner | Intended chunk |
|---|---|---|---|---|
deployment/models/chunk-160ms-model |
encoder-160ms.onnx |
decoder-160ms.onnx |
joiner-160ms.onnx |
160 ms |
deployment/models/chunk-480ms-model |
encoder-480ms.onnx |
decoder-480ms.onnx |
joiner-480ms.onnx |
480 ms |
deployment/models/chunk-960ms-model |
encoder-960ms.onnx |
decoder-960ms.onnx |
joiner-960ms.onnx |
960 ms |
deployment/models/chunk-1920ms-model |
encoder-1920ms.onnx |
decoder-1920ms.onnx |
joiner-1920ms.onnx |
1920 ms |
β Highlights
| Category | Description |
|---|---|
| Framework | icefall / k2 |
| Architecture | Zipformer transducer |
| Runtime | sherpa-onnx |
| Languages | Chinese and English |
| Training scale | Approximately 1 million hours of open-source and collected speech data |
| Recognition modes | Offline decoding and true streaming decoding |
| Streaming chunks | 160 ms, 480 ms, 960 ms, 1920 ms |
| Text output | Supports punctuation and casing |
π Evaluation
The following results are for the current X-ASR-zh-en release. Values are WER/CER percentages; lower is better. All results are reported with greedy search.
| Mode | Chunk size | LibriSpeech | GigaSpeech | WenetSpeech | ||
|---|---|---|---|---|---|---|
| clean | other | net | meeting | |||
| Streaming | 160 ms | 3.91 | 10.17 | 10.97 | 9.45 | 12.04 |
| Streaming | 480 ms | 3.14 | 7.57 | 9.77 | 7.38 | 9.31 |
| Streaming | 960 ms | 3.12 | 7.22 | 9.62 | 6.96 | 8.84 |
| Streaming | 1920 ms | 2.84 | 6.47 | 9.46 | 6.42 | 8.03 |
| Offline | - | 2.69 | 5.76 | 9.23 | 5.96 | 7.20 |
Note: Bold numbers indicate the best result among the listed modes for each benchmark column.
Public Benchmark Model Comparison
The following table compares representative ASR models on the same public benchmark columns. Ranks are computed by AVG across the five listed columns; lower is better. Parameter sizes are shown when provided by the source sheet.
| Rank | Model | Params | LibriSpeech | GigaSpeech | WenetSpeech | AVG | ||
|---|---|---|---|---|---|---|---|---|
| clean | other | net | meeting | |||||
| 1 | Qwen3-ASR | 1.7B | 1.65 | 3.45 | 8.56 | 5.29 | 5.46 | 4.882 |
| 2 | Qwen3-ASR | 0.6B | 2.18 | 4.54 | 8.94 | 5.97 | 6.88 | 5.702 |
| 3 | X-ASR-zh-en (offline) | 0.16B | 2.56 | 5.56 | 9.17 | 5.83 | 7.06 | 6.036 |
| 4 | SenseVoice-small | 234M | 3.16 | 7.21 | 11.24 | 5.73 | 6.47 | 6.762 |
| 5 | VibeVoice-ASR | 9B | 2.18 | 5.65 | 9.49 | 14.45 | 17.19 | 9.792 |
GigaSpeechBench Vertical Domain Evaluation
The following results report GigaSpeechBench vertical-domain performance for the current X-ASR-zh-en release. Values are WER/CER percentages; lower is better. Domain abbreviations follow the GigaSpeechBench vertical-domain labels.
CH
| Mode | Chunk size | ARG | AIT | ART | BIO | ECM | ENG | ENT | FIN | HUM | LAW | MED | MIL |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Streaming | 160 ms | 9.88 | 6.76 | 4.39 | 7.32 | 4.13 | 3.58 | 8.45 | 3.23 | 10.42 | 6.58 | 4.25 | 2.55 |
| Streaming | 480 ms | 8.67 | 6.17 | 3.60 | 6.22 | 3.78 | 3.04 | 7.04 | 2.78 | 9.43 | 5.84 | 3.76 | 2.11 |
| Streaming | 960 ms | 8.00 | 5.69 | 3.44 | 6.10 | 3.69 | 2.88 | 6.71 | 2.72 | 9.07 | 5.58 | 3.69 | 2.11 |
| Streaming | 1920 ms | 7.24 | 5.58 | 3.27 | 5.82 | 3.48 | 2.74 | 6.55 | 2.57 | 8.59 | 4.97 | 3.53 | 1.94 |
| Offline | - | 6.56 | 4.54 | 2.77 | 5.04 | 2.99 | 2.32 | 6.02 | 1.94 | 7.64 | 4.20 | 2.90 | 1.68 |
EN
| Mode | Chunk size | ARG | AIT | ART | BIO | ECM | ENG | ENT | FIN | HUM | LAW | MED | MIL |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Streaming | 160 ms | 5.29 | 8.57 | 8.55 | 7.31 | 4.33 | 5.01 | 16.25 | 5.58 | 7.36 | 13.39 | 6.03 | 6.20 |
| Streaming | 480 ms | 4.62 | 8.40 | 7.73 | 6.12 | 4.19 | 4.65 | 14.50 | 5.21 | 6.79 | 11.51 | 5.59 | 6.02 |
| Streaming | 960 ms | 4.58 | 8.35 | 7.45 | 6.00 | 4.13 | 4.44 | 13.99 | 5.12 | 6.58 | 10.86 | 5.52 | 6.04 |
| Streaming | 1920 ms | 4.33 | 8.32 | 6.90 | 5.89 | 4.00 | 4.37 | 13.61 | 4.98 | 6.39 | 10.52 | 5.45 | 5.78 |
| Offline | - | 4.09 | 8.28 | 6.73 | 5.48 | 4.12 | 4.30 | 12.30 | 4.94 | 6.17 | 10.41 | 5.35 | 5.61 |
π§ Demo
A sherpa-onnx based online demo is available here:
Demo video:
β¬οΈ Download
Download with HF CLI
hf download GilgameshWind/X-ASR-zh-en \
--local-dir ./x-asr-zh-en
Clone with Git LFS
git lfs install
git clone https://huggingface.co/GilgameshWind/X-ASR-zh-en
cd X-ASR-zh-en
git lfs pull
π Deployment
The recommended runtime is sherpa-onnx. The shortest path is to use the deployment package in this repository.
cd deployment
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
Start a CPU streaming server with the 160 ms model:
python infer_and_client/sherpa_streaming_server.py \
--host 0.0.0.0 \
--port 8766 \
--tokens models/chunk-160ms-model/tokens.txt \
--encoder models/chunk-160ms-model/encoder-160ms.onnx \
--decoder models/chunk-160ms-model/decoder-160ms.onnx \
--joiner models/chunk-160ms-model/joiner-160ms.onnx \
--provider cpu \
--sample-rate 16000 \
--feature-dim 80 \
--num-threads 1 \
--decoding-method greedy_search \
--model-type zipformer2 \
--enable-endpoint-detection 0 \
--text-format none
Test it with a WAV file:
python infer_and_client/sherpa_streaming_client.py \
--server-uri ws://127.0.0.1:8766 \
--wav /path/to/test.wav \
--chunk-ms 100 \
--simulate-realtime 1
For complete runtime options, see deployment/README.md.
β οΈ Intended Use and Limitations
- This release is intended for Chinese-English ASR research, evaluation, demos, and deployment experiments.
- The current release focuses on streaming and offline-streaming unified recognition.
- Production latency depends on hardware, concurrency, audio chunking, endpointing, and server configuration.
- The technical report with training details, evaluation protocol, ablations, and additional analysis is coming soon.
π Citation
The X-ASR-zh-en technical report is coming soon. Please cite the report once it is released. For now, refer to this model card and the GitHub project page.
π License
This model is released under the Apache-2.0 License.
π Acknowledgements
This model is trained with icefall and deployed with sherpa-onnx.
- icefall: https://github.com/k2-fsa/icefall
- sherpa-onnx: https://github.com/k2-fsa/sherpa-onnx
- Downloads last month
- 1



