Automatic Speech Recognition
Transformers
Safetensors
PyTorch
arkasr
text-generation
speech
audio
vllm
ark-asr
custom_code
Eval Results
Instructions to use AutoArk-AI/ARK-ASR-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AutoArk-AI/ARK-ASR-3B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="AutoArk-AI/ARK-ASR-3B", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("AutoArk-AI/ARK-ASR-3B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| tags: | |
| - automatic-speech-recognition | |
| - speech | |
| - audio | |
| - transformers | |
| - pytorch | |
| - safetensors | |
| - vllm | |
| - ark-asr | |
| pipeline_tag: automatic-speech-recognition | |
| language: | |
| - zh | |
| - en | |
| - de | |
| - ja | |
| - fr | |
| - ko | |
| - es | |
| - pl | |
| - it | |
| - ro | |
| - hu | |
| - cs | |
| - nl | |
| - fi | |
| - hr | |
| - sk | |
| - sl | |
| - et | |
| - lt | |
| license: apache-2.0 | |
| repository: https://github.com/AutoArk/open-audio-opd | |
| <div align="center"> | |
| # ARK-ASR-3B: State-of-the-Art Multilingual ASR | |
| [](https://github.com/AutoArk/open-audio-opd) | |
| [](https://arxiv.org/abs/2605.28139) | |
| [](https://www.apache.org/licenses/LICENSE-2.0) | |
| </div> | |
| > **TL;DR** ARK-ASR-3B is a multilingual automatic speech recognition model. It achieves current state-of-the-art results on the Hugging Face Open ASR Leaderboard English short-form benchmark, with an average WER of **5.04%** and RTFx of **490.98** across AMI, Earnings22, GigaSpeech, LibriSpeech, SPGISpeech, and VoxPopuli. The accompanying training, inference, and evaluation code is available at [AutoArk/open-audio-opd](https://github.com/AutoArk/open-audio-opd). | |
| ## Abstract | |
| ARK-ASR-3B is a 3B-scale audio-capable autoregressive Transformers model for automatic speech recognition. | |
| It combines a Whisper-style audio encoder, an MLP adapter, and a Qwen decoder with custom `arkasr` remote code. | |
| ARK-ASR currently supports Chinese, English, German, Japanese, French, Korean, Spanish, Polish, Italian, Romanian, Hungarian, Czech, Dutch, Finnish, Croatian, Slovak, Slovene, Estonian, and Lithuanian ASR. | |
| ## Supported Languages | |
| Chinese, English, German, Japanese, French, Korean, Spanish, Polish, Italian, Romanian, Hungarian, Czech, Dutch, Finnish, Croatian, Slovak, Slovene, Estonian, and Lithuanian. | |
| ## Model Overview | |
| <div align="center"> | |
| <img src="figures/ark_asr_architecture.png" width="95%" alt="ARK-ASR architecture"/> | |
| <br> | |
| <p><strong>Figure 1: ARK-ASR architecture.</strong> Audio is encoded by a Whisper-style encoder with RoPE, merged through an MLP adapter, and injected into a Qwen decoder by replacing audio placeholder token embeddings before transcript generation.</p> | |
| </div> | |
| - **Model size:** 3B-scale decoder LLM with a dedicated Whisper-style audio encoder and MLP adapter | |
| - **Task:** automatic speech recognition | |
| - **Architecture:** audio-capable autoregressive Transformers model with custom `arkasr` remote code | |
| - **Checkpoint format:** `safetensors` | |
| - **Sampling rate:** 16 kHz | |
| - **Recommended inference code:** [`scripts/infer/ark_asr_transformers.py`](https://github.com/AutoArk/open-audio-opd/blob/main/scripts/infer/ark_asr_transformers.py) | |
| - **vLLM serving:** [`scripts/vllm/ark_asr_vllm`](https://github.com/AutoArk/open-audio-opd/tree/master/scripts/vllm/ark_asr_vllm) | |
| The model should be loaded with `trust_remote_code=True`. The official inference script handles the processor, tokenizer, audio prompt format, generation cleanup, and ASR token filtering. | |
| ## Performance | |
| The following results are from the Hugging Face [Open ASR Leaderboard](https://huggingface.co/datasets/hf-audio/open-asr-leaderboard). Lower WER is better. ARK-ASR-3B reaches the current state of the art on this English short-form benchmark. | |
| ### English WER | |
| | Model | AMI | Earnings22 | GigaSpeech | LS Clean | LS Other | SPGISpeech | VoxPopuli | Avg | | |
| | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | |
| | ARK-ASR-3B | **8.79%** | **8.23%** | **6.98%** | **1.03%** | **2.35%** | **2.46%** | **5.47%** | **5.04%** | | |
| | ARK-ASR-0.6B | 10.02% | 9.77% | 8.00% | 1.53% | 3.51% | 2.63% | 6.31% | 5.97% | | |
| ### Chinese CER | |
| | Model | AISHELL-1 | WenetSpeech test meeting | WenetSpeech test-net | | |
| | --- | ---: | ---: | ---: | | |
| | ARK-ASR-3B | **1.80%** | **4.97%** | **4.58%** | | |
| | ARK-ASR-0.6B | 2.02% | 5.92% | 4.96% | | |
| ## Inference | |
| Run ASR inference with Hugging Face Transformers: | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer | |
| model_path = "AutoArk-AI/ARK-ASR-3B" | |
| audio_path = "assets/libai.wav" | |
| device = "cuda" if torch.cuda.is_available() else "cpu" | |
| torch_dtype = torch.bfloat16 if device == "cuda" else torch.float32 | |
| processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) | |
| tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_path, | |
| trust_remote_code=True, | |
| torch_dtype=torch_dtype, | |
| attn_implementation="sdpa", | |
| ).to(device) | |
| model.eval() | |
| def build_bad_words_ids(tokenizer): | |
| eos_ids = tokenizer.eos_token_id | |
| keep_ids = {eos_ids} if isinstance(eos_ids, int) else set(eos_ids or []) | |
| bad_ids = set(tokenizer.all_special_ids) - keep_ids | |
| bad_ids.update( | |
| token_id | |
| for token, token_id in tokenizer.get_added_vocab().items() | |
| if token.startswith("<") and token.endswith(">") and token_id not in keep_ids | |
| ) | |
| return [[token_id] for token_id in sorted(bad_ids)] | |
| conversation = [ | |
| { | |
| "role": "user", | |
| "content": [ | |
| {"type": "audio", "path": audio_path}, | |
| {"type": "text", "text": "Please transcribe this audio."}, | |
| ], | |
| } | |
| ] | |
| inputs = processor.apply_chat_template( | |
| conversation, | |
| add_generation_prompt=True, | |
| return_tensors="pt", | |
| sampling_rate=16000, | |
| audio_padding="longest", | |
| text_kwargs={"padding": "longest"}, | |
| audio_max_length=30 * 16000, | |
| ) | |
| inputs = inputs.to(device) | |
| if "audios" in inputs: | |
| inputs["audios"] = inputs["audios"].to(dtype=torch_dtype) | |
| bad_words_ids = build_bad_words_ids(tokenizer) | |
| with torch.inference_mode(): | |
| outputs = model.generate( | |
| **inputs, | |
| do_sample=False, | |
| max_new_tokens=256, | |
| pad_token_id=tokenizer.pad_token_id, | |
| eos_token_id=tokenizer.eos_token_id, | |
| bad_words_ids=bad_words_ids, | |
| ) | |
| decoded_outputs = tokenizer.batch_decode( | |
| outputs[:, inputs.input_ids.shape[1] :], | |
| skip_special_tokens=True, | |
| ) | |
| print(decoded_outputs) | |
| ``` | |
| For batch JSONL inference, use the open-source inference code: | |
| ```bash | |
| git clone https://github.com/AutoArk/open-audio-opd | |
| cd open-audio-opd | |
| pip install -e . | |
| ``` | |
| The input JSONL should contain one ASR sample per line: | |
| ```json | |
| {"audio":"/path/to/audio.wav","text":"","task":"asr","begin_time":-1,"end_time":-1} | |
| ``` | |
| ```bash | |
| python scripts/infer/ark_asr_transformers.py \ | |
| --input /path/to/input.jsonl \ | |
| --output runs/infer/predictions.jsonl \ | |
| --model_path AutoArk-AI/ARK-ASR-3B \ | |
| --processor_path AutoArk-AI/ARK-ASR-3B \ | |
| --batch_size 40 \ | |
| --dtype bfloat16 \ | |
| --attn_impl sdpa | |
| ``` | |
| The output JSONL preserves input metadata and adds: | |
| - `pred_text`: cleaned prediction text for downstream evaluation | |
| - `pred_text_raw`: raw decoded generation before cleanup | |
| ## vLLM Online Serving | |
| ARK-ASR can also be deployed as a vLLM-backed online ASR service with the | |
| adapter in | |
| [`scripts/vllm/ark_asr_vllm`](https://github.com/AutoArk/open-audio-opd/tree/master/scripts/vllm/ark_asr_vllm). | |
| The service exposes both a compact `/asr` endpoint and an OpenAI-style | |
| `/v1/audio/transcriptions` endpoint. | |
| Clone and install the serving code: | |
| ```bash | |
| git clone https://github.com/AutoArk/open-audio-opd | |
| cd open-audio-opd | |
| pip install -e ".[vllm]" | |
| ``` | |
| Start the service: | |
| ```bash | |
| MODEL=AutoArk-AI/ARK-ASR-3B \ | |
| GPU=0 \ | |
| PORT=8025 \ | |
| scripts/vllm/deploy_ark_asr_vllm_service.sh start | |
| ``` | |
| Check the service: | |
| ```bash | |
| scripts/vllm/deploy_ark_asr_vllm_service.sh status | |
| curl -sS http://127.0.0.1:8025/health | |
| curl -sS http://127.0.0.1:8025/token-mask | |
| ``` | |
| Run one transcription request: | |
| ```bash | |
| curl -sS -X POST http://127.0.0.1:8025/asr \ | |
| -F file=@/path/to/audio.wav \ | |
| -F max_new_tokens=256 | |
| ``` | |
| OpenAI-style transcription endpoint: | |
| ```bash | |
| curl -sS -X POST http://127.0.0.1:8025/v1/audio/transcriptions \ | |
| -F file=@/path/to/audio.wav \ | |
| -F model=ark-asr | |
| ``` | |
| Stop the service: | |
| ```bash | |
| scripts/vllm/deploy_ark_asr_vllm_service.sh stop | |
| ``` | |
| The vLLM adapter registers the custom `arkasr` model, loads the local | |
| processor/tokenizer with `trust_remote_code=True`, applies generation-time | |
| token masking for non-ASR control tokens, and keeps `<|im_end|>` as the stop | |
| token. Service logs and PID files are written under `runs/vllm/`. | |
| ## Evaluation | |
| The reported leaderboard numbers are evaluated with the Hugging Face | |
| [`open_asr_leaderboard`](https://github.com/huggingface/open_asr_leaderboard) | |
| evaluation code. | |
| For local J/WER evaluation, the repository also includes this entrypoint: | |
| ```bash | |
| python scripts/eval/eval_jwer_ark_asr_transformers.py \ | |
| --input /path/to/test.jsonl \ | |
| --output runs/eval/result.jsonl \ | |
| --model_path AutoArk-AI/ARK-ASR-3B \ | |
| --processor_path AutoArk-AI/ARK-ASR-3B \ | |
| --batch_size 40 \ | |
| --dtype bfloat16 \ | |
| --attn_impl sdpa | |
| ``` | |
| No evaluation audio or dataset files are bundled with this model repository. | |
| ## Acknowledgements | |
| The training code is based on [THUNLP/OPD](https://github.com/thunlp/OPD/) and [verl](https://github.com/volcengine/verl). The OPD recipe uses a stronger ASR teacher to score online student rollouts. | |
| ## Citation | |
| If you find ARK-ASR or open-audio-opd useful, please cite: | |
| ```bibtex | |
| @misc{lin2026dataefficientopd, | |
| title={Data-Efficient On-Policy Distillation for Automatic Speech Recognition}, | |
| author={Lin, Yu and Wang, Yiming and Cai, Runyuan and Zeng, Xiaodong}, | |
| year={2026}, | |
| eprint={2605.28139}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.AI}, | |
| url={https://arxiv.org/abs/2605.28139} | |
| } | |
| ``` | |