Automatic Speech Recognition
Transformers
Safetensors
PyTorch
arkasr
text-generation
speech
audio
ark-asr
custom_code
Instructions to use AutoArk-AI/ARK-ASR-0.6B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AutoArk-AI/ARK-ASR-0.6B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="AutoArk-AI/ARK-ASR-0.6B", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("AutoArk-AI/ARK-ASR-0.6B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 6,828 Bytes
638683c 05f7466 638683c 05f7466 638683c 05f7466 83c5223 05f7466 83c5223 05f7466 ec76f6e 83c5223 05f7466 594cf88 05f7466 594cf88 05f7466 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 | ---
library_name: transformers
tags:
- automatic-speech-recognition
- speech
- audio
- transformers
- pytorch
- safetensors
- ark-asr
pipeline_tag: automatic-speech-recognition
language:
- zh
- en
- de
- ja
- fr
- ko
license: apache-2.0
repository: https://github.com/AutoArk/open-audio-opd
---
<div align="center">
# ARK-ASR-0.6B: Efficient Multilingual ASR with Online Policy Distillation
[](https://github.com/AutoArk/open-audio-opd)
[](https://www.apache.org/licenses/LICENSE-2.0)
</div>
> **TL;DR** ARK-ASR-0.6B is a 0.6B-parameter automatic speech recognition model trained with teacher-data adaptation and on-policy distillation. The accompanying training, inference, and evaluation code is available at [AutoArk/open-audio-opd](https://github.com/AutoArk/open-audio-opd).
## Abstract
ARK-ASR is an audio ASR student model optimized with the **teacher-data adaptation + online policy distillation (TD + OPD)** recipe from `open-audio-opd`.
Instead of relying only on static supervised transcripts, OPD lets the student generate transcripts online and trains it against token-level teacher scores on the student's own generated behavior. This checkpoint corresponds to the `Ark-Base+TD+OPD (0.6B)` model reported in the open-audio-opd results.
ARK-ASR currently supports Chinese, English, German, Japanese, French, and Korean ASR.
## Model Overview
<div align="center">
<img src="figures/ark_asr_architecture.png" width="95%" alt="ARK-ASR architecture"/>
<br>
<p><strong>Figure 1: ARK-ASR architecture.</strong> Audio is encoded by a Whisper-style encoder with RoPE, merged through an MLP adapter, and injected into a Qwen2 decoder by replacing audio placeholder token embeddings before transcript generation.</p>
</div>
- **Model size:** 0.6B parameters
- **Task:** automatic speech recognition
- **Architecture:** audio-capable autoregressive Transformers model with custom `arkasr` remote code
- **Checkpoint format:** `safetensors`
- **Sampling rate:** 16 kHz
- **Recommended inference code:** [`scripts/infer/ark_asr_transformers.py`](https://github.com/AutoArk/open-audio-opd/blob/main/scripts/infer/ark_asr_transformers.py)
The model should be loaded with `trust_remote_code=True`. The official inference script handles the processor, tokenizer, audio prompt format, generation cleanup, and ASR token filtering.
## Performance
The following results are from the `open-audio-opd` evaluation. Lower CER/WER is better. Bold numbers mark the best result within the 0.6B group.
| Model | aishell-1 (CER) | Wenet-meeting (CER) | Wenet-net (CER) | Libri-clean (WER) | Libri-other (WER) |
| --- | ---: | ---: | ---: | ---: | ---: |
| *0.6B models* | | | | | |
| Ark-Base (0.6B) | 3.48% | 10.22% | 7.74% | 3.75% | 7.17% |
| Ark-Base+OPD (0.6B) | 3.00% | 7.18% | 6.13% | 2.88% | 5.50% |
| **Ark-Base+TD+OPD (0.6B)** | **1.95%** | 5.92% | **5.39%** | **2.45%** | **4.56%** |
| Qwen3-ASR-0.6B | 2.07% | **5.57%** | 5.45% | 2.81% | 5.05% |
| *Larger reference model* | | | | | |
| Qwen3-ASR-1.7B | 1.50% | 4.69% | 4.55% | 2.20% | 4.05% |
`Ark-Base` is the 0.6B supervised ASR checkpoint trained on 100k hours of ASR audio. `TD` denotes teacher-data adaptation using 2,000 hours of teacher-generated ASR data. `OPD` denotes on-policy distillation with a Qwen-ASR teacher.
## Inference
Run ASR inference with Hugging Face Transformers:
```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
model_path = "AutoArk-AI/ARK-ASR-0.6B"
audio_path = "assets/libai.wav"
device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if device == "cuda" else torch.float32
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
torch_dtype=torch_dtype,
attn_implementation="sdpa",
).to(device)
conversation = [
{
"role": "user",
"content": [
{"type": "audio", "path": audio_path},
{"type": "text", "text": "Please transcribe this audio."},
],
}
]
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
return_tensors="pt",
)
inputs = inputs.to(device)
if "audios" in inputs:
inputs["audios"] = inputs["audios"].to(dtype=torch_dtype)
bad_words_ids = [[token_id] for token_id in tokenizer.all_special_ids if token_id != tokenizer.eos_token_id]
outputs = model.generate(
**inputs,
do_sample=False,
max_new_tokens=256,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
bad_words_ids=bad_words_ids,
)
decoded_outputs = tokenizer.batch_decode(
outputs[:, inputs.input_ids.shape[1] :],
skip_special_tokens=True,
)
print(decoded_outputs)
```
For batch JSONL inference, use the open-source inference code:
```bash
git clone https://github.com/AutoArk/open-audio-opd
cd open-audio-opd
pip install -e .
```
The input JSONL should contain one ASR sample per line:
```json
{"audio":"/path/to/audio.wav","text":"","task":"asr","begin_time":-1,"end_time":-1}
```
```bash
python scripts/infer/ark_asr_transformers.py \
--input /path/to/input.jsonl \
--output runs/infer/predictions.jsonl \
--model_path AutoArk-AI/ARK-ASR-0.6B \
--processor_path AutoArk-AI/ARK-ASR-0.6B \
--batch_size 40 \
--dtype float16 \
--attn_impl sdpa
```
The output JSONL preserves input metadata and adds:
- `pred_text`: cleaned prediction text for downstream evaluation
- `pred_text_raw`: raw decoded generation before cleanup
## Evaluation
The repository also includes a J/WER evaluation entrypoint:
```bash
python scripts/eval/eval_jwer_ark_asr_transformers.py \
--input /path/to/test.jsonl \
--output runs/eval/result.jsonl \
--model_path AutoArk-AI/ARK-ASR-0.6B \
--processor_path AutoArk-AI/ARK-ASR-0.6B \
--batch_size 40 \
--dtype float16 \
--attn_impl sdpa
```
No evaluation audio or dataset files are bundled with this model repository.
## Acknowledgements
The training code is based on [THUNLP/OPD](https://github.com/thunlp/OPD/) and [verl](https://github.com/volcengine/verl). The OPD recipe uses a stronger ASR teacher to score online student rollouts.
## Citation
If you find ARK-ASR or open-audio-opd useful, please cite or link the project repository:
```bibtex
@misc{open_audio_opd_ark_asr,
title = {open-audio-opd: Industrial ASR Online Policy Distillation Training Code},
author = {AutoArk AI},
year = {2026},
howpublished = {\url{https://github.com/AutoArk/open-audio-opd}}
}
```
|