Automatic Speech Recognition
Transformers
Safetensors
PyTorch
arkasr
text-generation
speech
audio
ark-asr
custom_code
Instructions to use AutoArk-AI/ARK-ASR-0.6B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AutoArk-AI/ARK-ASR-0.6B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="AutoArk-AI/ARK-ASR-0.6B", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("AutoArk-AI/ARK-ASR-0.6B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| tags: | |
| - automatic-speech-recognition | |
| - speech | |
| - audio | |
| - transformers | |
| - pytorch | |
| - safetensors | |
| - ark-asr | |
| pipeline_tag: automatic-speech-recognition | |
| language: | |
| - zh | |
| - en | |
| - de | |
| - ja | |
| - fr | |
| - ko | |
| license: apache-2.0 | |
| repository: https://github.com/AutoArk/open-audio-opd | |
| <div align="center"> | |
| # ARK-ASR-0.6B: Efficient Multilingual ASR with Online Policy Distillation | |
| [](https://github.com/AutoArk/open-audio-opd) | |
| [](https://www.apache.org/licenses/LICENSE-2.0) | |
| </div> | |
| > **TL;DR** ARK-ASR-0.6B is a 0.6B-parameter automatic speech recognition model trained with teacher-data adaptation and on-policy distillation. The accompanying training, inference, and evaluation code is available at [AutoArk/open-audio-opd](https://github.com/AutoArk/open-audio-opd). | |
| ## Abstract | |
| ARK-ASR is an audio ASR student model optimized with the **teacher-data adaptation + online policy distillation (TD + OPD)** recipe from `open-audio-opd`. | |
| Instead of relying only on static supervised transcripts, OPD lets the student generate transcripts online and trains it against token-level teacher scores on the student's own generated behavior. This checkpoint corresponds to the `Ark-Base+TD+OPD (0.6B)` model reported in the open-audio-opd results. | |
| ARK-ASR currently supports Chinese, English, German, Japanese, French, and Korean ASR. | |
| ## Model Overview | |
| <div align="center"> | |
| <img src="figures/ark_asr_architecture.png" width="95%" alt="ARK-ASR architecture"/> | |
| <br> | |
| <p><strong>Figure 1: ARK-ASR architecture.</strong> Audio is encoded by a Whisper-style encoder with RoPE, merged through an MLP adapter, and injected into a Qwen2 decoder by replacing audio placeholder token embeddings before transcript generation.</p> | |
| </div> | |
| - **Model size:** 0.6B parameters | |
| - **Task:** automatic speech recognition | |
| - **Architecture:** audio-capable autoregressive Transformers model with custom `arkasr` remote code | |
| - **Checkpoint format:** `safetensors` | |
| - **Sampling rate:** 16 kHz | |
| - **Recommended inference code:** [`scripts/infer/ark_asr_transformers.py`](https://github.com/AutoArk/open-audio-opd/blob/main/scripts/infer/ark_asr_transformers.py) | |
| The model should be loaded with `trust_remote_code=True`. The official inference script handles the processor, tokenizer, audio prompt format, generation cleanup, and ASR token filtering. | |
| ## Performance | |
| The following results are from the `open-audio-opd` evaluation. Lower CER/WER is better. Bold numbers mark the best result within the 0.6B group. | |
| | Model | aishell-1 (CER) | Wenet-meeting (CER) | Wenet-net (CER) | Libri-clean (WER) | Libri-other (WER) | | |
| | --- | ---: | ---: | ---: | ---: | ---: | | |
| | *0.6B models* | | | | | | | |
| | Ark-Base (0.6B) | 3.48% | 10.22% | 7.74% | 3.75% | 7.17% | | |
| | Ark-Base+OPD (0.6B) | 3.00% | 7.18% | 6.13% | 2.88% | 5.50% | | |
| | **Ark-Base+TD+OPD (0.6B)** | **1.95%** | 5.92% | **5.39%** | **2.45%** | **4.56%** | | |
| | Qwen3-ASR-0.6B | 2.07% | **5.57%** | 5.45% | 2.81% | 5.05% | | |
| | *Larger reference model* | | | | | | | |
| | Qwen3-ASR-1.7B | 1.50% | 4.69% | 4.55% | 2.20% | 4.05% | | |
| `Ark-Base` is the 0.6B supervised ASR checkpoint trained on 100k hours of ASR audio. `TD` denotes teacher-data adaptation using 2,000 hours of teacher-generated ASR data. `OPD` denotes on-policy distillation with a Qwen-ASR teacher. | |
| ## Inference | |
| Run ASR inference with Hugging Face Transformers: | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer | |
| model_path = "AutoArk-AI/ARK-ASR-0.6B" | |
| audio_path = "assets/libai.wav" | |
| device = "cuda" if torch.cuda.is_available() else "cpu" | |
| torch_dtype = torch.float16 if device == "cuda" else torch.float32 | |
| processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) | |
| tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_path, | |
| trust_remote_code=True, | |
| torch_dtype=torch_dtype, | |
| attn_implementation="sdpa", | |
| ).to(device) | |
| conversation = [ | |
| { | |
| "role": "user", | |
| "content": [ | |
| {"type": "audio", "path": audio_path}, | |
| {"type": "text", "text": "Please transcribe this audio."}, | |
| ], | |
| } | |
| ] | |
| inputs = processor.apply_chat_template( | |
| conversation, | |
| add_generation_prompt=True, | |
| return_tensors="pt", | |
| ) | |
| inputs = inputs.to(device) | |
| if "audios" in inputs: | |
| inputs["audios"] = inputs["audios"].to(dtype=torch_dtype) | |
| bad_words_ids = [[token_id] for token_id in tokenizer.all_special_ids if token_id != tokenizer.eos_token_id] | |
| outputs = model.generate( | |
| **inputs, | |
| do_sample=False, | |
| max_new_tokens=256, | |
| pad_token_id=tokenizer.pad_token_id, | |
| eos_token_id=tokenizer.eos_token_id, | |
| bad_words_ids=bad_words_ids, | |
| ) | |
| decoded_outputs = tokenizer.batch_decode( | |
| outputs[:, inputs.input_ids.shape[1] :], | |
| skip_special_tokens=True, | |
| ) | |
| print(decoded_outputs) | |
| ``` | |
| For batch JSONL inference, use the open-source inference code: | |
| ```bash | |
| git clone https://github.com/AutoArk/open-audio-opd | |
| cd open-audio-opd | |
| pip install -e . | |
| ``` | |
| The input JSONL should contain one ASR sample per line: | |
| ```json | |
| {"audio":"/path/to/audio.wav","text":"","task":"asr","begin_time":-1,"end_time":-1} | |
| ``` | |
| ```bash | |
| python scripts/infer/ark_asr_transformers.py \ | |
| --input /path/to/input.jsonl \ | |
| --output runs/infer/predictions.jsonl \ | |
| --model_path AutoArk-AI/ARK-ASR-0.6B \ | |
| --processor_path AutoArk-AI/ARK-ASR-0.6B \ | |
| --batch_size 40 \ | |
| --dtype float16 \ | |
| --attn_impl sdpa | |
| ``` | |
| The output JSONL preserves input metadata and adds: | |
| - `pred_text`: cleaned prediction text for downstream evaluation | |
| - `pred_text_raw`: raw decoded generation before cleanup | |
| ## Evaluation | |
| The repository also includes a J/WER evaluation entrypoint: | |
| ```bash | |
| python scripts/eval/eval_jwer_ark_asr_transformers.py \ | |
| --input /path/to/test.jsonl \ | |
| --output runs/eval/result.jsonl \ | |
| --model_path AutoArk-AI/ARK-ASR-0.6B \ | |
| --processor_path AutoArk-AI/ARK-ASR-0.6B \ | |
| --batch_size 40 \ | |
| --dtype float16 \ | |
| --attn_impl sdpa | |
| ``` | |
| No evaluation audio or dataset files are bundled with this model repository. | |
| ## Acknowledgements | |
| The training code is based on [THUNLP/OPD](https://github.com/thunlp/OPD/) and [verl](https://github.com/volcengine/verl). The OPD recipe uses a stronger ASR teacher to score online student rollouts. | |
| ## Citation | |
| If you find ARK-ASR or open-audio-opd useful, please cite or link the project repository: | |
| ```bibtex | |
| @misc{open_audio_opd_ark_asr, | |
| title = {open-audio-opd: Industrial ASR Online Policy Distillation Training Code}, | |
| author = {AutoArk AI}, | |
| year = {2026}, | |
| howpublished = {\url{https://github.com/AutoArk/open-audio-opd}} | |
| } | |
| ``` | |