--- library_name: transformers tags: - automatic-speech-recognition - speech - audio - transformers - pytorch - safetensors - ark-asr pipeline_tag: automatic-speech-recognition language: - zh - en - de - ja - fr - ko license: apache-2.0 repository: https://github.com/AutoArk/open-audio-opd ---
# ARK-ASR-0.6B: Efficient Multilingual ASR with Online Policy Distillation [![GitHub](https://img.shields.io/badge/GitHub-AutoArk%2Fopen--audio--opd-blue?logo=github)](https://github.com/AutoArk/open-audio-opd) [![License](https://img.shields.io/badge/License-Apache--2.0-green)](https://www.apache.org/licenses/LICENSE-2.0)
> **TL;DR** ARK-ASR-0.6B is a 0.6B-parameter automatic speech recognition model trained with teacher-data adaptation and on-policy distillation. The accompanying training, inference, and evaluation code is available at [AutoArk/open-audio-opd](https://github.com/AutoArk/open-audio-opd). ## Abstract ARK-ASR is an audio ASR student model optimized with the **teacher-data adaptation + online policy distillation (TD + OPD)** recipe from `open-audio-opd`. Instead of relying only on static supervised transcripts, OPD lets the student generate transcripts online and trains it against token-level teacher scores on the student's own generated behavior. This checkpoint corresponds to the `Ark-Base+TD+OPD (0.6B)` model reported in the open-audio-opd results. ARK-ASR currently supports Chinese, English, German, Japanese, French, and Korean ASR. ## Model Overview
ARK-ASR architecture

Figure 1: ARK-ASR architecture. Audio is encoded by a Whisper-style encoder with RoPE, merged through an MLP adapter, and injected into a Qwen2 decoder by replacing audio placeholder token embeddings before transcript generation.

- **Model size:** 0.6B parameters - **Task:** automatic speech recognition - **Architecture:** audio-capable autoregressive Transformers model with custom `arkasr` remote code - **Checkpoint format:** `safetensors` - **Sampling rate:** 16 kHz - **Recommended inference code:** [`scripts/infer/ark_asr_transformers.py`](https://github.com/AutoArk/open-audio-opd/blob/main/scripts/infer/ark_asr_transformers.py) The model should be loaded with `trust_remote_code=True`. The official inference script handles the processor, tokenizer, audio prompt format, generation cleanup, and ASR token filtering. ## Performance The following results are from the `open-audio-opd` evaluation. Lower CER/WER is better. Bold numbers mark the best result within the 0.6B group. | Model | aishell-1 (CER) | Wenet-meeting (CER) | Wenet-net (CER) | Libri-clean (WER) | Libri-other (WER) | | --- | ---: | ---: | ---: | ---: | ---: | | *0.6B models* | | | | | | | Ark-Base (0.6B) | 3.48% | 10.22% | 7.74% | 3.75% | 7.17% | | Ark-Base+OPD (0.6B) | 3.00% | 7.18% | 6.13% | 2.88% | 5.50% | | **Ark-Base+TD+OPD (0.6B)** | **1.95%** | 5.92% | **5.39%** | **2.45%** | **4.56%** | | Qwen3-ASR-0.6B | 2.07% | **5.57%** | 5.45% | 2.81% | 5.05% | | *Larger reference model* | | | | | | | Qwen3-ASR-1.7B | 1.50% | 4.69% | 4.55% | 2.20% | 4.05% | `Ark-Base` is the 0.6B supervised ASR checkpoint trained on 100k hours of ASR audio. `TD` denotes teacher-data adaptation using 2,000 hours of teacher-generated ASR data. `OPD` denotes on-policy distillation with a Qwen-ASR teacher. ## Inference Run ASR inference with Hugging Face Transformers: ```python import torch from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer model_path = "AutoArk-AI/ARK-ASR-0.6B" audio_path = "assets/libai.wav" device = "cuda" if torch.cuda.is_available() else "cpu" torch_dtype = torch.float16 if device == "cuda" else torch.float32 processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_path, trust_remote_code=True, torch_dtype=torch_dtype, attn_implementation="sdpa", ).to(device) conversation = [ { "role": "user", "content": [ {"type": "audio", "path": audio_path}, {"type": "text", "text": "Please transcribe this audio."}, ], } ] inputs = processor.apply_chat_template( conversation, add_generation_prompt=True, return_tensors="pt", ) inputs = inputs.to(device) if "audios" in inputs: inputs["audios"] = inputs["audios"].to(dtype=torch_dtype) bad_words_ids = [[token_id] for token_id in tokenizer.all_special_ids if token_id != tokenizer.eos_token_id] outputs = model.generate( **inputs, do_sample=False, max_new_tokens=256, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id, bad_words_ids=bad_words_ids, ) decoded_outputs = tokenizer.batch_decode( outputs[:, inputs.input_ids.shape[1] :], skip_special_tokens=True, ) print(decoded_outputs) ``` For batch JSONL inference, use the open-source inference code: ```bash git clone https://github.com/AutoArk/open-audio-opd cd open-audio-opd pip install -e . ``` The input JSONL should contain one ASR sample per line: ```json {"audio":"/path/to/audio.wav","text":"","task":"asr","begin_time":-1,"end_time":-1} ``` ```bash python scripts/infer/ark_asr_transformers.py \ --input /path/to/input.jsonl \ --output runs/infer/predictions.jsonl \ --model_path AutoArk-AI/ARK-ASR-0.6B \ --processor_path AutoArk-AI/ARK-ASR-0.6B \ --batch_size 40 \ --dtype float16 \ --attn_impl sdpa ``` The output JSONL preserves input metadata and adds: - `pred_text`: cleaned prediction text for downstream evaluation - `pred_text_raw`: raw decoded generation before cleanup ## Evaluation The repository also includes a J/WER evaluation entrypoint: ```bash python scripts/eval/eval_jwer_ark_asr_transformers.py \ --input /path/to/test.jsonl \ --output runs/eval/result.jsonl \ --model_path AutoArk-AI/ARK-ASR-0.6B \ --processor_path AutoArk-AI/ARK-ASR-0.6B \ --batch_size 40 \ --dtype float16 \ --attn_impl sdpa ``` No evaluation audio or dataset files are bundled with this model repository. ## Acknowledgements The training code is based on [THUNLP/OPD](https://github.com/thunlp/OPD/) and [verl](https://github.com/volcengine/verl). The OPD recipe uses a stronger ASR teacher to score online student rollouts. ## Citation If you find ARK-ASR or open-audio-opd useful, please cite or link the project repository: ```bibtex @misc{open_audio_opd_ark_asr, title = {open-audio-opd: Industrial ASR Online Policy Distillation Training Code}, author = {AutoArk AI}, year = {2026}, howpublished = {\url{https://github.com/AutoArk/open-audio-opd}} } ```