Qwen3 ASR Micro (zh-TW / en, distilled, edge / Jetson Nano)

(562 M โ€” formerly โ€œQwen3-ASR-0.3Bโ€; a Micro-tier distillation of Qwen3-ASR-0.6B)

This is a new, smaller model โ€” not a quantization of Qwen3-ASR-0.6B. It has a different architecture (a 14-layer decoder vs the 0.6B's 28) and was separately trained by knowledge distillation on a new corpus. The q8_0 file is just this model's GGUF serialization for edge inference, the same way any model can be saved to GGUF โ€” it is not a re-quantized 0.6B. (HF may auto-tag GGUF repos as "quantized"; the base_model_relation here marks it as a trained derivative.)

Qwen3 ASR Micro is a compact bilingual (Traditional-Chinese + English, code-switch) speech-to-text model distilled from Qwen/Qwen3-ASR-0.6B. It reuses the teacher's 18-layer audio encoder (copied, frozen) and full 151 705-token vocabulary, pairs them with a new 14-layer Qwen3 decoder (pruned drop-middle from the teacher's 28 and then trained), and was distilled on ~197 h Taiwan-Mandarin (YouTube, teacher-labeled + s2twp) + 30 h English (LibriSpeech). Outputs native Traditional Chinese + punctuation.

This checkpoint is code-switch-optimized. It runs faster than real-time on a Jetson Nano gen1 (Tegra X1, Maxwell sm_53, CUDA 10.2) โ€” RTF ~0.8 vs the 0.6B's 1.3.

Relationship to Qwen3-ASR-0.6B

Type new distilled model (โ‰ˆ a "finetune"/student in HF's model tree), not a quantization
Shared with teacher 18-layer audio encoder (frozen), tokenizer, full 151 705 vocab
Changed decoder 28 โ†’ 14 layers (drop-middle, then KD-trained); new ~197 h training corpus
Params 562 M total (vs 938 M) โ€” genuinely fewer weights, not the same weights re-quantized

Results โ€” head-to-head vs the teacher (our measurement)

This is the code-switch-optimized checkpoint (stacked KD: CE + logit-KL + hidden-state KD, then code-switch upweighting). All rows below are our own runs of both models with one identical protocol (greedy, fp16, no language hint) so the comparison is fair โ€” absolute numbers may differ from Qwen's official figures because we can't reproduce their exact text normaliser.

metric (set) this 0.3B-CS (562M) teacher 0.6B (938M)
zh-CER โ€” CommonVoice-zh, 500 clips (strip punct, CJK+alnum, t2s) 0.0610 0.0445
CS-MER โ€” Taiwan code-switch, 60 clips (mixed CJK-char/en-word)ยน 0.0765 refยฒ
en-WER โ€” CommonVoice-en, 200 clips ~0.080 0.0233
total params 562 M 938 M
GGUF q8_0 size 735 MB 961 MB
Jetson Nano gen1 RTF (q8_0, 13.5 s clip) 0.79 (real-time) 1.3
on-device decode 6.7 s (14 layers) 13.8 s (28 layers)

ยน Before code-switch upweighting this checkpoint's CS-MER was 0.128; upweighting in-corpus CS clips to 35 % cut it to 0.0765 (โˆ’40 %) at no pure-zh cost (0.0615 โ†’ 0.0610) but raised pure-en WER 0.048 โ†’ ~0.080 โ€” a deliberate code-switch/English trade. ยฒ CS-MER references are teacher-generated, so the teacher is the ceiling by construction (not a fair teacher number).

Honest gap: on clean read-speech (CommonVoice-zh) the 0.3B is ~37 % relatively behind the 0.6B teacher (0.061 vs 0.044) โ€” the cost of 40 % fewer params and ~197 h of training data vs the teacher's vastly larger corpus. The cleaner official sets (Fleurs-zh, AISHELL-2, WenetSpeech) weren't runnable from our box (download-blocked / gated). The 0.3B's wins are size, real-time-on-Nano, and code-switch โ€” not raw zh accuracy.

CER/WER are on different sets and must not be compared across non-comparable benchmarks. The deployed streaming X-ASR transducer is still lower on pure zh-CER (~0.068); this model's value is being a bilingual LLM-ASR with code-switch, native Traditional output, 40 % smaller, and real-time on the Nano.

Files

  • qwen3-asr-micro-q8_0.gguf โ€” GGUF (q8_0) for rapidspeech.cpp / qwen3-asr.cpp (Nano / llama.cpp-style engines)
  • model.safetensors, config.json, tokenizer โ€” PyTorch (transformers) checkpoint
  • scripts/ โ€” the full reproduction pipeline (see below)

Usage

GGUF (edge, recommended on Nano): the block_count=14 flows through the GGUF, so the existing Qwen3-ASR engine builds the right decoder with no rebuild:

rs-asr-offline -m qwen3-asr-micro-q8_0.gguf -w clip_16k.wav        # rapidspeech.cpp
qwen3-asr-cli  -m qwen3-asr-micro-q8_0.gguf -f clip_16k.wav -t 4   # qwen3-asr.cpp (Nano, RTF ~0.8)

Engines: vieenrose/qwen3-asr.cpp (standalone) ยท vieenrose/RapidSpeech.cpp (integrated).

PyTorch (transformers from source โ€” Qwen3-ASR is only in main):

pip install git+https://github.com/huggingface/transformers.git
import torch, soundfile as sf
from transformers import Qwen3ASRForConditionalGeneration, AutoProcessor
proc  = AutoProcessor.from_pretrained("Luigi/Qwen3-ASR-Micro")
model = Qwen3ASRForConditionalGeneration.from_pretrained("Luigi/Qwen3-ASR-Micro", dtype=torch.float16).to("cuda").eval()
a, sr = sf.read("clip_16k.wav")
conv = [{"role":"user","content":[{"type":"audio","audio":a},{"type":"text","text":""}]}]
inp = proc.apply_chat_template(conv, add_generation_prompt=True, tokenize=True, return_dict=True,
                               return_tensors="pt", sampling_rate=16000)
inp = {k:(v.to("cuda").half() if v.is_floating_point() else v.to("cuda")) for k,v in inp.items()}
out = model.generate(**inp, max_new_tokens=96, do_sample=False)
print(proc.batch_decode(out[:, inp["input_ids"].shape[1]:], skip_special_tokens=True)[0])

Full reproduction process

All scripts are in scripts/ here. The training data is NOT published (the zh-TW audio is YouTube content โ€” copyright / platform ToS); the pipeline below reconstructs an equivalent corpus from your own sources.

0. Environment

  • pip install git+https://github.com/huggingface/transformers.git (qwen3_asr modeling is only in main, not any release โ‰ค 5.12).
  • Convert the teacher to the transformers HF layout (the published 0.6B is Qwen's original thinker.* format):
    python transformers/models/qwen3_asr/convert_qwen3_asr_to_hf.py \
        --model_id Qwen/Qwen3-ASR-0.6B --dst_dir qwen3-asr-hf --model_type asr
    

1. Data โ€” zh-TW (teacher-labeled, the hard part)

The published 0.6B is the teacher; it labels raw audio, which also gives native-Traditional targets.

  • Re-download the source Taiwan-Mandarin audio from YouTube (scripts/scale_corpus.py, IDs in data/youtube_ids.txt): yt-dlp (itag-140) โ†’ 16 kHz mono โ†’ fixed 16 s windows.
  • Teacher-label each window with Qwen3-ASR-0.6B (greedy), then apply OpenCC s2twp โ†’ Traditional Chinese targets (the teacher's raw script is inconsistent; s2twp is a deterministic post-step).
  • 135 videos โ†’ **145 h**, 36 105 segments. Neither the audio nor the labels are published (copyright); supply your own Taiwan-Mandarin source list to scale_corpus.py to reconstruct an equivalent corpus.

2. Data โ€” English balance (anti-forgetting)

  • scripts/english_prep3.py: LibriSpeech train-clean-100 via HF datasets streaming with Audio(decode=False) + soundfile (bypasses torchcodec) โ†’ 16 kHz wav + gold transcripts, ~30 h.
  • A 30 % English mix during training prevents catastrophic forgetting of English.

3. Student construction

  • Start from the teacher; prune the 28-layer Qwen3 decoder to 14 layers, "drop-middle" (keep first 7 + last 7 โ€” naive first-N or last-N truncation collapses the model). Re-index self_attn.layer_idx.
  • Freeze the audio encoder + projector + embeddings; train only the 14 decoder layers + final norm (220 M trainable). The encoder is one-shot and already good; the decoder is what we shrink.
  • Note: a fresh-pruned decoder has CER โ‰ˆ 1.0 โ€” recovery requires distillation (it is not a "prune + heal").

4. KD training (scripts/train_student_full.py)

  • Loss: cross-entropy on the teacher's Traditional labels (sequence-level KD). Label masking: build the prompt (add_generation_prompt=True) then manually append tokenizer(text)+eos and mask the prompt with -100 (the chat template does not append assistant text).
  • Optimizer: AdamW lr 2e-5, 100-step linear warmup, grad-clip 0.5, nan-guard (skip non-finite loss), grad-accum 4, EN_PROB 0.30, 8000 steps (~23 min on a single GB10). Save best-by-zh-CER checkpoint.
  • Curve (v1, CE-only): zh-CER 0.997 โ†’ 0.201 (1k) โ†’ 0.0865 (200-clip eval).
  • Refinement (this checkpoint), scripts/train_03b_kd.py + scripts/train_03b_hskd.py: warm-start the v1 student and add logit-KL (KL of student vs teacher softened logits, T=2) then hidden-state KD (normalised MSE of student decoder layer l vs teacher layer 2l). Loss 0.4ยทCE + 0.4ยทKL + 0.2ยทHS, cosine LR 1e-5โ†’1e-6. This took zh-CER 0.0865 โ†’ 0.0813 โ†’ 0.0722, then scaling the zh corpus 145 h โ†’ ~197 h reached 0.0702 (teacher 0.065). Both teacher+student in fp32 (mixed fp16/fp32 causes conv dtype errors); eval only on the 200-clip set (the 30-clip train-eval is noisy).

5. Export to GGUF (scripts/remap_student_to_thinker.py + convert_qwen3_asr_to_gguf.py)

  • Remap the transformers checkpoint (model.*) back to Qwen thinker.* (inverse of the HF converter), dup the tied lm_head. Fix config.json layer_types to length 14.
  • convert_qwen3_asr_to_gguf.py --hf-dir โ€ฆ --output โ€ฆ --quant q8_0. Full 151 705 vocab kept.

6. Deploy on Jetson Nano gen1

  • The engine (cuFFT batched mel + sm_53 CUDA-10.2 build) reads qwen3.block_count=14 and builds the 14-layer decoder automatically โ€” no rebuild. Measured RTF 0.957 (q8_0, 13.5 s clip).

Training hyper-parameters

teacher Qwen/Qwen3-ASR-0.6B
student 14-layer decoder (drop-middle of 28), encoder frozen, full vocab
data ~145 h zh-TW (YouTube, teacher+s2twp) + 30 h en (LibriSpeech)
objective CE + logit-KL + hidden-state KD (stacked)
lr / sched / steps 2e-5 / linear-warmup-100 / 8000
batch 1 ร— grad-accum 4, 30 % English
precision fp32 train, q8_0 GGUF

Limitations

  • Small-student gap: zh-CER 0.0610 vs teacher 0.0445 on CommonVoice-zh (~37 % relative) โ€” a real gap (40 % fewer params, ~197 h training vs the teacher's far larger corpus).
  • Code-switch/English trade: this checkpoint is tuned for code-switch (CS-MER 0.128โ†’0.0765); pure-English regressed (en-WER 0.048โ†’0.080). If you need max pure-English, the pre-CS checkpoint is in the git history.
  • Couldn't benchmark on Fleurs-zh / AISHELL-2 / WenetSpeech (download-blocked / gated from our box); numbers above are CommonVoice-zh + an internal Taiwan code-switch set.
  • English is only ~30 h (anti-forgetting, not parity); scale it for stronger English.
  • Trained on broadcast/news/lecture Taiwan YouTube; conversational/attendant domains under-represented.
  • zh-TW labels via teacher + s2twp, not human gold.

Ethics / data

The training corpus is not published. The zh-TW audio is YouTube content; neither the audio nor its derived labels are redistributed (copyright / platform ToS). Only the reproduction scripts are shared, so others can build an equivalent corpus from their own sources. English is public LibriSpeech (referenced, not re-hosted).

License

Apache-2.0 (inherits the Qwen3-ASR-0.6B base license).

Downloads last month
-
Safetensors
Model size
0.6B params
Tensor type
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Luigi/Qwen3-ASR-Micro

Finetuned
(38)
this model

Space using Luigi/Qwen3-ASR-Micro 1