Qwen3 ASR Micro (zh-TW / en, distilled, edge / Jetson Nano)

(562 M — formerly “Qwen3-ASR-0.3B”; a Micro-tier distillation of Qwen3-ASR-0.6B)

This is a new, smaller model — not a quantization of Qwen3-ASR-0.6B. It has a different architecture (a 14-layer decoder vs the 0.6B's 28) and was separately trained by knowledge distillation on a new corpus. The q8_0 file is just this model's GGUF serialization for edge inference, the same way any model can be saved to GGUF — it is not a re-quantized 0.6B. (HF may auto-tag GGUF repos as "quantized"; the base_model_relation here marks it as a trained derivative.)

Qwen3 ASR Micro is a compact bilingual (Traditional-Chinese + English, code-switch) speech-to-text model distilled from Qwen/Qwen3-ASR-0.6B. It reuses the teacher's 18-layer audio encoder (copied, frozen) and full 151 705-token vocabulary, pairs them with a new 14-layer Qwen3 decoder (pruned drop-middle from the teacher's 28 and then trained), and was distilled on ~197 h Taiwan-Mandarin (YouTube, teacher-labeled + s2twp) + 30 h English (LibriSpeech). Outputs native Traditional Chinese + punctuation.

This checkpoint is code-switch-optimized. It runs faster than real-time on a Jetson Nano gen1 (Tegra X1, Maxwell sm_53, CUDA 10.2) — RTF ~0.8 vs the 0.6B's 1.3.

Relationship to Qwen3-ASR-0.6B


Type	new distilled model (≈ a "finetune"/student in HF's model tree), not a quantization
Shared with teacher	18-layer audio encoder (frozen), tokenizer, full 151 705 vocab
Changed	decoder 28 → 14 layers (drop-middle, then KD-trained); new ~197 h training corpus
Params	562 M total (vs 938 M) — genuinely fewer weights, not the same weights re-quantized

Results — head-to-head vs the teacher (our measurement)

This is the code-switch-optimized checkpoint (stacked KD: CE + logit-KL + hidden-state KD, then code-switch upweighting). All rows below are our own runs of both models with one identical protocol (greedy, fp16, no language hint) so the comparison is fair — absolute numbers may differ from Qwen's official figures because we can't reproduce their exact text normaliser.

metric (set)	this 0.3B-CS (562M)	teacher 0.6B (938M)
zh-CER — CommonVoice-zh, 500 clips (strip punct, CJK+alnum, t2s)	0.0610	0.0445
CS-MER — Taiwan code-switch, 60 clips (mixed CJK-char/en-word)¹	0.0765	ref²
en-WER — CommonVoice-en, 200 clips	~0.080	0.0233
total params	562 M	938 M
GGUF q8_0 size	735 MB	961 MB
Jetson Nano gen1 RTF (q8_0, 13.5 s clip)	0.79 (real-time)	1.3
on-device decode	6.7 s (14 layers)	13.8 s (28 layers)

¹ Before code-switch upweighting this checkpoint's CS-MER was 0.128; upweighting in-corpus CS clips to 35 % cut it to 0.0765 (−40 %) at no pure-zh cost (0.0615 → 0.0610) but raised pure-en WER 0.048 → ~0.080 — a deliberate code-switch/English trade. ² CS-MER references are teacher-generated, so the teacher is the ceiling by construction (not a fair teacher number).

Honest gap: on clean read-speech (CommonVoice-zh) the 0.3B is ~37 % relatively behind the 0.6B teacher (0.061 vs 0.044) — the cost of 40 % fewer params and ~197 h of training data vs the teacher's vastly larger corpus. The cleaner official sets (Fleurs-zh, AISHELL-2, WenetSpeech) weren't runnable from our box (download-blocked / gated). The 0.3B's wins are size, real-time-on-Nano, and code-switch — not raw zh accuracy.

CER/WER are on different sets and must not be compared across non-comparable benchmarks. The deployed streaming X-ASR transducer is still lower on pure zh-CER (~0.068); this model's value is being a bilingual LLM-ASR with code-switch, native Traditional output, 40 % smaller, and real-time on the Nano.

Files

qwen3-asr-micro-q8_0.gguf — GGUF (q8_0) for rapidspeech.cpp / qwen3-asr.cpp (Nano / llama.cpp-style engines)
model.safetensors, config.json, tokenizer — PyTorch (transformers) checkpoint
scripts/ — the full reproduction pipeline (see below)

Usage

GGUF (edge, recommended on Nano): the block_count=14 flows through the GGUF, so the existing Qwen3-ASR engine builds the right decoder with no rebuild:

rs-asr-offline -m qwen3-asr-micro-q8_0.gguf -w clip_16k.wav        # rapidspeech.cpp
qwen3-asr-cli  -m qwen3-asr-micro-q8_0.gguf -f clip_16k.wav -t 4   # qwen3-asr.cpp (Nano, RTF ~0.8)

Engines: vieenrose/qwen3-asr.cpp (standalone) · vieenrose/RapidSpeech.cpp (integrated).

PyTorch (transformers from source — Qwen3-ASR is only in main):

pip install git+https://github.com/huggingface/transformers.git

import torch, soundfile as sf
from transformers import Qwen3ASRForConditionalGeneration, AutoProcessor
proc  = AutoProcessor.from_pretrained("Luigi/Qwen3-ASR-Micro")
model = Qwen3ASRForConditionalGeneration.from_pretrained("Luigi/Qwen3-ASR-Micro", dtype=torch.float16).to("cuda").eval()
a, sr = sf.read("clip_16k.wav")
conv = [{"role":"user","content":[{"type":"audio","audio":a},{"type":"text","text":""}]}]
inp = proc.apply_chat_template(conv, add_generation_prompt=True, tokenize=True, return_dict=True,
                               return_tensors="pt", sampling_rate=16000)
inp = {k:(v.to("cuda").half() if v.is_floating_point() else v.to("cuda")) for k,v in inp.items()}
out = model.generate(**inp, max_new_tokens=96, do_sample=False)
print(proc.batch_decode(out[:, inp["input_ids"].shape[1]:], skip_special_tokens=True)[0])

Full reproduction process

All scripts are in scripts/ here. The training data is NOT published (the zh-TW audio is YouTube content — copyright / platform ToS); the pipeline below reconstructs an equivalent corpus from your own sources.

0. Environment

pip install git+https://github.com/huggingface/transformers.git (qwen3_asr modeling is only in main, not any release ≤ 5.12).

Convert the teacher to the transformers HF layout (the published 0.6B is Qwen's original thinker.* format):

python transformers/models/qwen3_asr/convert_qwen3_asr_to_hf.py \
    --model_id Qwen/Qwen3-ASR-0.6B --dst_dir qwen3-asr-hf --model_type asr

1. Data — zh-TW (teacher-labeled, the hard part)

The published 0.6B is the teacher; it labels raw audio, which also gives native-Traditional targets.

Re-download the source Taiwan-Mandarin audio from YouTube (scripts/scale_corpus.py, IDs in data/youtube_ids.txt): yt-dlp (itag-140) → 16 kHz mono → fixed 16 s windows.
Teacher-label each window with Qwen3-ASR-0.6B (greedy), then apply OpenCC s2twp → Traditional Chinese targets (the teacher's raw script is inconsistent; s2twp is a deterministic post-step).
135 videos → **145 h**, 36 105 segments. Neither the audio nor the labels are published (copyright); supply your own Taiwan-Mandarin source list to scale_corpus.py to reconstruct an equivalent corpus.

2. Data — English balance (anti-forgetting)

scripts/english_prep3.py: LibriSpeech train-clean-100 via HF datasets streaming with Audio(decode=False) + soundfile (bypasses torchcodec) → 16 kHz wav + gold transcripts, ~30 h.
A 30 % English mix during training prevents catastrophic forgetting of English.

3. Student construction

Start from the teacher; prune the 28-layer Qwen3 decoder to 14 layers, "drop-middle" (keep first 7 + last 7 — naive first-N or last-N truncation collapses the model). Re-index self_attn.layer_idx.
Freeze the audio encoder + projector + embeddings; train only the 14 decoder layers + final norm (220 M trainable). The encoder is one-shot and already good; the decoder is what we shrink.
Note: a fresh-pruned decoder has CER ≈ 1.0 — recovery requires distillation (it is not a "prune + heal").

4. KD training (`scripts/train_student_full.py`)

Loss: cross-entropy on the teacher's Traditional labels (sequence-level KD). Label masking: build the prompt (add_generation_prompt=True) then manually append tokenizer(text)+eos and mask the prompt with -100 (the chat template does not append assistant text).
Optimizer: AdamW lr 2e-5, 100-step linear warmup, grad-clip 0.5, nan-guard (skip non-finite loss), grad-accum 4, EN_PROB 0.30, 8000 steps (~23 min on a single GB10). Save best-by-zh-CER checkpoint.
Curve (v1, CE-only): zh-CER 0.997 → 0.201 (1k) → 0.0865 (200-clip eval).
Refinement (this checkpoint), scripts/train_03b_kd.py + scripts/train_03b_hskd.py: warm-start the v1 student and add logit-KL (KL of student vs teacher softened logits, T=2) then hidden-state KD (normalised MSE of student decoder layer l vs teacher layer 2l). Loss 0.4·CE + 0.4·KL + 0.2·HS, cosine LR 1e-5→1e-6. This took zh-CER 0.0865 → 0.0813 → 0.0722, then scaling the zh corpus 145 h → ~197 h reached 0.0702 (teacher 0.065). Both teacher+student in fp32 (mixed fp16/fp32 causes conv dtype errors); eval only on the 200-clip set (the 30-clip train-eval is noisy).

5. Export to GGUF (`scripts/remap_student_to_thinker.py` + `convert_qwen3_asr_to_gguf.py`)

Remap the transformers checkpoint (model.*) back to Qwen thinker.* (inverse of the HF converter), dup the tied lm_head. Fix config.json layer_types to length 14.
convert_qwen3_asr_to_gguf.py --hf-dir … --output … --quant q8_0. Full 151 705 vocab kept.

6. Deploy on Jetson Nano gen1

The engine (cuFFT batched mel + sm_53 CUDA-10.2 build) reads qwen3.block_count=14 and builds the 14-layer decoder automatically — no rebuild. Measured RTF 0.957 (q8_0, 13.5 s clip).

Training hyper-parameters


teacher	Qwen/Qwen3-ASR-0.6B
student	14-layer decoder (drop-middle of 28), encoder frozen, full vocab
data	~145 h zh-TW (YouTube, teacher+s2twp) + 30 h en (LibriSpeech)
objective	CE + logit-KL + hidden-state KD (stacked)
lr / sched / steps	2e-5 / linear-warmup-100 / 8000
batch	1 × grad-accum 4, 30 % English
precision	fp32 train, q8_0 GGUF

Limitations

Small-student gap: zh-CER 0.0610 vs teacher 0.0445 on CommonVoice-zh (~37 % relative) — a real gap (40 % fewer params, ~197 h training vs the teacher's far larger corpus).
Code-switch/English trade: this checkpoint is tuned for code-switch (CS-MER 0.128→0.0765); pure-English regressed (en-WER ~~0.048→~~0.080). If you need max pure-English, the pre-CS checkpoint is in the git history.
Couldn't benchmark on Fleurs-zh / AISHELL-2 / WenetSpeech (download-blocked / gated from our box); numbers above are CommonVoice-zh + an internal Taiwan code-switch set.
English is only ~30 h (anti-forgetting, not parity); scale it for stronger English.
Trained on broadcast/news/lecture Taiwan YouTube; conversational/attendant domains under-represented.
zh-TW labels via teacher + s2twp, not human gold.

Ethics / data

The training corpus is not published. The zh-TW audio is YouTube content; neither the audio nor its derived labels are redistributed (copyright / platform ToS). Only the reproduction scripts are shared, so others can build an equivalent corpus from their own sources. English is public LibriSpeech (referenced, not re-hosted).

License

Apache-2.0 (inherits the Qwen3-ASR-0.6B base license).

Downloads last month: -

Safetensors

Model size

0.6B params

Tensor type

F16

Model tree for Luigi/Qwen3-ASR-Micro

Base model

Qwen/Qwen3-ASR-0.6B

Finetuned

(38)

this model

Luigi
/

Qwen3-ASR-Micro

Qwen3 ASR Micro (zh-TW / en, distilled, edge / Jetson Nano)

Relationship to Qwen3-ASR-0.6B

Results — head-to-head vs the teacher (our measurement)

Files

Usage

Full reproduction process

0. Environment

1. Data — zh-TW (teacher-labeled, the hard part)

2. Data — English balance (anti-forgetting)

3. Student construction

4. KD training (`scripts/train_student_full.py`)

5. Export to GGUF (`scripts/remap_student_to_thinker.py` + `convert_qwen3_asr_to_gguf.py`)

6. Deploy on Jetson Nano gen1

Training hyper-parameters

Limitations

Ethics / data

License

Model tree for Luigi/Qwen3-ASR-Micro

Space using Luigi/Qwen3-ASR-Micro 1

Qwen3 ASR Micro (zh-TW / en, distilled, edge / Jetson Nano)

Relationship to Qwen3-ASR-0.6B

Results — head-to-head vs the teacher (our measurement)

Files

Usage

Full reproduction process

0. Environment

1. Data — zh-TW (teacher-labeled, the hard part)

2. Data — English balance (anti-forgetting)

3. Student construction

4. KD training (scripts/train_student_full.py)

5. Export to GGUF (scripts/remap_student_to_thinker.py + convert_qwen3_asr_to_gguf.py)

6. Deploy on Jetson Nano gen1

Training hyper-parameters

Limitations

Ethics / data

License

Model tree for Luigi/Qwen3-ASR-Micro

Space using Luigi/Qwen3-ASR-Micro 1

4. KD training (`scripts/train_student_full.py`)

5. Export to GGUF (`scripts/remap_student_to_thinker.py` + `convert_qwen3_asr_to_gguf.py`)