Instructions to use Luigi/Qwen3-ASR-Micro with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Luigi/Qwen3-ASR-Micro with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="Luigi/Qwen3-ASR-Micro")# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("Luigi/Qwen3-ASR-Micro") model = AutoModelForMultimodalLM.from_pretrained("Luigi/Qwen3-ASR-Micro") - Notebooks
- Google Colab
- Kaggle
Qwen3 ASR Micro (zh-TW / en, distilled, edge / Jetson Nano)
(562 M โ formerly โQwen3-ASR-0.3Bโ; a Micro-tier distillation of Qwen3-ASR-0.6B)
This is a new, smaller model โ not a quantization of Qwen3-ASR-0.6B. It has a different architecture (a 14-layer decoder vs the 0.6B's 28) and was separately trained by knowledge distillation on a new corpus. The
q8_0file is just this model's GGUF serialization for edge inference, the same way any model can be saved to GGUF โ it is not a re-quantized 0.6B. (HF may auto-tag GGUF repos as "quantized"; thebase_model_relationhere marks it as a trained derivative.)
Qwen3 ASR Micro is a compact bilingual (Traditional-Chinese + English, code-switch) speech-to-text model
distilled from Qwen/Qwen3-ASR-0.6B. It reuses the teacher's
18-layer audio encoder (copied, frozen) and full 151 705-token vocabulary, pairs them with a new
14-layer Qwen3 decoder (pruned drop-middle from the teacher's 28 and then trained), and was distilled on
~197 h Taiwan-Mandarin (YouTube, teacher-labeled + s2twp) + 30 h English (LibriSpeech). Outputs native
Traditional Chinese + punctuation.
This checkpoint is code-switch-optimized. It runs faster than real-time on a Jetson Nano gen1 (Tegra X1, Maxwell sm_53, CUDA 10.2) โ RTF ~0.8 vs the 0.6B's 1.3.
Relationship to Qwen3-ASR-0.6B
| Type | new distilled model (โ a "finetune"/student in HF's model tree), not a quantization |
| Shared with teacher | 18-layer audio encoder (frozen), tokenizer, full 151 705 vocab |
| Changed | decoder 28 โ 14 layers (drop-middle, then KD-trained); new ~197 h training corpus |
| Params | 562 M total (vs 938 M) โ genuinely fewer weights, not the same weights re-quantized |
Results โ head-to-head vs the teacher (our measurement)
This is the code-switch-optimized checkpoint (stacked KD: CE + logit-KL + hidden-state KD, then code-switch upweighting). All rows below are our own runs of both models with one identical protocol (greedy, fp16, no language hint) so the comparison is fair โ absolute numbers may differ from Qwen's official figures because we can't reproduce their exact text normaliser.
| metric (set) | this 0.3B-CS (562M) | teacher 0.6B (938M) |
|---|---|---|
| zh-CER โ CommonVoice-zh, 500 clips (strip punct, CJK+alnum, t2s) | 0.0610 | 0.0445 |
| CS-MER โ Taiwan code-switch, 60 clips (mixed CJK-char/en-word)ยน | 0.0765 | refยฒ |
| en-WER โ CommonVoice-en, 200 clips | ~0.080 | 0.0233 |
| total params | 562 M | 938 M |
| GGUF q8_0 size | 735 MB | 961 MB |
| Jetson Nano gen1 RTF (q8_0, 13.5 s clip) | 0.79 (real-time) | 1.3 |
| on-device decode | 6.7 s (14 layers) | 13.8 s (28 layers) |
ยน Before code-switch upweighting this checkpoint's CS-MER was 0.128; upweighting in-corpus CS clips to 35 % cut it to 0.0765 (โ40 %) at no pure-zh cost (0.0615 โ 0.0610) but raised pure-en WER 0.048 โ ~0.080 โ a deliberate code-switch/English trade. ยฒ CS-MER references are teacher-generated, so the teacher is the ceiling by construction (not a fair teacher number).
Honest gap: on clean read-speech (CommonVoice-zh) the 0.3B is ~37 % relatively behind the 0.6B teacher (0.061 vs 0.044) โ the cost of 40 % fewer params and ~197 h of training data vs the teacher's vastly larger corpus. The cleaner official sets (Fleurs-zh, AISHELL-2, WenetSpeech) weren't runnable from our box (download-blocked / gated). The 0.3B's wins are size, real-time-on-Nano, and code-switch โ not raw zh accuracy.
CER/WER are on different sets and must not be compared across non-comparable benchmarks. The deployed streaming X-ASR transducer is still lower on pure zh-CER (~0.068); this model's value is being a bilingual LLM-ASR with code-switch, native Traditional output, 40 % smaller, and real-time on the Nano.
Files
qwen3-asr-micro-q8_0.ggufโ GGUF (q8_0) forrapidspeech.cpp/qwen3-asr.cpp(Nano / llama.cpp-style engines)model.safetensors,config.json, tokenizer โ PyTorch (transformers) checkpointscripts/โ the full reproduction pipeline (see below)
Usage
GGUF (edge, recommended on Nano): the block_count=14 flows through the GGUF, so the existing Qwen3-ASR
engine builds the right decoder with no rebuild:
rs-asr-offline -m qwen3-asr-micro-q8_0.gguf -w clip_16k.wav # rapidspeech.cpp
qwen3-asr-cli -m qwen3-asr-micro-q8_0.gguf -f clip_16k.wav -t 4 # qwen3-asr.cpp (Nano, RTF ~0.8)
Engines: vieenrose/qwen3-asr.cpp (standalone) ยท vieenrose/RapidSpeech.cpp (integrated).
PyTorch (transformers from source โ Qwen3-ASR is only in main):
pip install git+https://github.com/huggingface/transformers.git
import torch, soundfile as sf
from transformers import Qwen3ASRForConditionalGeneration, AutoProcessor
proc = AutoProcessor.from_pretrained("Luigi/Qwen3-ASR-Micro")
model = Qwen3ASRForConditionalGeneration.from_pretrained("Luigi/Qwen3-ASR-Micro", dtype=torch.float16).to("cuda").eval()
a, sr = sf.read("clip_16k.wav")
conv = [{"role":"user","content":[{"type":"audio","audio":a},{"type":"text","text":""}]}]
inp = proc.apply_chat_template(conv, add_generation_prompt=True, tokenize=True, return_dict=True,
return_tensors="pt", sampling_rate=16000)
inp = {k:(v.to("cuda").half() if v.is_floating_point() else v.to("cuda")) for k,v in inp.items()}
out = model.generate(**inp, max_new_tokens=96, do_sample=False)
print(proc.batch_decode(out[:, inp["input_ids"].shape[1]:], skip_special_tokens=True)[0])
Full reproduction process
All scripts are in scripts/ here. The training data is NOT published (the zh-TW audio is YouTube content
โ copyright / platform ToS); the pipeline below reconstructs an equivalent corpus from your own sources.
0. Environment
pip install git+https://github.com/huggingface/transformers.git(qwen3_asr modeling is only inmain, not any release โค 5.12).- Convert the teacher to the transformers HF layout (the published 0.6B is Qwen's original
thinker.*format):python transformers/models/qwen3_asr/convert_qwen3_asr_to_hf.py \ --model_id Qwen/Qwen3-ASR-0.6B --dst_dir qwen3-asr-hf --model_type asr
1. Data โ zh-TW (teacher-labeled, the hard part)
The published 0.6B is the teacher; it labels raw audio, which also gives native-Traditional targets.
- Re-download the source Taiwan-Mandarin audio from YouTube (
scripts/scale_corpus.py, IDs indata/youtube_ids.txt): yt-dlp (itag-140) โ 16 kHz mono โ fixed 16 s windows. - Teacher-label each window with Qwen3-ASR-0.6B (greedy), then apply OpenCC
s2twpโ Traditional Chinese targets (the teacher's raw script is inconsistent;s2twpis a deterministic post-step). 135 videos โ **145 h**, 36 105 segments. Neither the audio nor the labels are published (copyright); supply your own Taiwan-Mandarin source list toscale_corpus.pyto reconstruct an equivalent corpus.
2. Data โ English balance (anti-forgetting)
scripts/english_prep3.py: LibriSpeech train-clean-100 via HFdatasetsstreaming withAudio(decode=False)+ soundfile (bypasses torchcodec) โ 16 kHz wav + gold transcripts, ~30 h.- A 30 % English mix during training prevents catastrophic forgetting of English.
3. Student construction
- Start from the teacher; prune the 28-layer Qwen3 decoder to 14 layers, "drop-middle" (keep first 7 +
last 7 โ naive first-N or last-N truncation collapses the model). Re-index
self_attn.layer_idx. - Freeze the audio encoder + projector + embeddings; train only the 14 decoder layers + final norm (220 M trainable). The encoder is one-shot and already good; the decoder is what we shrink.
- Note: a fresh-pruned decoder has CER โ 1.0 โ recovery requires distillation (it is not a "prune + heal").
4. KD training (scripts/train_student_full.py)
- Loss: cross-entropy on the teacher's Traditional labels (sequence-level KD). Label masking: build the prompt
(
add_generation_prompt=True) then manually appendtokenizer(text)+eosand mask the prompt with -100 (the chat template does not append assistant text). - Optimizer: AdamW lr 2e-5, 100-step linear warmup, grad-clip 0.5, nan-guard (skip non-finite loss), grad-accum 4, EN_PROB 0.30, 8000 steps (~23 min on a single GB10). Save best-by-zh-CER checkpoint.
- Curve (v1, CE-only): zh-CER 0.997 โ 0.201 (1k) โ 0.0865 (200-clip eval).
- Refinement (this checkpoint),
scripts/train_03b_kd.py+scripts/train_03b_hskd.py: warm-start the v1 student and add logit-KL (KL of student vs teacher softened logits, T=2) then hidden-state KD (normalised MSE of student decoder layer l vs teacher layer 2l). Loss0.4ยทCE + 0.4ยทKL + 0.2ยทHS, cosine LR 1e-5โ1e-6. This took zh-CER 0.0865 โ 0.0813 โ 0.0722, then scaling the zh corpus 145 h โ ~197 h reached 0.0702 (teacher 0.065). Both teacher+student in fp32 (mixed fp16/fp32 causes conv dtype errors); eval only on the 200-clip set (the 30-clip train-eval is noisy).
5. Export to GGUF (scripts/remap_student_to_thinker.py + convert_qwen3_asr_to_gguf.py)
- Remap the transformers checkpoint (
model.*) back to Qwenthinker.*(inverse of the HF converter), dup the tiedlm_head. Fixconfig.jsonlayer_typesto length 14. convert_qwen3_asr_to_gguf.py --hf-dir โฆ --output โฆ --quant q8_0. Full 151 705 vocab kept.
6. Deploy on Jetson Nano gen1
- The engine (cuFFT batched mel + sm_53 CUDA-10.2 build) reads
qwen3.block_count=14and builds the 14-layer decoder automatically โ no rebuild. Measured RTF 0.957 (q8_0, 13.5 s clip).
Training hyper-parameters
| teacher | Qwen/Qwen3-ASR-0.6B |
| student | 14-layer decoder (drop-middle of 28), encoder frozen, full vocab |
| data | ~145 h zh-TW (YouTube, teacher+s2twp) + 30 h en (LibriSpeech) |
| objective | CE + logit-KL + hidden-state KD (stacked) |
| lr / sched / steps | 2e-5 / linear-warmup-100 / 8000 |
| batch | 1 ร grad-accum 4, 30 % English |
| precision | fp32 train, q8_0 GGUF |
Limitations
- Small-student gap: zh-CER 0.0610 vs teacher 0.0445 on CommonVoice-zh (~37 % relative) โ a real gap (40 % fewer params, ~197 h training vs the teacher's far larger corpus).
- Code-switch/English trade: this checkpoint is tuned for code-switch (CS-MER 0.128โ0.0765); pure-English regressed (en-WER
0.048โ0.080). If you need max pure-English, the pre-CS checkpoint is in the git history. - Couldn't benchmark on Fleurs-zh / AISHELL-2 / WenetSpeech (download-blocked / gated from our box); numbers above are CommonVoice-zh + an internal Taiwan code-switch set.
- English is only ~30 h (anti-forgetting, not parity); scale it for stronger English.
- Trained on broadcast/news/lecture Taiwan YouTube; conversational/attendant domains under-represented.
- zh-TW labels via teacher +
s2twp, not human gold.
Ethics / data
The training corpus is not published. The zh-TW audio is YouTube content; neither the audio nor its derived labels are redistributed (copyright / platform ToS). Only the reproduction scripts are shared, so others can build an equivalent corpus from their own sources. English is public LibriSpeech (referenced, not re-hosted).
License
Apache-2.0 (inherits the Qwen3-ASR-0.6B base license).
- Downloads last month
- -
Model tree for Luigi/Qwen3-ASR-Micro
Base model
Qwen/Qwen3-ASR-0.6B