Instructions to use YoozLabs/Qwen3-ASR-1.7B-Swift with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use YoozLabs/Qwen3-ASR-1.7B-Swift with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Qwen3-ASR-1.7B-Swift YoozLabs/Qwen3-ASR-1.7B-Swift
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Qwen3-ASR-1.7B-Swift
A Swift-friendly redistribution of mlx-community/Qwen3-ASR-1.7B-8bit for use with swift-transformers and Yooz Engine.
The canonical mlx-community/Qwen3-ASR-1.7B-8bit checkpoint ships vocab.json + merges.txt + tokenizer_config.json but no tokenizer.json. swift-transformers's AutoTokenizer.from(modelFolder:) requires tokenizer.json to be present on disk. This repo is the same checkpoint with tokenizer.json regenerated next to the existing artifacts so Swift consumers get a one-line load.
Lineage
Qwen/Qwen3-ASR-1.7B (upstream, FP16)
|
v
mlx-community/Qwen3-ASR-1.7B-8bit (mlx-audio 0.3.1 quantization)
|
v + tokenizer.json regenerated from vocab.json + merges.txt
YoozLabs/Qwen3-ASR-1.7B-Swift (this repo)
Weights (model.safetensors, model.safetensors.index.json), config (config.json, generation_config.json, preprocessor_config.json, chat_template.json), and the slow-tokenizer inputs (vocab.json, merges.txt, tokenizer_config.json) are byte-for-byte identical to mlx-community/Qwen3-ASR-1.7B-8bit. The only addition is tokenizer.json, plus the scripts/regen_tokenizer.py reproduction script and a MANIFEST.txt of SHA-256 digests for every artifact.
Validation
End-to-end parity with the Python mlx-audio reference, measured during Yooz Engine epic #46 phase 4 (PR yooz-labs/yooz-engine#64):
- Numerical parity: 9.6e-7 max absolute delta on decoder logits vs
mlx-audioPython reference, end-to-end on a 5 s clip. - Word error rate parity: 0 absolute WER delta vs the Python reference on the yooz-benchmark EN / AR / FA subsets.
- Tokenizer canary:
"Hello"encodes to[9707](matchesQwen3ASRTokenizerPrep.canaryExpectedTokensin the engine). The regen script cross-checks 5 multilingual canary strings againsttransformers.AutoTokenizerbefore writingtokenizer.json.
Eval
| Subset | Metric | This checkpoint | Notes |
|---|---|---|---|
| LibriSpeech-style English | WER | 6.3% | Parakeet TDT on the same set is 6.9% |
| Arabic (yooz-benchmark) | WER | 6.7% | Auto-LID free; no language hint required |
| Persian (yooz-benchmark) | WER | 28.3% | Auto-LID identical to hinted-language path |
| Hebrew (yooz-benchmark) | WER | 82.8% | Effectively unsupported, see Limitations |
Latency (M-series Apple Silicon)
| Phase | Time | Notes |
|---|---|---|
| Cold start (model load) | ~1.1 s | One-shot per process |
| Warm transcription, 5 s clip | 0.32 s | After model is resident |
| Resident memory | ~2.5 GB | 8-bit quantized weights + KV cache |
All numbers are from the Yooz Engine qwen3_asr_preview backend on M-series Apple Silicon (M2 Pro / M3 Max class). Detailed methodology lives in PR yooz-labs/yooz-engine#64.
Limitations
- Streaming is buffer-then-finalize, not chunk-incremental. The audio tower uses non-causal block attention (
_create_block_attention_mask), so partials only finalize when an utterance boundary is detected. True low-latency streaming is out of scope for this checkpoint and is tracked separately in the engine. - Hebrew is unsupported. 82.8% WER on the yooz-benchmark Hebrew subset indicates the model effectively does not transcribe Hebrew. Do not deploy it for Hebrew users.
- Persian is preview-only. 28.3% WER is competitive with the best open multilingual ASR models we have measured but is not yet at parity with the per-language fine-tunes the engine uses for FA defaults. We are using it as a fallback, not a default.
- English is not the default in Yooz Engine. Parakeet TDT remains the default English backend in
YoozEngine.appbecause it is faster and roughly equivalent on accuracy. This checkpoint shines on multilingual / code-switched audio where Parakeet does not run. - Deterministic decoding only. No sampling parameters are exposed; the engine consumes greedy decode output.
Files
| File | Size | Purpose |
|---|---|---|
model.safetensors |
2.46 GB | 8-bit MLX weights (audio tower + Qwen3 decoder) |
model.safetensors.index.json |
79 KB | Weight map |
tokenizer.json |
11.4 MB | Regenerated fast tokenizer (the addition over upstream) |
tokenizer_config.json |
12 KB | Special tokens, chat template, model_max_length |
vocab.json |
2.78 MB | Qwen2 byte-level BPE vocabulary |
merges.txt |
1.67 MB | Qwen2 BPE merges |
config.json |
7.2 KB | Qwen3ASR model config |
generation_config.json |
142 B | Decoding defaults |
preprocessor_config.json |
330 B | 128-bin log-mel preprocessor settings |
chat_template.json |
1.16 KB | Jinja chat template |
scripts/regen_tokenizer.py |
reproduction | Rebuilds tokenizer.json from vocab + merges |
MANIFEST.txt |
SHA-256 of each artifact above |
SHA-256 digests are pinned in MANIFEST.txt. Verify after download with shasum -a 256 -c MANIFEST.txt.
Use with Yooz Engine (Swift)
import YoozEngineClient
let client = YoozEngineClient()
let result = try await client.stt.transcribe(
audioURL: URL(fileURLWithPath: "audio.wav"),
backend: .qwen3ASRPreview,
languageHint: nil // auto-LID; pass "ar" / "fa" / "zh" / "en" to force
)
print(result.text)
YoozEngineClient auto-launches YoozEngine.app, which downloads this repo to its model cache on first run and uses swift-transformers AutoTokenizer.from(modelFolder:) to load tokenizer.json directly. No manual regeneration is required for engine consumers.
Use with mlx-audio (Python)
tokenizer.json is additive. The original mlx-audio Python entry points work unchanged:
from mlx_audio.stt.utils import load_model
from mlx_audio.stt.generate import generate_transcription
model = load_model("YoozLabs/Qwen3-ASR-1.7B-Swift")
out = generate_transcription(
model=model,
audio_path="audio.wav",
output_path="audio.txt",
format="txt",
verbose=True,
)
print(out.text)
Reproducing tokenizer.json
hf download mlx-community/Qwen3-ASR-1.7B-8bit --local-dir ./qwen3-asr
uv run --with tokenizers --with transformers \
python scripts/regen_tokenizer.py --model-dir ./qwen3-asr
shasum -a 256 ./qwen3-asr/tokenizer.json
# expected: 20b91623123c0f04e694141e5e385a7c44e57b7594157c1e3e38a90d19954c0d
The regen script builds a tokenizers.Tokenizer whose BPE model is loaded directly from vocab.json + merges.txt, configures the canonical Qwen2 byte-level pre-tokenizer / decoder / post-processor, and attaches every special token from tokenizer_config.json's added_tokens_decoder. It then cross-checks 5 multilingual canary strings against transformers.AutoTokenizer before writing the file.
License
Apache 2.0, inherited from Qwen/Qwen3-ASR-1.7B and mlx-community/Qwen3-ASR-1.7B-8bit. No relicensing.
Citation
The model itself is the work of the Tongyi Qwen team:
@misc{qwen3asr2025,
title = {Qwen3-ASR},
author = {Tongyi Qwen Team and Alibaba Group},
year = {2025},
url = {https://huggingface.co/Qwen/Qwen3-ASR-1.7B}
}
The 8-bit MLX quantization is the work of the mlx-community and Blaizzy/mlx-audio maintainers.
This repository's only contribution is the regenerated tokenizer.json and the reproduction script for Swift compatibility. If you use the checkpoint via Yooz Engine, please also cite the engine release: https://github.com/yooz-labs/yooz-engine.
Contact
- Issues with this checkpoint or the regen script: file at https://github.com/yooz-labs/yooz-engine/issues.
- General questions: dev@yooz.info.
- Upstream model issues: report against
Qwen/Qwen3-ASR-1.7B, not here.
- Downloads last month
- 104
8-bit
Model tree for YoozLabs/Qwen3-ASR-1.7B-Swift
Base model
mlx-community/Qwen3-ASR-1.7B-8bit