Qwen3-ASR-1.7B-Swift

A Swift-friendly redistribution of mlx-community/Qwen3-ASR-1.7B-8bit for use with swift-transformers and Yooz Engine.

The canonical mlx-community/Qwen3-ASR-1.7B-8bit checkpoint ships vocab.json + merges.txt + tokenizer_config.json but no tokenizer.json. swift-transformers's AutoTokenizer.from(modelFolder:) requires tokenizer.json to be present on disk. This repo is the same checkpoint with tokenizer.json regenerated next to the existing artifacts so Swift consumers get a one-line load.

Lineage

Qwen/Qwen3-ASR-1.7B          (upstream, FP16)
        |
        v
mlx-community/Qwen3-ASR-1.7B-8bit   (mlx-audio 0.3.1 quantization)
        |
        v  + tokenizer.json regenerated from vocab.json + merges.txt
YoozLabs/Qwen3-ASR-1.7B-Swift      (this repo)

Weights (model.safetensors, model.safetensors.index.json), config (config.json, generation_config.json, preprocessor_config.json, chat_template.json), and the slow-tokenizer inputs (vocab.json, merges.txt, tokenizer_config.json) are byte-for-byte identical to mlx-community/Qwen3-ASR-1.7B-8bit. The only addition is tokenizer.json, plus the scripts/regen_tokenizer.py reproduction script and a MANIFEST.txt of SHA-256 digests for every artifact.

Validation

End-to-end parity with the Python mlx-audio reference, measured during Yooz Engine epic #46 phase 4 (PR yooz-labs/yooz-engine#64):

  • Numerical parity: 9.6e-7 max absolute delta on decoder logits vs mlx-audio Python reference, end-to-end on a 5 s clip.
  • Word error rate parity: 0 absolute WER delta vs the Python reference on the yooz-benchmark EN / AR / FA subsets.
  • Tokenizer canary: "Hello" encodes to [9707] (matches Qwen3ASRTokenizerPrep.canaryExpectedTokens in the engine). The regen script cross-checks 5 multilingual canary strings against transformers.AutoTokenizer before writing tokenizer.json.

Eval

Subset Metric This checkpoint Notes
LibriSpeech-style English WER 6.3% Parakeet TDT on the same set is 6.9%
Arabic (yooz-benchmark) WER 6.7% Auto-LID free; no language hint required
Persian (yooz-benchmark) WER 28.3% Auto-LID identical to hinted-language path
Hebrew (yooz-benchmark) WER 82.8% Effectively unsupported, see Limitations

Latency (M-series Apple Silicon)

Phase Time Notes
Cold start (model load) ~1.1 s One-shot per process
Warm transcription, 5 s clip 0.32 s After model is resident
Resident memory ~2.5 GB 8-bit quantized weights + KV cache

All numbers are from the Yooz Engine qwen3_asr_preview backend on M-series Apple Silicon (M2 Pro / M3 Max class). Detailed methodology lives in PR yooz-labs/yooz-engine#64.

Limitations

  • Streaming is buffer-then-finalize, not chunk-incremental. The audio tower uses non-causal block attention (_create_block_attention_mask), so partials only finalize when an utterance boundary is detected. True low-latency streaming is out of scope for this checkpoint and is tracked separately in the engine.
  • Hebrew is unsupported. 82.8% WER on the yooz-benchmark Hebrew subset indicates the model effectively does not transcribe Hebrew. Do not deploy it for Hebrew users.
  • Persian is preview-only. 28.3% WER is competitive with the best open multilingual ASR models we have measured but is not yet at parity with the per-language fine-tunes the engine uses for FA defaults. We are using it as a fallback, not a default.
  • English is not the default in Yooz Engine. Parakeet TDT remains the default English backend in YoozEngine.app because it is faster and roughly equivalent on accuracy. This checkpoint shines on multilingual / code-switched audio where Parakeet does not run.
  • Deterministic decoding only. No sampling parameters are exposed; the engine consumes greedy decode output.

Files

File Size Purpose
model.safetensors 2.46 GB 8-bit MLX weights (audio tower + Qwen3 decoder)
model.safetensors.index.json 79 KB Weight map
tokenizer.json 11.4 MB Regenerated fast tokenizer (the addition over upstream)
tokenizer_config.json 12 KB Special tokens, chat template, model_max_length
vocab.json 2.78 MB Qwen2 byte-level BPE vocabulary
merges.txt 1.67 MB Qwen2 BPE merges
config.json 7.2 KB Qwen3ASR model config
generation_config.json 142 B Decoding defaults
preprocessor_config.json 330 B 128-bin log-mel preprocessor settings
chat_template.json 1.16 KB Jinja chat template
scripts/regen_tokenizer.py reproduction Rebuilds tokenizer.json from vocab + merges
MANIFEST.txt SHA-256 of each artifact above

SHA-256 digests are pinned in MANIFEST.txt. Verify after download with shasum -a 256 -c MANIFEST.txt.

Use with Yooz Engine (Swift)

import YoozEngineClient

let client = YoozEngineClient()
let result = try await client.stt.transcribe(
    audioURL: URL(fileURLWithPath: "audio.wav"),
    backend: .qwen3ASRPreview,
    languageHint: nil  // auto-LID; pass "ar" / "fa" / "zh" / "en" to force
)
print(result.text)

YoozEngineClient auto-launches YoozEngine.app, which downloads this repo to its model cache on first run and uses swift-transformers AutoTokenizer.from(modelFolder:) to load tokenizer.json directly. No manual regeneration is required for engine consumers.

Use with mlx-audio (Python)

tokenizer.json is additive. The original mlx-audio Python entry points work unchanged:

from mlx_audio.stt.utils import load_model
from mlx_audio.stt.generate import generate_transcription

model = load_model("YoozLabs/Qwen3-ASR-1.7B-Swift")
out = generate_transcription(
    model=model,
    audio_path="audio.wav",
    output_path="audio.txt",
    format="txt",
    verbose=True,
)
print(out.text)

Reproducing tokenizer.json

hf download mlx-community/Qwen3-ASR-1.7B-8bit --local-dir ./qwen3-asr
uv run --with tokenizers --with transformers \
  python scripts/regen_tokenizer.py --model-dir ./qwen3-asr
shasum -a 256 ./qwen3-asr/tokenizer.json
# expected: 20b91623123c0f04e694141e5e385a7c44e57b7594157c1e3e38a90d19954c0d

The regen script builds a tokenizers.Tokenizer whose BPE model is loaded directly from vocab.json + merges.txt, configures the canonical Qwen2 byte-level pre-tokenizer / decoder / post-processor, and attaches every special token from tokenizer_config.json's added_tokens_decoder. It then cross-checks 5 multilingual canary strings against transformers.AutoTokenizer before writing the file.

License

Apache 2.0, inherited from Qwen/Qwen3-ASR-1.7B and mlx-community/Qwen3-ASR-1.7B-8bit. No relicensing.

Citation

The model itself is the work of the Tongyi Qwen team:

@misc{qwen3asr2025,
  title  = {Qwen3-ASR},
  author = {Tongyi Qwen Team and Alibaba Group},
  year   = {2025},
  url    = {https://huggingface.co/Qwen/Qwen3-ASR-1.7B}
}

The 8-bit MLX quantization is the work of the mlx-community and Blaizzy/mlx-audio maintainers.

This repository's only contribution is the regenerated tokenizer.json and the reproduction script for Swift compatibility. If you use the checkpoint via Yooz Engine, please also cite the engine release: https://github.com/yooz-labs/yooz-engine.

Contact

Downloads last month
104
Safetensors
Model size
0.8B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YoozLabs/Qwen3-ASR-1.7B-Swift

Quantized
(1)
this model