Qwen3-ASR-1.7B-Swift

Special-case redistribution. Use this repo only when you need swift-transformers compatibility (a tokenizer.json on disk), e.g. as a Yooz Engine STT backend. Python / mlx-audio users can load the upstream mlx-community/Qwen3-ASR-1.7B-8bit directly. Part of the Yooz Working Models collection of task-specific Yooz artifacts.

A Swift-friendly redistribution of mlx-community/Qwen3-ASR-1.7B-8bit for use with swift-transformers and Yooz Engine.

The canonical mlx-community/Qwen3-ASR-1.7B-8bit checkpoint ships vocab.json + merges.txt + tokenizer_config.json but no tokenizer.json. swift-transformers's AutoTokenizer.from(modelFolder:) requires tokenizer.json to be present on disk. This repo is the same checkpoint with tokenizer.json regenerated next to the existing artifacts so Swift consumers get a one-line load.

Lineage

Qwen/Qwen3-ASR-1.7B          (upstream, FP16)
        |
        v
mlx-community/Qwen3-ASR-1.7B-8bit   (mlx-audio 0.3.1 quantization)
        |
        v  + tokenizer.json regenerated from vocab.json + merges.txt
YoozLabs/Qwen3-ASR-1.7B-Swift      (this repo)

Weights (model.safetensors, model.safetensors.index.json), config (config.json, generation_config.json, preprocessor_config.json, chat_template.json), and the slow-tokenizer inputs (vocab.json, merges.txt, tokenizer_config.json) are byte-for-byte identical to mlx-community/Qwen3-ASR-1.7B-8bit. The only addition is tokenizer.json, plus the scripts/regen_tokenizer.py reproduction script and a MANIFEST.txt of SHA-256 digests for every artifact.

Validation

End-to-end parity with the Python mlx-audio reference, measured during Yooz Engine epic #46 phase 4 (PR yooz-labs/yooz-engine#64):

Numerical parity: 9.6e-7 max absolute delta on decoder logits vs mlx-audio Python reference, end-to-end on a 5 s clip.
Word error rate parity: 0 absolute WER delta vs the Python reference on the yooz-benchmark EN / AR / FA subsets.
Tokenizer canary: "Hello" encodes to [9707] (matches Qwen3ASRTokenizerPrep.canaryExpectedTokens in the engine). The regen script cross-checks 5 multilingual canary strings against transformers.AutoTokenizer before writing tokenizer.json.

Eval

Subset	Metric	This checkpoint	Notes
LibriSpeech-style English	WER	6.3%	Parakeet TDT on the same set is 6.9%
Arabic (yooz-benchmark)	WER	6.7%	Auto-LID free; no language hint required
Persian (yooz-benchmark)	WER	28.3%	Auto-LID identical to hinted-language path
Hebrew (yooz-benchmark)	WER	82.8%	Effectively unsupported, see Limitations

Latency (M-series Apple Silicon)

Phase	Time	Notes
Cold start (model load)	~1.1 s	One-shot per process
Warm transcription, 5 s clip	0.32 s	After model is resident
Resident memory	~2.5 GB	8-bit quantized weights + KV cache

All numbers are from the Yooz Engine qwen3_asr_preview backend on M-series Apple Silicon (M2 Pro / M3 Max class). Detailed methodology lives in PR yooz-labs/yooz-engine#64.

Limitations

Streaming is buffer-then-finalize, not chunk-incremental. The audio tower uses non-causal block attention (_create_block_attention_mask), so partials only finalize when an utterance boundary is detected. True low-latency streaming is out of scope for this checkpoint and is tracked separately in the engine.
Hebrew is unsupported. 82.8% WER on the yooz-benchmark Hebrew subset indicates the model effectively does not transcribe Hebrew. Do not deploy it for Hebrew users.
Persian is preview-only. 28.3% WER is competitive with the best open multilingual ASR models we have measured but is not yet at parity with the per-language fine-tunes the engine uses for FA defaults. We are using it as a fallback, not a default.
English is not the default in Yooz Engine. Parakeet TDT remains the default English backend in YoozEngine.app because it is faster and roughly equivalent on accuracy. This checkpoint shines on multilingual / code-switched audio where Parakeet does not run.
Deterministic decoding only. No sampling parameters are exposed; the engine consumes greedy decode output.

Files

File	Size	Purpose
`model.safetensors`	2.46 GB	8-bit MLX weights (audio tower + Qwen3 decoder)
`model.safetensors.index.json`	79 KB	Weight map
`tokenizer.json`	11.4 MB	Regenerated fast tokenizer (the addition over upstream)
`tokenizer_config.json`	12 KB	Special tokens, chat template, model_max_length
`vocab.json`	2.78 MB	Qwen2 byte-level BPE vocabulary
`merges.txt`	1.67 MB	Qwen2 BPE merges
`config.json`	7.2 KB	`Qwen3ASR` model config
`generation_config.json`	142 B	Decoding defaults
`preprocessor_config.json`	330 B	128-bin log-mel preprocessor settings
`chat_template.json`	1.16 KB	Jinja chat template
`scripts/regen_tokenizer.py`	reproduction	Rebuilds `tokenizer.json` from vocab + merges
`MANIFEST.txt`	SHA-256 of each artifact above

SHA-256 digests are pinned in MANIFEST.txt. Verify after download with shasum -a 256 -c MANIFEST.txt.

Use with Yooz Engine (Swift)

import YoozEngineClient

let client = YoozEngineClient()
let result = try await client.stt.transcribe(
    audioURL: URL(fileURLWithPath: "audio.wav"),
    backend: .qwen3ASRPreview,
    languageHint: nil  // auto-LID; pass "ar" / "fa" / "zh" / "en" to force
)
print(result.text)

YoozEngineClient auto-launches YoozEngine.app, which downloads this repo to its model cache on first run and uses swift-transformers AutoTokenizer.from(modelFolder:) to load tokenizer.json directly. No manual regeneration is required for engine consumers.

Use with mlx-audio (Python)

tokenizer.json is additive. The original mlx-audio Python entry points work unchanged:

from mlx_audio.stt.utils import load_model
from mlx_audio.stt.generate import generate_transcription

model = load_model("YoozLabs/Qwen3-ASR-1.7B-Swift")
out = generate_transcription(
    model=model,
    audio_path="audio.wav",
    output_path="audio.txt",
    format="txt",
    verbose=True,
)
print(out.text)

Reproducing `tokenizer.json`

hf download mlx-community/Qwen3-ASR-1.7B-8bit --local-dir ./qwen3-asr
uv run --with tokenizers --with transformers \
  python scripts/regen_tokenizer.py --model-dir ./qwen3-asr
shasum -a 256 ./qwen3-asr/tokenizer.json
# expected: 20b91623123c0f04e694141e5e385a7c44e57b7594157c1e3e38a90d19954c0d

The regen script builds a tokenizers.Tokenizer whose BPE model is loaded directly from vocab.json + merges.txt, configures the canonical Qwen2 byte-level pre-tokenizer / decoder / post-processor, and attaches every special token from tokenizer_config.json's added_tokens_decoder. It then cross-checks 5 multilingual canary strings against transformers.AutoTokenizer before writing the file.

License

Apache 2.0, inherited from Qwen/Qwen3-ASR-1.7B and mlx-community/Qwen3-ASR-1.7B-8bit. No relicensing.

Citation

The model itself is the work of the Tongyi Qwen team:

@misc{qwen3asr2025,
  title  = {Qwen3-ASR},
  author = {Tongyi Qwen Team and Alibaba Group},
  year   = {2025},
  url    = {https://huggingface.co/Qwen/Qwen3-ASR-1.7B}
}

The 8-bit MLX quantization is the work of the mlx-community and Blaizzy/mlx-audio maintainers.

This repository's only contribution is the regenerated tokenizer.json and the reproduction script for Swift compatibility. If you use the checkpoint via Yooz Engine, please also cite the engine release: https://github.com/yooz-labs/yooz-engine.

Contact

Issues with this checkpoint or the regen script: file at https://github.com/yooz-labs/yooz-engine/issues.
General questions: dev@yooz.info.
Upstream model issues: report against Qwen/Qwen3-ASR-1.7B, not here.

Downloads last month: 137

Safetensors

Model size

0.8B params

Tensor type

BF16

U32

MLX

Hardware compatibility

8-bit

Model tree for YoozLabs/Qwen3-ASR-1.7B-Swift

Base model

mlx-community/Qwen3-ASR-1.7B-8bit

Quantized

(2)

this model

Collection including YoozLabs/Qwen3-ASR-1.7B-Swift

Yooz Working Models

Collection

Task-specific models behind the Yooz apps: STT touch-up (Light/Quality), on-device ASR, text cleanup. Not general-purpose assistants. yooz.live • 6 items • Updated 12 days ago