Caspi-1.7B CoreML
CoreML conversion of OzLabs/Caspi-1.7B for on-device Hebrew speech recognition on Apple Silicon (macOS/iOS).
Caspi is a Hebrew-optimized fine-tune of Qwen/Qwen3-ASR-1.7B, achieving ~5% WER on Hebrew benchmarks.
Model Details
| Property | Value |
|---|---|
| Base model | OzLabs/Caspi-1.7B (fine-tuned from Qwen/Qwen3-ASR-1.7B) |
| Architecture | Qwen3-ASR (audio encoder + LLM decoder) |
| Parameters | 2B |
| Quantization | Int8 (linear quantization via coremltools) |
| Primary language | Hebrew |
| License | CC-BY-NC-4.0 (inherited from OzLabs/Caspi-1.7B) |
Files
qwen3_asr_audio_encoder_v2.mlmodelc/ # Audio encoder (606 MB)
qwen3_asr_decoder_stateful.mlmodelc/ # Fused decoder + LM head with KV-cache (1.6 GB)
qwen3_asr_embeddings.bin # Token embeddings (151936 x 2048, float16, 594 MB)
vocab.json # Vocabulary (151643 tokens)
Architecture
| Component | Details |
|---|---|
| Audio encoder | 24 layers, d_model=1024, 16 heads, output_dim=2048 |
| Text decoder | 28 layers, hidden_size=2048, 16 query heads, 8 KV heads, head_dim=128 |
| Vocabulary | 151,936 tokens (SentencePiece) |
| Decoding | Autoregressive with stateful KV-cache (max 512 tokens) |
Performance (Apple Silicon)
Tested on M5 Pro, 48GB:
| Metric | Value |
|---|---|
| RTFx (release build) | ~2.15x (faster than real-time) |
| Peak memory | ~6.3 GB |
| Model size on disk | ~2.8 GB |
Usage with FluidAudio
These models are designed for use with FluidAudio's Qwen3AsrManager:
import FluidAudio
let manager = Qwen3AsrManager()
try await manager.loadModels(from: modelDirectory)
let samples = try AudioConverter().resampleAudioFile(audioURL)
let text = try await manager.transcribe(audioSamples: samples, language: "he")
print(text)
Note: FluidAudio's Qwen3AsrConfig must be updated for 1.7B dimensions (hidden_size=2048, etc.). See alandotcom/FluidAudio caspi-1.7b branch for the required config changes.
Usage with Hex
A fork of Hex (macOS dictation app) with Caspi support is available at alandotcom/Hex caspi-hebrew branch.
To use:
- Download the model files to
~/Library/Application Support/FluidAudio/Models/caspi-1.7b-coreml/ - Clone and build the Hex fork
- Select "Caspi 1.7B (Hebrew)" in Settings, set language to Hebrew
Conversion
Conversion scripts are available at alandotcom/caspi-hebrew-asr, forked from FluidInference/mobius with dimensions updated for the 1.7B architecture.
To reproduce:
git clone https://github.com/alandotcom/caspi-hebrew-asr.git
cd caspi-hebrew-asr/conversion
uv sync
uv run python convert-qwen3-asr.py # full f32 conversion
uv run python convert_decoder_fused.py # fused stateful decoder
uv run python extract_embeddings.py # embeddings + vocab
uv run python quantize_model.py input.mlpackage output.mlpackage --dtype int8 # quantize
Conversion pipeline:
- Audio encoder: traced with coremltools, FP16 precision
- Decoder: fused stateful decoder with LM head baked in, FLOAT32 compute precision (required to avoid float16 overflow in RMSNorm)
- Embeddings: extracted as raw float16 binary
- Post-training int8 weight quantization applied to decoder
License
This model inherits the CC-BY-NC-4.0 license from OzLabs/Caspi-1.7B.
The base model Qwen/Qwen3-ASR-1.7B is Apache-2.0 licensed.
The conversion scripts are from FluidInference/mobius (Apache-2.0).
Citations
Caspi
@misc{caspi_hebrew_asr,
title={Caspi-1.7B: Hebrew ASR fine-tuned from Qwen3-ASR-1.7B},
author={Oz Labs},
year={2026},
howpublished={Hugging Face model card}
}
Qwen3-ASR
@article{qwen3asr,
title={Qwen3-ASR Technical Report},
author={Qwen Team},
journal={arXiv preprint arXiv:2601.21337},
year={2025}
}
- Downloads last month
- 12