ASR Model Compression
Collection
Compressed ASR models for on-device speech recognition on Apple Silicon. CoreML and MLX variants of Cohere Transcribe, optimized for PressType. β’ 2 items β’ Updated
A highly compressed CoreML conversion of CohereLabs/cohere-transcribe-03-2026 with INT4 quantization and fused mel+encoder pipeline. Designed for on-device deployment on macOS/iOS with Apple Neural Engine and GPU acceleration.
| Metric | Value |
|---|---|
| Size | 1.1 GB (vs 8.2 GB FP16 CoreML β 7.5x smaller) |
| Composite WER (25-sample fast) | 4.29% |
| RTFx | 7.8x real-time |
| Compute units | CPU + GPU (required β CPU-only breaks int4 numerics) |
| Min macOS | 14.0+ (explicit KV-cache mode) |
Three CoreML .mlpackage files forming a complete ASR pipeline:
| File | Size | Description |
|---|---|---|
cohere_fused_mel_encoder_c4_mixed_int4.mlpackage |
963 MB | Fused mel spectrogram + FastConformer encoder (INT4) |
cohere_decoder_prefill_cached_c4_mixed.mlpackage |
81 MB | Decoder prefill with explicit KV-cache (INT4 linear, INT8 embeddings) |
cohere_decoder_decode_cached_c4_mixed.mlpackage |
74 MB | Autoregressive decoder with explicit KV-cache (INT4 linear, INT8 embeddings) |
cohere_transcribe_coreml_metadata.json |
1.3 KB | Pipeline config (dimensions, prompt IDs, compression policy) |
tokenizer.model |
481 KB | SentencePiece tokenizer |
pre_encode_* layers β INT8 (protects mel frontend STFT/filterbank)| Property | Value |
|---|---|
| Decoder mode | explicit_kv_cache (macOS 14+) |
| Mel frontend | Fused into encoder graph |
| Max audio | 480,000 samples (30 seconds @ 16kHz) |
| Encoder output | 438 frames |
| Max sequence | 512 tokens |
| Decoder layers | 8 |
| Attention heads | 8 |
| Head dim | 128 |
| Metadata version | 2 |
// Load models
let encoder = try MLModel(contentsOf: encoderURL, configuration: config)
let decoderPrefill = try MLModel(contentsOf: prefillURL, configuration: config)
let decoderDecode = try MLModel(contentsOf: decodeURL, configuration: config)
// config.computeUnits = .cpuAndGPU // Required for INT4
See cohere_transcribe_coreml_metadata.json for full pipeline configuration including prompt IDs, dimensions, and compression policy.
GPL-3.0 β see LICENSE.
The base model (CohereLabs/cohere-transcribe-03-2026) is Apache 2.0.
Base model
CohereLabs/cohere-transcribe-03-2026