Cohere Transcribe 03-2026 — CoreML INT4 Fused

A highly compressed CoreML conversion of CohereLabs/cohere-transcribe-03-2026 with INT4 quantization and fused mel+encoder pipeline. Designed for on-device deployment on macOS/iOS with Apple Neural Engine and GPU acceleration.

Key Metrics

Metric	Value
Size	1.1 GB (vs 8.2 GB FP16 CoreML — 7.5x smaller)
Composite WER (25-sample fast)	4.29%
RTFx	7.8x real-time
Compute units	CPU + GPU (required — CPU-only breaks int4 numerics)
Min macOS	14.0+ (explicit KV-cache mode)

Architecture

Three CoreML .mlpackage files forming a complete ASR pipeline:

File	Size	Description
`cohere_fused_mel_encoder_c4_mixed_int4.mlpackage`	963 MB	Fused mel spectrogram + FastConformer encoder (INT4)
`cohere_decoder_prefill_cached_c4_mixed.mlpackage`	81 MB	Decoder prefill with explicit KV-cache (INT4 linear, INT8 embeddings)
`cohere_decoder_decode_cached_c4_mixed.mlpackage`	74 MB	Autoregressive decoder with explicit KV-cache (INT4 linear, INT8 embeddings)
`cohere_transcribe_coreml_metadata.json`	1.3 KB	Pipeline config (dimensions, prompt IDs, compression policy)
`tokenizer.model`	481 KB	SentencePiece tokenizer

Compression Policy

Encoder

Default: INT4 per-channel linear symmetric
Softened: pre_encode_* layers → INT8 (protects mel frontend STFT/filterbank)
Weight threshold: 500K params (smaller tensors kept FP16)
Skipped: Non-linear conv constants (48 tensors, ~46M elements)

Decoder

Default: INT4 per-channel linear symmetric
Softened: Token embedding + position embedding → INT8
Weight threshold: 16K params
Skipped: Bias vectors below threshold

Pipeline Details

Property	Value
Decoder mode	`explicit_kv_cache` (macOS 14+)
Mel frontend	Fused into encoder graph
Max audio	480,000 samples (30 seconds @ 16kHz)
Encoder output	438 frames
Max sequence	512 tokens
Decoder layers	8
Attention heads	8
Head dim	128
Metadata version	2

Usage (Swift)

// Load models
let encoder = try MLModel(contentsOf: encoderURL, configuration: config)
let decoderPrefill = try MLModel(contentsOf: prefillURL, configuration: config)
let decoderDecode = try MLModel(contentsOf: decodeURL, configuration: config)

// config.computeUnits = .cpuAndGPU  // Required for INT4

See cohere_transcribe_coreml_metadata.json for full pipeline configuration including prompt IDs, dimensions, and compression policy.

License

GPL-3.0 — see LICENSE.

The base model (CohereLabs/cohere-transcribe-03-2026) is Apache 2.0.

Downloads last month: 25

Model tree for MarkChen1214/cohere-transcribe-03-2026-CoreML-INT4

Base model

CohereLabs/cohere-transcribe-03-2026

Quantized

(24)

this model

Collection including MarkChen1214/cohere-transcribe-03-2026-CoreML-INT4

ASR Model Compression

Collection

Compressed ASR models for on-device speech recognition on Apple Silicon. CoreML and MLX variants of Cohere Transcribe, optimized for PressType. • 2 items • Updated 18 days ago