Cohere Transcribe 03-2026 β€” CoreML INT4 Fused

A highly compressed CoreML conversion of CohereLabs/cohere-transcribe-03-2026 with INT4 quantization and fused mel+encoder pipeline. Designed for on-device deployment on macOS/iOS with Apple Neural Engine and GPU acceleration.

Key Metrics

Metric Value
Size 1.1 GB (vs 8.2 GB FP16 CoreML β€” 7.5x smaller)
Composite WER (25-sample fast) 4.29%
RTFx 7.8x real-time
Compute units CPU + GPU (required β€” CPU-only breaks int4 numerics)
Min macOS 14.0+ (explicit KV-cache mode)

Architecture

Three CoreML .mlpackage files forming a complete ASR pipeline:

File Size Description
cohere_fused_mel_encoder_c4_mixed_int4.mlpackage 963 MB Fused mel spectrogram + FastConformer encoder (INT4)
cohere_decoder_prefill_cached_c4_mixed.mlpackage 81 MB Decoder prefill with explicit KV-cache (INT4 linear, INT8 embeddings)
cohere_decoder_decode_cached_c4_mixed.mlpackage 74 MB Autoregressive decoder with explicit KV-cache (INT4 linear, INT8 embeddings)
cohere_transcribe_coreml_metadata.json 1.3 KB Pipeline config (dimensions, prompt IDs, compression policy)
tokenizer.model 481 KB SentencePiece tokenizer

Compression Policy

Encoder

  • Default: INT4 per-channel linear symmetric
  • Softened: pre_encode_* layers β†’ INT8 (protects mel frontend STFT/filterbank)
  • Weight threshold: 500K params (smaller tensors kept FP16)
  • Skipped: Non-linear conv constants (48 tensors, ~46M elements)

Decoder

  • Default: INT4 per-channel linear symmetric
  • Softened: Token embedding + position embedding β†’ INT8
  • Weight threshold: 16K params
  • Skipped: Bias vectors below threshold

Pipeline Details

Property Value
Decoder mode explicit_kv_cache (macOS 14+)
Mel frontend Fused into encoder graph
Max audio 480,000 samples (30 seconds @ 16kHz)
Encoder output 438 frames
Max sequence 512 tokens
Decoder layers 8
Attention heads 8
Head dim 128
Metadata version 2

Usage (Swift)

// Load models
let encoder = try MLModel(contentsOf: encoderURL, configuration: config)
let decoderPrefill = try MLModel(contentsOf: prefillURL, configuration: config)
let decoderDecode = try MLModel(contentsOf: decodeURL, configuration: config)

// config.computeUnits = .cpuAndGPU  // Required for INT4

See cohere_transcribe_coreml_metadata.json for full pipeline configuration including prompt IDs, dimensions, and compression policy.

License

GPL-3.0 β€” see LICENSE.

The base model (CohereLabs/cohere-transcribe-03-2026) is Apache 2.0.

Downloads last month
25
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for MarkChen1214/cohere-transcribe-03-2026-CoreML-INT4

Quantized
(24)
this model

Collection including MarkChen1214/cohere-transcribe-03-2026-CoreML-INT4