Cohere Transcribe โ€” CoreML INT8

Corrected CoreML release of CohereLabs/cohere-transcribe-03-2026 for Apple Silicon inference.

This release uses:

  • a length-aware encoder with inputs mel and length
  • decoder prefill and decode models with explicit encoder_mask inputs
  • verified token IDs: pad=2, eos=3, bos=4

The important March 28 fix is the encoder contract. The earlier static encoder baked padding behavior into traced inference and could cause repetition or hallucination on shorter utterances. The corrected encoder keeps the fixed mel shape but accepts the real mel length as a second input.

Contents

Artifact Precision Notes
cohere_encoder_int8.mlpackage INT8 weights Length-aware encoder (mel, length)
cohere_decoder_prefill_int8.mlpackage INT8 weights Prompt prefill with encoder_mask
cohere_decoder_decode_int8.mlpackage INT8 weights Single-token decode with encoder_mask
tokenizer.model n/a SentencePiece tokenizer
cohere_mel_filterbank.bin n/a Mel frontend weights
cohere_mel_window.bin n/a Mel frontend window

Inputs / Outputs

Encoder

Input Shape Type
mel [1, 128, 3500] float32
length [1] int32
Output Shape Type
encoder_hidden [1, 438, 1024] float16

Decoder Prefill

Input Shape Type
encoder_hidden [1, 438, 1024] float32
input_ids [1, 10] int32
encoder_mask [1, 438] float32
Output Shape Type
logits [1, 10, 16384] float32

Decoder Decode

Input Shape Type
input_ids [1, 1] int32
cache_update_mask [1, 512] float32
cache_valid_mask [1, 512] float32
encoder_mask [1, 438] float32
Output Shape Type
logits [1, 1, 16384] float32

Prompt

English transcription with punctuation uses the token IDs:

[13764, 7, 4, 16, 62, 62, 5, 9, 11, 13]

Relevant tokenizer IDs:

  • pad_token_id = 2
  • eos_token_id = 3
  • bos_token_id = 4

Validation

The corrected encoder was validated in Apple CoreML runtime with the same padded mel input and different length values:

  • same-length repeat max diff: 0.0
  • full-vs-short length max diff: 3.01220703125

That confirms the published encoder is not ignoring length.

Notes

  • The encoder still uses a fixed mel tensor shape of [1, 128, 3500]; length tells the encoder how many frames are real.
  • Longer audio should still be chunked upstream.
  • No timestamps or speaker diarization are included.

License

Apache 2.0 (same as the base model)

Downloads last month
52
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for phequals/cohere-transcribe-coreml-int8

Quantized
(9)
this model