Cohere Transcribe - ONNX INT4

This is an INT4 weight-only quantized ONNX conversion of CohereLabs/cohere-transcribe-03-2026.

This conversion follows the excellent work by Tristan Ripke, adapting his Whisper-style tensor contract and "baked-in" feature extraction to a more aggressive INT4 quantization for even smaller binary sizes and lower memory footprint.

What's Included

INT4 weight-only quantized ONNX model (approx. 1.9 GB total):

File Size Description
cohere-encoder.int4.onnx 6 MB Encoder graph
cohere-encoder.int4.onnx.data 1.8 GB Encoder weights (INT4)
cohere-decoder.int4.onnx 364 KB Decoder graph
cohere-decoder.int4.onnx.data 137 MB Decoder weights (INT4)
tokens.txt 219 KB 16,384-entry vocabulary

Comparison with INT8

Metric INT8 (Tristan) INT4 (This repo)
Total Size 2.75 GB 1.94 GB
Encoder Weights 2.6 GB 1.8 GB
Wall-clock time (5.4s clip, CPU) 44.7s 31.9s
Speedup โ€” ~1.4x faster

Benchmarks run on a CPU-only Ubuntu VPS (8 GB RAM, no GPU) using dotnet + Microsoft.ML.OnnxRuntime 1.24.4. Audio: voxpopuli_test_en_demo.wav (5.4s, 16kHz mono). Both models produce identical output.

A benchmark_cs.sh script is included to reproduce these numbers โ€” it builds a temporary .NET 8 project, runs both models, and prints the comparison.

Quick Start (Python)

pip install onnxruntime numpy soundfile librosa
import onnxruntime as ort
import numpy as np
import librosa

# Load audio (must be 16kHz mono)
audio, sr = librosa.load("your_audio.wav", sr=16000, mono=True)

# Load models
enc = ort.InferenceSession("cohere-encoder.int4.onnx")
dec = ort.InferenceSession("cohere-decoder.int4.onnx")

# Load tokens
tokens = {}
with open("tokens.txt", "r", encoding="utf-8") as f:
    for line in f:
        parts = line.strip().rsplit(" ", 1)
        if len(parts) == 2:
            tokens[int(parts[1])] = parts[0]
token_to_id = {v: k for k, v in tokens.items()}

# Build prompt
prompt_ids = [token_to_id[t] for t in [
    "<|startofcontext|>", "<|startoftranscript|>", "<|emo:undefined|>",
    "<|en|>", "<|en|>", "<|pnc|>", "<|noitn|>", "<|notimestamp|>", "<|nodiarize|>"
]]

# Run encoder
cross_k, cross_v = enc.run(None, {"audio": audio.reshape(1, -1).astype(np.float32)})

# Run decoder (autoregressive greedy decoding)
N_LAYERS, HEADS, HEAD_DIM, MAX_CTX = 8, 8, 128, 1024
self_k = np.zeros((N_LAYERS, 1, HEADS, MAX_CTX, HEAD_DIM), dtype=np.float32)
self_v = np.zeros((N_LAYERS, 1, HEADS, MAX_CTX, HEAD_DIM), dtype=np.float32)
eos_id = token_to_id["<|endoftext|>"]

generated = list(prompt_ids)
current = np.array([prompt_ids], dtype=np.int64)
offset = np.array(0, dtype=np.int64)

for _ in range(256):
    logits, self_k, self_v = dec.run(None, {
        "tokens": current, "in_n_layer_self_k_cache": self_k,
        "in_n_layer_self_v_cache": self_v, "n_layer_cross_k": cross_k,
        "n_layer_cross_v": cross_v, "offset": offset,
    })
    next_id = int(np.argmax(logits[0, -1, :]))
    if next_id == eos_id:
        break
    generated.append(next_id)
    offset = np.array(int(offset) + current.shape[1], dtype=np.int64)
    current = np.array([[next_id]], dtype=np.int64)

# Decode to text
text = "".join(
    tokens.get(t, "").replace("\u2581", " ")
    for t in generated[len(prompt_ids):]
    if not tokens.get(t, "").startswith("<|")
).strip()
print(text)

How This Was Made

We followed the architectural adaptations pioneered by Tristan Ripke:

  1. Feature extraction baked in -- The encoder takes raw 16kHz audio. STFT is implemented via Conv1d DFT filters.
  2. Cross-attention K/V pre-computed -- The encoder pre-computes Key/Value projections for all 8 decoder layers.
  3. INT4 Quantization -- We used MatMulNBitsQuantizer from onnxruntime.quantization to perform weight-only 4-bit quantization on the MatMul operators, significantly reducing the external_data footprint while maintaining acceptable transcription accuracy.

Attribution

Original model: CohereLabs/cohere-transcribe-03-2026. ONNX adaptation design: Tristan Ripke. INT4 Quantization: community contribution.

License: Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cstr/cohere-transcribe-onnx-int4

Quantized
(19)
this model