Cohere Transcribe - ONNX INT4
This is an INT4 weight-only quantized ONNX conversion of CohereLabs/cohere-transcribe-03-2026.
This conversion follows the excellent work by Tristan Ripke, adapting his Whisper-style tensor contract and "baked-in" feature extraction to a more aggressive INT4 quantization for even smaller binary sizes and lower memory footprint.
What's Included
INT4 weight-only quantized ONNX model (approx. 1.9 GB total):
| File | Size | Description |
|---|---|---|
cohere-encoder.int4.onnx |
6 MB | Encoder graph |
cohere-encoder.int4.onnx.data |
1.8 GB | Encoder weights (INT4) |
cohere-decoder.int4.onnx |
364 KB | Decoder graph |
cohere-decoder.int4.onnx.data |
137 MB | Decoder weights (INT4) |
tokens.txt |
219 KB | 16,384-entry vocabulary |
Comparison with INT8
| Metric | INT8 (Tristan) | INT4 (This repo) |
|---|---|---|
| Total Size | 2.75 GB | 1.94 GB |
| Encoder Weights | 2.6 GB | 1.8 GB |
| Wall-clock time (5.4s clip, CPU) | 44.7s | 31.9s |
| Speedup | โ | ~1.4x faster |
Benchmarks run on a CPU-only Ubuntu VPS (8 GB RAM, no GPU) using dotnet + Microsoft.ML.OnnxRuntime 1.24.4.
Audio: voxpopuli_test_en_demo.wav (5.4s, 16kHz mono).
Both models produce identical output.
A benchmark_cs.sh script is included to reproduce these numbers โ it builds a temporary .NET 8 project, runs both models, and prints the comparison.
Quick Start (Python)
pip install onnxruntime numpy soundfile librosa
import onnxruntime as ort
import numpy as np
import librosa
# Load audio (must be 16kHz mono)
audio, sr = librosa.load("your_audio.wav", sr=16000, mono=True)
# Load models
enc = ort.InferenceSession("cohere-encoder.int4.onnx")
dec = ort.InferenceSession("cohere-decoder.int4.onnx")
# Load tokens
tokens = {}
with open("tokens.txt", "r", encoding="utf-8") as f:
for line in f:
parts = line.strip().rsplit(" ", 1)
if len(parts) == 2:
tokens[int(parts[1])] = parts[0]
token_to_id = {v: k for k, v in tokens.items()}
# Build prompt
prompt_ids = [token_to_id[t] for t in [
"<|startofcontext|>", "<|startoftranscript|>", "<|emo:undefined|>",
"<|en|>", "<|en|>", "<|pnc|>", "<|noitn|>", "<|notimestamp|>", "<|nodiarize|>"
]]
# Run encoder
cross_k, cross_v = enc.run(None, {"audio": audio.reshape(1, -1).astype(np.float32)})
# Run decoder (autoregressive greedy decoding)
N_LAYERS, HEADS, HEAD_DIM, MAX_CTX = 8, 8, 128, 1024
self_k = np.zeros((N_LAYERS, 1, HEADS, MAX_CTX, HEAD_DIM), dtype=np.float32)
self_v = np.zeros((N_LAYERS, 1, HEADS, MAX_CTX, HEAD_DIM), dtype=np.float32)
eos_id = token_to_id["<|endoftext|>"]
generated = list(prompt_ids)
current = np.array([prompt_ids], dtype=np.int64)
offset = np.array(0, dtype=np.int64)
for _ in range(256):
logits, self_k, self_v = dec.run(None, {
"tokens": current, "in_n_layer_self_k_cache": self_k,
"in_n_layer_self_v_cache": self_v, "n_layer_cross_k": cross_k,
"n_layer_cross_v": cross_v, "offset": offset,
})
next_id = int(np.argmax(logits[0, -1, :]))
if next_id == eos_id:
break
generated.append(next_id)
offset = np.array(int(offset) + current.shape[1], dtype=np.int64)
current = np.array([[next_id]], dtype=np.int64)
# Decode to text
text = "".join(
tokens.get(t, "").replace("\u2581", " ")
for t in generated[len(prompt_ids):]
if not tokens.get(t, "").startswith("<|")
).strip()
print(text)
How This Was Made
We followed the architectural adaptations pioneered by Tristan Ripke:
- Feature extraction baked in -- The encoder takes raw 16kHz audio. STFT is implemented via Conv1d DFT filters.
- Cross-attention K/V pre-computed -- The encoder pre-computes Key/Value projections for all 8 decoder layers.
- INT4 Quantization -- We used
MatMulNBitsQuantizerfromonnxruntime.quantizationto perform weight-only 4-bit quantization on theMatMuloperators, significantly reducing theexternal_datafootprint while maintaining acceptable transcription accuracy.
Attribution
Original model: CohereLabs/cohere-transcribe-03-2026. ONNX adaptation design: Tristan Ripke. INT4 Quantization: community contribution.
License: Apache 2.0
Model tree for cstr/cohere-transcribe-onnx-int4
Base model
CohereLabs/cohere-transcribe-03-2026