CodeRankEmbed-onnx-int8

INT8 quantized ONNX version of nomic-ai/CodeRankEmbed for code search and embedding.

Quantization

Dynamic INT8 quantization with reduce_range=True for cross-platform correctness:

from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic(
    model_input='CodeRankEmbed_fp32.onnx',
    model_output='CodeRankEmbed_int8.onnx',
    weight_type=QuantType.QInt8,
    per_channel=True,
    reduce_range=True,   # clamp weights to [-64, 63] for AVX2 kernel safety
)

Why `reduce_range=True`

ORT's CPU INT8 MatMul kernels have two paths on x86:

CPU	Path	Full-range INT8 weights
Intel Cascade Lake+ / Ice Lake+ (VNNI)	`VPDPBUSD`	✓ correct
AMD Zen 4+ (VNNI / Genoa+)	`VPDPBUSD`	✓ correct
Apple Silicon (arm64 NEON + AMX)	separate arm64 kernels	✓ correct
Intel pre-2019 / AMD Zen 3 Milan (AVX2 only)	`pmaddubsw + phaddsw + paddd`	✗ int16 accumulator overflows → degenerate output

reduce_range=True clamps weights to [-64, 63] (7-bit signed range), giving the AVX2 int16 intermediate enough headroom to avoid overflow. VNNI and arm64 paths are unaffected (they handle full-range INT8 natively).

Known issue with earlier quantization

A previous version of this model was quantized without reduce_range=True. It worked correctly on VNNI-capable CPUs and Apple Silicon, but produced degenerate embeddings (all texts mapping to near-identical vectors) on AMD Zen 3 EPYC and similar pre-VNNI x86 hosts — verified on RunPod RTX 5090 pods with EPYC 7543. This version fixes that. See commit history.

Performance

Size: 139 MB (FP32 source: 548 MB) — ~75% reduction
Output dim: 768
Expected cosine vs FP32: ≥ 0.96 on production inputs
Inference speedup (VNNI CPUs): ~2× vs FP32
Inference speedup (pre-VNNI CPUs): ~1.5× vs FP32 (smaller win, but correct)

Validation (Mac M3 Max, ORT 1.24.3, 4-text probe)

T0 "how to parse json in python"        T3 "parse json data python"       cos=0.7749  (similar)
T0 "how to parse json in python"        T2 "sql inner join three tables"  cos=0.1123  (dissimilar)
Semantic separation                                                        0.6626  (≥ 0.15 healthy)

Usage

import onnxruntime as ort
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mrsladoje/CodeRankEmbed-onnx-int8")
session = ort.InferenceSession("model.onnx")

inputs = tokenizer(
    "your code or query here",
    padding=True, truncation=True, max_length=512, return_tensors="np"
)
outputs = session.run(None, dict(inputs))
# sentence_embedding is typically the second output; it's 768-dim L2-normalized

Files

onnx/model.onnx — INT8 quantized model (139 MB)
tokenizer.json, vocab.txt, config.json, special_tokens_map.json, tokenizer_config.json — from the base nomic-ai/CodeRankEmbed distribution

SHA256 (v2 — with `reduce_range=True`)

4eae31d09b1843103a1ebd5e2b2e24b5a5cad441a33906b35b12b1e2ed91d1db

Pin this in your downloader to guarantee you got the corrected weights and not a stale cached copy of v1.

Base model

nomic-ai/CodeRankEmbed (137M params), based on Snowflake/snowflake-arctic-embed-m-long. ONNX conversion derived from jalipalo/CodeRankEmbed-onnx.

License

MIT (inherited from base model).

Downloads last month: 245

Model tree for mrsladoje/CodeRankEmbed-onnx-int8

Base model

Snowflake/snowflake-arctic-embed-m-long

Finetuned

nomic-ai/CodeRankEmbed

Quantized

(12)

this model

CodeRankEmbed-onnx-int8

Quantization

Why reduce_range=True

Known issue with earlier quantization

Performance

Validation (Mac M3 Max, ORT 1.24.3, 4-text probe)

Usage

Files

SHA256 (v2 — with reduce_range=True)

Base model

License

Model tree for mrsladoje/CodeRankEmbed-onnx-int8

Why `reduce_range=True`

SHA256 (v2 — with `reduce_range=True`)