CodeRankEmbed-onnx-int8
INT8 quantized ONNX version of nomic-ai/CodeRankEmbed for code search and embedding.
Quantization
Dynamic INT8 quantization with reduce_range=True for cross-platform
correctness:
from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic(
model_input='CodeRankEmbed_fp32.onnx',
model_output='CodeRankEmbed_int8.onnx',
weight_type=QuantType.QInt8,
per_channel=True,
reduce_range=True, # clamp weights to [-64, 63] for AVX2 kernel safety
)
Why reduce_range=True
ORT's CPU INT8 MatMul kernels have two paths on x86:
| CPU | Path | Full-range INT8 weights |
|---|---|---|
| Intel Cascade Lake+ / Ice Lake+ (VNNI) | VPDPBUSD |
β correct |
| AMD Zen 4+ (VNNI / Genoa+) | VPDPBUSD |
β correct |
| Apple Silicon (arm64 NEON + AMX) | separate arm64 kernels | β correct |
| Intel pre-2019 / AMD Zen 3 Milan (AVX2 only) | pmaddubsw + phaddsw + paddd |
β int16 accumulator overflows β degenerate output |
reduce_range=True clamps weights to [-64, 63] (7-bit signed range), giving
the AVX2 int16 intermediate enough headroom to avoid overflow. VNNI and arm64
paths are unaffected (they handle full-range INT8 natively).
Known issue with earlier quantization
A previous version of this model was quantized without reduce_range=True.
It worked correctly on VNNI-capable CPUs and Apple Silicon, but produced
degenerate embeddings (all texts mapping to near-identical vectors) on
AMD Zen 3 EPYC and similar pre-VNNI x86 hosts β verified on RunPod
RTX 5090 pods with EPYC 7543. This version fixes that. See commit history.
Performance
- Size: 139 MB (FP32 source: 548 MB) β ~75% reduction
- Output dim: 768
- Expected cosine vs FP32: β₯ 0.96 on production inputs
- Inference speedup (VNNI CPUs): ~2Γ vs FP32
- Inference speedup (pre-VNNI CPUs): ~1.5Γ vs FP32 (smaller win, but correct)
Validation (Mac M3 Max, ORT 1.24.3, 4-text probe)
T0 "how to parse json in python" T3 "parse json data python" cos=0.7749 (similar)
T0 "how to parse json in python" T2 "sql inner join three tables" cos=0.1123 (dissimilar)
Semantic separation 0.6626 (β₯ 0.15 healthy)
Usage
import onnxruntime as ort
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mrsladoje/CodeRankEmbed-onnx-int8")
session = ort.InferenceSession("model.onnx")
inputs = tokenizer(
"your code or query here",
padding=True, truncation=True, max_length=512, return_tensors="np"
)
outputs = session.run(None, dict(inputs))
# sentence_embedding is typically the second output; it's 768-dim L2-normalized
Files
onnx/model.onnxβ INT8 quantized model (139 MB)tokenizer.json,vocab.txt,config.json,special_tokens_map.json,tokenizer_config.jsonβ from the base nomic-ai/CodeRankEmbed distribution
SHA256 (v2 β with reduce_range=True)
4eae31d09b1843103a1ebd5e2b2e24b5a5cad441a33906b35b12b1e2ed91d1db
Pin this in your downloader to guarantee you got the corrected weights and not a stale cached copy of v1.
Base model
nomic-ai/CodeRankEmbed (137M params), based on Snowflake/snowflake-arctic-embed-m-long. ONNX conversion derived from jalipalo/CodeRankEmbed-onnx.
License
MIT (inherited from base model).
- Downloads last month
- 245
Model tree for mrsladoje/CodeRankEmbed-onnx-int8
Base model
Snowflake/snowflake-arctic-embed-m-long