CodeRankEmbed-onnx-int8

INT8 quantized ONNX version of nomic-ai/CodeRankEmbed for code search and embedding.

Quantization

Dynamic INT8 quantization with reduce_range=True for cross-platform correctness:

from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic(
    model_input='CodeRankEmbed_fp32.onnx',
    model_output='CodeRankEmbed_int8.onnx',
    weight_type=QuantType.QInt8,
    per_channel=True,
    reduce_range=True,   # clamp weights to [-64, 63] for AVX2 kernel safety
)

Why reduce_range=True

ORT's CPU INT8 MatMul kernels have two paths on x86:

CPU Path Full-range INT8 weights
Intel Cascade Lake+ / Ice Lake+ (VNNI) VPDPBUSD βœ“ correct
AMD Zen 4+ (VNNI / Genoa+) VPDPBUSD βœ“ correct
Apple Silicon (arm64 NEON + AMX) separate arm64 kernels βœ“ correct
Intel pre-2019 / AMD Zen 3 Milan (AVX2 only) pmaddubsw + phaddsw + paddd βœ— int16 accumulator overflows β†’ degenerate output

reduce_range=True clamps weights to [-64, 63] (7-bit signed range), giving the AVX2 int16 intermediate enough headroom to avoid overflow. VNNI and arm64 paths are unaffected (they handle full-range INT8 natively).

Known issue with earlier quantization

A previous version of this model was quantized without reduce_range=True. It worked correctly on VNNI-capable CPUs and Apple Silicon, but produced degenerate embeddings (all texts mapping to near-identical vectors) on AMD Zen 3 EPYC and similar pre-VNNI x86 hosts β€” verified on RunPod RTX 5090 pods with EPYC 7543. This version fixes that. See commit history.

Performance

  • Size: 139 MB (FP32 source: 548 MB) β€” ~75% reduction
  • Output dim: 768
  • Expected cosine vs FP32: β‰₯ 0.96 on production inputs
  • Inference speedup (VNNI CPUs): ~2Γ— vs FP32
  • Inference speedup (pre-VNNI CPUs): ~1.5Γ— vs FP32 (smaller win, but correct)

Validation (Mac M3 Max, ORT 1.24.3, 4-text probe)

T0 "how to parse json in python"        T3 "parse json data python"       cos=0.7749  (similar)
T0 "how to parse json in python"        T2 "sql inner join three tables"  cos=0.1123  (dissimilar)
Semantic separation                                                        0.6626  (β‰₯ 0.15 healthy)

Usage

import onnxruntime as ort
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mrsladoje/CodeRankEmbed-onnx-int8")
session = ort.InferenceSession("model.onnx")

inputs = tokenizer(
    "your code or query here",
    padding=True, truncation=True, max_length=512, return_tensors="np"
)
outputs = session.run(None, dict(inputs))
# sentence_embedding is typically the second output; it's 768-dim L2-normalized

Files

  • onnx/model.onnx β€” INT8 quantized model (139 MB)
  • tokenizer.json, vocab.txt, config.json, special_tokens_map.json, tokenizer_config.json β€” from the base nomic-ai/CodeRankEmbed distribution

SHA256 (v2 β€” with reduce_range=True)

4eae31d09b1843103a1ebd5e2b2e24b5a5cad441a33906b35b12b1e2ed91d1db

Pin this in your downloader to guarantee you got the corrected weights and not a stale cached copy of v1.

Base model

nomic-ai/CodeRankEmbed (137M params), based on Snowflake/snowflake-arctic-embed-m-long. ONNX conversion derived from jalipalo/CodeRankEmbed-onnx.

License

MIT (inherited from base model).

Downloads last month
245
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mrsladoje/CodeRankEmbed-onnx-int8

Quantized
(12)
this model