CodeRankEmbed β€” Dynamic INT8 Quantized (ONNX)

A dynamically quantized INT8 version of nomic-ai/CodeRankEmbed, converted to ONNX by jalipalo and quantized for fast CPU inference.

What is this?

CodeRankEmbed is a 137M-parameter embedding model trained specifically for code search and retrieval. This repository provides a dynamic INT8 weight-quantized version that is significantly smaller and faster with negligible quality loss:

FP32 (original) INT8 (this model)
File size 522 MB 132 MB (βˆ’75%)
CPU inference 1.00Γ— ~2.09Γ— faster
Min cosine vs FP32 1.000 0.961
Calibration data needed β€” None

Quantization was done with ONNX Runtime's quantize_dynamic (weights only, QInt8, per_channel=True). Activations remain in FP32 at runtime β€” the recommended approach for transformer/embedding models per the ONNX Runtime documentation.

Usage

With @huggingface/transformers (JavaScript / Node.js)

import { pipeline } from "@huggingface/transformers";

const extractor = await pipeline(
  "feature-extraction",
  "mrsladoje/CodeRankEmbed-onnx-int8",
  { quantized: true }   // loads onnx/model_quantized.onnx automatically
);

const output = await extractor("def hello(): return 42", {
  pooling: "mean",
  normalize: true,
});
console.log(output.data); // Float32Array of 768 dimensions

With optimum (Python)

from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer

model = ORTModelForFeatureExtraction.from_pretrained(
    "mrsladoje/CodeRankEmbed-onnx-int8",
    file_name="onnx/model_quantized.onnx",
)
tokenizer = AutoTokenizer.from_pretrained("mrsladoje/CodeRankEmbed-onnx-int8")

inputs = tokenizer("def hello(): return 42", return_tensors="pt")
outputs = model(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1)

With onnxruntime directly (Python)

import onnxruntime as ort
import numpy as np
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("mrsladoje/CodeRankEmbed-onnx-int8")
tokenizer.enable_padding(length=128, pad_id=0)
tokenizer.enable_truncation(max_length=128)

session = ort.InferenceSession("onnx/model_quantized.onnx")

encoded = tokenizer.encode("def hello(): return 42")
input_ids = np.array([encoded.ids], dtype=np.int64)
attention_mask = np.array([encoded.attention_mask], dtype=np.int64)

outputs = session.run(None, {"input_ids": input_ids, "attention_mask": attention_mask})
embedding = outputs[1]  # sentence_embedding output, shape (1, 768)

Quantization Details

Parameter Value
Method quantize_dynamic (ONNX Runtime)
Weight type QInt8 (signed 8-bit integer)
Scope Weights only β€” activations quantized dynamically at runtime
Per-channel Yes
Calibration None required
ORT version 1.21.x

Why dynamic over static? Static INT8 quantization requires calibration data to pre-compute activation ranges. For transformer embedding models, activation distributions vary widely with input content and sequence length, making static calibration brittle (we validated this β€” static QDQ produced cosine similarities as low as 0.09–0.26 with MinMax calibration). Dynamic quantization sidesteps this entirely: weights are quantized offline and activations are quantized at runtime, giving robust quality across all inputs.

Quality Validation

Validated on 10 code snippets across Python, JavaScript, Go, Java, Rust, TypeScript, and SQL:

Model            Size      Speedup    Min cosine vs FP32    Quality
FP32 (baseline)  522.3 MB  1.00Γ—      β€”                     baseline
Dynamic INT8     132.2 MB  2.09Γ—      0.9610                 excellent

A cosine similarity β‰₯ 0.96 means the INT8 embeddings point in essentially the same direction as FP32. For retrieval tasks β€” especially with a reranker in the pipeline β€” this difference is undetectable in practice.

The ~2Γ— CPU speedup is real compute acceleration (not just faster file loading), coming from ONNX Runtime's MatMulIntegerToFloat fused kernels operating on INT8 weights. VNNI-capable CPUs (Intel 10th gen+, AMD Zen4+) may see even larger gains.

Attribution

All work in this repository respects and complies with the MIT license of the original model.

Downloads last month
46
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mrsladoje/CodeRankEmbed-onnx-int8

Quantized
(11)
this model