CodeRankEmbed β Dynamic INT8 Quantized (ONNX)
A dynamically quantized INT8 version of nomic-ai/CodeRankEmbed, converted to ONNX by jalipalo and quantized for fast CPU inference.
What is this?
CodeRankEmbed is a 137M-parameter embedding model trained specifically for code search and retrieval. This repository provides a dynamic INT8 weight-quantized version that is significantly smaller and faster with negligible quality loss:
| FP32 (original) | INT8 (this model) | |
|---|---|---|
| File size | 522 MB | 132 MB (β75%) |
| CPU inference | 1.00Γ | ~2.09Γ faster |
| Min cosine vs FP32 | 1.000 | 0.961 |
| Calibration data needed | β | None |
Quantization was done with ONNX Runtime's quantize_dynamic (weights only, QInt8, per_channel=True). Activations remain in FP32 at runtime β the recommended approach for transformer/embedding models per the ONNX Runtime documentation.
Usage
With @huggingface/transformers (JavaScript / Node.js)
import { pipeline } from "@huggingface/transformers";
const extractor = await pipeline(
"feature-extraction",
"mrsladoje/CodeRankEmbed-onnx-int8",
{ quantized: true } // loads onnx/model_quantized.onnx automatically
);
const output = await extractor("def hello(): return 42", {
pooling: "mean",
normalize: true,
});
console.log(output.data); // Float32Array of 768 dimensions
With optimum (Python)
from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer
model = ORTModelForFeatureExtraction.from_pretrained(
"mrsladoje/CodeRankEmbed-onnx-int8",
file_name="onnx/model_quantized.onnx",
)
tokenizer = AutoTokenizer.from_pretrained("mrsladoje/CodeRankEmbed-onnx-int8")
inputs = tokenizer("def hello(): return 42", return_tensors="pt")
outputs = model(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1)
With onnxruntime directly (Python)
import onnxruntime as ort
import numpy as np
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("mrsladoje/CodeRankEmbed-onnx-int8")
tokenizer.enable_padding(length=128, pad_id=0)
tokenizer.enable_truncation(max_length=128)
session = ort.InferenceSession("onnx/model_quantized.onnx")
encoded = tokenizer.encode("def hello(): return 42")
input_ids = np.array([encoded.ids], dtype=np.int64)
attention_mask = np.array([encoded.attention_mask], dtype=np.int64)
outputs = session.run(None, {"input_ids": input_ids, "attention_mask": attention_mask})
embedding = outputs[1] # sentence_embedding output, shape (1, 768)
Quantization Details
| Parameter | Value |
|---|---|
| Method | quantize_dynamic (ONNX Runtime) |
| Weight type | QInt8 (signed 8-bit integer) |
| Scope | Weights only β activations quantized dynamically at runtime |
| Per-channel | Yes |
| Calibration | None required |
| ORT version | 1.21.x |
Why dynamic over static? Static INT8 quantization requires calibration data to pre-compute activation ranges. For transformer embedding models, activation distributions vary widely with input content and sequence length, making static calibration brittle (we validated this β static QDQ produced cosine similarities as low as 0.09β0.26 with MinMax calibration). Dynamic quantization sidesteps this entirely: weights are quantized offline and activations are quantized at runtime, giving robust quality across all inputs.
Quality Validation
Validated on 10 code snippets across Python, JavaScript, Go, Java, Rust, TypeScript, and SQL:
Model Size Speedup Min cosine vs FP32 Quality
FP32 (baseline) 522.3 MB 1.00Γ β baseline
Dynamic INT8 132.2 MB 2.09Γ 0.9610 excellent
A cosine similarity β₯ 0.96 means the INT8 embeddings point in essentially the same direction as FP32. For retrieval tasks β especially with a reranker in the pipeline β this difference is undetectable in practice.
The ~2Γ CPU speedup is real compute acceleration (not just faster file loading), coming from ONNX Runtime's MatMulIntegerToFloat fused kernels operating on INT8 weights. VNNI-capable CPUs (Intel 10th gen+, AMD Zen4+) may see even larger gains.
Attribution
- Original model: nomic-ai/CodeRankEmbed β MIT License
- ONNX conversion: jalipalo/CodeRankEmbed-onnx β MIT License (inherited)
- INT8 quantization: this repository β MIT License
All work in this repository respects and complies with the MIT license of the original model.
- Downloads last month
- 46
Model tree for mrsladoje/CodeRankEmbed-onnx-int8
Base model
Snowflake/snowflake-arctic-embed-m-long