mrsladoje commited on
Commit
e74f446
Β·
verified Β·
1 Parent(s): 95c2a74

docs: update README with reduce_range=True rationale and Zen 3 fix

Browse files
Files changed (1) hide show
  1. README.md +115 -124
README.md CHANGED
@@ -1,124 +1,115 @@
1
- ---
2
- license: mit
3
- base_model: nomic-ai/CodeRankEmbed
4
- base_model_relation: quantized
5
- tags:
6
- - code
7
- - embeddings
8
- - onnx
9
- - int8
10
- - quantized
11
- language:
12
- - code
13
- pipeline_tag: feature-extraction
14
- ---
15
-
16
- # CodeRankEmbed β€” Dynamic INT8 Quantized (ONNX)
17
-
18
- A dynamically quantized INT8 version of [nomic-ai/CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed), converted to ONNX by [jalipalo](https://huggingface.co/jalipalo/CodeRankEmbed-onnx) and quantized for fast CPU inference.
19
-
20
- ## What is this?
21
-
22
- CodeRankEmbed is a 137M-parameter embedding model trained specifically for code search and retrieval. This repository provides a **dynamic INT8 weight-quantized** version that is significantly smaller and faster with negligible quality loss:
23
-
24
- | | FP32 (original) | INT8 (this model) |
25
- |---|---|---|
26
- | **File size** | 522 MB | 132 MB (βˆ’75%) |
27
- | **CPU inference** | 1.00Γ— | ~2.09Γ— faster |
28
- | **Min cosine vs FP32** | 1.000 | 0.961 |
29
- | **Calibration data needed** | β€” | None |
30
-
31
- Quantization was done with ONNX Runtime's `quantize_dynamic` (weights only, `QInt8`, `per_channel=True`). Activations remain in FP32 at runtime β€” the recommended approach for transformer/embedding models per the [ONNX Runtime documentation](https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html).
32
-
33
- ## Usage
34
-
35
- ### With `@huggingface/transformers` (JavaScript / Node.js)
36
-
37
- ```js
38
- import { pipeline } from "@huggingface/transformers";
39
-
40
- const extractor = await pipeline(
41
- "feature-extraction",
42
- "mrsladoje/CodeRankEmbed-onnx-int8",
43
- { quantized: true } // loads onnx/model_quantized.onnx automatically
44
- );
45
-
46
- const output = await extractor("def hello(): return 42", {
47
- pooling: "mean",
48
- normalize: true,
49
- });
50
- console.log(output.data); // Float32Array of 768 dimensions
51
- ```
52
-
53
- ### With `optimum` (Python)
54
-
55
- ```python
56
- from optimum.onnxruntime import ORTModelForFeatureExtraction
57
- from transformers import AutoTokenizer
58
-
59
- model = ORTModelForFeatureExtraction.from_pretrained(
60
- "mrsladoje/CodeRankEmbed-onnx-int8",
61
- file_name="onnx/model_quantized.onnx",
62
- )
63
- tokenizer = AutoTokenizer.from_pretrained("mrsladoje/CodeRankEmbed-onnx-int8")
64
-
65
- inputs = tokenizer("def hello(): return 42", return_tensors="pt")
66
- outputs = model(**inputs)
67
- embeddings = outputs.last_hidden_state.mean(dim=1)
68
- ```
69
-
70
- ### With `onnxruntime` directly (Python)
71
-
72
- ```python
73
- import onnxruntime as ort
74
- import numpy as np
75
- from tokenizers import Tokenizer
76
-
77
- tokenizer = Tokenizer.from_pretrained("mrsladoje/CodeRankEmbed-onnx-int8")
78
- tokenizer.enable_padding(length=128, pad_id=0)
79
- tokenizer.enable_truncation(max_length=128)
80
-
81
- session = ort.InferenceSession("onnx/model_quantized.onnx")
82
-
83
- encoded = tokenizer.encode("def hello(): return 42")
84
- input_ids = np.array([encoded.ids], dtype=np.int64)
85
- attention_mask = np.array([encoded.attention_mask], dtype=np.int64)
86
-
87
- outputs = session.run(None, {"input_ids": input_ids, "attention_mask": attention_mask})
88
- embedding = outputs[1] # sentence_embedding output, shape (1, 768)
89
- ```
90
-
91
- ## Quantization Details
92
-
93
- | Parameter | Value |
94
- |---|---|
95
- | Method | `quantize_dynamic` (ONNX Runtime) |
96
- | Weight type | `QInt8` (signed 8-bit integer) |
97
- | Scope | Weights only β€” activations quantized dynamically at runtime |
98
- | Per-channel | Yes |
99
- | Calibration | None required |
100
- | ORT version | 1.21.x |
101
-
102
- **Why dynamic over static?** Static INT8 quantization requires calibration data to pre-compute activation ranges. For transformer embedding models, activation distributions vary widely with input content and sequence length, making static calibration brittle (we validated this β€” static QDQ produced cosine similarities as low as 0.09–0.26 with MinMax calibration). Dynamic quantization sidesteps this entirely: weights are quantized offline and activations are quantized at runtime, giving robust quality across all inputs.
103
-
104
- ## Quality Validation
105
-
106
- Validated on 10 code snippets across Python, JavaScript, Go, Java, Rust, TypeScript, and SQL:
107
-
108
- ```
109
- Model Size Speedup Min cosine vs FP32 Quality
110
- FP32 (baseline) 522.3 MB 1.00Γ— β€” baseline
111
- Dynamic INT8 132.2 MB 2.09Γ— 0.9610 excellent
112
- ```
113
-
114
- A cosine similarity β‰₯ 0.96 means the INT8 embeddings point in essentially the same direction as FP32. For retrieval tasks β€” especially with a reranker in the pipeline β€” this difference is undetectable in practice.
115
-
116
- The ~2Γ— CPU speedup is real compute acceleration (not just faster file loading), coming from ONNX Runtime's `MatMulIntegerToFloat` fused kernels operating on INT8 weights. VNNI-capable CPUs (Intel 10th gen+, AMD Zen4+) may see even larger gains.
117
-
118
- ## Attribution
119
-
120
- - **Original model:** [nomic-ai/CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed) β€” MIT License
121
- - **ONNX conversion:** [jalipalo/CodeRankEmbed-onnx](https://huggingface.co/jalipalo/CodeRankEmbed-onnx) β€” MIT License (inherited)
122
- - **INT8 quantization:** this repository β€” MIT License
123
-
124
- All work in this repository respects and complies with the MIT license of the original model.
 
1
+ ---
2
+ library_name: transformers
3
+ tags:
4
+ - onnx
5
+ - quantized
6
+ - int8
7
+ - code-search
8
+ - embedding
9
+ - nomic-bert
10
+ base_model: nomic-ai/CodeRankEmbed
11
+ license: mit
12
+ ---
13
+
14
+ # CodeRankEmbed-onnx-int8
15
+
16
+ INT8 quantized ONNX version of [nomic-ai/CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed)
17
+ for code search and embedding.
18
+
19
+ ## Quantization
20
+
21
+ Dynamic INT8 quantization with **`reduce_range=True`** for cross-platform
22
+ correctness:
23
+
24
+ ```python
25
+ from onnxruntime.quantization import quantize_dynamic, QuantType
26
+
27
+ quantize_dynamic(
28
+ model_input='CodeRankEmbed_fp32.onnx',
29
+ model_output='CodeRankEmbed_int8.onnx',
30
+ weight_type=QuantType.QInt8,
31
+ per_channel=True,
32
+ reduce_range=True, # clamp weights to [-64, 63] for AVX2 kernel safety
33
+ )
34
+ ```
35
+
36
+ ### Why `reduce_range=True`
37
+
38
+ ORT's CPU INT8 MatMul kernels have two paths on x86:
39
+
40
+ | CPU | Path | Full-range INT8 weights |
41
+ |---|---|---|
42
+ | Intel Cascade Lake+ / Ice Lake+ (VNNI) | `VPDPBUSD` | βœ“ correct |
43
+ | AMD Zen 4+ (VNNI / Genoa+) | `VPDPBUSD` | βœ“ correct |
44
+ | Apple Silicon (arm64 NEON + AMX) | separate arm64 kernels | βœ“ correct |
45
+ | **Intel pre-2019 / AMD Zen 3 Milan (AVX2 only)** | `pmaddubsw + phaddsw + paddd` | **βœ— int16 accumulator overflows β†’ degenerate output** |
46
+
47
+ `reduce_range=True` clamps weights to `[-64, 63]` (7-bit signed range), giving
48
+ the AVX2 `int16` intermediate enough headroom to avoid overflow. VNNI and arm64
49
+ paths are unaffected (they handle full-range INT8 natively).
50
+
51
+ ### Known issue with earlier quantization
52
+
53
+ A previous version of this model was quantized **without** `reduce_range=True`.
54
+ It worked correctly on VNNI-capable CPUs and Apple Silicon, but produced
55
+ degenerate embeddings (all texts mapping to near-identical vectors) on
56
+ **AMD Zen 3 EPYC** and similar pre-VNNI x86 hosts β€” verified on RunPod
57
+ RTX 5090 pods with EPYC 7543. This version fixes that. See commit history.
58
+
59
+ ## Performance
60
+
61
+ - **Size**: 139 MB (FP32 source: 548 MB) β€” **~75% reduction**
62
+ - **Output dim**: 768
63
+ - **Expected cosine vs FP32**: β‰₯ 0.96 on production inputs
64
+ - **Inference speedup (VNNI CPUs)**: ~2Γ— vs FP32
65
+ - **Inference speedup (pre-VNNI CPUs)**: ~1.5Γ— vs FP32 (smaller win, but correct)
66
+
67
+ ### Validation (Mac M3 Max, ORT 1.24.3, 4-text probe)
68
+
69
+ ```
70
+ T0 "how to parse json in python" T3 "parse json data python" cos=0.7749 (similar)
71
+ T0 "how to parse json in python" T2 "sql inner join three tables" cos=0.1123 (dissimilar)
72
+ Semantic separation 0.6626 (β‰₯ 0.15 healthy)
73
+ ```
74
+
75
+ ## Usage
76
+
77
+ ```python
78
+ import onnxruntime as ort
79
+ from transformers import AutoTokenizer
80
+
81
+ tokenizer = AutoTokenizer.from_pretrained("mrsladoje/CodeRankEmbed-onnx-int8")
82
+ session = ort.InferenceSession("model.onnx")
83
+
84
+ inputs = tokenizer(
85
+ "your code or query here",
86
+ padding=True, truncation=True, max_length=512, return_tensors="np"
87
+ )
88
+ outputs = session.run(None, dict(inputs))
89
+ # sentence_embedding is typically the second output; it's 768-dim L2-normalized
90
+ ```
91
+
92
+ ## Files
93
+
94
+ - `onnx/model.onnx` β€” INT8 quantized model (139 MB)
95
+ - `tokenizer.json`, `vocab.txt`, `config.json`, `special_tokens_map.json`, `tokenizer_config.json`
96
+ β€” from the base nomic-ai/CodeRankEmbed distribution
97
+
98
+ ## SHA256 (v2 β€” with `reduce_range=True`)
99
+
100
+ ```
101
+ 4eae31d09b1843103a1ebd5e2b2e24b5a5cad441a33906b35b12b1e2ed91d1db
102
+ ```
103
+
104
+ Pin this in your downloader to guarantee you got the corrected weights and not
105
+ a stale cached copy of v1.
106
+
107
+ ## Base model
108
+
109
+ [nomic-ai/CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed) (137M params),
110
+ based on [Snowflake/snowflake-arctic-embed-m-long](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-long).
111
+ ONNX conversion derived from [jalipalo/CodeRankEmbed-onnx](https://huggingface.co/jalipalo/CodeRankEmbed-onnx).
112
+
113
+ ## License
114
+
115
+ MIT (inherited from base model).