cstr
/

zerank-1-small-ONNX

@@ -25,17 +25,61 @@ ONNX export of [zeroentropy/zerank-1-small](https://huggingface.co/zeroentropy/z
 | `model_int8.onnx` + `model_int8.onnx_data` | INT8 | ~2.5 GB | Weight-only INT8 (per-tensor symmetric) |
 | `model_int4_full.onnx` | INT4 | ~1.3 GB | MatMulNBits INT4, block_size=32 |
-The INT8 model uses a custom streaming quantizer (never loads the full 6.4 GB FP32 model into RAM). The INT4 model uses ORT's `MatMulNBitsQuantizer`.
-## Export details
-The `ZeRankScorer` wrapper bakes Yes-token logit extraction into the graph:
-1. Runs the Qwen3 transformer body → `[batch, seq, hidden]`
-2. Gathers at the last real-token position (using `attention_mask.sum - 1`)
-3. Applies `lm_head` and slices the Yes-token (id=`9454`) → `[batch, 1]`
-Output: `logits [batch, 1]` — raw Yes-token logit, higher = more relevant. Compatible with fastembed's standard reranker interface.
 ## Usage with fastembed-rs
@@ -46,44 +90,50 @@ let mut reranker = TextRerank::try_new(
     RerankInitOptions::new(RerankerModel::ZerankSmallInt8)
 ).unwrap();
 let results = reranker.rerank(
-    "what is a panda?",
-    vec!["A panda is a bear...", "The sky is blue..."],
     true,
-    Some(1), // batch_size=1
 ).unwrap();
 ```
-> **Note:** Use `batch_size=Some(1)` — the causal attention mask was traced with a static batch dimension.
-## Usage with ONNX Runtime (Python)
-```python
-import onnxruntime as ort
-import numpy as np
-from transformers import AutoTokenizer
-sess = ort.InferenceSession("model_int8.onnx", providers=["CPUExecutionProvider"])
-tok = AutoTokenizer.from_pretrained("cstr/zerank-1-small-ONNX")
-query, doc = "what is a panda?", "A panda is a large black-and-white bear."
-enc = tok(query, doc, return_tensors="np", truncation=True, max_length=512)
-logit = sess.run(["logits"], {
-    "input_ids":      enc["input_ids"].astype(np.int64),
-    "attention_mask": enc["attention_mask"].astype(np.int64),
-})[0]
-score = float(logit[0][0])
-print(f"Score: {score:.3f}")
-```
-## Original model
-See [zeroentropy/zerank-1-small](https://huggingface.co/zeroentropy/zerank-1-small) for full model details, evaluations, and license (Apache-2.0).
-| Task | cohere-rerank-v3.5 | Salesforce/Llama-rank-v1 | **zerank-1-small** | zerank-1 |
-|------|--------------------|--------------------------|----------------|----------|
-| Code | 0.724 | 0.694 | **0.730** | 0.754 |
-| Finance | 0.824 | 0.828 | **0.861** | 0.894 |
-| Legal | 0.804 | 0.767 | **0.817** | 0.821 |
-| Medical | 0.750 | 0.719 | **0.773** | 0.796 |
-| STEM | 0.510 | 0.595 | **0.680** | 0.694 |

 | `model_int8.onnx` + `model_int8.onnx_data` | INT8 | ~2.5 GB | Weight-only INT8 (per-tensor symmetric) |
 | `model_int4_full.onnx` | INT4 | ~1.3 GB | MatMulNBits INT4, block_size=32 |
+Conversion scripts: `export_zerank.py` (FP16 export), `stream_int8.py` (INT8 quantization).
+## ⚠️ Important: chat template required
+This model is a Qwen3-based causal LM that scores (query, document) relevance by extracting the **"Yes" token logit** at the last position. It requires a specific prompt format — plain pair tokenization produces meaningless scores.
+**Always format inputs as:**
+```
+<|im_start|>user
+Query: {query}
+Document: {document}
+Relevant:<|im_end|>
+<|im_start|>assistant
+```
+## Usage with ONNX Runtime (Python)
+```python
+import onnxruntime as ort
+import numpy as np
+from transformers import AutoTokenizer
+MODEL_PATH = "model_int8.onnx"   # or model.onnx, model_int4_full.onnx
+TEMPLATE   = "<|im_start|>user\nQuery: {query}\nDocument: {doc}\nRelevant:<|im_end|>\n<|im_start|>assistant\n"
+sess = ort.InferenceSession(MODEL_PATH, providers=["CPUExecutionProvider"])
+tok  = AutoTokenizer.from_pretrained("cstr/zerank-1-small-ONNX")
+def rerank(query: str, documents: list[str]) -> list[float]:
+    scores = []
+    for doc in documents:
+        text = TEMPLATE.format(query=query, doc=doc)
+        enc  = tok(text, return_tensors="np", truncation=True, max_length=512)
+        logit = sess.run(["logits"], {
+            "input_ids":      enc["input_ids"].astype(np.int64),
+            "attention_mask": enc["attention_mask"].astype(np.int64),
+        })[0]
+        scores.append(float(logit[0, 0]))
+    return scores
+query = "What is a panda?"
+docs  = [
+    "The giant panda is a bear species endemic to China.",
+    "The sky is blue and the grass is green.",
+    "Pandas are mammals in the family Ursidae.",
+]
+scores = rerank(query, docs)
+for s, d in sorted(zip(scores, docs), reverse=True):
+    print(f"[{s:.3f}] {d}")
+# [6.8] The giant panda is a bear species endemic to China.
+# [2.1] Pandas are mammals in the family Ursidae.
+# [-5.8] The sky is blue and the grass is green.
+```
+> **Note:** Current export uses `batch_size=1` (causal mask is static). Process documents one at a time as shown above.
 ## Usage with fastembed-rs
     RerankInitOptions::new(RerankerModel::ZerankSmallInt8)
 ).unwrap();
+// batch_size=1: chat template is applied automatically per document
 let results = reranker.rerank(
+    "What is a panda?",
+    vec![
+        "The giant panda is a bear species endemic to China.",
+        "The sky is blue.",
+        "Pandas are mammals in the family Ursidae.",
+    ],
     true,
+    Some(1),
 ).unwrap();
+for r in &results {
+    println!("[{:.3}] {}", r.score, r.document.as_ref().unwrap());
+}
 ```
+## Export details
+`export_zerank.py` wraps Qwen3ForCausalLM in a `ZeRankScorer` that:
+1. Runs the transformer body → `hidden [batch, seq, hidden]`
+2. Gathers the hidden state at the last real-token position (`attention_mask.sum - 1`)
+3. Applies `lm_head`, slices the **"Yes" token** (id `9454`) → `[batch, 1]`
+Output: `logits [batch, 1]` — raw Yes-token logit (higher = more relevant). FP16 weights, opset 18.
+`stream_int8.py` performs fully streaming weight-only INT8 quantization:
+- Never loads the full 6.4 GB FP32 model into RAM (peak ~1.5 GB)
+- Symmetric per-tensor quantization: `scale = max(|w|) / 127`
+- Adds `DequantizeLinear → MatMul` nodes for all MatMul B-weights
+- Non-MatMul tensors (embeddings, LayerNorm) kept as FP32
+## Benchmarks (from original model card)
+NDCG@10 with `text-embedding-3-small` as initial retriever (Top 100 candidates):
+| Task | Embedding only | cohere-rerank-v3.5 | Llama-rank-v1 | **zerank-1-small** | zerank-1 |
+|------|---------------|-------------------|--------------|----------------|----------|
+| Code | 0.678 | 0.724 | 0.694 | **0.730** | 0.754 |
+| Finance | 0.839 | 0.824 | 0.828 | **0.861** | 0.894 |
+| Legal | 0.703 | 0.804 | 0.767 | **0.817** | 0.821 |
+| Medical | 0.619 | 0.750 | 0.719 | **0.773** | 0.796 |
+| STEM | 0.401 | 0.510 | 0.595 | **0.680** | 0.694 |
+| Conversational | 0.250 | 0.571 | 0.484 | **0.556** | 0.596 |
+See [zeroentropy/zerank-1-small](https://huggingface.co/zeroentropy/zerank-1-small) for full details and Apache-2.0 license.