cstr
/

zerank-1-small-ONNX

@@ -25,18 +25,31 @@ ONNX export of [zeroentropy/zerank-1-small](https://huggingface.co/zeroentropy/z
 | `model_int8.onnx` + `model_int8.onnx_data` | INT8 | ~2.5 GB | Weight-only INT8 (per-tensor symmetric) |
 | `model_int4_full.onnx` | INT4 | ~1.3 GB | MatMulNBits INT4, block_size=32 |
-Conversion scripts: `export_zerank.py` (FP16 export), `stream_int8.py` (INT8 quantization).
 ## ⚠️ Important: chat template required
 This model is a Qwen3-based causal LM that scores (query, document) relevance by extracting the **"Yes" token logit** at the last position. It requires a specific prompt format — plain pair tokenization produces meaningless scores.
-**Always format inputs as:**
 ```
 <|im_start|>user
-Query: {query}
-Document: {document}
-Relevant:<|im_end|>
 <|im_start|>assistant
 ```
@@ -48,16 +61,23 @@ import numpy as np
 from transformers import AutoTokenizer
 MODEL_PATH = "model_int8.onnx"   # or model.onnx, model_int4_full.onnx
-TEMPLATE   = "<|im_start|>user\nQuery: {query}\nDocument: {doc}\nRelevant:<|im_end|>\n<|im_start|>assistant\n"
 sess = ort.InferenceSession(MODEL_PATH, providers=["CPUExecutionProvider"])
 tok  = AutoTokenizer.from_pretrained("cstr/zerank-1-small-ONNX")
 def rerank(query: str, documents: list[str]) -> list[float]:
     scores = []
     for doc in documents:
-        text = TEMPLATE.format(query=query, doc=doc)
-        enc  = tok(text, return_tensors="np", truncation=True, max_length=512)
         logit = sess.run(["logits"], {
             "input_ids":      enc["input_ids"].astype(np.int64),
             "attention_mask": enc["attention_mask"].astype(np.int64),
@@ -74,12 +94,12 @@ docs  = [
 scores = rerank(query, docs)
 for s, d in sorted(zip(scores, docs), reverse=True):
     print(f"[{s:.3f}] {d}")
-# [6.8] The giant panda is a bear species endemic to China.
-# [2.1] Pandas are mammals in the family Ursidae.
 # [-5.8] The sky is blue and the grass is green.
 ```
-> **Note:** Current export uses `batch_size=1` (causal mask is static). Process documents one at a time as shown above.
 ## Usage with fastembed-rs
@@ -90,7 +110,7 @@ let mut reranker = TextRerank::try_new(
     RerankInitOptions::new(RerankerModel::ZerankSmallInt8)
 ).unwrap();
-// batch_size=1: chat template is applied automatically per document
 let results = reranker.rerank(
     "What is a panda?",
     vec![
@@ -99,7 +119,7 @@ let results = reranker.rerank(
         "Pandas are mammals in the family Ursidae.",
     ],
     true,
-    Some(1),
 ).unwrap();
 for r in &results {
@@ -109,11 +129,12 @@ for r in &results {
 ## Export details
-`export_zerank.py` wraps Qwen3ForCausalLM in a `ZeRankScorer` that:
-1. Runs the transformer body → `hidden [batch, seq, hidden]`
-2. Gathers the hidden state at the last real-token position (`attention_mask.sum - 1`)
-3. Applies `lm_head`, slices the **"Yes" token** (id `9454`) → `[batch, 1]`
 Output: `logits [batch, 1]` — raw Yes-token logit (higher = more relevant). FP16 weights, opset 18.

 | `model_int8.onnx` + `model_int8.onnx_data` | INT8 | ~2.5 GB | Weight-only INT8 (per-tensor symmetric) |
 | `model_int4_full.onnx` | INT4 | ~1.3 GB | MatMulNBits INT4, block_size=32 |
+Conversion scripts: `export_zerank_v2.py` (FP16 export with dynamic batch), `stream_int8.py` (INT8 quantization).
 ## ⚠️ Important: chat template required
 This model is a Qwen3-based causal LM that scores (query, document) relevance by extracting the **"Yes" token logit** at the last position. It requires a specific prompt format — plain pair tokenization produces meaningless scores.
+**Always format inputs using the Qwen3 chat template with `system=query`, `user=document`:**
+```python
+# using the tokenizer directly (matches training format exactly):
+messages = [
+    {"role": "system", "content": query},
+    {"role": "user",   "content": document},
+]
+text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 ```
+This produces the following fixed string (equivalent, usable without a tokenizer):
+```
+<|im_start|>system
+{query}
+<|im_end|>
 <|im_start|>user
+{document}
+<|im_end|>
 <|im_start|>assistant
 ```
 from transformers import AutoTokenizer
 MODEL_PATH = "model_int8.onnx"   # or model.onnx, model_int4_full.onnx
+MAX_LENGTH = 512
 sess = ort.InferenceSession(MODEL_PATH, providers=["CPUExecutionProvider"])
 tok  = AutoTokenizer.from_pretrained("cstr/zerank-1-small-ONNX")
+def format_pair(query: str, doc: str) -> str:
+    messages = [
+        {"role": "system", "content": query},
+        {"role": "user",   "content": doc},
+    ]
+    return tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 def rerank(query: str, documents: list[str]) -> list[float]:
     scores = []
     for doc in documents:
+        text = format_pair(query, doc)
+        enc  = tok(text, return_tensors="np", truncation=True, max_length=MAX_LENGTH)
         logit = sess.run(["logits"], {
             "input_ids":      enc["input_ids"].astype(np.int64),
             "attention_mask": enc["attention_mask"].astype(np.int64),
 scores = rerank(query, docs)
 for s, d in sorted(zip(scores, docs), reverse=True):
     print(f"[{s:.3f}] {d}")
+# [+6.8] The giant panda is a bear species endemic to China.
+# [+2.1] Pandas are mammals in the family Ursidae.
 # [-5.8] The sky is blue and the grass is green.
 ```
+> **Batch inference:** The v2 export (`model.onnx`) supports `batch_size > 1` via a dynamic causal+padding mask. Pad a batch with the tokenizer and pass the full batch at once for higher throughput.
 ## Usage with fastembed-rs
     RerankInitOptions::new(RerankerModel::ZerankSmallInt8)
 ).unwrap();
+// The chat template is applied automatically; batch_size > 1 is supported.
 let results = reranker.rerank(
     "What is a panda?",
     vec![
         "Pandas are mammals in the family Ursidae.",
     ],
     true,
+    Some(32),
 ).unwrap();
 for r in &results {
 ## Export details
+`export_zerank_v2.py` wraps Qwen3ForCausalLM in a `ZeRankScorerV2` that:
+1. Builds a 4D causal+padding attention mask explicitly from `input_ids.shape[0]` — this makes the batch dimension dynamic in the ONNX graph (enabling `batch_size > 1`).
+2. Runs the transformer body → `hidden [batch, seq, hidden]`
+3. Gathers the hidden state at the last real-token position (`attention_mask.sum - 1`)
+4. Applies `lm_head`, slices the **"Yes" token** (id `9454`) → `[batch, 1]`
 Output: `logits [batch, 1]` — raw Yes-token logit (higher = more relevant). FP16 weights, opset 18.