File size: 5,777 Bytes
f5bc8c3 ba9c38b f5bc8c3 b78d236 ba9c38b b78d236 ba9c38b b78d236 ba9c38b b78d236 f5bc8c3 b78d236 f5bc8c3 b78d236 ba9c38b b78d236 ba9c38b b78d236 ba9c38b b78d236 ba9c38b b78d236 f5bc8c3 ba9c38b f5bc8c3 ba9c38b f5bc8c3 b78d236 f5bc8c3 ba9c38b f5bc8c3 b78d236 f5bc8c3 b78d236 f5bc8c3 ba9c38b f5bc8c3 ba9c38b f5bc8c3 b78d236 f5bc8c3 b78d236 f5bc8c3 b78d236 f5bc8c3 b78d236 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 | ---
license: apache-2.0
language:
- en
base_model:
- zeroentropy/zerank-1-small
pipeline_tag: text-ranking
tags:
- reranking
- onnx
- quantized
- fastembed
library_name: fastembed
---
# zerank-1-small — ONNX Export
ONNX export of [zeroentropy/zerank-1-small](https://huggingface.co/zeroentropy/zerank-1-small), a 1.7B Qwen3-based reranker. Includes three quantization levels for CPU inference.
## Files
| File | Format | Size | Description |
|------|--------|------|-------------|
| `model.onnx` + `model.onnx_data` | FP16 | ~3.2 GB | Full precision |
| `model_int8.onnx` + `model_int8.onnx_data` | INT8 | ~2.5 GB | Weight-only INT8 (per-tensor symmetric) |
| `model_int4_full.onnx` | INT4 | ~1.3 GB | MatMulNBits INT4, block_size=32 |
Conversion scripts: `export_zerank_v2.py` (FP16 export with dynamic batch), `stream_int8.py` (INT8 quantization).
## ⚠️ Important: chat template required
This model is a Qwen3-based causal LM that scores (query, document) relevance by extracting the **"Yes" token logit** at the last position. It requires a specific prompt format — plain pair tokenization produces meaningless scores.
**Always format inputs using the Qwen3 chat template with `system=query`, `user=document`:**
```python
# using the tokenizer directly (matches training format exactly):
messages = [
{"role": "system", "content": query},
{"role": "user", "content": document},
]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
```
This produces the following fixed string (equivalent, usable without a tokenizer):
```
<|im_start|>system
{query}
<|im_end|>
<|im_start|>user
{document}
<|im_end|>
<|im_start|>assistant
```
## Usage with ONNX Runtime (Python)
```python
import onnxruntime as ort
import numpy as np
from transformers import AutoTokenizer
MODEL_PATH = "model_int8.onnx" # or model.onnx, model_int4_full.onnx
MAX_LENGTH = 512
sess = ort.InferenceSession(MODEL_PATH, providers=["CPUExecutionProvider"])
tok = AutoTokenizer.from_pretrained("cstr/zerank-1-small-ONNX")
def format_pair(query: str, doc: str) -> str:
messages = [
{"role": "system", "content": query},
{"role": "user", "content": doc},
]
return tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
def rerank(query: str, documents: list[str]) -> list[float]:
scores = []
for doc in documents:
text = format_pair(query, doc)
enc = tok(text, return_tensors="np", truncation=True, max_length=MAX_LENGTH)
logit = sess.run(["logits"], {
"input_ids": enc["input_ids"].astype(np.int64),
"attention_mask": enc["attention_mask"].astype(np.int64),
})[0]
scores.append(float(logit[0, 0]))
return scores
query = "What is a panda?"
docs = [
"The giant panda is a bear species endemic to China.",
"The sky is blue and the grass is green.",
"Pandas are mammals in the family Ursidae.",
]
scores = rerank(query, docs)
for s, d in sorted(zip(scores, docs), reverse=True):
print(f"[{s:.3f}] {d}")
# [+6.8] The giant panda is a bear species endemic to China.
# [+2.1] Pandas are mammals in the family Ursidae.
# [-5.8] The sky is blue and the grass is green.
```
> **Batch inference:** The v2 export (`model.onnx`) supports `batch_size > 1` via a dynamic causal+padding mask. Pad a batch with the tokenizer and pass the full batch at once for higher throughput.
## Usage with fastembed-rs
```rust
use fastembed::{RerankInitOptions, RerankerModel, TextRerank};
let mut reranker = TextRerank::try_new(
RerankInitOptions::new(RerankerModel::ZerankSmallInt8)
).unwrap();
// The chat template is applied automatically; batch_size > 1 is supported.
let results = reranker.rerank(
"What is a panda?",
vec![
"The giant panda is a bear species endemic to China.",
"The sky is blue.",
"Pandas are mammals in the family Ursidae.",
],
true,
Some(32),
).unwrap();
for r in &results {
println!("[{:.3}] {}", r.score, r.document.as_ref().unwrap());
}
```
## Export details
`export_zerank_v2.py` wraps Qwen3ForCausalLM in a `ZeRankScorerV2` that:
1. Builds a 4D causal+padding attention mask explicitly from `input_ids.shape[0]` — this makes the batch dimension dynamic in the ONNX graph (enabling `batch_size > 1`).
2. Runs the transformer body → `hidden [batch, seq, hidden]`
3. Gathers the hidden state at the last real-token position (`attention_mask.sum - 1`)
4. Applies `lm_head`, slices the **"Yes" token** (id `9454`) → `[batch, 1]`
Output: `logits [batch, 1]` — raw Yes-token logit (higher = more relevant). FP16 weights, opset 18.
`stream_int8.py` performs fully streaming weight-only INT8 quantization:
- Never loads the full 6.4 GB FP32 model into RAM (peak ~1.5 GB)
- Symmetric per-tensor quantization: `scale = max(|w|) / 127`
- Adds `DequantizeLinear → MatMul` nodes for all MatMul B-weights
- Non-MatMul tensors (embeddings, LayerNorm) kept as FP32
## Benchmarks (from original model card)
NDCG@10 with `text-embedding-3-small` as initial retriever (Top 100 candidates):
| Task | Embedding only | cohere-rerank-v3.5 | Llama-rank-v1 | **zerank-1-small** | zerank-1 |
|------|---------------|-------------------|--------------|----------------|----------|
| Code | 0.678 | 0.724 | 0.694 | **0.730** | 0.754 |
| Finance | 0.839 | 0.824 | 0.828 | **0.861** | 0.894 |
| Legal | 0.703 | 0.804 | 0.767 | **0.817** | 0.821 |
| Medical | 0.619 | 0.750 | 0.719 | **0.773** | 0.796 |
| STEM | 0.401 | 0.510 | 0.595 | **0.680** | 0.694 |
| Conversational | 0.250 | 0.571 | 0.484 | **0.556** | 0.596 |
See [zeroentropy/zerank-1-small](https://huggingface.co/zeroentropy/zerank-1-small) for full details and Apache-2.0 license.
|