File size: 5,777 Bytes
f5bc8c3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ba9c38b
f5bc8c3
b78d236
 
 
 
ba9c38b
 
 
 
 
 
 
 
 
b78d236
ba9c38b
 
 
 
 
 
b78d236
ba9c38b
 
b78d236
 
f5bc8c3
b78d236
 
 
 
 
 
f5bc8c3
b78d236
ba9c38b
b78d236
 
 
 
ba9c38b
 
 
 
 
 
 
b78d236
 
 
ba9c38b
 
b78d236
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ba9c38b
 
b78d236
 
f5bc8c3
ba9c38b
f5bc8c3
 
 
 
 
 
 
 
 
 
ba9c38b
f5bc8c3
b78d236
 
 
 
 
 
f5bc8c3
ba9c38b
f5bc8c3
b78d236
 
 
 
f5bc8c3
 
b78d236
f5bc8c3
ba9c38b
f5bc8c3
ba9c38b
 
 
 
f5bc8c3
b78d236
 
 
 
 
 
 
 
 
f5bc8c3
b78d236
f5bc8c3
b78d236
 
 
 
 
 
 
 
f5bc8c3
b78d236
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
license: apache-2.0
language:
- en
base_model:
- zeroentropy/zerank-1-small
pipeline_tag: text-ranking
tags:
- reranking
- onnx
- quantized
- fastembed
library_name: fastembed
---

# zerank-1-small — ONNX Export

ONNX export of [zeroentropy/zerank-1-small](https://huggingface.co/zeroentropy/zerank-1-small), a 1.7B Qwen3-based reranker. Includes three quantization levels for CPU inference.

## Files

| File | Format | Size | Description |
|------|--------|------|-------------|
| `model.onnx` + `model.onnx_data` | FP16 | ~3.2 GB | Full precision |
| `model_int8.onnx` + `model_int8.onnx_data` | INT8 | ~2.5 GB | Weight-only INT8 (per-tensor symmetric) |
| `model_int4_full.onnx` | INT4 | ~1.3 GB | MatMulNBits INT4, block_size=32 |

Conversion scripts: `export_zerank_v2.py` (FP16 export with dynamic batch), `stream_int8.py` (INT8 quantization).

## ⚠️ Important: chat template required

This model is a Qwen3-based causal LM that scores (query, document) relevance by extracting the **"Yes" token logit** at the last position. It requires a specific prompt format — plain pair tokenization produces meaningless scores.

**Always format inputs using the Qwen3 chat template with `system=query`, `user=document`:**

```python
# using the tokenizer directly (matches training format exactly):
messages = [
    {"role": "system", "content": query},
    {"role": "user",   "content": document},
]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
```

This produces the following fixed string (equivalent, usable without a tokenizer):
```
<|im_start|>system
{query}
<|im_end|>
<|im_start|>user
{document}
<|im_end|>
<|im_start|>assistant
```

## Usage with ONNX Runtime (Python)

```python
import onnxruntime as ort
import numpy as np
from transformers import AutoTokenizer

MODEL_PATH = "model_int8.onnx"   # or model.onnx, model_int4_full.onnx
MAX_LENGTH = 512

sess = ort.InferenceSession(MODEL_PATH, providers=["CPUExecutionProvider"])
tok  = AutoTokenizer.from_pretrained("cstr/zerank-1-small-ONNX")

def format_pair(query: str, doc: str) -> str:
    messages = [
        {"role": "system", "content": query},
        {"role": "user",   "content": doc},
    ]
    return tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

def rerank(query: str, documents: list[str]) -> list[float]:
    scores = []
    for doc in documents:
        text = format_pair(query, doc)
        enc  = tok(text, return_tensors="np", truncation=True, max_length=MAX_LENGTH)
        logit = sess.run(["logits"], {
            "input_ids":      enc["input_ids"].astype(np.int64),
            "attention_mask": enc["attention_mask"].astype(np.int64),
        })[0]
        scores.append(float(logit[0, 0]))
    return scores

query = "What is a panda?"
docs  = [
    "The giant panda is a bear species endemic to China.",
    "The sky is blue and the grass is green.",
    "Pandas are mammals in the family Ursidae.",
]
scores = rerank(query, docs)
for s, d in sorted(zip(scores, docs), reverse=True):
    print(f"[{s:.3f}] {d}")
# [+6.8] The giant panda is a bear species endemic to China.
# [+2.1] Pandas are mammals in the family Ursidae.
# [-5.8] The sky is blue and the grass is green.
```

> **Batch inference:** The v2 export (`model.onnx`) supports `batch_size > 1` via a dynamic causal+padding mask. Pad a batch with the tokenizer and pass the full batch at once for higher throughput.

## Usage with fastembed-rs

```rust
use fastembed::{RerankInitOptions, RerankerModel, TextRerank};

let mut reranker = TextRerank::try_new(
    RerankInitOptions::new(RerankerModel::ZerankSmallInt8)
).unwrap();

// The chat template is applied automatically; batch_size > 1 is supported.
let results = reranker.rerank(
    "What is a panda?",
    vec![
        "The giant panda is a bear species endemic to China.",
        "The sky is blue.",
        "Pandas are mammals in the family Ursidae.",
    ],
    true,
    Some(32),
).unwrap();

for r in &results {
    println!("[{:.3}] {}", r.score, r.document.as_ref().unwrap());
}
```

## Export details

`export_zerank_v2.py` wraps Qwen3ForCausalLM in a `ZeRankScorerV2` that:

1. Builds a 4D causal+padding attention mask explicitly from `input_ids.shape[0]` — this makes the batch dimension dynamic in the ONNX graph (enabling `batch_size > 1`).
2. Runs the transformer body → `hidden [batch, seq, hidden]`
3. Gathers the hidden state at the last real-token position (`attention_mask.sum - 1`)
4. Applies `lm_head`, slices the **"Yes" token** (id `9454`) → `[batch, 1]`

Output: `logits [batch, 1]` — raw Yes-token logit (higher = more relevant). FP16 weights, opset 18.

`stream_int8.py` performs fully streaming weight-only INT8 quantization:
- Never loads the full 6.4 GB FP32 model into RAM (peak ~1.5 GB)
- Symmetric per-tensor quantization: `scale = max(|w|) / 127`
- Adds `DequantizeLinear → MatMul` nodes for all MatMul B-weights
- Non-MatMul tensors (embeddings, LayerNorm) kept as FP32

## Benchmarks (from original model card)

NDCG@10 with `text-embedding-3-small` as initial retriever (Top 100 candidates):

| Task | Embedding only | cohere-rerank-v3.5 | Llama-rank-v1 | **zerank-1-small** | zerank-1 |
|------|---------------|-------------------|--------------|----------------|----------|
| Code | 0.678 | 0.724 | 0.694 | **0.730** | 0.754 |
| Finance | 0.839 | 0.824 | 0.828 | **0.861** | 0.894 |
| Legal | 0.703 | 0.804 | 0.767 | **0.817** | 0.821 |
| Medical | 0.619 | 0.750 | 0.719 | **0.773** | 0.796 |
| STEM | 0.401 | 0.510 | 0.595 | **0.680** | 0.694 |
| Conversational | 0.250 | 0.571 | 0.484 | **0.556** | 0.596 |

See [zeroentropy/zerank-1-small](https://huggingface.co/zeroentropy/zerank-1-small) for full details and Apache-2.0 license.