| --- |
| license: apache-2.0 |
| language: |
| - en |
| base_model: |
| - zeroentropy/zerank-1-small |
| pipeline_tag: text-ranking |
| tags: |
| - reranking |
| - onnx |
| - quantized |
| - fastembed |
| library_name: fastembed |
| --- |
| |
| # zerank-1-small — ONNX Export |
|
|
| ONNX export of [zeroentropy/zerank-1-small](https://huggingface.co/zeroentropy/zerank-1-small), a 1.7B Qwen3-based reranker. Includes three quantization levels for CPU inference. |
|
|
| ## Files |
|
|
| | File | Format | Size | Description | |
| |------|--------|------|-------------| |
| | `model.onnx` + `model.onnx_data` | FP16 | ~3.2 GB | Full precision | |
| | `model_int8.onnx` + `model_int8.onnx_data` | INT8 | ~2.5 GB | Weight-only INT8 (per-tensor symmetric) | |
| | `model_int4_full.onnx` | INT4 | ~1.3 GB | MatMulNBits INT4, block_size=32 | |
| |
| Conversion scripts: `export_zerank_v2.py` (FP16 export with dynamic batch), `stream_int8.py` (INT8 quantization). |
|
|
| ## ⚠️ Important: chat template required |
|
|
| This model is a Qwen3-based causal LM that scores (query, document) relevance by extracting the **"Yes" token logit** at the last position. It requires a specific prompt format — plain pair tokenization produces meaningless scores. |
|
|
| **Always format inputs using the Qwen3 chat template with `system=query`, `user=document`:** |
|
|
| ```python |
| # using the tokenizer directly (matches training format exactly): |
| messages = [ |
| {"role": "system", "content": query}, |
| {"role": "user", "content": document}, |
| ] |
| text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| ``` |
|
|
| This produces the following fixed string (equivalent, usable without a tokenizer): |
| ``` |
| <|im_start|>system |
| {query} |
| <|im_end|> |
| <|im_start|>user |
| {document} |
| <|im_end|> |
| <|im_start|>assistant |
| ``` |
|
|
| ## Usage with ONNX Runtime (Python) |
|
|
| ```python |
| import onnxruntime as ort |
| import numpy as np |
| from transformers import AutoTokenizer |
| |
| MODEL_PATH = "model_int8.onnx" # or model.onnx, model_int4_full.onnx |
| MAX_LENGTH = 512 |
| |
| sess = ort.InferenceSession(MODEL_PATH, providers=["CPUExecutionProvider"]) |
| tok = AutoTokenizer.from_pretrained("cstr/zerank-1-small-ONNX") |
| |
| def format_pair(query: str, doc: str) -> str: |
| messages = [ |
| {"role": "system", "content": query}, |
| {"role": "user", "content": doc}, |
| ] |
| return tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| |
| def rerank(query: str, documents: list[str]) -> list[float]: |
| scores = [] |
| for doc in documents: |
| text = format_pair(query, doc) |
| enc = tok(text, return_tensors="np", truncation=True, max_length=MAX_LENGTH) |
| logit = sess.run(["logits"], { |
| "input_ids": enc["input_ids"].astype(np.int64), |
| "attention_mask": enc["attention_mask"].astype(np.int64), |
| })[0] |
| scores.append(float(logit[0, 0])) |
| return scores |
| |
| query = "What is a panda?" |
| docs = [ |
| "The giant panda is a bear species endemic to China.", |
| "The sky is blue and the grass is green.", |
| "Pandas are mammals in the family Ursidae.", |
| ] |
| scores = rerank(query, docs) |
| for s, d in sorted(zip(scores, docs), reverse=True): |
| print(f"[{s:.3f}] {d}") |
| # [+6.8] The giant panda is a bear species endemic to China. |
| # [+2.1] Pandas are mammals in the family Ursidae. |
| # [-5.8] The sky is blue and the grass is green. |
| ``` |
|
|
| > **Batch inference:** The v2 export (`model.onnx`) supports `batch_size > 1` via a dynamic causal+padding mask. Pad a batch with the tokenizer and pass the full batch at once for higher throughput. |
| |
| ## Usage with fastembed-rs |
| |
| ```rust |
| use fastembed::{RerankInitOptions, RerankerModel, TextRerank}; |
| |
| let mut reranker = TextRerank::try_new( |
| RerankInitOptions::new(RerankerModel::ZerankSmallInt8) |
| ).unwrap(); |
| |
| // The chat template is applied automatically; batch_size > 1 is supported. |
| let results = reranker.rerank( |
| "What is a panda?", |
| vec![ |
| "The giant panda is a bear species endemic to China.", |
| "The sky is blue.", |
| "Pandas are mammals in the family Ursidae.", |
| ], |
| true, |
| Some(32), |
| ).unwrap(); |
| |
| for r in &results { |
| println!("[{:.3}] {}", r.score, r.document.as_ref().unwrap()); |
| } |
| ``` |
| |
| ## Export details |
| |
| `export_zerank_v2.py` wraps Qwen3ForCausalLM in a `ZeRankScorerV2` that: |
| |
| 1. Builds a 4D causal+padding attention mask explicitly from `input_ids.shape[0]` — this makes the batch dimension dynamic in the ONNX graph (enabling `batch_size > 1`). |
| 2. Runs the transformer body → `hidden [batch, seq, hidden]` |
| 3. Gathers the hidden state at the last real-token position (`attention_mask.sum - 1`) |
| 4. Applies `lm_head`, slices the **"Yes" token** (id `9454`) → `[batch, 1]` |
| |
| Output: `logits [batch, 1]` — raw Yes-token logit (higher = more relevant). FP16 weights, opset 18. |
| |
| `stream_int8.py` performs fully streaming weight-only INT8 quantization: |
| - Never loads the full 6.4 GB FP32 model into RAM (peak ~1.5 GB) |
| - Symmetric per-tensor quantization: `scale = max(|w|) / 127` |
| - Adds `DequantizeLinear → MatMul` nodes for all MatMul B-weights |
| - Non-MatMul tensors (embeddings, LayerNorm) kept as FP32 |
| |
| ## Benchmarks (from original model card) |
| |
| NDCG@10 with `text-embedding-3-small` as initial retriever (Top 100 candidates): |
| |
| | Task | Embedding only | cohere-rerank-v3.5 | Llama-rank-v1 | **zerank-1-small** | zerank-1 | |
| |------|---------------|-------------------|--------------|----------------|----------| |
| | Code | 0.678 | 0.724 | 0.694 | **0.730** | 0.754 | |
| | Finance | 0.839 | 0.824 | 0.828 | **0.861** | 0.894 | |
| | Legal | 0.703 | 0.804 | 0.767 | **0.817** | 0.821 | |
| | Medical | 0.619 | 0.750 | 0.719 | **0.773** | 0.796 | |
| | STEM | 0.401 | 0.510 | 0.595 | **0.680** | 0.694 | |
| | Conversational | 0.250 | 0.571 | 0.484 | **0.556** | 0.596 | |
| |
| See [zeroentropy/zerank-1-small](https://huggingface.co/zeroentropy/zerank-1-small) for full details and Apache-2.0 license. |
| |