cstr
/

zerank-1-small-ONNX

+---
+license: apache-2.0
+language:
+- en
+base_model:
+- zeroentropy/zerank-1-small
+pipeline_tag: text-ranking
+tags:
+- reranking
+- onnx
+- quantized
+- fastembed
+library_name: fastembed
+---
+# zerank-1-small — ONNX Export
+ONNX export of [zeroentropy/zerank-1-small](https://huggingface.co/zeroentropy/zerank-1-small), a 1.7B Qwen3-based reranker. Includes three quantization levels for CPU inference.
+## Files
+| File | Format | Size | Description |
+|------|--------|------|-------------|
+| `model.onnx` + `model.onnx_data` | FP16 | ~3.2 GB | Full precision |
+| `model_int8.onnx` + `model_int8.onnx_data` | INT8 | ~2.5 GB | Weight-only INT8 (per-tensor symmetric) |
+| `model_int4_full.onnx` | INT4 | ~1.3 GB | MatMulNBits INT4, block_size=32 |
+The INT8 model uses a custom streaming quantizer (never loads the full 6.4 GB FP32 model into RAM). The INT4 model uses ORT's `MatMulNBitsQuantizer`.
+## Export details
+The `ZeRankScorer` wrapper bakes Yes-token logit extraction into the graph:
+1. Runs the Qwen3 transformer body → `[batch, seq, hidden]`
+2. Gathers at the last real-token position (using `attention_mask.sum - 1`)
+3. Applies `lm_head` and slices the Yes-token (id=`9454`) → `[batch, 1]`
+Output: `logits [batch, 1]` — raw Yes-token logit, higher = more relevant. Compatible with fastembed's standard reranker interface.
+## Usage with fastembed-rs
+```rust
+use fastembed::{RerankInitOptions, RerankerModel, TextRerank};
+let mut reranker = TextRerank::try_new(
+    RerankInitOptions::new(RerankerModel::ZerankSmallInt8)
+).unwrap();
+let results = reranker.rerank(
+    "what is a panda?",
+    vec!["A panda is a bear...", "The sky is blue..."],
+    true,
+    Some(1), // batch_size=1
+).unwrap();
+```
+> **Note:** Use `batch_size=Some(1)` — the causal attention mask was traced with a static batch dimension.
+## Usage with ONNX Runtime (Python)
+```python
+import onnxruntime as ort
+import numpy as np
+from transformers import AutoTokenizer
+sess = ort.InferenceSession("model_int8.onnx", providers=["CPUExecutionProvider"])
+tok = AutoTokenizer.from_pretrained("cstr/zerank-1-small-ONNX")
+query, doc = "what is a panda?", "A panda is a large black-and-white bear."
+enc = tok(query, doc, return_tensors="np", truncation=True, max_length=512)
+logit = sess.run(["logits"], {
+    "input_ids":      enc["input_ids"].astype(np.int64),
+    "attention_mask": enc["attention_mask"].astype(np.int64),
+})[0]
+score = float(logit[0][0])
+print(f"Score: {score:.3f}")
+```
+## Original model
+See [zeroentropy/zerank-1-small](https://huggingface.co/zeroentropy/zerank-1-small) for full model details, evaluations, and license (Apache-2.0).
+| Task | cohere-rerank-v3.5 | Salesforce/Llama-rank-v1 | **zerank-1-small** | zerank-1 |
+|------|--------------------|--------------------------|----------------|----------|
+| Code | 0.724 | 0.694 | **0.730** | 0.754 |
+| Finance | 0.824 | 0.828 | **0.861** | 0.894 |
+| Legal | 0.804 | 0.767 | **0.817** | 0.821 |
+| Medical | 0.750 | 0.719 | **0.773** | 0.796 |
+| STEM | 0.510 | 0.595 | **0.680** | 0.694 |