cstr commited on
Commit
f5bc8c3
·
verified ·
1 Parent(s): 4bc0e40

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +89 -0
README.md ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - zeroentropy/zerank-1-small
7
+ pipeline_tag: text-ranking
8
+ tags:
9
+ - reranking
10
+ - onnx
11
+ - quantized
12
+ - fastembed
13
+ library_name: fastembed
14
+ ---
15
+
16
+ # zerank-1-small — ONNX Export
17
+
18
+ ONNX export of [zeroentropy/zerank-1-small](https://huggingface.co/zeroentropy/zerank-1-small), a 1.7B Qwen3-based reranker. Includes three quantization levels for CPU inference.
19
+
20
+ ## Files
21
+
22
+ | File | Format | Size | Description |
23
+ |------|--------|------|-------------|
24
+ | `model.onnx` + `model.onnx_data` | FP16 | ~3.2 GB | Full precision |
25
+ | `model_int8.onnx` + `model_int8.onnx_data` | INT8 | ~2.5 GB | Weight-only INT8 (per-tensor symmetric) |
26
+ | `model_int4_full.onnx` | INT4 | ~1.3 GB | MatMulNBits INT4, block_size=32 |
27
+
28
+ The INT8 model uses a custom streaming quantizer (never loads the full 6.4 GB FP32 model into RAM). The INT4 model uses ORT's `MatMulNBitsQuantizer`.
29
+
30
+ ## Export details
31
+
32
+ The `ZeRankScorer` wrapper bakes Yes-token logit extraction into the graph:
33
+
34
+ 1. Runs the Qwen3 transformer body → `[batch, seq, hidden]`
35
+ 2. Gathers at the last real-token position (using `attention_mask.sum - 1`)
36
+ 3. Applies `lm_head` and slices the Yes-token (id=`9454`) → `[batch, 1]`
37
+
38
+ Output: `logits [batch, 1]` — raw Yes-token logit, higher = more relevant. Compatible with fastembed's standard reranker interface.
39
+
40
+ ## Usage with fastembed-rs
41
+
42
+ ```rust
43
+ use fastembed::{RerankInitOptions, RerankerModel, TextRerank};
44
+
45
+ let mut reranker = TextRerank::try_new(
46
+ RerankInitOptions::new(RerankerModel::ZerankSmallInt8)
47
+ ).unwrap();
48
+
49
+ let results = reranker.rerank(
50
+ "what is a panda?",
51
+ vec!["A panda is a bear...", "The sky is blue..."],
52
+ true,
53
+ Some(1), // batch_size=1
54
+ ).unwrap();
55
+ ```
56
+
57
+ > **Note:** Use `batch_size=Some(1)` — the causal attention mask was traced with a static batch dimension.
58
+
59
+ ## Usage with ONNX Runtime (Python)
60
+
61
+ ```python
62
+ import onnxruntime as ort
63
+ import numpy as np
64
+ from transformers import AutoTokenizer
65
+
66
+ sess = ort.InferenceSession("model_int8.onnx", providers=["CPUExecutionProvider"])
67
+ tok = AutoTokenizer.from_pretrained("cstr/zerank-1-small-ONNX")
68
+
69
+ query, doc = "what is a panda?", "A panda is a large black-and-white bear."
70
+ enc = tok(query, doc, return_tensors="np", truncation=True, max_length=512)
71
+ logit = sess.run(["logits"], {
72
+ "input_ids": enc["input_ids"].astype(np.int64),
73
+ "attention_mask": enc["attention_mask"].astype(np.int64),
74
+ })[0]
75
+ score = float(logit[0][0])
76
+ print(f"Score: {score:.3f}")
77
+ ```
78
+
79
+ ## Original model
80
+
81
+ See [zeroentropy/zerank-1-small](https://huggingface.co/zeroentropy/zerank-1-small) for full model details, evaluations, and license (Apache-2.0).
82
+
83
+ | Task | cohere-rerank-v3.5 | Salesforce/Llama-rank-v1 | **zerank-1-small** | zerank-1 |
84
+ |------|--------------------|--------------------------|----------------|----------|
85
+ | Code | 0.724 | 0.694 | **0.730** | 0.754 |
86
+ | Finance | 0.824 | 0.828 | **0.861** | 0.894 |
87
+ | Legal | 0.804 | 0.767 | **0.817** | 0.821 |
88
+ | Medical | 0.750 | 0.719 | **0.773** | 0.796 |
89
+ | STEM | 0.510 | 0.595 | **0.680** | 0.694 |