cstr commited on
Commit
b78d236
·
verified ·
1 Parent(s): 9f03a15

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +87 -37
README.md CHANGED
@@ -25,17 +25,61 @@ ONNX export of [zeroentropy/zerank-1-small](https://huggingface.co/zeroentropy/z
25
  | `model_int8.onnx` + `model_int8.onnx_data` | INT8 | ~2.5 GB | Weight-only INT8 (per-tensor symmetric) |
26
  | `model_int4_full.onnx` | INT4 | ~1.3 GB | MatMulNBits INT4, block_size=32 |
27
 
28
- The INT8 model uses a custom streaming quantizer (never loads the full 6.4 GB FP32 model into RAM). The INT4 model uses ORT's `MatMulNBitsQuantizer`.
29
 
30
- ## Export details
 
 
 
 
 
 
 
 
 
 
 
31
 
32
- The `ZeRankScorer` wrapper bakes Yes-token logit extraction into the graph:
 
 
 
 
 
33
 
34
- 1. Runs the Qwen3 transformer body → `[batch, seq, hidden]`
35
- 2. Gathers at the last real-token position (using `attention_mask.sum - 1`)
36
- 3. Applies `lm_head` and slices the Yes-token (id=`9454`) → `[batch, 1]`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
- Output: `logits [batch, 1]` raw Yes-token logit, higher = more relevant. Compatible with fastembed's standard reranker interface.
39
 
40
  ## Usage with fastembed-rs
41
 
@@ -46,44 +90,50 @@ let mut reranker = TextRerank::try_new(
46
  RerankInitOptions::new(RerankerModel::ZerankSmallInt8)
47
  ).unwrap();
48
 
 
49
  let results = reranker.rerank(
50
- "what is a panda?",
51
- vec!["A panda is a bear...", "The sky is blue..."],
 
 
 
 
52
  true,
53
- Some(1), // batch_size=1
54
  ).unwrap();
 
 
 
 
55
  ```
56
 
57
- > **Note:** Use `batch_size=Some(1)` — the causal attention mask was traced with a static batch dimension.
58
 
59
- ## Usage with ONNX Runtime (Python)
60
 
61
- ```python
62
- import onnxruntime as ort
63
- import numpy as np
64
- from transformers import AutoTokenizer
65
 
66
- sess = ort.InferenceSession("model_int8.onnx", providers=["CPUExecutionProvider"])
67
- tok = AutoTokenizer.from_pretrained("cstr/zerank-1-small-ONNX")
68
-
69
- query, doc = "what is a panda?", "A panda is a large black-and-white bear."
70
- enc = tok(query, doc, return_tensors="np", truncation=True, max_length=512)
71
- logit = sess.run(["logits"], {
72
- "input_ids": enc["input_ids"].astype(np.int64),
73
- "attention_mask": enc["attention_mask"].astype(np.int64),
74
- })[0]
75
- score = float(logit[0][0])
76
- print(f"Score: {score:.3f}")
77
- ```
78
 
79
- ## Original model
80
 
81
- See [zeroentropy/zerank-1-small](https://huggingface.co/zeroentropy/zerank-1-small) for full model details, evaluations, and license (Apache-2.0).
 
 
 
 
 
 
 
82
 
83
- | Task | cohere-rerank-v3.5 | Salesforce/Llama-rank-v1 | **zerank-1-small** | zerank-1 |
84
- |------|--------------------|--------------------------|----------------|----------|
85
- | Code | 0.724 | 0.694 | **0.730** | 0.754 |
86
- | Finance | 0.824 | 0.828 | **0.861** | 0.894 |
87
- | Legal | 0.804 | 0.767 | **0.817** | 0.821 |
88
- | Medical | 0.750 | 0.719 | **0.773** | 0.796 |
89
- | STEM | 0.510 | 0.595 | **0.680** | 0.694 |
 
25
  | `model_int8.onnx` + `model_int8.onnx_data` | INT8 | ~2.5 GB | Weight-only INT8 (per-tensor symmetric) |
26
  | `model_int4_full.onnx` | INT4 | ~1.3 GB | MatMulNBits INT4, block_size=32 |
27
 
28
+ Conversion scripts: `export_zerank.py` (FP16 export), `stream_int8.py` (INT8 quantization).
29
 
30
+ ## ⚠️ Important: chat template required
31
+
32
+ This model is a Qwen3-based causal LM that scores (query, document) relevance by extracting the **"Yes" token logit** at the last position. It requires a specific prompt format — plain pair tokenization produces meaningless scores.
33
+
34
+ **Always format inputs as:**
35
+ ```
36
+ <|im_start|>user
37
+ Query: {query}
38
+ Document: {document}
39
+ Relevant:<|im_end|>
40
+ <|im_start|>assistant
41
+ ```
42
 
43
+ ## Usage with ONNX Runtime (Python)
44
+
45
+ ```python
46
+ import onnxruntime as ort
47
+ import numpy as np
48
+ from transformers import AutoTokenizer
49
 
50
+ MODEL_PATH = "model_int8.onnx" # or model.onnx, model_int4_full.onnx
51
+ TEMPLATE = "<|im_start|>user\nQuery: {query}\nDocument: {doc}\nRelevant:<|im_end|>\n<|im_start|>assistant\n"
52
+
53
+ sess = ort.InferenceSession(MODEL_PATH, providers=["CPUExecutionProvider"])
54
+ tok = AutoTokenizer.from_pretrained("cstr/zerank-1-small-ONNX")
55
+
56
+ def rerank(query: str, documents: list[str]) -> list[float]:
57
+ scores = []
58
+ for doc in documents:
59
+ text = TEMPLATE.format(query=query, doc=doc)
60
+ enc = tok(text, return_tensors="np", truncation=True, max_length=512)
61
+ logit = sess.run(["logits"], {
62
+ "input_ids": enc["input_ids"].astype(np.int64),
63
+ "attention_mask": enc["attention_mask"].astype(np.int64),
64
+ })[0]
65
+ scores.append(float(logit[0, 0]))
66
+ return scores
67
+
68
+ query = "What is a panda?"
69
+ docs = [
70
+ "The giant panda is a bear species endemic to China.",
71
+ "The sky is blue and the grass is green.",
72
+ "Pandas are mammals in the family Ursidae.",
73
+ ]
74
+ scores = rerank(query, docs)
75
+ for s, d in sorted(zip(scores, docs), reverse=True):
76
+ print(f"[{s:.3f}] {d}")
77
+ # [6.8] The giant panda is a bear species endemic to China.
78
+ # [2.1] Pandas are mammals in the family Ursidae.
79
+ # [-5.8] The sky is blue and the grass is green.
80
+ ```
81
 
82
+ > **Note:** Current export uses `batch_size=1` (causal mask is static). Process documents one at a time as shown above.
83
 
84
  ## Usage with fastembed-rs
85
 
 
90
  RerankInitOptions::new(RerankerModel::ZerankSmallInt8)
91
  ).unwrap();
92
 
93
+ // batch_size=1: chat template is applied automatically per document
94
  let results = reranker.rerank(
95
+ "What is a panda?",
96
+ vec![
97
+ "The giant panda is a bear species endemic to China.",
98
+ "The sky is blue.",
99
+ "Pandas are mammals in the family Ursidae.",
100
+ ],
101
  true,
102
+ Some(1),
103
  ).unwrap();
104
+
105
+ for r in &results {
106
+ println!("[{:.3}] {}", r.score, r.document.as_ref().unwrap());
107
+ }
108
  ```
109
 
110
+ ## Export details
111
 
112
+ `export_zerank.py` wraps Qwen3ForCausalLM in a `ZeRankScorer` that:
113
 
114
+ 1. Runs the transformer body → `hidden [batch, seq, hidden]`
115
+ 2. Gathers the hidden state at the last real-token position (`attention_mask.sum - 1`)
116
+ 3. Applies `lm_head`, slices the **"Yes" token** (id `9454`) → `[batch, 1]`
 
117
 
118
+ Output: `logits [batch, 1]` — raw Yes-token logit (higher = more relevant). FP16 weights, opset 18.
119
+
120
+ `stream_int8.py` performs fully streaming weight-only INT8 quantization:
121
+ - Never loads the full 6.4 GB FP32 model into RAM (peak ~1.5 GB)
122
+ - Symmetric per-tensor quantization: `scale = max(|w|) / 127`
123
+ - Adds `DequantizeLinear → MatMul` nodes for all MatMul B-weights
124
+ - Non-MatMul tensors (embeddings, LayerNorm) kept as FP32
125
+
126
+ ## Benchmarks (from original model card)
 
 
 
127
 
128
+ NDCG@10 with `text-embedding-3-small` as initial retriever (Top 100 candidates):
129
 
130
+ | Task | Embedding only | cohere-rerank-v3.5 | Llama-rank-v1 | **zerank-1-small** | zerank-1 |
131
+ |------|---------------|-------------------|--------------|----------------|----------|
132
+ | Code | 0.678 | 0.724 | 0.694 | **0.730** | 0.754 |
133
+ | Finance | 0.839 | 0.824 | 0.828 | **0.861** | 0.894 |
134
+ | Legal | 0.703 | 0.804 | 0.767 | **0.817** | 0.821 |
135
+ | Medical | 0.619 | 0.750 | 0.719 | **0.773** | 0.796 |
136
+ | STEM | 0.401 | 0.510 | 0.595 | **0.680** | 0.694 |
137
+ | Conversational | 0.250 | 0.571 | 0.484 | **0.556** | 0.596 |
138
 
139
+ See [zeroentropy/zerank-1-small](https://huggingface.co/zeroentropy/zerank-1-small) for full details and Apache-2.0 license.