Update readme.md with benchmarks
Browse files
README.md
CHANGED
|
@@ -17,13 +17,23 @@ Working GGUF of [Qwen/Qwen3-Reranker-4B](https://huggingface.co/Qwen/Qwen3-Reran
|
|
| 17 |
|
| 18 |
> **Other sizes:** [0.6B](https://huggingface.co/Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp) · [4B (this)](https://huggingface.co/Voodisss/Qwen3-Reranker-4B-GGUF-llama_cpp) · [8B](https://huggingface.co/Voodisss/Qwen3-Reranker-8B-GGUF-llama_cpp)
|
| 19 |
|
| 20 |
-
##
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
|
| 25 |
-
|
|
| 26 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
## Does it work?
|
| 28 |
|
| 29 |
Yes. Most community GGUFs of Qwen3-Reranker produce garbage scores (`4.5e-23`) because they're missing reranker-specific tensors. See [llama.cpp #16407](https://github.com/ggml-org/llama.cpp/issues/16407). This one works:
|
|
|
|
| 17 |
|
| 18 |
> **Other sizes:** [0.6B](https://huggingface.co/Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp) · [4B (this)](https://huggingface.co/Voodisss/Qwen3-Reranker-4B-GGUF-llama_cpp) · [8B](https://huggingface.co/Voodisss/Qwen3-Reranker-8B-GGUF-llama_cpp)
|
| 19 |
|
| 20 |
+
## Quantization quality comparison (Qwen3-Reranker-4B)
|
| 21 |
+
|
| 22 |
+
Benchmarked on [MTEB AskUbuntuDupQuestions](https://huggingface.co/datasets/mteb/AskUbuntuDupQuestions) (361 queries) via llama-server `/v1/rerank` on RTX 3090. All quants produced from the same F16 source using `llama-quantize`.
|
| 23 |
+
|
| 24 |
+
| Quant | Size | NDCG@10 | MAP@10 | MRR@10 | Δ NDCG@10 |
|
| 25 |
+
| ------ | ------- | ------- | ------ | ------ | --------- |
|
| 26 |
+
| F16 | 7.50 GB | 0.7003 | 0.5530 | 0.7711 | baseline |
|
| 27 |
+
| Q8_0 | 3.99 GB | 0.6985 | 0.5514 | 0.7670 | -0.3% |
|
| 28 |
+
| Q6_K | 3.08 GB | 0.7016 | 0.5548 | 0.7722 | +0.2% |
|
| 29 |
+
| Q5_K_M | 2.69 GB | 0.7009 | 0.5517 | 0.7699 | +0.1% |
|
| 30 |
+
| Q5_0 | 2.63 GB | 0.6995 | 0.5532 | 0.7676 | -0.1% |
|
| 31 |
+
| Q4_K_M | 2.33 GB | 0.7058 | 0.5596 | 0.7746 | +0.8% |
|
| 32 |
+
| Q4_0 | 2.21 GB | 0.6930 | 0.5426 | 0.7623 | -1.1% |
|
| 33 |
+
| Q3_K_M | 1.93 GB | 0.7040 | 0.5555 | 0.7828 | +0.5% |
|
| 34 |
+
| Q2_K | 1.55 GB | 0.6691 | 0.5079 | 0.7401 | **-4.5%** |
|
| 35 |
+
|
| 36 |
+
**Takeaway:** All quants from Q8_0 down to Q3_K_M are within ±1% of F16 — pick based on your VRAM budget. Q4_K_M (2.33 GB) is the sweet spot: 3.2x smaller than F16 with no measurable quality loss. **Avoid Q2_K** — it's the only quant with real degradation.
|
| 37 |
## Does it work?
|
| 38 |
|
| 39 |
Yes. Most community GGUFs of Qwen3-Reranker produce garbage scores (`4.5e-23`) because they're missing reranker-specific tensors. See [llama.cpp #16407](https://github.com/ggml-org/llama.cpp/issues/16407). This one works:
|