Update readme.md with benchmarks

1b452c8 verified about 1 month ago

4.14 kB

base_model: Qwen/Qwen3-Reranker-4B
library_name: gguf
license: apache-2.0
pipeline_tag: text-ranking
tags:
  - reranker
  - gguf
  - llama.cpp
  - qwen3
  - text-ranking

Qwen3-Reranker-4B — GGUF (llama.cpp)

Working GGUF of Qwen/Qwen3-Reranker-4B for llama.cpp. Converted 2025-03-09 with the official convert_hf_to_gguf.py.

Other sizes: 0.6B · 4B (this) · 8B

Quantization quality comparison (Qwen3-Reranker-4B)

Benchmarked on MTEB AskUbuntuDupQuestions (361 queries) via llama-server /v1/rerank on RTX 3090. All quants produced from the same F16 source using llama-quantize.

Quant	Size	NDCG@10	MAP@10	MRR@10	Δ NDCG@10
F16	7.50 GB	0.7003	0.5530	0.7711	baseline
Q8_0	3.99 GB	0.6985	0.5514	0.7670	-0.3%
Q6_K	3.08 GB	0.7016	0.5548	0.7722	+0.2%
Q5_K_M	2.69 GB	0.7009	0.5517	0.7699	+0.1%
Q5_0	2.63 GB	0.6995	0.5532	0.7676	-0.1%
Q4_K_M	2.33 GB	0.7058	0.5596	0.7746	+0.8%
Q4_0	2.21 GB	0.6930	0.5426	0.7623	-1.1%
Q3_K_M	1.93 GB	0.7040	0.5555	0.7828	+0.5%
Q2_K	1.55 GB	0.6691	0.5079	0.7401	-4.5%

Takeaway: All quants from Q8_0 down to Q3_K_M are within ±1% of F16 — pick based on your VRAM budget. Q4_K_M (2.33 GB) is the sweet spot: 3.2x smaller than F16 with no measurable quality loss. Avoid Q2_K — it's the only quant with real degradation.

Does it work?

Yes. Most community GGUFs of Qwen3-Reranker produce garbage scores (4.5e-23) because they're missing reranker-specific tensors. See llama.cpp #16407. This one works:

Doc 0 (relevant):   relevance_score = 0.999966
Doc 1 (irrelevant): relevance_score = 0.000069

Quick start

llama-server -m Qwen3-Reranker-4B-f16.gguf --reranking --pooling rank --embedding --port 8081

curl http://localhost:8081/v1/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "query": "employment termination notice period",
    "documents": [
      "The Labour Code requires 30 calendar days written notice.",
      "Corporate tax rates for small enterprises."
    ]
  }'

Use /v1/rerank, not /v1/embeddings. The embeddings endpoint returns zeros for reranker models.

What's different about this GGUF?

The official convert_hf_to_gguf.py detects Qwen3-Reranker and does things naive converters skip:

Extracts cls.output.weight (the yes/no classifier) from lm_head
Sets pooling_type = RANK metadata
Bakes in the rerank chat template
Sets classifier.output_labels = ["yes", "no"]

Without these, llama-server has nothing to compute scores from.

Known broken GGUFs

DevQuasar/Qwen.Qwen3-Reranker-4B-GGUF — confirmed broken with llama.cpp

models.ini example

[Qwen3-Reranker-4B-f16]
model = /path/to/Qwen3-Reranker-4B-f16.gguf
reranking = true
pooling = rank
embedding = true
ctx-size = 32768

For a full multi-model setup guide (embedding + reranking + chat on one server), see the llama-server Qwen3 guide.

Convert it yourself

pip install huggingface_hub gguf torch safetensors sentencepiece
python -c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen3-Reranker-4B', local_dir='Qwen3-Reranker-4B-src')"
python convert_hf_to_gguf.py --outtype f16 --outfile Qwen3-Reranker-4B-f16.gguf Qwen3-Reranker-4B-src/

License

Apache 2.0 — same as the original model.