Voodisss's picture
Update readme.md with benchmarks
1b452c8 verified
metadata
base_model: Qwen/Qwen3-Reranker-4B
library_name: gguf
license: apache-2.0
pipeline_tag: text-ranking
tags:
  - reranker
  - gguf
  - llama.cpp
  - qwen3
  - text-ranking

Qwen3-Reranker-4B — GGUF (llama.cpp)

Working GGUF of Qwen/Qwen3-Reranker-4B for llama.cpp. Converted 2025-03-09 with the official convert_hf_to_gguf.py.

Other sizes: 0.6B · 4B (this) · 8B

Quantization quality comparison (Qwen3-Reranker-4B)

Benchmarked on MTEB AskUbuntuDupQuestions (361 queries) via llama-server /v1/rerank on RTX 3090. All quants produced from the same F16 source using llama-quantize.

Quant Size NDCG@10 MAP@10 MRR@10 Δ NDCG@10
F16 7.50 GB 0.7003 0.5530 0.7711 baseline
Q8_0 3.99 GB 0.6985 0.5514 0.7670 -0.3%
Q6_K 3.08 GB 0.7016 0.5548 0.7722 +0.2%
Q5_K_M 2.69 GB 0.7009 0.5517 0.7699 +0.1%
Q5_0 2.63 GB 0.6995 0.5532 0.7676 -0.1%
Q4_K_M 2.33 GB 0.7058 0.5596 0.7746 +0.8%
Q4_0 2.21 GB 0.6930 0.5426 0.7623 -1.1%
Q3_K_M 1.93 GB 0.7040 0.5555 0.7828 +0.5%
Q2_K 1.55 GB 0.6691 0.5079 0.7401 -4.5%

Takeaway: All quants from Q8_0 down to Q3_K_M are within ±1% of F16 — pick based on your VRAM budget. Q4_K_M (2.33 GB) is the sweet spot: 3.2x smaller than F16 with no measurable quality loss. Avoid Q2_K — it's the only quant with real degradation.

Does it work?

Yes. Most community GGUFs of Qwen3-Reranker produce garbage scores (4.5e-23) because they're missing reranker-specific tensors. See llama.cpp #16407. This one works:

Doc 0 (relevant):   relevance_score = 0.999966
Doc 1 (irrelevant): relevance_score = 0.000069

Quick start

llama-server -m Qwen3-Reranker-4B-f16.gguf --reranking --pooling rank --embedding --port 8081
curl http://localhost:8081/v1/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "query": "employment termination notice period",
    "documents": [
      "The Labour Code requires 30 calendar days written notice.",
      "Corporate tax rates for small enterprises."
    ]
  }'

Use /v1/rerank, not /v1/embeddings. The embeddings endpoint returns zeros for reranker models.

What's different about this GGUF?

The official convert_hf_to_gguf.py detects Qwen3-Reranker and does things naive converters skip:

  • Extracts cls.output.weight (the yes/no classifier) from lm_head
  • Sets pooling_type = RANK metadata
  • Bakes in the rerank chat template
  • Sets classifier.output_labels = ["yes", "no"]

Without these, llama-server has nothing to compute scores from.

Known broken GGUFs

models.ini example

[Qwen3-Reranker-4B-f16]
model = /path/to/Qwen3-Reranker-4B-f16.gguf
reranking = true
pooling = rank
embedding = true
ctx-size = 32768

For a full multi-model setup guide (embedding + reranking + chat on one server), see the llama-server Qwen3 guide.

Convert it yourself

pip install huggingface_hub gguf torch safetensors sentencepiece
python -c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen3-Reranker-4B', local_dir='Qwen3-Reranker-4B-src')"
python convert_hf_to_gguf.py --outtype f16 --outfile Qwen3-Reranker-4B-f16.gguf Qwen3-Reranker-4B-src/

License

Apache 2.0 — same as the original model.