Qwen3-Reranker-4B โ GGUF (llama.cpp)
Working GGUF of Qwen/Qwen3-Reranker-4B for llama.cpp. Converted 2025-03-09 with the official convert_hf_to_gguf.py.
Quantization quality comparison (Qwen3-Reranker-4B)
Benchmarked on MTEB AskUbuntuDupQuestions (361 queries) via llama-server /v1/rerank on RTX 3090. All quants produced from the same F16 source using llama-quantize.
| Quant | Size | NDCG@10 | MAP@10 | MRR@10 | ฮ NDCG@10 |
|---|---|---|---|---|---|
| F16 | 7.50 GB | 0.7003 | 0.5530 | 0.7711 | baseline |
| Q8_0 | 3.99 GB | 0.6985 | 0.5514 | 0.7670 | -0.3% |
| Q6_K | 3.08 GB | 0.7016 | 0.5548 | 0.7722 | +0.2% |
| Q5_K_M | 2.69 GB | 0.7009 | 0.5517 | 0.7699 | +0.1% |
| Q5_0 | 2.63 GB | 0.6995 | 0.5532 | 0.7676 | -0.1% |
| Q4_K_M | 2.33 GB | 0.7058 | 0.5596 | 0.7746 | +0.8% |
| Q4_0 | 2.21 GB | 0.6930 | 0.5426 | 0.7623 | -1.1% |
| Q3_K_M | 1.93 GB | 0.7040 | 0.5555 | 0.7828 | +0.5% |
| Q2_K | 1.55 GB | 0.6691 | 0.5079 | 0.7401 | -4.5% |
Takeaway: All quants from Q8_0 down to Q3_K_M are within ยฑ1% of F16 โ pick based on your VRAM budget. Q4_K_M (2.33 GB) is the sweet spot: 3.2x smaller than F16 with no measurable quality loss. Avoid Q2_K โ it's the only quant with real degradation.
Does it work?
Yes. Most community GGUFs of Qwen3-Reranker produce garbage scores (4.5e-23) because they're missing reranker-specific tensors. See llama.cpp #16407. This one works:
Doc 0 (relevant): relevance_score = 0.999966
Doc 1 (irrelevant): relevance_score = 0.000069
Quick start
llama-server -m Qwen3-Reranker-4B-f16.gguf --reranking --pooling rank --embedding --port 8081
curl http://localhost:8081/v1/rerank \
-H "Content-Type: application/json" \
-d '{
"query": "employment termination notice period",
"documents": [
"The Labour Code requires 30 calendar days written notice.",
"Corporate tax rates for small enterprises."
]
}'
Use /v1/rerank, not /v1/embeddings. The embeddings endpoint returns zeros for reranker models.
What's different about this GGUF?
The official convert_hf_to_gguf.py detects Qwen3-Reranker and does things naive converters skip:
- Extracts
cls.output.weight(the yes/no classifier) fromlm_head - Sets
pooling_type = RANKmetadata - Bakes in the rerank chat template
- Sets
classifier.output_labels = ["yes", "no"]
Without these, llama-server has nothing to compute scores from.
Known broken GGUFs
- DevQuasar/Qwen.Qwen3-Reranker-4B-GGUF โ confirmed broken with llama.cpp
models.ini example
[Qwen3-Reranker-4B-f16]
model = /path/to/Qwen3-Reranker-4B-f16.gguf
reranking = true
pooling = rank
embedding = true
ctx-size = 32768
For a full multi-model setup guide (embedding + reranking + chat on one server), see the llama-server Qwen3 guide.
Convert it yourself
pip install huggingface_hub gguf torch safetensors sentencepiece
python -c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen3-Reranker-4B', local_dir='Qwen3-Reranker-4B-src')"
python convert_hf_to_gguf.py --outtype f16 --outfile Qwen3-Reranker-4B-f16.gguf Qwen3-Reranker-4B-src/
License
Apache 2.0 โ same as the original model.
- Downloads last month
- 122
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit