Qwen3-Reranker-4B โ€” GGUF (llama.cpp)

Working GGUF of Qwen/Qwen3-Reranker-4B for llama.cpp. Converted 2025-03-09 with the official convert_hf_to_gguf.py.

Other sizes: 0.6B ยท 4B (this) ยท 8B

Quantization quality comparison (Qwen3-Reranker-4B)

Benchmarked on MTEB AskUbuntuDupQuestions (361 queries) via llama-server /v1/rerank on RTX 3090. All quants produced from the same F16 source using llama-quantize.

Quant Size NDCG@10 MAP@10 MRR@10 ฮ” NDCG@10
F16 7.50 GB 0.7003 0.5530 0.7711 baseline
Q8_0 3.99 GB 0.6985 0.5514 0.7670 -0.3%
Q6_K 3.08 GB 0.7016 0.5548 0.7722 +0.2%
Q5_K_M 2.69 GB 0.7009 0.5517 0.7699 +0.1%
Q5_0 2.63 GB 0.6995 0.5532 0.7676 -0.1%
Q4_K_M 2.33 GB 0.7058 0.5596 0.7746 +0.8%
Q4_0 2.21 GB 0.6930 0.5426 0.7623 -1.1%
Q3_K_M 1.93 GB 0.7040 0.5555 0.7828 +0.5%
Q2_K 1.55 GB 0.6691 0.5079 0.7401 -4.5%

Takeaway: All quants from Q8_0 down to Q3_K_M are within ยฑ1% of F16 โ€” pick based on your VRAM budget. Q4_K_M (2.33 GB) is the sweet spot: 3.2x smaller than F16 with no measurable quality loss. Avoid Q2_K โ€” it's the only quant with real degradation.

Does it work?

Yes. Most community GGUFs of Qwen3-Reranker produce garbage scores (4.5e-23) because they're missing reranker-specific tensors. See llama.cpp #16407. This one works:

Doc 0 (relevant):   relevance_score = 0.999966
Doc 1 (irrelevant): relevance_score = 0.000069

Quick start

llama-server -m Qwen3-Reranker-4B-f16.gguf --reranking --pooling rank --embedding --port 8081
curl http://localhost:8081/v1/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "query": "employment termination notice period",
    "documents": [
      "The Labour Code requires 30 calendar days written notice.",
      "Corporate tax rates for small enterprises."
    ]
  }'

Use /v1/rerank, not /v1/embeddings. The embeddings endpoint returns zeros for reranker models.

What's different about this GGUF?

The official convert_hf_to_gguf.py detects Qwen3-Reranker and does things naive converters skip:

  • Extracts cls.output.weight (the yes/no classifier) from lm_head
  • Sets pooling_type = RANK metadata
  • Bakes in the rerank chat template
  • Sets classifier.output_labels = ["yes", "no"]

Without these, llama-server has nothing to compute scores from.

Known broken GGUFs

models.ini example

[Qwen3-Reranker-4B-f16]
model = /path/to/Qwen3-Reranker-4B-f16.gguf
reranking = true
pooling = rank
embedding = true
ctx-size = 32768

For a full multi-model setup guide (embedding + reranking + chat on one server), see the llama-server Qwen3 guide.

Convert it yourself

pip install huggingface_hub gguf torch safetensors sentencepiece
python -c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen3-Reranker-4B', local_dir='Qwen3-Reranker-4B-src')"
python convert_hf_to_gguf.py --outtype f16 --outfile Qwen3-Reranker-4B-f16.gguf Qwen3-Reranker-4B-src/

License

Apache 2.0 โ€” same as the original model.

Downloads last month
122
GGUF
Model size
4B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Voodisss/Qwen3-Reranker-4B-GGUF-llama_cpp

Base model

Qwen/Qwen3-4B-Base
Quantized
(45)
this model

Collection including Voodisss/Qwen3-Reranker-4B-GGUF-llama_cpp