Qwen3-Reranker-8B โ€” GGUF (llama.cpp)

Working GGUF of Qwen/Qwen3-Reranker-8B for llama.cpp. Converted 2025-03-09 with the official convert_hf_to_gguf.py.

Other sizes: 0.6B ยท 4B ยท 8B (this)

Available files

File Quant Size Description
Qwen3-Reranker-8B-F16.gguf F16 14.10 GB Full precision, no quality loss
Qwen3-Reranker-8B-Q8_0.gguf Q8_0 7.49 GB 8-bit quantized, half the size

Does it work?

Yes. Most community GGUFs of Qwen3-Reranker produce garbage scores (4.5e-23) because they're missing reranker-specific tensors. See llama.cpp #16407. This one works:

Doc 0 (relevant):   relevance_score = 0.99XX
Doc 1 (irrelevant): relevance_score = 0.00XX

Quick start

llama-server -m Qwen3-Reranker-8B-f16.gguf --reranking --pooling rank --embedding --port 8081
curl http://localhost:8081/v1/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "query": "employment termination notice period",
    "documents": [
      "The Labour Code requires 30 calendar days written notice.",
      "Corporate tax rates for small enterprises."
    ]
  }'

Use /v1/rerank, not /v1/embeddings. The embeddings endpoint returns zeros for reranker models.

What's different about this GGUF?

The official convert_hf_to_gguf.py detects Qwen3-Reranker and does things naive converters skip:

  • Extracts cls.output.weight (the yes/no classifier) from lm_head
  • Sets pooling_type = RANK metadata
  • Bakes in the rerank chat template
  • Sets classifier.output_labels = ["yes", "no"]

Without these, llama-server has nothing to compute scores from.

Known broken GGUFs

models.ini example

[Qwen3-Reranker-8B-f16]
model = /path/to/Qwen3-Reranker-8B-f16.gguf
reranking = true
pooling = rank
embedding = true
ctx-size = 32768

For a full multi-model setup guide (embedding + reranking + chat on one server), see the llama-server Qwen3 guide.

Convert it yourself

pip install huggingface_hub gguf torch safetensors sentencepiece
python -c "from huggingface_hub import snapshot_download; snapshot_download('Qwen/Qwen3-Reranker-8B', local_dir='Qwen3-Reranker-8B-src')"
python convert_hf_to_gguf.py --outtype f16 --outfile Qwen3-Reranker-8B-f16.gguf Qwen3-Reranker-8B-src/

License

Apache 2.0 โ€” same as the original model.

Downloads last month
90
GGUF
Model size
8B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Voodisss/Qwen3-Reranker-8B-GGUF-llama_cpp

Base model

Qwen/Qwen3-8B-Base
Quantized
(36)
this model

Collection including Voodisss/Qwen3-Reranker-8B-GGUF-llama_cpp