Qwen3-Reranker-0.6B / README.md
JonathanMiddleton's picture
Update README.md
9a20733 verified
metadata
license: apache-2.0
base_model: Qwen/Qwen3-Reranker-0.6B
base_model_relation: quantized
tags:
  - gguf
  - quantized
  - llama.cpp
  - text-ranking
model_type: qwen3
quantized_by: Jonathan Middleton
revision: 602838d

Qwen3-Reranker-0.6B-GGUF

🚨 REQUIRED Llama.cpp build: https://github.com/ngxson/llama.cpp/tree/xsn/qwen3_embd_rerank
This unmerged fix branch is mandatory to run Qwen3 reranking models. Other HF GGUF quantizations of the 0.6B reranker typically fail in mainline llama.cpp because they were not produced with this build. This quantization was produced with the above build and works.

Purpose

Multilingual text-reranking model in GGUF for efficient CPU/GPU inference with llama.cpp-compatible back-ends.
Parameters ≈ 0.6 B.

Note: Token embedding matrix and output tensors are left at FP16 across all quantizations.

Files

Filename Quant Size (bytes / MiB) Est. quality Δ vs FP16
Qwen3-Reranker-0.6B-F16.gguf FP16 1,197,634,048 B (1142.2 MiB) 0 (reference)
Qwen3-Reranker-0.6B-Q4_K_M.gguf Q4_K_M 396,476,032 B (378.1 MiB) TBD
Qwen3-Reranker-0.6B-Q5_K_M.gguf Q5_K_M 444,186,496 B (423.6 MiB) TBD
Qwen3-Reranker-0.6B-Q6_K.gguf Q6_K 494,878,880 B (472.0 MiB) TBD
Qwen3-Reranker-0.6B-Q8_0.gguf Q8_0 639,153,088 B (609.5 MiB) TBD

Upstream Source

  • Repo: Qwen/Qwen3-Reranker-0.6B
  • Commit: f16fc5d (2025-06-09)
  • License: Apache-2.0

Conversion & Quantization

# Convert safetensors → GGUF (FP16)
python convert_hf_to_gguf.py ~/models/local/Qwen3-Reranker-0.6B

# Quantize variants
EMB_OPT="--token-embedding-type F16 --leave-output-tensor"
for QT in Q4_K_M Q5_K_M Q6_K Q8_0; do
  llama-quantize $EMB_OPT Qwen3-Reranker-0.6B-F16.gguf Qwen3-Reranker-0.6B-${QT}.gguf $QT
done