File size: 2,231 Bytes
774578e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9a20733
774578e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
---
license: apache-2.0
base_model: Qwen/Qwen3-Reranker-0.6B
base_model_relation: quantized
tags:
  - gguf
  - quantized
  - llama.cpp
  - text-ranking
model_type: qwen3
quantized_by: Jonathan Middleton
revision: 602838d   # Aug 19 2025
---

# Qwen3-Reranker-0.6B-GGUF

**🚨 REQUIRED Llama.cpp build:** https://github.com/ngxson/llama.cpp/tree/xsn/qwen3_embd_rerank  
**This unmerged fix branch is mandatory** to run Qwen3 reranking models. Other HF GGUF quantizations of the 0.6B reranker typically fail in mainline `llama.cpp` because they were not produced with this build. **This quantization was produced with the above build and works.**

## Purpose
Multilingual **text-reranking** model in **GGUF** for efficient CPU/GPU inference with *llama.cpp*-compatible back-ends.  
Parameters ≈ **0.6 B**.

**Note:** Token embedding matrix and output tensors are **left at FP16** across all quantizations.

## Files
| Filename                                   | Quant   | Size (bytes / MiB)                 | Est. quality Δ vs FP16 |
|--------------------------------------------|---------|------------------------------------|------------------------|
| `Qwen3-Reranker-0.6B-F16.gguf`             | FP16    | 1,197,634,048 B (1142.2 MiB)       | 0 (reference)          |
| `Qwen3-Reranker-0.6B-Q4_K_M.gguf`          | Q4_K_M  |   396,476,032 B (378.1 MiB)        | TBD                    |
| `Qwen3-Reranker-0.6B-Q5_K_M.gguf`          | Q5_K_M  |   444,186,496 B (423.6 MiB)        | TBD                    |
| `Qwen3-Reranker-0.6B-Q6_K.gguf`            | Q6_K    |   494,878,880 B (472.0 MiB)        | TBD                    |
| `Qwen3-Reranker-0.6B-Q8_0.gguf`            | Q8_0    |   639,153,088 B (609.5 MiB)        | TBD                    |

## Upstream Source
* **Repo:** `Qwen/Qwen3-Reranker-0.6B`  
* **Commit:** `f16fc5d` (2025-06-09)  
* **License:** Apache-2.0

## Conversion & Quantization
```bash
# Convert safetensors → GGUF (FP16)
python convert_hf_to_gguf.py ~/models/local/Qwen3-Reranker-0.6B

# Quantize variants
EMB_OPT="--token-embedding-type F16 --leave-output-tensor"
for QT in Q4_K_M Q5_K_M Q6_K Q8_0; do
  llama-quantize $EMB_OPT Qwen3-Reranker-0.6B-F16.gguf Qwen3-Reranker-0.6B-${QT}.gguf $QT
done