JonathanMiddleton
/

Qwen3-Reranker-0.6B

Model card Files Files and versions

Qwen3-Reranker-0.6B / README.md

JonathanMiddleton's picture

JonathanMiddleton

Update README.md

9a20733 verified 4 months ago

|

history blame contribute delete

2.23 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen3-Reranker-0.6B
	base_model_relation: quantized
	tags:
	- gguf
	- quantized
	- llama.cpp
	- text-ranking
	model_type: qwen3
	quantized_by: Jonathan Middleton
	revision: 602838d # Aug 19 2025
	---

	# Qwen3-Reranker-0.6B-GGUF

	🚨 REQUIRED Llama.cpp build: https://github.com/ngxson/llama.cpp/tree/xsn/qwen3_embd_rerank
	This unmerged fix branch is mandatory to run Qwen3 reranking models. Other HF GGUF quantizations of the 0.6B reranker typically fail in mainline `llama.cpp` because they were not produced with this build. This quantization was produced with the above build and works.

	## Purpose
	Multilingual text-reranking model in GGUF for efficient CPU/GPU inference with llama.cpp-compatible back-ends.
	Parameters ≈ 0.6 B.

	Note: Token embedding matrix and output tensors are left at FP16 across all quantizations.

	## Files
	\| Filename \| Quant \| Size (bytes / MiB) \| Est. quality Δ vs FP16 \|
	\|--------------------------------------------\|---------\|------------------------------------\|------------------------\|
	\| `Qwen3-Reranker-0.6B-F16.gguf` \| FP16 \| 1,197,634,048 B (1142.2 MiB) \| 0 (reference) \|
	\| `Qwen3-Reranker-0.6B-Q4_K_M.gguf` \| Q4_K_M \| 396,476,032 B (378.1 MiB) \| TBD \|
	\| `Qwen3-Reranker-0.6B-Q5_K_M.gguf` \| Q5_K_M \| 444,186,496 B (423.6 MiB) \| TBD \|
	\| `Qwen3-Reranker-0.6B-Q6_K.gguf` \| Q6_K \| 494,878,880 B (472.0 MiB) \| TBD \|
	\| `Qwen3-Reranker-0.6B-Q8_0.gguf` \| Q8_0 \| 639,153,088 B (609.5 MiB) \| TBD \|

	## Upstream Source
	* Repo: `Qwen/Qwen3-Reranker-0.6B`
	* Commit: `f16fc5d` (2025-06-09)
	* License: Apache-2.0

	## Conversion & Quantization
	```bash
	# Convert safetensors → GGUF (FP16)
	python convert_hf_to_gguf.py ~/models/local/Qwen3-Reranker-0.6B

	# Quantize variants
	EMB_OPT="--token-embedding-type F16 --leave-output-tensor"
	for QT in Q4_K_M Q5_K_M Q6_K Q8_0; do
	llama-quantize $EMB_OPT Qwen3-Reranker-0.6B-F16.gguf Qwen3-Reranker-0.6B-${QT}.gguf $QT
	done