Working GGUF for llama.cpp (native Windows/Linux, no WSL needed)
#9
by
Voodisss - opened
Hi β most community GGUF conversions of Qwen3-Reranker are broken with llama.cpp (missing cls.output.weight tensor, producing scores like 4.5e-23 instead of real relevance scores). See llama.cpp#16407 for details.
I've converted all three sizes (0.6B, 4B, 8B) using the official convert_hf_to_gguf.py and verified they work:
Collection: https://huggingface.co/collections/Voodisss/qwen3-reranker-gguf-for-llamacpp
8B: https://huggingface.co/Voodisss/Qwen3-Reranker-8B-GGUF-llama_cpp
Works natively on Windows and Linux with llama-server.exe or llama-cli β no WSL, no vLLM, no Docker containers that refuse to release RAM. Just:
llama-server -m Qwen3-Reranker-8B-f16.gguf --reranking --pooling rank --embedding
Then call /v1/rerank and get real scores.