Working GGUF for llama.cpp (native Windows/Linux, no WSL needed)

#9
by Voodisss - opened

Hi β€” most community GGUF conversions of Qwen3-Reranker are broken with llama.cpp (missing cls.output.weight tensor, producing scores like 4.5e-23 instead of real relevance scores). See llama.cpp#16407 for details.

I've converted all three sizes (0.6B, 4B, 8B) using the official convert_hf_to_gguf.py and verified they work:

Collection: https://huggingface.co/collections/Voodisss/qwen3-reranker-gguf-for-llamacpp
8B: https://huggingface.co/Voodisss/Qwen3-Reranker-8B-GGUF-llama_cpp

Works natively on Windows and Linux with llama-server.exe or llama-cli β€” no WSL, no vLLM, no Docker containers that refuse to release RAM. Just:

llama-server -m Qwen3-Reranker-8B-f16.gguf --reranking --pooling rank --embedding

Then call /v1/rerank and get real scores.

Sign up or log in to comment