The qwen3-reranker is now supported in llama-cpp-python. This project provides a test GGUF file.

llama-cpp-python: https://github.com/JamePeng/llama-cpp-python

Code example:

import llama_cpp
from llama_cpp.llama_embedding import LlamaEmbedding

# Initialize a Reranking model
ranker = LlamaEmbedding(
    model_path="path\to\Qwen3-Reranker-0.6B-Q8_0.gguf",
    pooling_type=llama_cpp.LLAMA_POOLING_TYPE_RANK,  # Crucial for Rerankers!
    n_gpu_layers=-1,
    n_ctx=0
)

query = "What causes Rain?"
docs = [
    "Clouds are made of water droplets...", # Relevant
    "To bake a cake you need flour...",     # Irrelevant
    "Rain is liquid water in the form of droplets..." # Highly Relevant
]

# Calculate relevance scores
# Logic: Constructs inputs like "[BOS] query [SEP] doc [EOS]" automatically
scores = ranker.rank(query, docs)

# Result: List of floats (higher means more relevant)
print(scores) 
# e.g., [0.0011407170677557588, 5.614783731289208e-05, 0.7173627614974976] -> The 3rd doc is the best match
Downloads last month
88
GGUF
Model size
0.6B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for JamePeng2023/Qwen3-Reranker-GGUF

Quantized
(58)
this model