Instructions to use batiai/Qwen3-Reranker-8B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use batiai/Qwen3-Reranker-8B-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="batiai/Qwen3-Reranker-8B-GGUF", filename="Qwen3-Reranker-8B-Q6_K.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use batiai/Qwen3-Reranker-8B-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf batiai/Qwen3-Reranker-8B-GGUF:Q6_K # Run inference directly in the terminal: llama-cli -hf batiai/Qwen3-Reranker-8B-GGUF:Q6_K
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf batiai/Qwen3-Reranker-8B-GGUF:Q6_K # Run inference directly in the terminal: llama-cli -hf batiai/Qwen3-Reranker-8B-GGUF:Q6_K
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf batiai/Qwen3-Reranker-8B-GGUF:Q6_K # Run inference directly in the terminal: ./llama-cli -hf batiai/Qwen3-Reranker-8B-GGUF:Q6_K
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf batiai/Qwen3-Reranker-8B-GGUF:Q6_K # Run inference directly in the terminal: ./build/bin/llama-cli -hf batiai/Qwen3-Reranker-8B-GGUF:Q6_K
Use Docker
docker model run hf.co/batiai/Qwen3-Reranker-8B-GGUF:Q6_K
- LM Studio
- Jan
- Ollama
How to use batiai/Qwen3-Reranker-8B-GGUF with Ollama:
ollama run hf.co/batiai/Qwen3-Reranker-8B-GGUF:Q6_K
- Unsloth Studio new
How to use batiai/Qwen3-Reranker-8B-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for batiai/Qwen3-Reranker-8B-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for batiai/Qwen3-Reranker-8B-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for batiai/Qwen3-Reranker-8B-GGUF to start chatting
- Pi new
How to use batiai/Qwen3-Reranker-8B-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf batiai/Qwen3-Reranker-8B-GGUF:Q6_K
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "batiai/Qwen3-Reranker-8B-GGUF:Q6_K" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use batiai/Qwen3-Reranker-8B-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf batiai/Qwen3-Reranker-8B-GGUF:Q6_K
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default batiai/Qwen3-Reranker-8B-GGUF:Q6_K
Run Hermes
hermes
- Docker Model Runner
How to use batiai/Qwen3-Reranker-8B-GGUF with Docker Model Runner:
docker model run hf.co/batiai/Qwen3-Reranker-8B-GGUF:Q6_K
- Lemonade
How to use batiai/Qwen3-Reranker-8B-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull batiai/Qwen3-Reranker-8B-GGUF:Q6_K
Run and chat with the model
lemonade run user.Qwen3-Reranker-8B-GGUF-Q6_K
List all available models
lemonade list
llm.create_chat_completion(
messages = "No input example has been defined for this model task."
)Qwen3-Reranker-8B GGUF — Quantized by BatiAI
GGUF quantizations of Qwen/Qwen3-Reranker-8B — the top-tier of the Qwen3 reranker family for maximum ranking precision. Part of BatiAI's on-device RAG stack for BatiFlow.
What is a reranker?
RAG pipeline: embedding (coarse retrieve) → reranker (precise scoring) → LLM (answer).
A reranker takes (query, candidate_document) and returns a relevance score. It's the "second pass" after vector search — turns "probably relevant" candidates into an ordered top-K that the LLM can use confidently.
When to pick 8B over 0.6B / 4B?
| Use case | Pick |
|---|---|
| Desktop workstation / plenty of RAM | 8B — best ranking accuracy, clearest margin on adversarial/ambiguous negatives |
| Typical laptop / 32 GB Mac | 4B — close to 8B quality at half the size |
| Edge / small Mac / batch rerank at scale | 0.6B — 13× smaller than 8B, still hits 100 % pairwise accuracy on our test |
All three from the same Qwen3-Reranker family, different sizes. 8B is the quality ceiling.
Quick Start (llama.cpp)
./llama-server -m Qwen3-Reranker-8B-Q8_0.gguf \
--rerank --pooling rank -c 4096 \
--host 127.0.0.1 --port 8090
curl http://127.0.0.1:8090/rerank -d '{
"query": "What is RAG?",
"documents": ["RAG ...", "Paris ..."]
}'
Note: Ollama doesn't have a native reranker endpoint yet, so this GGUF is intended for direct llama.cpp integration or tools like LangChain / LlamaIndex.
Available Quantizations
| File | Quant | Size | Recommended |
|---|---|---|---|
Qwen3-Reranker-8B-Q6_K.gguf |
Q6_K | 5.8 GB | balanced (recommended default) |
Qwen3-Reranker-8B-Q8_0.gguf |
Q8_0 | 7.5 GB | near-lossless |
Quality Verification (measured)
Ran 40 (query, positive, negative) triples — 20 EN + 20 KO — on hard test (topically-close negatives):
| Quant | Accuracy | Margin (pos-neg) |
|---|---|---|
| Q6_K | 100 % | 0.819 |
| Q8_0 | 100 % | 0.825 |
Pearson correlation Q6_K ↔ Q8_0: r = 0.9986 → quantization drift essentially zero.
8B vs smaller variants (same testset, same script):
| Model | Hard margin | Drift (Q6↔Q8) |
|---|---|---|
| 0.6B | 0.723–0.751 | r = 0.996 |
| 4B | 0.650–0.672 | r = 0.998 |
| 8B | 0.819–0.825 | r = 0.999 |
The 8B's larger margin on adversarial negatives is its key differentiator — the score separation between "right answer" and "close-but-wrong" is visibly wider, which helps in high-stakes retrieval where you can't afford the top-1 to be wrong.
Why Qwen3-Reranker?
- SOTA among open rerankers — top of MTEB reranking benchmarks
- Multilingual — en / ko / ja / zh
- Apache 2.0 — commercial-friendly
Why BatiAI?
- Quantized directly from Alibaba's BF16 safetensors
- BatiAI-signed —
general.author: BatiAI,general.url: https://flow.bati.ai - Part of a full on-device RAG stack
Technical Details
- Original Model: Qwen/Qwen3-Reranker-8B
- Architecture: Qwen3 Causal LM (cross-encoder scorer)
- Parameters: 8 B
- Context: 32 K
- License: Apache 2.0
- Quantized with: llama.cpp build
bafae2765
BatiAI's RAG Stack
| Role | Model | HF |
|---|---|---|
| Reranker (0.6 B) | Qwen3-Reranker-0.6B | batiai/Qwen3-Reranker-0.6B-GGUF |
| Reranker (4 B) | Qwen3-Reranker-4B | batiai/Qwen3-Reranker-4B-GGUF |
| Reranker (8 B) | Qwen3-Reranker-8B | this repo |
| VL Embedding (2 B) | Qwen3-VL-Embedding-2B | batiai/Qwen3-VL-Embedding-2B-GGUF |
| Chat LLM (35 B-A3B) | Qwen3.6-35B-A3B | batiai/Qwen3.6-35B-A3B-GGUF |
License
Mirrors upstream Qwen Apache 2.0. Commercial use permitted.
- Downloads last month
- 104
6-bit
8-bit
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="batiai/Qwen3-Reranker-8B-GGUF", filename="", )