Working GGUF for llama.cpp (native Windows/Linux, no WSL needed)

#22

by Voodisss - opened Mar 9

Mar 9

Hi — most community GGUF conversions of Qwen3-Reranker are broken with llama.cpp (missing cls.output.weight tensor, producing scores like 4.5e-23 instead of real relevance scores). See llama.cpp#16407 for details.

I've converted all three sizes (0.6B, 4B, 8B) using the official convert_hf_to_gguf.py and verified they work:

Collection: https://huggingface.co/collections/Voodisss/qwen3-reranker-gguf-for-llamacpp
0.6B: https://huggingface.co/Voodisss/Qwen3-Reranker-0.6B-GGUF-llama_cpp

Works natively on Windows and Linux with llama-server.exe or llama-cli — no WSL, no vLLM, no Docker containers that refuse to release RAM. Just:

llama-server -m Qwen3-Reranker-0.6B-f16.gguf --reranking --pooling rank --embedding

Then call /v1/rerank and get real scores.

gpuman

Mar 10

how do i provide system prompt and instructions if i am using v1/rerank endpoint in llama cpp?

Voodisss

Mar 10

how do i provide system prompt and instructions if i am using v1/rerank endpoint in llama cpp?

Go into any of the Qwen models I uploaded and you will find URL to the Github gist that will contain all the context your LLM agent needs to use the models with llama.cpp - it will show the structure of the API request.

gpuman

Mar 10

curl http://localhost:8081/v1/rerank
-H "Content-Type: application/json"
-d '{
"model": "Qwen3-Reranker-4B-f16",
"query": "employment termination notice period",
"documents": [
"The Labour Code requires 30 calendar days written notice.",
"Corporate tax rates for small enterprises."
]
}'

the rerank api only has these arguments, but the official Qwen model card suggests to use instructions and system prompt for better accuracy.

Instruction Aware notes whether the embedding or reranking model supports customizing the input instruction according to different tasks.
Our evaluation indicates that, for most downstream tasks, using instructions (instruct) typically yields an improvement of 1% to 5% compared to not using them. Therefore, we recommend that developers create tailored instructions specific to their tasks and scenarios. In multilingual contexts, we also advise users to write their instructions in English, as most instructions utilized during the model training process were originally written in English.

gpuman

Mar 10

curl http://localhost:8081/v1/rerank
-H "Content-Type: application/json"
-d '{
"model": "Qwen3-Reranker-4B-f16",
"query": "employment termination notice period",
"documents": [
"The Labour Code requires 30 calendar days written notice.",
"Corporate tax rates for small enterprises."
]
}'

the rerank api only has these arguments, but the official Qwen model card suggests to use instructions and system prompt for better accuracy.

Instruction Aware notes whether the embedding or reranking model supports customizing the input instruction according to different tasks.
Our evaluation indicates that, for most downstream tasks, using instructions (instruct) typically yields an improvement of 1% to 5% compared to not using them. Therefore, we recommend that developers create tailored instructions specific to their tasks and scenarios. In multilingual contexts, we also advise users to write their instructions in English, as most instructions utilized during the model training process were originally written in English.

gpuman

Mar 10

can you please help me out on how to provide these system prompt, or instructions for reranker endpoint?

Voodisss

Mar 10

@gpuman as far as I'm aware, you introduce the instructions like this:
curl http://localhost:8081/v1/rerank
-H "Content-Type: application/json"
-d '{
"model": "Qwen3-Reranker-0.6B-f16",
"query": "Instruct: Rank the documents based on their relevance to the legal requirements for employment termination.\nQuery: employment termination notice period",
"documents": [
"The Labour Code requires 30 calendar days written notice.",
"Corporate tax rates for small enterprises."
]
}'

gpuman

Mar 10

@gpuman as far as I'm aware, you introduce the instructions like this:
curl http://localhost:8081/v1/rerank
-H "Content-Type: application/json"
-d '{
"model": "Qwen3-Reranker-0.6B-f16",
"query": "Instruct: Rank the documents based on their relevance to the legal requirements for employment termination.\nQuery: employment termination notice period",
"documents": [
"The Labour Code requires 30 calendar days written notice.",
"Corporate tax rates for small enterprises."
]
}'

Thanks, scores are much better now , also i am using ggml's Q_8 quantized gguf build, and its working fine . also tried your f_16 build

Voodisss

Mar 10

•

edited Mar 10

In 24 hours timeframe I will add Q_8 quants of all those reranking model sizes.

Thanks for the feedback - I will also experiment with using instructions.

Voodisss

Mar 10

@gpuman upon further research, I've determined that the format you're using works but is not quite correct. Here's why:

The rerank chat template baked into the GGUF already contains a hardcoded instruction:

<Instruct>: Given a web search query, retrieve relevant passages that answer the query
<Query>: {query}
<Document>: {document}

The server does a simple string replace — {query} becomes whatever you pass in the "query" field, {document} becomes each document. There is no separate instruction field in the /v1/rerank API.

Correct basic usage (no custom instruction needed for most cases):

curl http://localhost:8081/v1/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "query": "employment termination notice period",
    "documents": [
      "The Labour Code requires 30 calendar days written notice.",
      "Corporate tax rates for small enterprises."
    ]
  }'

The default instruction (Given a web search query, retrieve relevant passages that answer the query) is applied automatically by the template. Qwen's own evaluation shows custom instructions only improve scores by 1-5%, so the default is fine for most retrieval tasks.

If you want a custom instruction, you can embed it in the query field as a workaround, but you need to close the <Query> context and inject a new <Instruct> line to override the default:

curl http://localhost:8081/v1/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "query": "employment termination notice period\n<Instruct>: Rank documents by relevance to legal requirements for employment termination",
    "documents": [...]
  }'

This is a hack — the resulting prompt has the default instruction AND your custom one. It works in practice but isn't clean. The llama.cpp /v1/rerank API simply doesn't have an instruction parameter. For full control over the instruction, you'd need to use the Transformers Python code from Qwen's official README which formats the prompt manually.

gpuman

Mar 11

The rerank chat template baked into the GGUF already contains a hardcoded instruction:

how is this chat template baked in? is it while creating the gguf file or in llama cpp?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment