Qwen3-Reranker-0.6B-Q8

This repository contains an 8-bit quantized Safetensors version of the Qwen3-Reranker-0.6B model.
The reranker is a cross-encoder binary relevance model, designed to judge whether a document is relevant to a query (output: yes/no).

This quantized version is optimized for:

Lower memory usage
CPU-friendly inference
Faster loading
Easier deployment in retrieval pipelines

🔍 Model Overview

Property	Details
Base Model	Qwen/Qwen3-Reranker-0.6B
Architecture	Cross-Encoder Reranker
Quantization	8-bit (bitsandbytes)
Inference Device	CPU or GPU
Task	Document relevance classification
Output	Yes / No probability

This model takes a (Query, Document) pair and returns the probability that the document answers the query.

🧠 Intended Use

This model is intended for RAG pipelines and information retrieval tasks, such as:

Ranking retrieved passages
Filtering irrelevant search results
Enhancing LLM-based RAG quality
Improving semantic search precision
Document scoring for QA systems

Transformers Usage

# Requires transformers>=4.51.0,accelerate,bitsandbytes

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "ManiKumarAdapala/Qwen3-Reranker-0.6B-Q8_0-Safetensors"
tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side='left')

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype=torch.float16,   # CPU friendly
    device_map={"": "cpu"}, #cuda for GPU usage
).eval()


# We recommend enabling flash_attention_2 for better acceleration and memory saving for cuda.
# model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, attn_implementation="flash_attention_2").cuda().eval()

token_false_id = tokenizer.convert_tokens_to_ids("no")
token_true_id = tokenizer.convert_tokens_to_ids("yes")
max_length = 128 # increasing this will increase computation time

prefix = "<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\".<|im_end|>\n<|im_start|>user\n"
suffix = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"

prefix_tokens = tokenizer.encode(prefix, add_special_tokens=False)
suffix_tokens = tokenizer.encode(suffix, add_special_tokens=False)

def format_instruction(instruction, query, doc):
    if instruction is None:
        instruction = 'Given a web search query, retrieve relevant passages that answer the query'
    output = "<Instruct>: {instruction}\n<Query>: {query}\n<Document>: {doc}".format(instruction=instruction,query=query, doc=doc)
    return output

def process_inputs(pairs):
    inputs = tokenizer(
        pairs, padding=False, truncation='longest_first',
        return_attention_mask=False, max_length=max_length - len(prefix_tokens) - len(suffix_tokens)
    )
    for i, ele in enumerate(inputs['input_ids']):
        inputs['input_ids'][i] = prefix_tokens + ele + suffix_tokens
    inputs = tokenizer.pad(inputs, padding=True, return_tensors="pt", max_length=max_length)
    for key in inputs:
        inputs[key] = inputs[key].to(model.device)
    return inputs


def compute_logits(inputs, **kwargs):
    batch_scores = model(**inputs).logits[:, -1, :]
    true_vector = batch_scores[:, token_true_id]
    false_vector = batch_scores[:, token_false_id]
    batch_scores = torch.stack([false_vector, true_vector], dim=1)
    batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1)
    scores = batch_scores[:, 1].exp().tolist()
    return scores


task = 'Given a web search query, retrieve relevant passages that answer the query'

query = "what is photosynthesis ?"

documents = [
    "The French Revolution began in 1789...",
    "Some plants are carnivorous and digest insects.",
    "Photosynthesis is the process by which plants convert light into chemical energy.",
]


pairs = [format_instruction(task, query, doc) for doc in documents]

# Tokenize the input texts
inputs = process_inputs(pairs)
scores = compute_logits(inputs)

print("scores: ", scores)
# output : scores:  [0.0002269744873046875, 0.00042247772216796875, 0.994140625]

# Sorting documents as per score
sorted_docs = [doc for _, doc in sorted(zip(scores, documents), reverse=True)]
print(sorted_docs)

# Selecting top 3 as per ranking
top_3_docs = sorted_docs[:3]
print(top_3_docs)

Citation

@article{qwen3embedding,
  title={Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models},
  author={Zhang, Yanzhao and Li, Mingxin and Long, Dingkun and Zhang, Xin and Lin, Huan and Yang, Baosong and Xie, Pengjun and Yang, An and Liu, Dayiheng and Lin, Junyang and Huang, Fei and Zhou, Jingren},
  journal={arXiv preprint arXiv:2506.05176},
  year={2025}
}

Disclaimer:
I am not the creator or original owner of the Qwen/Qwen3 models. This repository provides a quantized version strictly for compatibility and deployment. All rights to the underlying models remain with the original authors. This repository adheres to the same license and usage terms as the upstream (base) model. Please review the original license for details on permissions and limitations.

Downloads last month: 249

Safetensors

Model size

0.6B params

Tensor type

F32

F16

Inference Providers NEW

Text Ranking

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManiKumarAdapala/Qwen3-Reranker-0.6B-Q8_0-Safetensors

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-Reranker-0.6B

Quantized

(42)

this model