Qwen3-Reranker-0.6B-Q8

This repository contains an 8-bit quantized Safetensors version of the Qwen3-Reranker-0.6B model.
The reranker is a cross-encoder binary relevance model, designed to judge whether a document is relevant to a query (output: yes/no).

This quantized version is optimized for:

  • Lower memory usage
  • CPU-friendly inference
  • Faster loading
  • Easier deployment in retrieval pipelines

馃攳 Model Overview

Property Details
Base Model Qwen/Qwen3-Reranker-0.6B
Architecture Cross-Encoder Reranker
Quantization 8-bit (bitsandbytes)
Inference Device CPU or GPU
Task Document relevance classification
Output Yes / No probability

This model takes a (Query, Document) pair and returns the probability that the document answers the query.


馃 Intended Use

This model is intended for RAG pipelines and information retrieval tasks, such as:

  • Ranking retrieved passages
  • Filtering irrelevant search results
  • Enhancing LLM-based RAG quality
  • Improving semantic search precision
  • Document scoring for QA systems

Transformers Usage

# Requires transformers>=4.51.0,accelerate,bitsandbytes

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "ManiKumarAdapala/Qwen3-Reranker-0.6B-Q8_0-Safetensors"
tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side='left')

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype=torch.float16,   # CPU friendly
    device_map={"": "cpu"}, #cuda for GPU usage
).eval()


# We recommend enabling flash_attention_2 for better acceleration and memory saving for cuda.
# model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, attn_implementation="flash_attention_2").cuda().eval()

token_false_id = tokenizer.convert_tokens_to_ids("no")
token_true_id = tokenizer.convert_tokens_to_ids("yes")
max_length = 128 # increasing this will increase computation time

prefix = "<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\".<|im_end|>\n<|im_start|>user\n"
suffix = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"

prefix_tokens = tokenizer.encode(prefix, add_special_tokens=False)
suffix_tokens = tokenizer.encode(suffix, add_special_tokens=False)

def format_instruction(instruction, query, doc):
    if instruction is None:
        instruction = 'Given a web search query, retrieve relevant passages that answer the query'
    output = "<Instruct>: {instruction}\n<Query>: {query}\n<Document>: {doc}".format(instruction=instruction,query=query, doc=doc)
    return output

def process_inputs(pairs):
    inputs = tokenizer(
        pairs, padding=False, truncation='longest_first',
        return_attention_mask=False, max_length=max_length - len(prefix_tokens) - len(suffix_tokens)
    )
    for i, ele in enumerate(inputs['input_ids']):
        inputs['input_ids'][i] = prefix_tokens + ele + suffix_tokens
    inputs = tokenizer.pad(inputs, padding=True, return_tensors="pt", max_length=max_length)
    for key in inputs:
        inputs[key] = inputs[key].to(model.device)
    return inputs


def compute_logits(inputs, **kwargs):
    batch_scores = model(**inputs).logits[:, -1, :]
    true_vector = batch_scores[:, token_true_id]
    false_vector = batch_scores[:, token_false_id]
    batch_scores = torch.stack([false_vector, true_vector], dim=1)
    batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1)
    scores = batch_scores[:, 1].exp().tolist()
    return scores


task = 'Given a web search query, retrieve relevant passages that answer the query'

query = "what is photosynthesis ?"

documents = [
    "The French Revolution began in 1789...",
    "Some plants are carnivorous and digest insects.",
    "Photosynthesis is the process by which plants convert light into chemical energy.",
]


pairs = [format_instruction(task, query, doc) for doc in documents]

# Tokenize the input texts
inputs = process_inputs(pairs)
scores = compute_logits(inputs)

print("scores: ", scores)
# output : scores:  [0.0002269744873046875, 0.00042247772216796875, 0.994140625]

# Sorting documents as per score
sorted_docs = [doc for _, doc in sorted(zip(scores, documents), reverse=True)]
print(sorted_docs)

# Selecting top 3 as per ranking
top_3_docs = sorted_docs[:3]
print(top_3_docs)

Citation

@article{qwen3embedding,
  title={Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models},
  author={Zhang, Yanzhao and Li, Mingxin and Long, Dingkun and Zhang, Xin and Lin, Huan and Yang, Baosong and Xie, Pengjun and Yang, An and Liu, Dayiheng and Lin, Junyang and Huang, Fei and Zhou, Jingren},
  journal={arXiv preprint arXiv:2506.05176},
  year={2025}
}

Disclaimer:
I am not the creator or original owner of the Qwen/Qwen3 models. This repository provides a quantized version strictly for compatibility and deployment. All rights to the underlying models remain with the original authors. This repository adheres to the same license and usage terms as the upstream (base) model. Please review the original license for details on permissions and limitations.

Downloads last month
249
Safetensors
Model size
0.6B params
Tensor type
F32
F16
I8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for ManiKumarAdapala/Qwen3-Reranker-0.6B-Q8_0-Safetensors

Quantized
(42)
this model