Qwen3-Reranker-0.6B-Q8
This repository contains an 8-bit quantized Safetensors version of the Qwen3-Reranker-0.6B model.
The reranker is a cross-encoder binary relevance model, designed to judge whether a document is relevant to a query (output: yes/no).
This quantized version is optimized for:
- Lower memory usage
- CPU-friendly inference
- Faster loading
- Easier deployment in retrieval pipelines
馃攳 Model Overview
| Property | Details |
|---|---|
| Base Model | Qwen/Qwen3-Reranker-0.6B |
| Architecture | Cross-Encoder Reranker |
| Quantization | 8-bit (bitsandbytes) |
| Inference Device | CPU or GPU |
| Task | Document relevance classification |
| Output | Yes / No probability |
This model takes a (Query, Document) pair and returns the probability that the document answers the query.
馃 Intended Use
This model is intended for RAG pipelines and information retrieval tasks, such as:
- Ranking retrieved passages
- Filtering irrelevant search results
- Enhancing LLM-based RAG quality
- Improving semantic search precision
- Document scoring for QA systems
Transformers Usage
# Requires transformers>=4.51.0,accelerate,bitsandbytes
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "ManiKumarAdapala/Qwen3-Reranker-0.6B-Q8_0-Safetensors"
tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side='left')
model = AutoModelForCausalLM.from_pretrained(
model_id,
dtype=torch.float16, # CPU friendly
device_map={"": "cpu"}, #cuda for GPU usage
).eval()
# We recommend enabling flash_attention_2 for better acceleration and memory saving for cuda.
# model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, attn_implementation="flash_attention_2").cuda().eval()
token_false_id = tokenizer.convert_tokens_to_ids("no")
token_true_id = tokenizer.convert_tokens_to_ids("yes")
max_length = 128 # increasing this will increase computation time
prefix = "<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\".<|im_end|>\n<|im_start|>user\n"
suffix = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"
prefix_tokens = tokenizer.encode(prefix, add_special_tokens=False)
suffix_tokens = tokenizer.encode(suffix, add_special_tokens=False)
def format_instruction(instruction, query, doc):
if instruction is None:
instruction = 'Given a web search query, retrieve relevant passages that answer the query'
output = "<Instruct>: {instruction}\n<Query>: {query}\n<Document>: {doc}".format(instruction=instruction,query=query, doc=doc)
return output
def process_inputs(pairs):
inputs = tokenizer(
pairs, padding=False, truncation='longest_first',
return_attention_mask=False, max_length=max_length - len(prefix_tokens) - len(suffix_tokens)
)
for i, ele in enumerate(inputs['input_ids']):
inputs['input_ids'][i] = prefix_tokens + ele + suffix_tokens
inputs = tokenizer.pad(inputs, padding=True, return_tensors="pt", max_length=max_length)
for key in inputs:
inputs[key] = inputs[key].to(model.device)
return inputs
def compute_logits(inputs, **kwargs):
batch_scores = model(**inputs).logits[:, -1, :]
true_vector = batch_scores[:, token_true_id]
false_vector = batch_scores[:, token_false_id]
batch_scores = torch.stack([false_vector, true_vector], dim=1)
batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1)
scores = batch_scores[:, 1].exp().tolist()
return scores
task = 'Given a web search query, retrieve relevant passages that answer the query'
query = "what is photosynthesis ?"
documents = [
"The French Revolution began in 1789...",
"Some plants are carnivorous and digest insects.",
"Photosynthesis is the process by which plants convert light into chemical energy.",
]
pairs = [format_instruction(task, query, doc) for doc in documents]
# Tokenize the input texts
inputs = process_inputs(pairs)
scores = compute_logits(inputs)
print("scores: ", scores)
# output : scores: [0.0002269744873046875, 0.00042247772216796875, 0.994140625]
# Sorting documents as per score
sorted_docs = [doc for _, doc in sorted(zip(scores, documents), reverse=True)]
print(sorted_docs)
# Selecting top 3 as per ranking
top_3_docs = sorted_docs[:3]
print(top_3_docs)
Citation
@article{qwen3embedding,
title={Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models},
author={Zhang, Yanzhao and Li, Mingxin and Long, Dingkun and Zhang, Xin and Lin, Huan and Yang, Baosong and Xie, Pengjun and Yang, An and Liu, Dayiheng and Lin, Junyang and Huang, Fei and Zhou, Jingren},
journal={arXiv preprint arXiv:2506.05176},
year={2025}
}
Disclaimer:
I am not the creator or original owner of the Qwen/Qwen3 models. This repository provides a quantized version strictly for compatibility and deployment. All rights to the underlying models remain with the original authors. This repository adheres to the same license and usage terms as the upstream (base) model. Please review the original license for details on permissions and limitations.
- Downloads last month
- 249