Instructions to use mlx-community/Qwen3-Reranker-0.6B-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/Qwen3-Reranker-0.6B-4bit with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Qwen3-Reranker-0.6B-4bit mlx-community/Qwen3-Reranker-0.6B-4bit
- sentence-transformers
How to use mlx-community/Qwen3-Reranker-0.6B-4bit with sentence-transformers:
from sentence_transformers import CrossEncoder model = CrossEncoder("mlx-community/Qwen3-Reranker-0.6B-4bit") query = "Which planet is known as the Red Planet?" passages = [ "Venus is often called Earth's twin because of its similar size and proximity.", "Mars, known for its reddish appearance, is often referred to as the Red Planet.", "Jupiter, the largest planet in our solar system, has a prominent red spot.", "Saturn, famous for its rings, is sometimes mistaken for the Red Planet." ] scores = model.predict([(query, passage) for passage in passages]) print(scores) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
mlx-community/Qwen3-Reranker-0.6B-4bit
This model was converted to MLX format from Qwen/Qwen3-Reranker-0.6B using mlx-lm 0.31.3.
- Quantization: affine 4-bit, group_size=64 (~4.5 bits/weight)
- On-disk size: ~331 MB
- Task: text reranking (cross-encoder, yes/no relevance scoring)
Scoring recipe
Qwen3-Reranker is a causal LM used as a reranker: the relevance score of a
(query, document) pair is softmax([logit("no"), logit("yes")])[1] at the
last position of the prompt below.
import mlx.core as mx
from mlx_lm import load
model, tok = load("mlx-community/Qwen3-Reranker-0.6B-4bit")
hf = getattr(tok, "_tokenizer", tok)
INSTRUCT = "Given a web search query, retrieve relevant passages that answer the query"
PREFIX = ('<|im_start|>system\nJudge whether the Document meets the requirements '
'based on the Query and the Instruct provided. Note that the answer can '
'only be "yes" or "no".<|im_end|>\n<|im_start|>user\n')
SUFFIX = "<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n"
true_id, false_id = hf.convert_tokens_to_ids("yes"), hf.convert_tokens_to_ids("no")
pre, suf = hf.encode(PREFIX, add_special_tokens=False), hf.encode(SUFFIX, add_special_tokens=False)
def rerank_score(query, doc):
content = f"<Instruct>: {INSTRUCT}\n<Query>: {query}\n<Document>: {doc}"
ids = pre + hf.encode(content, add_special_tokens=False) + suf
logits = model(mx.array([ids]))[:, -1, :]
pair = mx.stack([logits[0, false_id], logits[0, true_id]])
return float(mx.exp((pair - mx.logsumexp(pair))[1]))
print(rerank_score("What is the capital of China?", "The capital of China is Beijing."))
- Downloads last month
- 54
Model size
93.1M params
Tensor type
BF16
·
U32 ·
Hardware compatibility
Log In to add your hardware
4-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support