n24q02m/Qwen3-Reranker-0.6B-ONNX

ONNX-optimized version of Qwen/Qwen3-Reranker-0.6B for use with qwen3-embed and fastembed (PR #605).

Available Variants

Variant File Size Description
INT8 onnx/model_quantized.onnx 573 MB Dynamic INT8 quantization (default)
Q4F16 onnx/model_q4f16.onnx 517 MB INT4 weights + FP16 activations

Usage

qwen3-embed

pip install qwen3-embed
from qwen3_embed import TextCrossEncoder

# INT8 (default)
reranker = TextCrossEncoder("n24q02m/Qwen3-Reranker-0.6B-ONNX")
scores = list(reranker.rerank("What is AI?", ["AI is...", "Pizza is..."]))

# Custom instruction
scores = list(reranker.rerank(
    "What is AI?",
    ["doc1", "doc2"],
    instruction="Judge document relevance for code search.",
))

# Q4F16 (smaller, slightly less accurate)
reranker_q4 = TextCrossEncoder("n24q02m/Qwen3-Reranker-0.6B-ONNX-Q4F16")

fastembed

pip install fastembed
from fastembed import TextCrossEncoder

# INT8 (default)
reranker = TextCrossEncoder("Qwen/Qwen3-Reranker-0.6B")
scores = list(reranker.rerank("What is AI?", ["AI is...", "Pizza is..."]))

# Q4F16
reranker_q4 = TextCrossEncoder("Qwen/Qwen3-Reranker-0.6B-Q4F16")

Note: fastembed support requires PR #605 or install from fork: pip install git+https://github.com/n24q02m/fastembed.git@feat/qwen3-support

Conversion Details

  • Source: Qwen/Qwen3-Reranker-0.6B
  • ONNX opset: 21
  • INT8: onnxruntime.quantization.quantize_dynamic (QInt8)
  • Q4F16: MatMulNBitsQuantizer (block_size=128, symmetric) + FP16 cast

Related

Downloads last month
341
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for n24q02m/Qwen3-Reranker-0.6B-ONNX

Quantized
(58)
this model