bge-reranker-v2-m3 (ONNX, dynamic axes)

Self-converted ONNX export of BAAI/bge-reranker-v2-m3, hosted by Newtech Studio for use with text-embeddings-inference (TEI).

Why this exists

BAAI's upstream bge-reranker-v2-m3 ships no ONNX files, so TEI falls back to its Candle backend, which silently caps cross-encoder rerank at batch=4:

WARN: Backend does not support a batch size > 4
WARN: forcing `max_batch_requests=4`

The community export at onnx-community/bge-reranker-v2-m3-ONNX provides ONNX files but the conversion targets transformers.js (browser) and bakes in a static batch=8 axis. Better than 4, but a real RAG query with 30+ candidate documents still gets sliced into 4 sub-batches.

This export uses optimum-cli's default dynamic batch + sequence axes, which lets TEI's ORT backend honor --max-client-batch-size=32. A 30-pair rerank now runs as a single batch in ~600-800 ms (CPU) instead of 4 internal sub-batches summing ~1 s.

Same weights as upstream — the export only changes the graph's shape declarations.

Precision

fp32 (~2.3 GB external data). TEI's ORT backend on CPU rejects fp16:

ERROR: Could not start ORT backend: Dtype float16 is not supported
for `ort`, only float32.

GPU TEI deployments could use an fp16 build; CPU stays on fp32.

Reproduction

pip install -U "optimum[exporters,onnxruntime]" transformers onnx
optimum-cli export onnx \
  --model BAAI/bge-reranker-v2-m3 \
  --task text-classification \
  --opset 17 \
  ./out

TEI usage

command:
  - --model-id=newtechstudio/bge-reranker-v2-m3-onnx
  - --max-client-batch-size=32        # honored, no longer capped at 4 or 8
  - --max-batch-tokens=4096

License

Inherits the upstream license (Apache-2.0).

Downloads last month
700
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for newtechstudio/bge-reranker-v2-m3-onnx

Quantized
(47)
this model