bge-reranker-v2-m3 (ONNX, dynamic axes)
Self-converted ONNX export of BAAI/bge-reranker-v2-m3, hosted by Newtech Studio for use with text-embeddings-inference (TEI).
Why this exists
BAAI's upstream bge-reranker-v2-m3 ships no ONNX files, so TEI falls
back to its Candle backend, which silently caps cross-encoder rerank at
batch=4:
WARN: Backend does not support a batch size > 4
WARN: forcing `max_batch_requests=4`
The community export at
onnx-community/bge-reranker-v2-m3-ONNX
provides ONNX files but the conversion targets transformers.js (browser)
and bakes in a static batch=8 axis. Better than 4, but a real RAG query
with 30+ candidate documents still gets sliced into 4 sub-batches.
This export uses optimum-cli's default dynamic batch + sequence axes,
which lets TEI's ORT backend honor --max-client-batch-size=32. A
30-pair rerank now runs as a single batch in ~600-800 ms (CPU) instead
of 4 internal sub-batches summing ~1 s.
Same weights as upstream — the export only changes the graph's shape declarations.
Precision
fp32 (~2.3 GB external data). TEI's ORT backend on CPU rejects fp16:
ERROR: Could not start ORT backend: Dtype float16 is not supported
for `ort`, only float32.
GPU TEI deployments could use an fp16 build; CPU stays on fp32.
Reproduction
pip install -U "optimum[exporters,onnxruntime]" transformers onnx
optimum-cli export onnx \
--model BAAI/bge-reranker-v2-m3 \
--task text-classification \
--opset 17 \
./out
TEI usage
command:
- --model-id=newtechstudio/bge-reranker-v2-m3-onnx
- --max-client-batch-size=32 # honored, no longer capped at 4 or 8
- --max-batch-tokens=4096
License
Inherits the upstream license (Apache-2.0).
- Downloads last month
- 700
Model tree for newtechstudio/bge-reranker-v2-m3-onnx
Base model
BAAI/bge-reranker-v2-m3