bge-m3-onnx / README.md
hangerrits's picture
docs: model card explaining dynamic-axes ONNX export
27060a1 verified
---
license: mit
language:
- multilingual
library_name: text-embeddings-inference
base_model: BAAI/bge-m3
tags:
- onnx
- feature-extraction
- embeddings
- sentence-transformers
---
# bge-m3 (ONNX, dynamic axes)
Self-converted ONNX export of [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3),
hosted by [Newtech Studio](https://huggingface.co/newtechstudio) for use with
[text-embeddings-inference (TEI)](https://github.com/huggingface/text-embeddings-inference).
## Why this exists
BAAI's upstream `bge-m3` repo includes ONNX files, but their export bakes a
**static batch dimension** into the graph. TEI's ORT backend then logs:
```
WARN: Backend does not support a batch size > 8
WARN: forcing `max_batch_requests=8`
```
…silently capping every server we ran at batch=8 regardless of what we
configured via `--max-client-batch-size`. Under heavy indexing load this
caused TEI to chop a single client batch of 64+ chunks into 8+ internal
sub-batches, throttling throughput well below what the hardware could deliver.
This export uses `optimum-cli`'s **default dynamic batch + sequence axes**,
which lets ORT honor whatever batch size TEI's CLI flags allow. On a
CPU-only TEI deployment with `--max-client-batch-size=128` the bulk lane
goes from 1/16 effective utilization to full single-batch throughput.
Same weights as upstream — the export only changes the graph's shape
declarations and the file format, not the math.
## Precision
**fp32** (~2.3 GB external data). We tried fp16 (1.1 GB) but TEI's ORT
backend on CPU explicitly rejects float16:
```
ERROR: Could not start ORT backend: Dtype float16 is not supported
for `ort`, only float32.
```
If you're running TEI on a CUDA / TensorRT backend, an fp16 build would
work and halve the disk + memory footprint; on CPU stick with fp32.
## Reproduction
```bash
pip install -U "optimum[exporters,onnxruntime]" transformers onnx
optimum-cli export onnx \
--model BAAI/bge-m3 \
--task feature-extraction \
--opset 17 \
./out
```
Output: `out/model.onnx` (graph) + `out/model.onnx_data` (weights, external)
plus `config.json`, `tokenizer.json`, `tokenizer_config.json`,
`special_tokens_map.json`. Layout matches what TEI expects.
## TEI usage
```yaml
command:
- --model-id=newtechstudio/bge-m3-onnx
- --max-client-batch-size=128 # honored, no longer capped
- --max-batch-tokens=16384
```
## License
Inherits the [upstream license](https://huggingface.co/BAAI/bge-m3) (MIT).