---
license: mit
language:
  - multilingual
library_name: text-embeddings-inference
base_model: BAAI/bge-m3
tags:
  - onnx
  - feature-extraction
  - embeddings
  - sentence-transformers
---

# bge-m3 (ONNX, dynamic axes)

Self-converted ONNX export of [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3),
hosted by [Newtech Studio](https://huggingface.co/newtechstudio) for use with
[text-embeddings-inference (TEI)](https://github.com/huggingface/text-embeddings-inference).

## Why this exists

BAAI's upstream `bge-m3` repo includes ONNX files, but their export bakes a
**static batch dimension** into the graph. TEI's ORT backend then logs:

```
WARN: Backend does not support a batch size > 8
WARN: forcing `max_batch_requests=8`
```

…silently capping every server we ran at batch=8 regardless of what we
configured via `--max-client-batch-size`. Under heavy indexing load this
caused TEI to chop a single client batch of 64+ chunks into 8+ internal
sub-batches, throttling throughput well below what the hardware could deliver.

This export uses `optimum-cli`'s **default dynamic batch + sequence axes**,
which lets ORT honor whatever batch size TEI's CLI flags allow. On a
CPU-only TEI deployment with `--max-client-batch-size=128` the bulk lane
goes from 1/16 effective utilization to full single-batch throughput.

Same weights as upstream — the export only changes the graph's shape
declarations and the file format, not the math.

## Precision

**fp32** (~2.3 GB external data). We tried fp16 (1.1 GB) but TEI's ORT
backend on CPU explicitly rejects float16:

```
ERROR: Could not start ORT backend: Dtype float16 is not supported
for `ort`, only float32.
```

If you're running TEI on a CUDA / TensorRT backend, an fp16 build would
work and halve the disk + memory footprint; on CPU stick with fp32.

## Reproduction

```bash
pip install -U "optimum[exporters,onnxruntime]" transformers onnx
optimum-cli export onnx \
  --model BAAI/bge-m3 \
  --task feature-extraction \
  --opset 17 \
  ./out
```

Output: `out/model.onnx` (graph) + `out/model.onnx_data` (weights, external)
plus `config.json`, `tokenizer.json`, `tokenizer_config.json`,
`special_tokens_map.json`. Layout matches what TEI expects.

## TEI usage

```yaml
command:
  - --model-id=newtechstudio/bge-m3-onnx
  - --max-client-batch-size=128       # honored, no longer capped
  - --max-batch-tokens=16384
```

## License

Inherits the [upstream license](https://huggingface.co/BAAI/bge-m3) (MIT).