docs: model card explaining dynamic-axes ONNX export

27060a1 verified 28 days ago

2.48 kB

	---
	license: mit
	language:
	- multilingual
	library_name: text-embeddings-inference
	base_model: BAAI/bge-m3
	tags:
	- onnx
	- feature-extraction
	- embeddings
	- sentence-transformers
	---

	# bge-m3 (ONNX, dynamic axes)

	Self-converted ONNX export of [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3),
	hosted by [Newtech Studio](https://huggingface.co/newtechstudio) for use with
	[text-embeddings-inference (TEI)](https://github.com/huggingface/text-embeddings-inference).

	## Why this exists

	BAAI's upstream `bge-m3` repo includes ONNX files, but their export bakes a
	static batch dimension into the graph. TEI's ORT backend then logs:

	```
	WARN: Backend does not support a batch size > 8
	WARN: forcing `max_batch_requests=8`
	```

	…silently capping every server we ran at batch=8 regardless of what we
	configured via `--max-client-batch-size`. Under heavy indexing load this
	caused TEI to chop a single client batch of 64+ chunks into 8+ internal
	sub-batches, throttling throughput well below what the hardware could deliver.

	This export uses `optimum-cli`'s default dynamic batch + sequence axes,
	which lets ORT honor whatever batch size TEI's CLI flags allow. On a
	CPU-only TEI deployment with `--max-client-batch-size=128` the bulk lane
	goes from 1/16 effective utilization to full single-batch throughput.

	Same weights as upstream — the export only changes the graph's shape
	declarations and the file format, not the math.

	## Precision

	fp32 (~2.3 GB external data). We tried fp16 (1.1 GB) but TEI's ORT
	backend on CPU explicitly rejects float16:

	```
	ERROR: Could not start ORT backend: Dtype float16 is not supported
	for `ort`, only float32.
	```

	If you're running TEI on a CUDA / TensorRT backend, an fp16 build would
	work and halve the disk + memory footprint; on CPU stick with fp32.

	## Reproduction

	```bash
	pip install -U "optimum[exporters,onnxruntime]" transformers onnx
	optimum-cli export onnx \
	--model BAAI/bge-m3 \
	--task feature-extraction \
	--opset 17 \
	./out
	```

	Output: `out/model.onnx` (graph) + `out/model.onnx_data` (weights, external)
	plus `config.json`, `tokenizer.json`, `tokenizer_config.json`,
	`special_tokens_map.json`. Layout matches what TEI expects.

	## TEI usage

	```yaml
	command:
	- --model-id=newtechstudio/bge-m3-onnx
	- --max-client-batch-size=128 # honored, no longer capped
	- --max-batch-tokens=16384
	```

	## License

	Inherits the [upstream license](https://huggingface.co/BAAI/bge-m3) (MIT).