--- license: mit language: - multilingual library_name: text-embeddings-inference base_model: BAAI/bge-m3 tags: - onnx - feature-extraction - embeddings - sentence-transformers --- # bge-m3 (ONNX, dynamic axes) Self-converted ONNX export of [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3), hosted by [Newtech Studio](https://huggingface.co/newtechstudio) for use with [text-embeddings-inference (TEI)](https://github.com/huggingface/text-embeddings-inference). ## Why this exists BAAI's upstream `bge-m3` repo includes ONNX files, but their export bakes a **static batch dimension** into the graph. TEI's ORT backend then logs: ``` WARN: Backend does not support a batch size > 8 WARN: forcing `max_batch_requests=8` ``` …silently capping every server we ran at batch=8 regardless of what we configured via `--max-client-batch-size`. Under heavy indexing load this caused TEI to chop a single client batch of 64+ chunks into 8+ internal sub-batches, throttling throughput well below what the hardware could deliver. This export uses `optimum-cli`'s **default dynamic batch + sequence axes**, which lets ORT honor whatever batch size TEI's CLI flags allow. On a CPU-only TEI deployment with `--max-client-batch-size=128` the bulk lane goes from 1/16 effective utilization to full single-batch throughput. Same weights as upstream — the export only changes the graph's shape declarations and the file format, not the math. ## Precision **fp32** (~2.3 GB external data). We tried fp16 (1.1 GB) but TEI's ORT backend on CPU explicitly rejects float16: ``` ERROR: Could not start ORT backend: Dtype float16 is not supported for `ort`, only float32. ``` If you're running TEI on a CUDA / TensorRT backend, an fp16 build would work and halve the disk + memory footprint; on CPU stick with fp32. ## Reproduction ```bash pip install -U "optimum[exporters,onnxruntime]" transformers onnx optimum-cli export onnx \ --model BAAI/bge-m3 \ --task feature-extraction \ --opset 17 \ ./out ``` Output: `out/model.onnx` (graph) + `out/model.onnx_data` (weights, external) plus `config.json`, `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`. Layout matches what TEI expects. ## TEI usage ```yaml command: - --model-id=newtechstudio/bge-m3-onnx - --max-client-batch-size=128 # honored, no longer capped - --max-batch-tokens=16384 ``` ## License Inherits the [upstream license](https://huggingface.co/BAAI/bge-m3) (MIT).