Feature Extraction
ONNX
sentence-transformers
multilingual
text-embeddings-inference
xlm-roberta
embeddings
Instructions to use newtechstudio/bge-m3-onnx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use newtechstudio/bge-m3-onnx with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("newtechstudio/bge-m3-onnx") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| language: | |
| - multilingual | |
| library_name: text-embeddings-inference | |
| base_model: BAAI/bge-m3 | |
| tags: | |
| - onnx | |
| - feature-extraction | |
| - embeddings | |
| - sentence-transformers | |
| # bge-m3 (ONNX, dynamic axes) | |
| Self-converted ONNX export of [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3), | |
| hosted by [Newtech Studio](https://huggingface.co/newtechstudio) for use with | |
| [text-embeddings-inference (TEI)](https://github.com/huggingface/text-embeddings-inference). | |
| ## Why this exists | |
| BAAI's upstream `bge-m3` repo includes ONNX files, but their export bakes a | |
| **static batch dimension** into the graph. TEI's ORT backend then logs: | |
| ``` | |
| WARN: Backend does not support a batch size > 8 | |
| WARN: forcing `max_batch_requests=8` | |
| ``` | |
| …silently capping every server we ran at batch=8 regardless of what we | |
| configured via `--max-client-batch-size`. Under heavy indexing load this | |
| caused TEI to chop a single client batch of 64+ chunks into 8+ internal | |
| sub-batches, throttling throughput well below what the hardware could deliver. | |
| This export uses `optimum-cli`'s **default dynamic batch + sequence axes**, | |
| which lets ORT honor whatever batch size TEI's CLI flags allow. On a | |
| CPU-only TEI deployment with `--max-client-batch-size=128` the bulk lane | |
| goes from 1/16 effective utilization to full single-batch throughput. | |
| Same weights as upstream — the export only changes the graph's shape | |
| declarations and the file format, not the math. | |
| ## Precision | |
| **fp32** (~2.3 GB external data). We tried fp16 (1.1 GB) but TEI's ORT | |
| backend on CPU explicitly rejects float16: | |
| ``` | |
| ERROR: Could not start ORT backend: Dtype float16 is not supported | |
| for `ort`, only float32. | |
| ``` | |
| If you're running TEI on a CUDA / TensorRT backend, an fp16 build would | |
| work and halve the disk + memory footprint; on CPU stick with fp32. | |
| ## Reproduction | |
| ```bash | |
| pip install -U "optimum[exporters,onnxruntime]" transformers onnx | |
| optimum-cli export onnx \ | |
| --model BAAI/bge-m3 \ | |
| --task feature-extraction \ | |
| --opset 17 \ | |
| ./out | |
| ``` | |
| Output: `out/model.onnx` (graph) + `out/model.onnx_data` (weights, external) | |
| plus `config.json`, `tokenizer.json`, `tokenizer_config.json`, | |
| `special_tokens_map.json`. Layout matches what TEI expects. | |
| ## TEI usage | |
| ```yaml | |
| command: | |
| - --model-id=newtechstudio/bge-m3-onnx | |
| - --max-client-batch-size=128 # honored, no longer capped | |
| - --max-batch-tokens=16384 | |
| ``` | |
| ## License | |
| Inherits the [upstream license](https://huggingface.co/BAAI/bge-m3) (MIT). | |