--- license: mit base_model: BAAI/bge-m3 tags: - onnx - bge-m3 - feature-extraction - sentence-embeddings - sparse-embeddings - colbert - retrieval language: - multilingual pipeline_tag: feature-extraction inference: false --- # bge-m3-3head (ONNX: dense + learned-sparse + ColBERT) A self-exported ONNX of [`BAAI/bge-m3`](https://huggingface.co/BAAI/bge-m3) that emits **all three** BGE-M3 representations from **one forward pass**, with **dynamic batch and sequence axes**: | Output | Shape | Notes | |---|---|---| | `dense` | `[batch, 1024]` | CLS hidden state, **raw** (not L2-normalised) | | `sparse` | `[batch, seq]` | `relu(sparse_linear(h))`, per-token scalar, **raw** | | `colbert` | `[batch, seq, 1024]` | `colbert_linear(h)`, **raw** (not normalised/masked) | Inputs: `input_ids [batch, seq]` (int64), `attention_mask [batch, seq]` (int64). Opset 17. All heads are emitted **raw on purpose** — L2-normalisation, the lexical token-weight aggregation, and ColBERT masking are left to the serving layer so the `normalize` flag and the exact lexical-weight contract stay in application code, not frozen into the graph. ## Why this exists `text-embeddings-inference` (TEI) cannot serve BGE-M3 learned-sparse: its only sparse path is SPLADE pooling, which requires a `ForMaskedLM` model and produces SPLADE — a different head with different semantics. BGE-M3's sparse is its own trained `sparse_linear` head. This artifact lets a single lightweight onnxruntime server (no torch) serve dense + sparse + ColBERT, replacing a dense-only TEI lane without growing infra (the XLM-RoBERTa encoder weights dominate either engine). ## Files Two files — BGE-M3 fp32 (~2.1 GB) exceeds protobuf's 2 GB single-file limit, so the weights are external data. **Keep them adjacent**; onnxruntime resolves the sidecar by the relative name in the graph. - `model.onnx` — graph (~210 KB) - `model.onnx.data` — weights (~2.1 GB) - `tokenizer.json` — the BGE-M3 XLM-RoBERTa fast tokenizer (vocab 250002) ## Serving contract (lexical sparse) The serving layer reproduces FlagEmbedding's `_process_token_weights`: drop `{cls, eos, pad, unk}` and non-positive weights, take the **max weight per unique token-id**. Emit `indices` (raw 0-based token-ids, no duplicates) and parallel `values` (post-ReLU, positive, **not** L2-normalised); `sparse_dim` = tokenizer vocab cardinality (**250002**), which should be read authoritatively, not hardcoded. ## Usage (onnxruntime) ```python import numpy as np, onnxruntime as ort from tokenizers import Tokenizer tok = Tokenizer.from_file("tokenizer.json") sess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"]) enc = tok.encode_batch(["quarterly management review minutes"]) ids = np.array([e.ids for e in enc], dtype=np.int64) mask = np.array([e.attention_mask for e in enc], dtype=np.int64) # Request only the heads you need; the shared backbone makes a dense-only # call cheap (ColBERT projection is pruned). dense, sparse, colbert = sess.run( ["dense", "sparse", "colbert"], {"input_ids": ids, "attention_mask": mask}, ) ``` ## License MIT, inherited from [`BAAI/bge-m3`](https://huggingface.co/BAAI/bge-m3). Weights are unchanged BGE-M3 weights re-serialised to ONNX; please cite BGE-M3 (Chen et al., 2024).