| --- |
| license: mit |
| base_model: BAAI/bge-m3 |
| tags: |
| - onnx |
| - bge-m3 |
| - feature-extraction |
| - sentence-embeddings |
| - sparse-embeddings |
| - colbert |
| - retrieval |
| language: |
| - multilingual |
| pipeline_tag: feature-extraction |
| inference: false |
| --- |
| |
| # bge-m3-3head (ONNX: dense + learned-sparse + ColBERT) |
|
|
| A self-exported ONNX of [`BAAI/bge-m3`](https://huggingface.co/BAAI/bge-m3) |
| that emits **all three** BGE-M3 representations from **one forward pass**, |
| with **dynamic batch and sequence axes**: |
|
|
| | Output | Shape | Notes | |
| |---|---|---| |
| | `dense` | `[batch, 1024]` | CLS hidden state, **raw** (not L2-normalised) | |
| | `sparse` | `[batch, seq]` | `relu(sparse_linear(h))`, per-token scalar, **raw** | |
| | `colbert` | `[batch, seq, 1024]` | `colbert_linear(h)`, **raw** (not normalised/masked) | |
|
|
| Inputs: `input_ids [batch, seq]` (int64), `attention_mask [batch, seq]` |
| (int64). Opset 17. |
|
|
| All heads are emitted **raw on purpose** β L2-normalisation, the lexical |
| token-weight aggregation, and ColBERT masking are left to the serving |
| layer so the `normalize` flag and the exact lexical-weight contract stay |
| in application code, not frozen into the graph. |
|
|
| ## Why this exists |
|
|
| `text-embeddings-inference` (TEI) cannot serve BGE-M3 learned-sparse: its |
| only sparse path is SPLADE pooling, which requires a `ForMaskedLM` model |
| and produces SPLADE β a different head with different semantics. BGE-M3's |
| sparse is its own trained `sparse_linear` head. This artifact lets a |
| single lightweight onnxruntime server (no torch) serve dense + sparse + |
| ColBERT, replacing a dense-only TEI lane without growing infra (the |
| XLM-RoBERTa encoder weights dominate either engine). |
|
|
| ## Files |
|
|
| Two files β BGE-M3 fp32 (~2.1 GB) exceeds protobuf's 2 GB single-file |
| limit, so the weights are external data. **Keep them adjacent**; |
| onnxruntime resolves the sidecar by the relative name in the graph. |
|
|
| - `model.onnx` β graph (~210 KB) |
| - `model.onnx.data` β weights (~2.1 GB) |
| - `tokenizer.json` β the BGE-M3 XLM-RoBERTa fast tokenizer (vocab 250002) |
|
|
| ## Serving contract (lexical sparse) |
|
|
| The serving layer reproduces FlagEmbedding's `_process_token_weights`: |
| drop `{cls, eos, pad, unk}` and non-positive weights, take the **max |
| weight per unique token-id**. Emit `indices` (raw 0-based token-ids, no |
| duplicates) and parallel `values` (post-ReLU, positive, **not** |
| L2-normalised); `sparse_dim` = tokenizer vocab cardinality (**250002**), |
| which should be read authoritatively, not hardcoded. |
|
|
| ## Usage (onnxruntime) |
|
|
| ```python |
| import numpy as np, onnxruntime as ort |
| from tokenizers import Tokenizer |
| |
| tok = Tokenizer.from_file("tokenizer.json") |
| sess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"]) |
| enc = tok.encode_batch(["quarterly management review minutes"]) |
| ids = np.array([e.ids for e in enc], dtype=np.int64) |
| mask = np.array([e.attention_mask for e in enc], dtype=np.int64) |
| |
| # Request only the heads you need; the shared backbone makes a dense-only |
| # call cheap (ColBERT projection is pruned). |
| dense, sparse, colbert = sess.run( |
| ["dense", "sparse", "colbert"], |
| {"input_ids": ids, "attention_mask": mask}, |
| ) |
| ``` |
|
|
| ## License |
|
|
| MIT, inherited from [`BAAI/bge-m3`](https://huggingface.co/BAAI/bge-m3). |
| Weights are unchanged BGE-M3 weights re-serialised to ONNX; please cite |
| BGE-M3 (Chen et al., 2024). |
|
|