File size: 3,310 Bytes
302a1cf | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 | ---
license: mit
base_model: BAAI/bge-m3
tags:
- onnx
- bge-m3
- feature-extraction
- sentence-embeddings
- sparse-embeddings
- colbert
- retrieval
language:
- multilingual
pipeline_tag: feature-extraction
inference: false
---
# bge-m3-3head (ONNX: dense + learned-sparse + ColBERT)
A self-exported ONNX of [`BAAI/bge-m3`](https://huggingface.co/BAAI/bge-m3)
that emits **all three** BGE-M3 representations from **one forward pass**,
with **dynamic batch and sequence axes**:
| Output | Shape | Notes |
|---|---|---|
| `dense` | `[batch, 1024]` | CLS hidden state, **raw** (not L2-normalised) |
| `sparse` | `[batch, seq]` | `relu(sparse_linear(h))`, per-token scalar, **raw** |
| `colbert` | `[batch, seq, 1024]` | `colbert_linear(h)`, **raw** (not normalised/masked) |
Inputs: `input_ids [batch, seq]` (int64), `attention_mask [batch, seq]`
(int64). Opset 17.
All heads are emitted **raw on purpose** — L2-normalisation, the lexical
token-weight aggregation, and ColBERT masking are left to the serving
layer so the `normalize` flag and the exact lexical-weight contract stay
in application code, not frozen into the graph.
## Why this exists
`text-embeddings-inference` (TEI) cannot serve BGE-M3 learned-sparse: its
only sparse path is SPLADE pooling, which requires a `ForMaskedLM` model
and produces SPLADE — a different head with different semantics. BGE-M3's
sparse is its own trained `sparse_linear` head. This artifact lets a
single lightweight onnxruntime server (no torch) serve dense + sparse +
ColBERT, replacing a dense-only TEI lane without growing infra (the
XLM-RoBERTa encoder weights dominate either engine).
## Files
Two files — BGE-M3 fp32 (~2.1 GB) exceeds protobuf's 2 GB single-file
limit, so the weights are external data. **Keep them adjacent**;
onnxruntime resolves the sidecar by the relative name in the graph.
- `model.onnx` — graph (~210 KB)
- `model.onnx.data` — weights (~2.1 GB)
- `tokenizer.json` — the BGE-M3 XLM-RoBERTa fast tokenizer (vocab 250002)
## Serving contract (lexical sparse)
The serving layer reproduces FlagEmbedding's `_process_token_weights`:
drop `{cls, eos, pad, unk}` and non-positive weights, take the **max
weight per unique token-id**. Emit `indices` (raw 0-based token-ids, no
duplicates) and parallel `values` (post-ReLU, positive, **not**
L2-normalised); `sparse_dim` = tokenizer vocab cardinality (**250002**),
which should be read authoritatively, not hardcoded.
## Usage (onnxruntime)
```python
import numpy as np, onnxruntime as ort
from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
sess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
enc = tok.encode_batch(["quarterly management review minutes"])
ids = np.array([e.ids for e in enc], dtype=np.int64)
mask = np.array([e.attention_mask for e in enc], dtype=np.int64)
# Request only the heads you need; the shared backbone makes a dense-only
# call cheap (ColBERT projection is pruned).
dense, sparse, colbert = sess.run(
["dense", "sparse", "colbert"],
{"input_ids": ids, "attention_mask": mask},
)
```
## License
MIT, inherited from [`BAAI/bge-m3`](https://huggingface.co/BAAI/bge-m3).
Weights are unchanged BGE-M3 weights re-serialised to ONNX; please cite
BGE-M3 (Chen et al., 2024).
|