File size: 3,310 Bytes

302a1cf

---
license: mit
base_model: BAAI/bge-m3
tags:
  - onnx
  - bge-m3
  - feature-extraction
  - sentence-embeddings
  - sparse-embeddings
  - colbert
  - retrieval
language:
  - multilingual
pipeline_tag: feature-extraction
inference: false
---

# bge-m3-3head (ONNX: dense + learned-sparse + ColBERT)

A self-exported ONNX of [`BAAI/bge-m3`](https://huggingface.co/BAAI/bge-m3)
that emits **all three** BGE-M3 representations from **one forward pass**,
with **dynamic batch and sequence axes**:

| Output | Shape | Notes |
|---|---|---|
| `dense` | `[batch, 1024]` | CLS hidden state, **raw** (not L2-normalised) |
| `sparse` | `[batch, seq]` | `relu(sparse_linear(h))`, per-token scalar, **raw** |
| `colbert` | `[batch, seq, 1024]` | `colbert_linear(h)`, **raw** (not normalised/masked) |

Inputs: `input_ids [batch, seq]` (int64), `attention_mask [batch, seq]`
(int64). Opset 17.

All heads are emitted **raw on purpose** — L2-normalisation, the lexical
token-weight aggregation, and ColBERT masking are left to the serving
layer so the `normalize` flag and the exact lexical-weight contract stay
in application code, not frozen into the graph.

## Why this exists

`text-embeddings-inference` (TEI) cannot serve BGE-M3 learned-sparse: its
only sparse path is SPLADE pooling, which requires a `ForMaskedLM` model
and produces SPLADE — a different head with different semantics. BGE-M3's
sparse is its own trained `sparse_linear` head. This artifact lets a
single lightweight onnxruntime server (no torch) serve dense + sparse +
ColBERT, replacing a dense-only TEI lane without growing infra (the
XLM-RoBERTa encoder weights dominate either engine).

## Files

Two files — BGE-M3 fp32 (~2.1 GB) exceeds protobuf's 2 GB single-file
limit, so the weights are external data. **Keep them adjacent**;
onnxruntime resolves the sidecar by the relative name in the graph.

- `model.onnx` — graph (~210 KB)
- `model.onnx.data` — weights (~2.1 GB)
- `tokenizer.json` — the BGE-M3 XLM-RoBERTa fast tokenizer (vocab 250002)

## Serving contract (lexical sparse)

The serving layer reproduces FlagEmbedding's `_process_token_weights`:
drop `{cls, eos, pad, unk}` and non-positive weights, take the **max
weight per unique token-id**. Emit `indices` (raw 0-based token-ids, no
duplicates) and parallel `values` (post-ReLU, positive, **not**
L2-normalised); `sparse_dim` = tokenizer vocab cardinality (**250002**),
which should be read authoritatively, not hardcoded.

## Usage (onnxruntime)

```python
import numpy as np, onnxruntime as ort
from tokenizers import Tokenizer

tok = Tokenizer.from_file("tokenizer.json")
sess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
enc = tok.encode_batch(["quarterly management review minutes"])
ids  = np.array([e.ids for e in enc], dtype=np.int64)
mask = np.array([e.attention_mask for e in enc], dtype=np.int64)

# Request only the heads you need; the shared backbone makes a dense-only
# call cheap (ColBERT projection is pruned).
dense, sparse, colbert = sess.run(
    ["dense", "sparse", "colbert"],
    {"input_ids": ids, "attention_mask": mask},
)
```

## License

MIT, inherited from [`BAAI/bge-m3`](https://huggingface.co/BAAI/bge-m3).
Weights are unchanged BGE-M3 weights re-serialised to ONNX; please cite
BGE-M3 (Chen et al., 2024).