bge-m3-3head (ONNX: dense + learned-sparse + ColBERT)
A self-exported ONNX of BAAI/bge-m3
that emits all three BGE-M3 representations from one forward pass,
with dynamic batch and sequence axes:
| Output | Shape | Notes |
|---|---|---|
dense |
[batch, 1024] |
CLS hidden state, raw (not L2-normalised) |
sparse |
[batch, seq] |
relu(sparse_linear(h)), per-token scalar, raw |
colbert |
[batch, seq, 1024] |
colbert_linear(h), raw (not normalised/masked) |
Inputs: input_ids [batch, seq] (int64), attention_mask [batch, seq]
(int64). Opset 17.
All heads are emitted raw on purpose โ L2-normalisation, the lexical
token-weight aggregation, and ColBERT masking are left to the serving
layer so the normalize flag and the exact lexical-weight contract stay
in application code, not frozen into the graph.
Why this exists
text-embeddings-inference (TEI) cannot serve BGE-M3 learned-sparse: its
only sparse path is SPLADE pooling, which requires a ForMaskedLM model
and produces SPLADE โ a different head with different semantics. BGE-M3's
sparse is its own trained sparse_linear head. This artifact lets a
single lightweight onnxruntime server (no torch) serve dense + sparse +
ColBERT, replacing a dense-only TEI lane without growing infra (the
XLM-RoBERTa encoder weights dominate either engine).
Files
Two files โ BGE-M3 fp32 (~2.1 GB) exceeds protobuf's 2 GB single-file limit, so the weights are external data. Keep them adjacent; onnxruntime resolves the sidecar by the relative name in the graph.
model.onnxโ graph (~210 KB)model.onnx.dataโ weights (~2.1 GB)tokenizer.jsonโ the BGE-M3 XLM-RoBERTa fast tokenizer (vocab 250002)
Serving contract (lexical sparse)
The serving layer reproduces FlagEmbedding's _process_token_weights:
drop {cls, eos, pad, unk} and non-positive weights, take the max
weight per unique token-id. Emit indices (raw 0-based token-ids, no
duplicates) and parallel values (post-ReLU, positive, not
L2-normalised); sparse_dim = tokenizer vocab cardinality (250002),
which should be read authoritatively, not hardcoded.
Usage (onnxruntime)
import numpy as np, onnxruntime as ort
from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
sess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
enc = tok.encode_batch(["quarterly management review minutes"])
ids = np.array([e.ids for e in enc], dtype=np.int64)
mask = np.array([e.attention_mask for e in enc], dtype=np.int64)
# Request only the heads you need; the shared backbone makes a dense-only
# call cheap (ColBERT projection is pruned).
dense, sparse, colbert = sess.run(
["dense", "sparse", "colbert"],
{"input_ids": ids, "attention_mask": mask},
)
License
MIT, inherited from BAAI/bge-m3.
Weights are unchanged BGE-M3 weights re-serialised to ONNX; please cite
BGE-M3 (Chen et al., 2024).
Model tree for newtechstudio/bge-m3-3head
Base model
BAAI/bge-m3