bge-m3-3head / README.md
hangerrits's picture
Update README.md
bfc3e4a verified
metadata
license: mit
base_model: BAAI/bge-m3
tags:
  - onnx
  - bge-m3
  - feature-extraction
  - sentence-embeddings
  - sparse-embeddings
  - colbert
  - retrieval
language:
  - multilingual
pipeline_tag: feature-extraction
inference: false

bge-m3-3head (ONNX: dense + learned-sparse + ColBERT)

A self-exported ONNX of BAAI/bge-m3 that emits all three BGE-M3 representations from one forward pass, with dynamic batch and sequence axes:

Output Shape Notes
dense [batch, 1024] CLS hidden state, raw (not L2-normalised)
sparse [batch, seq] relu(sparse_linear(h)), per-token scalar, raw
colbert [batch, seq, 1024] colbert_linear(h), raw (not normalised/masked)

Inputs: input_ids [batch, seq] (int64), attention_mask [batch, seq] (int64). Opset 17.

All heads are emitted raw on purpose — L2-normalisation, the lexical token-weight aggregation, and ColBERT masking are left to the serving layer so the normalize flag and the exact lexical-weight contract stay in application code, not frozen into the graph.

Why this exists

text-embeddings-inference (TEI) cannot serve BGE-M3 learned-sparse: its only sparse path is SPLADE pooling, which requires a ForMaskedLM model and produces SPLADE — a different head with different semantics. BGE-M3's sparse is its own trained sparse_linear head. This artifact lets a single lightweight onnxruntime server (no torch) serve dense + sparse + ColBERT, replacing a dense-only TEI lane without growing infra (the XLM-RoBERTa encoder weights dominate either engine).

Files

Two files — BGE-M3 fp32 (~2.1 GB) exceeds protobuf's 2 GB single-file limit, so the weights are external data. Keep them adjacent; onnxruntime resolves the sidecar by the relative name in the graph.

  • model.onnx — graph (~210 KB)
  • model.onnx.data — weights (~2.1 GB)
  • tokenizer.json — the BGE-M3 XLM-RoBERTa fast tokenizer (vocab 250002)

Serving contract (lexical sparse)

The serving layer reproduces FlagEmbedding's _process_token_weights: drop {cls, eos, pad, unk} and non-positive weights, take the max weight per unique token-id. Emit indices (raw 0-based token-ids, no duplicates) and parallel values (post-ReLU, positive, not L2-normalised); sparse_dim = tokenizer vocab cardinality (250002), which should be read authoritatively, not hardcoded.

Usage (onnxruntime)

import numpy as np, onnxruntime as ort
from tokenizers import Tokenizer

tok = Tokenizer.from_file("tokenizer.json")
sess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
enc = tok.encode_batch(["quarterly management review minutes"])
ids  = np.array([e.ids for e in enc], dtype=np.int64)
mask = np.array([e.attention_mask for e in enc], dtype=np.int64)

# Request only the heads you need; the shared backbone makes a dense-only
# call cheap (ColBERT projection is pruned).
dense, sparse, colbert = sess.run(
    ["dense", "sparse", "colbert"],
    {"input_ids": ids, "attention_mask": mask},
)

License

MIT, inherited from BAAI/bge-m3. Weights are unchanged BGE-M3 weights re-serialised to ONNX; please cite BGE-M3 (Chen et al., 2024).