newtechstudio
/

bge-m3-3head

Feature Extraction

sentence-embeddings

sparse-embeddings

Model card Files Files and versions

bge-m3-3head / README.md

hangerrits's picture

Update README.md

bfc3e4a verified 16 days ago

|

history blame contribute delete

3.31 kB

	---
	license: mit
	base_model: BAAI/bge-m3
	tags:
	- onnx
	- bge-m3
	- feature-extraction
	- sentence-embeddings
	- sparse-embeddings
	- colbert
	- retrieval
	language:
	- multilingual
	pipeline_tag: feature-extraction
	inference: false
	---

	# bge-m3-3head (ONNX: dense + learned-sparse + ColBERT)

	A self-exported ONNX of [`BAAI/bge-m3`](https://huggingface.co/BAAI/bge-m3)
	that emits all three BGE-M3 representations from one forward pass,
	with dynamic batch and sequence axes:

	\| Output \| Shape \| Notes \|
	\|---\|---\|---\|
	\| `dense` \| `[batch, 1024]` \| CLS hidden state, raw (not L2-normalised) \|
	\| `sparse` \| `[batch, seq]` \| `relu(sparse_linear(h))`, per-token scalar, raw \|
	\| `colbert` \| `[batch, seq, 1024]` \| `colbert_linear(h)`, raw (not normalised/masked) \|

	Inputs: `input_ids [batch, seq]` (int64), `attention_mask [batch, seq]`
	(int64). Opset 17.

	All heads are emitted raw on purpose — L2-normalisation, the lexical
	token-weight aggregation, and ColBERT masking are left to the serving
	layer so the `normalize` flag and the exact lexical-weight contract stay
	in application code, not frozen into the graph.

	## Why this exists

	`text-embeddings-inference` (TEI) cannot serve BGE-M3 learned-sparse: its
	only sparse path is SPLADE pooling, which requires a `ForMaskedLM` model
	and produces SPLADE — a different head with different semantics. BGE-M3's
	sparse is its own trained `sparse_linear` head. This artifact lets a
	single lightweight onnxruntime server (no torch) serve dense + sparse +
	ColBERT, replacing a dense-only TEI lane without growing infra (the
	XLM-RoBERTa encoder weights dominate either engine).

	## Files

	Two files — BGE-M3 fp32 (~2.1 GB) exceeds protobuf's 2 GB single-file
	limit, so the weights are external data. Keep them adjacent;
	onnxruntime resolves the sidecar by the relative name in the graph.

	- `model.onnx` — graph (~210 KB)
	- `model.onnx.data` — weights (~2.1 GB)
	- `tokenizer.json` — the BGE-M3 XLM-RoBERTa fast tokenizer (vocab 250002)

	## Serving contract (lexical sparse)

	The serving layer reproduces FlagEmbedding's `_process_token_weights`:
	drop `{cls, eos, pad, unk}` and non-positive weights, take the **max
	weight per unique token-id**. Emit `indices` (raw 0-based token-ids, no
	duplicates) and parallel `values` (post-ReLU, positive, not
	L2-normalised); `sparse_dim` = tokenizer vocab cardinality (250002),
	which should be read authoritatively, not hardcoded.

	## Usage (onnxruntime)

	```python
	import numpy as np, onnxruntime as ort
	from tokenizers import Tokenizer

	tok = Tokenizer.from_file("tokenizer.json")
	sess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
	enc = tok.encode_batch(["quarterly management review minutes"])
	ids = np.array([e.ids for e in enc], dtype=np.int64)
	mask = np.array([e.attention_mask for e in enc], dtype=np.int64)

	# Request only the heads you need; the shared backbone makes a dense-only
	# call cheap (ColBERT projection is pruned).
	dense, sparse, colbert = sess.run(
	["dense", "sparse", "colbert"],
	{"input_ids": ids, "attention_mask": mask},
	)
	```

	## License

	MIT, inherited from [`BAAI/bge-m3`](https://huggingface.co/BAAI/bge-m3).
	Weights are unchanged BGE-M3 weights re-serialised to ONNX; please cite
	BGE-M3 (Chen et al., 2024).