bge-m3 — OpenVINO IR (FP16)

This is a redistribution. For the model's intended use, training data, evaluation results, limitations, and citation, please see the upstream card: BAAI/bge-m3.

OpenVINO IR conversion of BAAI/bge-m3, weights compressed to FP16. Drop-in for optimum-intel and OpenArc.

This export covers the dense embedding head only (the multi-vector and sparse heads from the original bge-m3 release are not included).

Files

openvino_model.{xml,bin} — XLM-RoBERTa encoder, FP16 weights (~1.05 GB)
openvino_tokenizer.{xml,bin} / openvino_detokenizer.{xml,bin} — OpenVINO Tokenizers IR
1_Pooling/config.json — sentence-transformers pooling metadata (pooling_mode_cls_token: true)
Standard HF tokenizer files: tokenizer.json, tokenizer_config.json, special_tokens_map.json, sentencepiece.bpe.model
LICENSE — MIT, inherited from the upstream model.

Architecture


Base model	XLM-RoBERTa large
Hidden size	1024
Layers	24
Max sequence length	8194
Vocabulary	250 002
Pool	CLS (declared via `1_Pooling/config.json`)

Usage with `optimum-intel`

from optimum.intel import OVModelForFeatureExtraction
from transformers import AutoTokenizer
import torch.nn.functional as F

model = OVModelForFeatureExtraction.from_pretrained("kread/bge-m3-fp16-ov", device="GPU")
tok = AutoTokenizer.from_pretrained("kread/bge-m3-fp16-ov")

inputs = tok("What is the capital of France?", return_tensors="pt",
             padding=True, truncation=True, max_length=512)
embedding = F.normalize(model(**inputs).last_hidden_state[:, 0], p=2, dim=1)  # CLS pool

Usage with OpenArc

Requires OpenArc with metadata-driven pooling dispatch (see feat(embed): dispatch pooling mode from sentence-transformers config). The shipped 1_Pooling/config.json triggers automatic CLS pooling on load.

openarc add bge-m3 \
  --model-path /path/to/bge-m3-fp16-ov \
  --model-type emb \
  --engine optimum \
  --device GPU

openarc serve
# POST /v1/embeddings  {"model": "bge-m3", "input": "..."}

Conversion

Weight compression baked into the initial save (avoids the bus-error trap caused by overwriting an mmapped IR file mid-process):

from optimum.intel import OVModelForFeatureExtraction
import openvino as ov

m = OVModelForFeatureExtraction.from_pretrained("BAAI/bge-m3", export=True)
ov.save_model(m.model, "openvino_model.xml", compress_to_fp16=True)
m.save_pretrained("./out")  # tokenizer + sentence-transformers metadata

openvino_tokenizer.{xml,bin} and openvino_detokenizer.{xml,bin} were generated via openvino_tokenizers.convert_tokenizer(..., with_detokenizer=True).

Numerical equivalence

Single-query smoke test against the source PyTorch FP32 weights:

Metric	Value
Embedding dim	1024
`cos(ov_fp16, pt_fp32)`	> 0.999
Output unit-normed	✓

License

MIT, inherited from BAAI/bge-m3. See LICENSE in this repo.

Citation

From the upstream model card:

@misc{bge-m3,
      title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
      author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
      year={2024},
      eprint={2402.03216},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Downloads last month: 37

Model tree for kread/bge-m3-fp16-ov

Base model

BAAI/bge-m3

Quantized

(83)

this model

Paper for kread/bge-m3-fp16-ov

BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

Paper • 2402.03216 • Published Feb 5, 2024 • 7