bge-m3 β€” OpenVINO IR (FP16)

This is a redistribution. For the model's intended use, training data, evaluation results, limitations, and citation, please see the upstream card: BAAI/bge-m3.

OpenVINO IR conversion of BAAI/bge-m3, weights compressed to FP16. Drop-in for optimum-intel and OpenArc.

This export covers the dense embedding head only (the multi-vector and sparse heads from the original bge-m3 release are not included).

Files

  • openvino_model.{xml,bin} β€” XLM-RoBERTa encoder, FP16 weights (~1.05 GB)
  • openvino_tokenizer.{xml,bin} / openvino_detokenizer.{xml,bin} β€” OpenVINO Tokenizers IR
  • 1_Pooling/config.json β€” sentence-transformers pooling metadata (pooling_mode_cls_token: true)
  • Standard HF tokenizer files: tokenizer.json, tokenizer_config.json, special_tokens_map.json, sentencepiece.bpe.model
  • LICENSE β€” MIT, inherited from the upstream model.

Architecture

Base model XLM-RoBERTa large
Hidden size 1024
Layers 24
Max sequence length 8194
Vocabulary 250 002
Pool CLS (declared via 1_Pooling/config.json)

Usage with optimum-intel

from optimum.intel import OVModelForFeatureExtraction
from transformers import AutoTokenizer
import torch.nn.functional as F

model = OVModelForFeatureExtraction.from_pretrained("kread/bge-m3-fp16-ov", device="GPU")
tok = AutoTokenizer.from_pretrained("kread/bge-m3-fp16-ov")

inputs = tok("What is the capital of France?", return_tensors="pt",
             padding=True, truncation=True, max_length=512)
embedding = F.normalize(model(**inputs).last_hidden_state[:, 0], p=2, dim=1)  # CLS pool

Usage with OpenArc

Requires OpenArc with metadata-driven pooling dispatch (see feat(embed): dispatch pooling mode from sentence-transformers config). The shipped 1_Pooling/config.json triggers automatic CLS pooling on load.

openarc add bge-m3 \
  --model-path /path/to/bge-m3-fp16-ov \
  --model-type emb \
  --engine optimum \
  --device GPU

openarc serve
# POST /v1/embeddings  {"model": "bge-m3", "input": "..."}

Conversion

Weight compression baked into the initial save (avoids the bus-error trap caused by overwriting an mmapped IR file mid-process):

from optimum.intel import OVModelForFeatureExtraction
import openvino as ov

m = OVModelForFeatureExtraction.from_pretrained("BAAI/bge-m3", export=True)
ov.save_model(m.model, "openvino_model.xml", compress_to_fp16=True)
m.save_pretrained("./out")  # tokenizer + sentence-transformers metadata

openvino_tokenizer.{xml,bin} and openvino_detokenizer.{xml,bin} were generated via openvino_tokenizers.convert_tokenizer(..., with_detokenizer=True).

Numerical equivalence

Single-query smoke test against the source PyTorch FP32 weights:

Metric Value
Embedding dim 1024
cos(ov_fp16, pt_fp32) > 0.999
Output unit-normed βœ“

License

MIT, inherited from BAAI/bge-m3. See LICENSE in this repo.

Citation

From the upstream model card:

@misc{bge-m3,
      title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
      author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
      year={2024},
      eprint={2402.03216},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Downloads last month
37
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for kread/bge-m3-fp16-ov

Base model

BAAI/bge-m3
Quantized
(83)
this model

Paper for kread/bge-m3-fp16-ov