Instructions to use filipelopesmedbr/MedEmbed-large-v0.1-onnx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use filipelopesmedbr/MedEmbed-large-v0.1-onnx with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("filipelopesmedbr/MedEmbed-large-v0.1-onnx") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
MedEmbed Large v0.1 ONNX
This repository contains an ONNX export of abhinand/MedEmbed-large-v0.1, a Sentence-Transformers model for biomedical and clinical embeddings.
The model is intended for medical semantic search, retrieval, clustering, sentence similarity, and vector search workflows.
The exported ONNX graph returns token-level embeddings as last_hidden_state. To reproduce the original Sentence-Transformers sentence embeddings, consumers must apply CLS pooling followed by L2 normalization.
Repository contents
model.onnx: original ONNX export from the Sentence-Transformers Transformer module.model_optimized.onnx: ONNX Runtime optimized version for CPU inference.config.json: Transformer configuration.tokenizer.json: tokenizer file.tokenizer_config.json: tokenizer configuration.special_tokens_map.json: special tokens configuration.vocab.txt: BERT vocabulary.modules.json: Sentence-Transformers module definition.sentence_bert_config.json: Sentence-Transformers configuration.config_sentence_transformers.json: Sentence-Transformers metadata.1_Pooling/config.json: pooling configuration used by the original Sentence-Transformers model.README.md: this documentation.
Embedding configuration
This model uses:
Embedding dimension: 1024
Pooling: CLS
Normalization: L2
Recommended vector distance: Cosine
For vector databases such as Qdrant, use:
Vector size: 1024
Distance: Cosine
Original Sentence-Transformers structure
The original model structure was inspected before export:
SentenceTransformer(
(0): Transformer({'transformer_task': 'feature-extraction', 'modality_config': {'text': {'method': 'forward', 'method_output_name': 'last_hidden_state'}}, 'module_output_name': 'token_embeddings', 'architecture': 'BertModel'})
(1): Pooling({'embedding_dimension': 1024, 'pooling_mode': 'cls', 'include_prompt': True})
(2): Normalize({})
)
The important points are:
Pooling mode: CLS
Final embedding dimension: 1024
Normalize module: enabled
Therefore, consumers must not use mean pooling for this model. The final sentence embedding is obtained from the CLS token representation and then L2-normalized.
How this ONNX model was created
The model was exported on macOS from:
abhinand/MedEmbed-large-v0.1
The export was done manually by loading the SentenceTransformer model, extracting its first module, and exporting the underlying Transformer model:
model = SentenceTransformer(
"abhinand/MedEmbed-large-v0.1",
device="cpu",
)
transformer = model[0].auto_model
tokenizer = model[0].tokenizer
The model was exported on CPU. This is intentional. Although Apple Silicon MPS was available locally, ONNX export from MPS can produce mixed-device export errors. CPU export is the safer and more portable path.
The resulting ONNX model takes the following inputs:
input_ids
attention_mask
token_type_ids
And returns:
last_hidden_state
Environment used
The export was performed in a local Python virtual environment on macOS:
cd /Users/filipelopes/Desktop/Development/convert-onnx
python3 -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -U torch transformers sentence-transformers "optimum[onnxruntime]" onnx onnxruntime onnxscript huggingface_hub
The environment used during this conversion included:
torch: 2.12.0
transformers: 4.57.6
sentence-transformers: 5.5.1
onnx: 1.21.0
onnxruntime: 1.26.0
mps available: True
MPS was available, but ONNX export was performed on CPU.
Export script
The ONNX export used the following logic:
from pathlib import Path
import torch
from sentence_transformers import SentenceTransformer
MODEL_ID = "abhinand/MedEmbed-large-v0.1"
OUT_DIR = Path("./MedEmbed-large-v0.1-onnx")
OUT_DIR.mkdir(parents=True, exist_ok=True)
model = SentenceTransformer(MODEL_ID, device="cpu")
model.eval()
transformer = model[0].auto_model
tokenizer = model[0].tokenizer
transformer.to("cpu")
transformer.eval()
model.save(str(OUT_DIR))
tokenizer.save_pretrained(OUT_DIR)
transformer.config.save_pretrained(OUT_DIR)
dummy = tokenizer(
["Patient has chronic kidney disease."],
padding=True,
truncation=True,
max_length=128,
return_tensors="pt",
)
dummy = {k: v.to("cpu") for k, v in dummy.items()}
input_names = ["input_ids", "attention_mask"]
args = (dummy["input_ids"], dummy["attention_mask"])
has_token_type_ids = "token_type_ids" in dummy
if has_token_type_ids:
input_names.append("token_type_ids")
args = (
dummy["input_ids"],
dummy["attention_mask"],
dummy["token_type_ids"],
)
class TransformerWrapper(torch.nn.Module):
def __init__(self, transformer, has_token_type_ids: bool):
super().__init__()
self.transformer = transformer
self.has_token_type_ids = has_token_type_ids
def forward(self, input_ids, attention_mask, token_type_ids=None):
if self.has_token_type_ids:
outputs = self.transformer(
input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
return_dict=True,
)
else:
outputs = self.transformer(
input_ids=input_ids,
attention_mask=attention_mask,
return_dict=True,
)
return outputs.last_hidden_state
wrapper = TransformerWrapper(
transformer=transformer,
has_token_type_ids=has_token_type_ids,
)
wrapper.to("cpu")
wrapper.eval()
dynamic_axes = {
"input_ids": {0: "batch", 1: "sequence"},
"attention_mask": {0: "batch", 1: "sequence"},
"last_hidden_state": {0: "batch", 1: "sequence"},
}
if has_token_type_ids:
dynamic_axes["token_type_ids"] = {0: "batch", 1: "sequence"}
with torch.no_grad():
torch.onnx.export(
wrapper,
args=args,
f=str(OUT_DIR / "model.onnx"),
input_names=input_names,
output_names=["last_hidden_state"],
dynamic_axes=dynamic_axes,
opset_version=17,
do_constant_folding=True,
dynamo=False,
)
print("Exported to:", OUT_DIR / "model.onnx")
print("Input names:", input_names)
print("Embedding dimension:", model.get_sentence_embedding_dimension())
print("Pooling: CLS")
print("Normalize: True")
ONNX Runtime optimization
Before optimization, the Transformer configuration was inspected:
model_type: bert
hidden_size: 1024
num_attention_heads: 16
The model was then optimized using ONNX Runtime's BERT optimizer:
python -m onnxruntime.transformers.optimizer \
--input ./MedEmbed-large-v0.1-onnx/model.onnx \
--output ./MedEmbed-large-v0.1-onnx/model_optimized.onnx \
--model_type bert \
--num_heads 16 \
--hidden_size 1024 \
--opt_level 2
The optimized model is intended for CPU inference. It should generally provide equal or better latency than the original ONNX graph, while preserving equivalent embeddings.
This optimization is graph optimization, not INT8 quantization. Meaningful semantic degradation is not expected.
Recommended usage:
Use model_optimized.onnx by default for CPU inference.
Keep model.onnx as the reference exported ONNX graph.
Validation
The exported ONNX model was validated against the original Sentence-Transformers model.
Validation result:
ST shape: (4, 1024)
ONNX shape: (4, 1024)
Cosine ST vs ONNX per row:
[0.99999994 1. 1.0000001 1. ]
Mean cosine: 1.0
Max abs diff: 2.4028122425079346e-07
This indicates that the ONNX output, after CLS pooling and L2 normalization, is numerically equivalent to the original Sentence-Transformers output for the tested examples.
Usage with ONNX Runtime
Install dependencies:
pip install -U transformers onnxruntime huggingface_hub numpy
Run inference:
import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer
repo_id = "filipelopesmedbr/MedEmbed-large-v0.1-onnx"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
onnx_path = hf_hub_download(
repo_id=repo_id,
filename="model_optimized.onnx",
)
texts = [
"Patient has chronic kidney disease.",
"The patient was diagnosed with renal failure.",
]
encoded = tokenizer(
texts,
padding=True,
truncation=True,
max_length=512,
return_tensors="np",
)
session = ort.InferenceSession(
onnx_path,
providers=["CPUExecutionProvider"],
)
inputs = {
"input_ids": encoded["input_ids"],
"attention_mask": encoded["attention_mask"],
}
if "token_type_ids" in encoded:
inputs["token_type_ids"] = encoded["token_type_ids"]
outputs = session.run(None, inputs)
# MedEmbed-large-v0.1 uses CLS pooling.
embeddings = outputs[0][:, 0, :]
# Match the SentenceTransformers Normalize module.
embeddings = embeddings / np.linalg.norm(
embeddings,
axis=1,
keepdims=True,
)
print(embeddings.shape)
print(embeddings[0][:10])
Expected output shape:
(2, 1024)
Pooling
The ONNX graph returns token embeddings. Sentence embeddings must be generated with CLS pooling:
embeddings = token_embeddings[:, 0, :]
Then apply L2 normalization:
embeddings = embeddings / np.linalg.norm(
embeddings,
axis=1,
keepdims=True,
)
Do not use mean pooling for this model unless you intentionally want behavior different from the original Sentence-Transformers model.
Upload to Hugging Face
The repository can be uploaded using the Hugging Face CLI:
hf auth login
hf upload \
filipelopesmedbr/MedEmbed-large-v0.1-onnx \
./MedEmbed-large-v0.1-onnx \
. \
--repo-type model \
--commit-message "Add ONNX export"
To upload only the optimized ONNX model:
hf upload \
filipelopesmedbr/MedEmbed-large-v0.1-onnx \
./MedEmbed-large-v0.1-onnx/model_optimized.onnx \
model_optimized.onnx \
--repo-type model \
--commit-message "Add optimized ONNX model"
To upload this README:
hf upload \
filipelopesmedbr/MedEmbed-large-v0.1-onnx \
./MedEmbed-large-v0.1-onnx/README.md \
README.md \
--repo-type model \
--commit-message "Add README documenting ONNX export process"
Notes
model_optimized.onnxis recommended for CPU inference.model.onnxis kept as the reference ONNX export.- This repository does not require the original PyTorch weights for ONNX inference.
- The model was exported from the Sentence-Transformers Transformer component and validated against
SentenceTransformer.encode(). - Apple Silicon MPS can be useful for PyTorch inference, but the ONNX export itself should be done on CPU.
- Consumers must apply CLS pooling and L2 normalization outside the ONNX graph.
- For Qdrant or similar vector databases, configure vector size
1024and distanceCosine.
Original model
Original model:
abhinand/MedEmbed-large-v0.1
This repository is a converted and validated ONNX inference artifact derived from that model.
- Downloads last month
- 69