MedEmbed Large v0.1 ONNX

This repository contains an ONNX export of abhinand/MedEmbed-large-v0.1, a Sentence-Transformers model for biomedical and clinical embeddings.

The model is intended for medical semantic search, retrieval, clustering, sentence similarity, and vector search workflows.

The exported ONNX graph returns token-level embeddings as last_hidden_state. To reproduce the original Sentence-Transformers sentence embeddings, consumers must apply CLS pooling followed by L2 normalization.

Repository contents

  • model.onnx: original ONNX export from the Sentence-Transformers Transformer module.
  • model_optimized.onnx: ONNX Runtime optimized version for CPU inference.
  • config.json: Transformer configuration.
  • tokenizer.json: tokenizer file.
  • tokenizer_config.json: tokenizer configuration.
  • special_tokens_map.json: special tokens configuration.
  • vocab.txt: BERT vocabulary.
  • modules.json: Sentence-Transformers module definition.
  • sentence_bert_config.json: Sentence-Transformers configuration.
  • config_sentence_transformers.json: Sentence-Transformers metadata.
  • 1_Pooling/config.json: pooling configuration used by the original Sentence-Transformers model.
  • README.md: this documentation.

Embedding configuration

This model uses:

Embedding dimension: 1024
Pooling: CLS
Normalization: L2
Recommended vector distance: Cosine

For vector databases such as Qdrant, use:

Vector size: 1024
Distance: Cosine

Original Sentence-Transformers structure

The original model structure was inspected before export:

SentenceTransformer(
  (0): Transformer({'transformer_task': 'feature-extraction', 'modality_config': {'text': {'method': 'forward', 'method_output_name': 'last_hidden_state'}}, 'module_output_name': 'token_embeddings', 'architecture': 'BertModel'})
  (1): Pooling({'embedding_dimension': 1024, 'pooling_mode': 'cls', 'include_prompt': True})
  (2): Normalize({})
)

The important points are:

Pooling mode: CLS
Final embedding dimension: 1024
Normalize module: enabled

Therefore, consumers must not use mean pooling for this model. The final sentence embedding is obtained from the CLS token representation and then L2-normalized.

How this ONNX model was created

The model was exported on macOS from:

abhinand/MedEmbed-large-v0.1

The export was done manually by loading the SentenceTransformer model, extracting its first module, and exporting the underlying Transformer model:

model = SentenceTransformer(
    "abhinand/MedEmbed-large-v0.1",
    device="cpu",
)

transformer = model[0].auto_model
tokenizer = model[0].tokenizer

The model was exported on CPU. This is intentional. Although Apple Silicon MPS was available locally, ONNX export from MPS can produce mixed-device export errors. CPU export is the safer and more portable path.

The resulting ONNX model takes the following inputs:

input_ids
attention_mask
token_type_ids

And returns:

last_hidden_state

Environment used

The export was performed in a local Python virtual environment on macOS:

cd /Users/filipelopes/Desktop/Development/convert-onnx
python3 -m venv .venv
source .venv/bin/activate

pip install -U pip
pip install -U torch transformers sentence-transformers "optimum[onnxruntime]" onnx onnxruntime onnxscript huggingface_hub

The environment used during this conversion included:

torch: 2.12.0
transformers: 4.57.6
sentence-transformers: 5.5.1
onnx: 1.21.0
onnxruntime: 1.26.0
mps available: True

MPS was available, but ONNX export was performed on CPU.

Export script

The ONNX export used the following logic:

from pathlib import Path

import torch
from sentence_transformers import SentenceTransformer

MODEL_ID = "abhinand/MedEmbed-large-v0.1"
OUT_DIR = Path("./MedEmbed-large-v0.1-onnx")

OUT_DIR.mkdir(parents=True, exist_ok=True)

model = SentenceTransformer(MODEL_ID, device="cpu")
model.eval()

transformer = model[0].auto_model
tokenizer = model[0].tokenizer

transformer.to("cpu")
transformer.eval()

model.save(str(OUT_DIR))
tokenizer.save_pretrained(OUT_DIR)
transformer.config.save_pretrained(OUT_DIR)

dummy = tokenizer(
    ["Patient has chronic kidney disease."],
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors="pt",
)

dummy = {k: v.to("cpu") for k, v in dummy.items()}

input_names = ["input_ids", "attention_mask"]
args = (dummy["input_ids"], dummy["attention_mask"])

has_token_type_ids = "token_type_ids" in dummy

if has_token_type_ids:
    input_names.append("token_type_ids")
    args = (
        dummy["input_ids"],
        dummy["attention_mask"],
        dummy["token_type_ids"],
    )


class TransformerWrapper(torch.nn.Module):
    def __init__(self, transformer, has_token_type_ids: bool):
        super().__init__()
        self.transformer = transformer
        self.has_token_type_ids = has_token_type_ids

    def forward(self, input_ids, attention_mask, token_type_ids=None):
        if self.has_token_type_ids:
            outputs = self.transformer(
                input_ids=input_ids,
                attention_mask=attention_mask,
                token_type_ids=token_type_ids,
                return_dict=True,
            )
        else:
            outputs = self.transformer(
                input_ids=input_ids,
                attention_mask=attention_mask,
                return_dict=True,
            )

        return outputs.last_hidden_state


wrapper = TransformerWrapper(
    transformer=transformer,
    has_token_type_ids=has_token_type_ids,
)

wrapper.to("cpu")
wrapper.eval()

dynamic_axes = {
    "input_ids": {0: "batch", 1: "sequence"},
    "attention_mask": {0: "batch", 1: "sequence"},
    "last_hidden_state": {0: "batch", 1: "sequence"},
}

if has_token_type_ids:
    dynamic_axes["token_type_ids"] = {0: "batch", 1: "sequence"}

with torch.no_grad():
    torch.onnx.export(
        wrapper,
        args=args,
        f=str(OUT_DIR / "model.onnx"),
        input_names=input_names,
        output_names=["last_hidden_state"],
        dynamic_axes=dynamic_axes,
        opset_version=17,
        do_constant_folding=True,
        dynamo=False,
    )

print("Exported to:", OUT_DIR / "model.onnx")
print("Input names:", input_names)
print("Embedding dimension:", model.get_sentence_embedding_dimension())
print("Pooling: CLS")
print("Normalize: True")

ONNX Runtime optimization

Before optimization, the Transformer configuration was inspected:

model_type: bert
hidden_size: 1024
num_attention_heads: 16

The model was then optimized using ONNX Runtime's BERT optimizer:

python -m onnxruntime.transformers.optimizer \
  --input ./MedEmbed-large-v0.1-onnx/model.onnx \
  --output ./MedEmbed-large-v0.1-onnx/model_optimized.onnx \
  --model_type bert \
  --num_heads 16 \
  --hidden_size 1024 \
  --opt_level 2

The optimized model is intended for CPU inference. It should generally provide equal or better latency than the original ONNX graph, while preserving equivalent embeddings.

This optimization is graph optimization, not INT8 quantization. Meaningful semantic degradation is not expected.

Recommended usage:

Use model_optimized.onnx by default for CPU inference.
Keep model.onnx as the reference exported ONNX graph.

Validation

The exported ONNX model was validated against the original Sentence-Transformers model.

Validation result:

ST shape: (4, 1024)
ONNX shape: (4, 1024)
Cosine ST vs ONNX per row:
[0.99999994 1.         1.0000001  1.        ]
Mean cosine: 1.0
Max abs diff: 2.4028122425079346e-07

This indicates that the ONNX output, after CLS pooling and L2 normalization, is numerically equivalent to the original Sentence-Transformers output for the tested examples.

Usage with ONNX Runtime

Install dependencies:

pip install -U transformers onnxruntime huggingface_hub numpy

Run inference:

import numpy as np
import onnxruntime as ort

from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer

repo_id = "filipelopesmedbr/MedEmbed-large-v0.1-onnx"

tokenizer = AutoTokenizer.from_pretrained(repo_id)

onnx_path = hf_hub_download(
    repo_id=repo_id,
    filename="model_optimized.onnx",
)

texts = [
    "Patient has chronic kidney disease.",
    "The patient was diagnosed with renal failure.",
]

encoded = tokenizer(
    texts,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="np",
)

session = ort.InferenceSession(
    onnx_path,
    providers=["CPUExecutionProvider"],
)

inputs = {
    "input_ids": encoded["input_ids"],
    "attention_mask": encoded["attention_mask"],
}

if "token_type_ids" in encoded:
    inputs["token_type_ids"] = encoded["token_type_ids"]

outputs = session.run(None, inputs)

# MedEmbed-large-v0.1 uses CLS pooling.
embeddings = outputs[0][:, 0, :]

# Match the SentenceTransformers Normalize module.
embeddings = embeddings / np.linalg.norm(
    embeddings,
    axis=1,
    keepdims=True,
)

print(embeddings.shape)
print(embeddings[0][:10])

Expected output shape:

(2, 1024)

Pooling

The ONNX graph returns token embeddings. Sentence embeddings must be generated with CLS pooling:

embeddings = token_embeddings[:, 0, :]

Then apply L2 normalization:

embeddings = embeddings / np.linalg.norm(
    embeddings,
    axis=1,
    keepdims=True,
)

Do not use mean pooling for this model unless you intentionally want behavior different from the original Sentence-Transformers model.

Upload to Hugging Face

The repository can be uploaded using the Hugging Face CLI:

hf auth login

hf upload \
  filipelopesmedbr/MedEmbed-large-v0.1-onnx \
  ./MedEmbed-large-v0.1-onnx \
  . \
  --repo-type model \
  --commit-message "Add ONNX export"

To upload only the optimized ONNX model:

hf upload \
  filipelopesmedbr/MedEmbed-large-v0.1-onnx \
  ./MedEmbed-large-v0.1-onnx/model_optimized.onnx \
  model_optimized.onnx \
  --repo-type model \
  --commit-message "Add optimized ONNX model"

To upload this README:

hf upload \
  filipelopesmedbr/MedEmbed-large-v0.1-onnx \
  ./MedEmbed-large-v0.1-onnx/README.md \
  README.md \
  --repo-type model \
  --commit-message "Add README documenting ONNX export process"

Notes

  • model_optimized.onnx is recommended for CPU inference.
  • model.onnx is kept as the reference ONNX export.
  • This repository does not require the original PyTorch weights for ONNX inference.
  • The model was exported from the Sentence-Transformers Transformer component and validated against SentenceTransformer.encode().
  • Apple Silicon MPS can be useful for PyTorch inference, but the ONNX export itself should be done on CPU.
  • Consumers must apply CLS pooling and L2 normalization outside the ONNX graph.
  • For Qdrant or similar vector databases, configure vector size 1024 and distance Cosine.

Original model

Original model:

abhinand/MedEmbed-large-v0.1

This repository is a converted and validated ONNX inference artifact derived from that model.

Downloads last month
69
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for filipelopesmedbr/MedEmbed-large-v0.1-onnx

Quantized
(2)
this model