medclinical

A semantic retrieval encoder built on the top of the thomas-sounack/BioClinical-ModernBERT-large BERT encoder for RAG applications.

Trained with contrastive learning approach on the mixed dataset of medical texts and QA pairs.

max sequence length

It is 1024 tokens. The encoder was trainer with the sequences with the maximum length of 1024 (including 2 special tokens).

The thomas-sounack/BioClinical-ModernBERT-large encoder supports up to 8192, but I think this is too long to support regular popular medical texts which often jump from one topic to another topic.

downloading

If you prefer to work with the local copies, download the both models first with hf command or manually.

# https://huggingface.co/thomas-sounack/BioClinical-ModernBERT-large
$ mkdir -p "./work/base"
$ hf download "thomas-sounack/BioClinical-ModernBERT-large" --local-dir "./work/base/BIOCLINICAL_LARGE"
# https://huggingface.co/mazurkin/medclinical
$ mkdir -p "./work/export/trial-release"
$ hf download "mazurkin/medclinical" --local-dir "./work/export/trial-release"

usage

load the base encoder

from the local folder (first download thomas-sounack/BioClinical-ModernBERT-large to the ./work/base/BIOCLINICAL_LARGE folder):

base = transformers.AutoModel.from_pretrained(
    './work/base/BIOCLINICAL_LARGE',
    trust_remote_code=True,
    local_files_only=True,
)

from the HuggingFace directly:

base = transformers.AutoModel.from_pretrained(
    'thomas-sounack/BioClinical-ModernBERT-large',
    trust_remote_code=True,
)

load the tokenizer

from the local folder (first download thomas-sounack/BioClinical-ModernBERT-large to the ./work/base/BIOCLINICAL_LARGE folder):

tokenizer = transformers.AutoTokenizer.from_pretrained(
    './work/base/BIOCLINICAL_LARGE',
    local_files_only=True,
)

from the HuggingFace directly:

tokenizer = transformers.AutoTokenizer.from_pretrained(
    'thomas-sounack/BioClinical-ModernBERT-large',
)

load the semantic encoder

from the local folder (first download 'mazurkin/medclinical' to './work/medformer/trial-release'):

model = transformers.AutoModel.from_pretrained(
    './work/medformer/trial-release',
    trust_remote_code=True,
    local_files_only=True,
    base_encoder=base,
)

from the HuggingFace directly:

model = transformers.AutoModel.from_pretrained(
    'mazurkin/medclinical',
    trust_remote_code=True,
    base_encoder=base,
)

tokenize

texts = [
    'Type 2 diabetes mellitus with insulin resistance and metabolic syndrome.',
    'Started on metformin 500mg twice daily for glycemic control in adult-onset diabetes.',
    'The stock market closed higher today with tech shares leading the gains.',
]

encoded = tokenizer(
    texts,
    padding=True,
    truncation=True,
    max_length=1024, # not 8192 as 'thomas-sounack/BioClinical-ModernBERT-base' reports
    return_tensors='pt',
)

compute the embeddings

with torch.inference_mode():
    outputs = model(
        input_ids=encoded['input_ids'],
        attention_mask=encoded['attention_mask'],
        return_dict=True,
    )

access the embeddings

embeddings: torch.Tensor = outputs.norm_embeddings

cross-similarity

similarity_matrix: torch.Tensor = embeddings @ embeddings.T
Downloads last month
42
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mazurkin/medclinical