medclinical
A semantic retrieval encoder built on the top of the thomas-sounack/BioClinical-ModernBERT-large BERT encoder for RAG applications.
Trained with contrastive learning approach on the mixed dataset of medical texts and QA pairs.
max sequence length
It is 1024 tokens. The encoder was trainer with the sequences with the maximum length of 1024 (including 2 special tokens).
The thomas-sounack/BioClinical-ModernBERT-large encoder supports up to 8192, but I think this is too long
to support regular popular medical texts which often jump from one topic to another topic.
downloading
If you prefer to work with the local copies, download the both models first with hf command or manually.
# https://huggingface.co/thomas-sounack/BioClinical-ModernBERT-large
$ mkdir -p "./work/base"
$ hf download "thomas-sounack/BioClinical-ModernBERT-large" --local-dir "./work/base/BIOCLINICAL_LARGE"
# https://huggingface.co/mazurkin/medclinical
$ mkdir -p "./work/export/trial-release"
$ hf download "mazurkin/medclinical" --local-dir "./work/export/trial-release"
usage
load the base encoder
from the local folder (first download thomas-sounack/BioClinical-ModernBERT-large to the ./work/base/BIOCLINICAL_LARGE folder):
base = transformers.AutoModel.from_pretrained(
'./work/base/BIOCLINICAL_LARGE',
trust_remote_code=True,
local_files_only=True,
)
from the HuggingFace directly:
base = transformers.AutoModel.from_pretrained(
'thomas-sounack/BioClinical-ModernBERT-large',
trust_remote_code=True,
)
load the tokenizer
from the local folder (first download thomas-sounack/BioClinical-ModernBERT-large to the ./work/base/BIOCLINICAL_LARGE folder):
tokenizer = transformers.AutoTokenizer.from_pretrained(
'./work/base/BIOCLINICAL_LARGE',
local_files_only=True,
)
from the HuggingFace directly:
tokenizer = transformers.AutoTokenizer.from_pretrained(
'thomas-sounack/BioClinical-ModernBERT-large',
)
load the semantic encoder
from the local folder (first download 'mazurkin/medclinical' to './work/medformer/trial-release'):
model = transformers.AutoModel.from_pretrained(
'./work/medformer/trial-release',
trust_remote_code=True,
local_files_only=True,
base_encoder=base,
)
from the HuggingFace directly:
model = transformers.AutoModel.from_pretrained(
'mazurkin/medclinical',
trust_remote_code=True,
base_encoder=base,
)
tokenize
texts = [
'Type 2 diabetes mellitus with insulin resistance and metabolic syndrome.',
'Started on metformin 500mg twice daily for glycemic control in adult-onset diabetes.',
'The stock market closed higher today with tech shares leading the gains.',
]
encoded = tokenizer(
texts,
padding=True,
truncation=True,
max_length=1024, # not 8192 as 'thomas-sounack/BioClinical-ModernBERT-base' reports
return_tensors='pt',
)
compute the embeddings
with torch.inference_mode():
outputs = model(
input_ids=encoded['input_ids'],
attention_mask=encoded['attention_mask'],
return_dict=True,
)
access the embeddings
embeddings: torch.Tensor = outputs.norm_embeddings
cross-similarity
similarity_matrix: torch.Tensor = embeddings @ embeddings.T
- Downloads last month
- 42
Model tree for mazurkin/medclinical
Base model
answerdotai/ModernBERT-large