You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

miracle-english: English Clinical Sentence Transformer

This is a sentence-transformers model fine-tuned from the `intfloat/multilingual-e5-large` architecture. It maps sentences and paragraphs to a 1024-dimensional dense vector space and is explicitly designed to improve medical information retrieval and search capabilities across unstructured clinical data.

Model Details

Model Description

Existing embedding models are predominantly trained on publicly available datasets and often fall short in healthcare settings containing domain-specific terminology, abbreviations, and nuanced clinical language.

The miracle-english model addresses this gap. It is a domain-specific embedding model fine-tuned on translated real-world German clinical documents to enhance context retrieval when integrated into Retrieval Augmented Generation (RAG) systems for healthcare applications. To protect patient privacy, all training procedures and evaluations were conducted on pseudonymized documents.

Model Type: Sentence Transformer
Base model: intfloat/multilingual-e5-large
Language: English
Domain: Healthcare / Medical Information Retrieval
Output Dimensionality: 1024 dimensions

Usage

Direct Usage (Sentence Transformers)

First, install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference:

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("SHIPAI-IKIM/miracle-english") 

# Run inference
sentences = [
    'query: What is the main diagnosis of the patient?',
    'passage: The patient was admitted to the emergency department and had most recently been treated for NSCLC.',
    'passage: Histology: Preoperative indication: The patient has a histologically confirmed basal cell carcinoma (BCC) at the location mentioned above, so there is now an indication for tumor excision.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

The model was fine-tuned on a carefully curated dataset comprising translated clinical notes documented at the University Hospital Essen between 2018 and 2023.

Document Types: The corpus included 400,000 clinical documents spanning four categories: radiology reports, discharge letters, pathology reports, and surgical operation notes.
Synthetic Data Generation: The dataset was segmented into chunks, and the SauerkrautLM-SOLAR-Instruct Large Language Model was tasked with generating medically relevant questions alongside the correct answers contained in the chunks.
Scale: The training data consisted of approximately 11 million synthetically generated question-answer pairs.
Pseudonymization: Protected Health Information (PHI) in the documents was identified and replaced with surrogates utilizing a dedicated de-identification pipeline.

Training Hyperparameters

Optimizer: AdamW
Learning Rate: 2e-5
Batch Size: 1024
Loss Function: CachedMultipleNegativesRankingLoss with a mini-batch size of 32
Epochs: 1 (Limited to a single epoch to prevent overfitting to specific linguistic patterns generated by the LLM)

Limitations and Risks

Template Overfitting: Clinical documents from a single institution often share rigid structural templates, creating a risk that the model learns to associate relevance with institutional artifacts rather than purely semantic content.
Document Diversity: The dataset is limited to four types of clinical documents and may benefit from expansion by including a greater variety of medical texts.
Synthetic Data Noise: The LLM used for data generation is susceptible to hallucinations, and a manual sample audit revealed that 18.0% of generated pairs contained hallucinations and 6.0% contained factual errors. This introduces potential noise into the training dataset.
Clinical Verification: Any application of these models in clinical practice must mandate robust downstream filtering and expert human verification to identify and intercept potential retrieval errors.

Citation

If you use this model, please cite the following publication:

Arzideh K., Schäfer H., Idrissi-Yaghir A. et al. "Improving Retrieval Augmented Generation for Health Care by Fine-Tuning Clinical Embedding Models: Development and Evaluation Study". Journal of Medical Internet Research, 28, e82997, doi: https://doi.org/10.2196/82997

Downloads last month: -

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for SHIPAI-IKIM/miracle-english

Base model

intfloat/multilingual-e5-large

Finetuned

(171)

this model