Model Details
Model Description
Existing embedding models are predominantly trained on publicly available datasets and often fall short in healthcare settings containing domain-specific terminology, abbreviations, and nuanced clinical language.
The miracle-english model addresses this gap. It is a domain-specific embedding model fine-tuned on translated real-world German clinical documents to enhance context retrieval when integrated into Retrieval Augmented Generation (RAG) systems for healthcare applications. To protect patient privacy, all training procedures and evaluations were conducted on pseudonymized documents.
- Model Type: Sentence Transformer
- Base model:
intfloat/multilingual-e5-large - Language: English
- Domain: Healthcare / Medical Information Retrieval
- Output Dimensionality: 1024 dimensions
Usage
Direct Usage (Sentence Transformers)
First, install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference:
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("SHIPAI-IKIM/miracle-english")
# Run inference
sentences = [
'query: What is the main diagnosis of the patient?',
'passage: The patient was admitted to the emergency department and had most recently been treated for NSCLC.',
'passage: Histology: Preoperative indication: The patient has a histologically confirmed basal cell carcinoma (BCC) at the location mentioned above, so there is now an indication for tumor excision.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Training Details
Training Dataset
The model was fine-tuned on a carefully curated dataset comprising translated clinical notes documented at the University Hospital Essen between 2018 and 2023.
- Document Types: The corpus included 400,000 clinical documents spanning four categories: radiology reports, discharge letters, pathology reports, and surgical operation notes.
- Synthetic Data Generation: The dataset was segmented into chunks, and the
SauerkrautLM-SOLAR-InstructLarge Language Model was tasked with generating medically relevant questions alongside the correct answers contained in the chunks. - Scale: The training data consisted of approximately 11 million synthetically generated question-answer pairs.
- Pseudonymization: Protected Health Information (PHI) in the documents was identified and replaced with surrogates utilizing a dedicated de-identification pipeline.
Training Hyperparameters
- Optimizer: AdamW
- Learning Rate: 2e-5
- Batch Size: 1024
- Loss Function:
CachedMultipleNegativesRankingLosswith a mini-batch size of 32 - Epochs: 1 (Limited to a single epoch to prevent overfitting to specific linguistic patterns generated by the LLM)
Limitations and Risks
- Template Overfitting: Clinical documents from a single institution often share rigid structural templates, creating a risk that the model learns to associate relevance with institutional artifacts rather than purely semantic content.
- Document Diversity: The dataset is limited to four types of clinical documents and may benefit from expansion by including a greater variety of medical texts.
- Synthetic Data Noise: The LLM used for data generation is susceptible to hallucinations, and a manual sample audit revealed that 18.0% of generated pairs contained hallucinations and 6.0% contained factual errors. This introduces potential noise into the training dataset.
- Clinical Verification: Any application of these models in clinical practice must mandate robust downstream filtering and expert human verification to identify and intercept potential retrieval errors.
Citation
If you use this model, please cite the following publication:
Arzideh K., Schäfer H., Idrissi-Yaghir A. et al. "Improving Retrieval Augmented Generation for Health Care by Fine-Tuning Clinical Embedding Models: Development and Evaluation Study". Journal of Medical Internet Research, 28, e82997, doi: https://doi.org/10.2196/82997
- Downloads last month
- -
Model tree for SHIPAI-IKIM/miracle-english
Base model
intfloat/multilingual-e5-large