SentenceTransformer based on sentence-transformers/all-MiniLM-L6-v2
This is a sentence-transformers model finetuned from sentence-transformers/all-MiniLM-L6-v2. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: sentence-transformers/all-MiniLM-L6-v2
- Maximum Sequence Length: 256 tokens
- Output Dimensionality: 384 dimensions
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("Gaykar/all-MiniLM-L6-medical-rag")
# Run inference
queries = [
"What is (are) multiple sulfatase deficiency ?",
]
documents = [
'Multiple sulfatase deficiency is a condition that mainly affects the brain, skin, and skeleton. Because the signs and symptoms of multiple sulfatase deficiency vary widely, researchers have split the condition into three types: neonatal, late-infantile, and juvenile. The neonatal type is the most severe form, with signs and symptoms appearing soon after birth. Affected individuals have deterioration of tissue in the nervous system (leukodystrophy), which can contribute to movement problems, seizures, developmental delay, and slow growth. They also have dry, scaly skin (ichthyosis) and excess hair growth (hypertrichosis). Skeletal abnormalities can include abnormal side-to-side curvature of the spine (scoliosis), joint stiffness, and dysostosis multiplex, which refers to a specific pattern of skeletal abnormalities seen on x-ray. Individuals with the neonatal type typically have facial features that can be described as "coarse." Affected individuals may also have hearing loss, heart malformations, and an enlarged liver and spleen (hepatosplenomegaly). Many of the signs and symptoms of neonatal multiple sulfatase deficiency worsen over time. The late-infantile type is the most common form of multiple sulfatase deficiency. It is characterized by normal cognitive development in early childhood followed by a progressive loss of mental abilities and movement (psychomotor regression) due to leukodystrophy or other brain abnormalities. Individuals with this form of the condition do not have as many features as those with the neonatal type, but they often have ichthyosis, skeletal abnormalities, and coarse facial features. The juvenile type is the rarest form of multiple sulfatase deficiency. Signs and symptoms of the juvenile type appear in mid- to late childhood. Affected individuals have normal early cognitive development but then experience psychomotor regression; however, the regression in the juvenile type usually occurs at a slower rate than in the late-infantile type. Ichthyosis is also common in the juvenile type of multiple sulfatase deficiency. Life expectancy is shortened in individuals with all types of multiple sulfatase deficiency. Typically, affected individuals survive only a few years after the signs and symptoms of the condition appear, but life expectancy varies depending on the severity of the condition and how quickly the neurological problems worsen.',
'There is no cure for OPCA. The disorder is slowly progressive with death usually occurring approximately 20 years after onset.',
'Spinal cord infarction is a stroke either within the spinal cord or the arteries that supply it. It is caused by arteriosclerosis or a thickening or closing of the major arteries to the spinal cord. Frequently spinal cord infarction is caused by a specific form of arteriosclerosis called atheromatosis, in which a deposit or accumulation of lipid-containing matter forms within the arteries. Symptoms, which generally appear within minutes or a few hours of the infarction, may include intermittent sharp or burning back pain, aching pain down through the legs, weakness in the legs, paralysis, loss of deep tendon reflexes, loss of pain and temperature sensation, and incontinence.',
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 384] [3, 384]
# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.7917, 0.0896, 0.0186]])
Evaluation
Metrics
Information Retrieval
- Dataset:
retriever_evaluator - Evaluated with
InformationRetrievalEvaluator
| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.823 |
| cosine_accuracy@3 | 0.9022 |
| cosine_accuracy@5 | 0.9171 |
| cosine_accuracy@10 | 0.9542 |
| cosine_precision@1 | 0.823 |
| cosine_precision@3 | 0.3007 |
| cosine_precision@5 | 0.1834 |
| cosine_precision@10 | 0.0954 |
| cosine_recall@1 | 0.823 |
| cosine_recall@3 | 0.9022 |
| cosine_recall@5 | 0.9171 |
| cosine_recall@10 | 0.9542 |
| cosine_ndcg@10 | 0.889 |
| cosine_mrr@10 | 0.8682 |
| cosine_map@100 | 0.8708 |
The image below shows the difference between base model and fine tuned model:-
If a base model gives a 0.80 to a correct answer and a 0.75 to a wrong one, the retriever might easily get confused by a small amount of noise. In our fine-tuned model, if the correct answer stays at 0.80 but the wrong ones drop to 0.20, you have created a massive Discriminative Gap. This ensures:
Robustness: Even if a "negative" answer shares similar keywords, the model now knows they aren't semantically related to that specific question.
Cleaner RAG: Your LLM receives exactly the right context without "distractor" chunks that could cause hallucinations.
Training Details
Training Dataset
📊 Dataset Creation Pipeline
This dataset was created purely for academic and learning purposes to demonstrate skills in data collection, preprocessing, and LLM-based data generation within the medical NLP domain. ⚠️ No proprietary or copyrighted text is redistributed in raw form.
The overall pipeline consists of two stages:
1️⃣ Data Collection (Source Material)
Medical information related to brain tumors and human health was gathered from openly accessible educational and public medical resources. These sources were used only as intermediate context to generate derived question–answer pairs.
Sources Used
Public medical websites
- Example: Mayo Clinic
- Used only to understand structure and terminology (no verbatim content stored)
Educational textbooks (open-access PDFs)
NCERT Biology (Human Health and Disease)
NIOS Senior Secondary Biology
Open-source QA dataset
- MedQuAD – Medical Question Answer Dataset
- https://huggingface.co/datasets/keivalya/MedQuad-MedicalQnADataset
- ~6000 samples (already structured as question–answer pairs)
📌 Important Note
- No textbook or website content is stored, shared, or redistributed in original form.
- All source material was used only as temporary input to generate transformed outputs.
2️⃣ Data Formatting & Generation
To convert unstructured medical text into structured data, an LLM-assisted pipeline was implemented using LangChain and the Groq API.
Workflow
Extracted medical text chunks from websites and PDFs
Passed extracted text as context to an LLM
Prompted the LLM to generate high-quality question–answer pairs
Discarded:
- Non-medical content
- Questions without valid answers
- Low-information or irrelevant text
Stored only the generated Q&A pairs in JSON format
Prompt Design
prompt = PromptTemplate(
input_variables=["page_content"],
template="""
You are a medical AI assistant.
Given the following medical text, generate high-quality question and answer pairs.
Ignore any non-medical information.
IMP: Ignore if the data only contains questions without answers. Do not generate questions from such data.
Rules:
- Use ONLY the provided content
- Ignore sentences that do not contain meaningful medical information
- Do NOT hallucinate
- If no useful information exists, return an empty list
- Output MUST be valid JSON only
Output format:
[
{
"question": "...",
"answer": "..."
}
]
Medical Text:
{page_content}
"""
)
📦 Final Dataset Format
The final dataset contains only synthesized question–answer pairs, structured as:
{
"question": "What is a pituitary tumor?",
"answer": "A pituitary tumor is an abnormal growth in the pituitary gland that can affect hormone production."
}
- No raw source text
- No copyrighted paragraphs
- Fully transformed content
🎓 Intended Use
This project is intended to:
- Demonstrate data extraction and preprocessing skills
- Showcase LLM-assisted dataset generation
- Support academic research and experimentation
- Enable model fine-tuning and evaluation
❌ Not intended for:
- Commercial redistribution
- Reproducing copyrighted material
- Clinical or diagnostic use
⚖️ Ethical & Legal Considerations
- All source materials are either open-access or used under educational fair use
- The dataset contains only derived, non-verbatim content
- This repository does not claim ownership over original source materials
- If any content is found to violate usage policies, it will be removed immediately
Unnamed Dataset
- Size: 6,460 training samples
- Columns:
questionandanswer - Approximate statistics based on the first 1000 samples:
question answer type string string details - min: 6 tokens
- mean: 14.62 tokens
- max: 43 tokens
- min: 3 tokens
- mean: 156.11 tokens
- max: 256 tokens
- Samples:
question answer What type of brain tumors are children likely to have?Primary brain tumorsWhat is (are) Non 24 hour sleep wake disorder ?Non 24 hour sleep wake disorder refers to a steady pattern of one- to two-hour delays in sleep onset and wake times in people with normal living conditions. This occurs because the period of the person's sleep-wake cycle is longer than 24 hours. The condition most commonly affects people who are blind, due to an impaired sense of light-dark cycles. Non 24 hour sleep wake disorder can also affect sighted people. The cause of the disorder in these cases is incompletely understood, but studies suggest melatonin levels play a role.Name two common symptoms of diphtheria.Slight fever and sore throat, and the development of a tough membrane in the throat. - Loss:
MultipleNegativesRankingLosswith these parameters:{ "scale": 20.0, "similarity_fct": "cos_sim", "gather_across_devices": false }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy: stepsper_device_train_batch_size: 32per_device_eval_batch_size: 32gradient_accumulation_steps: 2learning_rate: 5e-06weight_decay: 0.01num_train_epochs: 6warmup_ratio: 0.1load_best_model_at_end: Truebatch_sampler: no_duplicates
All Hyperparameters
Click to expand
overwrite_output_dir: Falsedo_predict: Falseeval_strategy: stepsprediction_loss_only: Trueper_device_train_batch_size: 32per_device_eval_batch_size: 32per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 2eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-06weight_decay: 0.01adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1.0num_train_epochs: 6max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: Nonewarmup_ratio: 0.1warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falsebf16: Falsefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Trueignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}parallelism_config: Nonedeepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torch_fusedoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthproject: huggingfacetrackio_space_id: trackioddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsehub_revision: Nonegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters:auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: noneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseliger_kernel_config: Noneeval_use_gather_object: Falseaverage_tokens_across_devices: Trueprompts: Nonebatch_sampler: no_duplicatesmulti_dataset_batch_sampler: proportionalrouter_mapping: {}learning_rate_mapping: {}
Training Logs
| Epoch | Step | Training Loss | retriever_evaluator_cosine_ndcg@10 |
|---|---|---|---|
| 0.0990 | 10 | 0.3503 | - |
| 0.1980 | 20 | 0.3043 | - |
| 0.2970 | 30 | 0.2295 | - |
| 0.3960 | 40 | 0.2332 | - |
| 0.4950 | 50 | 0.2588 | - |
| 0.5941 | 60 | 0.195 | - |
| 0.6931 | 70 | 0.1892 | - |
| 0.7921 | 80 | 0.2441 | - |
| 0.8911 | 90 | 0.1436 | - |
| 0.9901 | 100 | 0.121 | 0.8738 |
| 1.0891 | 110 | 0.1649 | - |
| 1.1881 | 120 | 0.137 | - |
| 1.2871 | 130 | 0.1231 | - |
| 1.3861 | 140 | 0.1652 | - |
| 1.4851 | 150 | 0.1249 | - |
| 1.5842 | 160 | 0.1618 | - |
| 1.6832 | 170 | 0.1747 | - |
| 1.7822 | 180 | 0.094 | - |
| 1.8812 | 190 | 0.1044 | - |
| 1.9802 | 200 | 0.0933 | 0.8820 |
| 2.0792 | 210 | 0.1261 | - |
| 2.1782 | 220 | 0.0988 | - |
| 2.2772 | 230 | 0.1055 | - |
| 2.3762 | 240 | 0.1023 | - |
| 2.4752 | 250 | 0.1258 | - |
| 2.5743 | 260 | 0.1259 | - |
| 2.6733 | 270 | 0.1253 | - |
| 2.7723 | 280 | 0.1362 | - |
| 2.8713 | 290 | 0.0931 | - |
| 2.9703 | 300 | 0.1152 | 0.8870 |
| 3.0693 | 310 | 0.0933 | - |
| 3.1683 | 320 | 0.0917 | - |
| 3.2673 | 330 | 0.1061 | - |
| 3.3663 | 340 | 0.0903 | - |
| 3.4653 | 350 | 0.0944 | - |
| 3.5644 | 360 | 0.0927 | - |
| 3.6634 | 370 | 0.0863 | - |
| 3.7624 | 380 | 0.1132 | - |
| 3.8614 | 390 | 0.1027 | - |
| 3.9604 | 400 | 0.0818 | 0.8876 |
| 4.0594 | 410 | 0.099 | - |
| 4.1584 | 420 | 0.1009 | - |
| 4.2574 | 430 | 0.1029 | - |
| 4.3564 | 440 | 0.1262 | - |
| 4.4554 | 450 | 0.0946 | - |
| 4.5545 | 460 | 0.0878 | - |
| 4.6535 | 470 | 0.0931 | - |
| 4.7525 | 480 | 0.0999 | - |
| 4.8515 | 490 | 0.0856 | - |
| 4.9505 | 500 | 0.0793 | 0.8907 |
| 5.0495 | 510 | 0.1057 | - |
| 5.1485 | 520 | 0.094 | - |
| 5.2475 | 530 | 0.1111 | - |
| 5.3465 | 540 | 0.0854 | - |
| 5.4455 | 550 | 0.1063 | - |
| 5.5446 | 560 | 0.1043 | - |
| 5.6436 | 570 | 0.0942 | - |
| 5.7426 | 580 | 0.0852 | - |
| 5.8416 | 590 | 0.0752 | - |
| 5.9406 | 600 | 0.0883 | 0.8890 |
- The bold row denotes the saved checkpoint.
Framework Versions
- Python: 3.12.12
- Sentence Transformers: 5.2.2
- Transformers: 4.57.6
- PyTorch: 2.9.0+cu126
- Accelerate: 1.12.0
- Datasets: 4.0.0
- Tokenizers: 0.22.2
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Downloads last month
- 38
Model tree for Gaykar/all-MiniLM-L6-medical-rag
Base model
sentence-transformers/all-MiniLM-L6-v2Space using Gaykar/all-MiniLM-L6-medical-rag 1
Papers for Gaykar/all-MiniLM-L6-medical-rag
Efficient Natural Language Response Suggestion for Smart Reply
Evaluation results
- Cosine Accuracy@1 on retriever evaluatorself-reported0.823
- Cosine Accuracy@3 on retriever evaluatorself-reported0.902
- Cosine Accuracy@5 on retriever evaluatorself-reported0.917
- Cosine Accuracy@10 on retriever evaluatorself-reported0.954
- Cosine Precision@1 on retriever evaluatorself-reported0.823
- Cosine Precision@3 on retriever evaluatorself-reported0.301
- Cosine Precision@5 on retriever evaluatorself-reported0.183
- Cosine Precision@10 on retriever evaluatorself-reported0.095
- Cosine Recall@1 on retriever evaluatorself-reported0.823
- Cosine Recall@3 on retriever evaluatorself-reported0.902
- Cosine Recall@5 on retriever evaluatorself-reported0.917
- Cosine Recall@10 on retriever evaluatorself-reported0.954
- Cosine Ndcg@10 on retriever evaluatorself-reported0.889
- Cosine Mrr@10 on retriever evaluatorself-reported0.868
- Cosine Map@100 on retriever evaluatorself-reported0.871

