SentenceTransformer based on sentence-transformers/all-MiniLM-L6-v2

This is a sentence-transformers model finetuned from sentence-transformers/all-MiniLM-L6-v2. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: sentence-transformers/all-MiniLM-L6-v2
  • Maximum Sequence Length: 256 tokens
  • Output Dimensionality: 384 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("Gaykar/all-MiniLM-L6-medical-rag")
# Run inference
queries = [
    "What is (are) multiple sulfatase deficiency ?",
]
documents = [
    'Multiple sulfatase deficiency is a condition that mainly affects the brain, skin, and skeleton. Because the signs and symptoms of multiple sulfatase deficiency vary widely, researchers have split the condition into three types: neonatal, late-infantile, and juvenile.  The neonatal type is the most severe form, with signs and symptoms appearing soon after birth. Affected individuals have deterioration of tissue in the nervous system (leukodystrophy), which can contribute to movement problems, seizures, developmental delay, and slow growth. They also have dry, scaly skin (ichthyosis) and excess hair growth (hypertrichosis). Skeletal abnormalities can include abnormal side-to-side curvature of the spine (scoliosis), joint stiffness, and dysostosis multiplex, which refers to a specific pattern of skeletal abnormalities seen on x-ray. Individuals with the neonatal type typically have facial features that can be described as "coarse." Affected individuals may also have hearing loss, heart malformations, and an enlarged liver and spleen (hepatosplenomegaly). Many of the signs and symptoms of neonatal multiple sulfatase deficiency worsen over time.  The late-infantile type is the most common form of multiple sulfatase deficiency. It is characterized by normal cognitive development in early childhood followed by a progressive loss of mental abilities and movement (psychomotor regression) due to leukodystrophy or other brain abnormalities. Individuals with this form of the condition do not have as many features as those with the neonatal type, but they often have ichthyosis, skeletal abnormalities, and coarse facial features.  The juvenile type is the rarest form of multiple sulfatase deficiency. Signs and symptoms of the juvenile type appear in mid- to late childhood. Affected individuals have normal early cognitive development but then experience psychomotor regression; however, the regression in the juvenile type usually occurs at a slower rate than in the late-infantile type. Ichthyosis is also common in the juvenile type of multiple sulfatase deficiency.  Life expectancy is shortened in individuals with all types of multiple sulfatase deficiency. Typically, affected individuals survive only a few years after the signs and symptoms of the condition appear, but life expectancy varies depending on the severity of the condition and how quickly the neurological problems worsen.',
    'There is no cure for OPCA. The disorder is slowly progressive with death usually occurring approximately 20 years after onset.',
    'Spinal cord infarction is a stroke either within the spinal cord or the arteries that supply it. It is caused by arteriosclerosis or a thickening or closing of the major arteries to the spinal cord. Frequently spinal cord infarction is caused by a specific form of arteriosclerosis called atheromatosis, in which a deposit or accumulation of lipid-containing matter forms within the arteries. Symptoms, which generally appear within minutes or a few hours of the infarction, may include intermittent sharp or burning back pain, aching pain down through the legs, weakness in the legs, paralysis, loss of deep tendon reflexes, loss of pain and temperature sensation, and incontinence.',
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 384] [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.7917, 0.0896, 0.0186]])

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.823
cosine_accuracy@3 0.9022
cosine_accuracy@5 0.9171
cosine_accuracy@10 0.9542
cosine_precision@1 0.823
cosine_precision@3 0.3007
cosine_precision@5 0.1834
cosine_precision@10 0.0954
cosine_recall@1 0.823
cosine_recall@3 0.9022
cosine_recall@5 0.9171
cosine_recall@10 0.9542
cosine_ndcg@10 0.889
cosine_mrr@10 0.8682
cosine_map@100 0.8708

The image below shows the difference between base model and fine tuned model:-

image

If a base model gives a 0.80 to a correct answer and a 0.75 to a wrong one, the retriever might easily get confused by a small amount of noise. In our fine-tuned model, if the correct answer stays at 0.80 but the wrong ones drop to 0.20, you have created a massive Discriminative Gap. This ensures:

Robustness: Even if a "negative" answer shares similar keywords, the model now knows they aren't semantically related to that specific question.

Cleaner RAG: Your LLM receives exactly the right context without "distractor" chunks that could cause hallucinations.

Training Details

Training Dataset

📊 Dataset Creation Pipeline

This dataset was created purely for academic and learning purposes to demonstrate skills in data collection, preprocessing, and LLM-based data generation within the medical NLP domain. ⚠️ No proprietary or copyrighted text is redistributed in raw form.

The overall pipeline consists of two stages:


1️⃣ Data Collection (Source Material)

Medical information related to brain tumors and human health was gathered from openly accessible educational and public medical resources. These sources were used only as intermediate context to generate derived question–answer pairs.

Sources Used

📌 Important Note

  • No textbook or website content is stored, shared, or redistributed in original form.
  • All source material was used only as temporary input to generate transformed outputs.

2️⃣ Data Formatting & Generation

To convert unstructured medical text into structured data, an LLM-assisted pipeline was implemented using LangChain and the Groq API.

Workflow

  1. Extracted medical text chunks from websites and PDFs

  2. Passed extracted text as context to an LLM

  3. Prompted the LLM to generate high-quality question–answer pairs

  4. Discarded:

    • Non-medical content
    • Questions without valid answers
    • Low-information or irrelevant text
  5. Stored only the generated Q&A pairs in JSON format


Prompt Design

prompt = PromptTemplate(
    input_variables=["page_content"],
    template="""
You are a medical AI assistant.

Given the following medical text, generate high-quality question and answer pairs.
Ignore any non-medical information.
IMP: Ignore if the data only contains questions without answers. Do not generate questions from such data.

Rules:
- Use ONLY the provided content
- Ignore sentences that do not contain meaningful medical information
- Do NOT hallucinate
- If no useful information exists, return an empty list
- Output MUST be valid JSON only

Output format:
[
  {
    "question": "...",
    "answer": "..."
  }
]

Medical Text:
{page_content}
"""
)

📦 Final Dataset Format

The final dataset contains only synthesized question–answer pairs, structured as:

{
  "question": "What is a pituitary tumor?",
  "answer": "A pituitary tumor is an abnormal growth in the pituitary gland that can affect hormone production."
}
  • No raw source text
  • No copyrighted paragraphs
  • Fully transformed content

🎓 Intended Use

This project is intended to:

  • Demonstrate data extraction and preprocessing skills
  • Showcase LLM-assisted dataset generation
  • Support academic research and experimentation
  • Enable model fine-tuning and evaluation

❌ Not intended for:

  • Commercial redistribution
  • Reproducing copyrighted material
  • Clinical or diagnostic use

⚖️ Ethical & Legal Considerations

  • All source materials are either open-access or used under educational fair use
  • The dataset contains only derived, non-verbatim content
  • This repository does not claim ownership over original source materials
  • If any content is found to violate usage policies, it will be removed immediately

image


Unnamed Dataset

  • Size: 6,460 training samples
  • Columns: question and answer
  • Approximate statistics based on the first 1000 samples:
    question answer
    type string string
    details
    • min: 6 tokens
    • mean: 14.62 tokens
    • max: 43 tokens
    • min: 3 tokens
    • mean: 156.11 tokens
    • max: 256 tokens
  • Samples:
    question answer
    What type of brain tumors are children likely to have? Primary brain tumors
    What is (are) Non 24 hour sleep wake disorder ? Non 24 hour sleep wake disorder refers to a steady pattern of one- to two-hour delays in sleep onset and wake times in people with normal living conditions. This occurs because the period of the person's sleep-wake cycle is longer than 24 hours. The condition most commonly affects people who are blind, due to an impaired sense of light-dark cycles. Non 24 hour sleep wake disorder can also affect sighted people. The cause of the disorder in these cases is incompletely understood, but studies suggest melatonin levels play a role.
    Name two common symptoms of diphtheria. Slight fever and sore throat, and the development of a tough membrane in the throat.
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim",
        "gather_across_devices": false
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • gradient_accumulation_steps: 2
  • learning_rate: 5e-06
  • weight_decay: 0.01
  • num_train_epochs: 6
  • warmup_ratio: 0.1
  • load_best_model_at_end: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 32
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 2
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-06
  • weight_decay: 0.01
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 6
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: None
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • project: huggingface
  • trackio_space_id: trackio
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: no
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: True
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step Training Loss retriever_evaluator_cosine_ndcg@10
0.0990 10 0.3503 -
0.1980 20 0.3043 -
0.2970 30 0.2295 -
0.3960 40 0.2332 -
0.4950 50 0.2588 -
0.5941 60 0.195 -
0.6931 70 0.1892 -
0.7921 80 0.2441 -
0.8911 90 0.1436 -
0.9901 100 0.121 0.8738
1.0891 110 0.1649 -
1.1881 120 0.137 -
1.2871 130 0.1231 -
1.3861 140 0.1652 -
1.4851 150 0.1249 -
1.5842 160 0.1618 -
1.6832 170 0.1747 -
1.7822 180 0.094 -
1.8812 190 0.1044 -
1.9802 200 0.0933 0.8820
2.0792 210 0.1261 -
2.1782 220 0.0988 -
2.2772 230 0.1055 -
2.3762 240 0.1023 -
2.4752 250 0.1258 -
2.5743 260 0.1259 -
2.6733 270 0.1253 -
2.7723 280 0.1362 -
2.8713 290 0.0931 -
2.9703 300 0.1152 0.8870
3.0693 310 0.0933 -
3.1683 320 0.0917 -
3.2673 330 0.1061 -
3.3663 340 0.0903 -
3.4653 350 0.0944 -
3.5644 360 0.0927 -
3.6634 370 0.0863 -
3.7624 380 0.1132 -
3.8614 390 0.1027 -
3.9604 400 0.0818 0.8876
4.0594 410 0.099 -
4.1584 420 0.1009 -
4.2574 430 0.1029 -
4.3564 440 0.1262 -
4.4554 450 0.0946 -
4.5545 460 0.0878 -
4.6535 470 0.0931 -
4.7525 480 0.0999 -
4.8515 490 0.0856 -
4.9505 500 0.0793 0.8907
5.0495 510 0.1057 -
5.1485 520 0.094 -
5.2475 530 0.1111 -
5.3465 540 0.0854 -
5.4455 550 0.1063 -
5.5446 560 0.1043 -
5.6436 570 0.0942 -
5.7426 580 0.0852 -
5.8416 590 0.0752 -
5.9406 600 0.0883 0.8890
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.12.12
  • Sentence Transformers: 5.2.2
  • Transformers: 4.57.6
  • PyTorch: 2.9.0+cu126
  • Accelerate: 1.12.0
  • Datasets: 4.0.0
  • Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
38
Safetensors
Model size
22.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Gaykar/all-MiniLM-L6-medical-rag

Finetuned
(771)
this model

Space using Gaykar/all-MiniLM-L6-medical-rag 1

Papers for Gaykar/all-MiniLM-L6-medical-rag

Evaluation results