SentenceTransformer based on sentence-transformers/all-MiniLM-L6-v2

This is a sentence-transformers model finetuned from sentence-transformers/all-MiniLM-L6-v2. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: sentence-transformers/all-MiniLM-L6-v2
Maximum Sequence Length: 256 tokens
Output Dimensionality: 384 dimensions
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("Gaykar/all-MiniLM-L6-medical-rag")
# Run inference
queries = [
    "What is (are) multiple sulfatase deficiency ?",
]
documents = [
    'Multiple sulfatase deficiency is a condition that mainly affects the brain, skin, and skeleton. Because the signs and symptoms of multiple sulfatase deficiency vary widely, researchers have split the condition into three types: neonatal, late-infantile, and juvenile.  The neonatal type is the most severe form, with signs and symptoms appearing soon after birth. Affected individuals have deterioration of tissue in the nervous system (leukodystrophy), which can contribute to movement problems, seizures, developmental delay, and slow growth. They also have dry, scaly skin (ichthyosis) and excess hair growth (hypertrichosis). Skeletal abnormalities can include abnormal side-to-side curvature of the spine (scoliosis), joint stiffness, and dysostosis multiplex, which refers to a specific pattern of skeletal abnormalities seen on x-ray. Individuals with the neonatal type typically have facial features that can be described as "coarse." Affected individuals may also have hearing loss, heart malformations, and an enlarged liver and spleen (hepatosplenomegaly). Many of the signs and symptoms of neonatal multiple sulfatase deficiency worsen over time.  The late-infantile type is the most common form of multiple sulfatase deficiency. It is characterized by normal cognitive development in early childhood followed by a progressive loss of mental abilities and movement (psychomotor regression) due to leukodystrophy or other brain abnormalities. Individuals with this form of the condition do not have as many features as those with the neonatal type, but they often have ichthyosis, skeletal abnormalities, and coarse facial features.  The juvenile type is the rarest form of multiple sulfatase deficiency. Signs and symptoms of the juvenile type appear in mid- to late childhood. Affected individuals have normal early cognitive development but then experience psychomotor regression; however, the regression in the juvenile type usually occurs at a slower rate than in the late-infantile type. Ichthyosis is also common in the juvenile type of multiple sulfatase deficiency.  Life expectancy is shortened in individuals with all types of multiple sulfatase deficiency. Typically, affected individuals survive only a few years after the signs and symptoms of the condition appear, but life expectancy varies depending on the severity of the condition and how quickly the neurological problems worsen.',
    'There is no cure for OPCA. The disorder is slowly progressive with death usually occurring approximately 20 years after onset.',
    'Spinal cord infarction is a stroke either within the spinal cord or the arteries that supply it. It is caused by arteriosclerosis or a thickening or closing of the major arteries to the spinal cord. Frequently spinal cord infarction is caused by a specific form of arteriosclerosis called atheromatosis, in which a deposit or accumulation of lipid-containing matter forms within the arteries. Symptoms, which generally appear within minutes or a few hours of the infarction, may include intermittent sharp or burning back pain, aching pain down through the legs, weakness in the legs, paralysis, loss of deep tendon reflexes, loss of pain and temperature sensation, and incontinence.',
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 384] [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.7917, 0.0896, 0.0186]])

Evaluation

Metrics

Information Retrieval

Dataset: retriever_evaluator
Evaluated with InformationRetrievalEvaluator

Metric	Value
cosine_accuracy@1	0.823
cosine_accuracy@3	0.9022
cosine_accuracy@5	0.9171
cosine_accuracy@10	0.9542
cosine_precision@1	0.823
cosine_precision@3	0.3007
cosine_precision@5	0.1834
cosine_precision@10	0.0954
cosine_recall@1	0.823
cosine_recall@3	0.9022
cosine_recall@5	0.9171
cosine_recall@10	0.9542
cosine_ndcg@10	0.889
cosine_mrr@10	0.8682
cosine_map@100	0.8708

The image below shows the difference between base model and fine tuned model:-

If a base model gives a 0.80 to a correct answer and a 0.75 to a wrong one, the retriever might easily get confused by a small amount of noise. In our fine-tuned model, if the correct answer stays at 0.80 but the wrong ones drop to 0.20, you have created a massive Discriminative Gap. This ensures:

Robustness: Even if a "negative" answer shares similar keywords, the model now knows they aren't semantically related to that specific question.

Cleaner RAG: Your LLM receives exactly the right context without "distractor" chunks that could cause hallucinations.

Training Details

Training Dataset

📊 Dataset Creation Pipeline

This dataset was created purely for academic and learning purposes to demonstrate skills in data collection, preprocessing, and LLM-based data generation within the medical NLP domain. ⚠️ No proprietary or copyrighted text is redistributed in raw form.

The overall pipeline consists of two stages:

1️⃣ Data Collection (Source Material)

Medical information related to brain tumors and human health was gathered from openly accessible educational and public medical resources. These sources were used only as intermediate context to generate derived question–answer pairs.

Sources Used

Public medical websites
- Example: Mayo Clinic
- Used only to understand structure and terminology (no verbatim content stored)
Educational textbooks (open-access PDFs)
- NCERT Biology (Human Health and Disease)
  - https://ncert.nic.in/textbook/pdf/lebo107.pdf
- NIOS Senior Secondary Biology
  - https://nios.ac.in/media/documents/SrSec314NewE/Lesson-29.pdf
Open-source QA dataset
- MedQuAD – Medical Question Answer Dataset
- https://huggingface.co/datasets/keivalya/MedQuad-MedicalQnADataset
- ~6000 samples (already structured as question–answer pairs)

📌 Important Note

No textbook or website content is stored, shared, or redistributed in original form.
All source material was used only as temporary input to generate transformed outputs.

2️⃣ Data Formatting & Generation

To convert unstructured medical text into structured data, an LLM-assisted pipeline was implemented using LangChain and the Groq API.

Workflow

Extracted medical text chunks from websites and PDFs
Passed extracted text as context to an LLM
Prompted the LLM to generate high-quality question–answer pairs
Discarded:
- Non-medical content
- Questions without valid answers
- Low-information or irrelevant text
Stored only the generated Q&A pairs in JSON format

Prompt Design

prompt = PromptTemplate(
    input_variables=["page_content"],
    template="""
You are a medical AI assistant.

Given the following medical text, generate high-quality question and answer pairs.
Ignore any non-medical information.
IMP: Ignore if the data only contains questions without answers. Do not generate questions from such data.

Rules:
- Use ONLY the provided content
- Ignore sentences that do not contain meaningful medical information
- Do NOT hallucinate
- If no useful information exists, return an empty list
- Output MUST be valid JSON only

Output format:
[
  {
    "question": "...",
    "answer": "..."
  }
]

Medical Text:
{page_content}
"""
)

📦 Final Dataset Format

The final dataset contains only synthesized question–answer pairs, structured as:

{
  "question": "What is a pituitary tumor?",
  "answer": "A pituitary tumor is an abnormal growth in the pituitary gland that can affect hormone production."
}

No raw source text
No copyrighted paragraphs
Fully transformed content

🎓 Intended Use

This project is intended to:

Demonstrate data extraction and preprocessing skills
Showcase LLM-assisted dataset generation
Support academic research and experimentation
Enable model fine-tuning and evaluation

❌ Not intended for:

Commercial redistribution
Reproducing copyrighted material
Clinical or diagnostic use

⚖️ Ethical & Legal Considerations

All source materials are either open-access or used under educational fair use
The dataset contains only derived, non-verbatim content
This repository does not claim ownership over original source materials
If any content is found to violate usage policies, it will be removed immediately

Unnamed Dataset

Size: 6,460 training samples
Columns: question and answer
Approximate statistics based on the first 1000 samples:
question answer
type string string
details
min: 6 tokens
mean: 14.62 tokens
max: 43 tokens

min: 3 tokens
mean: 156.11 tokens
max: 256 tokens

	question	answer
type	string	string
details	min: 6 tokens mean: 14.62 tokens max: 43 tokens	min: 3 tokens mean: 156.11 tokens max: 256 tokens

Samples:

question	answer
`What type of brain tumors are children likely to have?`	`Primary brain tumors`
`What is (are) Non 24 hour sleep wake disorder ?`	Non 24 hour sleep wake disorder refers to a steady pattern of one- to two-hour delays in sleep onset and wake times in people with normal living conditions. This occurs because the period of the person's sleep-wake cycle is longer than 24 hours. The condition most commonly affects people who are blind, due to an impaired sense of light-dark cycles. Non 24 hour sleep wake disorder can also affect sighted people. The cause of the disorder in these cases is incompletely understood, but studies suggest melatonin levels play a role.
`Name two common symptoms of diphtheria.`	`Slight fever and sore throat, and the development of a tough membrane in the throat.`

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim",
    "gather_across_devices": false
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 32
per_device_eval_batch_size: 32
gradient_accumulation_steps: 2
learning_rate: 5e-06
weight_decay: 0.01
num_train_epochs: 6
warmup_ratio: 0.1
load_best_model_at_end: True
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 32
per_device_eval_batch_size: 32
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 2
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 5e-06
weight_decay: 0.01
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 6
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: None
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
bf16: False
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: True
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
parallelism_config: None
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch_fused
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
project: huggingface
trackio_space_id: trackio
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
hub_revision: None
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
include_tokens_per_second: False
include_num_input_tokens_seen: no
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
liger_kernel_config: None
eval_use_gather_object: False
average_tokens_across_devices: True
prompts: None
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional
router_mapping: {}
learning_rate_mapping: {}

Training Logs

Epoch	Step	Training Loss	retriever_evaluator_cosine_ndcg@10
0.0990	10	0.3503	-
0.1980	20	0.3043	-
0.2970	30	0.2295	-
0.3960	40	0.2332	-
0.4950	50	0.2588	-
0.5941	60	0.195	-
0.6931	70	0.1892	-
0.7921	80	0.2441	-
0.8911	90	0.1436	-
0.9901	100	0.121	0.8738
1.0891	110	0.1649	-
1.1881	120	0.137	-
1.2871	130	0.1231	-
1.3861	140	0.1652	-
1.4851	150	0.1249	-
1.5842	160	0.1618	-
1.6832	170	0.1747	-
1.7822	180	0.094	-
1.8812	190	0.1044	-
1.9802	200	0.0933	0.8820
2.0792	210	0.1261	-
2.1782	220	0.0988	-
2.2772	230	0.1055	-
2.3762	240	0.1023	-
2.4752	250	0.1258	-
2.5743	260	0.1259	-
2.6733	270	0.1253	-
2.7723	280	0.1362	-
2.8713	290	0.0931	-
2.9703	300	0.1152	0.8870
3.0693	310	0.0933	-
3.1683	320	0.0917	-
3.2673	330	0.1061	-
3.3663	340	0.0903	-
3.4653	350	0.0944	-
3.5644	360	0.0927	-
3.6634	370	0.0863	-
3.7624	380	0.1132	-
3.8614	390	0.1027	-
3.9604	400	0.0818	0.8876
4.0594	410	0.099	-
4.1584	420	0.1009	-
4.2574	430	0.1029	-
4.3564	440	0.1262	-
4.4554	450	0.0946	-
4.5545	460	0.0878	-
4.6535	470	0.0931	-
4.7525	480	0.0999	-
4.8515	490	0.0856	-
4.9505	500	0.0793	0.8907
5.0495	510	0.1057	-
5.1485	520	0.094	-
5.2475	530	0.1111	-
5.3465	540	0.0854	-
5.4455	550	0.1063	-
5.5446	560	0.1043	-
5.6436	570	0.0942	-
5.7426	580	0.0852	-
5.8416	590	0.0752	-
5.9406	600	0.0883	0.8890

The bold row denotes the saved checkpoint.

Framework Versions

Python: 3.12.12
Sentence Transformers: 5.2.2
Transformers: 4.57.6
PyTorch: 2.9.0+cu126
Accelerate: 1.12.0
Datasets: 4.0.0
Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Downloads last month: 1,391

Safetensors

Model size

22.7M params

Tensor type

F32

Model tree for Gaykar/all-MiniLM-L6-medical-rag

Base model

nreimers/MiniLM-L6-H384-uncased

Quantized

sentence-transformers/all-MiniLM-L6-v2

Finetuned

(928)

this model

Space using Gaykar/all-MiniLM-L6-medical-rag 1

Papers for Gaykar/all-MiniLM-L6-medical-rag

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Paper • 1908.10084 • Published Aug 27, 2019 • 14

Efficient Natural Language Response Suggestion for Smart Reply

Paper • 1705.00652 • Published May 1, 2017

Evaluation results

Cosine Accuracy@1 on retriever evaluator
self-reported

0.823
Cosine Accuracy@3 on retriever evaluator
self-reported

0.902
Cosine Accuracy@5 on retriever evaluator
self-reported

0.917
Cosine Accuracy@10 on retriever evaluator
self-reported

0.954
Cosine Precision@1 on retriever evaluator
self-reported

0.823
Cosine Precision@3 on retriever evaluator
self-reported

0.301
Cosine Precision@5 on retriever evaluator
self-reported

0.183
Cosine Precision@10 on retriever evaluator
self-reported

0.095