MedClear V2: Medical Text Simplification

MedClear translates doctor-speak into human-speak. Fine-tuned FLAN-T5-base (248M params) that simplifies clinical notes, medical terms, and discharge summaries into plain language patients can understand.

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("DTanzillo/medclear-v2-base")
model = AutoModelForSeq2SeqLM.from_pretrained("DTanzillo/medclear-v2-base")

text = "simplify: Patient underwent laparoscopic cholecystectomy for acute cholecystitis. EBL minimal. POD1: afebrile, tolerating PO diet."
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(**inputs, max_new_tokens=256, num_beams=4)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Data

23,157 examples across multiple granularity levels:

Level Examples %
Terms 4,989 21.5%
Phrases 6,660 28.8%
Sentences 8,000 34.5%
Flashcards 2,689 11.6%
Paragraphs 574 2.5%
RAG-augmented 245 1.1%

Key insight: 50% of training is term/phrase level. The model learns vocabulary mappings first, then composes them into simplified text.

Results

Metric Raw FLAN-T5 MedClear
ROUGE-1 F1 0.13 0.36
ROUGE-2 F1 0.05 0.13
ROUGE-L F1 0.10 0.22
Eval Loss -- 1.712

Limitations

  • Can hallucinate on complex multi-fact clinical notes
  • Best used with RAG pipeline (MedlinePlus) for verification
  • Not a substitute for professional medical advice

Demo

Try the live demo: MedClear on HuggingFace Spaces

Duke University Hackathon 2026

Downloads last month
43
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DTanzillo/medclear-v2-base

Finetuned
(901)
this model

Dataset used to train DTanzillo/medclear-v2-base

Space using DTanzillo/medclear-v2-base 1