MediGPT / README.md
tenperformer's picture
Update README.md
4e4bf40 verified
metadata
license: mit
language:
  - en
tags:
  - pytorch
  - transformer
  - gpt
  - medical-ai
  - clinical-nlp
  - biomedical-nlp
  - language-model
  - healthcare
pipeline_tag: text-generation

🩺 MediGPT: A Domain-Specific Clinical Small Language Model

MediGPT is a decoder-only GPT-style Transformer trained entirely from scratch on medical and biomedical text corpora.

Unlike models that rely on large-scale pretraining followed by fine-tuning, MediGPT was developed as a research-oriented educational project to explore domain-specific language modeling, transformer architectures, and clinical text generation.


πŸš€ Key Features

  • Built entirely from scratch using PyTorch
  • Decoder-only GPT architecture
  • Trained on medical and biomedical datasets
  • GPT-2 BPE tokenization via tiktoken
  • End-to-end implementation including:
    • Data preprocessing
    • Corpus analysis
    • Transformer implementation
    • Training pipeline
    • Evaluation metrics
    • Text generation

image

image

image


πŸ—οΈ Model Architecture

Component Value
Architecture GPT Decoder
Layers 8
Attention Heads 8
Hidden Size 512
Context Length 512
Dropout 0.1
Tokenizer GPT-2 (tiktoken)
Framework PyTorch

Approximate Parameter Count: ~90M


πŸ“š Training Data

MediGPT was trained on a combination of medical datasets:

MedQuAD

Medical question-answer pairs covering diseases, symptoms, diagnostics, treatments, and healthcare information.

PubMedQA

Biomedical question-answer data derived from scientific literature and PubMed abstracts.

The resulting corpus exposes the model to:

  • Clinical terminology
  • Biomedical vocabulary
  • Medical QA patterns
  • Research-style writing
  • Healthcare discourse

🎯 Training Objective

The model was trained using autoregressive next-token prediction.

Given a sequence such as:

Symptoms of diabetes include ...

the model learns to predict the most likely next token.


πŸ“ˆ Evaluation

Evaluation included:

  • Validation Loss
  • Perplexity
  • Corpus Statistics
  • Attention Analysis
  • Qualitative Text Generation

Results indicate successful learning of:

  • Clinical vocabulary
  • Biomedical terminology
  • Medical writing style
  • Research-oriented sentence structures

image

image

image

image


πŸ”¬ Example Prompt

Input:

Symptoms of diabetes include

Generated continuation:

Symptoms of diabetes include increased thirst, frequent urination, fatigue, blurred vision and other complications associated with impaired glucose regulation.


⚠️ Limitations

  • Small-scale model compared to modern foundation models
  • Limited medical reasoning capability
  • May generate inaccurate information
  • Not suitable for clinical deployment
  • Intended primarily for research and educational purposes

🩺 Medical Disclaimer

This model is not a medical device and must not be used for:

  • Medical diagnosis
  • Treatment recommendations
  • Clinical decision making
  • Emergency healthcare situations

All outputs should be verified by qualified healthcare professionals.


πŸ‘¨β€πŸ’» Author

Rishabh Shenoy

Developed as a research-oriented educational project exploring domain-specific language model development and biomedical NLP.


πŸ“œ Citation

If you use this work, please cite the repository and model page.