---
license: mit
language:
- en
tags:
- pytorch
- transformer
- gpt
- medical-ai
- clinical-nlp
- biomedical-nlp
- language-model
- healthcare
pipeline_tag: text-generation
---

# 🩺 MediGPT: A Domain-Specific Clinical Small Language Model

MediGPT is a decoder-only GPT-style Transformer trained entirely from scratch on medical and biomedical text corpora.

Unlike models that rely on large-scale pretraining followed by fine-tuning, MediGPT was developed as a research-oriented educational project to explore domain-specific language modeling, transformer architectures, and clinical text generation.

---

## 🚀 Key Features

- Built entirely from scratch using PyTorch
- Decoder-only GPT architecture
- Trained on medical and biomedical datasets
- GPT-2 BPE tokenization via tiktoken
- End-to-end implementation including:
  - Data preprocessing
  - Corpus analysis
  - Transformer implementation
  - Training pipeline
  - Evaluation metrics
  - Text generation

![image](https://cdn-uploads.huggingface.co/production/uploads/6a0cd36102135372849ae9d0/keZO89KH-n1_3dmUGn7sW.png)

![image](https://cdn-uploads.huggingface.co/production/uploads/6a0cd36102135372849ae9d0/5Nidpu-PX09pmQhBZ1MvC.png)

![image](https://cdn-uploads.huggingface.co/production/uploads/6a0cd36102135372849ae9d0/zdBON7JuqWSykfisJyr7u.png)

---

## 🏗️ Model Architecture

| Component | Value |
|------------|---------|
| Architecture | GPT Decoder |
| Layers | 8 |
| Attention Heads | 8 |
| Hidden Size | 512 |
| Context Length | 512 |
| Dropout | 0.1 |
| Tokenizer | GPT-2 (tiktoken) |
| Framework | PyTorch |

Approximate Parameter Count: ~90M

---

## 📚 Training Data

MediGPT was trained on a combination of medical datasets:

### MedQuAD
Medical question-answer pairs covering diseases, symptoms, diagnostics, treatments, and healthcare information.

### PubMedQA
Biomedical question-answer data derived from scientific literature and PubMed abstracts.

The resulting corpus exposes the model to:

- Clinical terminology
- Biomedical vocabulary
- Medical QA patterns
- Research-style writing
- Healthcare discourse

---

## 🎯 Training Objective

The model was trained using autoregressive next-token prediction.

Given a sequence such as:

Symptoms of diabetes include ...

the model learns to predict the most likely next token.

---

## 📈 Evaluation

Evaluation included:

- Validation Loss
- Perplexity
- Corpus Statistics
- Attention Analysis
- Qualitative Text Generation

Results indicate successful learning of:

- Clinical vocabulary
- Biomedical terminology
- Medical writing style
- Research-oriented sentence structures

![image](https://cdn-uploads.huggingface.co/production/uploads/6a0cd36102135372849ae9d0/EG7lUJC8oZx9j2NSptqog.png)

![image](https://cdn-uploads.huggingface.co/production/uploads/6a0cd36102135372849ae9d0/-cgycKs9EJfTkxRLVTOFc.png)


![image](https://cdn-uploads.huggingface.co/production/uploads/6a0cd36102135372849ae9d0/vzU8wdXwwTcOHP9QTmxfe.png)

![image](https://cdn-uploads.huggingface.co/production/uploads/6a0cd36102135372849ae9d0/iVArLtoPt9yv2LybIjJj-.png)

---

## 🔬 Example Prompt

Input:

Symptoms of diabetes include

Generated continuation:

Symptoms of diabetes include increased thirst, frequent urination, fatigue, blurred vision and other complications associated with impaired glucose regulation.

---

## ⚠️ Limitations

- Small-scale model compared to modern foundation models
- Limited medical reasoning capability
- May generate inaccurate information
- Not suitable for clinical deployment
- Intended primarily for research and educational purposes

---

## 🩺 Medical Disclaimer

This model is not a medical device and must not be used for:

- Medical diagnosis
- Treatment recommendations
- Clinical decision making
- Emergency healthcare situations

All outputs should be verified by qualified healthcare professionals.

---

## 👨‍💻 Author

Rishabh Shenoy

Developed as a research-oriented educational project exploring domain-specific language model development and biomedical NLP.

---

## 📜 Citation

If you use this work, please cite the repository and model page.