MediGPT / README.md
tenperformer's picture
Update README.md
4e4bf40 verified
---
license: mit
language:
- en
tags:
- pytorch
- transformer
- gpt
- medical-ai
- clinical-nlp
- biomedical-nlp
- language-model
- healthcare
pipeline_tag: text-generation
---
# 🩺 MediGPT: A Domain-Specific Clinical Small Language Model
MediGPT is a decoder-only GPT-style Transformer trained entirely from scratch on medical and biomedical text corpora.
Unlike models that rely on large-scale pretraining followed by fine-tuning, MediGPT was developed as a research-oriented educational project to explore domain-specific language modeling, transformer architectures, and clinical text generation.
---
## πŸš€ Key Features
- Built entirely from scratch using PyTorch
- Decoder-only GPT architecture
- Trained on medical and biomedical datasets
- GPT-2 BPE tokenization via tiktoken
- End-to-end implementation including:
- Data preprocessing
- Corpus analysis
- Transformer implementation
- Training pipeline
- Evaluation metrics
- Text generation
![image](https://cdn-uploads.huggingface.co/production/uploads/6a0cd36102135372849ae9d0/keZO89KH-n1_3dmUGn7sW.png)
![image](https://cdn-uploads.huggingface.co/production/uploads/6a0cd36102135372849ae9d0/5Nidpu-PX09pmQhBZ1MvC.png)
![image](https://cdn-uploads.huggingface.co/production/uploads/6a0cd36102135372849ae9d0/zdBON7JuqWSykfisJyr7u.png)
---
## πŸ—οΈ Model Architecture
| Component | Value |
|------------|---------|
| Architecture | GPT Decoder |
| Layers | 8 |
| Attention Heads | 8 |
| Hidden Size | 512 |
| Context Length | 512 |
| Dropout | 0.1 |
| Tokenizer | GPT-2 (tiktoken) |
| Framework | PyTorch |
Approximate Parameter Count: ~90M
---
## πŸ“š Training Data
MediGPT was trained on a combination of medical datasets:
### MedQuAD
Medical question-answer pairs covering diseases, symptoms, diagnostics, treatments, and healthcare information.
### PubMedQA
Biomedical question-answer data derived from scientific literature and PubMed abstracts.
The resulting corpus exposes the model to:
- Clinical terminology
- Biomedical vocabulary
- Medical QA patterns
- Research-style writing
- Healthcare discourse
---
## 🎯 Training Objective
The model was trained using autoregressive next-token prediction.
Given a sequence such as:
Symptoms of diabetes include ...
the model learns to predict the most likely next token.
---
## πŸ“ˆ Evaluation
Evaluation included:
- Validation Loss
- Perplexity
- Corpus Statistics
- Attention Analysis
- Qualitative Text Generation
Results indicate successful learning of:
- Clinical vocabulary
- Biomedical terminology
- Medical writing style
- Research-oriented sentence structures
![image](https://cdn-uploads.huggingface.co/production/uploads/6a0cd36102135372849ae9d0/EG7lUJC8oZx9j2NSptqog.png)
![image](https://cdn-uploads.huggingface.co/production/uploads/6a0cd36102135372849ae9d0/-cgycKs9EJfTkxRLVTOFc.png)
![image](https://cdn-uploads.huggingface.co/production/uploads/6a0cd36102135372849ae9d0/vzU8wdXwwTcOHP9QTmxfe.png)
![image](https://cdn-uploads.huggingface.co/production/uploads/6a0cd36102135372849ae9d0/iVArLtoPt9yv2LybIjJj-.png)
---
## πŸ”¬ Example Prompt
Input:
Symptoms of diabetes include
Generated continuation:
Symptoms of diabetes include increased thirst, frequent urination, fatigue, blurred vision and other complications associated with impaired glucose regulation.
---
## ⚠️ Limitations
- Small-scale model compared to modern foundation models
- Limited medical reasoning capability
- May generate inaccurate information
- Not suitable for clinical deployment
- Intended primarily for research and educational purposes
---
## 🩺 Medical Disclaimer
This model is not a medical device and must not be used for:
- Medical diagnosis
- Treatment recommendations
- Clinical decision making
- Emergency healthcare situations
All outputs should be verified by qualified healthcare professionals.
---
## πŸ‘¨β€πŸ’» Author
Rishabh Shenoy
Developed as a research-oriented educational project exploring domain-specific language model development and biomedical NLP.
---
## πŸ“œ Citation
If you use this work, please cite the repository and model page.