--- license: mit language: - en tags: - pytorch - transformer - gpt - medical-ai - clinical-nlp - biomedical-nlp - language-model - healthcare pipeline_tag: text-generation --- # 🩺 MediGPT: A Domain-Specific Clinical Small Language Model MediGPT is a decoder-only GPT-style Transformer trained entirely from scratch on medical and biomedical text corpora. Unlike models that rely on large-scale pretraining followed by fine-tuning, MediGPT was developed as a research-oriented educational project to explore domain-specific language modeling, transformer architectures, and clinical text generation. --- ## 🚀 Key Features - Built entirely from scratch using PyTorch - Decoder-only GPT architecture - Trained on medical and biomedical datasets - GPT-2 BPE tokenization via tiktoken - End-to-end implementation including: - Data preprocessing - Corpus analysis - Transformer implementation - Training pipeline - Evaluation metrics - Text generation ![image](https://cdn-uploads.huggingface.co/production/uploads/6a0cd36102135372849ae9d0/keZO89KH-n1_3dmUGn7sW.png) ![image](https://cdn-uploads.huggingface.co/production/uploads/6a0cd36102135372849ae9d0/5Nidpu-PX09pmQhBZ1MvC.png) ![image](https://cdn-uploads.huggingface.co/production/uploads/6a0cd36102135372849ae9d0/zdBON7JuqWSykfisJyr7u.png) --- ## 🏗️ Model Architecture | Component | Value | |------------|---------| | Architecture | GPT Decoder | | Layers | 8 | | Attention Heads | 8 | | Hidden Size | 512 | | Context Length | 512 | | Dropout | 0.1 | | Tokenizer | GPT-2 (tiktoken) | | Framework | PyTorch | Approximate Parameter Count: ~90M --- ## 📚 Training Data MediGPT was trained on a combination of medical datasets: ### MedQuAD Medical question-answer pairs covering diseases, symptoms, diagnostics, treatments, and healthcare information. ### PubMedQA Biomedical question-answer data derived from scientific literature and PubMed abstracts. The resulting corpus exposes the model to: - Clinical terminology - Biomedical vocabulary - Medical QA patterns - Research-style writing - Healthcare discourse --- ## 🎯 Training Objective The model was trained using autoregressive next-token prediction. Given a sequence such as: Symptoms of diabetes include ... the model learns to predict the most likely next token. --- ## 📈 Evaluation Evaluation included: - Validation Loss - Perplexity - Corpus Statistics - Attention Analysis - Qualitative Text Generation Results indicate successful learning of: - Clinical vocabulary - Biomedical terminology - Medical writing style - Research-oriented sentence structures ![image](https://cdn-uploads.huggingface.co/production/uploads/6a0cd36102135372849ae9d0/EG7lUJC8oZx9j2NSptqog.png) ![image](https://cdn-uploads.huggingface.co/production/uploads/6a0cd36102135372849ae9d0/-cgycKs9EJfTkxRLVTOFc.png) ![image](https://cdn-uploads.huggingface.co/production/uploads/6a0cd36102135372849ae9d0/vzU8wdXwwTcOHP9QTmxfe.png) ![image](https://cdn-uploads.huggingface.co/production/uploads/6a0cd36102135372849ae9d0/iVArLtoPt9yv2LybIjJj-.png) --- ## 🔬 Example Prompt Input: Symptoms of diabetes include Generated continuation: Symptoms of diabetes include increased thirst, frequent urination, fatigue, blurred vision and other complications associated with impaired glucose regulation. --- ## ⚠️ Limitations - Small-scale model compared to modern foundation models - Limited medical reasoning capability - May generate inaccurate information - Not suitable for clinical deployment - Intended primarily for research and educational purposes --- ## 🩺 Medical Disclaimer This model is not a medical device and must not be used for: - Medical diagnosis - Treatment recommendations - Clinical decision making - Emergency healthcare situations All outputs should be verified by qualified healthcare professionals. --- ## 👨‍💻 Author Rishabh Shenoy Developed as a research-oriented educational project exploring domain-specific language model development and biomedical NLP. --- ## 📜 Citation If you use this work, please cite the repository and model page.