| --- |
| license: mit |
| language: |
| - en |
| tags: |
| - pytorch |
| - transformer |
| - gpt |
| - medical-ai |
| - clinical-nlp |
| - biomedical-nlp |
| - language-model |
| - healthcare |
| pipeline_tag: text-generation |
| --- |
| |
| # π©Ί MediGPT: A Domain-Specific Clinical Small Language Model |
|
|
| MediGPT is a decoder-only GPT-style Transformer trained entirely from scratch on medical and biomedical text corpora. |
|
|
| Unlike models that rely on large-scale pretraining followed by fine-tuning, MediGPT was developed as a research-oriented educational project to explore domain-specific language modeling, transformer architectures, and clinical text generation. |
|
|
| --- |
|
|
| ## π Key Features |
|
|
| - Built entirely from scratch using PyTorch |
| - Decoder-only GPT architecture |
| - Trained on medical and biomedical datasets |
| - GPT-2 BPE tokenization via tiktoken |
| - End-to-end implementation including: |
| - Data preprocessing |
| - Corpus analysis |
| - Transformer implementation |
| - Training pipeline |
| - Evaluation metrics |
| - Text generation |
|
|
|  |
|
|
|  |
|
|
|  |
|
|
| --- |
|
|
| ## ποΈ Model Architecture |
|
|
| | Component | Value | |
| |------------|---------| |
| | Architecture | GPT Decoder | |
| | Layers | 8 | |
| | Attention Heads | 8 | |
| | Hidden Size | 512 | |
| | Context Length | 512 | |
| | Dropout | 0.1 | |
| | Tokenizer | GPT-2 (tiktoken) | |
| | Framework | PyTorch | |
|
|
| Approximate Parameter Count: ~90M |
|
|
| --- |
|
|
| ## π Training Data |
|
|
| MediGPT was trained on a combination of medical datasets: |
|
|
| ### MedQuAD |
| Medical question-answer pairs covering diseases, symptoms, diagnostics, treatments, and healthcare information. |
|
|
| ### PubMedQA |
| Biomedical question-answer data derived from scientific literature and PubMed abstracts. |
|
|
| The resulting corpus exposes the model to: |
|
|
| - Clinical terminology |
| - Biomedical vocabulary |
| - Medical QA patterns |
| - Research-style writing |
| - Healthcare discourse |
|
|
| --- |
|
|
| ## π― Training Objective |
|
|
| The model was trained using autoregressive next-token prediction. |
|
|
| Given a sequence such as: |
|
|
| Symptoms of diabetes include ... |
|
|
| the model learns to predict the most likely next token. |
|
|
| --- |
|
|
| ## π Evaluation |
|
|
| Evaluation included: |
|
|
| - Validation Loss |
| - Perplexity |
| - Corpus Statistics |
| - Attention Analysis |
| - Qualitative Text Generation |
|
|
| Results indicate successful learning of: |
|
|
| - Clinical vocabulary |
| - Biomedical terminology |
| - Medical writing style |
| - Research-oriented sentence structures |
|
|
|  |
|
|
|  |
|
|
|
|
|  |
|
|
|  |
|
|
| --- |
|
|
| ## π¬ Example Prompt |
|
|
| Input: |
|
|
| Symptoms of diabetes include |
|
|
| Generated continuation: |
|
|
| Symptoms of diabetes include increased thirst, frequent urination, fatigue, blurred vision and other complications associated with impaired glucose regulation. |
|
|
| --- |
|
|
| ## β οΈ Limitations |
|
|
| - Small-scale model compared to modern foundation models |
| - Limited medical reasoning capability |
| - May generate inaccurate information |
| - Not suitable for clinical deployment |
| - Intended primarily for research and educational purposes |
|
|
| --- |
|
|
| ## π©Ί Medical Disclaimer |
|
|
| This model is not a medical device and must not be used for: |
|
|
| - Medical diagnosis |
| - Treatment recommendations |
| - Clinical decision making |
| - Emergency healthcare situations |
|
|
| All outputs should be verified by qualified healthcare professionals. |
|
|
| --- |
|
|
| ## π¨βπ» Author |
|
|
| Rishabh Shenoy |
|
|
| Developed as a research-oriented educational project exploring domain-specific language model development and biomedical NLP. |
|
|
| --- |
|
|
| ## π Citation |
|
|
| If you use this work, please cite the repository and model page. |
|
|