MediGPT / README.md

Update README.md

4e4bf40 verified 4 days ago

4.14 kB

	---
	license: mit
	language:
	- en
	tags:
	- pytorch
	- transformer
	- gpt
	- medical-ai
	- clinical-nlp
	- biomedical-nlp
	- language-model
	- healthcare
	pipeline_tag: text-generation
	---

	# 🩺 MediGPT: A Domain-Specific Clinical Small Language Model

	MediGPT is a decoder-only GPT-style Transformer trained entirely from scratch on medical and biomedical text corpora.

	Unlike models that rely on large-scale pretraining followed by fine-tuning, MediGPT was developed as a research-oriented educational project to explore domain-specific language modeling, transformer architectures, and clinical text generation.

	---

	## 🚀 Key Features

	- Built entirely from scratch using PyTorch
	- Decoder-only GPT architecture
	- Trained on medical and biomedical datasets
	- GPT-2 BPE tokenization via tiktoken
	- End-to-end implementation including:
	- Data preprocessing
	- Corpus analysis
	- Transformer implementation
	- Training pipeline
	- Evaluation metrics
	- Text generation

	![image](https://cdn-uploads.huggingface.co/production/uploads/6a0cd36102135372849ae9d0/keZO89KH-n1_3dmUGn7sW.png)

	![image](https://cdn-uploads.huggingface.co/production/uploads/6a0cd36102135372849ae9d0/5Nidpu-PX09pmQhBZ1MvC.png)

	![image](https://cdn-uploads.huggingface.co/production/uploads/6a0cd36102135372849ae9d0/zdBON7JuqWSykfisJyr7u.png)

	---

	## 🏗️ Model Architecture

	\| Component \| Value \|
	\|------------\|---------\|
	\| Architecture \| GPT Decoder \|
	\| Layers \| 8 \|
	\| Attention Heads \| 8 \|
	\| Hidden Size \| 512 \|
	\| Context Length \| 512 \|
	\| Dropout \| 0.1 \|
	\| Tokenizer \| GPT-2 (tiktoken) \|
	\| Framework \| PyTorch \|

	Approximate Parameter Count: ~90M

	---

	## 📚 Training Data

	MediGPT was trained on a combination of medical datasets:

	### MedQuAD
	Medical question-answer pairs covering diseases, symptoms, diagnostics, treatments, and healthcare information.

	### PubMedQA
	Biomedical question-answer data derived from scientific literature and PubMed abstracts.

	The resulting corpus exposes the model to:

	- Clinical terminology
	- Biomedical vocabulary
	- Medical QA patterns
	- Research-style writing
	- Healthcare discourse

	---

	## 🎯 Training Objective

	The model was trained using autoregressive next-token prediction.

	Given a sequence such as:

	Symptoms of diabetes include ...

	the model learns to predict the most likely next token.

	---

	## 📈 Evaluation

	Evaluation included:

	- Validation Loss
	- Perplexity
	- Corpus Statistics
	- Attention Analysis
	- Qualitative Text Generation

	Results indicate successful learning of:

	- Clinical vocabulary
	- Biomedical terminology
	- Medical writing style
	- Research-oriented sentence structures

	![image](https://cdn-uploads.huggingface.co/production/uploads/6a0cd36102135372849ae9d0/EG7lUJC8oZx9j2NSptqog.png)

	![image](https://cdn-uploads.huggingface.co/production/uploads/6a0cd36102135372849ae9d0/-cgycKs9EJfTkxRLVTOFc.png)


	![image](https://cdn-uploads.huggingface.co/production/uploads/6a0cd36102135372849ae9d0/vzU8wdXwwTcOHP9QTmxfe.png)

	![image](https://cdn-uploads.huggingface.co/production/uploads/6a0cd36102135372849ae9d0/iVArLtoPt9yv2LybIjJj-.png)

	---

	## 🔬 Example Prompt

	Input:

	Symptoms of diabetes include

	Generated continuation:

	Symptoms of diabetes include increased thirst, frequent urination, fatigue, blurred vision and other complications associated with impaired glucose regulation.

	---

	## ⚠️ Limitations

	- Small-scale model compared to modern foundation models
	- Limited medical reasoning capability
	- May generate inaccurate information
	- Not suitable for clinical deployment
	- Intended primarily for research and educational purposes

	---

	## 🩺 Medical Disclaimer

	This model is not a medical device and must not be used for:

	- Medical diagnosis
	- Treatment recommendations
	- Clinical decision making
	- Emergency healthcare situations

	All outputs should be verified by qualified healthcare professionals.

	---

	## 👨‍💻 Author

	Rishabh Shenoy

	Developed as a research-oriented educational project exploring domain-specific language model development and biomedical NLP.

	---

	## 📜 Citation

	If you use this work, please cite the repository and model page.