MedAssistGPT / README.md

Update README.md

9e1cffa verified 4 days ago

4.47 kB

	---
	license: apache-2.0
	datasets:
	- japhba/pubmed_simple
	language:
	- en
	tags:
	- research
	- llm-pretraining
	- transformer
	- gqa
	- rope
	- swiglu
	- rmsnorm
	- medical-text
	---

	# 🧠 MedAssistGPT — Pretraining Checkpoints (303M & 401M)

	Experimental medical-domain LLM pretraining project.
	⚠️ Research-only. Not for clinical, diagnostic, or production use.

	---

	## 📌 Overview

	This repository contains multiple pretraining checkpoints of the MedAssistGPT architecture, released in two parameter scales:

	- MedAssistGPT-303M
	- MedAssistGPT-401M

	Both variants:
	- share the same architecture design
	- use the same tokenizer
	- are trained on the same dataset
	- differ only in model width / attention configuration and training progress

	The purpose of this repository is to document architecture choices, data pipelines, and large-scale training behavior, rather than to present a fully converged or production-ready medical language model.

	---

	## 🧩 Architecture (Shared Design)

	All models are decoder-only Transformers implemented from scratch in PyTorch.

	### Core components
	- RoPE (Rotary Positional Embeddings)
	- Grouped Query Attention (GQA)
	- SwiGLU feed-forward layers
	- RMSNorm (pre-norm)
	- Weight tying (token embeddings ↔ LM head)
	- Dropout: 0.0 (pretraining configuration)

	### Tokenization
	- Tokenizer: `tiktoken` `p50k_base`
	- Vocabulary size: ≈ 50,281
	- Context length: 1,024 tokens

	---

	## 📐 Model Variants

	\| Variant \| Parameters \| d_model \| Heads \| GQA (KV heads) \| Blocks \|
	\|------\|-----------\|--------\|-------\|---------------\|--------\|
	\| 303M \| ~303M \| 1024 \| 16 \| 4 \| 24 \|
	\| 401M \| ~401M \| 1024 \| 32 \| 4 \| 24 \|

	> Both variants use the same architectural template; the 401M model increases attention width while preserving GQA.

	---

	## 📚 Data

	\| Item \| Value \|
	\|----\|----\|
	\| Dataset \| `japhba/pubmed_simple` \|
	\| Text field \| `abstract` \|
	\| Domain \| Biomedical / medical research \|
	\| Cleaning \| Minimal (raw abstracts) \|
	\| Sequence length \| 1,024 \|
	\| Sliding window stride \| 512 \|

	---

	## ⚙️ Training Setup (Common)

	\| Item \| Value \|
	\|----\|----\|
	\| Objective \| Causal language modeling (next-token prediction) \|
	\| Optimizer \| AdamW \|
	\| Betas \| (0.9, 0.95) \|
	\| Precision \| bf16 \|
	\| Gradient accumulation \| Enabled \|
	\| Gradient clipping \| 1.0 \|
	\| Effective batch size \| 128 \|

	---

	## 📦 Checkpoints

	The `checkpoints/` directory contains multiple snapshots of the same model variants at different training stages.

	Examples:
	- `checkpoint_step_25000.pt` (303M) → ~2.5B tokens seen
	- Additional checkpoints may exist for the 401M variant

	> ⚠️ Important:
	> All released checkpoints are early-stage pretraining snapshots.
	> At ~2.5B tokens (~8× tokens/parameter for 303M), the models are undertrained and should not be treated as finished base models.

	They are provided to:
	- study training dynamics,
	- resume or extend pretraining,
	- experiment with fine-tuning,
	- inspect architectural behavior at scale.

	---

	## 📈 Training Status

	- Training and validation loss were still improving at the time of the last checkpoints.
	- Training runs were interrupted due to infrastructure preemption and were not resumed.
	- No claims are made about benchmark or downstream task performance.

	---

	## 🚀 Loading the Model

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	repo_id = "kunjcr2/MedAssistGPT-303M" # or 401M repo

	tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(repo_id, trust_remote_code=True)

	prompt = "A patient was admitted with severe headache. Initial assessment revealed"
	inputs = tokenizer(prompt, return_tensors="pt")

	outputs = model.generate(
	**inputs,
	max_new_tokens=100,
	temperature=0.7,
	)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	````

	---

	## 🧪 Intended Use

	This repository is intended for:

	* architecture exploration,
	* large-scale pretraining experiments,
	* medical-domain language modeling research,
	* educational purposes.

	🚫 Not intended for clinical or production medical use.

	---

	## 🔮 Possible Next Steps (Not Included)

	* Continued pretraining with larger token budgets
	* Supervised fine-tuning (SFT) on medical QA datasets
	* Evaluation on biomedical NLP benchmarks

	---

	## 🪪 License

	Apache 2.0