MedAssistGPT / README.md

kunjcr2

Update README.md

9e1cffa verified about 6 hours ago

preview code

raw

history blame contribute delete

4.47 kB

metadata

license: apache-2.0
datasets:
  - japhba/pubmed_simple
language:
  - en
tags:
  - research
  - llm-pretraining
  - transformer
  - gqa
  - rope
  - swiglu
  - rmsnorm
  - medical-text

🧠 MedAssistGPT — Pretraining Checkpoints (303M & 401M)

Experimental medical-domain LLM pretraining project.
⚠️ Research-only. Not for clinical, diagnostic, or production use.

📌 Overview

This repository contains multiple pretraining checkpoints of the MedAssistGPT architecture, released in two parameter scales:

MedAssistGPT-303M
MedAssistGPT-401M

Both variants:

share the same architecture design
use the same tokenizer
are trained on the same dataset
differ only in model width / attention configuration and training progress

The purpose of this repository is to document architecture choices, data pipelines, and large-scale training behavior, rather than to present a fully converged or production-ready medical language model.

🧩 Architecture (Shared Design)

All models are decoder-only Transformers implemented from scratch in PyTorch.

Core components

RoPE (Rotary Positional Embeddings)
Grouped Query Attention (GQA)
SwiGLU feed-forward layers
RMSNorm (pre-norm)
Weight tying (token embeddings ↔ LM head)
Dropout: 0.0 (pretraining configuration)

Tokenization

Tokenizer: tiktoken p50k_base
Vocabulary size: ≈ 50,281
Context length: 1,024 tokens

📐 Model Variants

Variant	Parameters	d_model	Heads	GQA (KV heads)	Blocks
303M	~303M	1024	16	4	24
401M	~401M	1024	32	4	24

Both variants use the same architectural template; the 401M model increases attention width while preserving GQA.

📚 Data

Item	Value
Dataset	`japhba/pubmed_simple`
Text field	`abstract`
Domain	Biomedical / medical research
Cleaning	Minimal (raw abstracts)
Sequence length	1,024
Sliding window stride	512

⚙️ Training Setup (Common)

Item	Value
Objective	Causal language modeling (next-token prediction)
Optimizer	AdamW
Betas	(0.9, 0.95)
Precision	bf16
Gradient accumulation	Enabled
Gradient clipping	1.0
Effective batch size	128

📦 Checkpoints

The checkpoints/ directory contains multiple snapshots of the same model variants at different training stages.

Examples:

checkpoint_step_25000.pt (303M) → ~2.5B tokens seen
Additional checkpoints may exist for the 401M variant

⚠️ Important:
All released checkpoints are early-stage pretraining snapshots.
At ~~2.5B tokens (~~8× tokens/parameter for 303M), the models are undertrained and should not be treated as finished base models.

They are provided to:

study training dynamics,
resume or extend pretraining,
experiment with fine-tuning,
inspect architectural behavior at scale.

📈 Training Status

Training and validation loss were still improving at the time of the last checkpoints.
Training runs were interrupted due to infrastructure preemption and were not resumed.
No claims are made about benchmark or downstream task performance.

🚀 Loading the Model

from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "kunjcr2/MedAssistGPT-303M"  # or 401M repo

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo_id, trust_remote_code=True)

prompt = "A patient was admitted with severe headache. Initial assessment revealed"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.7,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

🧪 Intended Use

This repository is intended for:

architecture exploration,
large-scale pretraining experiments,
medical-domain language modeling research,
educational purposes.

🚫 Not intended for clinical or production medical use.

🔮 Possible Next Steps (Not Included)

Continued pretraining with larger token budgets
Supervised fine-tuning (SFT) on medical QA datasets
Evaluation on biomedical NLP benchmarks

🪪 License

Apache 2.0