kunjcr2
/

MedAssistGPT

@@ -5,156 +5,170 @@ datasets:
 language:
 - en
 tags:
-- v2_pretrain_medassist
 - gqa
 - rope
 - swiglu
 - rmsnorm
-- medical
 ---
-# 🧠 MedAssist-GPT-401M
-**Mid-sized medical-domain LLM pretraining project.**
-⚠️ *Strictly for research. Not for clinical or diagnostic use.*
 ---
-## 🧩 TL;DR
-* **Architecture:** Transformer with **RoPE**, **GQA**, **SwiGLU** MLP, and **RMSNorm**
-* **Tokenizer:** `tiktoken` `p50k_base` (vocab ≈ **50,281**)
-* **Context length:** 1,024 tokens
-* **Parameters:** ≈ **401 M** (`d_model=1024`, `n_heads=32`, `blocks=24`, `d_ff=2048`)
-* **GQA groups:** 8 → 4 KV heads per 32 query heads
-* **Dropout:** 0.0 (pretraining)
-* **Precision:** **bf16** mixed precision
-* **Training objective:** Next-token prediction
-* **Effective batch:** 32 × 4 = 128
 ---
-## 📚 Data
-| Field                   | Value                             |
-| ----------------------- | --------------------------------- |
-| **Dataset**             | `japhba/pubmed_simple`            |
-| **Text column**         | `abstract`                        |
-| **Train/Val split**     | 95 / 5                            |
-| **Samples used**        | 100 k abstracts                   |
-| **Seq length / stride** | 1,024 / 1,024                     |
-| **Cleaning**            | `use_clean=False` (raw abstracts) |
 ---
-## ⚙️ Training
-| Item                       | Value                                                                 |
-| -------------------------- | --------------------------------------------------------------------- |
-| **Framework**              | PyTorch                                                               |
-| **Precision**              | bf16                                                                  |
-| **Objective**              | Causal LM (next-token prediction)                                     |
-| **Optimizer**              | AdamW (`β₁ = 0.9`, `β₂ = 0.95`, `eps = 1e-8`)                         |
-| **Learning rate**          | 3 × 10⁻⁴ (linear + 100-step warmup)                                   |
-| **Weight decay**           | 0.1                                                                   |
-| **Batch size**             | 32 (× 4 grad acc → 128 effective)                                     |
-| **Grad clip**              | 1.0                                                                   |
-| **Total steps**            | 100 k                                                                 |
-| **Eval**                   | every 500 steps × 100 iters                                           |
-| **Checkpoint save**        | every 1 k steps                                                       |
-| **Seed**                   | 7 979 797                                                             |
-| **Gradient checkpointing** | ✅ Enabled                                                             |
-| **WandB**                  | `kunjcr2-dreamable/MedAssist-GPT-Pretraining` (`medassist-401M-test`) |
-| **HF repo**                | `kunjcr2/MedAssist-GPT-401M`                                          |
 ---
-## 🧮 Training Environment
-| Item                | Value                  |
-| ------------------- | ---------------------- |
-| **Hardware**        | 1× NVIDIA A100 (80 GB) |
-| **Precision dtype** | bf16                   |
-| **Runtime**         | ~15 hours              |
-| **Scheduler**       | Linear LR decay        |
-| **Mixed precision** | Native AMP (bf16)      |
 ---
-## 📈 Loss Curves
-*(Placeholder — will update post-training)*
-![train\_loss](https://cdn-uploads.huggingface.co/production/uploads/67c358189919777813863c48/bQGVqgx4GoqXZTcMh8KhM.png)
-![val\_loss](https://cdn-uploads.huggingface.co/production/uploads/67c358189919777813863c48/jhNnS_Wvhj4-fzNoO2dRN.png)
 ---
-## 🚀 Minimal Inference
-```python
-# pip install torch tiktoken huggingface_hub safetensors
-import torch, tiktoken
-from safetensors.torch import load_file
-from huggingface_hub import hf_hub_download
-from MedAssistGPT import MedAssistGPT, MODEL_CONFIG
-REPO_ID = "kunjcr2/MedAssist-GPT-401M"
-weights = hf_hub_download(REPO_ID, "model.safetensors")
-state = load_file(weights, device="cpu")
-model = MedAssistGPT(MODEL_CONFIG)
-model.load_state_dict(state, strict=True).eval()
-enc = tiktoken.get_encoding("p50k_base")
-ids = torch.tensor([enc.encode(
-    "A patient was admitted with severe headache. Initial assessment revealed"
-)], dtype=torch.long)
-for _ in range(100):
-    logits = model(ids)[:, -1, :]
-    next_id = torch.multinomial(torch.softmax(logits / 0.6, dim=-1), 1)
-    ids = torch.cat([ids, next_id], dim=1)
-print(enc.decode(ids[0].tolist()))
-```
 ---
-## 💾 Checkpoints
-* Main run: `medassist-401M-test`
-* Checkpoint: `/checkpoints/checkpoint_step_44500.pt`
 ---
-## 🧪 Intended Use
-For research and experimentation only — e.g.,
-* domain-adapted pretraining,
-* architecture exploration,
-* fine-tuning for medical text understanding.
-🚫 **Not intended for clinical or production medical use.**
 ---
-## 🔮 Future Work
-Next update includes:
-* **Supervised fine-tuning (SFT)**
-* **Reinforcement Learning (PPO) for alignment**
 ---
-## 📁 Files
-* 'checkpoints/'
-* `config.json`, `tokenizer_config.json`
-* Training script / notebook defining `MedAssistGPT`
 ---
 ## 🪪 License
-Apache 2.0

 language:
 - en
 tags:
+- research
+- llm-pretraining
+- transformer
 - gqa
 - rope
 - swiglu
 - rmsnorm
+- medical-text
 ---
+# 🧠 MedAssistGPT — Pretraining Checkpoints (303M & 401M)
+**Experimental medical-domain LLM pretraining project.**
+⚠️ **Research-only. Not for clinical, diagnostic, or production use.**
 ---
+## 📌 Overview
+This repository contains **multiple pretraining checkpoints** of the **MedAssistGPT architecture**, released in **two parameter scales**:
+- **MedAssistGPT-303M**
+- **MedAssistGPT-401M**
+Both variants:
+- share the **same architecture design**
+- use the **same tokenizer**
+- are trained on the **same dataset**
+- differ only in **model width / attention configuration** and **training progress**
+The purpose of this repository is to document **architecture choices, data pipelines, and large-scale training behavior**, rather than to present a fully converged or production-ready medical language model.
 ---
+## 🧩 Architecture (Shared Design)
+All models are **decoder-only Transformers** implemented from scratch in PyTorch.
+### Core components
+- **RoPE (Rotary Positional Embeddings)**
+- **Grouped Query Attention (GQA)**
+- **SwiGLU feed-forward layers**
+- **RMSNorm (pre-norm)**
+- **Weight tying** (token embeddings ↔ LM head)
+- **Dropout:** 0.0 (pretraining configuration)
+### Tokenization
+- **Tokenizer:** `tiktoken` `p50k_base`
+- **Vocabulary size:** ≈ 50,281
+- **Context length:** 1,024 tokens
 ---
+## 📐 Model Variants
+| Variant | Parameters | d_model | Heads | GQA (KV heads) | Blocks |
+|------|-----------|--------|-------|---------------|--------|
+| **303M** | ~303M | 1024 | 16 | 4 | 24 |
+| **401M** | ~401M | 1024 | 32 | 4 | 24 |
+> Both variants use the **same architectural template**; the 401M model increases attention width while preserving GQA.
 ---
+## 📚 Data
+| Item | Value |
+|----|----|
+| Dataset | `japhba/pubmed_simple` |
+| Text field | `abstract` |
+| Domain | Biomedical / medical research |
+| Cleaning | Minimal (raw abstracts) |
+| Sequence length | 1,024 |
+| Sliding window stride | 512 |
 ---
+## ⚙️ Training Setup (Common)
+| Item | Value |
+|----|----|
+| Objective | Causal language modeling (next-token prediction) |
+| Optimizer | AdamW |
+| Betas | (0.9, 0.95) |
+| Precision | bf16 |
+| Gradient accumulation | Enabled |
+| Gradient clipping | 1.0 |
+| Effective batch size | 128 |
 ---
+## 📦 Checkpoints
+The `checkpoints/` directory contains **multiple snapshots of the same model variants at different training stages**.
+Examples:
+- `checkpoint_step_25000.pt` (303M) → ~2.5B tokens seen
+- Additional checkpoints may exist for the 401M variant
+> ⚠️ **Important:**
+> All released checkpoints are **early-stage pretraining snapshots**.
+> At ~2.5B tokens (~8× tokens/parameter for 303M), the models are **undertrained** and should **not** be treated as finished base models.
+They are provided to:
+- study training dynamics,
+- resume or extend pretraining,
+- experiment with fine-tuning,
+- inspect architectural behavior at scale.
 ---
+## 📈 Training Status
+- Training and validation loss were **still improving** at the time of the last checkpoints.
+- Training runs were **interrupted due to infrastructure preemption** and were not resumed.
+- No claims are made about benchmark or downstream task performance.
 ---
+## 🚀 Loading the Model
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+repo_id = "kunjcr2/MedAssistGPT-303M"  # or 401M repo
+tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(repo_id, trust_remote_code=True)
+prompt = "A patient was admitted with severe headache. Initial assessment revealed"
+inputs = tokenizer(prompt, return_tensors="pt")
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=100,
+    temperature=0.7,
+)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+````
 ---
+## 🧪 Intended Use
+This repository is intended for:
+* architecture exploration,
+* large-scale pretraining experiments,
+* medical-domain language modeling research,
+* educational purposes.
+🚫 **Not intended for clinical or production medical use.**
 ---
+## 🔮 Possible Next Steps (Not Included)
+* Continued pretraining with larger token budgets
+* Supervised fine-tuning (SFT) on medical QA datasets
+* Evaluation on biomedical NLP benchmarks
 ---
 ## 🪪 License
+Apache 2.0