kunjcr2
/

MedAssistGPT

llm-pretraining

Model card Files Files and versions

kunjcr2 commited on Nov 10, 2025

Commit

36d5852

·

verified ·

1 Parent(s): 4d5e222

Update README.md

Files changed (1) hide show

README.md +74 -1

README.md CHANGED Viewed

@@ -4,4 +4,77 @@ datasets:
 - Hack90/europe_pmc_articles_part_2
 language:
 - en
----

 - Hack90/europe_pmc_articles_part_2
 language:
 - en
+tags:
+- v0_pretrain_medassist
+---
+# MedAssist-GPT
+Tiny medical-domain LLM pretraining project.
+**NOT for clinical use.**
+## TL;DR
+* **Arch:** Transformer with **RoPE** + **GQA**, **SwiGLU** MLP, **RMSNorm**, causal LM head (tied embeddings).
+* **Tokenizer:** `tiktoken` **p50k_base** (vocab ≈ 50,281).
+* **Context:** 1,024 tokens (default).
+* **Size (default config):** ~125M params (d_model=512, n_heads=16, layers=16, d_ff=2048).
+* **Trained on** about 2.2B tokens of pure medical data.
+## Data (example)
+* Source: `Hack90/europe_pmc_articles_part_2` (`full_text`).
+* XML → plain text via `clean()`; sliding windows (`max_length=1024`, `stride=1024`).
+## Training (script)
+* AdamW + OneCycleLR, bf16 AMP, grad accumulation, checkpoints, optional HF upload, wandb logging.
+## Loss
+![train_loss](https://cdn-uploads.huggingface.co/production/uploads/67c358189919777813863c48/bQGVqgx4GoqXZTcMh8KhM.png)
+![val_loss](https://cdn-uploads.huggingface.co/production/uploads/67c358189919777813863c48/jhNnS_Wvhj4-fzNoO2dRN.png)
+## Try it (minimal)
+```python
+# pip install torch tiktoken huggingface_hub safetensors
+import torch, tiktoken
+from safetensors.torch import load_file
+from huggingface_hub import hf_hub_download
+REPO_ID = "kunjcr2/MedAssistGPT"   # change if needed
+WEIGHTS = hf_hub_download(REPO_ID, "model.safetensors")
+state = load_file(WEIGHTS, device="cpu")
+# Import your MedAssistGPT class from the script/notebook
+from MedAssistGPT import MedAssistGPT, MODEL_CONFIG  # ensure paths match your repo
+model = MedAssistGPT(MODEL_CONFIG)
+model.load_state_dict(state, strict=True).eval()
+enc = tiktoken.get_encoding("p50k_base")
+ids = torch.tensor([enc.encode("To live a good life")], dtype=torch.long)
+with torch.no_grad():
+    for _ in range(100):
+        logits = model(ids)[:, -1, :]
+        next_id = torch.multinomial(torch.softmax(logits/0.7, dim=-1), 1)
+        ids = torch.cat([ids, next_id], dim=1)
+        if next_id.item() == enc.eot_token: break
+print(enc.decode(ids[0].tolist()))
+```
+## Intended use & limitations
+Research/experimentation + downstream finetuning after pretraining.
+Do **NOT** use for medical decisions.
+## Files
+* `model.safetensors` (weights)
+* `config.json`, `tokenizer_config.json`
+* Script/notebook defining `MedAssistGPT` class
+## License
+Apache-2.0