gpjt
/

8xa100m40-gradient-clipping

Text Generation

gpjt-llm-from-scratch

Model card Files Files and versions

gpjt commited on Feb 5

Commit

273db09

·

verified ·

1 Parent(s): ec485b9

Update README.md

Files changed (1) hide show

README.md +4 -3

README.md CHANGED Viewed

@@ -40,7 +40,7 @@ LLM, please do feel free to play with it!
 ### Model Sources
 - **Repository:** [gpjt/ddp-base-model-from-scratch](https://github.com/gpjt/ddp-base-model-from-scratch)
-- **Blog post:** [Writing an LLM from scratch, part 29 -- using DistributedDataParallel to train a base model from scratch in the cloud](https://www.gilesthomas.com/2026/01/llm-from-scratch-29-ddp-training-a-base-model-in-the-cloud)
 ## How to Get Started with the Model
@@ -78,9 +78,10 @@ number of tokens.  It's [both dumb and ignorant](https://www.gilesthomas.com/202
 ## Training Details
-- **Machine type:** TODO
 - **Tokens:**  3,260,190,720 (Chinchilla-optimal of 20x parameters) rounded up to the nearest batch.
 - **Dataset:** [gpjt/fineweb-gpt2-tokens](https://huggingface.co/datasets/gpjt/fineweb-gpt2-tokens)
 - **Micro-batch size:** 12
-- **Global batch size:** TODO
 - **Dropout:** 0.1

 ### Model Sources
 - **Repository:** [gpjt/ddp-base-model-from-scratch](https://github.com/gpjt/ddp-base-model-from-scratch)
+- **Blog post:** [Writing an LLM from scratch, part 32b -- Interventions: gradient clipping](https://www.gilesthomas.com/2026/02/llm-from-scratch-32b-interventions-gradient-clipping)
 ## How to Get Started with the Model
 ## Training Details
+- **Machine type:** 8a A100 with 40GiB per GPU
 - **Tokens:**  3,260,190,720 (Chinchilla-optimal of 20x parameters) rounded up to the nearest batch.
 - **Dataset:** [gpjt/fineweb-gpt2-tokens](https://huggingface.co/datasets/gpjt/fineweb-gpt2-tokens)
 - **Micro-batch size:** 12
+- **Global batch size:** 96
 - **Dropout:** 0.1
+- **Gradient clipping**: at 3.5