gpjt
/

1xrtx3090-baseline

Text Generation

gpjt-llm-from-scratch

Model card Files Files and versions

gpjt commited on Apr 14

Commit

da711bf

·

verified ·

1 Parent(s): cf75bf5

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -40,7 +40,7 @@ LLM, please do feel free to play with it!
 ### Model Sources
 - **Repository:** [gpjt/ddp-base-model-from-scratch](https://github.com/gpjt/ddp-base-model-from-scratch)
-- **Blog post:** [Writing an LLM from scratch, part 29 -- using DistributedDataParallel to train a base model from scratch in the cloud](https://www.gilesthomas.com/2026/01/llm-from-scratch-29-ddp-training-a-base-model-in-the-cloud)
 ## How to Get Started with the Model
@@ -78,11 +78,11 @@ number of tokens.  It's [both dumb and ignorant](https://www.gilesthomas.com/202
 ## Training Details
-- **Machine type:** TODO
 - **Tokens:**  3,260,190,720 (Chinchilla-optimal of 20x parameters) rounded up to the nearest batch.
 - **Dataset:** [gpjt/fineweb-gpt2-tokens](https://huggingface.co/datasets/gpjt/fineweb-gpt2-tokens)
 - **Micro-batch size:** 6
-- **Global batch size:** TODO
 - **Dropout:** 0.1
 - **Gradient clipping:** None
 - **Learning rate:** 0.0004

 ### Model Sources
 - **Repository:** [gpjt/ddp-base-model-from-scratch](https://github.com/gpjt/ddp-base-model-from-scratch)
+- **Blog post:** [Writing an LLM from scratch, part 32k -- Interventions: training a better model locally with gradient accumulation](https://staging.gilesthomas.com/2026/04/llm-from-scratch-32k-interventions-training-our-best-model-locally-gradient-accumulation)
 ## How to Get Started with the Model
 ## Training Details
+- **Machine type:** Local machine with an RTX 3090
 - **Tokens:**  3,260,190,720 (Chinchilla-optimal of 20x parameters) rounded up to the nearest batch.
 - **Dataset:** [gpjt/fineweb-gpt2-tokens](https://huggingface.co/datasets/gpjt/fineweb-gpt2-tokens)
 - **Micro-batch size:** 6
+- **Global batch size:** 96 (using 12 gradient accumulation steps)
 - **Dropout:** 0.1
 - **Gradient clipping:** None
 - **Learning rate:** 0.0004