Update to checkpoint 20000 (best quality, loss 0.758)

Files changed (3) hide show

README.md CHANGED Viewed

@@ -48,9 +48,8 @@ A 53-million parameter GPT model trained from scratch on FineWebEdu educational
 - **Framework:** Apple MLX (training), PyTorch (export)
 - **Dataset:** FineWebEdu - 10M tokens of educational web content
 - **Training Hardware:** Apple M2 Pro (16GB unified memory)
-- **Checkpoint:** 35000 iterations
-- **Training Method:** Base pretraining (20K iters) + Knowledge Distillation (15K iters)
-- **Teacher Model:** GPT-OSS-20B (via Groq API)
 ### Architecture Highlights
@@ -72,13 +71,10 @@ Pre-LN provides better training stability and is used in modern transformers (GP
 - **Dataset:** FineWebEdu (diverse educational web content)
 - **Training Tokens:** 10M
-- **Base Training:** 20,000 iterations (loss 0.758)
-- **Knowledge Distillation:** 15,000 additional iterations with GPT-OSS-20B as teacher
-- **Total Iterations:** 35,000
 - **Batch Size:** 12
-- **Learning Rate:** 3e-4 with cosine decay (base), 3e-5 (distillation)
-- **Final Training Loss:** 3.46
-- **Distillation Method:** 50% hard loss (ground truth) + 50% soft loss (teacher)
 ### Performance Benchmarks
@@ -95,7 +91,7 @@ Measured on Apple M2 Pro (16GB unified memory):
 | **Generation Latency** | ~0.59s per 100 tokens |
 | **Activation Memory** | 843 MB (batch=4, seq=512) |
-> **Note:** Benchmarks measured at checkpoint 20000. This release (checkpoint 35000) includes additional knowledge distillation training.
 ## Usage

 - **Framework:** Apple MLX (training), PyTorch (export)
 - **Dataset:** FineWebEdu - 10M tokens of educational web content
 - **Training Hardware:** Apple M2 Pro (16GB unified memory)
+- **Checkpoint:** 20000 iterations
+- **Training Method:** Base pretraining from scratch
 ### Architecture Highlights
 - **Dataset:** FineWebEdu (diverse educational web content)
 - **Training Tokens:** 10M
+- **Total Iterations:** 20,000
 - **Batch Size:** 12
+- **Learning Rate:** 3e-4 with cosine decay
+- **Final Training Loss:** 0.7583
 ### Performance Benchmarks
 | **Generation Latency** | ~0.59s per 100 tokens |
 | **Activation Memory** | 843 MB (batch=4, seq=512) |
+> **Note:** All benchmarks measured at checkpoint 20000 (this release).
 ## Usage

pytorch_model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:6d21ad8e8646491e7510cb58cc5d542ca21db6a8e174f5399fba9546662cf317
 size 143190611

 version https://git-lfs.github.com/spec/v1
+oid sha256:46767573997bf47cf4f151837ff4ca7288b44a09332b2c41a0666fddf3c74cd2
 size 143190611

training_metadata.json ADDED Viewed

+{
+  "model_name": "nanoGPT-MLX-53M-FineWebEdu",
+  "framework": "MLX",
+  "architecture": "Pre-LN Transformer (GPT-2 style)",
+  "training": {
+    "dataset": "FineWebEdu-10M",
+    "iterations": 20000,
+    "final_loss": 0.7583,
+    "optimizer": "AdamW",
+    "learning_rate": 0.0006,
+    "batch_size": 16,
+    "context_length": 512
+  },
+  "model_config": {
+    "vocab_size": 50257,
+    "d_model": 384,
+    "n_layers": 8,
+    "n_heads": 8,
+    "d_ff": 1536,
+    "dropout": 0.1
+  },
+  "parameters": "52.99M",
+  "converted_from": "MLX checkpoint_20000.npz",
+  "conversion_date": "2025-11-14"
+}