jacksuuuu commited on
Commit
af92360
·
verified ·
1 Parent(s): 788c324

Update README for checkpoint 35000 with distillation details

Browse files
Files changed (1) hide show
  1. README.md +16 -14
README.md CHANGED
@@ -42,6 +42,7 @@ A 53-million parameter GPT model trained on FineWebEdu using Apple's MLX framewo
42
  - **Training Data:** FineWebEdu (10M tokens, educational web content)
43
  - **Training Framework:** MLX (Apple Silicon optimized)
44
  - **Hardware:** M2 Pro with 16GB memory
 
45
 
46
  ### Architecture Highlights
47
 
@@ -63,15 +64,17 @@ Pre-LN provides better training stability and is used in modern transformers (GP
63
 
64
  - **Dataset:** FineWebEdu (diverse educational web content)
65
  - **Training Tokens:** 10M
66
- - **Iterations:** 20,000
 
 
67
  - **Batch Size:** 12
68
- - **Learning Rate:** 3e-4 with cosine decay
69
- - **Training Loss:** 0.758
70
- - **Validation Perplexity:** 690,728
71
 
72
  ### Performance Benchmarks
73
 
74
- Training and inference on M2 Pro:
75
 
76
  ```
77
  📊 Model Size: 53.0M parameters
@@ -86,6 +89,8 @@ Training and inference on M2 Pro:
86
  💾 Memory: 843 MB activations (batch=4, seq=512)
87
  ```
88
 
 
 
89
  ## Usage
90
 
91
  ### Basic Text Generation
@@ -117,18 +122,15 @@ print(text)
117
 
118
  **Prompt:** "Once upon a time"
119
 
120
- **Generated:**
121
  ```
122
- Once upon a time, there was a bunny named Fl. Fl lived in a big forest
123
- with many trees. One day, Fl met a wise old owl named Sally.
124
-
125
- "Sally, what should I do?" asked Fl.
126
-
127
- "I want to help you," said the old owl. "Let's talk."
128
-
129
- Fl agreed and they became good friends.
130
  ```
131
 
 
 
132
  ## Model Architecture
133
 
134
  ```python
 
42
  - **Training Data:** FineWebEdu (10M tokens, educational web content)
43
  - **Training Framework:** MLX (Apple Silicon optimized)
44
  - **Hardware:** M2 Pro with 16GB memory
45
+ - **Checkpoint:** 35000 (includes knowledge distillation from GPT-OSS-20B)
46
 
47
  ### Architecture Highlights
48
 
 
64
 
65
  - **Dataset:** FineWebEdu (diverse educational web content)
66
  - **Training Tokens:** 10M
67
+ - **Base Training:** 20,000 iterations (loss 0.758)
68
+ - **Knowledge Distillation:** 15,000 additional iterations with GPT-OSS-20B as teacher
69
+ - **Total Iterations:** 35,000
70
  - **Batch Size:** 12
71
+ - **Learning Rate:** 3e-4 with cosine decay (base), 3e-5 (distillation)
72
+ - **Final Training Loss:** 3.46
73
+ - **Distillation Method:** 50% hard loss (ground truth) + 50% soft loss (teacher)
74
 
75
  ### Performance Benchmarks
76
 
77
+ Training and inference on M2 Pro (measured at checkpoint 20000):
78
 
79
  ```
80
  📊 Model Size: 53.0M parameters
 
89
  💾 Memory: 843 MB activations (batch=4, seq=512)
90
  ```
91
 
92
+ **Note:** This checkpoint (35000) includes additional training with knowledge distillation.
93
+
94
  ## Usage
95
 
96
  ### Basic Text Generation
 
122
 
123
  **Prompt:** "Once upon a time"
124
 
125
+ **Generated (Checkpoint 35000 with distillation):**
126
  ```
127
+ Once upon a time: "the)." as in KDE, set by an article of the U and
128
+ updated to the existing of a network. For requirements of the application
129
+ to an individual to the data above above above above...
 
 
 
 
 
130
  ```
131
 
132
+ **Note:** This checkpoint shows characteristics of knowledge distillation training. The model has learned broader patterns from the teacher model (GPT-OSS-20B), though generation quality varies. For more coherent story generation, consider fine-tuning on your specific use case.
133
+
134
  ## Model Architecture
135
 
136
  ```python