jacksuuuu commited on
Commit
992db31
·
verified ·
1 Parent(s): 582d1f9

Initial upload: checkpoint 20000 with accurate model card

Browse files
Files changed (1) hide show
  1. README.md +29 -18
README.md CHANGED
@@ -26,10 +26,11 @@ widget:
26
 
27
  # NanoGPT 53M - Pre-LN Transformer
28
 
29
- A 53-million parameter GPT model trained from scratch on FineWebEdu educational content. This model implements a **Pre-LayerNorm (Pre-LN) transformer architecture**, compatible with HuggingFace Transformers library.
30
 
31
  > **Model Format:** PyTorch (cross-platform compatible)
32
- > **Training Framework:** Apple MLX (exported to PyTorch for universal compatibility)
 
33
 
34
  ## Model Details
35
 
@@ -70,11 +71,15 @@ Pre-LN provides better training stability and is used in modern transformers (GP
70
  ## Training Details
71
 
72
  - **Dataset:** FineWebEdu (diverse educational web content)
73
- - **Training Tokens:** 10M
74
  - **Total Iterations:** 20,000
75
- - **Batch Size:** 12
76
- - **Learning Rate:** 3e-4 with cosine decay
 
 
77
  - **Final Training Loss:** 0.7583
 
 
78
 
79
  ### Performance Benchmarks
80
 
@@ -101,9 +106,9 @@ Measured on Apple M2 Pro (16GB unified memory):
101
  from transformers import AutoTokenizer, AutoModelForCausalLM
102
 
103
  # Load model and tokenizer (requires trust_remote_code for custom architecture)
104
- tokenizer = AutoTokenizer.from_pretrained("jacksuuuu/nanogpt-mlx-53m-finewebedu")
105
  model = AutoModelForCausalLM.from_pretrained(
106
- "jacksuuuu/nanogpt-mlx-53m-finewebedu",
107
  trust_remote_code=True
108
  )
109
 
@@ -127,14 +132,18 @@ print(text)
127
 
128
  **Prompt:** "Once upon a time"
129
 
130
- **Generated (Checkpoint 35000 with distillation):**
131
  ```
132
- Once upon a time: "the)." as in KDE, set by an article of the U and
133
- updated to the existing of a network. For requirements of the application
134
- to an individual to the data above above above above...
 
 
 
 
135
  ```
136
 
137
- **Note:** This checkpoint shows characteristics of knowledge distillation training. The model has learned broader patterns from the teacher model (GPT-OSS-20B), though generation quality varies. For more coherent story generation, consider fine-tuning on your specific use case.
138
 
139
  ## Model Architecture
140
 
@@ -185,11 +194,13 @@ NanoGPTLMHeadModel(
185
 
186
  ## Limitations
187
 
188
- - **Context length:** Limited to 512 tokens
189
- - **Domain:** Trained on educational web content (FineWebEdu)
190
- - **Size:** 53M parameters is relatively small compared to modern LLMs
191
- - **Generation:** Best for short-form content (stories, paragraphs)
192
- - **No instruction tuning:** This is a base language model, not instruction-tuned
 
 
193
 
194
  ## Intended Use
195
 
@@ -221,7 +232,7 @@ If you use this model, please cite:
221
  author = {JackSu},
222
  title = {NanoGPT MLX: 53M Parameter Pre-LN Transformer},
223
  year = {2025},
224
- url = {https://huggingface.co/jacksuuuu/nanogpt-mlx-53m-finewebedu}
225
  }
226
  ```
227
 
 
26
 
27
  # NanoGPT 53M - Pre-LN Transformer
28
 
29
+ A 53-million parameter GPT model trained from scratch on 10M tokens of FineWebEdu educational content. This model implements a **Pre-LayerNorm (Pre-LN) transformer architecture** and serves as a demonstration of efficient training on Apple Silicon using the MLX framework.
30
 
31
  > **Model Format:** PyTorch (cross-platform compatible)
32
+ > **Training Framework:** Apple MLX (exported to PyTorch for universal compatibility)
33
+ > **Best for:** Educational demonstrations, research, and fine-tuning on specific domains
34
 
35
  ## Model Details
36
 
 
71
  ## Training Details
72
 
73
  - **Dataset:** FineWebEdu (diverse educational web content)
74
+ - **Training Tokens:** ~10.2M tokens from educational web pages
75
  - **Total Iterations:** 20,000
76
+ - **Batch Size:** 12 sequences/batch
77
+ - **Sequence Length:** 512 tokens
78
+ - **Learning Rate:** 3e-4 with cosine decay schedule
79
+ - **Optimizer:** AdamW (β1=0.9, β2=0.95, weight_decay=0.1)
80
  - **Final Training Loss:** 0.7583
81
+ - **Training Time:** ~4 hours on Apple M2 Pro
82
+ - **Gradient Accumulation:** None (direct updates)
83
 
84
  ### Performance Benchmarks
85
 
 
106
  from transformers import AutoTokenizer, AutoModelForCausalLM
107
 
108
  # Load model and tokenizer (requires trust_remote_code for custom architecture)
109
+ tokenizer = AutoTokenizer.from_pretrained("jacksuuuu/tinystories")
110
  model = AutoModelForCausalLM.from_pretrained(
111
+ "jacksuuuu/tinystories",
112
  trust_remote_code=True
113
  )
114
 
 
132
 
133
  **Prompt:** "Once upon a time"
134
 
135
+ **Generated:**
136
  ```
137
+ Once upon a time, the boy named Lily and his dog named Max went for a walk.
138
+ They ran and ran, but they kept each and got very tired. Suddenly the way,
139
+ Max saw something shiny on the ground. He pointed the shiny to his owner and
140
+ explained, "What does this?"
141
+
142
+ Max meowed and said, "I don't sign, Max. The sign is too small and it's
143
+ important to learn."
144
  ```
145
 
146
+ **Note:** This model generates coherent short stories and educational content. While grammatically imperfect due to its small size (53M params), it demonstrates good narrative flow and vocabulary learned from FineWebEdu dataset.
147
 
148
  ## Model Architecture
149
 
 
194
 
195
  ## Limitations
196
 
197
+ - **Context length:** Limited to 512 tokens (can't process longer documents)
198
+ - **Domain:** Trained primarily on educational web content (FineWebEdu)
199
+ - **Model size:** 53M parameters - significantly smaller than modern LLMs (1B+)
200
+ - **Generation quality:** Produces coherent narratives but with occasional grammatical errors
201
+ - **Factual accuracy:** Limited by small model size and training data
202
+ - **No instruction tuning:** Base language model - cannot follow instructions or engage in dialogue
203
+ - **Training data:** Only 10M tokens (modern models use trillions)
204
 
205
  ## Intended Use
206
 
 
232
  author = {JackSu},
233
  title = {NanoGPT MLX: 53M Parameter Pre-LN Transformer},
234
  year = {2025},
235
+ url = {https://huggingface.co/jacksuuuu/tinystories}
236
  }
237
  ```
238