Initial upload: checkpoint 20000 with accurate model card
Browse files
README.md
CHANGED
|
@@ -26,10 +26,11 @@ widget:
|
|
| 26 |
|
| 27 |
# NanoGPT 53M - Pre-LN Transformer
|
| 28 |
|
| 29 |
-
A 53-million parameter GPT model trained from scratch on FineWebEdu educational content. This model implements a **Pre-LayerNorm (Pre-LN) transformer architecture**
|
| 30 |
|
| 31 |
> **Model Format:** PyTorch (cross-platform compatible)
|
| 32 |
-
> **Training Framework:** Apple MLX (exported to PyTorch for universal compatibility)
|
|
|
|
| 33 |
|
| 34 |
## Model Details
|
| 35 |
|
|
@@ -70,11 +71,15 @@ Pre-LN provides better training stability and is used in modern transformers (GP
|
|
| 70 |
## Training Details
|
| 71 |
|
| 72 |
- **Dataset:** FineWebEdu (diverse educational web content)
|
| 73 |
-
- **Training Tokens:**
|
| 74 |
- **Total Iterations:** 20,000
|
| 75 |
-
- **Batch Size:** 12
|
| 76 |
-
- **
|
|
|
|
|
|
|
| 77 |
- **Final Training Loss:** 0.7583
|
|
|
|
|
|
|
| 78 |
|
| 79 |
### Performance Benchmarks
|
| 80 |
|
|
@@ -101,9 +106,9 @@ Measured on Apple M2 Pro (16GB unified memory):
|
|
| 101 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 102 |
|
| 103 |
# Load model and tokenizer (requires trust_remote_code for custom architecture)
|
| 104 |
-
tokenizer = AutoTokenizer.from_pretrained("jacksuuuu/
|
| 105 |
model = AutoModelForCausalLM.from_pretrained(
|
| 106 |
-
"jacksuuuu/
|
| 107 |
trust_remote_code=True
|
| 108 |
)
|
| 109 |
|
|
@@ -127,14 +132,18 @@ print(text)
|
|
| 127 |
|
| 128 |
**Prompt:** "Once upon a time"
|
| 129 |
|
| 130 |
-
**Generated
|
| 131 |
```
|
| 132 |
-
Once upon a time
|
| 133 |
-
|
| 134 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 135 |
```
|
| 136 |
|
| 137 |
-
**Note:** This
|
| 138 |
|
| 139 |
## Model Architecture
|
| 140 |
|
|
@@ -185,11 +194,13 @@ NanoGPTLMHeadModel(
|
|
| 185 |
|
| 186 |
## Limitations
|
| 187 |
|
| 188 |
-
- **Context length:** Limited to 512 tokens
|
| 189 |
-
- **Domain:** Trained on educational web content (FineWebEdu)
|
| 190 |
-
- **
|
| 191 |
-
- **Generation:**
|
| 192 |
-
- **
|
|
|
|
|
|
|
| 193 |
|
| 194 |
## Intended Use
|
| 195 |
|
|
@@ -221,7 +232,7 @@ If you use this model, please cite:
|
|
| 221 |
author = {JackSu},
|
| 222 |
title = {NanoGPT MLX: 53M Parameter Pre-LN Transformer},
|
| 223 |
year = {2025},
|
| 224 |
-
url = {https://huggingface.co/jacksuuuu/
|
| 225 |
}
|
| 226 |
```
|
| 227 |
|
|
|
|
| 26 |
|
| 27 |
# NanoGPT 53M - Pre-LN Transformer
|
| 28 |
|
| 29 |
+
A 53-million parameter GPT model trained from scratch on 10M tokens of FineWebEdu educational content. This model implements a **Pre-LayerNorm (Pre-LN) transformer architecture** and serves as a demonstration of efficient training on Apple Silicon using the MLX framework.
|
| 30 |
|
| 31 |
> **Model Format:** PyTorch (cross-platform compatible)
|
| 32 |
+
> **Training Framework:** Apple MLX (exported to PyTorch for universal compatibility)
|
| 33 |
+
> **Best for:** Educational demonstrations, research, and fine-tuning on specific domains
|
| 34 |
|
| 35 |
## Model Details
|
| 36 |
|
|
|
|
| 71 |
## Training Details
|
| 72 |
|
| 73 |
- **Dataset:** FineWebEdu (diverse educational web content)
|
| 74 |
+
- **Training Tokens:** ~10.2M tokens from educational web pages
|
| 75 |
- **Total Iterations:** 20,000
|
| 76 |
+
- **Batch Size:** 12 sequences/batch
|
| 77 |
+
- **Sequence Length:** 512 tokens
|
| 78 |
+
- **Learning Rate:** 3e-4 with cosine decay schedule
|
| 79 |
+
- **Optimizer:** AdamW (β1=0.9, β2=0.95, weight_decay=0.1)
|
| 80 |
- **Final Training Loss:** 0.7583
|
| 81 |
+
- **Training Time:** ~4 hours on Apple M2 Pro
|
| 82 |
+
- **Gradient Accumulation:** None (direct updates)
|
| 83 |
|
| 84 |
### Performance Benchmarks
|
| 85 |
|
|
|
|
| 106 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 107 |
|
| 108 |
# Load model and tokenizer (requires trust_remote_code for custom architecture)
|
| 109 |
+
tokenizer = AutoTokenizer.from_pretrained("jacksuuuu/tinystories")
|
| 110 |
model = AutoModelForCausalLM.from_pretrained(
|
| 111 |
+
"jacksuuuu/tinystories",
|
| 112 |
trust_remote_code=True
|
| 113 |
)
|
| 114 |
|
|
|
|
| 132 |
|
| 133 |
**Prompt:** "Once upon a time"
|
| 134 |
|
| 135 |
+
**Generated:**
|
| 136 |
```
|
| 137 |
+
Once upon a time, the boy named Lily and his dog named Max went for a walk.
|
| 138 |
+
They ran and ran, but they kept each and got very tired. Suddenly the way,
|
| 139 |
+
Max saw something shiny on the ground. He pointed the shiny to his owner and
|
| 140 |
+
explained, "What does this?"
|
| 141 |
+
|
| 142 |
+
Max meowed and said, "I don't sign, Max. The sign is too small and it's
|
| 143 |
+
important to learn."
|
| 144 |
```
|
| 145 |
|
| 146 |
+
**Note:** This model generates coherent short stories and educational content. While grammatically imperfect due to its small size (53M params), it demonstrates good narrative flow and vocabulary learned from FineWebEdu dataset.
|
| 147 |
|
| 148 |
## Model Architecture
|
| 149 |
|
|
|
|
| 194 |
|
| 195 |
## Limitations
|
| 196 |
|
| 197 |
+
- **Context length:** Limited to 512 tokens (can't process longer documents)
|
| 198 |
+
- **Domain:** Trained primarily on educational web content (FineWebEdu)
|
| 199 |
+
- **Model size:** 53M parameters - significantly smaller than modern LLMs (1B+)
|
| 200 |
+
- **Generation quality:** Produces coherent narratives but with occasional grammatical errors
|
| 201 |
+
- **Factual accuracy:** Limited by small model size and training data
|
| 202 |
+
- **No instruction tuning:** Base language model - cannot follow instructions or engage in dialogue
|
| 203 |
+
- **Training data:** Only 10M tokens (modern models use trillions)
|
| 204 |
|
| 205 |
## Intended Use
|
| 206 |
|
|
|
|
| 232 |
author = {JackSu},
|
| 233 |
title = {NanoGPT MLX: 53M Parameter Pre-LN Transformer},
|
| 234 |
year = {2025},
|
| 235 |
+
url = {https://huggingface.co/jacksuuuu/tinystories}
|
| 236 |
}
|
| 237 |
```
|
| 238 |
|