jacksuuuu commited on
Commit
c11d047
·
verified ·
1 Parent(s): c48a2fd

Update model card: professional format, remove MLX version reference

Browse files
Files changed (1) hide show
  1. README.md +46 -40
README.md CHANGED
@@ -4,45 +4,53 @@ language:
4
  license: mit
5
  tags:
6
  - text-generation
7
- - mlx
8
  - gpt
 
9
  - pre-ln
 
10
  datasets:
11
  - HuggingFaceFW/fineweb-edu
 
 
12
  metrics:
13
  - perplexity
14
- model-index:
15
- - name: nanogpt-mlx-53m-finewebedu
16
- results:
17
- - task:
18
- type: text-generation
19
- name: Text Generation
20
- dataset:
21
- name: FineWebEdu
22
- type: HuggingFaceFW/fineweb-edu
23
- metrics:
24
- - type: perplexity
25
- value: 690728
26
- name: Validation Perplexity
27
- - type: loss
28
- value: 0.758
29
- name: Training Loss
30
  ---
31
 
32
- # NanoGPT MLX 53M (FineWebEdu)
33
 
34
- A 53-million parameter GPT model trained on FineWebEdu using Apple's MLX framework. This model features a **Pre-LayerNorm (Pre-LN) transformer architecture** optimized for Apple Silicon.
 
 
 
35
 
36
  ## Model Details
37
 
38
- - **Parameters:** 53M (52,990,464 total)
39
- - **Architecture:** Pre-LN Transformer (8 layers, 384d model, 8 attention heads)
 
 
 
 
 
 
40
  - **Context Length:** 512 tokens
41
- - **Vocabulary:** 50,257 tokens (GPT-2 tokenizer)
42
- - **Training Data:** FineWebEdu (10M tokens, educational web content)
43
- - **Training Framework:** MLX (Apple Silicon optimized)
44
- - **Hardware:** M2 Pro with 16GB memory
45
- - **Checkpoint:** 35000 (includes knowledge distillation from GPT-OSS-20B)
 
 
 
 
46
 
47
  ### Architecture Highlights
48
 
@@ -74,22 +82,20 @@ Pre-LN provides better training stability and is used in modern transformers (GP
74
 
75
  ### Performance Benchmarks
76
 
77
- Training and inference on M2 Pro (measured at checkpoint 20000):
78
-
79
- ```
80
- 📊 Model Size: 53.0M parameters
81
- 202.1 MB (fp32), 101.1 MB (fp16)
82
 
83
- Training: 27,355 tokens/sec (forward pass)
84
- 13.36 batches/sec (batch=4, seq=512)
85
-
86
- 🎯 Inference: 169.9 tokens/sec
87
- ~0.59s per 100 tokens
88
-
89
- 💾 Memory: 843 MB activations (batch=4, seq=512)
90
- ```
 
 
91
 
92
- **Note:** This checkpoint (35000) includes additional training with knowledge distillation.
93
 
94
  ## Usage
95
 
 
4
  license: mit
5
  tags:
6
  - text-generation
7
+ - pytorch
8
  - gpt
9
+ - transformers
10
  - pre-ln
11
+ - causal-lm
12
  datasets:
13
  - HuggingFaceFW/fineweb-edu
14
+ library_name: transformers
15
+ pipeline_tag: text-generation
16
  metrics:
17
  - perplexity
18
+ widget:
19
+ - text: "Once upon a time"
20
+ example_title: "Story Beginning"
21
+ - text: "The capital of France is"
22
+ example_title: "Factual Question"
23
+ - text: "In the field of machine learning,"
24
+ example_title: "Technical Topic"
 
 
 
 
 
 
 
 
 
25
  ---
26
 
27
+ # NanoGPT 53M - Pre-LN Transformer
28
 
29
+ A 53-million parameter GPT model trained from scratch on FineWebEdu educational content. This model implements a **Pre-LayerNorm (Pre-LN) transformer architecture**, compatible with HuggingFace Transformers library.
30
+
31
+ > **Model Format:** PyTorch (cross-platform compatible)
32
+ > **Training Framework:** Apple MLX (exported to PyTorch for universal compatibility)
33
 
34
  ## Model Details
35
 
36
+ ### Architecture
37
+ - **Model Type:** GPT (Decoder-only Transformer)
38
+ - **Parameters:** 53M (52,990,464 total, 43M unique with weight tying)
39
+ - **Architecture Pattern:** Pre-LayerNorm (Pre-LN)
40
+ - **Layers:** 8 transformer blocks
41
+ - **Hidden Size:** 384
42
+ - **Attention Heads:** 8
43
+ - **Feedforward Dimension:** 1536
44
  - **Context Length:** 512 tokens
45
+ - **Vocabulary Size:** 50,257 (GPT-2 tokenizer)
46
+
47
+ ### Training
48
+ - **Framework:** Apple MLX (training), PyTorch (export)
49
+ - **Dataset:** FineWebEdu - 10M tokens of educational web content
50
+ - **Training Hardware:** Apple M2 Pro (16GB unified memory)
51
+ - **Checkpoint:** 35000 iterations
52
+ - **Training Method:** Base pretraining (20K iters) + Knowledge Distillation (15K iters)
53
+ - **Teacher Model:** GPT-OSS-20B (via Groq API)
54
 
55
  ### Architecture Highlights
56
 
 
82
 
83
  ### Performance Benchmarks
84
 
85
+ Measured on Apple M2 Pro (16GB unified memory):
 
 
 
 
86
 
87
+ | Metric | Value |
88
+ |--------|-------|
89
+ | **Model Size** | 53.0M parameters |
90
+ | **Memory (fp32)** | 202.1 MB |
91
+ | **Memory (fp16)** | 101.1 MB |
92
+ | **Training Throughput** | 27,355 tokens/sec |
93
+ | **Batch Processing** | 13.36 batches/sec (batch=4, seq=512) |
94
+ | **Inference Speed** | 169.9 tokens/sec |
95
+ | **Generation Latency** | ~0.59s per 100 tokens |
96
+ | **Activation Memory** | 843 MB (batch=4, seq=512) |
97
 
98
+ > **Note:** Benchmarks measured at checkpoint 20000. This release (checkpoint 35000) includes additional knowledge distillation training.
99
 
100
  ## Usage
101