disham993 commited on
Commit
23a6e38
·
verified ·
1 Parent(s): 1bbb4c0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +141 -3
README.md CHANGED
@@ -1,3 +1,141 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - roneneldan/TinyStories
5
+ language:
6
+ - en
7
+ library_name: pytorch
8
+ tags:
9
+ - text-generation-inference
10
+ - gemma3
11
+ metrics:
12
+ - perplexity
13
+ pipeline_tag: text-generation
14
+ ---
15
+
16
+ # Gemma3 270M - TinyStories - PyTorch From-Scratch Implementation
17
+
18
+ A PyTorch implementation of Google DeepMind's Gemma3 270M model built entirely from scratch, featuring a compact transformer architecture.
19
+
20
+ ## Model Overview
21
+
22
+ This is from scratch implementation of the Gemma3 270M architecture that demonstrates modern transformer techniques including sliding window attention, RoPE positional encoding, and mixed precision training. The model maintains the core architectural principles of the official Gemma3 270M while making practical choices for training efficiency.
23
+
24
+ ## Training Data
25
+
26
+ ### Dataset
27
+ - **Source**: TinyStories dataset (~600M tokens)
28
+ - **Tokenizer**: GPT-2 tokenizer for faster data processing compared to Gemma3 270M tokenizer
29
+ - **Format**: Memory-mapped binary files for efficient loading
30
+
31
+ ### Model Details
32
+
33
+ - This is the base model itself solely trained on TinyStories dataset for 10 hours on A6000 GPU.
34
+ - Task: text-generation
35
+ - Language: en
36
+ - Dataset: https://huggingface.co/datasets/roneneldan/TinyStories
37
+
38
+ ## Training Procedure
39
+
40
+ ### Training Hyperparameters
41
+
42
+ - **learning_rate:** 1e-4
43
+ - **max_iters:** 150000
44
+ - **warmup_steps:** 1000
45
+ - **min_lr:** 5e-4
46
+ - **eval_iters:** 500
47
+ - **batch_size:** 32
48
+ - **block_size:** 128
49
+ - **gradient_accumulation_steps:** 32
50
+ - **device:** cuda
51
+ - **dtype:** bfloat16
52
+ - **ptdtype:** float32
53
+
54
+ ## Evaluation results
55
+
56
+ Detailed training analysis and model evaluation can be found in [`results/results_interpertation.md`](results/results_interpertation.md), which includes:
57
+
58
+ - **📊 Loss Analysis**: Training and validation loss curves showing smooth convergence without overfitting
59
+ - **📝 Qualitative Evaluation**: Story generation examples demonstrating coherent narrative abilities
60
+ - **📈 Training Dynamics**: Gradient norm analysis and learning rate schedule evaluation
61
+ - **🎯 Model Performance**: Final perplexity metrics and generation quality assessment
62
+
63
+ **Key Results:**
64
+ - Final train loss: 1.8 (perplexity ~6.0)
65
+ - Final validation loss: 2.0 (perplexity ~7.4)
66
+ - Excellent generalization with no overfitting observed
67
+ - Coherent story generation with proper grammar and age-appropriate content
68
+
69
+ ## Usage
70
+
71
+ **Code Snippet**
72
+ ```python
73
+ # Import Necessary Libraries
74
+ import torch
75
+ import tiktoken
76
+ from architecture import model_config, Gemma3Model
77
+
78
+ # Tokenizer
79
+ enc = tiktoken.get_encoding("gpt2")
80
+
81
+ # Loading Model
82
+ model_config["dtype"] = torch.bfloat16
83
+ model = Gemma3Model(model_config) # re-create the model with same config
84
+ device = "cuda" if torch.cuda.is_available() else "cpu"
85
+ best_model_params_path = "best_model_params.pt"
86
+ model.load_state_dict(torch.load(best_model_params_path, map_location=torch.device(device))) # load best model states
87
+
88
+ # Inference
89
+ sentence = "Dad was telling the kids an adventure tale about a pirate ship"
90
+ context = (torch.tensor(enc.encode_ordinary(sentence)).unsqueeze(dim = 0))
91
+ y = model.generate(context, 200)
92
+ print(enc.decode(y.squeeze().tolist()))
93
+ ```
94
+
95
+ **Result**
96
+
97
+ ```text
98
+ Dad was telling the kids an adventure tale about a pirate ship coming to the shore.
99
+
100
+ Suddenly, Dad showed John many pictures and showed him what to do. She chose a film for them to watch.
101
+ John was excited. He had never seen one before and was intrigued.
102
+
103
+ When they arrived, Dad handed John bookshelf safely. "What have you got, John?", asked Dad. John eagerly answered back to Dad. Dad explained that the businessman was a dinosaur that had been guarded by the sea.
104
+
105
+ John thought about this for a reason and knew he was too happy with this movie. He said to Dad, "Life is a really fun experience".
106
+ His Dad nodded and said, "Yes, you can accept anything special. It was a very comfortable motorcycle."Once upon a time, there was a nice friendly little boy named John. Every day he would have endless their conversation and encouragement. He was so full of joy and excitement taking action.
107
+
108
+ Today, John was playing in the backyard when
109
+ ```
110
+
111
+ ## Limitations and Biases
112
+
113
+ - This model is only intended for understanding the architecture of a transformer based model from scratch and get the intuition
114
+ - Inference is super slow as KV cache is absent
115
+ - TinyStories is synthetic data generated by GPT-3.5/4
116
+ - May have inherited biases or patterns from the generating model
117
+ - Limited diversity compared to real human-written content
118
+ - Repetitive narrative structures typical of children's literature
119
+ - 270M parameters is relatively small by modern standards
120
+ - Limited reasoning capabilities compared to larger models
121
+
122
+
123
+ ## Training Infrastructure
124
+
125
+ For a complete guide covering the entire process - from data tokenization to inference - please refer to the [GitHub repository](https://github.com/di37/gemma3-270M-tinystories-pytorch).
126
+
127
+ ## Last Update
128
+
129
+ 2025-09-06
130
+
131
+ ## Citation
132
+
133
+ ```bibtex
134
+ @misc{gemma3-270m-pytorch,
135
+ title={Gemma3 270M - TinyStories - PyTorch From-Scratch Implementation},
136
+ author={Doula Isham Rashik Hasan},
137
+ year={2025},
138
+ howpublished={\url{https://github.com/di37/gemma3-270M-tinystories-pytorch}},
139
+ note={Implementation of Google DeepMind's Gemma3 270M from scratch pre-trained on TinyStories}
140
+ }
141
+ ```