justjuu commited on
Commit
073e95f
·
verified ·
1 Parent(s): be75174

Upload folder using huggingface_hub

Browse files
Files changed (6) hide show
  1. README.md +215 -0
  2. config.json +11 -0
  3. model.safetensors +3 -0
  4. samples.txt +30 -0
  5. special_tokens_map.json +3 -0
  6. tokenizer_config.json +6 -0
README.md ADDED
@@ -0,0 +1,215 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - pytorch
5
+ - causal-lm
6
+ - gpt
7
+ - small-language-model
8
+ - decoder-only
9
+ language:
10
+ - en
11
+ pipeline_tag: text-generation
12
+ ---
13
+
14
+ # Pico-GPT
15
+
16
+ A small GPT-style decoder-only language model (~49.2M parameters) trained from scratch on OpenWebText.
17
+
18
+ ## Model Details
19
+
20
+ | Property | Value |
21
+ |----------|--------|
22
+ | **Architecture** | Decoder-only Transformer with Pre-LayerNorm |
23
+ | **Parameters** | ~49,218,816 |
24
+ | **Layers** | 6 |
25
+ | **Hidden Size** | 384 |
26
+ | **FFN Hidden Size** | 1536 |
27
+ | **Attention Heads** | 6 |
28
+ | **Head Dimension** | 64 |
29
+ | **Context Length** | 128 tokens |
30
+ | **Vocabulary** | 50257 (GPT-2) |
31
+ | **Flash Attention** | ✅ Enabled |
32
+ | **Dropout** | 0.1 |
33
+ | **Bias** | Disabled |
34
+
35
+ ## Training Objective
36
+
37
+ The model was trained using **causal language modeling (next-token prediction)**. The loss function is cross-entropy over the vocabulary.
38
+
39
+ For a given sequence of tokens `x_1, x_2, ..., x_n`, the model is trained to predict `x_{i+1}` given `x_1, ..., x_i`.
40
+
41
+ ## Dataset
42
+
43
+ ### Source
44
+ - **Dataset:** OpenWebText
45
+ - **Hugging Face:** `Skylion007/openwebtext`
46
+ - **Mode:** Streaming preprocessing
47
+ - **License:** Same as OpenAI's GPT-2 dataset
48
+
49
+ ### Preprocessing Pipeline
50
+ - **Tokenizer:** GPT-2 (tiktoken)
51
+ - **Tokenization:** Streaming, incremental
52
+ - **EOS Token:** Appended after each document
53
+ - **Text Cleaning:** Minimal (strip whitespace, skip empty strings)
54
+ - **Sharding:** Binary shards (uint16), 5M tokens per shard
55
+ - **Train/Val Split:** Deterministic split by token count
56
+ - **Memory Mapping:** Enabled for efficient loading
57
+
58
+ ### Dataset Statistics
59
+ - **Total Tokens Collected:** 1B tokens
60
+ - **Training Tokens:** 950M tokens
61
+ - **Validation Tokens:** 50M tokens
62
+ - **Training Shards:** ~190 files (train_000.bin to train_189.bin)
63
+ - **Validation Shard:** val.bin
64
+ - **Data Type:** uint16 (supports memory mapping)
65
+
66
+ ## Training Configuration
67
+
68
+ ### Hyperparameters
69
+ | Parameter | Value |
70
+ |-----------|--------|
71
+ | **Optimizer** | AdamW |
72
+ | **Learning Rate** | 3e-4 |
73
+ | **Weight Decay** | 0.1 |
74
+ | **Betas** | (0.9, 0.95) |
75
+ | **Max Steps** | N/A |
76
+ | **Batch Size** | 64 |
77
+ | **Context Window** | 128 |
78
+ | **Gradient Clipping** | 1.0 |
79
+ | **Checkpoint Interval** | N/A |
80
+ | **Log Interval** | N/A |
81
+
82
+ ### Training Results
83
+ | Metric | Value |
84
+ |--------|--------|
85
+ | **Final Training Loss** | N/A |
86
+ | **Training Time** | N/A |
87
+ | **Hardware** | NVIDIA A100 (20GB) or equivalent |
88
+
89
+ ## Model Files
90
+
91
+ | File | Description |
92
+ |-------|-------------|
93
+ | `model.safetensors` | Model weights in safetensors format (secure, fast loading) |
94
+ | `config.json` | Model architecture configuration |
95
+ | `training_config.json` | Training hyperparameters and results |
96
+ | `training_log.csv` | Training metrics over time (step, loss, elapsed_time) |
97
+ | `samples.txt` | Sample generations from the trained model |
98
+ | `tokenizer_config.json` | Tokenizer configuration |
99
+ | `special_tokens_map.json` | Special tokens mapping |
100
+
101
+ ## Usage
102
+
103
+ ### Loading with safetensors:
104
+
105
+ ```python
106
+ import torch
107
+ from safetensors.torch import load_file
108
+ import json
109
+
110
+ # Load config
111
+ with open("config.json", "r") as f:
112
+ config = json.load(f)
113
+
114
+ # Load weights
115
+ state_dict = load_file("model.safetensors")
116
+
117
+ # Create model (requires custom model class from pico_gpt/model.py)
118
+ from pico_gpt.model import GPT
119
+ from pico_gpt.config import ModelConfig
120
+
121
+ model = GPT(ModelConfig(**config))
122
+ model.load_state_dict(state_dict)
123
+ model.eval()
124
+ ```
125
+
126
+ ### Text Generation:
127
+
128
+ ```python
129
+ import torch
130
+ import tiktoken
131
+
132
+ # Load tokenizer
133
+ enc = tiktoken.get_encoding("gpt2")
134
+
135
+ # Prepare prompt
136
+ prompt = "The future of artificial intelligence is"
137
+ tokens = enc.encode(prompt)
138
+ tokens = tokens[-context_length:] # Truncate to context length if needed
139
+ idx = torch.tensor([tokens], dtype=torch.long)
140
+
141
+ # Generate
142
+ with torch.no_grad():
143
+ generated = model.generate(
144
+ idx,
145
+ max_new_tokens=100,
146
+ temperature=0.8,
147
+ eos_token_id=enc.eot_token,
148
+ )
149
+
150
+ # Decode result
151
+ generated_text = enc.decode(generated[0].tolist())
152
+ print(generated_text)
153
+ ```
154
+
155
+ ### Loading Checkpoint:
156
+
157
+ ```python
158
+ import torch
159
+
160
+ # Load checkpoint
161
+ checkpoint = torch.load("checkpoint_step_<N>.pt", map_location="cpu")
162
+ model_state = checkpoint["model_state_dict"]
163
+ config = checkpoint["config"]
164
+
165
+ # Load training config if needed
166
+ training_config = checkpoint.get("training_config", {})
167
+
168
+ # Use with custom GPT class
169
+ from pico_gpt.model import GPT
170
+ from pico_gpt.config import ModelConfig
171
+
172
+ model = GPT(config)
173
+ model.load_state_dict(model_state)
174
+ ```
175
+
176
+ ## Limitations
177
+
178
+ - **Small Model Size:** ~49.2M parameters limits reasoning capability
179
+ - **Short Context:** 128 token context window limits long-range dependencies
180
+ - **Single Dataset:** Trained only on web text (OpenWebText subset)
181
+ - **No Instruction Tuning:** Not aligned for chat/instruction following
182
+ - **Potential Biases:** May contain biases present in the training data
183
+ - **No Weight Tying:** Embedding and output layers have separate parameters
184
+
185
+ ## Future Work
186
+
187
+ - [ ] Convert to native Hugging Face GPT-2 architecture
188
+ - [ ] Increase model size and context length
189
+ - [ ] Add instruction tuning / alignment
190
+ - [ ] Evaluation on downstream benchmarks (perplexity, etc.)
191
+ - [ ] Fine-tune for specific tasks
192
+ - [ ] Implement more sampling strategies (top-k, top-p)
193
+ - [ ] Add support for streaming inference
194
+
195
+ ## Citation
196
+
197
+ ```bibtex
198
+ @misc{pico-gpt,
199
+ title={Pico-GPT: A Small Language Model from Scratch},
200
+ author={Your Name},
201
+ year={2026},
202
+ howpublished={\url{https://huggingface.co/YOUR_USERNAME/pico-gpt}},
203
+ }
204
+ ```
205
+
206
+ ## Acknowledgments
207
+
208
+ - This project uses the **GPT-2 tokenizer** from OpenAI's `tiktoken` library
209
+ - Dataset: **OpenWebText** by Skylion007
210
+ - Architecture inspired by **GPT**, **GPT-2**, and **nanoGPT**
211
+
212
+ ---
213
+
214
+ *For training details, see `training_config.json` and `training_log.csv`.*
215
+ *Model files use the safetensors format for safe and efficient loading.*
config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "custom_gpt",
3
+ "vocab_size": 50257,
4
+ "n_layer": 6,
5
+ "n_head": 6,
6
+ "n_embd": 384,
7
+ "context_length": 128,
8
+ "dropout": 0.1,
9
+ "bias": false,
10
+ "ffn_dim": 1536
11
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d440a4046ce1be803b5e04d4bc9bd6863d58f1258ae85e29746dde63c62e7cea
3
+ size 197098232
samples.txt ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Generated Samples
2
+ ==================================================
3
+
4
+ ### Sample 1
5
+ **Prompt:** The future of artificial intelligence is
6
+ **Generated:** easily forgotten. The ruling will be crucial to the moment of our presidency, when you have a tradition of replaced patenting a visually impaired former boss with a super-apartie.
7
+
8
+ This is the right moment for the day. The site
9
+
10
+ ### Sample 2
11
+ **Prompt:** Once upon a time
12
+ **Generated:** , the world is moving upward, and the planet is on thepak--the sun of summer, the howl of January.
13
+
14
+ About the same time, the biggest change is in the way of the lightning Kingdom: The last row of our
15
+
16
+ ### Sample 3
17
+ **Prompt:** The best way to learn programming is
18
+ **Generated:** to make your school in our favorite languages.
19
+
20
+ The such anStrength is that you can learn languages and teach English. You need to have huge knowledge of what they use in your learning skills or their own methods.
21
+
22
+ The next step to
23
+
24
+ ### Sample 4
25
+ **Prompt:** In the field of machine learning
26
+ **Generated:** , the term “political philosophy” often associated with the great post- AND State’s distinction between two offices of government and foreign policy. The word “unexist” is used to refer to the laws of nature, in
27
+
28
+ ### Sample 5
29
+ **Prompt:** One of the most important concepts is
30
+ **Generated:** that you can tell the truth about the stuff that would happen to you of how much you played. If you were at a time where you had a good conversation with your husband and who did you do? If you were a kid, you’
special_tokens_map.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "eos_token": "<|endoftext|>"
3
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "tokenizer_class": "GPT2Tokenizer",
3
+ "eos_token": "<|endoftext|>",
4
+ "model_max_length": 128,
5
+ "tokenizer_type": "gpt2"
6
+ }