saumilyajj commited on
Commit
a5986f2
Β·
verified Β·
1 Parent(s): 47df44d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +204 -204
README.md CHANGED
@@ -1,204 +1,204 @@
1
- # GPT from Scratch: A PyTorch Implementation
2
-
3
- A comprehensive implementation of GPT-style transformer models built from scratch using PyTorch. This project demonstrates the core concepts of transformer architecture, attention mechanisms, and language modeling through hands-on experimentation.
4
-
5
- ## πŸš€ Project Overview
6
-
7
- This repository contains:
8
-
9
- - **Two GPT implementations** with increasing complexity (GPTv1 and GPTv2)
10
- - **Parallel data processing pipeline** for OpenWebText dataset
11
- - **Character-level tokenization** system
12
- - **Training persistence and checkpointing**
13
- - **Complete experimentation workflow**
14
-
15
- ## πŸ“ Project Structure
16
-
17
- ```
18
- β”œβ”€β”€ src/
19
- β”‚ β”œβ”€β”€ data-extraction.py # Full dataset processing (OpenWebText)
20
- β”‚ └── data-extraction-2.py # Sampled dataset processing (1% for quick iteration)
21
- β”œβ”€β”€ notebooks/
22
- β”‚ β”œβ”€β”€ GPTv1.ipynb # Basic GPT transformer implementation
23
- β”‚ β”œβ”€β”€ GPTv2.ipynb # Enhanced GPT with training persistence
24
- β”‚ └── ... # Additional experimental notebooks
25
- β”œβ”€β”€ artifacts/
26
- β”‚ β”œβ”€β”€ vocab.txt # Character vocabulary
27
- β”‚ β”œβ”€β”€ training_data.json # Training metrics and history
28
- β”‚ β”œβ”€β”€ model-01.pkl # Saved model checkpoint
29
- β”‚ β”œβ”€β”€ output_train.txt # Processed training data
30
- β”‚ └── output_val.txt # Processed validation data
31
- β”œβ”€β”€ data/
32
- β”‚ └── MNIST/ # Standard datasets
33
- β”œβ”€β”€ docs/
34
- β”‚ └── .github/
35
- β”‚ └── copilot-instructions.md # AI agent guidelines
36
- β”œβ”€β”€ gradio_app.py # Interactive web interface for text generation
37
- β”œβ”€β”€ requirements.txt # Project dependencies
38
- └── LICENSE # MIT License
39
- ```
40
-
41
- ## πŸ› οΈ Installation
42
-
43
- 1. **Clone the repository:**
44
-
45
- ```bash
46
- git clone https://huggingface.co/YOUR_USERNAME/gpt-from-scratch
47
- cd gpt-from-scratch
48
- ```
49
-
50
- 2. **Install dependencies:**
51
-
52
- ```bash
53
- pip install -r requirements.txt
54
- ```
55
-
56
- 3. **Download sample data:**
57
-
58
- ```bash
59
- # Place your OpenWebText .xz files in the 'openwebtext' directory
60
- # Or use the provided wizard-of-oz.txt for quick testing
61
- ```
62
-
63
- ## πŸƒβ€β™‚οΈ Quick Start
64
-
65
- ### 1. Data Processing
66
-
67
- For quick experimentation (1% sample):
68
-
69
- ```bash
70
- python src/data-extraction-2.py
71
- ```
72
-
73
- For full dataset processing:
74
-
75
- ```bash
76
- python src/data-extraction.py
77
- ```
78
-
79
- ### 2. Model Training
80
-
81
- Open and run the Jupyter notebooks:
82
-
83
- **GPTv1 (Basic Implementation):**
84
-
85
- - Open `notebooks/GPTv1.ipynb`
86
- - Focuses on core transformer concepts
87
- - Uses wizard-of-oz.txt for training
88
-
89
- **GPTv2 (Advanced Implementation):**
90
-
91
- - Open `notebooks/GPTv2.ipynb`
92
- - Includes training persistence and better monitoring
93
- - Uses processed OpenWebText data
94
- - Memory-mapped file handling for large datasets
95
-
96
- ### 3. Interactive Web Interface
97
-
98
- Launch the Gradio web interface for real-time text generation:
99
-
100
- ```bash
101
- python gradio_app.py
102
- ```
103
-
104
- **Features:**
105
-
106
- - 🎯 Real-time text generation with your trained model
107
- - 🌑️ Temperature control for creativity adjustment
108
- - 🎲 Seed control for reproducible results
109
- - πŸ“Š Model information and architecture details
110
- - πŸ’‘ Pre-built example prompts to get started
111
-
112
- Access the interface at `http://localhost:7860` in your browser.
113
-
114
- ## πŸ—οΈ Architecture Details
115
-
116
- ### Data Pipeline
117
-
118
- - **Parallel Processing**: Uses `ProcessPoolExecutor` for efficient .xz file handling
119
- - **Train/Validation Split**: 90/10 split with optional sampling
120
- - **Character-Level Tokenization**: Direct character-to-integer mapping
121
- - **Windows Compatibility**: Includes `freeze_support()` for multiprocessing
122
-
123
- ### Model Architecture
124
-
125
- - **Multi-Head Attention**: Custom implementation with proper masking
126
- - **Feed-Forward Networks**: Standard transformer FFN with dropout
127
- - **Positional Embeddings**: Learned position encodings
128
- - **Layer Normalization**: Applied throughout the network
129
-
130
- ### Key Hyperparameters
131
-
132
- ```python
133
- block_size = 8 # Context window size
134
- batch_size = 128 # Training batch size
135
- n_embd = 384 # Embedding dimension
136
- n_head = 16/32 # Number of attention heads (varies by version)
137
- n_layer = 16/32 # Number of transformer layers
138
- dropout = 0.2 # Dropout rate
139
- learning_rate = 3e-4 # Learning rate
140
- ```
141
-
142
- ## πŸ“Š Training Features
143
-
144
- - **Progress Tracking**: `tqdm` integration for real-time monitoring
145
- - **Training Persistence**: JSON-based training history (GPTv2)
146
- - **Model Checkpointing**: Pickle serialization for easy loading
147
- - **Evaluation Loops**: Separate training/validation evaluation
148
- - **Device Agnostic**: Automatic CUDA/CPU detection
149
-
150
- ## πŸ”§ Usage Examples
151
-
152
- ### Training a Model
153
-
154
- ```python
155
- # Load and configure hyperparameters
156
- device = 'cuda' if torch.cuda.is_available() else 'cpu'
157
-
158
- # Initialize model
159
- model = GPTLanguageModel(vocab_size)
160
- model = model.to(device)
161
-
162
- # Train with monitoring
163
- for iter in range(max_iters):
164
- # Training loop with loss tracking
165
- # Automatic checkpointing every eval_iters
166
- ```
167
-
168
- ### Generating Text
169
-
170
- ```python
171
- # Generate text from trained model
172
- context = torch.zeros((1, 1), dtype=torch.long, device=device)
173
- generated = decode(model.generate(context, max_new_tokens=500)[0].tolist())
174
- print(generated)
175
- ```
176
-
177
- ## 🀝 Contributing
178
-
179
- 1. Fork the repository
180
- 2. Create a feature branch (`git checkout -b feature/amazing-feature`)
181
- 3. Commit your changes (`git commit -m 'Add amazing feature'`)
182
- 4. Push to the branch (`git push origin feature/amazing-feature`)
183
- 5. Open a Pull Request
184
-
185
- ## πŸ“„ License
186
-
187
- This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
188
-
189
- ## πŸ™ Acknowledgments
190
-
191
- - Inspired by Andrej Karpathy's "Let's build GPT" series
192
- - Based on the "Attention Is All You Need" paper
193
- - Uses OpenWebText dataset for training
194
- - Built with PyTorch framework
195
-
196
- ## πŸ“š Learning Resources
197
-
198
- - [Attention Is All You Need Paper](https://arxiv.org/abs/1706.03762)
199
- - [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
200
- - [Andrej Karpathy's GPT Tutorial](https://www.youtube.com/watch?v=kCc8FmEb1nY)
201
-
202
- ---
203
-
204
- **Happy Learning! πŸŽ“**
 
1
+ # GPT from Scratch: A PyTorch Implementation
2
+
3
+ A comprehensive implementation of GPT-style transformer models built from scratch using PyTorch. This project demonstrates the core concepts of transformer architecture, attention mechanisms, and language modeling through hands-on experimentation.
4
+
5
+ ## πŸš€ Project Overview
6
+
7
+ This repository contains:
8
+
9
+ - **Two GPT implementations** with increasing complexity (GPTv1 and GPTv2)
10
+ - **Parallel data processing pipeline** for OpenWebText dataset
11
+ - **Character-level tokenization** system
12
+ - **Training persistence and checkpointing**
13
+ - **Complete experimentation workflow**
14
+
15
+ ## πŸ“ Project Structure
16
+
17
+ ```
18
+ β”œβ”€β”€ src/
19
+ β”‚ β”œβ”€β”€ data-extraction.py # Full dataset processing (OpenWebText)
20
+ β”‚ └── data-extraction-2.py # Sampled dataset processing (1% for quick iteration)
21
+ β”œβ”€β”€ notebooks/
22
+ β”‚ β”œβ”€β”€ GPTv1.ipynb # Basic GPT transformer implementation
23
+ β”‚ β”œβ”€β”€ GPTv2.ipynb # Enhanced GPT with training persistence
24
+ β”‚ └── ... # Additional experimental notebooks
25
+ β”œβ”€β”€ artifacts/
26
+ β”‚ β”œβ”€β”€ vocab.txt # Character vocabulary
27
+ β”‚ β”œβ”€β”€ training_data.json # Training metrics and history
28
+ β”‚ β”œβ”€β”€ model-01.pkl # Saved model checkpoint
29
+ β”‚ β”œβ”€β”€ output_train.txt # Processed training data
30
+ β”‚ └── output_val.txt # Processed validation data
31
+ β”œβ”€β”€ data/
32
+ β”‚ └── MNIST/ # Standard datasets
33
+ β”œβ”€β”€ docs/
34
+ β”‚ └── .github/
35
+ β”‚ └── copilot-instructions.md # AI agent guidelines
36
+ β”œβ”€β”€ gradio_app.py # Interactive web interface for text generation
37
+ β”œβ”€β”€ requirements.txt # Project dependencies
38
+ └── LICENSE # MIT License
39
+ ```
40
+
41
+ ## πŸ› οΈ Installation
42
+
43
+ 1. **Clone the repository:**
44
+
45
+ ```bash
46
+ git clone https://huggingface.co/saumilyajj/GTP-on-Reddit
47
+ cd gpt-from-scratch
48
+ ```
49
+
50
+ 2. **Install dependencies:**
51
+
52
+ ```bash
53
+ pip install -r requirements.txt
54
+ ```
55
+
56
+ 3. **Download sample data:**
57
+
58
+ ```bash
59
+ # Place your OpenWebText .xz files in the 'openwebtext' directory
60
+ # Or use the provided wizard-of-oz.txt for quick testing
61
+ ```
62
+
63
+ ## πŸƒβ€β™‚οΈ Quick Start
64
+
65
+ ### 1. Data Processing
66
+
67
+ For quick experimentation (1% sample):
68
+
69
+ ```bash
70
+ python src/data-extraction-2.py
71
+ ```
72
+
73
+ For full dataset processing:
74
+
75
+ ```bash
76
+ python src/data-extraction.py
77
+ ```
78
+
79
+ ### 2. Model Training
80
+
81
+ Open and run the Jupyter notebooks:
82
+
83
+ **GPTv1 (Basic Implementation):**
84
+
85
+ - Open `notebooks/GPTv1.ipynb`
86
+ - Focuses on core transformer concepts
87
+ - Uses wizard-of-oz.txt for training
88
+
89
+ **GPTv2 (Advanced Implementation):**
90
+
91
+ - Open `notebooks/GPTv2.ipynb`
92
+ - Includes training persistence and better monitoring
93
+ - Uses processed OpenWebText data
94
+ - Memory-mapped file handling for large datasets
95
+
96
+ ### 3. Interactive Web Interface
97
+
98
+ Launch the Gradio web interface for real-time text generation:
99
+
100
+ ```bash
101
+ python gradio_app.py
102
+ ```
103
+
104
+ **Features:**
105
+
106
+ - 🎯 Real-time text generation with your trained model
107
+ - 🌑️ Temperature control for creativity adjustment
108
+ - 🎲 Seed control for reproducible results
109
+ - πŸ“Š Model information and architecture details
110
+ - πŸ’‘ Pre-built example prompts to get started
111
+
112
+ Access the interface at `http://localhost:7860` in your browser.
113
+
114
+ ## πŸ—οΈ Architecture Details
115
+
116
+ ### Data Pipeline
117
+
118
+ - **Parallel Processing**: Uses `ProcessPoolExecutor` for efficient .xz file handling
119
+ - **Train/Validation Split**: 90/10 split with optional sampling
120
+ - **Character-Level Tokenization**: Direct character-to-integer mapping
121
+ - **Windows Compatibility**: Includes `freeze_support()` for multiprocessing
122
+
123
+ ### Model Architecture
124
+
125
+ - **Multi-Head Attention**: Custom implementation with proper masking
126
+ - **Feed-Forward Networks**: Standard transformer FFN with dropout
127
+ - **Positional Embeddings**: Learned position encodings
128
+ - **Layer Normalization**: Applied throughout the network
129
+
130
+ ### Key Hyperparameters
131
+
132
+ ```python
133
+ block_size = 8 # Context window size
134
+ batch_size = 128 # Training batch size
135
+ n_embd = 384 # Embedding dimension
136
+ n_head = 16/32 # Number of attention heads (varies by version)
137
+ n_layer = 16/32 # Number of transformer layers
138
+ dropout = 0.2 # Dropout rate
139
+ learning_rate = 3e-4 # Learning rate
140
+ ```
141
+
142
+ ## πŸ“Š Training Features
143
+
144
+ - **Progress Tracking**: `tqdm` integration for real-time monitoring
145
+ - **Training Persistence**: JSON-based training history (GPTv2)
146
+ - **Model Checkpointing**: Pickle serialization for easy loading
147
+ - **Evaluation Loops**: Separate training/validation evaluation
148
+ - **Device Agnostic**: Automatic CUDA/CPU detection
149
+
150
+ ## πŸ”§ Usage Examples
151
+
152
+ ### Training a Model
153
+
154
+ ```python
155
+ # Load and configure hyperparameters
156
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
157
+
158
+ # Initialize model
159
+ model = GPTLanguageModel(vocab_size)
160
+ model = model.to(device)
161
+
162
+ # Train with monitoring
163
+ for iter in range(max_iters):
164
+ # Training loop with loss tracking
165
+ # Automatic checkpointing every eval_iters
166
+ ```
167
+
168
+ ### Generating Text
169
+
170
+ ```python
171
+ # Generate text from trained model
172
+ context = torch.zeros((1, 1), dtype=torch.long, device=device)
173
+ generated = decode(model.generate(context, max_new_tokens=500)[0].tolist())
174
+ print(generated)
175
+ ```
176
+
177
+ ## 🀝 Contributing
178
+
179
+ 1. Fork the repository
180
+ 2. Create a feature branch (`git checkout -b feature/amazing-feature`)
181
+ 3. Commit your changes (`git commit -m 'Add amazing feature'`)
182
+ 4. Push to the branch (`git push origin feature/amazing-feature`)
183
+ 5. Open a Pull Request
184
+
185
+ ## πŸ“„ License
186
+
187
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
188
+
189
+ ## πŸ™ Acknowledgments
190
+
191
+ - Inspired by Andrej Karpathy's "Let's build GPT" series
192
+ - Based on the "Attention Is All You Need" paper
193
+ - Uses OpenWebText dataset for training
194
+ - Built with PyTorch framework
195
+
196
+ ## πŸ“š Learning Resources
197
+
198
+ - [Attention Is All You Need Paper](https://arxiv.org/abs/1706.03762)
199
+ - [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
200
+ - [Andrej Karpathy's GPT Tutorial](https://www.youtube.com/watch?v=kCc8FmEb1nY)
201
+
202
+ ---
203
+
204
+ **Happy Learning! πŸŽ“**