Krishnakanth1993 commited on
Commit
0ede4e9
·
1 Parent(s): 71a75cd

Initial commit

Browse files
Files changed (5) hide show
  1. .gitignore +64 -0
  2. README.md +239 -0
  3. app.py +163 -0
  4. inference.py +101 -0
  5. model.py +216 -0
.gitignore ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+ *.so
6
+ .Python
7
+ build/
8
+ develop-eggs/
9
+ dist/
10
+ downloads/
11
+ eggs/
12
+ .eggs/
13
+ lib/
14
+ lib64/
15
+ parts/
16
+ sdist/
17
+ var/
18
+ wheels/
19
+ *.egg-info/
20
+ .installed.cfg
21
+ *.egg
22
+
23
+ # Virtual environments
24
+ venv/
25
+ env/
26
+ ENV/
27
+ .venv
28
+
29
+ # IDE
30
+ .vscode/
31
+ .idea/
32
+ *.swp
33
+ *.swo
34
+ *~
35
+
36
+ # Model checkpoints
37
+ *.pth
38
+ *.pt
39
+ *.ckpt
40
+ *.bin
41
+ *.safetensors
42
+
43
+ # Data files
44
+ input.txt
45
+ *.txt
46
+ *.csv
47
+ *.json
48
+
49
+ # Jupyter Notebook
50
+ .ipynb_checkpoints/
51
+ *.ipynb_checkpoints/
52
+
53
+ # OS
54
+ .DS_Store
55
+ Thumbs.db
56
+
57
+ # Logs
58
+ *.log
59
+ logs/
60
+
61
+ # Hugging Face cache
62
+ .cache/
63
+ hf_cache/
64
+
README.md CHANGED
@@ -11,3 +11,242 @@ license: mit
11
  ---
12
 
13
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ---
12
 
13
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
14
+ # Sentence Completion with GPT
15
+
16
+ A Gradio web application for sentence completion using a custom GPT model architecture. This app can use either a trained model checkpoint or pretrained GPT-2 weights.
17
+
18
+ ## Features
19
+
20
+ - **Sentence Completion**: Generate text completions for any given prompt
21
+ - **Customizable Generation**: Control generation parameters (temperature, top-k, max tokens)
22
+ - **Model Flexibility**: Supports both saved trained models and pretrained GPT-2
23
+ - **Easy Deployment**: Ready for deployment on Hugging Face Spaces
24
+
25
+ ## Model Architecture
26
+
27
+ This app uses a custom GPT implementation based on the GPT-2 architecture:
28
+ - **Parameters**: ~124M (for gpt2 base model)
29
+ - **Vocab Size**: 50,257 tokens
30
+ - **Block Size**: 1024 tokens (max sequence length)
31
+ - **Architecture**: 12 layers, 12 attention heads, 768 embedding dimension
32
+
33
+ ## Environment Setup
34
+
35
+ ### Prerequisites
36
+
37
+ - Python 3.8 or higher
38
+ - pip (Python package manager)
39
+ - (Optional) CUDA-enabled GPU for faster inference
40
+
41
+ ### Step 1: Clone or Download the Repository
42
+
43
+ ```bash
44
+ git clone <repository-url>
45
+ cd first_llm_124
46
+ ```
47
+
48
+ Or download and extract the project files to a directory.
49
+
50
+ ### Step 2: Create a Virtual Environment (Recommended)
51
+
52
+ Using a virtual environment helps avoid conflicts with other projects:
53
+
54
+ **On Windows:**
55
+ ```bash
56
+ python -m venv venv
57
+ venv\Scripts\activate
58
+ ```
59
+
60
+ **On macOS/Linux:**
61
+ ```bash
62
+ python3 -m venv venv
63
+ source venv/bin/activate
64
+ ```
65
+
66
+ ### Step 3: Install Dependencies
67
+
68
+ Install all required packages from the requirements file:
69
+
70
+ ```bash
71
+ pip install -r requirements.txt
72
+ ```
73
+
74
+ Or install packages individually:
75
+ ```bash
76
+ pip install gradio>=4.0.0
77
+ pip install torch>=2.0.0
78
+ pip install transformers>=4.30.0
79
+ pip install tiktoken>=0.5.0
80
+ pip install huggingface_hub>=0.34.0
81
+ ```
82
+
83
+ ### Step 4: Verify Installation
84
+
85
+ Verify that all packages are installed correctly:
86
+
87
+ ```bash
88
+ python -c "import torch; import gradio; import transformers; import tiktoken; print('All packages installed successfully!')"
89
+ ```
90
+
91
+ ### Step 5: Prepare Model Directory (Optional)
92
+
93
+ If you have a trained model, create a `model` directory and place your checkpoint there:
94
+
95
+ ```bash
96
+ mkdir model
97
+ # Place your model.pth file in the model/ directory
98
+ ```
99
+
100
+ ## Installation
101
+
102
+ 1. Follow the [Environment Setup](#environment-setup) steps above
103
+ 2. Ensure all dependencies are installed
104
+ 3. (Optional) Place your trained model checkpoint in the `model/` directory
105
+
106
+ ## Usage
107
+
108
+ ### Running Locally
109
+
110
+ ```bash
111
+ python app.py
112
+ ```
113
+
114
+ The app will start a local server. Open the provided URL in your browser.
115
+
116
+ ### Model Loading
117
+
118
+ The app automatically tries to load models in this order:
119
+ 1. Saved checkpoint file (checks for: `./model/model.pth`, `model.pt`, `checkpoint.pth`, `checkpoint.pt`, `gpt_model.pth`)
120
+ 2. Pretrained GPT-2 from Hugging Face (fallback)
121
+
122
+ ### Saving a Trained Model
123
+
124
+ If you have a trained model, you can save it using:
125
+
126
+ ```python
127
+ import torch
128
+ import os
129
+
130
+ # Create model directory if it doesn't exist
131
+ os.makedirs('model', exist_ok=True)
132
+
133
+ # After training your model, save the checkpoint
134
+ checkpoint = {
135
+ 'model_state_dict': model.state_dict(),
136
+ 'config': {
137
+ 'block_size': model.config.block_size,
138
+ 'vocab_size': model.config.vocab_size,
139
+ 'n_layer': model.config.n_layer,
140
+ 'n_head': model.config.n_head,
141
+ 'n_embd': model.config.n_embd,
142
+ }
143
+ }
144
+ torch.save(checkpoint, './model/model.pth')
145
+ print("Model saved successfully to ./model/model.pth!")
146
+ ```
147
+
148
+ ### Loading a Saved Model
149
+
150
+ Place your saved model checkpoint (`.pth` or `.pt` file) in the `model/` directory. The app will automatically detect and load it from `./model/model.pth`.
151
+
152
+ ## Parameters
153
+
154
+ - **Max Tokens**: Maximum number of tokens to generate (10-200)
155
+ - **Top-K**: Sample from the top K most likely tokens (1-100). Lower values make the output more focused.
156
+ - **Temperature**: Controls the randomness of the output (0.1-2.0). Lower values make the output more deterministic, higher values more creative.
157
+
158
+ ## Project Structure
159
+
160
+ ```
161
+ .
162
+ ├── app.py # Gradio interface (main entry point)
163
+ ├── model.py # GPT model architecture
164
+ ├── inference.py # Model loading and text generation utilities
165
+ ├── requirements.txt # Python dependencies
166
+ ├── README.md # This file
167
+ ├── llm_trainer.ipynb # Jupyter notebook for training
168
+ ├── input.txt # Training data (optional)
169
+ ├── model/ # (Optional) Directory for saved model checkpoints
170
+ │ └── model.pth # Saved model checkpoint
171
+ └── venv/ # Virtual environment (created during setup)
172
+ ```
173
+
174
+ ## Deployment to Hugging Face Spaces
175
+
176
+ 1. Create a new Space on [Hugging Face Spaces](https://huggingface.co/spaces)
177
+ 2. Upload all files from this project (except `venv/` and `__pycache__/`)
178
+ 3. Set the Space SDK to **Gradio**
179
+ 4. Add your model checkpoint file in the `model/` directory (if using a trained model)
180
+ 5. The Space will automatically install dependencies and launch the app
181
+
182
+ ### For Hugging Face Spaces
183
+
184
+ The app will automatically:
185
+ - Use CPU or GPU if available
186
+ - Load pretrained GPT-2 if no checkpoint is found
187
+ - Handle model loading errors gracefully
188
+
189
+ ## Model Training
190
+
191
+ To train your own model, use the `llm_trainer.ipynb` notebook. After training, save the model:
192
+
193
+ ```python
194
+ import torch
195
+ import os
196
+
197
+ # Create model directory if it doesn't exist
198
+ os.makedirs('model', exist_ok=True)
199
+
200
+ # Save model checkpoint
201
+ checkpoint = {
202
+ 'model_state_dict': model.state_dict(),
203
+ 'config': {
204
+ 'block_size': model.config.block_size,
205
+ 'vocab_size': model.config.vocab_size,
206
+ 'n_layer': model.config.n_layer,
207
+ 'n_head': model.config.n_head,
208
+ 'n_embd': model.config.n_embd,
209
+ }
210
+ }
211
+ torch.save(checkpoint, './model/model.pth')
212
+ print("Model saved successfully!")
213
+ ```
214
+
215
+ Then place `model.pth` in the `model/` directory for automatic loading.
216
+
217
+ ## Troubleshooting
218
+
219
+ ### Common Issues
220
+
221
+ 1. **Import Errors**:
222
+ - Ensure all dependencies are installed: `pip install -r requirements.txt`
223
+ - Make sure your virtual environment is activated
224
+
225
+ 2. **Model Not Found**:
226
+ - Check that the model checkpoint is in the correct directory: `./model/model.pth`
227
+ - Verify the file exists: `ls model/model.pth` (Linux/Mac) or `dir model\model.pth` (Windows)
228
+
229
+ 3. **CUDA Out of Memory**:
230
+ - The app will automatically fall back to CPU if GPU memory is insufficient
231
+ - Reduce max_tokens parameter in the interface
232
+
233
+ 4. **Module Not Found**:
234
+ - Reinstall dependencies: `pip install -r requirements.txt --upgrade`
235
+ - Check Python version: `python --version` (should be 3.8+)
236
+
237
+ 5. **Port Already in Use**:
238
+ - Change the port in `app.py`: `demo.launch(server_port=7861)`
239
+ - Or stop the process using the port
240
+
241
+ ## License
242
+
243
+ This project uses the GPT-2 architecture and can load pretrained GPT-2 weights from Hugging Face, which are subject to OpenAI's GPT-2 license.
244
+
245
+ ## Notes
246
+
247
+ - The model uses tiktoken's 'gpt2' encoding
248
+ - Generation uses top-k sampling with temperature control
249
+ - Maximum sequence length is 1024 tokens
250
+
251
+
252
+
app.py ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Gradio App for Sentence Completion
3
+ Main entry point for Hugging Face Spaces
4
+ """
5
+
6
+ import gradio as gr
7
+ import torch
8
+ from inference import load_model, generate_text, get_device
9
+
10
+
11
+ # Global model variable
12
+ model = None
13
+ device = None
14
+
15
+
16
+ def initialize_model(model_path=None, pretrained_model='gpt2'):
17
+ """Initialize the model on startup"""
18
+ global model, device
19
+ try:
20
+ model, device = load_model(model_path=model_path, pretrained_model=pretrained_model)
21
+ return f"Model loaded successfully on device: {device}"
22
+ except Exception as e:
23
+ return f"Error loading model: {str(e)}"
24
+
25
+
26
+ def complete_sentence(prompt, max_tokens, top_k, temperature):
27
+ """Generate sentence completion based on prompt"""
28
+ global model, device
29
+
30
+ if model is None:
31
+ return "Error: Model not loaded. Please restart the app."
32
+
33
+ if not prompt.strip():
34
+ return "Please enter a prompt to complete."
35
+
36
+ try:
37
+ # Ensure device is current
38
+ if device != get_device():
39
+ device = get_device()
40
+ model = model.to(device)
41
+
42
+ # Generate completion
43
+ generated_text = generate_text(
44
+ prompt=prompt,
45
+ model=model,
46
+ max_tokens=max_tokens,
47
+ top_k=top_k,
48
+ temperature=temperature,
49
+ device=device
50
+ )
51
+
52
+ return generated_text
53
+ except Exception as e:
54
+ return f"Error generating text: {str(e)}"
55
+
56
+
57
+ def create_interface():
58
+ """Create and return the Gradio interface"""
59
+
60
+ # Initialize model on startup
61
+ # Try to load from common checkpoint paths
62
+ checkpoint_paths = [
63
+ './model/model.pth',
64
+ 'model.pt',
65
+ 'checkpoint.pth',
66
+ 'checkpoint.pt',
67
+ 'gpt_model.pth',
68
+ ]
69
+
70
+ model_path = None
71
+ for path in checkpoint_paths:
72
+ import os
73
+ if os.path.exists(path):
74
+ model_path = path
75
+ break
76
+
77
+ status = initialize_model(model_path=model_path, pretrained_model='gpt2')
78
+ print(status)
79
+
80
+ # Create Gradio interface
81
+ with gr.Blocks(title="Sentence Completion with GPT") as demo:
82
+ gr.Markdown(
83
+ """
84
+ # Sentence Completion with GPT
85
+
86
+ Enter a prompt and the model will complete the sentence for you.
87
+ Adjust the parameters to control the generation behavior.
88
+ """
89
+ )
90
+
91
+ with gr.Row():
92
+ with gr.Column(scale=2):
93
+ prompt_input = gr.Textbox(
94
+ label="Prompt",
95
+ placeholder="Enter your prompt here...",
96
+ lines=3,
97
+ value="The future of artificial intelligence is"
98
+ )
99
+
100
+ with gr.Row():
101
+ max_tokens_slider = gr.Slider(
102
+ minimum=10,
103
+ maximum=200,
104
+ value=50,
105
+ step=10,
106
+ label="Max Tokens"
107
+ )
108
+
109
+ top_k_slider = gr.Slider(
110
+ minimum=1,
111
+ maximum=100,
112
+ value=50,
113
+ step=1,
114
+ label="Top-K"
115
+ )
116
+
117
+ temperature_slider = gr.Slider(
118
+ minimum=0.1,
119
+ maximum=2.0,
120
+ value=1.0,
121
+ step=0.1,
122
+ label="Temperature"
123
+ )
124
+
125
+ generate_btn = gr.Button("Generate", variant="primary")
126
+
127
+ with gr.Column(scale=2):
128
+ output_text = gr.Textbox(
129
+ label="Generated Text",
130
+ lines=10,
131
+ interactive=False
132
+ )
133
+
134
+ gr.Markdown(
135
+ """
136
+ ### Parameters:
137
+ - **Max Tokens**: Maximum number of tokens to generate
138
+ - **Top-K**: Sample from top K most likely tokens (lower = more focused)
139
+ - **Temperature**: Controls randomness (lower = more deterministic, higher = more creative)
140
+ """
141
+ )
142
+
143
+ # Set up the generate function
144
+ generate_btn.click(
145
+ fn=complete_sentence,
146
+ inputs=[prompt_input, max_tokens_slider, top_k_slider, temperature_slider],
147
+ outputs=output_text
148
+ )
149
+
150
+ # Also generate on Enter key press
151
+ prompt_input.submit(
152
+ fn=complete_sentence,
153
+ inputs=[prompt_input, max_tokens_slider, top_k_slider, temperature_slider],
154
+ outputs=output_text
155
+ )
156
+
157
+ return demo
158
+
159
+
160
+ if __name__ == "__main__":
161
+ demo = create_interface()
162
+ demo.launch(share=False)
163
+
inference.py ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Inference and Model Loading Utilities
3
+ """
4
+
5
+ import os
6
+ import torch
7
+ from torch.nn import functional as F
8
+ import tiktoken
9
+ from model import GPT, GPTConfig
10
+
11
+
12
+ def get_device():
13
+ """Auto-detect and return the best available device"""
14
+ if torch.cuda.is_available():
15
+ return 'cuda'
16
+ elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
17
+ return "mps"
18
+ else:
19
+ return 'cpu'
20
+
21
+
22
+ def load_model(model_path=None, pretrained_model='gpt2', device=None):
23
+ """
24
+ Load model with priority: saved checkpoint > pretrained model
25
+
26
+ Args:
27
+ model_path: Path to saved model checkpoint (.pth or .pt file)
28
+ pretrained_model: HuggingFace model name to fallback to ('gpt2', 'gpt2-medium', etc.)
29
+ device: Device to load model on (auto-detected if None)
30
+
31
+ Returns:
32
+ Loaded model and device
33
+ """
34
+ if device is None:
35
+ device = get_device()
36
+
37
+ # Try to load saved checkpoint first
38
+ if model_path and os.path.exists(model_path):
39
+ try:
40
+ print(f"Loading saved model from {model_path}...")
41
+ model = GPT.load_checkpoint(model_path, device=device)
42
+ return model, device
43
+ except Exception as e:
44
+ print(f"Failed to load saved model: {e}")
45
+ print(f"Falling back to pretrained model: {pretrained_model}")
46
+
47
+ # Fallback to pretrained model
48
+ print(f"Loading pretrained model: {pretrained_model}...")
49
+ try:
50
+ model = GPT.from_pretrained(pretrained_model)
51
+ model.to(device)
52
+ return model, device
53
+ except Exception as e:
54
+ print(f"Failed to load pretrained model: {e}")
55
+ # Last resort: create untrained model with default config
56
+ print("Creating model with default config...")
57
+ config = GPTConfig()
58
+ model = GPT(config)
59
+ model.to(device)
60
+ return model, device
61
+
62
+
63
+ def generate_text(prompt, model, max_tokens=50, top_k=50, temperature=1.0, device="cpu"):
64
+ """
65
+ Generate text completion for a given prompt using the GPT model.
66
+
67
+ Args:
68
+ prompt: Input text prompt
69
+ model: GPT model instance
70
+ max_tokens: Maximum number of tokens to generate
71
+ top_k: Top-k sampling parameter (None for no top-k filtering)
72
+ temperature: Temperature for sampling (higher = more random)
73
+ device: Device to run inference on
74
+
75
+ Returns:
76
+ Generated text string (including original prompt)
77
+ """
78
+ enc = tiktoken.get_encoding("gpt2")
79
+ model.eval()
80
+
81
+ with torch.no_grad():
82
+ # tokenize prompt
83
+ input_ids = enc.encode(prompt)
84
+ x = torch.tensor(input_ids, dtype=torch.long, device=device).unsqueeze(0)
85
+
86
+ for _ in range(max_tokens):
87
+ logits, _ = model(x)
88
+ logits = logits[:, -1, :] / temperature
89
+
90
+ if top_k is not None:
91
+ topk = torch.topk(logits, top_k, dim=-1)
92
+ mask = logits < topk.values[:, -1].unsqueeze(-1)
93
+ logits = logits.masked_fill(mask, -float("inf"))
94
+
95
+ probs = F.softmax(logits, dim=-1)
96
+ next_token = torch.multinomial(probs, num_samples=1)
97
+ x = torch.cat((x, next_token), dim=1)
98
+
99
+ generated_ids = x[0].tolist()
100
+ return enc.decode(generated_ids)
101
+
model.py ADDED
@@ -0,0 +1,216 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ GPT Model Architecture
3
+ Extracted from llm_trainer.ipynb
4
+ """
5
+
6
+ import math
7
+ from dataclasses import dataclass
8
+ import torch
9
+ import torch.nn as nn
10
+ from torch.nn import functional as F
11
+
12
+
13
+ class CausalSelfAttention(nn.Module):
14
+
15
+ def __init__(self, config):
16
+ super().__init__()
17
+ assert config.n_embd % config.n_head == 0
18
+ # key, query, value projections for all heads, but in a batch
19
+ self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
20
+ # output projection
21
+ self.c_proj = nn.Linear(config.n_embd, config.n_embd)
22
+ self.c_proj.NANGPT_SCALE_INIT = 1
23
+ # regularization
24
+ self.n_head = config.n_head
25
+ self.n_embd = config.n_embd
26
+ self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size)).view(1, 1, config.block_size, config.block_size))
27
+
28
+ def forward(self, x):
29
+ B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)
30
+ # calculate query, key, values for all heads in batch and move head forward to be the batch dim
31
+ # nh is "number of heads", hs is "head size", and C (number of channels) = nh * hs
32
+ # e.g. in GPT-2 (124M), n_head=12, hs=64, so nh*hs=C=768 channels in the Transformer
33
+ qkv = self.c_attn(x)
34
+ q, k, v = qkv.split(self.n_embd, dim=2)
35
+ k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
36
+ q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
37
+ v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
38
+
39
+ att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
40
+ att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float('-inf'))
41
+ att = F.softmax(att, dim=-1)
42
+ y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
43
+
44
+ y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side
45
+ # output projection
46
+ y = self.c_proj(y)
47
+ return y
48
+
49
+
50
+ class MLP(nn.Module):
51
+
52
+ def __init__(self, config):
53
+ super().__init__()
54
+ self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd)
55
+ self.gelu = nn.GELU(approximate='tanh')
56
+ self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)
57
+ self.c_proj.NANOGPT_SCALE_INIT = 1
58
+
59
+ def forward(self, x):
60
+ x = self.c_fc(x)
61
+ x = self.gelu(x)
62
+ x = self.c_proj(x)
63
+ return x
64
+
65
+
66
+ class Block(nn.Module):
67
+
68
+ def __init__(self, config):
69
+ super().__init__()
70
+ self.ln_1 = nn.LayerNorm(config.n_embd)
71
+ self.attn = CausalSelfAttention(config)
72
+ self.ln_2 = nn.LayerNorm(config.n_embd)
73
+ self.mlp = MLP(config)
74
+
75
+ def forward(self, x):
76
+ x = x + self.attn(self.ln_1(x))
77
+ x = x + self.mlp(self.ln_2(x))
78
+ return x
79
+
80
+
81
+ @dataclass
82
+ class GPTConfig:
83
+ block_size: int = 1024 # max sequence length
84
+ vocab_size: int = 50257 # number of tokens: 50,000 BPE merges + 256 bytes tokens + 1 <|endoftext|> token
85
+ n_layer: int = 12 # number of layers
86
+ n_head: int = 12 # number of heads
87
+ n_embd: int = 768 # embedding dimension
88
+
89
+
90
+ class GPT(nn.Module):
91
+
92
+ def __init__(self, config):
93
+ super().__init__()
94
+ self.config = config
95
+
96
+ self.transformer = nn.ModuleDict(dict(
97
+ wte = nn.Embedding(config.vocab_size, config.n_embd),
98
+ wpe = nn.Embedding(config.block_size, config.n_embd),
99
+ h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
100
+ ln_f = nn.LayerNorm(config.n_embd),
101
+ ))
102
+ self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
103
+
104
+ # weight sharing
105
+ self.transformer.wte.weight = self.lm_head.weight
106
+
107
+ # weight initialization
108
+ self.apply(self._init_weights)
109
+
110
+ def _init_weights(self, module):
111
+ if isinstance(module, nn.Linear):
112
+ std = 0.02
113
+ if hasattr(module, 'NANGPT_SCALE_INIT'):
114
+ std *= (2 * self.config.n_layer) ** -0.5
115
+ torch.nn.init.normal_(module.weight, mean = 0.0, std = std)
116
+ if module.bias is not None:
117
+ torch.nn.init.zeros_(module.bias)
118
+ elif isinstance(module, nn.Embedding):
119
+ torch.nn.init.normal_(module.weight, mean=0.0, std = 0.02)
120
+
121
+ def forward(self, idx, targets=None):
122
+ # idx is of shape (B, T)
123
+ B, T = idx.size()
124
+ assert T <= self.config.block_size, f"Cannot forward sequence of length {T}, block size is only {self.config.block_size}"
125
+ # forward the token and posisition embeddings
126
+ pos = torch.arange(0, T, dtype=torch.long, device=idx.device) # shape (T)
127
+ pos_emb = self.transformer.wpe(pos) # position embeddings of shape (T, n_embd)
128
+ tok_emb = self.transformer.wte(idx) # token embeddings of shape (B, T, n_embd)
129
+ x = tok_emb + pos_emb
130
+ # forward the blocks of the transformer
131
+ for block in self.transformer.h:
132
+ x = block(x)
133
+ # forward the final layernorm and the classifier
134
+ x = self.transformer.ln_f(x)
135
+ logits = self.lm_head(x) # (B, T, vocab_size)
136
+ loss = None
137
+ if targets is not None:
138
+ loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
139
+ return logits, loss
140
+
141
+ @classmethod
142
+ def from_pretrained(cls, model_type):
143
+ """Loads pretrained GPT-2 model weights from huggingface"""
144
+ assert model_type in {'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}
145
+ from transformers import GPT2LMHeadModel
146
+ print("loading weights from pretrained gpt: %s" % model_type)
147
+
148
+ # n_layer, n_head and n_embd are determined from model_type
149
+ config_args = {
150
+ 'gpt2': dict(n_layer=12, n_head=12, n_embd=768), # 124M params
151
+ 'gpt2-medium': dict(n_layer=24, n_head=16, n_embd=1024), # 350M params
152
+ 'gpt2-large': dict(n_layer=36, n_head=20, n_embd=1280), # 774M params
153
+ 'gpt2-xl': dict(n_layer=48, n_head=25, n_embd=1600), # 1558M params
154
+ }[model_type]
155
+ config_args['vocab_size'] = 50257 # always 50257 for GPT model checkpoints
156
+ config_args['block_size'] = 1024 # always 1024 for GPT model checkpoints
157
+ # create a from-scratch initialized minGPT model
158
+ config = GPTConfig(**config_args)
159
+ model = GPT(config)
160
+ sd = model.state_dict()
161
+ sd_keys = sd.keys()
162
+ sd_keys = [k for k in sd_keys if not k.endswith('.attn.bias')] # discard this mask / buffer, not a param
163
+
164
+ # init a huggingface/transformers model
165
+ model_hf = GPT2LMHeadModel.from_pretrained(model_type)
166
+ sd_hf = model_hf.state_dict()
167
+
168
+ # copy while ensuring all of the parameters are aligned and match in names and shapes
169
+ sd_keys_hf = sd_hf.keys()
170
+ sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.masked_bias')] # ignore these, just a buffer
171
+ sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.bias')] # same, just the mask (buffer)
172
+ transposed = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight']
173
+ # basically the openai checkpoints use a "Conv1D" module, but we only want to use a vanilla Linear
174
+ # this means that we have to transpose these weights when we import them
175
+ assert len(sd_keys_hf) == len(sd_keys), f"mismatched keys: {len(sd_keys_hf)} != {len(sd_keys)}"
176
+ for k in sd_keys_hf:
177
+ if any(k.endswith(w) for w in transposed):
178
+ # special treatment for the Conv1D weights we need to transpose
179
+ assert sd_hf[k].shape[::-1] == sd[k].shape
180
+ with torch.no_grad():
181
+ sd[k].copy_(sd_hf[k].t())
182
+ else:
183
+ # vanilla copy over the other parameters
184
+ assert sd_hf[k].shape == sd[k].shape
185
+ with torch.no_grad():
186
+ sd[k].copy_(sd_hf[k])
187
+
188
+ return model
189
+
190
+ def save_checkpoint(self, filepath):
191
+ """Save model checkpoint with config"""
192
+ checkpoint = {
193
+ 'model_state_dict': self.state_dict(),
194
+ 'config': {
195
+ 'block_size': self.config.block_size,
196
+ 'vocab_size': self.config.vocab_size,
197
+ 'n_layer': self.config.n_layer,
198
+ 'n_head': self.config.n_head,
199
+ 'n_embd': self.config.n_embd,
200
+ }
201
+ }
202
+ torch.save(checkpoint, filepath)
203
+ print(f"Model saved to {filepath}")
204
+
205
+ @classmethod
206
+ def load_checkpoint(cls, filepath, device='cpu'):
207
+ """Load model from checkpoint file"""
208
+ checkpoint = torch.load(filepath, map_location=device)
209
+ config_dict = checkpoint['config']
210
+ config = GPTConfig(**config_dict)
211
+ model = cls(config)
212
+ model.load_state_dict(checkpoint['model_state_dict'])
213
+ model.to(device)
214
+ print(f"Model loaded from {filepath}")
215
+ return model
216
+