File size: 8,434 Bytes
d575ce4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8691f4b
 
d575ce4
 
 
 
 
8691f4b
d575ce4
 
 
 
8691f4b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d575ce4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
# Ursa_Minor_Smashed

A GPT-2 (124M parameters) model trained from scratch on the FineWeb-edu dataset using a modern training recipe. This model reproduces the GPT-2 small architecture with updated training techniques including Flash Attention, mixed precision training, and improved optimization strategies.

## Model Details

### Model Description
- **Developed by:** Kaileh57
- **Model type:** GPT-2 (Decoder-only Transformer)
- **Language(s):** English
- **License:** MIT
- **Finetuned from model:** Trained from scratch

### Model Architecture

| Parameter | Value |
|-----------|-------|
| Parameters | 124M |
| Layers | 12 |
| Hidden Size | 768 |
| Attention Heads | 12 |
| Context Length | 1024 |
| Vocabulary Size | 50,304 |
| Activation Function | GELU (tanh approximation) |
| Position Embeddings | Learned |
| Layer Norm | Pre-normalization |
| Attention Type | Multi-head with Flash Attention |
| Weight Tying | Token embeddings tied with output projection |

### Training Details

- **Dataset:** FineWeb-edu (10B token sample)
- **Training Regime:** Mixed precision (bfloat16)
- **Optimizer:** AdamW (β₁=0.9, β₂=0.95, ε=1e-8)
- **Learning Rate:** 1.5e-3 with cosine decay
- **Batch Size:** 524,288 tokens
- **Training Steps:** 19,073 (1 epoch)
- **Warmup Steps:** 715
- **Weight Decay:** 0.1
- **Gradient Clipping:** 1.0
- **Hardware:** NVIDIA RTX A6000 (48GB)
- **Training Time:** ~2 days

### Performance

| Benchmark | Score |
|-----------|-------|
| HellaSwag | 32.4% |
| Final Loss | 2.85 |
| Perplexity | ~17.3 |

*Note: Scores are from single epoch training. Multi-epoch training reaches ~35% on HellaSwag.*

## Uses

### Direct Use
This model can be used for:
- Text generation
- Research on efficient training methods
- Educational purposes to understand GPT architectures
- Fine-tuning for specific downstream tasks

### Out-of-Scope Use
- Production applications requiring high reliability
- Generation of factual information (model may hallucinate)
- Any use case requiring larger context than 1024 tokens

## Quick Start

### Installation

#### CPU Installation (Default)
```bash

# Clone the repository

git clone https://github.com/Kaileh57/Ursa_Minor_Smashed.git

cd Ursa_Minor_Smashed



# Set up CPU environment

pip install -r requirements.txt

# or use the setup script: ./setup.sh

```

#### CUDA Installation (For GPU Acceleration)
If you have a CUDA-capable GPU and want to use GPU acceleration:

**Linux/macOS:**
```bash

# Use the automated CUDA setup script

chmod +x setup-cuda.sh

./setup-cuda.sh

```

**Windows:**
```batch

# Use the Windows CUDA setup script

setup-cuda.bat

```

**Manual CUDA Installation:**
```bash

# Create separate CUDA environment

python -m venv venv-cuda

source venv-cuda/bin/activate  # On Windows: venv-cuda\Scripts\activate



# Install CUDA requirements

pip install -r requirements-cuda.txt

```

**CUDA Requirements:**
- NVIDIA GPU with CUDA Compute Capability 3.5 or higher
- NVIDIA drivers installed
- CUDA 11.8 or 12.1 toolkit (optional, PyTorch includes CUDA runtime)
- At least 4GB GPU memory recommended

### Basic Usage

#### Command Line Interface

**CUDA Version (GPU):**
```bash

# Basic text generation

python inference_cuda.py --prompt "Hello, I'm a language model" --max-tokens 50



# Creative writing

python inference_cuda.py --prompt "Once upon a time" --max-tokens 200 --temperature 0.9



# More focused output

python inference_cuda.py --prompt "The key to machine learning is" --max-tokens 100 --temperature 0.7 --top-k 50

```

**CPU Version:**
```bash

# Basic text generation

python inference_cpu.py --prompt "Hello, I'm a language model" --max-tokens 50



# Creative writing

python inference_cpu.py --prompt "Once upon a time" --max-tokens 120 --temperature 0.9



# More focused output

python inference_cpu.py --prompt "The key to machine learning is" --max-tokens 80 --temperature 0.7 --top-k 30

```

#### Python Interface

**CUDA Version:**
```python

from inference_cuda import generate_direct, load_model_direct



# Load model once (requires CUDA)

model = load_model_direct("model_optimized.pt")



# Generate text with CUDA optimizations

result = generate_direct(

    model, 

    "Hello, I'm a language model", 

    max_new_tokens=100,  # Higher tokens for GPU

    temperature=0.8,

    top_k=50            # Higher top_k for better quality

)

print(result)

```

**CPU Version:**
```python

from inference_cpu import generate_direct, load_model_direct



# Load model once (CPU only)

model = load_model_direct("model_optimized.pt")



# Generate text with CPU optimizations

result = generate_direct(

    model, 

    "Hello, I'm a language model", 

    max_new_tokens=80,   # Lower tokens for CPU efficiency

    temperature=0.8,

    top_k=30            # Lower top_k for CPU efficiency

)

print(result)

```

#### Chat Interface

**CUDA Version:**
```bash

# Start CUDA-optimized chat

python chat_cuda.py

```

**CPU Version:**
```bash

# Start CPU-optimized chat

python chat_cpu.py

```

## Training Procedure

The model was trained using a modern GPT training recipe including:
- Flash Attention for efficient attention computation
- Mixed precision training with bfloat16
- Gradient accumulation to achieve large batch sizes
- TF32 for faster matrix multiplications
- Optimized vocabulary size (50,304) for better GPU utilization

### Training Hyperparameters
- **Learning rate schedule:** Cosine decay from 1.5e-3 to 1.5e-4
- **Gradient accumulation steps:** Dynamically calculated
- **Mixed precision:** bfloat16 with PyTorch autocast

## Evaluation

### Testing Data
Evaluated on:
- FineWeb-edu validation set
- HellaSwag benchmark (10,042 examples)

### Metrics
- Cross-entropy loss on validation set
- Accuracy on HellaSwag commonsense reasoning

## Technical Specifications

### Compute Infrastructure
- **Hardware:** 1x NVIDIA RTX A6000 (48GB VRAM)
- **Software:** PyTorch 2.0+, CUDA 12.1, Flash Attention 2

### Model Initialization
- Weights initialized with σ=0.02 (scaled by 1/√(2×n_layers) for residual projections)

- Embeddings initialized with σ=0.02

- Biases initialized to zero

- LayerNorm weights initialized to 1.0, biases to 0.0



## Citation



If you use this model, please cite:



```bibtex

@misc{ursa-minor-smashed,

  author = {Kaileh57},

  title = {Ursa Minor Smashed: Efficient GPT-2 Training},

  year = {2024},

  url = {https://github.com/Kaileh57/Ursa_Minor_Smashed}

}

```



## Available Tools



### Core Scripts

- **`inference_cuda.py`** - CUDA-optimized inference script

- **`inference_cpu.py`** - CPU-optimized inference script

- **`chat_cuda.py`** - CUDA-optimized chat interface

- **`chat_cpu.py`** - CPU-optimized chat interface

- **`benchmark_cuda.py`** - CUDA performance benchmarking tool

- **`benchmark_cpu.py`** - CPU performance benchmarking tool

- **`convert_to_gguf.py`** - Convert to GGUF format for llama.cpp



### Examples

- **`examples/basic_usage_cuda.py`** - CUDA usage examples

- **`examples/basic_usage_cpu.py`** - CPU usage examples



### Parameters



#### Generation Parameters

- **`temperature`** (0.1-1.0): Controls randomness (lower = more focused)

- **`top_k`** (1-100): Limit to top-k most likely tokens

- **`top_p`** (0.1-1.0): Nucleus sampling threshold

- **`repetition_penalty`** (1.0-2.0): Reduce repetitive output

- **`max_tokens`**: Maximum tokens to generate



#### Recommended Settings

- **Creative writing**: temp=0.8-0.9, top_p=0.9, top_k=50

- **Factual content**: temp=0.3-0.5, top_p=0.8, top_k=20

- **Code generation**: temp=0.2-0.4, top_p=0.7, top_k=10



## Performance



**CUDA Performance (GPU)**:

- **Inference Speed**: ~50-150+ tokens/sec (depends on GPU)

- **Memory Usage**: ~2-4GB VRAM

- **Features**: CUDA autocast, torch.compile optimization

- **Latency**: ~10-20ms per token



**CPU Performance**:

- **Inference Speed**: ~15-25 tokens/sec

- **Memory Usage**: ~2-3GB RAM

- **Features**: Multi-threading, CPU-optimized parameters

- **Latency**: ~40-65ms per token



**General**:

- **Context Length**: 1024 tokens maximum

- **Model Size**: 124M parameters



## Acknowledgments

- Based on Andrej Karpathy's nanoGPT implementation

- Trained on HuggingFace's FineWeb-edu dataset

- Uses OpenAI's GPT-2 tokenizer