File size: 6,151 Bytes
9639af0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
# LaughLM

A high-performance **decoder-only transformer training system** built with **JAX + Flax** and optimized for **TPU training**.

LaughLM is designed as a **research-friendly yet production-capable framework** for experimenting with modern transformer architectures while maintaining high training throughput.

The system emphasizes:

- clean modular architecture
- hardware-efficient training
- reproducible experiments
- flexible configuration
- large-scale dataset streaming
- high MFU optimization on TPUs

---

# Features

- **Decoder-only GPT architecture**
- **JAX + Flax implementation**
- **TPU-optimized mixed precision training**
- **Flexible architecture selection**
- **Pre-tokenized memory-mapped datasets**
- **Multiple attention variants**
- **Multiple FFN architectures**
- **Weight tying support**
- **Orbax checkpointing**
- **Optax optimizers**
- **Config-driven experiments**

Supported architecture features:

- MHA / MQA / GQA attention
- RoPE positional encoding
- SwiGLU / GEGLU / GELU MLP
- RMSNorm / LayerNorm
- configurable residual scaling
- multiple LR schedulers
- masked weight decay

---

# Project Structure:
```text

.

β”œβ”€β”€ configs

β”‚Β Β  β”œβ”€β”€ gpu_test.yaml

β”‚Β Β  └── test.yaml

β”œβ”€β”€ LaughLM

β”‚Β Β  β”œβ”€β”€ config

β”‚Β Β  β”‚Β Β  β”œβ”€β”€ loader.py

β”‚Β Β  β”‚Β Β  β”œβ”€β”€ schema.py

β”‚Β Β  β”‚Β Β  └── validation.py

β”‚Β Β  β”œβ”€β”€ data

β”‚Β Β  β”‚Β Β  β”œβ”€β”€ domain_sampler.py

β”‚Β Β  β”‚Β Β  β”œβ”€β”€ memmap_loader.py

β”‚Β Β  β”‚Β Β  β”œβ”€β”€ shard_writer.py

β”‚Β Β  β”‚Β Β  β”œβ”€β”€ tokenizer.py

β”‚Β Β  β”‚Β Β  └── tokenizer_train.py

β”‚Β Β  β”œβ”€β”€ model

β”‚Β Β  β”‚Β Β  β”œβ”€β”€ gpt.py

β”‚Β Β  β”‚Β Β  β”œβ”€β”€ layers

β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ attention.py

β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ mlp.py

β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ normalization.py

β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ positional.py

β”‚Β Β  β”‚Β Β  β”‚Β Β  └── residual.py

β”‚Β Β  β”‚Β Β  β”œβ”€β”€ parameter_utils.py

β”‚Β Β  β”‚Β Β  └── transformer_block.py

β”‚Β Β  β”œβ”€β”€ training

β”‚Β Β  β”‚Β Β  β”œβ”€β”€ checkpoint.py

β”‚Β Β  β”‚Β Β  β”œβ”€β”€ logger.py

β”‚Β Β  β”‚Β Β  β”œβ”€β”€ loss.py

β”‚Β Β  β”‚Β Β  β”œβ”€β”€ optimizer.py

β”‚Β Β  β”‚Β Β  β”œβ”€β”€ scheduler.py

β”‚Β Β  β”‚Β Β  β”œβ”€β”€ trainer.py

β”‚Β Β  β”‚Β Β  β”œβ”€β”€ train_state.py

β”‚Β Β  β”‚Β Β  └── train_step.py

β”‚Β Β  └── utils

β”‚Β Β      └── rng.py

β”œβ”€β”€ LICENSE

β”œβ”€β”€ log.txt

β”œβ”€β”€ pyproject.toml

β”œβ”€β”€ README.md

β”œβ”€β”€ requirements.txt

└── scripts

    β”œβ”€β”€ build_shard.py

    └── train_gpu_test.py

```

---

# Installation

Clone the repository:

```bash

git clone https://github.com/your-org/LaughLM.git

cd LaughLM

```

Create environment:
```bash

python -m venv venv

source venv/bin/activate

```
Install dependencies:
```bash

pip install -r requirements.txt

```

For TPU environments install JAX:

```bash

pip install --upgrade "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html

```

---

Configuration

Experiments are fully defined via YAML configs.

Example:

configs/test.yaml

Configuration sections include:

model architecture

optimizer

scheduler

runtime parameters

dataset sources

tokenizer settings

hardware configuration


Example snippet:
```yaml

model:

  d_model: 768

  num_layers: 12

  num_heads: 12

  vocab_size: 32000

  max_seq_len: 2048

```

---

Dataset Pipeline

LaughLM uses a pre-tokenized dataset pipeline for maximum throughput.

Training datasets are converted into binary token shards.

Advantages:

high throughput

minimal CPU overhead

memory-mapped streaming

scalable to large datasets



---

Step 1 β€” Train Tokenizer

Train a tokenizer using streaming datasets.
```bash

python -m LaughLM.data.tokenizer_train

```
Output:

tokenizer.json


---

Step 2 β€” Build Token Shards

Convert raw text into token shards.
```bash

python scripts/build_shard.py

```
Output:

dataset_shard.bin



Shards contain:



uint16 token stream





---



Step 3 β€” Training



Run training:

```bash

python scripts/train_gpu_test.py

```

Training automatically handles:



optimizer



scheduler



logging



checkpointing





Example output:



STEP   PROGRESS β”‚ LOSS   PPL β”‚ LR β”‚ TOK/S β”‚ MFU





---



Checkpointing



Checkpoints are saved using Orbax.



Default directory:



checkpoints/



Resume training automatically if checkpoints exist.





---



Benchmarking Performance



Benchmark raw training throughput:



python scripts/benchmark_train_step.py



This measures:



compile time



step time



tokens/sec



MFU





Example output:



Compile time: 18.2s

Step time: 0.048s

Tokens/sec: 430000





---



Monitoring



Training logger displays:



loss



perplexity



gradient norm



tokens/sec



MFU



ETA





Example:



STEP  PROGRESS β”‚ LOSS β”‚ LR β”‚ TOK/S β”‚ MFU β”‚ ETA





---



Optimization Roadmap



LaughLM is designed to progressively reach high TPU utilization.



Target MFU:



50–60% MFU on TPU v5e



Optimization phases:



Phase	Goal



Baseline	establish benchmark

Data pipeline	remove input bottlenecks

Graph optimization	eliminate Python overhead

Kernel fusion	maximize MXU utilization

Flash attention	reduce memory traffic







---



Development Workflow



Recommended workflow:



1. Create branch

2. Implement change

3. Run benchmark

4. Compare tokens/sec

5. Merge if improvement



Example:

```bash

git checkout -b optimize_attention
```



---



Contributing



Pull requests should include:



clear description



performance impact



benchmark results







---



License



MIT License





---



Acknowledgements



LaughLM builds on ideas from:



GPT



LLaMA



PaLM



DeepSeek



MiniCPM





and the JAX / Flax ecosystem.





---



Future Work



Planned improvements:



Flash Attention



Activation checkpointing



MoE layers



PJIT sharding



distributed training