# LaughLM

A high-performance **decoder-only transformer training system** built with **JAX + Flax** and optimized for **TPU training**.

LaughLM is designed as a **research-friendly yet production-capable framework** for experimenting with modern transformer architectures while maintaining high training throughput.

The system emphasizes:

- clean modular architecture
- hardware-efficient training
- reproducible experiments
- flexible configuration
- large-scale dataset streaming
- high MFU optimization on TPUs

---

# Features

- **Decoder-only GPT architecture**
- **JAX + Flax implementation**
- **TPU-optimized mixed precision training**
- **Flexible architecture selection**
- **Pre-tokenized memory-mapped datasets**
- **Multiple attention variants**
- **Multiple FFN architectures**
- **Weight tying support**
- **Orbax checkpointing**
- **Optax optimizers**
- **Config-driven experiments**

Supported architecture features:

- MHA / MQA / GQA attention
- RoPE positional encoding
- SwiGLU / GEGLU / GELU MLP
- RMSNorm / LayerNorm
- configurable residual scaling
- multiple LR schedulers
- masked weight decay

---

# Project Structure:
```text
.
├── configs
│   ├── gpu_test.yaml
│   └── test.yaml
├── LaughLM
│   ├── config
│   │   ├── loader.py
│   │   ├── schema.py
│   │   └── validation.py
│   ├── data
│   │   ├── domain_sampler.py
│   │   ├── memmap_loader.py
│   │   ├── shard_writer.py
│   │   ├── tokenizer.py
│   │   └── tokenizer_train.py
│   ├── model
│   │   ├── gpt.py
│   │   ├── layers
│   │   │   ├── attention.py
│   │   │   ├── mlp.py
│   │   │   ├── normalization.py
│   │   │   ├── positional.py
│   │   │   └── residual.py
│   │   ├── parameter_utils.py
│   │   └── transformer_block.py
│   ├── training
│   │   ├── checkpoint.py
│   │   ├── logger.py
│   │   ├── loss.py
│   │   ├── optimizer.py
│   │   ├── scheduler.py
│   │   ├── trainer.py
│   │   ├── train_state.py
│   │   └── train_step.py
│   └── utils
│       └── rng.py
├── LICENSE
├── log.txt
├── pyproject.toml
├── README.md
├── requirements.txt
└── scripts
    ├── build_shard.py
    └── train_gpu_test.py
```

---

# Installation

Clone the repository:

```bash
git clone https://github.com/your-org/LaughLM.git
cd LaughLM
```

Create environment:
```bash
python -m venv venv
source venv/bin/activate
```
Install dependencies:
```bash
pip install -r requirements.txt
```

For TPU environments install JAX:

```bash
pip install --upgrade "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
```

---

Configuration

Experiments are fully defined via YAML configs.

Example:

configs/test.yaml

Configuration sections include:

model architecture

optimizer

scheduler

runtime parameters

dataset sources

tokenizer settings

hardware configuration


Example snippet:
```yaml
model:
  d_model: 768
  num_layers: 12
  num_heads: 12
  vocab_size: 32000
  max_seq_len: 2048
```

---

Dataset Pipeline

LaughLM uses a pre-tokenized dataset pipeline for maximum throughput.

Training datasets are converted into binary token shards.

Advantages:

high throughput

minimal CPU overhead

memory-mapped streaming

scalable to large datasets



---

Step 1 — Train Tokenizer

Train a tokenizer using streaming datasets.
```bash
python -m LaughLM.data.tokenizer_train
```
Output:

tokenizer.json


---

Step 2 — Build Token Shards

Convert raw text into token shards.
```bash
python scripts/build_shard.py
```
Output:

dataset_shard.bin

Shards contain:

uint16 token stream


---

Step 3 — Training

Run training:
```bash
python scripts/train_gpu_test.py
```
Training automatically handles:

optimizer

scheduler

logging

checkpointing


Example output:

STEP   PROGRESS │ LOSS   PPL │ LR │ TOK/S │ MFU


---

Checkpointing

Checkpoints are saved using Orbax.

Default directory:

checkpoints/

Resume training automatically if checkpoints exist.


---

Benchmarking Performance

Benchmark raw training throughput:

python scripts/benchmark_train_step.py

This measures:

compile time

step time

tokens/sec

MFU


Example output:

Compile time: 18.2s
Step time: 0.048s
Tokens/sec: 430000


---

Monitoring

Training logger displays:

loss

perplexity

gradient norm

tokens/sec

MFU

ETA


Example:

STEP  PROGRESS │ LOSS │ LR │ TOK/S │ MFU │ ETA


---

Optimization Roadmap

LaughLM is designed to progressively reach high TPU utilization.

Target MFU:

50–60% MFU on TPU v5e

Optimization phases:

Phase	Goal

Baseline	establish benchmark
Data pipeline	remove input bottlenecks
Graph optimization	eliminate Python overhead
Kernel fusion	maximize MXU utilization
Flash attention	reduce memory traffic



---

Development Workflow

Recommended workflow:

1. Create branch
2. Implement change
3. Run benchmark
4. Compare tokens/sec
5. Merge if improvement

Example:
```bash
git checkout -b optimize_attention
```

---

Contributing

Pull requests should include:

clear description

performance impact

benchmark results



---

License

MIT License


---

Acknowledgements

LaughLM builds on ideas from:

GPT

LLaMA

PaLM

DeepSeek

MiniCPM


and the JAX / Flax ecosystem.


---

Future Work

Planned improvements:

Flash Attention

Activation checkpointing

MoE layers

PJIT sharding

distributed training