---
license: mit
tags:
  - sorting
  - mechanistic-interpretability
  - transformers
  - toy-model
---

# SortGPT Checkpoints

Checkpoints for small decoder-only transformers trained on the **integer sorting task**.

## Task

The model takes a sequence of `k` integers from `{0, ..., N-1}`, a SEP token, and must output the sorted sequence:

```
[unsorted_tokens | SEP | sorted_tokens]
```

Input length is `2*k + 1`. The SEP token index is `N` (i.e., `vocab_size = N + 1`).

## Grid

| Parameter    | Values                |
|--------------|-----------------------|
| `k` (length) | 16, 32               |
| `N` (vocab)  | 128, 256, 512, 1024  |
| Seeds        | 1, 2, 3, 4, 5        |
| `n_embd`     | 64                    |
| `n_layers`   | 2                     |
| `n_heads`    | 1                     |
| `init_std`   | 0.01                  |
| `lr`         | 0.03                  |
| `max_iters`  | 100,000               |

8 configs × 5 seeds = **40 runs**, each with 20 checkpoints (every 5,000 steps).

## Architecture

Small GPT-2-style decoder-only transformer:
- Token embeddings (no positional embeddings — `without_pos=True`)
- 2 pre-norm transformer blocks, each with causal self-attention + MLP
- Final LayerNorm + tied LM head
- Weight tying between token embedding and LM head

## File Structure

```
checkpoints/
  k{16,32}_N{128,256,512,1024}/
    seed{1,2,3,4,5}/
      std0p01_iseed{S}__ckpt{iter}.pt
model.py   # Model definition + loading utilities
```

## Loading a Checkpoint

```python
# Copy model.py to your project, then:
from model import load_model_from_checkpoint

model = load_model_from_checkpoint("checkpoints/k32_N512/seed1/std0p01_iseed1__ckpt100000.pt")
```

Each `.pt` file is a dict with keys:
- `model_config`: dict of `GPTConfig` fields
- `model_state_dict`: PyTorch state dict
- `checkpoint_iter`, `init_seed`, `init_std`, `l1_init_scale`

## Training Details

- Optimizer: AdamW (betas=0.9, 0.95)
- LR schedule: cosine decay with linear warmup
- Batch size: 128
- Data: randomly sampled sorting problems (no duplicates)
- `data_seed`: 1337 (shared across all runs)