thomas-schweich's picture
|
download
raw
28.6 kB
# PAWN (Playstyle-Agnostic World-model Network for Chess)
A causal transformer trained on random chess games, designed as a testbed for finetuning and augmentation methods at small scales. Apache 2.0.
## Repository Structure
```
pawn/
├── engine/ # Rust chess engine with PyO3 bindings (via shakmaty)
├── pawn/ # Core Python package
│ ├── config.py # CLMConfig (small/base/large), TrainingConfig
│ ├── model.py # PAWNCLM transformer (RMSNorm, SwiGLU, RoPE, factored embeddings)
│ ├── data.py # On-the-fly random game data pipeline
│ ├── lichess_data.py # Lichess PGN data pipeline + legal mask computation
│ ├── trainer.py # Pretraining loop
│ ├── gpu.py # GPU auto-detection (compile/AMP/SDPA backend)
│ ├── logging.py # MetricsLogger (JSONL output)
│ ├── checkpoint.py # Atomic save/load, .complete sentinel, HF push
│ ├── adapters/ # Bottleneck, LoRA, FiLM, sparse, hybrid
│ ├── eval_suite/ # Probes, generation tests, diagnostics, lichess eval
│ └── dashboard/ # Solara training dashboard (metrics, charts, runner)
├── scripts/ # Training and evaluation entry points
├── tests/ # Unit tests
├── deploy/ # Runpod deployment scripts
└── docs/ # Architecture, training, adapter docs
```
## Building
This is a uv workspace. The root project is the `pawn` Python package; `engine/` is the sole workspace member.
```bash
# Build the Rust chess engine (required before anything else)
cd engine && uv run --with maturin maturin develop --release && cd ..
# Install Python deps (dev tools like pytest, seaborn, solara are in base dependencies):
uv sync --extra rocm # AMD (ROCm 7.1)
uv sync --extra cu128 # NVIDIA (CUDA 12.8)
# Run tests
uv run pytest tests/
# Pretrain from scratch (local dev)
uv run python scripts/train.py --variant base --local-checkpoints
```
The only extras are GPU backends (`rocm` or `cu128`). Everything else (pytest, solara, optuna, seaborn, etc.) is in base dependencies. PyTorch lives in the extras because uv can't resolve CPU/CUDA/ROCm from a single lockfile — always specify `--extra rocm` or `--extra cu128`.
**GPU requirement**: `configure_gpu()` (called by every training and eval script) raises `RuntimeError` if no CUDA/ROCm GPU is detected. This prevents accidentally running GPU workloads on CPU, which is almost always a mistake. The environment variable `PAWN_ALLOW_CPU=1` overrides this check as a last resort for the rare case where CPU execution is genuinely intended (e.g. a lightweight backfill script). Unit tests do not call `configure_gpu()` and run fine on CPU without the override.
## Engine (`engine/`)
**Single source of truth** for all chess logic. All game simulation, move generation, legality checks, tokenization, PGN parsing, and board state extraction happen in Rust. No Python chess libraries.
- Uses rayon for parallel game generation (~43K games/sec, 150M+/hr)
- PyO3 bindings expose `chess_engine` module to Python
- Key functions: `generate_random_games()`, `parse_pgn_file()`, `compute_legal_token_masks_sparse()`, `extract_board_states()`, `export_move_vocabulary()`, `compute_accuracy_ceiling()`
## Model
### Architecture
- Decoder-only transformer, next-token prediction over 4,278 tokens
- Token vocabulary: 1 PAD + 4,096 grid (64x64 src/dst) + 176 promotions + 5 outcomes
- Factored embeddings: `src_embed[s] + dst_embed[d] + promo_embed[p]`
- Sequence format: `[outcome] [ply_1] ... [ply_N] [PAD] ... [PAD]` (256 tokens)
### Variants
- `CLMConfig.small()`: d=256, 8 layers, 4 heads, ~9.5M params
- `CLMConfig.base()`: d=512, 8 layers, 8 heads, ~35.8M params (default)
- `CLMConfig.large()`: d=640, 10 layers, 8 heads, ~68.4M params
- `CLMConfig.toy()`: d=64, 2 layers, for tests only
## Training
All training scripts require one of `--hf-repo REPO_ID` or `--local-checkpoints` (mutually exclusive). Use `--local-checkpoints` for local dev; use `--hf-repo` for any run where you need durable checkpoints.
### Pretraining
```bash
# Single model
uv run python scripts/train.py --variant base --local-checkpoints
# All three variants simultaneously (shared data batches, sequential GPU)
uv run python scripts/train_all.py --local-checkpoints
# Resume from checkpoint
uv run python scripts/train.py --variant base --resume checkpoints/step_00050000 --local-checkpoints
```
**`scripts/train.py`** key args:
- `--variant {small|base|large|toy}` — model size (default: base)
- `--resume PATH` — resume from checkpoint directory
- `--total-steps N` — training steps (default: 100,000)
- `--batch-size N` — batch size (default: 256)
- `--discard-ply-limit` — only train on naturally-ended games (no ply-limit truncation)
- Architecture overrides: `--d-model`, `--n-layers`, `--n-heads`, `--d-ff`, `--lr`, `--weight-decay`, `--warmup-steps`
**`scripts/train_all.py`** additional args:
- `--shm-checkpoints` — write checkpoints to `/dev/shm` (requires `--hf-repo`, volatile)
- `--run-evals` — auto-run probes + diagnostics after training completes
- `--publish-results` — push eval results to HF
- `--patience N` — per-model early stopping patience (eval intervals without improvement)
### Adapter Training
All adapter scripts require `--checkpoint PATH` (pretrained weights) and `--pgn PATH` (Lichess PGN file). They freeze the backbone and train only adapter parameters.
```bash
# Example: train a LoRA adapter on Lichess 1800-1900 games
uv run python scripts/train_lora.py \
--checkpoint thomas-schweich/pawn-base \
--pgn thomas-schweich/pawn-lichess-full --elo-min 1800 --elo-max 1900 \
--lora-rank 4 --lr 3e-4 --local-checkpoints
```
| Script | Adapter | Key args | Typical params |
|--------|---------|----------|----------------|
| `train_bottleneck.py` | Houlsby MLP | `--bottleneck-dim 8` | ~131K |
| `train_lora.py` | Low-rank attention | `--lora-rank 4 --lora-targets qkvo` | ~65K |
| `train_film.py` | Channel-wise affine | `--no-output-film` | ~17K |
| `train_sparse.py` | Binary mask | `--density 0.01 --sparse-targets qkvo` | ~503K-2.7M |
| `train_hybrid.py` | LoRA + FiLM | `--lora-rank 4 --film-lr 1e-3` | ~65K |
| `train_tiny.py` | None (from scratch) | `--d-model 84 --n-layers 2` | ~524K |
Common adapter args: `--epochs 50`, `--batch-size 64`, `--lr 3e-4`, `--patience 10`, `--val-every 1`, `--max-games 12000`, `--min-ply 10`
### Common CLI Patterns
- `--sdpa-math` — force MATH SDPA backend (required for ROCm + torch.compile)
- `--no-compile` — disable torch.compile
- `--no-amp` — disable mixed precision
- `--num-workers N` — DataLoader workers (default: 8 for adapters, 4 for pretraining)
- `--device {cuda|cpu}` — device selection
- `--wandb` — enable Weights & Biases logging
## Evaluation & Metrics
### Linear Probes
```bash
uv run python scripts/eval_probes.py --log-dir logs --device cuda
```
Trains linear probes on frozen hidden states to measure internal representations (piece type, check status, castling rights, material count, game phase, etc.). Args: `--n-games 4096`, `--n-val-games 1024`, `--n-epochs 20`, `--run RUN_NAME` (specific run).
### Move Prediction Accuracy
```bash
uv run python scripts/eval_accuracy.py \
--checkpoint thomas-schweich/pawn-base \
--pgn thomas-schweich/pawn-lichess-full --elo-min 1800 --elo-max 1900 \
--adapter-checkpoint logs/run_*/checkpoints/best
```
MAIA-compatible evaluation with per-phase and per-ply accuracy. Args: `--min-eval-ply 10`, `--max-games 50000`, `--per-ply`.
### Theoretical Accuracy Ceilings
```bash
uv run python scripts/compute_theoretical_ceiling.py
```
Computes theoretical accuracy ceilings for random games via Monte Carlo rollouts: unconditional (E[1/N_legal]), naive-conditioned (1-ply filter), and MC-conditioned (Bayes-optimal with outcome knowledge). Reports a bias bracket (naive vs split-half corrected estimates) and bootstrap 95% CIs clustered by game. CPU-intensive.
### Export to HuggingFace
```bash
uv run python scripts/export_hf_repo.py --run-dir logs/run_YYYYMMDD_HHMMSS
```
Converts a training run to HuggingFace repo format (safetensors + metrics). Finds best checkpoint by val loss.
## Checkpoints
Pre-trained weights are hosted on HuggingFace and loaded directly by repo ID:
- `thomas-schweich/pawn-small` — 9.5M params, `CLMConfig.small()`
- `thomas-schweich/pawn-base` — 35.8M params, `CLMConfig.base()`
- `thomas-schweich/pawn-large` — 68.4M params, `CLMConfig.large()`
All scripts accept HF repo IDs for `--checkpoint` (e.g. `--checkpoint thomas-schweich/pawn-base`). Weights are downloaded and cached automatically via `huggingface_hub`.
### Checkpoint Format (safetensors)
Checkpoints are directories, not single files:
```
step_00065000/
├── model.safetensors # model weights
├── optimizer.safetensors # flattened optimizer state
├── training_state.json # step, scheduler, scaler, RNG (base64)
├── config.json # model + training config
└── .complete # SHA-256 hashes of all files (integrity sentinel)
```
Central module: `pawn/checkpoint.py`. All save/load goes through this module.
Legacy `.pt` files are still loadable (backward compatible).
### Checkpoint Storage Modes
All training scripts require one of:
- `--hf-repo REPO_ID` — push checkpoints to a HuggingFace branch as they're written (durable)
- `--local-checkpoints` — save locally only (for development without an HF account)
HF mode creates a `run/{run_id}` branch. HF pushes happen in background threads (one per model slot) so training is not blocked by uploads. Squash-merge into main when satisfied.
Optional: `--shm-checkpoints` writes checkpoints to `/dev/shm` (RAM-backed filesystem, instant writes). Requires `--hf-repo` since `/dev/shm` is volatile. Old checkpoints are cleaned up after successful HF push, keeping only the latest and the best (by val loss) for post-training evals.
### Data Integrity
**Every checkpoint write is atomic**: files are written to a `.tmp` directory, then renamed.
The `.complete` sentinel contains SHA-256 hashes of every file in the checkpoint.
**Hashes are always verified on load — no exceptions.**
- `IncompleteCheckpointError` — raised when `.complete` sentinel is missing
- `CheckpointIntegrityError` — raised when any hash mismatches
**Never use `kill -9` on training processes.** SIGTERM is handled gracefully: a flag is set,
the training loop checks it between steps, saves a checkpoint, pushes to HF, and exits cleanly.
**Never rsync checkpoint files from running pods.** Checkpoints are pushed to HuggingFace
from the trainer. Load via HF repo ID (e.g. `--checkpoint thomas-schweich/pawn-base`).
## RunPod Operations
### Docker Image
A single Docker image (`thomasschweich/pawn:latest`) is **automatically built and pushed to Docker Hub by CI** on every merge to main. No manual builds needed.
The image is based on `runpod/pytorch` (CUDA + SSH + Jupyter) with all Python deps pre-installed. Code lives at `/opt/pawn` on pods. SSH in and run experiments directly.
To build locally (rarely needed):
```bash
docker build --platform linux/amd64 \
--build-arg GIT_HASH=$(git rev-parse HEAD) \
-t thomasschweich/pawn:latest .
```
### Pod Lifecycle
Use `deploy/pod.sh` for all pod management. Requires `runpodctl` (`wget -qO- cli.runpod.net | sudo bash`).
```bash
# Create a pod
bash deploy/pod.sh create myexp --gpu h100
# SSH into it
bash deploy/pod.sh ssh myexp
# Launch training
bash deploy/pod.sh launch myexp scripts/train_all.py --hf-repo thomas-schweich/pawn-{variant}
# Stop (preserves volume, stops billing)
bash deploy/pod.sh stop myexp
# Delete (destroys everything)
bash deploy/pod.sh delete myexp
```
GPU shortcuts: `a5000`, `a40`, `a6000`, `4090`, `5090`, `l40s`, `h100`. Pod configs are cached in `~/.config/pawn/pods/<name>.env`.
### GPU Selection
Benchmarks from pretraining 3 models concurrently (`train_all.py`, batch=256):
| GPU | VRAM | $/hr | Step time | 100K cost | Notes |
|-----|------|------|-----------|-----------|-------|
| B200 | 192GB | $4.99 | 0.28s | ~$39 | Fastest |
| H200 SXM | 80GB | $3.59 | 0.34s | ~$34 | Best wall-clock/cost balance |
| RTX PRO 6000 | 48GB | $1.89 | 0.62s | ~$33 | Cheapest viable |
| A100 PCIe | 80GB | $1.39 | 0.79s | ~$30 | Cheapest overall |
| L40S | 48GB | $0.86 | 1.37s | ~$33 | Slow but cheap |
| RTX 5090/4090/3090 | 24-32GB | — | OOM | — | Insufficient VRAM for 3 models |
Total cost is remarkably consistent ($30-39) across viable GPUs. The choice is wall-clock time vs cost, not cost vs cost. Single-model training fits on 24GB GPUs.
### Required Pod Configuration
- **Always attach a network volume.** Checkpoints write to disk during atomic rename and HF push. Ephemeral container disk is lost on pod termination.
- **Set `HF_TOKEN` as a pod environment variable** for automatic HuggingFace authentication. The entrypoint persists it to `~/.cache/huggingface/token`.
- `PAWN_MODEL=thomas-schweich/pawn-base` — auto-pull a checkpoint on startup (runner target).
- `PAWN_CMD` — training command to execute (alternative to Docker CMD args).
### Pod Safety
- Stop pods with `runpodctl pod stop` or `bash deploy/pod.sh stop` — sends SIGTERM, trainer saves and pushes before exiting.
- **Never `runpodctl pod delete` while training is running** — data loss risk.
- **Never `kill -9` training processes** — use SIGTERM (plain `kill`), which triggers graceful shutdown.
- **Never rsync checkpoint files from running pods** — load via HF repo ID instead.
### HuggingFace Bucket Backups
Use HF buckets (`hf://buckets/...`) to back up experiment data from pods. Buckets are not datasets or repos — they use the `sync_bucket` API, not `upload_file` with `repo_type`.
**Key constraint: upload bandwidth from pods is ~1.8 MB/s.** Checkpoints are ~430MB each (143MB model + 287MB optimizer). A full training run with 5K-step checkpoint intervals produces ~17GB per trial. Sync selectively.
**During training — sync metrics only (instant):**
```python
from huggingface_hub import HfApi
api = HfApi(token=HF_TOKEN)
api.sync_bucket(
source="/workspace/logs",
dest="hf://buckets/OWNER/BUCKET/logs",
exclude=["*/checkpoints/*"],
)
```
**For individual files (lab notes, transcripts) — stage in a temp dir:**
```python
import tempfile, shutil, os
with tempfile.TemporaryDirectory() as td:
shutil.copy("/workspace/runs/lab-notes.md", os.path.join(td, "lab-notes.md"))
api.sync_bucket(source=td, dest="hf://buckets/OWNER/BUCKET")
```
`upload_file(..., repo_type="bucket")` does **not** work — buckets are not a valid repo type for that API. Always use `sync_bucket`.
**After training — sync only best/final checkpoints:**
```python
api.sync_bucket(
source="/workspace/logs/trial_XXXX/run_.../checkpoints/step_BEST",
dest="hf://buckets/OWNER/BUCKET/checkpoints/trial_XXXX/best",
)
```
**HF_TOKEN**: Stored in `/opt/pawn/.env` on pods. Source it or `export` before calling the API.
## Monitoring Training Progress
### Key Principle: Write Scripts to Disk for Pre-Approval
When setting up recurring monitoring, **always write the monitoring script to a file first** so the user can review and pre-approve it. This avoids repeated permission prompts when `/loop` fires.
**Pattern:**
1. Write a bash script to disk (e.g., `scripts/check_my_run.sh`)
2. User reviews and approves the script
3. Schedule with `/loop 15m bash scripts/check_my_run.sh`
**Example monitoring script:**
```bash
#!/usr/bin/env bash
# scripts/check_my_run.sh — monitor a specific training run
set -euo pipefail
bash /home/tas/pawn/scripts/monitor_training.sh <POD_ID>
```
Or for local-only monitoring:
```bash
#!/usr/bin/env bash
set -euo pipefail
bash /home/tas/pawn/scripts/check_progress.sh --sync
```
### Available Monitoring Tools
| Tool | What it does |
|------|-------------|
| `scripts/monitor_training.sh [POD_ID]` | SSH to pod, sync metrics via rsync, show per-variant step/loss/acc/ETA, check HF checkpoint branches |
| `scripts/check_progress.sh [LOG_DIR]` | Show progress from local `logs/` directory |
| `python -m pawn.dashboard --log-dir logs` | Solara web dashboard with interactive charts |
### Dashboard
```bash
python -m pawn.dashboard --log-dir logs
```
Reads `metrics.jsonl` files, no dependency on training packages. Auto-detects run type from config fields. Shows loss curves, accuracy, LR schedules, GPU utilization, patience clocks, and adapter-specific diagnostics. Requires restart for code changes (no hot reload).
## Logs
Training metrics in `logs/` (gitignored). Each run gets a timestamped directory with `metrics.jsonl` and a random slug (e.g., `run_20260325_140000_zesty-osprey/`).
`MetricsLogger` (`pawn/logging.py`) writes one JSON object per line. Every record includes timestamp, step, elapsed time, and memory stats. Config records include hostname, git hash, git tag, and run slug.
## Hyperparameter Sweeps
Optuna integration via `pawn/sweep.py` and `scripts/sweep.py`:
```bash
uv run python scripts/sweep.py \
--adapter lora --n-trials 30 --n-jobs 2 --n-gpus 2 \
--total-steps 20000 --pruner hyperband \
--checkpoint thomas-schweich/pawn-base --pgn thomas-schweich/pawn-lichess-full \
--local-checkpoints
```
Supports all adapter types + architecture search. GPU affinity assigns `CUDA_VISIBLE_DEVICES = trial.number % n_gpus`. SQLite-backed study persistence. Pruner options: `hyperband`, `median`, `none`.
## Key Patterns & Gotchas
- **DataLoader workers must use `multiprocessing_context='spawn'`** — the Rust engine uses rayon, and fork after rayon init causes deadlocks.
- **`SDPA_BACKEND` must be set before `torch.compile()`** — compiled code captures the backend at trace time. `apply_gpu_config()` handles this.
- **ROCm works**: The only known ROCm issue is a stride mismatch in flash attention backward when combined with `torch.compile` + AMP. The workaround is `--sdpa-math` (use the MATH SDPA backend instead of flash), which `configure_gpu()` applies automatically on AMD GPUs. Everything else — training, eval, adapters, data loading — works identically on ROCm and CUDA. **Do not assume bugs are ROCm-specific.** Every other time something has failed on AMD it turned out to be a bug in our code (wrong torch version installed, stale lockfile, missing dependency, etc.), not a ROCm issue.
- **Sparse logit projection**: `forward_hidden()` returns `(B,T,d_model)`, then only loss-masked positions project through `lm_head` — avoids full `(B,T,V)` materialization.
- **Legal mask via Rust**: `LegalMaskBuilder` replays games in Rust, returns sparse indices (~2 MB) scattered into a pre-allocated GPU buffer (vs ~70 MB dense).
- **GPU auto-detection**: `pawn.gpu.configure_gpu()` selects compile/AMP/SDPA settings. `apply_gpu_config()` applies them. NVIDIA uses flash attention + compile; AMD uses MATH SDPA + compile. Both paths are tested and production-validated.
- **Factored embeddings**: each move token decomposes into `src_embed[s] + dst_embed[d] + promo_embed[p]`, reducing embedding parameters by ~32x.
---
# Current task
# Pretraining Ablations: mate-boost, no-outcome, discard-ply-limit
You are running three pretraining ablation experiments on PAWN-Base. Each ablation modifies one aspect of the random game generation to understand its effect on the learned world model.
You are on a RunPod with 3x RTX 6000 Ada (48GB VRAM each, $0.77/hr each = $2.31/hr total). Use the **pawn-lab MCP server** and the **manage-pod skill** workflow.
**As your first action**, set up the check-in crons per the manage-pod skill, then proceed.
## The Three Ablations
| Ablation | Config | Hypothesis |
|----------|--------|------------|
| **mate-boost** | `mate_boost: 1.0` | Always taking mate-in-1 produces shorter, more decisive games (~134 avg ply vs ~238). More checkmate patterns, fewer aimless endgames. |
| **no-outcome** | `no_outcome_token: true` | Stripping the outcome token forces the model to infer game result from moves alone. Tests whether outcome conditioning helps or hurts. |
| **discard-ply-limit** | `discard_ply_limit: true` | Only naturally-ended games (no truncation at 255 plies). All games have meaningful endings. |
There is no separate baseline run — the existing PAWN-Base checkpoint (`thomas-schweich/pawn-base`) trained at 100K steps is the baseline reference.
## Unified Config
All training uses the unified `scripts/train.py` with Pydantic `RunConfig`. Call `lab_schema` to see all available fields. Configs are JSON dicts passed to `lab_launch`.
**Base config shared by all ablations:**
```json
{
"run_type": "pretrain",
"variant": "base",
"total_steps": 200000,
"batch_size": 256,
"lr": 3e-4,
"local_checkpoints": true,
"amp_dtype": "bfloat16",
"eval_interval": 5000,
"num_workers": 4
}
```
Each ablation adds its flag on top. Example for mate-boost:
```json
{
"run_type": "pretrain",
"variant": "base",
"mate_boost": 1.0,
"total_steps": 200000,
"batch_size": 256,
"lr": 3e-4,
"local_checkpoints": true,
"amp_dtype": "bfloat16",
"eval_interval": 5000,
"num_workers": 4
}
```
## Procedure
### Phase 1: Learning Rate Exploration (~1 hour)
For each of the 3 ablations, launch 3 trials with different learning rates using `pause_after_steps`:
```
lab_launch(config={
"run_type": "pretrain",
"variant": "base",
"mate_boost": 1.0, # ← ablation flag
"total_steps": 200000,
"pause_after_steps": 5000, # ← pause for comparison
"batch_size": 256,
"lr": 1e-4, # ← vary this: 1e-4, 3e-4, 1e-3
"local_checkpoints": true,
"amp_dtype": "bfloat16",
"no_compile": true, # ← skip compile for short exploration
"eval_interval": 2500,
"num_workers": 4
})
```
That's 9 trials total (3 LRs × 3 ablations). With 3 GPUs, run 3 trials concurrently (one per GPU). Run each ablation's 3 LRs sequentially on its GPU, or interleave — the lab runner assigns GPUs automatically. Each pauses at 5K steps. Use `no_compile: true` to skip the 15-30 min compile overhead for these short runs.
After all 9 pause, call `lab_results` to compare val_loss at 5K steps. Pick the best LR for each ablation.
### Phase 2: Full Training (~10-14 hours)
Resume the best LR for each ablation to 200K steps with torch.compile enabled:
```
lab_resume(trial_id=BEST_MATE_BOOST) # no pause → runs to total_steps
lab_resume(trial_id=BEST_NO_OUTCOME)
lab_resume(trial_id=BEST_DISCARD_PLY)
```
`lab_resume` clears `pause_after_steps` by default, so the resumed trials run to completion at `total_steps=200000`. The resumed trials pick up from the 5K-step checkpoint with optimizer state intact.
**Important:** Do NOT pass `no_compile: true` for the resumed runs. Let torch.compile run — the 15-30 min compile overhead amortizes over 195K remaining steps.
With 1 model per GPU (no sharing), expect ~0.4s/step → ~22 hours for 195K steps. All 3 run in parallel. Total cost: ~$51.
### Phase 3: Evaluation
After all 3 runs converge, compare:
- Final val_loss and accuracy vs baseline (PAWN-Base at 100K steps)
- Learning curves (loss vs step for all 3, plotted together)
- Per-ply accuracy via `scripts/eval_accuracy.py` if time permits
## Monitoring
Use the manage-pod skill check-in pattern:
- **Mini check-in (every 5 min):** `lab_events` — launch next trial if GPU idle
- **Full check-in (hourly):** `lab_status` + `lab_log` for each running trial + update lab notes
During Phase 1 (9 short trials), use 5-min check-ins — trials finish fast.
During Phase 2 (3 long runs), switch to hourly after confirming stability.
### What to watch for
- **NaN loss:** Kill immediately. Try lr/3. Use `lab_log` to check.
- **discard-ply-limit throughput:** This ablation discards ~60% of generated games. If step times are much slower than the other two, reduce `num_workers` to free CPU for data generation.
- **mate-boost game lengths:** Check `lab_log` — mate-boost games are ~134 plies vs ~238 normal. Each batch processes more games, which may cause higher memory usage. If OOM, reduce batch_size to 128.
- **GPU assignment:** The lab runner assigns one trial per GPU via CUDA_VISIBLE_DEVICES. If a trial fails and a GPU sits idle, launch a replacement promptly.
## Cost Budget
| Phase | Trials | Time | Cost |
|-------|--------|------|------|
| LR exploration | 9 × 5K steps | ~1h | ~$3 |
| Full training | 3 × 195K steps (parallel) | ~22h | ~$51 |
| **Total** | | **~23h** | **~$53** |
Set cost tracking: `lab_set_cost(cost_per_hour=2.31)`
## Lab Notes
Write experiment state to `runs/lab-notes.md` (survives context compaction via PostCompact hook). Include:
- Phase 1 results table (LR × ablation → val_loss at 5K steps)
- Which LR was selected for each ablation and why
- Phase 2 progress (step, loss, ETA per ablation)
- Any anomalies or adjustments made
## Current State (as of 2026-04-03 19:30 UTC)
### Phase 2: Full Training (RUNNING)
Three ablation runs training to 200K steps, torch.compile ON, resumed from 5K:
| Trial | GPU | Ablation | LR | Step (~19:00) | train_loss | acc |
|-------|-----|----------|-----|---------------|-----------|------|
| 24 | 0 | mate-boost | 1e-3 | ~120K | 3.19 | 6.8% |
| 25 | 1 | no-outcome | 3e-4 (no warmup) | ~116K | 3.09 | 7.0% |
| 26 | 2 | discard-ply-limit | 1e-3 | ~119K | 3.18 | 7.5% |
- ETA: ~2026-04-04 04:30 UTC (~9h remaining)
- Hourly cron monitors via `lab_events` + `lab_log`
- Baseline ref: PAWN-Base val_loss ~3.06, acc ~6.9% at 100K steps
### Concurrent: Accuracy Ceiling Computation (RUNNING)
Running `scripts/compute_theoretical_ceiling.py` on spare CPUs (RAYON_NUM_THREADS=30) while GPUs train:
```bash
RAYON_NUM_THREADS=30 PAWN_ALLOW_CPU=1 /opt/pawn/.venv/bin/python \
scripts/compute_theoretical_ceiling.py \
--n-games 5000 --rollouts 1024 --sample-rate 0.075 \
--model-accuracy 0.069 --output /workspace/data/theoretical_ceiling_1024.json
```
- 1024 rollouts/move (8x previous), expected to narrow bias bracket from 0.66pp to ~0.23pp
- Processes in 10 batches of 500 games, printing intermediate estimates
- ETA: ~3-4 hours (finishes well before training)
- Output: `/workspace/data/theoretical_ceiling_1024.json`
### Bug Fixes Applied This Session
- **`pause_after_steps`**: Was missing from `trainer.py` (only in adapter training). Added to `TrainingConfig` and `trainer.py` train loop. Wired through `scripts/train.py`.
- **`lab_resume` checkpoint discovery**: Was looking for `checkpoints/best` or `checkpoints/final` only. Pretraining uses `step_XXXXXXXX` naming. Added fallback to pick highest `step_*` dir.
### HF Bucket Backup
All experiment data syncs to `hf://buckets/thomas-schweich/pretraining-ablations`:
- Metrics/configs synced every ~4 hours (excluding checkpoints)
- 100K-step checkpoints for all 3 ablations uploaded
- Lab notes + chat transcript synced periodically
- HF_TOKEN in `/opt/pawn/.env`
- See "HuggingFace Bucket Backups" section above for API patterns
### Lab Notes
Detailed log at `/workspace/runs/lab-notes.md` (symlinked from `/opt/pawn/runs/lab-notes.md`). Survives context compaction via PostCompact hook. Read this first when resuming.
## Post-Training Procedure
When each training run completes (in any order):
1. **Upload best checkpoint** to HF bucket in the background (see bucket sync patterns above)
2. **Run linear probes** on the best checkpoint in the background:
```bash
PAWN_ALLOW_CPU=1 /opt/pawn/.venv/bin/python scripts/eval_probes.py \
--checkpoint /workspace/logs/trial_XXXX/run_.../checkpoints/step_BEST \
--log-dir /workspace/eval_probes/trial_XXXX --device cuda:N
```
Use the GPU that just freed up. Run in background so other completions aren't blocked.
3. **After the last probe run is kicked off**, draft the final report comparing all 3 ablations + baseline:
- val_loss and accuracy curves
- Probe results (piece type, check, castling, material, game phase)
- Accuracy ceiling comparison
- Key findings and surprises
4. **Update the report** with probe results as they finish
5. **Upload everything** to HF bucket: final checkpoints, probe results, report, lab notes
## Success Criteria
1. All three ablations train to 200K steps without divergence
2. Clear comparison of val_loss and accuracy across ablations
3. At least one ablation shows meaningful difference from baseline (better or worse — both are informative)
--

Xet Storage Details

Size:
28.6 kB
·
Xet hash:
2435990282dfc0b7ef32d0cc9c03315053277fa64c6d11c8ea376e386bd8e1b2

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.