Buckets:
| # PAWN (Playstyle-Agnostic World-model Network for Chess) | |
| A causal transformer trained on random chess games, designed as a testbed for finetuning and augmentation methods at small scales. Apache 2.0. | |
| ## Repository Structure | |
| ``` | |
| pawn/ | |
| ├── engine/ # Rust chess engine with PyO3 bindings (via shakmaty) | |
| ├── pawn/ # Core Python package | |
| │ ├── config.py # CLMConfig (small/base/large), TrainingConfig | |
| │ ├── model.py # PAWNCLM transformer (RMSNorm, SwiGLU, RoPE, factored embeddings) | |
| │ ├── data.py # On-the-fly random game data pipeline | |
| │ ├── lichess_data.py # Lichess PGN data pipeline + legal mask computation | |
| │ ├── trainer.py # Pretraining loop | |
| │ ├── gpu.py # GPU auto-detection (compile/AMP/SDPA backend) | |
| │ ├── logging.py # MetricsLogger (JSONL output) | |
| │ ├── checkpoint.py # Atomic save/load, .complete sentinel, HF push | |
| │ ├── adapters/ # Bottleneck, LoRA, FiLM, sparse, hybrid | |
| │ ├── eval_suite/ # Probes, generation tests, diagnostics, lichess eval | |
| │ └── dashboard/ # Solara training dashboard (metrics, charts, runner) | |
| ├── scripts/ # Training and evaluation entry points | |
| ├── tests/ # Unit tests | |
| ├── deploy/ # Runpod deployment scripts | |
| └── docs/ # Architecture, training, adapter docs | |
| ``` | |
| ## Building | |
| This is a uv workspace. The root project is the `pawn` Python package; `engine/` is the sole workspace member. | |
| ```bash | |
| # Build the Rust chess engine (required before anything else) | |
| cd engine && uv run --with maturin maturin develop --release && cd .. | |
| # Install Python deps (dev tools like pytest, seaborn, solara are in base dependencies): | |
| uv sync --extra rocm # AMD (ROCm 7.1) | |
| uv sync --extra cu128 # NVIDIA (CUDA 12.8) | |
| # Run tests | |
| uv run pytest tests/ | |
| # Pretrain from scratch (local dev) | |
| uv run python scripts/train.py --variant base --local-checkpoints | |
| ``` | |
| The only extras are GPU backends (`rocm` or `cu128`). Everything else (pytest, solara, optuna, seaborn, etc.) is in base dependencies. PyTorch lives in the extras because uv can't resolve CPU/CUDA/ROCm from a single lockfile — always specify `--extra rocm` or `--extra cu128`. | |
| **GPU requirement**: `configure_gpu()` (called by every training and eval script) raises `RuntimeError` if no CUDA/ROCm GPU is detected. This prevents accidentally running GPU workloads on CPU, which is almost always a mistake. The environment variable `PAWN_ALLOW_CPU=1` overrides this check as a last resort for the rare case where CPU execution is genuinely intended (e.g. a lightweight backfill script). Unit tests do not call `configure_gpu()` and run fine on CPU without the override. | |
| ## Engine (`engine/`) | |
| **Single source of truth** for all chess logic. All game simulation, move generation, legality checks, tokenization, PGN parsing, and board state extraction happen in Rust. No Python chess libraries. | |
| - Uses rayon for parallel game generation (~43K games/sec, 150M+/hr) | |
| - PyO3 bindings expose `chess_engine` module to Python | |
| - Key functions: `generate_random_games()`, `parse_pgn_file()`, `compute_legal_token_masks_sparse()`, `extract_board_states()`, `export_move_vocabulary()`, `compute_accuracy_ceiling()` | |
| ## Model | |
| ### Architecture | |
| - Decoder-only transformer, next-token prediction over 4,278 tokens | |
| - Token vocabulary: 1 PAD + 4,096 grid (64x64 src/dst) + 176 promotions + 5 outcomes | |
| - Factored embeddings: `src_embed[s] + dst_embed[d] + promo_embed[p]` | |
| - Sequence format: `[outcome] [ply_1] ... [ply_N] [PAD] ... [PAD]` (256 tokens) | |
| ### Variants | |
| - `CLMConfig.small()`: d=256, 8 layers, 4 heads, ~9.5M params | |
| - `CLMConfig.base()`: d=512, 8 layers, 8 heads, ~35.8M params (default) | |
| - `CLMConfig.large()`: d=640, 10 layers, 8 heads, ~68.4M params | |
| - `CLMConfig.toy()`: d=64, 2 layers, for tests only | |
| ## Training | |
| All training scripts require one of `--hf-repo REPO_ID` or `--local-checkpoints` (mutually exclusive). Use `--local-checkpoints` for local dev; use `--hf-repo` for any run where you need durable checkpoints. | |
| ### Pretraining | |
| ```bash | |
| # Single model | |
| uv run python scripts/train.py --variant base --local-checkpoints | |
| # All three variants simultaneously (shared data batches, sequential GPU) | |
| uv run python scripts/train_all.py --local-checkpoints | |
| # Resume from checkpoint | |
| uv run python scripts/train.py --variant base --resume checkpoints/step_00050000 --local-checkpoints | |
| ``` | |
| **`scripts/train.py`** key args: | |
| - `--variant {small|base|large|toy}` — model size (default: base) | |
| - `--resume PATH` — resume from checkpoint directory | |
| - `--total-steps N` — training steps (default: 100,000) | |
| - `--batch-size N` — batch size (default: 256) | |
| - `--discard-ply-limit` — only train on naturally-ended games (no ply-limit truncation) | |
| - Architecture overrides: `--d-model`, `--n-layers`, `--n-heads`, `--d-ff`, `--lr`, `--weight-decay`, `--warmup-steps` | |
| **`scripts/train_all.py`** additional args: | |
| - `--shm-checkpoints` — write checkpoints to `/dev/shm` (requires `--hf-repo`, volatile) | |
| - `--run-evals` — auto-run probes + diagnostics after training completes | |
| - `--publish-results` — push eval results to HF | |
| - `--patience N` — per-model early stopping patience (eval intervals without improvement) | |
| ### Adapter Training | |
| All adapter scripts require `--checkpoint PATH` (pretrained weights) and `--pgn PATH` (Lichess PGN file). They freeze the backbone and train only adapter parameters. | |
| ```bash | |
| # Example: train a LoRA adapter on Lichess 1800-1900 games | |
| uv run python scripts/train_lora.py \ | |
| --checkpoint thomas-schweich/pawn-base \ | |
| --pgn thomas-schweich/pawn-lichess-full --elo-min 1800 --elo-max 1900 \ | |
| --lora-rank 4 --lr 3e-4 --local-checkpoints | |
| ``` | |
| | Script | Adapter | Key args | Typical params | | |
| |--------|---------|----------|----------------| | |
| | `train_bottleneck.py` | Houlsby MLP | `--bottleneck-dim 8` | ~131K | | |
| | `train_lora.py` | Low-rank attention | `--lora-rank 4 --lora-targets qkvo` | ~65K | | |
| | `train_film.py` | Channel-wise affine | `--no-output-film` | ~17K | | |
| | `train_sparse.py` | Binary mask | `--density 0.01 --sparse-targets qkvo` | ~503K-2.7M | | |
| | `train_hybrid.py` | LoRA + FiLM | `--lora-rank 4 --film-lr 1e-3` | ~65K | | |
| | `train_tiny.py` | None (from scratch) | `--d-model 84 --n-layers 2` | ~524K | | |
| Common adapter args: `--epochs 50`, `--batch-size 64`, `--lr 3e-4`, `--patience 10`, `--val-every 1`, `--max-games 12000`, `--min-ply 10` | |
| ### Common CLI Patterns | |
| - `--sdpa-math` — force MATH SDPA backend (required for ROCm + torch.compile) | |
| - `--no-compile` — disable torch.compile | |
| - `--no-amp` — disable mixed precision | |
| - `--num-workers N` — DataLoader workers (default: 8 for adapters, 4 for pretraining) | |
| - `--device {cuda|cpu}` — device selection | |
| - `--wandb` — enable Weights & Biases logging | |
| ## Evaluation & Metrics | |
| ### Linear Probes | |
| ```bash | |
| uv run python scripts/eval_probes.py --log-dir logs --device cuda | |
| ``` | |
| Trains linear probes on frozen hidden states to measure internal representations (piece type, check status, castling rights, material count, game phase, etc.). Args: `--n-games 4096`, `--n-val-games 1024`, `--n-epochs 20`, `--run RUN_NAME` (specific run). | |
| ### Move Prediction Accuracy | |
| ```bash | |
| uv run python scripts/eval_accuracy.py \ | |
| --checkpoint thomas-schweich/pawn-base \ | |
| --pgn thomas-schweich/pawn-lichess-full --elo-min 1800 --elo-max 1900 \ | |
| --adapter-checkpoint logs/run_*/checkpoints/best | |
| ``` | |
| MAIA-compatible evaluation with per-phase and per-ply accuracy. Args: `--min-eval-ply 10`, `--max-games 50000`, `--per-ply`. | |
| ### Theoretical Accuracy Ceilings | |
| ```bash | |
| uv run python scripts/compute_theoretical_ceiling.py | |
| ``` | |
| Computes theoretical accuracy ceilings for random games via Monte Carlo rollouts: unconditional (E[1/N_legal]), naive-conditioned (1-ply filter), and MC-conditioned (Bayes-optimal with outcome knowledge). Reports a bias bracket (naive vs split-half corrected estimates) and bootstrap 95% CIs clustered by game. CPU-intensive. | |
| ### Export to HuggingFace | |
| ```bash | |
| uv run python scripts/export_hf_repo.py --run-dir logs/run_YYYYMMDD_HHMMSS | |
| ``` | |
| Converts a training run to HuggingFace repo format (safetensors + metrics). Finds best checkpoint by val loss. | |
| ## Checkpoints | |
| Pre-trained weights are hosted on HuggingFace and loaded directly by repo ID: | |
| - `thomas-schweich/pawn-small` — 9.5M params, `CLMConfig.small()` | |
| - `thomas-schweich/pawn-base` — 35.8M params, `CLMConfig.base()` | |
| - `thomas-schweich/pawn-large` — 68.4M params, `CLMConfig.large()` | |
| All scripts accept HF repo IDs for `--checkpoint` (e.g. `--checkpoint thomas-schweich/pawn-base`). Weights are downloaded and cached automatically via `huggingface_hub`. | |
| ### Checkpoint Format (safetensors) | |
| Checkpoints are directories, not single files: | |
| ``` | |
| step_00065000/ | |
| ├── model.safetensors # model weights | |
| ├── optimizer.safetensors # flattened optimizer state | |
| ├── training_state.json # step, scheduler, scaler, RNG (base64) | |
| ├── config.json # model + training config | |
| └── .complete # SHA-256 hashes of all files (integrity sentinel) | |
| ``` | |
| Central module: `pawn/checkpoint.py`. All save/load goes through this module. | |
| Legacy `.pt` files are still loadable (backward compatible). | |
| ### Checkpoint Storage Modes | |
| All training scripts require one of: | |
| - `--hf-repo REPO_ID` — push checkpoints to a HuggingFace branch as they're written (durable) | |
| - `--local-checkpoints` — save locally only (for development without an HF account) | |
| HF mode creates a `run/{run_id}` branch. HF pushes happen in background threads (one per model slot) so training is not blocked by uploads. Squash-merge into main when satisfied. | |
| Optional: `--shm-checkpoints` writes checkpoints to `/dev/shm` (RAM-backed filesystem, instant writes). Requires `--hf-repo` since `/dev/shm` is volatile. Old checkpoints are cleaned up after successful HF push, keeping only the latest and the best (by val loss) for post-training evals. | |
| ### Data Integrity | |
| **Every checkpoint write is atomic**: files are written to a `.tmp` directory, then renamed. | |
| The `.complete` sentinel contains SHA-256 hashes of every file in the checkpoint. | |
| **Hashes are always verified on load — no exceptions.** | |
| - `IncompleteCheckpointError` — raised when `.complete` sentinel is missing | |
| - `CheckpointIntegrityError` — raised when any hash mismatches | |
| **Never use `kill -9` on training processes.** SIGTERM is handled gracefully: a flag is set, | |
| the training loop checks it between steps, saves a checkpoint, pushes to HF, and exits cleanly. | |
| **Never rsync checkpoint files from running pods.** Checkpoints are pushed to HuggingFace | |
| from the trainer. Load via HF repo ID (e.g. `--checkpoint thomas-schweich/pawn-base`). | |
| ## RunPod Operations | |
| ### Docker Image | |
| A single Docker image (`thomasschweich/pawn:latest`) is **automatically built and pushed to Docker Hub by CI** on every merge to main. No manual builds needed. | |
| The image is based on `runpod/pytorch` (CUDA + SSH + Jupyter) with all Python deps pre-installed. Code lives at `/opt/pawn` on pods. SSH in and run experiments directly. | |
| To build locally (rarely needed): | |
| ```bash | |
| docker build --platform linux/amd64 \ | |
| --build-arg GIT_HASH=$(git rev-parse HEAD) \ | |
| -t thomasschweich/pawn:latest . | |
| ``` | |
| ### Pod Lifecycle | |
| Use `deploy/pod.sh` for all pod management. Requires `runpodctl` (`wget -qO- cli.runpod.net | sudo bash`). | |
| ```bash | |
| # Create a pod | |
| bash deploy/pod.sh create myexp --gpu h100 | |
| # SSH into it | |
| bash deploy/pod.sh ssh myexp | |
| # Launch training | |
| bash deploy/pod.sh launch myexp scripts/train_all.py --hf-repo thomas-schweich/pawn-{variant} | |
| # Stop (preserves volume, stops billing) | |
| bash deploy/pod.sh stop myexp | |
| # Delete (destroys everything) | |
| bash deploy/pod.sh delete myexp | |
| ``` | |
| GPU shortcuts: `a5000`, `a40`, `a6000`, `4090`, `5090`, `l40s`, `h100`. Pod configs are cached in `~/.config/pawn/pods/<name>.env`. | |
| ### GPU Selection | |
| Benchmarks from pretraining 3 models concurrently (`train_all.py`, batch=256): | |
| | GPU | VRAM | $/hr | Step time | 100K cost | Notes | | |
| |-----|------|------|-----------|-----------|-------| | |
| | B200 | 192GB | $4.99 | 0.28s | ~$39 | Fastest | | |
| | H200 SXM | 80GB | $3.59 | 0.34s | ~$34 | Best wall-clock/cost balance | | |
| | RTX PRO 6000 | 48GB | $1.89 | 0.62s | ~$33 | Cheapest viable | | |
| | A100 PCIe | 80GB | $1.39 | 0.79s | ~$30 | Cheapest overall | | |
| | L40S | 48GB | $0.86 | 1.37s | ~$33 | Slow but cheap | | |
| | RTX 5090/4090/3090 | 24-32GB | — | OOM | — | Insufficient VRAM for 3 models | | |
| Total cost is remarkably consistent ($30-39) across viable GPUs. The choice is wall-clock time vs cost, not cost vs cost. Single-model training fits on 24GB GPUs. | |
| ### Required Pod Configuration | |
| - **Always attach a network volume.** Checkpoints write to disk during atomic rename and HF push. Ephemeral container disk is lost on pod termination. | |
| - **Set `HF_TOKEN` as a pod environment variable** for automatic HuggingFace authentication. The entrypoint persists it to `~/.cache/huggingface/token`. | |
| - `PAWN_MODEL=thomas-schweich/pawn-base` — auto-pull a checkpoint on startup (runner target). | |
| - `PAWN_CMD` — training command to execute (alternative to Docker CMD args). | |
| ### Pod Safety | |
| - Stop pods with `runpodctl pod stop` or `bash deploy/pod.sh stop` — sends SIGTERM, trainer saves and pushes before exiting. | |
| - **Never `runpodctl pod delete` while training is running** — data loss risk. | |
| - **Never `kill -9` training processes** — use SIGTERM (plain `kill`), which triggers graceful shutdown. | |
| - **Never rsync checkpoint files from running pods** — load via HF repo ID instead. | |
| ### HuggingFace Bucket Backups | |
| Use HF buckets (`hf://buckets/...`) to back up experiment data from pods. Buckets are not datasets or repos — they use the `sync_bucket` API, not `upload_file` with `repo_type`. | |
| **Key constraint: upload bandwidth from pods is ~1.8 MB/s.** Checkpoints are ~430MB each (143MB model + 287MB optimizer). A full training run with 5K-step checkpoint intervals produces ~17GB per trial. Sync selectively. | |
| **During training — sync metrics only (instant):** | |
| ```python | |
| from huggingface_hub import HfApi | |
| api = HfApi(token=HF_TOKEN) | |
| api.sync_bucket( | |
| source="/workspace/logs", | |
| dest="hf://buckets/OWNER/BUCKET/logs", | |
| exclude=["*/checkpoints/*"], | |
| ) | |
| ``` | |
| **For individual files (lab notes, transcripts) — stage in a temp dir:** | |
| ```python | |
| import tempfile, shutil, os | |
| with tempfile.TemporaryDirectory() as td: | |
| shutil.copy("/workspace/runs/lab-notes.md", os.path.join(td, "lab-notes.md")) | |
| api.sync_bucket(source=td, dest="hf://buckets/OWNER/BUCKET") | |
| ``` | |
| `upload_file(..., repo_type="bucket")` does **not** work — buckets are not a valid repo type for that API. Always use `sync_bucket`. | |
| **After training — sync only best/final checkpoints:** | |
| ```python | |
| api.sync_bucket( | |
| source="/workspace/logs/trial_XXXX/run_.../checkpoints/step_BEST", | |
| dest="hf://buckets/OWNER/BUCKET/checkpoints/trial_XXXX/best", | |
| ) | |
| ``` | |
| **HF_TOKEN**: Stored in `/opt/pawn/.env` on pods. Source it or `export` before calling the API. | |
| ## Monitoring Training Progress | |
| ### Key Principle: Write Scripts to Disk for Pre-Approval | |
| When setting up recurring monitoring, **always write the monitoring script to a file first** so the user can review and pre-approve it. This avoids repeated permission prompts when `/loop` fires. | |
| **Pattern:** | |
| 1. Write a bash script to disk (e.g., `scripts/check_my_run.sh`) | |
| 2. User reviews and approves the script | |
| 3. Schedule with `/loop 15m bash scripts/check_my_run.sh` | |
| **Example monitoring script:** | |
| ```bash | |
| #!/usr/bin/env bash | |
| # scripts/check_my_run.sh — monitor a specific training run | |
| set -euo pipefail | |
| bash /home/tas/pawn/scripts/monitor_training.sh <POD_ID> | |
| ``` | |
| Or for local-only monitoring: | |
| ```bash | |
| #!/usr/bin/env bash | |
| set -euo pipefail | |
| bash /home/tas/pawn/scripts/check_progress.sh --sync | |
| ``` | |
| ### Available Monitoring Tools | |
| | Tool | What it does | | |
| |------|-------------| | |
| | `scripts/monitor_training.sh [POD_ID]` | SSH to pod, sync metrics via rsync, show per-variant step/loss/acc/ETA, check HF checkpoint branches | | |
| | `scripts/check_progress.sh [LOG_DIR]` | Show progress from local `logs/` directory | | |
| | `python -m pawn.dashboard --log-dir logs` | Solara web dashboard with interactive charts | | |
| ### Dashboard | |
| ```bash | |
| python -m pawn.dashboard --log-dir logs | |
| ``` | |
| Reads `metrics.jsonl` files, no dependency on training packages. Auto-detects run type from config fields. Shows loss curves, accuracy, LR schedules, GPU utilization, patience clocks, and adapter-specific diagnostics. Requires restart for code changes (no hot reload). | |
| ## Logs | |
| Training metrics in `logs/` (gitignored). Each run gets a timestamped directory with `metrics.jsonl` and a random slug (e.g., `run_20260325_140000_zesty-osprey/`). | |
| `MetricsLogger` (`pawn/logging.py`) writes one JSON object per line. Every record includes timestamp, step, elapsed time, and memory stats. Config records include hostname, git hash, git tag, and run slug. | |
| ## Hyperparameter Sweeps | |
| Optuna integration via `pawn/sweep.py` and `scripts/sweep.py`: | |
| ```bash | |
| uv run python scripts/sweep.py \ | |
| --adapter lora --n-trials 30 --n-jobs 2 --n-gpus 2 \ | |
| --total-steps 20000 --pruner hyperband \ | |
| --checkpoint thomas-schweich/pawn-base --pgn thomas-schweich/pawn-lichess-full \ | |
| --local-checkpoints | |
| ``` | |
| Supports all adapter types + architecture search. GPU affinity assigns `CUDA_VISIBLE_DEVICES = trial.number % n_gpus`. SQLite-backed study persistence. Pruner options: `hyperband`, `median`, `none`. | |
| ## Key Patterns & Gotchas | |
| - **DataLoader workers must use `multiprocessing_context='spawn'`** — the Rust engine uses rayon, and fork after rayon init causes deadlocks. | |
| - **`SDPA_BACKEND` must be set before `torch.compile()`** — compiled code captures the backend at trace time. `apply_gpu_config()` handles this. | |
| - **ROCm works**: The only known ROCm issue is a stride mismatch in flash attention backward when combined with `torch.compile` + AMP. The workaround is `--sdpa-math` (use the MATH SDPA backend instead of flash), which `configure_gpu()` applies automatically on AMD GPUs. Everything else — training, eval, adapters, data loading — works identically on ROCm and CUDA. **Do not assume bugs are ROCm-specific.** Every other time something has failed on AMD it turned out to be a bug in our code (wrong torch version installed, stale lockfile, missing dependency, etc.), not a ROCm issue. | |
| - **Sparse logit projection**: `forward_hidden()` returns `(B,T,d_model)`, then only loss-masked positions project through `lm_head` — avoids full `(B,T,V)` materialization. | |
| - **Legal mask via Rust**: `LegalMaskBuilder` replays games in Rust, returns sparse indices (~2 MB) scattered into a pre-allocated GPU buffer (vs ~70 MB dense). | |
| - **GPU auto-detection**: `pawn.gpu.configure_gpu()` selects compile/AMP/SDPA settings. `apply_gpu_config()` applies them. NVIDIA uses flash attention + compile; AMD uses MATH SDPA + compile. Both paths are tested and production-validated. | |
| - **Factored embeddings**: each move token decomposes into `src_embed[s] + dst_embed[d] + promo_embed[p]`, reducing embedding parameters by ~32x. | |
| --- | |
| # Current task | |
| # Pretraining Ablations: mate-boost, no-outcome, discard-ply-limit | |
| You are running three pretraining ablation experiments on PAWN-Base. Each ablation modifies one aspect of the random game generation to understand its effect on the learned world model. | |
| You are on a RunPod with 3x RTX 6000 Ada (48GB VRAM each, $0.77/hr each = $2.31/hr total). Use the **pawn-lab MCP server** and the **manage-pod skill** workflow. | |
| **As your first action**, set up the check-in crons per the manage-pod skill, then proceed. | |
| ## The Three Ablations | |
| | Ablation | Config | Hypothesis | | |
| |----------|--------|------------| | |
| | **mate-boost** | `mate_boost: 1.0` | Always taking mate-in-1 produces shorter, more decisive games (~134 avg ply vs ~238). More checkmate patterns, fewer aimless endgames. | | |
| | **no-outcome** | `no_outcome_token: true` | Stripping the outcome token forces the model to infer game result from moves alone. Tests whether outcome conditioning helps or hurts. | | |
| | **discard-ply-limit** | `discard_ply_limit: true` | Only naturally-ended games (no truncation at 255 plies). All games have meaningful endings. | | |
| There is no separate baseline run — the existing PAWN-Base checkpoint (`thomas-schweich/pawn-base`) trained at 100K steps is the baseline reference. | |
| ## Unified Config | |
| All training uses the unified `scripts/train.py` with Pydantic `RunConfig`. Call `lab_schema` to see all available fields. Configs are JSON dicts passed to `lab_launch`. | |
| **Base config shared by all ablations:** | |
| ```json | |
| { | |
| "run_type": "pretrain", | |
| "variant": "base", | |
| "total_steps": 200000, | |
| "batch_size": 256, | |
| "lr": 3e-4, | |
| "local_checkpoints": true, | |
| "amp_dtype": "bfloat16", | |
| "eval_interval": 5000, | |
| "num_workers": 4 | |
| } | |
| ``` | |
| Each ablation adds its flag on top. Example for mate-boost: | |
| ```json | |
| { | |
| "run_type": "pretrain", | |
| "variant": "base", | |
| "mate_boost": 1.0, | |
| "total_steps": 200000, | |
| "batch_size": 256, | |
| "lr": 3e-4, | |
| "local_checkpoints": true, | |
| "amp_dtype": "bfloat16", | |
| "eval_interval": 5000, | |
| "num_workers": 4 | |
| } | |
| ``` | |
| ## Procedure | |
| ### Phase 1: Learning Rate Exploration (~1 hour) | |
| For each of the 3 ablations, launch 3 trials with different learning rates using `pause_after_steps`: | |
| ``` | |
| lab_launch(config={ | |
| "run_type": "pretrain", | |
| "variant": "base", | |
| "mate_boost": 1.0, # ← ablation flag | |
| "total_steps": 200000, | |
| "pause_after_steps": 5000, # ← pause for comparison | |
| "batch_size": 256, | |
| "lr": 1e-4, # ← vary this: 1e-4, 3e-4, 1e-3 | |
| "local_checkpoints": true, | |
| "amp_dtype": "bfloat16", | |
| "no_compile": true, # ← skip compile for short exploration | |
| "eval_interval": 2500, | |
| "num_workers": 4 | |
| }) | |
| ``` | |
| That's 9 trials total (3 LRs × 3 ablations). With 3 GPUs, run 3 trials concurrently (one per GPU). Run each ablation's 3 LRs sequentially on its GPU, or interleave — the lab runner assigns GPUs automatically. Each pauses at 5K steps. Use `no_compile: true` to skip the 15-30 min compile overhead for these short runs. | |
| After all 9 pause, call `lab_results` to compare val_loss at 5K steps. Pick the best LR for each ablation. | |
| ### Phase 2: Full Training (~10-14 hours) | |
| Resume the best LR for each ablation to 200K steps with torch.compile enabled: | |
| ``` | |
| lab_resume(trial_id=BEST_MATE_BOOST) # no pause → runs to total_steps | |
| lab_resume(trial_id=BEST_NO_OUTCOME) | |
| lab_resume(trial_id=BEST_DISCARD_PLY) | |
| ``` | |
| `lab_resume` clears `pause_after_steps` by default, so the resumed trials run to completion at `total_steps=200000`. The resumed trials pick up from the 5K-step checkpoint with optimizer state intact. | |
| **Important:** Do NOT pass `no_compile: true` for the resumed runs. Let torch.compile run — the 15-30 min compile overhead amortizes over 195K remaining steps. | |
| With 1 model per GPU (no sharing), expect ~0.4s/step → ~22 hours for 195K steps. All 3 run in parallel. Total cost: ~$51. | |
| ### Phase 3: Evaluation | |
| After all 3 runs converge, compare: | |
| - Final val_loss and accuracy vs baseline (PAWN-Base at 100K steps) | |
| - Learning curves (loss vs step for all 3, plotted together) | |
| - Per-ply accuracy via `scripts/eval_accuracy.py` if time permits | |
| ## Monitoring | |
| Use the manage-pod skill check-in pattern: | |
| - **Mini check-in (every 5 min):** `lab_events` — launch next trial if GPU idle | |
| - **Full check-in (hourly):** `lab_status` + `lab_log` for each running trial + update lab notes | |
| During Phase 1 (9 short trials), use 5-min check-ins — trials finish fast. | |
| During Phase 2 (3 long runs), switch to hourly after confirming stability. | |
| ### What to watch for | |
| - **NaN loss:** Kill immediately. Try lr/3. Use `lab_log` to check. | |
| - **discard-ply-limit throughput:** This ablation discards ~60% of generated games. If step times are much slower than the other two, reduce `num_workers` to free CPU for data generation. | |
| - **mate-boost game lengths:** Check `lab_log` — mate-boost games are ~134 plies vs ~238 normal. Each batch processes more games, which may cause higher memory usage. If OOM, reduce batch_size to 128. | |
| - **GPU assignment:** The lab runner assigns one trial per GPU via CUDA_VISIBLE_DEVICES. If a trial fails and a GPU sits idle, launch a replacement promptly. | |
| ## Cost Budget | |
| | Phase | Trials | Time | Cost | | |
| |-------|--------|------|------| | |
| | LR exploration | 9 × 5K steps | ~1h | ~$3 | | |
| | Full training | 3 × 195K steps (parallel) | ~22h | ~$51 | | |
| | **Total** | | **~23h** | **~$53** | | |
| Set cost tracking: `lab_set_cost(cost_per_hour=2.31)` | |
| ## Lab Notes | |
| Write experiment state to `runs/lab-notes.md` (survives context compaction via PostCompact hook). Include: | |
| - Phase 1 results table (LR × ablation → val_loss at 5K steps) | |
| - Which LR was selected for each ablation and why | |
| - Phase 2 progress (step, loss, ETA per ablation) | |
| - Any anomalies or adjustments made | |
| ## Current State (as of 2026-04-03 19:30 UTC) | |
| ### Phase 2: Full Training (RUNNING) | |
| Three ablation runs training to 200K steps, torch.compile ON, resumed from 5K: | |
| | Trial | GPU | Ablation | LR | Step (~19:00) | train_loss | acc | | |
| |-------|-----|----------|-----|---------------|-----------|------| | |
| | 24 | 0 | mate-boost | 1e-3 | ~120K | 3.19 | 6.8% | | |
| | 25 | 1 | no-outcome | 3e-4 (no warmup) | ~116K | 3.09 | 7.0% | | |
| | 26 | 2 | discard-ply-limit | 1e-3 | ~119K | 3.18 | 7.5% | | |
| - ETA: ~2026-04-04 04:30 UTC (~9h remaining) | |
| - Hourly cron monitors via `lab_events` + `lab_log` | |
| - Baseline ref: PAWN-Base val_loss ~3.06, acc ~6.9% at 100K steps | |
| ### Concurrent: Accuracy Ceiling Computation (RUNNING) | |
| Running `scripts/compute_theoretical_ceiling.py` on spare CPUs (RAYON_NUM_THREADS=30) while GPUs train: | |
| ```bash | |
| RAYON_NUM_THREADS=30 PAWN_ALLOW_CPU=1 /opt/pawn/.venv/bin/python \ | |
| scripts/compute_theoretical_ceiling.py \ | |
| --n-games 5000 --rollouts 1024 --sample-rate 0.075 \ | |
| --model-accuracy 0.069 --output /workspace/data/theoretical_ceiling_1024.json | |
| ``` | |
| - 1024 rollouts/move (8x previous), expected to narrow bias bracket from 0.66pp to ~0.23pp | |
| - Processes in 10 batches of 500 games, printing intermediate estimates | |
| - ETA: ~3-4 hours (finishes well before training) | |
| - Output: `/workspace/data/theoretical_ceiling_1024.json` | |
| ### Bug Fixes Applied This Session | |
| - **`pause_after_steps`**: Was missing from `trainer.py` (only in adapter training). Added to `TrainingConfig` and `trainer.py` train loop. Wired through `scripts/train.py`. | |
| - **`lab_resume` checkpoint discovery**: Was looking for `checkpoints/best` or `checkpoints/final` only. Pretraining uses `step_XXXXXXXX` naming. Added fallback to pick highest `step_*` dir. | |
| ### HF Bucket Backup | |
| All experiment data syncs to `hf://buckets/thomas-schweich/pretraining-ablations`: | |
| - Metrics/configs synced every ~4 hours (excluding checkpoints) | |
| - 100K-step checkpoints for all 3 ablations uploaded | |
| - Lab notes + chat transcript synced periodically | |
| - HF_TOKEN in `/opt/pawn/.env` | |
| - See "HuggingFace Bucket Backups" section above for API patterns | |
| ### Lab Notes | |
| Detailed log at `/workspace/runs/lab-notes.md` (symlinked from `/opt/pawn/runs/lab-notes.md`). Survives context compaction via PostCompact hook. Read this first when resuming. | |
| ## Post-Training Procedure | |
| When each training run completes (in any order): | |
| 1. **Upload best checkpoint** to HF bucket in the background (see bucket sync patterns above) | |
| 2. **Run linear probes** on the best checkpoint in the background: | |
| ```bash | |
| PAWN_ALLOW_CPU=1 /opt/pawn/.venv/bin/python scripts/eval_probes.py \ | |
| --checkpoint /workspace/logs/trial_XXXX/run_.../checkpoints/step_BEST \ | |
| --log-dir /workspace/eval_probes/trial_XXXX --device cuda:N | |
| ``` | |
| Use the GPU that just freed up. Run in background so other completions aren't blocked. | |
| 3. **After the last probe run is kicked off**, draft the final report comparing all 3 ablations + baseline: | |
| - val_loss and accuracy curves | |
| - Probe results (piece type, check, castling, material, game phase) | |
| - Accuracy ceiling comparison | |
| - Key findings and surprises | |
| 4. **Update the report** with probe results as they finish | |
| 5. **Upload everything** to HF bucket: final checkpoints, probe results, report, lab notes | |
| ## Success Criteria | |
| 1. All three ablations train to 200K steps without divergence | |
| 2. Clear comparison of val_loss and accuracy across ablations | |
| 3. At least one ablation shows meaningful difference from baseline (better or worse — both are informative) | |
| -- | |
Xet Storage Details
- Size:
- 28.6 kB
- Xet hash:
- 2435990282dfc0b7ef32d0cc9c03315053277fa64c6d11c8ea376e386bd8e1b2
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.