thomas-schweich commited on 1 day ago

Commit

230508d

unverified ·

1 Parent(s): c965110

Safetensors migration, checkpoint integrity, and multi-model training. (#1)

Browse files

Files changed (44) hide show

.dockerignore +1 -0
.gitignore +9 -2
CLAUDE.md +78 -7
Dockerfile +28 -8
README.md +10 -4
checkpoints/pawn-base +1 -1
checkpoints/pawn-large +1 -1
checkpoints/pawn-small +1 -1
deploy/entrypoint-run.sh +19 -0
deploy/sync.sh +21 -50
engine/python/chess_engine/__init__.py +2 -0
engine/src/batch.rs +190 -22
engine/src/board.rs +1 -1
engine/src/lib.rs +41 -0
engine/src/pgn.rs +0 -6
engine/src/vocab.rs +59 -8
pawn/checkpoint.py +647 -0
pawn/config.py +1 -1
pawn/data.py +47 -28
pawn/eval_suite/diagnostics.py +1 -1
pawn/eval_suite/lichess.py +7 -8
pawn/eval_suite/probes.py +4 -10
pawn/eval_suite/worker.py +5 -5
pawn/lichess_data.py +5 -22
pawn/model.py +1 -1
pawn/trainer.py +52 -37
pyproject.toml +1 -0
scripts/check_progress.sh +14 -34
scripts/eval_accuracy.py +34 -24
scripts/eval_probes.py +9 -12
scripts/export_hf_repo.py +283 -0
scripts/profile_step.py +7 -6
scripts/train.py +7 -1
scripts/train_all.py +400 -0
scripts/train_bottleneck.py +73 -35
scripts/train_film.py +59 -20
scripts/train_hybrid.py +59 -20
scripts/train_lora.py +59 -20
scripts/train_sparse.py +59 -20
scripts/train_tiny.py +45 -12
tests/test_checkpoint.py +359 -0
tests/test_clm_format.py +224 -0
tests/test_graceful_shutdown.py +130 -0
uv.lock +25 -334

.dockerignore CHANGED Viewed

@@ -10,6 +10,7 @@ data/
 checkpoints/
 logs/
 deploy/
 *.so
 CLAUDE.md
 docs/

 checkpoints/
 logs/
 deploy/
+!deploy/entrypoint-run.sh
 *.so
 CLAUDE.md
 docs/

.gitignore CHANGED Viewed

@@ -1,5 +1,4 @@
-# Checkpoints and logs
-checkpoints/
 logs/
 # Training data
@@ -7,6 +6,7 @@ data/
 # Deployment artifacts
 deploy/pawn-deploy/
 # Python
 __pycache__/
@@ -36,8 +36,15 @@ Thumbs.db
 # uv (track root lockfile only)
 engine/uv.lock
 # Misc
 runpodctl.tar.gz
 .claude/
 .playwright-mcp/
 local/

+# Local checkpoints and logs
 logs/
 # Training data
 # Deployment artifacts
 deploy/pawn-deploy/
+export/
 # Python
 __pycache__/
 # uv (track root lockfile only)
 engine/uv.lock
+# Local working files
+local/
 # Misc
 runpodctl.tar.gz
 .claude/
+<<<<<<< HEAD
+.playwright-mcp
+=======
 .playwright-mcp/
 local/
+>>>>>>> c965110657b87bef82768cf9960b3f5486c54867

CLAUDE.md CHANGED Viewed

@@ -41,7 +41,7 @@ uv sync --extra dev       # + pytest, ipykernel
 uv run pytest tests/
 # Pretrain from scratch
-uv run python scripts/train.py --variant base
 ```
 ## Engine (`engine/`)
@@ -86,18 +86,24 @@ All adapters freeze the backbone and initialize to identity (zero-init or gamma=
 ## Scripts (`scripts/`)
 - `train.py` -- Pretrain from scratch (`--variant small|base|large|toy`)
 - `train_bottleneck.py`, `train_film.py`, `train_lora.py`, `train_sparse.py`, `train_hybrid.py` -- Adapter behavioral cloning on Lichess PGN
 - `train_tiny.py` -- Standalone tiny transformer baseline (no frozen backbone)
 - `eval_accuracy.py` -- MAIA-compatible evaluation (per-phase, per-ply accuracy)
 ## Deploy (`deploy/`)
-Scripts for Runpod GPU VM deployment:
-- `setup.sh` -- One-time pod setup (Rust, uv, engine build, `uv sync --extra cu128`)
-- `build.sh` -- Package repo for transfer (`--checkpoint PATH`, `--data-dir PATH`)
-- `pod.sh` -- Pod lifecycle (create/start/stop/delete/ssh/deploy/launch)
-Pod configs cached in `~/.config/pawn/pods/`. Working directory on pods: `/workspace/pawn/`.
 ## Dashboard (`pawn/dashboard/`)
@@ -117,8 +123,73 @@ Auto-detects run type from config fields (`run_type`, `formulation`, `pgn_file`)
 ## Checkpoints
-Stored in `checkpoints/` (gitignored). Pre-trained weights downloadable from HuggingFace Hub / GitHub Releases.
 ## Logs
 Training metrics in `logs/` (gitignored). Each run gets a timestamped directory with `metrics.jsonl`.

 uv run pytest tests/
 # Pretrain from scratch
+uv run python scripts/train.py --variant base --local-checkpoints
 ```
 ## Engine (`engine/`)
 ## Scripts (`scripts/`)
 - `train.py` -- Pretrain from scratch (`--variant small|base|large|toy`)
+- `train_all.py` -- Train small/base/large simultaneously on shared data batches
 - `train_bottleneck.py`, `train_film.py`, `train_lora.py`, `train_sparse.py`, `train_hybrid.py` -- Adapter behavioral cloning on Lichess PGN
 - `train_tiny.py` -- Standalone tiny transformer baseline (no frozen backbone)
 - `eval_accuracy.py` -- MAIA-compatible evaluation (per-phase, per-ply accuracy)
+- `eval_probes.py` -- Run linear probes on all checkpoints
+- `export_hf_repo.py` -- Convert training run to HuggingFace repo format (safetensors + metrics)
+All training scripts require `--hf-repo REPO` or `--local-checkpoints`.
 ## Deploy (`deploy/`)
+Docker-based deployment to Runpod GPU VMs:
+- `Dockerfile` -- Multi-target build: `interactive` (SSH+Jupyter, default) and `runner` (auto-stop)
+- `entrypoint-run.sh` -- Runner entrypoint, pulls from HF via `PAWN_MODEL` env var
+- `sync.sh` -- Pull latest checkpoints/metrics from HuggingFace submodules
+- `pod.sh` -- Pod lifecycle (create/start/stop/delete/ssh)
+Code lives at `/opt/pawn` on pods (outside the `/workspace` volume mount).
 ## Dashboard (`pawn/dashboard/`)
 ## Checkpoints
+Pre-trained weights are HuggingFace git submodules under `checkpoints/`:
+- `checkpoints/pawn-small` — 9.5M params, `CLMConfig.small()`
+- `checkpoints/pawn-base` — 35.8M params, `CLMConfig.base()`
+- `checkpoints/pawn-large` — 68.4M params, `CLMConfig.large()`
+Pull with: `git submodule update --init --remote checkpoints/pawn-base`
+### Checkpoint Format (safetensors)
+Checkpoints are directories, not single files:
+```
+step_00065000/
+├── model.safetensors        # model weights
+├── optimizer.safetensors    # flattened optimizer state
+├── training_state.json      # step, scheduler, scaler, RNG (base64)
+├── config.json              # model + training config
+└── .complete                # SHA-256 hashes of all files (integrity sentinel)
+```
+Central module: `pawn/checkpoint.py`. All save/load goes through this module.
+Legacy `.pt` files are still loadable (backward compatible).
+### Checkpoint Storage Modes
+All training scripts require one of:
+- `--hf-repo REPO_ID` — push checkpoints to a HuggingFace branch as they're written (durable)
+- `--local-checkpoints` — save locally only (for development without an HF account)
+HF mode creates a `run/{run_id}` branch. Squash-merge into main when satisfied.
+### Data Integrity
+**Every checkpoint write is atomic**: files are written to a `.tmp` directory, then renamed.
+The `.complete` sentinel contains SHA-256 hashes of every file in the checkpoint.
+**Hashes are always verified on load — no exceptions.**
+- `IncompleteCheckpointError` — raised when `.complete` sentinel is missing
+- `CheckpointIntegrityError` — raised when any hash mismatches
+**Never use `kill -9` on training processes.** SIGTERM is handled gracefully: a flag is set,
+the training loop checks it between steps, saves a checkpoint, pushes to HF, and exits cleanly.
+**Never rsync checkpoint files from running pods.** Checkpoints are pushed to HuggingFace
+from the trainer. Pull via `deploy/sync.sh` (submodule update).
 ## Logs
 Training metrics in `logs/` (gitignored). Each run gets a timestamped directory with `metrics.jsonl`.
+## Runpod Pod Management
+### Setup
+- Docker image: multi-target build in `Dockerfile`
+  - `interactive` (default) — SSH + Jupyter, stays alive
+  - `runner` — executes command then exits (pod auto-stops)
+- Build: `docker build --target runner --build-arg GIT_HASH=$(git rev-parse HEAD) ...`
+### Required Configuration
+- **Always attach a network volume.** Checkpoints write to disk during atomic rename and HF push. Ephemeral container disk is lost on pod termination.
+- **Set `HF_TOKEN` as a pod environment variable** for automatic HuggingFace authentication.
+- Set `PAWN_MODEL=thomas-schweich/pawn-base` env var in the runner to auto-pull a checkpoint on startup.
+### Lifecycle
+- Create: `runpodctl pod create --name pawn-exp --gpu-id "NVIDIA RTX A5000" --image thomasschweich/pawn:<tag> --volume-in-gb 75 --ports "8888/http,22/tcp"`
+- Stop: `runpodctl pod stop <ID>` — sends SIGTERM → trainer saves and pushes before exiting
+- **Never `runpodctl pod delete` while training is running** — data loss risk
+- Monitor: pull HF submodule (`deploy/sync.sh`) and read `metrics.jsonl`

Dockerfile CHANGED Viewed

@@ -3,17 +3,27 @@
 # Extends the official Runpod PyTorch template — SSH and JupyterLab
 # start automatically via the base image's /start.sh entrypoint.
 #
 # Build:
 #   docker build --platform linux/amd64 \
 #     --build-arg GIT_HASH=$(git rev-parse HEAD) \
 #     --build-arg GIT_TAG=$(git tag --points-at HEAD) \
 #     -t pawn:<tag> .
 #
-# Runpod BYOC:
-#   Push to a registry, then set as the container image in a Pod template.
-#   Configure HTTP port 8888 (Jupyter) and TCP port 22 (SSH).
-#   Mount a network volume at /workspace for data, checkpoints, and logs.
-#   Code lives at /opt/pawn (outside the volume mount).
 # ── Builder ──────────────────────────────────────────────────────────
 FROM runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404 AS builder
@@ -43,14 +53,14 @@ RUN cd engine && \
     uv run --no-project --with maturin maturin build --release && \
     cd ..
-# ── Runtime ──────────────────────────────────────────────────────────
-FROM runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404
 ENV PYTHONUNBUFFERED=1 \
     PYTHONPATH=/opt/pawn
 # Direct deps only (torch + numpy already in base image)
-RUN pip install --no-cache-dir psutil tqdm wandb
 COPY --from=builder /workspace/pawn/engine/target/wheels/*.whl /tmp/
 RUN pip install --no-cache-dir /tmp/*.whl && rm -rf /tmp/*.whl
@@ -72,3 +82,13 @@ RUN echo "export PYTHONPATH=/opt/pawn" >> /etc/environment && \
     echo "export PAWN_GIT_HASH=${GIT_HASH}" >> /etc/environment && \
     echo "export PAWN_GIT_TAG=${GIT_TAG}" >> /etc/environment && \
     cat /etc/environment >> /root/.bashrc

 # Extends the official Runpod PyTorch template — SSH and JupyterLab
 # start automatically via the base image's /start.sh entrypoint.
 #
+# Build targets:
+#   interactive (default) — SSH + Jupyter, stays alive
+#   runner               — runs a command then exits (pod auto-stops)
+#
 # Build:
 #   docker build --platform linux/amd64 \
 #     --build-arg GIT_HASH=$(git rev-parse HEAD) \
 #     --build-arg GIT_TAG=$(git tag --points-at HEAD) \
+#     [--target runner] \
 #     -t pawn:<tag> .
 #
+# Run (interactive):
+#   docker run --gpus all pawn:<tag>
+#
+# IMPORTANT: Always attach a Runpod network volume. Checkpoints use
+# atomic directory writes (tmp + rename) that require persistent disk.
+# Set HF_TOKEN as a pod env var for HuggingFace checkpoint push.
+#
+# Run (auto-stop):
+#   docker run --gpus all -e PAWN_MODEL=thomas-schweich/pawn-base \
+#     pawn:<tag>-runner python scripts/train.py --variant base
 # ── Builder ──────────────────────────────────────────────────────────
 FROM runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404 AS builder
     uv run --no-project --with maturin maturin build --release && \
     cd ..
+# ── Runtime base (shared by all targets) ─────────────────────────────
+FROM runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404 AS runtime-base
 ENV PYTHONUNBUFFERED=1 \
     PYTHONPATH=/opt/pawn
 # Direct deps only (torch + numpy already in base image)
+RUN pip install --no-cache-dir psutil safetensors tqdm wandb huggingface-hub
 COPY --from=builder /workspace/pawn/engine/target/wheels/*.whl /tmp/
 RUN pip install --no-cache-dir /tmp/*.whl && rm -rf /tmp/*.whl
     echo "export PAWN_GIT_HASH=${GIT_HASH}" >> /etc/environment && \
     echo "export PAWN_GIT_TAG=${GIT_TAG}" >> /etc/environment && \
     cat /etc/environment >> /root/.bashrc
+# ── Runner — executes command then exits (pod auto-stops) ────────────
+FROM runtime-base AS runner
+COPY deploy/entrypoint-run.sh /entrypoint-run.sh
+RUN chmod +x /entrypoint-run.sh
+ENTRYPOINT ["/entrypoint-run.sh"]
+# ── Interactive (default) — SSH + Jupyter, stays alive ───────────────
+FROM runtime-base AS interactive
+# Inherits /start.sh entrypoint from Runpod base image

README.md CHANGED Viewed

@@ -32,7 +32,9 @@ Notably, the vocabulary includes impossible moves like `a1a1` and `b1a5`. PAWN n
 Conceptually, each token is best thought of as a move in UCI notation--they are effectively coordinates. They do not include any information on the type of peice, side to play, or any direct geometric or board state information other than the factored nature of the embeddings (see the architecture section below for details).
-For example, `e2e4` is the token that represents the king's pawn opening, but only when it's the first ply in the sequence (moving a rook between from e2 to e4 in the late game would use the same token). The model learns to track which type of peice is on each square any given moment entirely of its own accord. For that matter, it isn't even told what piece types exist and what movement patterns they follow, or indeed even the concept of a peice. All of that 'understanding' comes purely from observation.
 ## Quickstart
@@ -50,13 +52,17 @@ uv sync --extra cu128   # NVIDIA GPU (or --extra rocm for AMD)
 uv sync --extra dev --extra eval --extra dashboard
 # Train an adapter on a pre-trained checkpoint
 uv run python scripts/train_bottleneck.py \
-    --checkpoint checkpoints/pawn-base.pt \
     --pgn data/lichess_1800_1900.pgn \
-    --bottleneck-dim 32 --lr 1e-4
 # Or pretrain a PAWN variant from scratch (generates random games on-the-fly; no dataset required)
-uv run python scripts/train.py --variant base
 # Launch the real-time monitoring dashboard (optional dashboard dependency must be installed)
 uv run python -m pawn.dashboard --log-dir logs --port 8765

 Conceptually, each token is best thought of as a move in UCI notation--they are effectively coordinates. They do not include any information on the type of peice, side to play, or any direct geometric or board state information other than the factored nature of the embeddings (see the architecture section below for details).
+For example, `e2e4` is the token that represents the king's pawn opening, but only when it's the first ply in the sequence (moving a rook between from e2 to e4 in the late game would use the same token). The model learns to track which type of peice is on each square any given moment entirely of its own accord.
+For that matter, it isn't told what piece types exist, what movement patterns they follow, or indeed the concept of a peice. All of that 'understanding' comes purely from observation and can be isolated via [linear probes](https://arxiv.org/abs/1610.01644) (Alain & Bengio, 2016).
 ## Quickstart
 uv sync --extra dev --extra eval --extra dashboard
 # Train an adapter on a pre-trained checkpoint
+git submodule update --init checkpoints/pawn-base
 uv run python scripts/train_bottleneck.py \
+    --checkpoint checkpoints/pawn-base \
     --pgn data/lichess_1800_1900.pgn \
+    --bottleneck-dim 32 --lr 1e-4 --local-checkpoints
 # Or pretrain a PAWN variant from scratch (generates random games on-the-fly; no dataset required)
+uv run python scripts/train.py --variant base --local-checkpoints
+# Or train all three variants simultaneously on shared data
+uv run python scripts/train_all.py --local-checkpoints
 # Launch the real-time monitoring dashboard (optional dashboard dependency must be installed)
 uv run python -m pawn.dashboard --log-dir logs --port 8765

checkpoints/pawn-base CHANGED Viewed

	@@ -1 +1 @@
1	- Subproject commit ~~d09eee93b2368af24e8c6d1f1d64c399a4a22964~~


1	+ Subproject commit 45aa6e347ca9516662874238adaeef4d30fc6df8

checkpoints/pawn-large CHANGED Viewed

	@@ -1 +1 @@
1	- Subproject commit ~~4076e45f4772228f98d68095c012202e1054bd3c~~


1	+ Subproject commit 1d3f1deee3411e86d69a814520553fcf78f96c5f

checkpoints/pawn-small CHANGED Viewed

	@@ -1 +1 @@
1	- Subproject commit ~~ac274df81dd414044dcb21e869a010d6bd5e4c24~~


1	+ Subproject commit 929fee43ce16e02cf12a255cf39cc3f26e0913e1

deploy/entrypoint-run.sh ADDED Viewed

	@@ -0,0 +1,19 @@

+#!/bin/bash
+# Entrypoint for auto-stop runner target.
+# Optionally pulls a checkpoint from HuggingFace before running the command.
+set -e
+cd /opt/pawn
+export PYTHONPATH=/opt/pawn
+# Pull checkpoint from HuggingFace if PAWN_MODEL is set
+if [ -n "$PAWN_MODEL" ]; then
+    echo "Pulling model: $PAWN_MODEL"
+    python3 -c "
+from huggingface_hub import snapshot_download
+snapshot_download('$PAWN_MODEL', local_dir='checkpoints/model')
+print('Model downloaded to checkpoints/model/')
+"
+fi
+exec "$@"

deploy/sync.sh CHANGED Viewed

@@ -1,65 +1,36 @@
 #!/usr/bin/env bash
-# Sync logs and checkpoints from Runpod pod(s) to local machine.
-# Usage: bash deploy/sync.sh [pod-name]
 #
-# With no args, syncs from all pods in ~/.config/pawn/pods/
-# With a pod name, syncs from that specific pod only.
 set -euo pipefail
 REPO="$(cd "$(dirname "$0")/.." && pwd)"
-POD_DIR="$HOME/.config/pawn/pods"
-sync_pod() {
-    local name="$1" host="$2" port="$3" remote_root="${4:-/opt/pawn}"
-    local ssh_opts="-o StrictHostKeyChecking=accept-new -o ConnectTimeout=10 -p $port"
-    echo "=== Syncing from $name ($host:$port) ==="
-    echo "--- Logs ---"
-    rsync -avz --progress --no-owner --no-group \
-        -e "ssh $ssh_opts" \
-        "root@$host:$remote_root/logs/" "$REPO/logs/" || {
-        echo "  Failed to sync logs from $name"
-        return 1
-    }
-    echo "--- Checkpoints (best.pt only) ---"
-    rsync -avz --progress --no-owner --no-group \
-        --include='*/' --include='best.pt' --include='*.pt' --exclude='*' \
-        -e "ssh $ssh_opts" \
-        "root@$host:$remote_root/logs/" "$REPO/logs/" || {
-        echo "  Failed to sync checkpoints from $name"
-        return 1
-    }
-    echo "=== $name sync complete ==="
     echo ""
 }
-if [ ! -d "$POD_DIR" ] || [ -z "$(ls "$POD_DIR"/*.env 2>/dev/null)" ]; then
-    echo "No pods configured. Add .env files to $POD_DIR/"
-    exit 1
-fi
 if [ $# -ge 1 ]; then
-    # Sync specific pod
-    POD_FILE="$POD_DIR/$1.env"
-    if [ ! -f "$POD_FILE" ]; then
-        echo "Pod '$1' not found. Available pods:"
-        ls "$POD_DIR"/*.env 2>/dev/null | xargs -I{} basename {} .env | sed 's/^/  /'
         exit 1
     fi
-    source "$POD_FILE"
-    sync_pod "$1" "$POD_HOST" "$POD_PORT" "${POD_REMOTE_ROOT:-/opt/pawn}"
 else
-    # Sync all pods
-    for pod_file in "$POD_DIR"/*.env; do
-        pod_name="$(basename "${pod_file%.env}")"
-        unset POD_HOST POD_PORT POD_REMOTE_ROOT
-        source "$pod_file"
-        sync_pod "$pod_name" "$POD_HOST" "$POD_PORT" "${POD_REMOTE_ROOT:-/opt/pawn}" || true
     done
 fi
-echo "Local logs:"
-du -sh "$REPO/logs/"

 #!/usr/bin/env bash
+# Pull latest checkpoints and metrics from HuggingFace submodules.
+# Usage: bash deploy/sync.sh [submodule-name]
 #
+# With no args, pulls all checkpoint submodules.
+# With a name, pulls only that submodule.
 set -euo pipefail
 REPO="$(cd "$(dirname "$0")/.." && pwd)"
+pull_submodule() {
+    local sub="$1"
+    local name="$(basename "$sub")"
+    echo "=== Pulling $name ==="
+    git -C "$sub" fetch origin 2>/dev/null || { echo "  Failed to fetch $name"; return 1; }
+    git -C "$sub" pull origin main 2>/dev/null || { echo "  Failed to pull $name (main)"; return 1; }
+    echo "=== $name up to date ==="
     echo ""
 }
 if [ $# -ge 1 ]; then
+    sub="$REPO/checkpoints/$1"
+    if [ ! -d "$sub/.git" ]; then
+        echo "Submodule '$1' not found. Available:"
+        for s in "$REPO"/checkpoints/pawn-*/; do
+            [ -d "$s/.git" ] && echo "  $(basename "$s")"
+        done
         exit 1
     fi
+    pull_submodule "$sub"
 else
+    for sub in "$REPO"/checkpoints/pawn-*/; do
+        [ -d "$sub/.git" ] || continue
+        pull_submodule "$sub" || true
     done
 fi

engine/python/chess_engine/__init__.py CHANGED Viewed

@@ -8,6 +8,7 @@ from chess_engine._engine import (
     # Core game generation
     generate_training_batch,
     generate_random_games,
     generate_checkmate_games,
     generate_checkmate_training_batch,
     # Diagnostic sets
@@ -37,6 +38,7 @@ from chess_engine._engine import (
 __all__ = [
     "generate_training_batch",
     "generate_random_games",
     "generate_checkmate_games",
     "generate_checkmate_training_batch",
     "generate_diagnostic_sets",

     # Core game generation
     generate_training_batch,
     generate_random_games,
+    generate_clm_batch,
     generate_checkmate_games,
     generate_checkmate_training_batch,
     # Diagnostic sets
 __all__ = [
     "generate_training_batch",
     "generate_random_games",
+    "generate_clm_batch",
     "generate_checkmate_games",
     "generate_checkmate_training_batch",
     "generate_diagnostic_sets",

engine/src/batch.rs CHANGED Viewed

@@ -49,23 +49,18 @@ pub fn generate_training_batch(batch_size: usize, max_ply: usize, seed: u64) ->
         game_lengths.push(record.game_length as i16);
         termination_codes.push(record.termination.as_u8());
-        // Copy move_ids with padding
         for t in 0..length {
             move_ids[b * max_ply + t] = record.move_ids[t] as i16;
         }
-        // Place EOG token at position game_length
-        if length < max_ply {
-            move_ids[b * max_ply + length] = vocab::EOG_TOKEN as i16;
-        }
-        // Copy legal move grids
         for t in 0..length {
             let grid_offset = (b * max_ply + t) * 64;
             debug_assert_eq!(record.legal_grids[t].len(), 64);
             legal_move_grid[grid_offset..grid_offset + 64]
                 .copy_from_slice(&record.legal_grids[t]);
         }
-        // EOG position: all zeros (already initialized)
         // Copy promotion masks (contiguous layout: [[bool; 4]; 44] = [bool; 176])
         for t in 0..length {
@@ -108,9 +103,6 @@ pub fn generate_random_games(n_games: usize, max_ply: usize, seed: u64) -> GameB
         for t in 0..(*length as usize) {
             move_ids[b * max_ply + t] = moves[t] as i16;
         }
-        if (*length as usize) < max_ply {
-            move_ids[b * max_ply + *length as usize] = vocab::EOG_TOKEN as i16;
-        }
     }
     GameBatch {
@@ -152,9 +144,6 @@ pub fn generate_checkmate_training_batch(
         for t in 0..(ex.game_length as usize).min(max_ply) {
             move_ids[b * max_ply + t] = ex.move_ids[t] as i16;
         }
-        if (ex.game_length as usize) < max_ply {
-            move_ids[b * max_ply + ex.game_length as usize] = vocab::EOG_TOKEN as i16;
-        }
         checkmate_targets[b * 64..(b + 1) * 64].copy_from_slice(&ex.checkmate_grid);
         legal_grids[b * 64..(b + 1) * 64].copy_from_slice(&ex.legal_grid);
     }
@@ -206,9 +195,6 @@ pub fn generate_completed_games(n_games: usize, max_ply: usize, seed: u64) -> Ga
         for t in 0..(*length as usize) {
             move_ids[b * max_ply + t] = moves[t] as i16;
         }
-        if (*length as usize) < max_ply {
-            move_ids[b * max_ply + *length as usize] = vocab::EOG_TOKEN as i16;
-        }
     }
     GameBatch {
@@ -281,9 +267,6 @@ pub fn generate_checkmate_games(
         for t in 0..(*length as usize) {
             move_ids[b * max_ply + t] = moves[t] as i16;
         }
-        if (*length as usize) < max_ply {
-            move_ids[b * max_ply + *length as usize] = vocab::EOG_TOKEN as i16;
-        }
     }
     (GameBatch {
@@ -295,6 +278,97 @@ pub fn generate_checkmate_games(
     }, total_generated)
 }
 #[cfg(test)]
 mod tests {
     use super::*;
@@ -321,15 +395,23 @@ mod tests {
     }
     #[test]
-    fn test_eog_token_placement() {
         let batch = generate_training_batch(2, 256, 42);
         for b in 0..2 {
             let len = batch.game_lengths[b] as usize;
             if len < 256 {
                 assert_eq!(
                     batch.move_ids[b * 256 + len],
-                    vocab::EOG_TOKEN as i16,
-                    "EOG token should be at position game_length"
                 );
             }
         }
@@ -343,4 +425,90 @@ mod tests {
         assert_eq!(b1.game_lengths, b2.game_lengths);
         assert_eq!(b1.legal_move_grid, b2.legal_move_grid);
     }
 }

         game_lengths.push(record.game_length as i16);
         termination_codes.push(record.termination.as_u8());
+        // Copy move_ids (remaining positions are already 0 = PAD)
         for t in 0..length {
             move_ids[b * max_ply + t] = record.move_ids[t] as i16;
         }
+        // Copy legal move grids (positions beyond game_length are already 0)
         for t in 0..length {
             let grid_offset = (b * max_ply + t) * 64;
             debug_assert_eq!(record.legal_grids[t].len(), 64);
             legal_move_grid[grid_offset..grid_offset + 64]
                 .copy_from_slice(&record.legal_grids[t]);
         }
         // Copy promotion masks (contiguous layout: [[bool; 4]; 44] = [bool; 176])
         for t in 0..length {
         for t in 0..(*length as usize) {
             move_ids[b * max_ply + t] = moves[t] as i16;
         }
     }
     GameBatch {
         for t in 0..(ex.game_length as usize).min(max_ply) {
             move_ids[b * max_ply + t] = ex.move_ids[t] as i16;
         }
         checkmate_targets[b * 64..(b + 1) * 64].copy_from_slice(&ex.checkmate_grid);
         legal_grids[b * 64..(b + 1) * 64].copy_from_slice(&ex.legal_grid);
     }
         for t in 0..(*length as usize) {
             move_ids[b * max_ply + t] = moves[t] as i16;
         }
     }
     GameBatch {
         for t in 0..(*length as usize) {
             move_ids[b * max_ply + t] = moves[t] as i16;
         }
     }
     (GameBatch {
     }, total_generated)
 }
+/// Output of CLM (Causal Language Model) batch generation.
+///
+/// Contains ready-to-train tensors in the format:
+///   input_ids = [outcome, ply_1, ply_2, ..., ply_N, PAD, ..., PAD]
+///   targets   = [ply_1,   ply_2, ply_3, ..., PAD,   PAD, ..., PAD]
+///   loss_mask = [true,    true,  true,  ..., true,   false, ..., false]
+///
+/// Also includes raw move_ids and game_lengths for replay operations
+/// (legal mask computation, board state extraction, validation).
+pub struct CLMBatch {
+    pub input_ids: Vec<i16>,        // [batch_size * seq_len]
+    pub targets: Vec<i16>,          // [batch_size * seq_len]
+    pub loss_mask: Vec<bool>,       // [batch_size * seq_len]
+    pub move_ids: Vec<i16>,         // [batch_size * max_ply] raw for replay
+    pub game_lengths: Vec<i16>,     // [batch_size]
+    pub termination_codes: Vec<u8>, // [batch_size]
+    pub batch_size: usize,
+    pub seq_len: usize,
+    pub max_ply: usize,
+}
+/// Generate a CLM training batch: random games packed into model-ready format.
+///
+/// `seq_len` is the total sequence length (256). Games are generated with up to
+/// `seq_len - 1` plies, leaving position 0 for the outcome token.
+pub fn generate_clm_batch(
+    batch_size: usize,
+    seq_len: usize,
+    seed: u64,
+    discard_ply_limit: bool,
+) -> CLMBatch {
+    let max_ply = seq_len - 1;
+    let game_batch = if discard_ply_limit {
+        generate_completed_games(batch_size, max_ply, seed)
+    } else {
+        generate_random_games(batch_size, max_ply, seed)
+    };
+    let mut input_ids = vec![0i16; batch_size * seq_len];
+    let mut targets = vec![0i16; batch_size * seq_len];
+    let mut loss_mask = vec![false; batch_size * seq_len];
+    for b in 0..batch_size {
+        let gl = game_batch.game_lengths[b] as usize;
+        let term = match game_batch.termination_codes[b] {
+            0 => Termination::Checkmate,
+            1 => Termination::Stalemate,
+            2 => Termination::SeventyFiveMoveRule,
+            3 => Termination::FivefoldRepetition,
+            4 => Termination::InsufficientMaterial,
+            _ => Termination::PlyLimit,
+        };
+        let outcome = vocab::termination_to_outcome(term, game_batch.game_lengths[b] as u16);
+        let row = b * seq_len;
+        // Position 0: outcome token
+        input_ids[row] = outcome as i16;
+        // Positions 1..=gl: move tokens
+        for t in 0..gl {
+            input_ids[row + 1 + t] = game_batch.move_ids[b * max_ply + t];
+        }
+        // Remaining positions are already 0 (PAD)
+        // Targets: input_ids shifted left by 1
+        for t in 0..(seq_len - 1) {
+            targets[row + t] = input_ids[row + t + 1];
+        }
+        // targets[row + seq_len - 1] is already 0
+        // Loss mask: positions 0..=gl are true
+        for t in 0..=gl {
+            loss_mask[row + t] = true;
+        }
+    }
+    CLMBatch {
+        input_ids,
+        targets,
+        loss_mask,
+        move_ids: game_batch.move_ids,
+        game_lengths: game_batch.game_lengths,
+        termination_codes: game_batch.termination_codes,
+        batch_size,
+        seq_len,
+        max_ply,
+    }
+}
 #[cfg(test)]
 mod tests {
     use super::*;
     }
     #[test]
+    fn test_pad_after_game_end() {
         let batch = generate_training_batch(2, 256, 42);
         for b in 0..2 {
             let len = batch.game_lengths[b] as usize;
             if len < 256 {
                 assert_eq!(
                     batch.move_ids[b * 256 + len],
+                    vocab::PAD_TOKEN as i16,
+                    "Position game_length should be PAD (0)"
+                );
+            }
+            // All positions after game_length should also be PAD
+            for t in len..256 {
+                assert_eq!(
+                    batch.move_ids[b * 256 + t],
+                    0,
+                    "Position {} (after game_length={}) should be PAD", t, len
                 );
             }
         }
         assert_eq!(b1.game_lengths, b2.game_lengths);
         assert_eq!(b1.legal_move_grid, b2.legal_move_grid);
     }
+    #[test]
+    fn test_clm_batch_format() {
+        let seq_len = 256;
+        let batch = generate_clm_batch(8, seq_len, 42, false);
+        assert_eq!(batch.input_ids.len(), 8 * seq_len);
+        assert_eq!(batch.targets.len(), 8 * seq_len);
+        assert_eq!(batch.loss_mask.len(), 8 * seq_len);
+        assert_eq!(batch.move_ids.len(), 8 * (seq_len - 1));
+        assert_eq!(batch.game_lengths.len(), 8);
+        for b in 0..8 {
+            let gl = batch.game_lengths[b] as usize;
+            let row = b * seq_len;
+            // Position 0: outcome token (4273-4277)
+            let outcome = batch.input_ids[row];
+            assert!(outcome >= vocab::OUTCOME_BASE as i16 && outcome <= vocab::PLY_LIMIT as i16,
+                "Position 0 should be outcome token, got {}", outcome);
+            // Positions 1..=gl: move tokens (1-4272)
+            for t in 1..=gl {
+                let tok = batch.input_ids[row + t];
+                assert!(tok >= 1 && tok <= 4272,
+                    "Position {} should be move token, got {}", t, tok);
+            }
+            // Positions gl+1..seq_len: PAD (0)
+            for t in (gl + 1)..seq_len {
+                assert_eq!(batch.input_ids[row + t], 0,
+                    "Position {} should be PAD, got {}", t, batch.input_ids[row + t]);
+            }
+            // Targets: shifted left by 1
+            for t in 0..(seq_len - 1) {
+                assert_eq!(batch.targets[row + t], batch.input_ids[row + t + 1],
+                    "targets[{}] should equal input_ids[{}]", t, t + 1);
+            }
+            assert_eq!(batch.targets[row + seq_len - 1], 0, "Last target should be PAD");
+            // Target at position gl is PAD (end of game)
+            assert_eq!(batch.targets[row + gl], 0, "Target at game_length should be PAD");
+            // Loss mask: true for 0..=gl, false after
+            for t in 0..=gl {
+                assert!(batch.loss_mask[row + t],
+                    "loss_mask[{}] should be true (gl={})", t, gl);
+            }
+            for t in (gl + 1)..seq_len {
+                assert!(!batch.loss_mask[row + t],
+                    "loss_mask[{}] should be false (gl={})", t, gl);
+            }
+        }
+    }
+    #[test]
+    fn test_clm_batch_deterministic() {
+        let b1 = generate_clm_batch(4, 256, 99, false);
+        let b2 = generate_clm_batch(4, 256, 99, false);
+        assert_eq!(b1.input_ids, b2.input_ids);
+        assert_eq!(b1.targets, b2.targets);
+        assert_eq!(b1.loss_mask, b2.loss_mask);
+        assert_eq!(b1.game_lengths, b2.game_lengths);
+    }
+    #[test]
+    fn test_clm_batch_outcome_correctness() {
+        let batch = generate_clm_batch(32, 256, 42, false);
+        for b in 0..32 {
+            let gl = batch.game_lengths[b] as usize;
+            let tc = batch.termination_codes[b];
+            let expected = vocab::termination_to_outcome(
+                match tc {
+                    0 => Termination::Checkmate,
+                    1 => Termination::Stalemate,
+                    2 => Termination::SeventyFiveMoveRule,
+                    3 => Termination::FivefoldRepetition,
+                    4 => Termination::InsufficientMaterial,
+                    _ => Termination::PlyLimit,
+                },
+                gl as u16,
+            );
+            assert_eq!(batch.input_ids[b * 256] as u16, expected,
+                "Game {} outcome mismatch: tc={}, gl={}", b, tc, gl);
+        }
+    }
 }

engine/src/board.rs CHANGED Viewed

@@ -74,7 +74,7 @@ pub fn move_to_token(m: &Move) -> u16 {
 /// Convert our token index to a shakmaty Move, given the current position.
 /// Finds the legal move matching the token's (src, dst, promo) decomposition.
 pub fn token_to_move(pos: &Chess, token: u16) -> Option<Move> {
-    // Validate the token is decomposable (not PAD/EOG)
     vocab::decompose_token(token)?;
     let legal = pos.legal_moves();

 /// Convert our token index to a shakmaty Move, given the current position.
 /// Finds the legal move matching the token's (src, dst, promo) decomposition.
 pub fn token_to_move(pos: &Chess, token: u16) -> Option<Move> {
+    // Validate the token is decomposable (not PAD/outcome)
     vocab::decompose_token(token)?;
     let legal = pos.legal_moves();

engine/src/lib.rs CHANGED Viewed

@@ -169,6 +169,46 @@ fn generate_random_games<'py>(
     Ok((move_ids, game_lengths, termination_codes))
 }
 /// Compute legal move masks by replaying games. Spec §7.4.
 #[pyfunction]
 fn compute_legal_move_masks<'py>(
@@ -868,6 +908,7 @@ fn _engine(m: &Bound<'_, PyModule>) -> PyResult<()> {
     m.add_function(wrap_pyfunction!(export_move_vocabulary, m)?)?;
     m.add_function(wrap_pyfunction!(generate_training_batch, m)?)?;
     m.add_function(wrap_pyfunction!(generate_random_games, m)?)?;
     m.add_function(wrap_pyfunction!(generate_checkmate_games, m)?)?;
     m.add_function(wrap_pyfunction!(generate_checkmate_training_batch, m)?)?;
     m.add_function(wrap_pyfunction!(compute_legal_move_masks, m)?)?;

     Ok((move_ids, game_lengths, termination_codes))
 }
+/// Generate a CLM training batch with model-ready tensors.
+///
+/// Returns (input_ids, targets, loss_mask, move_ids, game_lengths, term_codes).
+/// input_ids = [outcome, ply_1, ..., ply_N, PAD, ...] (seq_len per row).
+/// move_ids are the raw moves (seq_len-1 per row) for replay operations.
+#[pyfunction]
+#[pyo3(signature = (batch_size, seq_len=256, seed=42, discard_ply_limit=false))]
+fn generate_clm_batch<'py>(
+    py: Python<'py>,
+    batch_size: usize,
+    seq_len: usize,
+    seed: u64,
+    discard_ply_limit: bool,
+) -> PyResult<(
+    Bound<'py, PyArray2<i16>>,   // input_ids (B, seq_len)
+    Bound<'py, PyArray2<i16>>,   // targets (B, seq_len)
+    Bound<'py, PyArray2<bool>>,  // loss_mask (B, seq_len)
+    Bound<'py, PyArray2<i16>>,   // move_ids (B, seq_len-1)
+    Bound<'py, PyArray1<i16>>,   // game_lengths (B,)
+    Bound<'py, PyArray1<u8>>,    // termination_codes (B,)
+)> {
+    let result = py.allow_threads(|| {
+        batch::generate_clm_batch(batch_size, seq_len, seed, discard_ply_limit)
+    });
+    let max_ply = seq_len - 1;
+    let input_ids = numpy::PyArray::from_vec(py, result.input_ids)
+        .reshape([batch_size, seq_len])?;
+    let targets = numpy::PyArray::from_vec(py, result.targets)
+        .reshape([batch_size, seq_len])?;
+    let loss_mask = numpy::PyArray::from_vec(py, result.loss_mask)
+        .reshape([batch_size, seq_len])?;
+    let move_ids = numpy::PyArray::from_vec(py, result.move_ids)
+        .reshape([batch_size, max_ply])?;
+    let game_lengths = numpy::PyArray::from_vec(py, result.game_lengths);
+    let termination_codes = numpy::PyArray::from_vec(py, result.termination_codes);
+    Ok((input_ids, targets, loss_mask, move_ids, game_lengths, termination_codes))
+}
 /// Compute legal move masks by replaying games. Spec §7.4.
 #[pyfunction]
 fn compute_legal_move_masks<'py>(
     m.add_function(wrap_pyfunction!(export_move_vocabulary, m)?)?;
     m.add_function(wrap_pyfunction!(generate_training_batch, m)?)?;
     m.add_function(wrap_pyfunction!(generate_random_games, m)?)?;
+    m.add_function(wrap_pyfunction!(generate_clm_batch, m)?)?;
     m.add_function(wrap_pyfunction!(generate_checkmate_games, m)?)?;
     m.add_function(wrap_pyfunction!(generate_checkmate_training_batch, m)?)?;
     m.add_function(wrap_pyfunction!(compute_legal_move_masks, m)?)?;

engine/src/pgn.rs CHANGED Viewed

@@ -62,9 +62,6 @@ pub fn batch_san_to_tokens(
         for (t, &tok) in tokens.iter().enumerate() {
             flat[gi * max_ply + t] = tok as i16;
         }
-        if n_valid < max_ply {
-            flat[gi * max_ply + n_valid] = crate::vocab::EOG_TOKEN as i16;
-        }
         lengths.push(n_valid as i16);
     }
@@ -195,9 +192,6 @@ pub fn pgn_file_to_tokens(
         for (t, &tok) in tokens.iter().enumerate() {
             flat[gi * max_ply + t] = tok as i16;
         }
-        if *n_valid < max_ply {
-            flat[gi * max_ply + n_valid] = crate::vocab::EOG_TOKEN as i16;
-        }
         lengths.push(*n_valid as i16);
     }

         for (t, &tok) in tokens.iter().enumerate() {
             flat[gi * max_ply + t] = tok as i16;
         }
         lengths.push(n_valid as i16);
     }
         for (t, &tok) in tokens.iter().enumerate() {
             flat[gi * max_ply + t] = tok as i16;
         }
         lengths.push(*n_valid as i16);
     }

engine/src/vocab.rs CHANGED Viewed

@@ -1,10 +1,10 @@
 //! Move vocabulary: the single source of truth for token ↔ UCI string mapping.
 //!
-//! Token layout (4,274 total):
 //!   0        = padding
 //!   1..=4096 = base grid (64×64 src×dst pairs)
 //!   4097..=4272 = promotions (44 eligible pairs × 4 piece types)
-//!   4273     = end-of-game
 //!
 //! Square indexing: file-major within rank.
 //!   a1=0, b1=1, ..., h1=7, a2=8, ..., h8=63
@@ -12,9 +12,10 @@
 use std::collections::HashMap;
-pub const VOCAB_SIZE: usize = 4274;
 pub const PAD_TOKEN: u16 = 0;
-pub const EOG_TOKEN: u16 = 4273;
 pub const BASE_GRID_START: u16 = 1;
 pub const BASE_GRID_END: u16 = 4096; // inclusive
 pub const PROMO_START: u16 = 4097;
@@ -22,6 +23,14 @@ pub const PROMO_END: u16 = 4272; // inclusive
 pub const NUM_PROMO_PAIRS: usize = 44;
 pub const NUM_PROMO_TYPES: usize = 4;
 /// Square names in our index order.
 pub const SQUARE_NAMES: [&str; 64] = [
     "a1", "b1", "c1", "d1", "e1", "f1", "g1", "h1",
@@ -131,11 +140,29 @@ pub fn promo_token(src: u8, dst: u8, promo_type: u8) -> Option<u16> {
     Some(PROMO_START + (pair_idx as u16) * 4 + (promo_type as u16))
 }
 /// Decompose a token into (src_square, dst_square, promo_type).
 /// promo_type: 0=none, 1=q, 2=r, 3=b, 4=n (matches embedding index).
-/// Returns None for PAD and EOG tokens.
 pub fn decompose_token(token: u16) -> Option<(u8, u8, u8)> {
-    if token == PAD_TOKEN || token == EOG_TOKEN {
         return None;
     }
     if token >= BASE_GRID_START && token <= BASE_GRID_END {
@@ -285,8 +312,32 @@ mod tests {
     }
     #[test]
-    fn test_pad_eog_decompose() {
         assert!(decompose_token(PAD_TOKEN).is_none());
-        assert!(decompose_token(EOG_TOKEN).is_none());
     }
 }

 //! Move vocabulary: the single source of truth for token ↔ UCI string mapping.
 //!
+//! Token layout (4,278 total):
 //!   0        = padding
 //!   1..=4096 = base grid (64×64 src×dst pairs)
 //!   4097..=4272 = promotions (44 eligible pairs × 4 piece types)
+//!   4273..=4277 = outcome tokens (game result)
 //!
 //! Square indexing: file-major within rank.
 //!   a1=0, b1=1, ..., h1=7, a2=8, ..., h8=63
 use std::collections::HashMap;
+use crate::types::Termination;
+pub const VOCAB_SIZE: usize = 4278;
 pub const PAD_TOKEN: u16 = 0;
 pub const BASE_GRID_START: u16 = 1;
 pub const BASE_GRID_END: u16 = 4096; // inclusive
 pub const PROMO_START: u16 = 4097;
 pub const NUM_PROMO_PAIRS: usize = 44;
 pub const NUM_PROMO_TYPES: usize = 4;
+// Outcome tokens — must match pawn/config.py
+pub const OUTCOME_BASE: u16 = 4273;
+pub const WHITE_CHECKMATES: u16 = 4273;
+pub const BLACK_CHECKMATES: u16 = 4274;
+pub const STALEMATE: u16 = 4275;
+pub const DRAW_BY_RULE: u16 = 4276;
+pub const PLY_LIMIT: u16 = 4277;
 /// Square names in our index order.
 pub const SQUARE_NAMES: [&str; 64] = [
     "a1", "b1", "c1", "d1", "e1", "f1", "g1", "h1",
     Some(PROMO_START + (pair_idx as u16) * 4 + (promo_type as u16))
 }
+/// Map a game termination reason to the corresponding outcome token.
+///
+/// For checkmate, the winner is determined by game length parity:
+/// - Odd game_length (white made the last move) → WHITE_CHECKMATES
+/// - Even game_length (black made the last move) → BLACK_CHECKMATES
+pub fn termination_to_outcome(term: Termination, game_length: u16) -> u16 {
+    match term {
+        Termination::Checkmate => {
+            if game_length % 2 == 1 { WHITE_CHECKMATES } else { BLACK_CHECKMATES }
+        }
+        Termination::Stalemate => STALEMATE,
+        Termination::SeventyFiveMoveRule
+        | Termination::FivefoldRepetition
+        | Termination::InsufficientMaterial => DRAW_BY_RULE,
+        Termination::PlyLimit => PLY_LIMIT,
+    }
+}
 /// Decompose a token into (src_square, dst_square, promo_type).
 /// promo_type: 0=none, 1=q, 2=r, 3=b, 4=n (matches embedding index).
+/// Returns None for PAD and outcome tokens.
 pub fn decompose_token(token: u16) -> Option<(u8, u8, u8)> {
+    if token == PAD_TOKEN || token >= OUTCOME_BASE {
         return None;
     }
     if token >= BASE_GRID_START && token <= BASE_GRID_END {
     }
     #[test]
+    fn test_pad_outcome_decompose() {
         assert!(decompose_token(PAD_TOKEN).is_none());
+        // All 5 outcome tokens should return None
+        for token in OUTCOME_BASE..=PLY_LIMIT {
+            assert!(decompose_token(token).is_none(),
+                "outcome token {} should not decompose", token);
+        }
+    }
+    #[test]
+    fn test_termination_to_outcome() {
+        use crate::types::Termination;
+        // Checkmate with odd game_length = white wins
+        assert_eq!(termination_to_outcome(Termination::Checkmate, 11), WHITE_CHECKMATES);
+        assert_eq!(termination_to_outcome(Termination::Checkmate, 1), WHITE_CHECKMATES);
+        // Checkmate with even game_length = black wins
+        assert_eq!(termination_to_outcome(Termination::Checkmate, 12), BLACK_CHECKMATES);
+        assert_eq!(termination_to_outcome(Termination::Checkmate, 2), BLACK_CHECKMATES);
+        // Other terminations
+        assert_eq!(termination_to_outcome(Termination::Stalemate, 50), STALEMATE);
+        assert_eq!(termination_to_outcome(Termination::SeventyFiveMoveRule, 100), DRAW_BY_RULE);
+        assert_eq!(termination_to_outcome(Termination::FivefoldRepetition, 80), DRAW_BY_RULE);
+        assert_eq!(termination_to_outcome(Termination::InsufficientMaterial, 60), DRAW_BY_RULE);
+        assert_eq!(termination_to_outcome(Termination::PlyLimit, 255), PLY_LIMIT);
     }
 }

pawn/checkpoint.py ADDED Viewed

	@@ -0,0 +1,647 @@

+"""Checkpoint save/load using safetensors + JSON.
+Replaces monolithic torch.save() .pt files with directory-based checkpoints:
+  - model.safetensors / adapter.safetensors — tensor data
+  - optimizer.safetensors — flattened optimizer state tensors
+  - training_state.json — scalars, scheduler, scaler, RNG, optimizer metadata
+  - config.json — model and training configuration
+  - .complete — SHA-256 hashes of all files (integrity sentinel)
+Writes are atomic: files are written to a .tmp directory, then renamed.
+Loads always verify the .complete sentinel and SHA-256 hashes.
+Backward compatible: all load functions transparently handle legacy .pt files.
+"""
+from __future__ import annotations
+import base64
+import hashlib
+import json
+import os
+import shutil
+from pathlib import Path
+import torch
+import torch.nn as nn
+from safetensors.torch import save_file, load_file
+CHECKPOINT_FORMAT_VERSION = 1
+LEGACY_EXTENSIONS = {".pt", ".pth"}
+# ---------------------------------------------------------------------------
+# Exceptions
+# ---------------------------------------------------------------------------
+class IncompleteCheckpointError(Exception):
+    """Raised when a checkpoint directory is missing its .complete sentinel."""
+    pass
+class CheckpointIntegrityError(Exception):
+    """Raised when a checkpoint file's SHA-256 hash doesn't match .complete."""
+    pass
+# ---------------------------------------------------------------------------
+# SHA-256 helpers
+# ---------------------------------------------------------------------------
+def _sha256_file(path: Path) -> str:
+    """Compute SHA-256 hex digest of a file."""
+    h = hashlib.sha256()
+    with open(path, "rb") as f:
+        for chunk in iter(lambda: f.read(1 << 20), b""):  # 1MB chunks
+            h.update(chunk)
+    return h.hexdigest()
+def _write_complete_sentinel(directory: Path) -> None:
+    """Write .complete sentinel with SHA-256 hashes of all checkpoint files."""
+    hashes = {}
+    for f in sorted(directory.iterdir()):
+        if f.name == ".complete" or f.is_dir():
+            continue
+        hashes[f.name] = _sha256_file(f)
+    sentinel = {"format_version": CHECKPOINT_FORMAT_VERSION, "files": hashes}
+    with open(directory / ".complete", "w") as f:
+        json.dump(sentinel, f, indent=2)
+def _verify_complete_sentinel(directory: Path) -> None:
+    """Verify .complete sentinel exists and all hashes match.
+    Raises IncompleteCheckpointError if sentinel is missing.
+    Raises CheckpointIntegrityError if any hash mismatches.
+    """
+    sentinel_path = directory / ".complete"
+    if not sentinel_path.exists():
+        raise IncompleteCheckpointError(
+            f"Checkpoint {directory} is missing .complete sentinel — "
+            f"likely a partial write from a crashed or interrupted save."
+        )
+    with open(sentinel_path) as f:
+        sentinel = json.load(f)
+    for filename, expected_hash in sentinel["files"].items():
+        filepath = directory / filename
+        if not filepath.exists():
+            raise CheckpointIntegrityError(
+                f"File {filename} listed in .complete but missing from {directory}"
+            )
+        actual_hash = _sha256_file(filepath)
+        if actual_hash != expected_hash:
+            raise CheckpointIntegrityError(
+                f"SHA-256 mismatch for {filename} in {directory}: "
+                f"expected {expected_hash[:16]}..., got {actual_hash[:16]}..."
+            )
+# ---------------------------------------------------------------------------
+# Atomic directory write
+# ---------------------------------------------------------------------------
+def _atomic_directory_write(target: Path):
+    """Context manager for atomic directory writes.
+    Usage:
+        with _atomic_directory_write(Path("step_00001")) as tmp:
+            save_file(tensors, tmp / "model.safetensors")
+            ...
+        # Directory is now at step_00001/ with .complete sentinel
+    """
+    class _AtomicDir:
+        def __init__(self, target: Path):
+            self.target = target
+            self.tmp = target.parent / f"{target.name}.tmp"
+        def __enter__(self) -> Path:
+            # Clean up any leftover .tmp from a previous crash
+            if self.tmp.exists():
+                shutil.rmtree(self.tmp)
+            self.tmp.mkdir(parents=True, exist_ok=True)
+            return self.tmp
+        def __exit__(self, exc_type, exc_val, exc_tb):
+            if exc_type is not None:
+                # Error during write — clean up temp dir
+                if self.tmp.exists():
+                    shutil.rmtree(self.tmp)
+                return False
+            # Write .complete sentinel with hashes
+            _write_complete_sentinel(self.tmp)
+            # Atomic rename (same filesystem)
+            if self.target.exists():
+                shutil.rmtree(self.target)
+            os.rename(self.tmp, self.target)
+            return False
+    return _AtomicDir(target)
+# ---------------------------------------------------------------------------
+# JSON helpers
+# ---------------------------------------------------------------------------
+def _json_default(obj):
+    """JSON serializer for types not natively supported."""
+    if isinstance(obj, torch.Tensor):
+        return obj.item() if obj.numel() == 1 else obj.tolist()
+    if hasattr(obj, "item"):  # numpy scalar
+        return obj.item()
+    if isinstance(obj, Path):
+        return str(obj)
+    return str(obj)
+# ---------------------------------------------------------------------------
+# RNG state serialization
+# ---------------------------------------------------------------------------
+def _rng_to_json(
+    torch_rng: torch.Tensor | None, cuda_rng: torch.Tensor | None
+) -> dict:
+    data = {}
+    if torch_rng is not None:
+        data["torch_rng_state"] = base64.b64encode(
+            torch_rng.numpy().tobytes()
+        ).decode("ascii")
+    if cuda_rng is not None:
+        data["cuda_rng_state"] = base64.b64encode(
+            cuda_rng.numpy().tobytes()
+        ).decode("ascii")
+    return data
+def _json_to_rng(data: dict) -> tuple[torch.Tensor | None, torch.Tensor | None]:
+    torch_rng = None
+    if "torch_rng_state" in data:
+        raw = base64.b64decode(data["torch_rng_state"])
+        torch_rng = torch.frombuffer(bytearray(raw), dtype=torch.uint8)
+    cuda_rng = None
+    if "cuda_rng_state" in data:
+        raw = base64.b64decode(data["cuda_rng_state"])
+        cuda_rng = torch.frombuffer(bytearray(raw), dtype=torch.uint8)
+    return torch_rng, cuda_rng
+# ---------------------------------------------------------------------------
+# Optimizer state flattening for safetensors
+# ---------------------------------------------------------------------------
+def _flatten_optimizer_state(
+    opt_state_dict: dict,
+) -> tuple[dict[str, torch.Tensor], dict]:
+    """Flatten optimizer state into safetensors-compatible tensors + JSON metadata."""
+    tensors: dict[str, torch.Tensor] = {}
+    scalars: dict[str, float | int] = {}
+    for param_id, param_state in opt_state_dict["state"].items():
+        for key, val in param_state.items():
+            flat_key = f"state.{param_id}.{key}"
+            if isinstance(val, torch.Tensor):
+                t = val.cpu().contiguous()
+                if t.ndim == 0:
+                    t = t.unsqueeze(0)
+                tensors[flat_key] = t
+            else:
+                scalars[flat_key] = val
+    meta = {
+        "param_groups": opt_state_dict["param_groups"],
+        "scalars": scalars if scalars else None,
+    }
+    return tensors, meta
+def _unflatten_optimizer_state(
+    tensors: dict[str, torch.Tensor],
+    meta: dict,
+    device: str = "cpu",
+) -> dict:
+    """Reconstruct optimizer state_dict from flattened tensors + metadata."""
+    state: dict[int, dict[str, torch.Tensor | float | int]] = {}
+    scalars = meta.get("scalars") or {}
+    for flat_key, val in tensors.items():
+        parts = flat_key.split(".", 2)
+        if len(parts) != 3 or parts[0] != "state":
+            continue
+        param_id = int(parts[1])
+        key = parts[2]
+        if param_id not in state:
+            state[param_id] = {}
+        t = val.to(device)
+        if t.shape == (1,) and key == "step":
+            t = t.squeeze(0)
+        state[param_id][key] = t
+    for flat_key, val in scalars.items():
+        parts = flat_key.split(".", 2)
+        if len(parts) != 3:
+            continue
+        param_id = int(parts[1])
+        key = parts[2]
+        if param_id not in state:
+            state[param_id] = {}
+        state[param_id][key] = val
+    return {
+        "state": state,
+        "param_groups": meta["param_groups"],
+    }
+# ---------------------------------------------------------------------------
+# Legacy checkpoint detection
+# ---------------------------------------------------------------------------
+def is_legacy_checkpoint(path: str | Path) -> bool:
+    """True if path is a single .pt/.pth file (legacy format)."""
+    p = Path(path)
+    return p.is_file() and p.suffix in LEGACY_EXTENSIONS
+def _load_legacy_pt(path: str | Path, device: str = "cpu") -> dict:
+    return torch.load(str(path), map_location=device, weights_only=False)
+# ---------------------------------------------------------------------------
+# Pretrain checkpoint save/load
+# ---------------------------------------------------------------------------
+def save_pretrain_checkpoint(
+    path: str | Path,
+    model: nn.Module,
+    optimizer: torch.optim.Optimizer,
+    scheduler,
+    scaler,
+    global_step: int,
+    model_config: dict,
+    training_config: dict,
+) -> None:
+    """Save a pretraining checkpoint atomically.
+    Writes to {path}.tmp/, then renames to {path}/ with .complete sentinel.
+    """
+    path = Path(path)
+    with _atomic_directory_write(path) as tmp:
+        # 1. Model weights
+        state_dict = {k: v.cpu().contiguous() for k, v in model.state_dict().items()}
+        save_file(state_dict, tmp / "model.safetensors")
+        # 2. Optimizer tensors
+        opt_tensors, opt_meta = _flatten_optimizer_state(optimizer.state_dict())
+        if opt_tensors:
+            save_file(opt_tensors, tmp / "optimizer.safetensors")
+        # 3. Training state (JSON)
+        training_state = {
+            "format_version": CHECKPOINT_FORMAT_VERSION,
+            "global_step": global_step,
+            "scheduler_state_dict": scheduler.state_dict() if scheduler is not None else None,
+            "scaler_state_dict": scaler.state_dict() if scaler is not None else None,
+            "optimizer_meta": opt_meta,
+            **_rng_to_json(
+                torch.get_rng_state(),
+                torch.cuda.get_rng_state() if torch.cuda.is_available() else None,
+            ),
+        }
+        with open(tmp / "training_state.json", "w") as f:
+            json.dump(training_state, f, indent=2, default=_json_default)
+        # 4. Config
+        config = {
+            "format_version": CHECKPOINT_FORMAT_VERSION,
+            "checkpoint_type": "pretrain",
+            "model_config": model_config,
+            "training_config": training_config,
+        }
+        with open(tmp / "config.json", "w") as f:
+            json.dump(config, f, indent=2, default=_json_default)
+def load_pretrain_checkpoint(
+    path: str | Path,
+    model: nn.Module,
+    optimizer: torch.optim.Optimizer | None = None,
+    scheduler=None,
+    scaler=None,
+    device: str = "cpu",
+) -> dict:
+    """Load a pretraining checkpoint with integrity verification.
+    Handles both legacy .pt files and new directory format.
+    New format: verifies .complete sentinel and SHA-256 hashes.
+    """
+    path = Path(path)
+    if is_legacy_checkpoint(path):
+        ckpt = _load_legacy_pt(path, device)
+        model.load_state_dict(ckpt["model_state_dict"])
+        if optimizer and "optimizer_state_dict" in ckpt:
+            optimizer.load_state_dict(ckpt["optimizer_state_dict"])
+        if scheduler and "scheduler_state_dict" in ckpt:
+            scheduler.load_state_dict(ckpt["scheduler_state_dict"])
+        if scaler and "scaler_state_dict" in ckpt:
+            scaler.load_state_dict(ckpt["scaler_state_dict"])
+        if ckpt.get("torch_rng_state") is not None:
+            torch.set_rng_state(ckpt["torch_rng_state"].cpu().byte())
+        if ckpt.get("cuda_rng_state") is not None and torch.cuda.is_available():
+            torch.cuda.set_rng_state(ckpt["cuda_rng_state"].cpu().byte())
+        return {
+            "global_step": ckpt.get("global_step", 0),
+            "model_config": ckpt.get("model_config"),
+            "training_config": ckpt.get("training_config"),
+        }
+    # New directory format — verify integrity first
+    _verify_complete_sentinel(path)
+    weights = load_file(path / "model.safetensors", device=device)
+    model.load_state_dict(weights)
+    with open(path / "training_state.json") as f:
+        ts = json.load(f)
+    if optimizer and (path / "optimizer.safetensors").exists():
+        opt_tensors = load_file(path / "optimizer.safetensors", device=device)
+        opt_state = _unflatten_optimizer_state(opt_tensors, ts["optimizer_meta"], device)
+        optimizer.load_state_dict(opt_state)
+    if scheduler and "scheduler_state_dict" in ts:
+        scheduler.load_state_dict(ts["scheduler_state_dict"])
+    if scaler and "scaler_state_dict" in ts:
+        scaler.load_state_dict(ts["scaler_state_dict"])
+    torch_rng, cuda_rng = _json_to_rng(ts)
+    if torch_rng is not None:
+        torch.set_rng_state(torch_rng.cpu().byte())
+    if cuda_rng is not None and torch.cuda.is_available():
+        torch.cuda.set_rng_state(cuda_rng.cpu().byte())
+    with open(path / "config.json") as f:
+        config = json.load(f)
+    return {
+        "global_step": ts.get("global_step", 0),
+        "model_config": config.get("model_config"),
+        "training_config": config.get("training_config"),
+    }
+# ---------------------------------------------------------------------------
+# Adapter checkpoint save/load
+# ---------------------------------------------------------------------------
+def save_adapter_checkpoint(
+    path: str | Path,
+    adapter_state_dict: dict[str, torch.Tensor],
+    config: dict,
+    epoch: int,
+    step: int,
+    val_metrics: dict,
+    optimizer: torch.optim.Optimizer | None = None,
+    scheduler=None,
+    scaler=None,
+    extra: dict | None = None,
+) -> None:
+    """Save an adapter checkpoint atomically."""
+    path = Path(path)
+    with _atomic_directory_write(path) as tmp:
+        # 1. Adapter weights
+        tensors = {k: v.cpu().contiguous() for k, v in adapter_state_dict.items()}
+        save_file(tensors, tmp / "adapter.safetensors")
+        # 2. Optimizer tensors (if provided)
+        opt_meta = None
+        if optimizer is not None:
+            opt_tensors, opt_meta = _flatten_optimizer_state(optimizer.state_dict())
+            if opt_tensors:
+                save_file(opt_tensors, tmp / "optimizer.safetensors")
+        # 3. Training state
+        training_state: dict = {
+            "format_version": CHECKPOINT_FORMAT_VERSION,
+            "epoch": epoch,
+            "step": step,
+            "val_metrics": val_metrics,
+        }
+        if opt_meta is not None:
+            training_state["optimizer_meta"] = opt_meta
+        if scheduler is not None:
+            training_state["scheduler_state_dict"] = scheduler.state_dict()
+        if scaler is not None:
+            training_state["scaler_state_dict"] = scaler.state_dict()
+        if extra:
+            training_state.update(extra)
+        with open(tmp / "training_state.json", "w") as f:
+            json.dump(training_state, f, indent=2, default=_json_default)
+        # 4. Config
+        with open(tmp / "config.json", "w") as f:
+            json.dump(config, f, indent=2, default=_json_default)
+def load_adapter_checkpoint(
+    path: str | Path,
+    device: str = "cpu",
+) -> dict:
+    """Load an adapter checkpoint with integrity verification."""
+    path = Path(path)
+    if is_legacy_checkpoint(path):
+        ckpt = _load_legacy_pt(path, device)
+        adapter_key = None
+        for key in ("lora_state_dict", "bottleneck_state_dict", "film_state_dict",
+                     "sparse_state_dict", "adapter_state_dict", "model_state_dict"):
+            if key in ckpt:
+                adapter_key = key
+                break
+        return {
+            "adapter_state_dict": ckpt.get(adapter_key, {}),
+            "config": ckpt.get("config", {}),
+            "epoch": ckpt.get("epoch", 0),
+            "step": ckpt.get("step", 0),
+            "val_metrics": {
+                "loss": ckpt.get("val_loss"),
+                "top1_accuracy": ckpt.get("val_top1"),
+            },
+            "optimizer_state_dict": ckpt.get("optimizer_state_dict"),
+            "scheduler_state_dict": ckpt.get("scheduler_state_dict"),
+            "scaler_state_dict": ckpt.get("scaler_state_dict"),
+            "best_val_loss": ckpt.get("best_val_loss"),
+            "patience_counter": ckpt.get("patience_counter"),
+        }
+    # New directory format — verify integrity first
+    _verify_complete_sentinel(path)
+    adapter_weights = load_file(path / "adapter.safetensors", device=device)
+    with open(path / "config.json") as f:
+        config = json.load(f)
+    ts = {}
+    ts_path = path / "training_state.json"
+    if ts_path.exists():
+        with open(ts_path) as f:
+            ts = json.load(f)
+    result: dict = {
+        "adapter_state_dict": adapter_weights,
+        "config": config,
+        "epoch": ts.get("epoch", 0),
+        "step": ts.get("step", 0),
+        "val_metrics": ts.get("val_metrics", {}),
+        "best_val_loss": ts.get("best_val_loss"),
+        "patience_counter": ts.get("patience_counter"),
+    }
+    if (path / "optimizer.safetensors").exists() and "optimizer_meta" in ts:
+        opt_tensors = load_file(path / "optimizer.safetensors", device=device)
+        result["optimizer_state_dict"] = _unflatten_optimizer_state(
+            opt_tensors, ts["optimizer_meta"], device
+        )
+    if "scheduler_state_dict" in ts:
+        result["scheduler_state_dict"] = ts["scheduler_state_dict"]
+    if "scaler_state_dict" in ts:
+        result["scaler_state_dict"] = ts["scaler_state_dict"]
+    return result
+# ---------------------------------------------------------------------------
+# Backbone-only loading (inference)
+# ---------------------------------------------------------------------------
+def load_backbone_weights(
+    path: str | Path,
+    device: str = "cpu",
+) -> tuple[dict[str, torch.Tensor], dict | None]:
+    """Load model weights and config for inference with integrity verification.
+    Works with:
+    - Legacy .pt files (extracts model_state_dict + model_config)
+    - New checkpoint directories (reads model.safetensors + config.json, verifies .complete)
+    - Bare model.safetensors files (no .complete check — used for HF downloads)
+    Returns (state_dict, model_config_dict_or_None).
+    """
+    path = Path(path)
+    if is_legacy_checkpoint(path):
+        ckpt = _load_legacy_pt(path, device)
+        return ckpt["model_state_dict"], ckpt.get("model_config")
+    # Directory with model.safetensors
+    if path.is_dir():
+        sf_path = path / "model.safetensors"
+        if not sf_path.exists():
+            raise FileNotFoundError(f"No model.safetensors in {path}")
+        # Verify integrity if .complete exists (new format checkpoints)
+        if (path / ".complete").exists():
+            _verify_complete_sentinel(path)
+        weights = load_file(sf_path, device=device)
+        config = None
+        config_path = path / "config.json"
+        if config_path.exists():
+            with open(config_path) as f:
+                config = json.load(f).get("model_config")
+        return weights, config
+    # Bare safetensors file
+    if path.suffix == ".safetensors":
+        weights = load_file(path, device=device)
+        config = None
+        config_path = path.parent / "config.json"
+        if config_path.exists():
+            with open(config_path) as f:
+                config = json.load(f).get("model_config")
+        return weights, config
+    raise ValueError(f"Unrecognized checkpoint format: {path}")
+# ---------------------------------------------------------------------------
+# HuggingFace push
+# ---------------------------------------------------------------------------
+def push_checkpoint_to_hf(
+    checkpoint_path: str | Path,
+    repo_id: str,
+    branch: str,
+    metrics_path: str | Path | None = None,
+    step: int = 0,
+) -> None:
+    """Push a complete checkpoint directory to a HuggingFace repo branch.
+    Uploads checkpoint files to checkpoints/step_NNNN/ on the branch.
+    Optionally uploads metrics.jsonl (truncated to current step) to the root.
+    Requires HF_TOKEN environment variable or prior `huggingface_hub.login()`.
+    """
+    from huggingface_hub import HfApi
+    checkpoint_path = Path(checkpoint_path)
+    api = HfApi()
+    # Ensure branch exists
+    try:
+        api.create_branch(repo_id, repo_type="model", branch=branch, exist_ok=True)
+    except Exception:
+        pass  # Branch may already exist
+    # Upload checkpoint directory
+    api.upload_folder(
+        folder_path=str(checkpoint_path),
+        path_in_repo=f"checkpoints/{checkpoint_path.name}",
+        repo_id=repo_id,
+        repo_type="model",
+        revision=branch,
+        commit_message=f"Checkpoint step {step}",
+    )
+    # Upload truncated metrics.jsonl to repo root
+    if metrics_path is not None:
+        metrics_path = Path(metrics_path)
+        if metrics_path.exists():
+            import tempfile
+            # Truncate metrics to current step
+            truncated_lines = []
+            with open(metrics_path) as f:
+                for line in f:
+                    truncated_lines.append(line)
+                    try:
+                        record = json.loads(line)
+                        if record.get("type") in ("train", "val") and record.get("step", 0) >= step:
+                            break
+                    except json.JSONDecodeError:
+                        continue
+            with tempfile.NamedTemporaryFile(mode="w", suffix=".jsonl", delete=False) as tmp:
+                tmp.writelines(truncated_lines)
+                tmp_path = tmp.name
+            try:
+                api.upload_file(
+                    path_or_fileobj=tmp_path,
+                    path_in_repo="metrics.jsonl",
+                    repo_id=repo_id,
+                    repo_type="model",
+                    revision=branch,
+                    commit_message=f"Metrics through step {step}",
+                )
+            finally:
+                os.unlink(tmp_path)

pawn/config.py CHANGED Viewed

@@ -3,7 +3,7 @@
 from dataclasses import dataclass
-# Outcome token IDs (spec section 3.2)
 PAD_TOKEN = 0
 OUTCOME_TOKEN_BASE = 4273
 WHITE_CHECKMATES = 4273

 from dataclasses import dataclass
+# Outcome token IDs — must match engine/src/vocab.rs
 PAD_TOKEN = 0
 OUTCOME_TOKEN_BASE = 4273
 WHITE_CHECKMATES = 4273

pawn/data.py CHANGED Viewed

@@ -23,7 +23,6 @@ from pawn.config import (
 _positions_cache: dict[tuple[str, int], torch.Tensor] = {}
 def _map_termination_to_outcome(
     term_codes: np.ndarray, game_lengths: np.ndarray
 ) -> torch.Tensor:
@@ -51,46 +50,45 @@ def _map_termination_to_outcome(
     return outcomes
-def _to_clm_batch(
     move_ids: np.ndarray,
     game_lengths: np.ndarray,
-    term_codes: np.ndarray,
     seq_len: int,
 ) -> dict[str, torch.Tensor]:
-    """Convert Rust engine output to CLM training tensors.
     Constructs input_ids = [outcome, move_1, ..., move_N, PAD, ...]
     and targets shifted left by 1.
     Args:
-        move_ids: (B, engine_max_ply) from generate_random_games
-        game_lengths: (B,) actual game lengths (≤ engine_max_ply)
-        term_codes: (B,) termination codes
         seq_len: total CLM sequence length (256)
     """
     B = len(game_lengths)
     n_move_slots = seq_len - 1  # 255 slots for moves (position 0 = outcome)
-    engine_max_ply = move_ids.shape[1]
     game_lengths_t = torch.from_numpy(game_lengths).long()
-    move_ids_t = torch.from_numpy(move_ids).long()  # (B, engine_max_ply)
-    outcome_tokens = _map_termination_to_outcome(term_codes, game_lengths)
     # Build input_ids: [outcome, move_0, ..., move_{N-1}, PAD, ...]
     input_ids = torch.zeros(B, seq_len, dtype=torch.long)
     input_ids[:, 0] = outcome_tokens
     # Mask out any non-move tokens from engine output
-    cache_key = ("engine", engine_max_ply)
     engine_positions = _positions_cache.get(cache_key)
     if engine_positions is None:
-        engine_positions = torch.arange(engine_max_ply).unsqueeze(0)
         _positions_cache[cache_key] = engine_positions
     move_mask = engine_positions < game_lengths_t.unsqueeze(1)
     clean_moves = move_ids_t * move_mask
     # Place moves at positions 1..n_move_slots
-    n_to_copy = min(engine_max_ply, n_move_slots)
     input_ids[:, 1 : n_to_copy + 1] = clean_moves[:, :n_to_copy]
     # Cap game_lengths to n_move_slots (handles edge case where engine
@@ -116,6 +114,21 @@ def _to_clm_batch(
     }
 class CLMDataset(torch.utils.data.IterableDataset):
     """Generates CLM training data on-the-fly via the Rust engine.
@@ -156,16 +169,18 @@ class CLMDataset(torch.utils.data.IterableDataset):
             t.start()
         step = self._start_step
-        # Engine generates games of up to max_ply-1 moves, leaving 1 slot
-        # for the outcome token in the seq_len=max_ply CLM sequence.
-        engine_max_ply = self.max_ply - 1
         while True:
             seed = self.base_seed + step * num_workers + worker_id
-            move_ids, game_lengths, term_codes = engine.generate_random_games(
-                self.batch_size, engine_max_ply, seed,
-                discard_ply_limit=self.discard_ply_limit,
-            )
-            yield _to_clm_batch(move_ids, game_lengths, term_codes, self.max_ply)
             step += 1
@@ -178,17 +193,21 @@ def create_validation_set(
     Also computes legal move masks for legal move rate evaluation.
     Args:
-        max_ply: total CLM sequence length (256). Engine gets max_ply-1.
     """
-    engine_max_ply = max_ply - 1
-    move_ids, game_lengths, term_codes = engine.generate_random_games(
-        n_games, engine_max_ply, seed, discard_ply_limit=discard_ply_limit,
-    )
-    batch = _to_clm_batch(move_ids, game_lengths, term_codes, max_ply)
     # Compute legal move masks for evaluating legal move rate
     legal_grid, _legal_promo = engine.compute_legal_move_masks(move_ids, game_lengths)
-    batch["legal_grid"] = torch.from_numpy(legal_grid).long()  # (B, engine_max_ply, 64)
     batch["game_lengths"] = torch.from_numpy(game_lengths).long()
     return batch

 _positions_cache: dict[tuple[str, int], torch.Tensor] = {}
 def _map_termination_to_outcome(
     term_codes: np.ndarray, game_lengths: np.ndarray
 ) -> torch.Tensor:
     return outcomes
+def pack_clm_sequences(
     move_ids: np.ndarray,
     game_lengths: np.ndarray,
+    outcome_tokens: torch.Tensor,
     seq_len: int,
 ) -> dict[str, torch.Tensor]:
+    """Pack move arrays into CLM training tensors.
     Constructs input_ids = [outcome, move_1, ..., move_N, PAD, ...]
     and targets shifted left by 1.
     Args:
+        move_ids: (B, max_ply) raw move token IDs
+        game_lengths: (B,) actual game lengths
+        outcome_tokens: (B,) pre-computed outcome token IDs (4273-4277)
         seq_len: total CLM sequence length (256)
     """
     B = len(game_lengths)
     n_move_slots = seq_len - 1  # 255 slots for moves (position 0 = outcome)
+    max_ply = move_ids.shape[1]
     game_lengths_t = torch.from_numpy(game_lengths).long()
+    move_ids_t = torch.from_numpy(move_ids).long()  # (B, max_ply)
     # Build input_ids: [outcome, move_0, ..., move_{N-1}, PAD, ...]
     input_ids = torch.zeros(B, seq_len, dtype=torch.long)
     input_ids[:, 0] = outcome_tokens
     # Mask out any non-move tokens from engine output
+    cache_key = ("engine", max_ply)
     engine_positions = _positions_cache.get(cache_key)
     if engine_positions is None:
+        engine_positions = torch.arange(max_ply).unsqueeze(0)
         _positions_cache[cache_key] = engine_positions
     move_mask = engine_positions < game_lengths_t.unsqueeze(1)
     clean_moves = move_ids_t * move_mask
     # Place moves at positions 1..n_move_slots
+    n_to_copy = min(max_ply, n_move_slots)
     input_ids[:, 1 : n_to_copy + 1] = clean_moves[:, :n_to_copy]
     # Cap game_lengths to n_move_slots (handles edge case where engine
     }
+def _to_clm_batch(
+    move_ids: np.ndarray,
+    game_lengths: np.ndarray,
+    term_codes: np.ndarray,
+    seq_len: int,
+) -> dict[str, torch.Tensor]:
+    """Convert Rust engine output to CLM training tensors.
+    Convenience wrapper: computes outcome tokens from termination codes,
+    then delegates to pack_clm_sequences.
+    """
+    outcome_tokens = _map_termination_to_outcome(term_codes, game_lengths)
+    return pack_clm_sequences(move_ids, game_lengths, outcome_tokens, seq_len)
 class CLMDataset(torch.utils.data.IterableDataset):
     """Generates CLM training data on-the-fly via the Rust engine.
             t.start()
         step = self._start_step
         while True:
             seed = self.base_seed + step * num_workers + worker_id
+            input_ids, targets, loss_mask, _move_ids, _gl, _tc = \
+                engine.generate_clm_batch(
+                    self.batch_size, self.max_ply, seed,
+                    discard_ply_limit=self.discard_ply_limit,
+                )
+            yield {
+                "input_ids": torch.from_numpy(input_ids).long(),
+                "targets": torch.from_numpy(targets).long(),
+                "loss_mask": torch.from_numpy(loss_mask),
+            }
             step += 1
     Also computes legal move masks for legal move rate evaluation.
     Args:
+        max_ply: total CLM sequence length (256).
     """
+    input_ids, targets, loss_mask, move_ids, game_lengths, _tc = \
+        engine.generate_clm_batch(
+            n_games, max_ply, seed, discard_ply_limit=discard_ply_limit,
+        )
+    batch = {
+        "input_ids": torch.from_numpy(input_ids).long(),
+        "targets": torch.from_numpy(targets).long(),
+        "loss_mask": torch.from_numpy(loss_mask),
+    }
     # Compute legal move masks for evaluating legal move rate
     legal_grid, _legal_promo = engine.compute_legal_move_masks(move_ids, game_lengths)
+    batch["legal_grid"] = torch.from_numpy(legal_grid).long()
     batch["game_lengths"] = torch.from_numpy(game_lengths).long()
     return batch

pawn/eval_suite/diagnostics.py CHANGED Viewed

@@ -8,7 +8,7 @@ import torch.nn.functional as F
 import chess_engine as engine
 from pawn.config import PAD_TOKEN
-from pawn.data import _to_clm_batch, _map_termination_to_outcome
 # ---------------------------------------------------------------------------

 import chess_engine as engine
 from pawn.config import PAD_TOKEN
+from pawn.data import pack_clm_sequences, _map_termination_to_outcome
 # ---------------------------------------------------------------------------

pawn/eval_suite/lichess.py CHANGED Viewed

@@ -9,8 +9,8 @@ import torch.nn as nn
 import chess_engine as engine
-from pawn.config import PAD_TOKEN, WHITE_CHECKMATES
-from pawn.data import _to_clm_batch, _map_termination_to_outcome
 # ---------------------------------------------------------------------------
@@ -139,12 +139,11 @@ def evaluate_on_lichess(
         game_lengths = band_data["game_lengths"]
         n = len(move_ids)
-        # We don't have true termination codes for Lichess games parsed via PGN
-        # (the Rust parser returns move sequences, not outcomes).
-        # Use a dummy termination code and outcome token.
-        # For loss/accuracy evaluation, the outcome token choice doesn't matter
         # much since we evaluate on move prediction, not outcome prediction.
-        term_codes = np.full(n, 5, dtype=np.uint8)  # PLY_LIMIT as default
         engine_max_ply = max_seq_len - 1
         # Pad/truncate move_ids to engine_max_ply
@@ -154,7 +153,7 @@ def evaluate_on_lichess(
             padded[i, :gl] = move_ids[i, :gl]
         game_lengths_capped = np.minimum(game_lengths, engine_max_ply).astype(np.int16)
-        batch = _to_clm_batch(padded, game_lengths_capped, term_codes, max_seq_len)
         input_ids = batch["input_ids"]
         targets = batch["targets"]
         loss_mask = batch["loss_mask"]

 import chess_engine as engine
+from pawn.config import PAD_TOKEN, WHITE_CHECKMATES, PLY_LIMIT
+from pawn.data import pack_clm_sequences, _map_termination_to_outcome
 # ---------------------------------------------------------------------------
         game_lengths = band_data["game_lengths"]
         n = len(move_ids)
+        # We don't have true termination codes for Lichess games parsed via PGN.
+        # Use PLY_LIMIT as a dummy outcome token for all games — for
+        # loss/accuracy evaluation the outcome token choice doesn't matter
         # much since we evaluate on move prediction, not outcome prediction.
+        dummy_outcomes = torch.full((n,), PLY_LIMIT, dtype=torch.long)
         engine_max_ply = max_seq_len - 1
         # Pad/truncate move_ids to engine_max_ply
             padded[i, :gl] = move_ids[i, :gl]
         game_lengths_capped = np.minimum(game_lengths, engine_max_ply).astype(np.int16)
+        batch = pack_clm_sequences(padded, game_lengths_capped, dummy_outcomes, max_seq_len)
         input_ids = batch["input_ids"]
         targets = batch["targets"]
         loss_mask = batch["loss_mask"]

pawn/eval_suite/probes.py CHANGED Viewed

@@ -15,7 +15,6 @@ import chess_engine as engine
 from pawn.config import CLMConfig
 from pawn.model import PAWNCLM
-from pawn.data import _to_clm_batch
 # ---------------------------------------------------------------------------
@@ -77,21 +76,16 @@ def extract_probe_data(
     Returns dict with all arrays needed for all probes.
     """
-    engine_max_ply = max_ply - 1
-    move_ids_np, game_lengths_np, term_codes_np = engine.generate_random_games(
-        n_games, engine_max_ply, seed
-    )
     boards_np, side_np, castling_np, ep_np, check_np, halfmove_np = (
         engine.extract_board_states(move_ids_np, game_lengths_np)
     )
-    batch = _to_clm_batch(move_ids_np, game_lengths_np, term_codes_np, max_ply)
     result = {
-        "input_ids": batch["input_ids"],
-        "loss_mask": batch["loss_mask"],
         "boards": torch.from_numpy(boards_np.copy()).long(),
         "side_to_move": torch.from_numpy(side_np.copy()).float(),
         "castling_rights": torch.from_numpy(castling_np.copy()),

 from pawn.config import CLMConfig
 from pawn.model import PAWNCLM
 # ---------------------------------------------------------------------------
     Returns dict with all arrays needed for all probes.
     """
+    input_ids, targets, loss_mask, move_ids_np, game_lengths_np, _tc = \
+        engine.generate_clm_batch(n_games, max_ply, seed)
     boards_np, side_np, castling_np, ep_np, check_np, halfmove_np = (
         engine.extract_board_states(move_ids_np, game_lengths_np)
     )
     result = {
+        "input_ids": torch.from_numpy(input_ids).long(),
+        "loss_mask": torch.from_numpy(loss_mask),
         "boards": torch.from_numpy(boards_np.copy()).long(),
         "side_to_move": torch.from_numpy(side_np.copy()).float(),
         "castling_rights": torch.from_numpy(castling_np.copy()),

pawn/eval_suite/worker.py CHANGED Viewed

@@ -62,15 +62,15 @@ def run_in_worker(fn: Callable[..., Any], *args: Any, timeout: float | None = No
 def _load_model(checkpoint_path: str, device: str) -> PAWNCLM:
     """Load and freeze a PAWNCLM checkpoint. Runs inside worker processes."""
-    import torch
     from pawn.config import CLMConfig
     from pawn.model import PAWNCLM
-    ckpt = torch.load(checkpoint_path, map_location=device, weights_only=False)
-    cfg = CLMConfig(**ckpt["model_config"]) if "model_config" in ckpt else CLMConfig()
     model = PAWNCLM(cfg).to(device)
-    model.load_state_dict(ckpt["model_state_dict"])
-    del ckpt
     gc.collect()
     model.eval()
     for p in model.parameters():

 def _load_model(checkpoint_path: str, device: str) -> PAWNCLM:
     """Load and freeze a PAWNCLM checkpoint. Runs inside worker processes."""
+    from pawn.checkpoint import load_backbone_weights
     from pawn.config import CLMConfig
     from pawn.model import PAWNCLM
+    state_dict, model_config = load_backbone_weights(checkpoint_path, device)
+    cfg = CLMConfig(**model_config) if model_config else CLMConfig()
     model = PAWNCLM(cfg).to(device)
+    model.load_state_dict(state_dict)
+    del state_dict
     gc.collect()
     model.eval()
     for p in model.parameters():

pawn/lichess_data.py CHANGED Viewed

@@ -219,32 +219,15 @@ def prepare_lichess_dataset(
     seq_len = max_ply + 1  # outcome token + max_ply move slots
-    # Build CLM sequences: [outcome, move_0, ..., move_{N-1}, PAD, ...]
-    input_ids = torch.zeros(N, seq_len, dtype=torch.long)
-    input_ids[:, 0] = outcome_tokens
-    gl_t = torch.from_numpy(game_lengths).long()
-    mid_t = torch.from_numpy(move_ids).long()
-    ply_range = torch.arange(max_ply).unsqueeze(0)
-    move_mask = ply_range < gl_t.unsqueeze(1)
-    input_ids[:, 1:] = mid_t * move_mask
-    # Targets: shifted left by 1
-    targets = torch.zeros(N, seq_len, dtype=torch.long)
-    targets[:, :-1] = input_ids[:, 1:]
-    # Loss mask: positions 0..game_length-1 (each has a valid move target)
-    # Position gl would target PAD, which we don't want to train on.
-    seq_positions = torch.arange(seq_len).unsqueeze(0)
-    loss_mask = seq_positions < gl_t.unsqueeze(1)
     return {
         "move_ids": move_ids,
         "game_lengths": game_lengths,
-        "input_ids": input_ids,
-        "targets": targets,
-        "loss_mask": loss_mask,
         "outcome_tokens": outcome_tokens,
         "n_games": N,
     }

     seq_len = max_ply + 1  # outcome token + max_ply move slots
+    from pawn.data import pack_clm_sequences
+    batch = pack_clm_sequences(move_ids, game_lengths, outcome_tokens, seq_len)
     return {
         "move_ids": move_ids,
         "game_lengths": game_lengths,
+        "input_ids": batch["input_ids"],
+        "targets": batch["targets"],
+        "loss_mask": batch["loss_mask"],
         "outcome_tokens": outcome_tokens,
         "n_games": N,
     }

pawn/model.py CHANGED Viewed

@@ -37,7 +37,7 @@ def _build_decomposition_table() -> torch.Tensor:
     for token_idx, uci_str in vocab["token_to_move"].items():
         if token_idx >= OUTCOME_TOKEN_BASE:
-            continue  # Skip EOG (and beyond)
         src_name = uci_str[:2]
         dst_name = uci_str[2:4]
         promo_suffix = uci_str[4:] if len(uci_str) > 4 else ""

     for token_idx, uci_str in vocab["token_to_move"].items():
         if token_idx >= OUTCOME_TOKEN_BASE:
+            continue  # Outcome tokens use standalone embeddings
         src_name = uci_str[:2]
         dst_name = uci_str[2:4]
         promo_suffix = uci_str[4:] if len(uci_str) > 4 else ""

pawn/trainer.py CHANGED Viewed

@@ -207,17 +207,27 @@ def _make_run_dir(base_log_dir: str) -> str:
 class CLMTrainer:
-    def __init__(self, train_cfg: TrainingConfig, model_cfg: CLMConfig):
         self.cfg = train_cfg
         self.model_cfg = model_cfg
         self.device = train_cfg.device
         self.global_step = 0
         self.run_dir = _make_run_dir(train_cfg.log_dir)
         self.cfg.checkpoint_dir = os.path.join(self.run_dir, "checkpoints")
         self._jsonl_path = os.path.join(self.run_dir, "metrics.jsonl")
         self._jsonl_file = None
         self._model = PAWNCLM(model_cfg).to(self.device)
         self.model = self._model
         param_count = sum(p.numel() for p in self._model.parameters())
@@ -449,12 +459,13 @@ class CLMTrainer:
             prefetch_factor=1 if num_workers > 0 else None,
         )
         def _graceful_exit(signum, frame):
-            print(f"\nReceived signal {signum}, saving checkpoint and exiting...")
-            self.save_checkpoint()
-            if self._jsonl_file:
-                self._jsonl_file.close()
-            sys.exit(128 + signum)
         old_term = signal.signal(signal.SIGTERM, _graceful_exit)
         old_int = signal.signal(signal.SIGINT, _graceful_exit)
@@ -547,6 +558,12 @@ class CLMTrainer:
                     self.save_checkpoint()
                     break
                 step_start = time.time()
         signal.signal(signal.SIGTERM, old_term)
@@ -557,49 +574,47 @@ class CLMTrainer:
             self._jsonl_file = None
     def save_checkpoint(self, path: str | None = None):
         if path is None:
             path = os.path.join(
-                self.cfg.checkpoint_dir, f"step_{self.global_step:08d}.pt"
             )
-        dirname = os.path.dirname(path)
-        if dirname:
-            os.makedirs(dirname, exist_ok=True)
         model: PAWNCLM = self._eager_model()
-        torch.save(
-            {
-                "global_step": self.global_step,
-                "model_state_dict": model.state_dict(),
-                "optimizer_state_dict": self.optimizer.state_dict(),
-                "scheduler_state_dict": self.scheduler.state_dict(),
-                "scaler_state_dict": self.scaler.state_dict(),
-                "model_config": self.model_cfg.__dict__,
-                "training_config": self.cfg.__dict__,
-                "torch_rng_state": torch.get_rng_state(),
-                "cuda_rng_state": (
-                    torch.cuda.get_rng_state() if torch.cuda.is_available() else None
-                ),
-            },
             path,
         )
         print(f"Checkpoint saved: {path}")
     def load_checkpoint(self, path: str):
-        ckpt = torch.load(path, map_location=self.device, weights_only=False)
-        self.global_step = ckpt["global_step"]
         model: PAWNCLM = self._eager_model()
-        model.load_state_dict(ckpt["model_state_dict"])
-        self.optimizer.load_state_dict(ckpt["optimizer_state_dict"])
-        self.scheduler.load_state_dict(ckpt["scheduler_state_dict"])
-        self.scaler.load_state_dict(ckpt["scaler_state_dict"])
-        if ckpt.get("torch_rng_state") is not None:
-            torch.set_rng_state(ckpt["torch_rng_state"].cpu().byte())
-        if ckpt.get("cuda_rng_state") is not None and torch.cuda.is_available():
-            torch.cuda.set_rng_state(ckpt["cuda_rng_state"].cpu().byte())
         print(f"Resumed from step {self.global_step}")

 class CLMTrainer:
+    def __init__(
+        self,
+        train_cfg: TrainingConfig,
+        model_cfg: CLMConfig,
+        hf_repo: str | None = None,
+    ):
         self.cfg = train_cfg
         self.model_cfg = model_cfg
         self.device = train_cfg.device
         self.global_step = 0
+        self.hf_repo = hf_repo
+        self.hf_branch: str | None = None
         self.run_dir = _make_run_dir(train_cfg.log_dir)
         self.cfg.checkpoint_dir = os.path.join(self.run_dir, "checkpoints")
         self._jsonl_path = os.path.join(self.run_dir, "metrics.jsonl")
         self._jsonl_file = None
+        if self.hf_repo:
+            self.hf_branch = f"run/{os.path.basename(self.run_dir)}"
         self._model = PAWNCLM(model_cfg).to(self.device)
         self.model = self._model
         param_count = sum(p.numel() for p in self._model.parameters())
             prefetch_factor=1 if num_workers > 0 else None,
         )
+        _shutdown_requested = False
+        _shutdown_signal = None
         def _graceful_exit(signum, frame):
+            nonlocal _shutdown_requested, _shutdown_signal
+            _shutdown_requested = True
+            _shutdown_signal = signum
         old_term = signal.signal(signal.SIGTERM, _graceful_exit)
         old_int = signal.signal(signal.SIGINT, _graceful_exit)
                     self.save_checkpoint()
                     break
+                if _shutdown_requested:
+                    print(f"\nShutdown requested (signal {_shutdown_signal}), "
+                          f"saving checkpoint at step {self.global_step}...")
+                    self.save_checkpoint()
+                    break
                 step_start = time.time()
         signal.signal(signal.SIGTERM, old_term)
             self._jsonl_file = None
     def save_checkpoint(self, path: str | None = None):
+        from pawn.checkpoint import save_pretrain_checkpoint
         if path is None:
             path = os.path.join(
+                self.cfg.checkpoint_dir, f"step_{self.global_step:08d}"
             )
         model: PAWNCLM = self._eager_model()
+        save_pretrain_checkpoint(
             path,
+            model,
+            self.optimizer,
+            self.scheduler,
+            self.scaler,
+            self.global_step,
+            self.model_cfg.__dict__,
+            self.cfg.__dict__,
         )
         print(f"Checkpoint saved: {path}")
+        if self.hf_repo and self.hf_branch:
+            from pawn.checkpoint import push_checkpoint_to_hf
+            try:
+                push_checkpoint_to_hf(
+                    path, self.hf_repo, self.hf_branch,
+                    metrics_path=self._jsonl_path,
+                    step=self.global_step,
+                )
+                print(f"Pushed to HF: {self.hf_repo}@{self.hf_branch}")
+            except Exception as e:
+                print(f"WARNING: HF push failed: {e}")
     def load_checkpoint(self, path: str):
+        from pawn.checkpoint import load_pretrain_checkpoint
         model: PAWNCLM = self._eager_model()
+        meta = load_pretrain_checkpoint(
+            path, model, self.optimizer, self.scheduler, self.scaler,
+            device=self.device,
+        )
+        self.global_step = meta["global_step"]
         print(f"Resumed from step {self.global_step}")

pyproject.toml CHANGED Viewed

@@ -8,6 +8,7 @@ dependencies = [
     "chess-engine",
     "numpy~=2.2.0",
     "psutil>=5.9.0",
     "tqdm~=4.67.0",
     "wandb~=0.25.0",
 ]

     "chess-engine",
     "numpy~=2.2.0",
     "psutil>=5.9.0",
+    "safetensors>=0.4.0",
     "tqdm~=4.67.0",
     "wandb~=0.25.0",
 ]

scripts/check_progress.sh CHANGED Viewed

@@ -1,33 +1,34 @@
 #!/usr/bin/env bash
-# Check training progress, sync from pods, and auto-stop finished pods.
-# Usage: check_progress.sh [--sync] [--auto-stop] [LOG_DIR]
 set -euo pipefail
 SYNC=false
-AUTO_STOP=false
 LOG_DIR=""
 for arg in "$@"; do
     case "$arg" in
         --sync) SYNC=true ;;
-        --auto-stop) AUTO_STOP=true ;;
         *) LOG_DIR="$arg" ;;
     esac
 done
 LOG_DIR="${LOG_DIR:-logs}"
 REPO="$(cd "$(dirname "$0")/.." && pwd)"
-POD_DIR="$HOME/.config/pawn/pods"
-# Sync from all pods
-if $SYNC && [ -d "$POD_DIR" ]; then
     bash "$REPO/deploy/sync.sh" 2>/dev/null || true
 fi
-# Show progress for top 5 most recent runs
-find "$LOG_DIR" -name metrics.jsonl -printf '%T@ %p\n' 2>/dev/null \
-    | sort -rn | head -n 5 | while read -r _ path; do
     run_name="$(basename "$(dirname "$path")")"
     python3 -c "
 import json, sys
 records = [json.loads(l) for l in open('$path')]
@@ -51,29 +52,8 @@ print(f'{\"$run_name\":<28}  {variant}  discard_ply={str(discard):<5}  step {ste
 done
 # Check local training process
-if pgrep -f 'train.py.*discard-ply-limit' > /dev/null 2>&1; then
-    echo "Local discard-ply-limit run: RUNNING"
 else
-    echo "WARNING: Local discard-ply-limit run: NOT RUNNING"
-fi
-# Auto-stop finished pods
-if $AUTO_STOP && [ -d "$POD_DIR" ]; then
-    for env_file in "$POD_DIR"/*.env; do
-        [ -f "$env_file" ] || continue
-        pod_name="$(basename "${env_file%.env}")"
-        unset POD_ID POD_HOST POD_PORT POD_GPU 2>/dev/null || true
-        source "$env_file"
-        # Check if process is alive on pod
-        alive=$(ssh -o ConnectTimeout=5 -p "$POD_PORT" "root@$POD_HOST" \
-            "pgrep -f 'train.py' > /dev/null 2>&1 && echo yes || echo no" 2>/dev/null || echo "unreachable")
-        if [ "$alive" = "no" ]; then
-            echo ">>> $pod_name: training finished. Final sync + stopping..."
-            bash "$REPO/deploy/sync.sh" "$pod_name" 2>/dev/null || true
-            runpodctl pod stop "$POD_ID" 2>/dev/null
-            echo ">>> $pod_name ($POD_ID) STOPPED"
-        fi
-    done
 fi

 #!/usr/bin/env bash
+# Check training progress from HuggingFace submodules and local logs.
+# Usage: check_progress.sh [--sync] [LOG_DIR]
 set -euo pipefail
 SYNC=false
 LOG_DIR=""
 for arg in "$@"; do
     case "$arg" in
         --sync) SYNC=true ;;
         *) LOG_DIR="$arg" ;;
     esac
 done
 LOG_DIR="${LOG_DIR:-logs}"
 REPO="$(cd "$(dirname "$0")/.." && pwd)"
+# Sync submodules from HuggingFace
+if $SYNC; then
     bash "$REPO/deploy/sync.sh" 2>/dev/null || true
 fi
+# Show progress from all metrics.jsonl files (local logs + submodules)
+N=5
+{
+    find "$LOG_DIR" -name metrics.jsonl -printf '%T@ %p\n' 2>/dev/null
+    find "$REPO/checkpoints" -name metrics.jsonl -printf '%T@ %p\n' 2>/dev/null
+} | sort -rn | head -n "$N" | while read -r _ path; do
     run_name="$(basename "$(dirname "$path")")"
     python3 -c "
 import json, sys
 records = [json.loads(l) for l in open('$path')]
 done
 # Check local training process
+if pgrep -f 'train.py' > /dev/null 2>&1; then
+    echo "Local training: RUNNING"
 else
+    echo "Local training: NOT RUNNING"
 fi

scripts/eval_accuracy.py CHANGED Viewed

@@ -65,37 +65,47 @@ def parse_args():
     return p.parse_args()
-def _detect_adapter_type(ckpt: dict) -> str:
-    """Auto-detect adapter type from checkpoint keys."""
-    if "adapter_state_dict" in ckpt:
         return "hybrid"
-    if "lora_state_dict" in ckpt:
         return "lora"
-    if "film_state_dict" in ckpt:
-        return "film"
-    if "sparse_state_dict" in ckpt:
         return "sparse"
-    if "bottleneck_state_dict" in ckpt:
-        return "bottleneck"
-    raise ValueError("Cannot detect adapter type from checkpoint keys: "
-                     + ", ".join(ckpt.keys()))
 def load_model(checkpoint_path: str, adapter_path: str, device: str):
     """Load backbone + adapter, auto-detecting adapter type."""
     # Backbone
-    ckpt = torch.load(checkpoint_path, map_location=device, weights_only=False)
-    cfg = CLMConfig(**ckpt["model_config"]) if "model_config" in ckpt else CLMConfig()
     backbone = PAWNCLM(cfg).to(device)
-    backbone.load_state_dict(ckpt["model_state_dict"])
-    del ckpt
     gc.collect()
     backbone.eval()
     # Adapter
-    adapter_ckpt = torch.load(adapter_path, map_location=device, weights_only=False)
-    adapter_type = _detect_adapter_type(adapter_ckpt)
-    adapter_config = adapter_ckpt.get("config", {})
     if adapter_type == "lora":
         from pawn.adapters.lora import LoRACLM
@@ -108,15 +118,15 @@ def load_model(checkpoint_path: str, adapter_path: str, device: str):
             layers=tuple(int(x) for x in adapter_config["lora_layers"].split(","))
                    if adapter_config.get("lora_layers") else None,
         ).to(device)
-        model.load_lora_state_dict(adapter_ckpt["lora_state_dict"])
     elif adapter_type == "film":
         from pawn.adapters.film import FiLMCLM
         has_output = not adapter_config.get("no_output_film", False)
-        if any(k.startswith("output_film.") for k in adapter_ckpt["film_state_dict"]):
             has_output = True
         model = FiLMCLM(backbone, use_output_film=has_output).to(device)
-        model.load_film_state_dict(adapter_ckpt["film_state_dict"])
     elif adapter_type == "hybrid":
         from pawn.adapters.hybrid import HybridCLM
@@ -133,7 +143,7 @@ def load_model(checkpoint_path: str, adapter_path: str, device: str):
             use_output_film=adapter_config.get("output_film", False),
             film_layers=tuple(int(x) for x in film_layers.split(",")) if film_layers else None,
         ).to(device)
-        model.load_adapter_state_dict(adapter_ckpt["adapter_state_dict"])
     elif adapter_type == "sparse":
         from pawn.adapters.sparse import SparseCLM
@@ -148,7 +158,7 @@ def load_model(checkpoint_path: str, adapter_path: str, device: str):
             layers=tuple(int(x) for x in sparse_layers.split(",")) if sparse_layers else None,
             seed=adapter_config.get("sparse_seed", 42),
         ).to(device)
-        model.load_sparse_state_dict(adapter_ckpt["sparse_state_dict"])
     elif adapter_type == "bottleneck":
         from pawn.adapters.bottleneck import BottleneckCLM
@@ -160,7 +170,7 @@ def load_model(checkpoint_path: str, adapter_path: str, device: str):
             adapt_ffn=adapter_config.get("adapt_ffn", True),
             layers=tuple(int(x) for x in adapter_layers_str.split(",")) if adapter_layers_str else None,
         ).to(device)
-        model.load_adapter_state_dict(adapter_ckpt["bottleneck_state_dict"])
     model.eval()
     return model, adapter_type

     return p.parse_args()
+def _detect_adapter_type(config: dict) -> str:
+    """Auto-detect adapter type from config dict.
+    The config dict comes from load_adapter_checkpoint()["config"], which
+    contains training args for both legacy .pt files and new-format directories.
+    """
+    if "checkpoint_type" in config:
+        return config["checkpoint_type"]
+    if "bottleneck_dim" in config:
+        return "bottleneck"
+    if "lora_rank" in config and config.get("use_film") is not None:
         return "hybrid"
+    if "lora_rank" in config:
         return "lora"
+    if "density" in config:
         return "sparse"
+    if "no_output_film" in config:
+        return "film"
+    raise ValueError("Cannot detect adapter type from config keys: "
+                     + ", ".join(config.keys()))
 def load_model(checkpoint_path: str, adapter_path: str, device: str):
     """Load backbone + adapter, auto-detecting adapter type."""
+    from pawn.checkpoint import load_backbone_weights, load_adapter_checkpoint
     # Backbone
+    state_dict, model_config = load_backbone_weights(checkpoint_path, device)
+    cfg = CLMConfig(**model_config) if model_config else CLMConfig()
     backbone = PAWNCLM(cfg).to(device)
+    backbone.load_state_dict(state_dict)
+    del state_dict
     gc.collect()
     backbone.eval()
     # Adapter
+    adapter_data = load_adapter_checkpoint(adapter_path, device)
+    adapter_weights = adapter_data["adapter_state_dict"]
+    adapter_config = adapter_data.get("config", {})
+    adapter_type = _detect_adapter_type(adapter_config)
     if adapter_type == "lora":
         from pawn.adapters.lora import LoRACLM
             layers=tuple(int(x) for x in adapter_config["lora_layers"].split(","))
                    if adapter_config.get("lora_layers") else None,
         ).to(device)
+        model.load_lora_state_dict(adapter_weights)
     elif adapter_type == "film":
         from pawn.adapters.film import FiLMCLM
         has_output = not adapter_config.get("no_output_film", False)
+        if any(k.startswith("output_film.") for k in adapter_weights):
             has_output = True
         model = FiLMCLM(backbone, use_output_film=has_output).to(device)
+        model.load_film_state_dict(adapter_weights)
     elif adapter_type == "hybrid":
         from pawn.adapters.hybrid import HybridCLM
             use_output_film=adapter_config.get("output_film", False),
             film_layers=tuple(int(x) for x in film_layers.split(",")) if film_layers else None,
         ).to(device)
+        model.load_adapter_state_dict(adapter_weights)
     elif adapter_type == "sparse":
         from pawn.adapters.sparse import SparseCLM
             layers=tuple(int(x) for x in sparse_layers.split(",")) if sparse_layers else None,
             seed=adapter_config.get("sparse_seed", 42),
         ).to(device)
+        model.load_sparse_state_dict(adapter_weights)
     elif adapter_type == "bottleneck":
         from pawn.adapters.bottleneck import BottleneckCLM
             adapt_ffn=adapter_config.get("adapt_ffn", True),
             layers=tuple(int(x) for x in adapter_layers_str.split(",")) if adapter_layers_str else None,
         ).to(device)
+        model.load_adapter_state_dict(adapter_weights)
     model.eval()
     return model, adapter_type

scripts/eval_probes.py CHANGED Viewed

@@ -11,20 +11,17 @@ import torch
 from pawn.config import CLMConfig
 from pawn.model import PAWNCLM
 from pawn.eval_suite.probes import extract_probe_data, train_all_probes
-from pawn.gpu import configure_gpu
 def load_model_from_checkpoint(checkpoint_path: str, device: str) -> PAWNCLM:
-    ckpt = torch.load(checkpoint_path, map_location=device, weights_only=False)
-    config_dict = ckpt.get("model_config")
-    if config_dict:
-        cfg = CLMConfig(**config_dict)
     else:
-        state = ckpt["model_state_dict"]
-        d_model = state["embed.src_embed.weight"].shape[1]
-        n_layers = max(int(k.split(".")[1]) for k in state if k.startswith("layers.")) + 1
-        n_heads = state["layers.0.attn.wq.weight"].shape[0] // (d_model // (d_model // (state["layers.0.attn.wq.weight"].shape[0] // d_model * d_model // state["layers.0.attn.wq.weight"].shape[0]) if True else 1))
-        # Infer config from known variants
         if d_model == 256 and n_layers == 8:
             cfg = CLMConfig.small()
         elif d_model == 512 and n_layers == 8:
@@ -33,9 +30,8 @@ def load_model_from_checkpoint(checkpoint_path: str, device: str) -> PAWNCLM:
             cfg = CLMConfig.large()
         else:
             cfg = CLMConfig(d_model=d_model, n_layers=n_layers)
     model = PAWNCLM(cfg).to(device)
-    model.load_state_dict(ckpt["model_state_dict"])
     model.eval()
     return model
@@ -52,6 +48,7 @@ def main():
     device = args.device or ("cuda" if torch.cuda.is_available() else "cpu")
     if device == "cuda":
         gpu_cfg = configure_gpu()
         import pawn.model as model_module
         model_module.SDPA_BACKEND = gpu_cfg.get("sdpa_backend")

 from pawn.config import CLMConfig
 from pawn.model import PAWNCLM
 from pawn.eval_suite.probes import extract_probe_data, train_all_probes
 def load_model_from_checkpoint(checkpoint_path: str, device: str) -> PAWNCLM:
+    from pawn.checkpoint import load_backbone_weights
+    state_dict, model_config = load_backbone_weights(checkpoint_path, device)
+    if model_config:
+        cfg = CLMConfig(**model_config)
     else:
+        # Fallback: infer from state dict shapes
+        d_model = state_dict["embed.src_embed.weight"].shape[1]
+        n_layers = max(int(k.split(".")[1]) for k in state_dict if k.startswith("layers.")) + 1
         if d_model == 256 and n_layers == 8:
             cfg = CLMConfig.small()
         elif d_model == 512 and n_layers == 8:
             cfg = CLMConfig.large()
         else:
             cfg = CLMConfig(d_model=d_model, n_layers=n_layers)
     model = PAWNCLM(cfg).to(device)
+    model.load_state_dict(state_dict)
     model.eval()
     return model
     device = args.device or ("cuda" if torch.cuda.is_available() else "cpu")
     if device == "cuda":
+        from pawn.gpu import configure_gpu
         gpu_cfg = configure_gpu()
         import pawn.model as model_module
         model_module.SDPA_BACKEND = gpu_cfg.get("sdpa_backend")

scripts/export_hf_repo.py ADDED Viewed

	@@ -0,0 +1,283 @@

+#!/usr/bin/env python3
+"""Export a training run to HuggingFace repo format.
+Converts .pt checkpoints to safetensors and structures files for HF upload:
+  - Root: best checkpoint (model.safetensors, config.json, metrics.jsonl, README.md)
+  - checkpoints/step_NNNN/: other checkpoints with truncated metrics
+Usage:
+    python scripts/export_hf_repo.py \
+        --run-dir logs/run_20260322_182707 \
+        --output-dir export/pawn-base \
+        --repo-name pawn-base \
+        --github-url https://github.com/thomas-schweich/PAWN
+"""
+from __future__ import annotations
+import argparse
+import json
+import shutil
+from pathlib import Path
+import torch
+from safetensors.torch import save_file
+def find_best_step(metrics_path: Path) -> int | None:
+    """Find the step with lowest val loss from metrics.jsonl."""
+    best_loss = float("inf")
+    best_step = None
+    with open(metrics_path) as f:
+        for line in f:
+            record = json.loads(line)
+            if record.get("type") != "val":
+                continue
+            loss = record.get("val/loss", float("inf"))
+            step = record.get("step")
+            if loss < best_loss and step is not None:
+                best_loss = loss
+                best_step = step
+    return best_step
+def truncate_metrics(metrics_path: Path, up_to_step: int) -> list[str]:
+    """Return metrics lines up to and including the given step."""
+    lines = []
+    with open(metrics_path) as f:
+        for line in f:
+            lines.append(line)
+            record = json.loads(line)
+            if record.get("type") in ("train", "val") and record.get("step", 0) > up_to_step:
+                break
+    return lines
+def convert_pt_to_safetensors(pt_path: Path, output_dir: Path):
+    """Convert a .pt checkpoint to safetensors + JSON directory format."""
+    output_dir.mkdir(parents=True, exist_ok=True)
+    ckpt = torch.load(str(pt_path), map_location="cpu", weights_only=False)
+    # Model weights -> safetensors
+    state_dict = ckpt["model_state_dict"]
+    tensors = {k: v.cpu().contiguous() for k, v in state_dict.items()}
+    save_file(tensors, output_dir / "model.safetensors")
+    # Config
+    config = {
+        "format_version": 1,
+        "checkpoint_type": "pretrain",
+        "model_config": ckpt.get("model_config", {}),
+        "training_config": ckpt.get("training_config", {}),
+    }
+    with open(output_dir / "config.json", "w") as f:
+        json.dump(config, f, indent=2, default=str)
+    # Optimizer -> safetensors (if present)
+    if "optimizer_state_dict" in ckpt:
+        from pawn.checkpoint import _flatten_optimizer_state, _rng_to_json, _json_default
+        opt_tensors, opt_meta = _flatten_optimizer_state(ckpt["optimizer_state_dict"])
+        if opt_tensors:
+            save_file(opt_tensors, output_dir / "optimizer.safetensors")
+        # Training state
+        training_state = {
+            "format_version": 1,
+            "global_step": ckpt.get("global_step", 0),
+            "scheduler_state_dict": ckpt.get("scheduler_state_dict"),
+            "scaler_state_dict": ckpt.get("scaler_state_dict"),
+            "optimizer_meta": opt_meta,
+        }
+        rng_state = {}
+        if ckpt.get("torch_rng_state") is not None:
+            rng_state.update(_rng_to_json(ckpt["torch_rng_state"], ckpt.get("cuda_rng_state")))
+        training_state.update(rng_state)
+        with open(output_dir / "training_state.json", "w") as f:
+            json.dump(training_state, f, indent=2, default=_json_default)
+def generate_readme(
+    repo_name: str, model_config: dict, training_config: dict,
+    best_step: int, val_loss: float, val_acc: float,
+    github_url: str, extra_desc: str = "",
+) -> str:
+    """Generate a HuggingFace model card README."""
+    d_model = model_config.get("d_model", "?")
+    n_layers = model_config.get("n_layers", "?")
+    n_heads = model_config.get("n_heads", "?")
+    discard = training_config.get("discard_ply_limit", False)
+    # Infer variant name
+    variant = "base"
+    if d_model == 256:
+        variant = "small"
+    elif d_model == 640:
+        variant = "large"
+    params = {"small": "9.5M", "base": "35.8M", "large": "68.4M"}.get(variant, "?")
+    return f"""---
+license: apache-2.0
+library_name: pytorch
+tags:
+  - chess
+  - transformer
+  - causal-lm
+  - world-model
+datasets:
+  - random-self-play
+model-index:
+  - name: {repo_name}
+    results:
+      - task:
+          type: next-move-prediction
+        metrics:
+          - name: Val Loss
+            type: loss
+            value: {val_loss}
+          - name: Val Accuracy
+            type: accuracy
+            value: {val_acc}
+---
+# {repo_name.upper()}
+A causal transformer trained on random chess games, designed as a testbed for finetuning and augmentation methods at small scales.
+{extra_desc}
+## Model Details
+| | |
+|---|---|
+| **Parameters** | {params} |
+| **Architecture** | Decoder-only transformer (RMSNorm, SwiGLU, RoPE) |
+| **d_model** | {d_model} |
+| **Layers** | {n_layers} |
+| **Heads** | {n_heads} |
+| **Vocabulary** | 4,278 tokens (4,096 grid + 176 promotions + 5 outcomes + 1 PAD) |
+| **Sequence length** | 256 |
+| **Best val loss** | {val_loss:.4f} (step {best_step:,}) |
+| **Best val accuracy** | {val_acc:.1%} |
+## Usage
+```python
+import torch
+from safetensors.torch import load_file
+from pawn.config import CLMConfig
+from pawn.model import PAWNCLM
+cfg = CLMConfig.{variant}()
+model = PAWNCLM(cfg)
+model.load_state_dict(load_file("model.safetensors"))
+model.eval()
+```
+## Training
+Trained from scratch on random self-play games generated by a Rust chess engine (shakmaty).
+See the [PAWN repository]({github_url}) for training code, data pipeline, and evaluation suite.
+## License
+Apache 2.0
+"""
+def main():
+    parser = argparse.ArgumentParser(description="Export training run to HF repo format")
+    parser.add_argument("--run-dir", required=True, help="Training run directory")
+    parser.add_argument("--output-dir", required=True, help="Output directory for HF repo")
+    parser.add_argument("--repo-name", required=True, help="Repository name for README")
+    parser.add_argument("--github-url", default="https://github.com/thomas-schweich/PAWN")
+    parser.add_argument("--best-only", action="store_true", help="Only export best checkpoint")
+    parser.add_argument("--extra-desc", default="", help="Extra description for README")
+    args = parser.parse_args()
+    run_dir = Path(args.run_dir)
+    output_dir = Path(args.output_dir)
+    metrics_path = run_dir / "metrics.jsonl"
+    if not metrics_path.exists():
+        print(f"ERROR: {metrics_path} not found")
+        return
+    # Find best step
+    best_step = find_best_step(metrics_path)
+    if best_step is None:
+        print("ERROR: No val records found in metrics.jsonl")
+        return
+    print(f"Best val step: {best_step}")
+    # Find best val metrics
+    best_val_loss, best_val_acc = float("inf"), 0.0
+    with open(metrics_path) as f:
+        for line in f:
+            r = json.loads(line)
+            if r.get("type") == "val" and r.get("step") == best_step:
+                best_val_loss = r.get("val/loss", float("inf"))
+                best_val_acc = r.get("val/accuracy", 0.0)
+                break
+    # Find all .pt checkpoints
+    ckpt_dir = run_dir / "checkpoints"
+    checkpoints = sorted(ckpt_dir.glob("step_*.pt")) if ckpt_dir.exists() else []
+    if not checkpoints:
+        print("ERROR: No checkpoints found")
+        return
+    # Find nearest checkpoint to best step
+    best_ckpt = min(checkpoints, key=lambda p: abs(
+        int(p.stem.replace("step_", "")) - best_step
+    ))
+    print(f"Best checkpoint: {best_ckpt}")
+    # Read config from checkpoint
+    ckpt_data = torch.load(str(best_ckpt), map_location="cpu", weights_only=False)
+    model_config = ckpt_data.get("model_config", {})
+    training_config = ckpt_data.get("training_config", {})
+    del ckpt_data
+    # Export best checkpoint to root
+    output_dir.mkdir(parents=True, exist_ok=True)
+    print(f"\nExporting best checkpoint to {output_dir}/")
+    convert_pt_to_safetensors(best_ckpt, output_dir)
+    # Copy full metrics.jsonl
+    shutil.copy2(metrics_path, output_dir / "metrics.jsonl")
+    # Generate README
+    readme = generate_readme(
+        args.repo_name, model_config, training_config,
+        best_step, best_val_loss, best_val_acc,
+        args.github_url, args.extra_desc,
+    )
+    with open(output_dir / "README.md", "w") as f:
+        f.write(readme)
+    # Export other checkpoints
+    if not args.best_only:
+        for ckpt in checkpoints:
+            if ckpt == best_ckpt:
+                continue
+            step_name = ckpt.stem  # e.g. "step_00005000"
+            step_num = int(step_name.replace("step_", ""))
+            step_dir = output_dir / "checkpoints" / step_name
+            print(f"  Exporting {step_name}...")
+            convert_pt_to_safetensors(ckpt, step_dir)
+            # Truncated metrics
+            truncated = truncate_metrics(metrics_path, step_num)
+            with open(step_dir / "metrics.jsonl", "w") as f:
+                f.writelines(truncated)
+    print(f"\nExport complete: {output_dir}")
+    print(f"  Best: model.safetensors, config.json, metrics.jsonl, README.md")
+    if not args.best_only:
+        print(f"  Checkpoints: {len(checkpoints) - 1} in checkpoints/")
+if __name__ == "__main__":
+    main()

scripts/profile_step.py CHANGED Viewed

@@ -18,17 +18,18 @@ import torch
 # ── PAWN imports ─────────────────────────────────────────────────────────────
 from pawn.config import CLMConfig
 from pawn.model import PAWNCLM
-from pawn.data import _to_clm_batch
 import chess_engine as engine
 def generate_clm_batch(batch_size: int, device: str):
     """Generate a CLM batch and move to device."""
-    move_ids, game_lengths, term_codes = engine.generate_random_games(
-        batch_size, 255, seed=42
-    )
-    batch = _to_clm_batch(move_ids, game_lengths, term_codes, 256)
-    return {k: v.to(device) for k, v in batch.items()}

 # ── PAWN imports ─────────────────────────────────────────────────────────────
 from pawn.config import CLMConfig
 from pawn.model import PAWNCLM
 import chess_engine as engine
 def generate_clm_batch(batch_size: int, device: str):
     """Generate a CLM batch and move to device."""
+    input_ids, targets, loss_mask, _mid, _gl, _tc = \
+        engine.generate_clm_batch(batch_size, 256, seed=42)
+    return {
+        "input_ids": torch.from_numpy(input_ids).long().to(device),
+        "targets": torch.from_numpy(targets).long().to(device),
+        "loss_mask": torch.from_numpy(loss_mask).to(device),
+    }

scripts/train.py CHANGED Viewed

@@ -30,6 +30,12 @@ def parse_args():
     parser.add_argument("--log-dir", type=str, default=None, help="Override log directory")
     parser.add_argument("--discard-ply-limit", action="store_true",
                         help="Only train on games that ended naturally (no ply limit truncation)")
     return parser.parse_args()
@@ -73,7 +79,7 @@ def main():
     print(f"Model config: {model_cfg}")
     print(f"Training config: {train_cfg}")
-    trainer = CLMTrainer(train_cfg, model_cfg)
     if args.resume:
         trainer.load_checkpoint(args.resume)

     parser.add_argument("--log-dir", type=str, default=None, help="Override log directory")
     parser.add_argument("--discard-ply-limit", action="store_true",
                         help="Only train on games that ended naturally (no ply limit truncation)")
+    ckpt_group = parser.add_mutually_exclusive_group(required=True)
+    ckpt_group.add_argument("--hf-repo", type=str, default=None,
+                            help="Push checkpoints to this HuggingFace repo (requires HF_TOKEN)")
+    ckpt_group.add_argument("--local-checkpoints", action="store_true",
+                            help="Save checkpoints locally only (no HuggingFace push)")
     return parser.parse_args()
     print(f"Model config: {model_cfg}")
     print(f"Training config: {train_cfg}")
+    trainer = CLMTrainer(train_cfg, model_cfg, hf_repo=args.hf_repo)
     if args.resume:
         trainer.load_checkpoint(args.resume)

scripts/train_all.py ADDED Viewed

	@@ -0,0 +1,400 @@

+#!/usr/bin/env python3
+"""Train small, base, and large PAWN models simultaneously on shared data.
+All three models see the exact same batches in the same order, eliminating
+data generation overhead and ensuring comparable training conditions.
+Usage:
+    uv run python scripts/train_all.py --local-checkpoints
+    uv run python scripts/train_all.py --hf-repo thomas-schweich/pawn-{variant}
+"""
+from __future__ import annotations
+import argparse
+import json
+import math
+import os
+import signal
+import sys
+import time
+import torch
+import torch.multiprocessing as mp
+from torch.utils.data import DataLoader
+from pawn.config import CLMConfig, TrainingConfig
+from pawn.model import PAWNCLM, clm_loss
+from pawn.data import CLMDataset, create_validation_set
+from pawn.gpu import configure_gpu, apply_gpu_config
+from pawn.checkpoint import save_pretrain_checkpoint, push_checkpoint_to_hf
+# ---------------------------------------------------------------------------
+# Per-model state
+# ---------------------------------------------------------------------------
+class ModelSlot:
+    """Holds everything needed to train and checkpoint one model variant."""
+    def __init__(
+        self,
+        name: str,
+        model_cfg: CLMConfig,
+        train_cfg: TrainingConfig,
+        device: str,
+        hf_repo: str | None,
+    ):
+        self.name = name
+        self.model_cfg = model_cfg
+        self.train_cfg = train_cfg
+        self.device = device
+        self.hf_repo = hf_repo
+        self.model = PAWNCLM(model_cfg).to(device)
+        param_count = sum(p.numel() for p in self.model.parameters())
+        print(f"  {name}: {param_count:,} params ({model_cfg.d_model}d/{model_cfg.n_layers}L)")
+        self.optimizer = torch.optim.AdamW(
+            self.model.parameters(),
+            lr=train_cfg.lr,
+            weight_decay=train_cfg.weight_decay,
+        )
+        from pawn.trainer import CosineWithWarmup
+        self.scheduler = CosineWithWarmup(
+            self.optimizer,
+            warmup_steps=train_cfg.warmup_steps,
+            total_steps=train_cfg.total_steps,
+        )
+        self.scaler = torch.amp.GradScaler(device, enabled=train_cfg.use_amp)
+        # Run directory
+        self.run_dir = _make_run_dir(train_cfg.log_dir, name)
+        self.checkpoint_dir = os.path.join(self.run_dir, "checkpoints")
+        os.makedirs(self.checkpoint_dir, exist_ok=True)
+        self.jsonl_path = os.path.join(self.run_dir, "metrics.jsonl")
+        self._jsonl_file: open | None = None
+        self.hf_branch = f"run/{os.path.basename(self.run_dir)}" if hf_repo else None
+        self.global_step = 0
+        # Write config
+        import subprocess
+        try:
+            git_hash = subprocess.check_output(
+                ["git", "rev-parse", "HEAD"], stderr=subprocess.DEVNULL, text=True
+            ).strip()
+        except Exception:
+            git_hash = os.environ.get("PAWN_GIT_HASH")
+        try:
+            git_tag = subprocess.check_output(
+                ["git", "tag", "--points-at", "HEAD"], stderr=subprocess.DEVNULL, text=True
+            ).strip() or None
+        except Exception:
+            git_tag = os.environ.get("PAWN_GIT_TAG")
+        config_data = {
+            "model": model_cfg.__dict__,
+            "training": train_cfg.__dict__,
+            "param_count": param_count,
+            "formulation": "clm",
+            "multi_model": True,
+            "variant": name,
+            "git_hash": git_hash,
+            "git_tag": git_tag,
+        }
+        with open(os.path.join(self.run_dir, "config.json"), "w") as f:
+            json.dump(config_data, f, indent=2, default=str)
+        self._log_jsonl({"type": "config", **config_data})
+    def train_step(self, batch: dict[str, torch.Tensor]) -> dict[str, float]:
+        self.model.train()
+        input_ids = batch["input_ids"].to(self.device)
+        targets = batch["targets"].to(self.device)
+        loss_mask = batch["loss_mask"].to(self.device)
+        with torch.amp.autocast(self.device, enabled=self.train_cfg.use_amp):
+            loss, metrics = self.model.forward_train(input_ids, loss_mask, targets)
+        self.scaler.scale(loss).backward()
+        return metrics
+    def optimizer_step(self) -> float:
+        self.scaler.unscale_(self.optimizer)
+        grad_norm = torch.nn.utils.clip_grad_norm_(
+            self.model.parameters(), self.train_cfg.max_grad_norm
+        ).item()
+        self.scaler.step(self.optimizer)
+        self.scaler.update()
+        self.optimizer.zero_grad(set_to_none=True)
+        self.scheduler.step()
+        return grad_norm
+    def save_checkpoint(self):
+        path = os.path.join(self.checkpoint_dir, f"step_{self.global_step:08d}")
+        save_pretrain_checkpoint(
+            path, self.model, self.optimizer, self.scheduler, self.scaler,
+            self.global_step, self.model_cfg.__dict__, self.train_cfg.__dict__,
+        )
+        print(f"  [{self.name}] Checkpoint saved: {path}")
+        if self.hf_repo and self.hf_branch:
+            try:
+                push_checkpoint_to_hf(
+                    path, self.hf_repo, self.hf_branch,
+                    metrics_path=self.jsonl_path, step=self.global_step,
+                )
+                print(f"  [{self.name}] Pushed to HF: {self.hf_repo}@{self.hf_branch}")
+            except Exception as e:
+                print(f"  [{self.name}] WARNING: HF push failed: {e}")
+    @torch.no_grad()
+    def evaluate(self, val_data: dict[str, torch.Tensor]) -> dict[str, float]:
+        self.model.eval()
+        n = val_data["input_ids"].shape[0]
+        batch_size = self.train_cfg.batch_size
+        total_metrics: dict[str, float] = {}
+        n_batches = 0
+        for start in range(0, n, batch_size):
+            end = min(start + batch_size, n)
+            input_ids = val_data["input_ids"][start:end].to(self.device)
+            targets = val_data["targets"][start:end].to(self.device)
+            loss_mask = val_data["loss_mask"][start:end].to(self.device)
+            with torch.amp.autocast(self.device, enabled=self.train_cfg.use_amp):
+                logits, _ = self.model(input_ids, loss_mask)
+                _, metrics = clm_loss(logits, targets, loss_mask)
+            for k, v in metrics.items():
+                total_metrics[k] = total_metrics.get(k, 0.0) + v
+            n_batches += 1
+        return {f"val/{k}": v / n_batches for k, v in total_metrics.items()}
+    def _log_jsonl(self, record: dict):
+        if self._jsonl_file is None:
+            self._jsonl_file = open(self.jsonl_path, "a")
+        self._jsonl_file.write(json.dumps(record, default=str) + "\n")
+        self._jsonl_file.flush()
+    def close(self):
+        if self._jsonl_file:
+            self._jsonl_file.close()
+            self._jsonl_file = None
+def _make_run_dir(log_dir: str, variant: str) -> str:
+    timestamp = time.strftime("%Y%m%d_%H%M%S")
+    run_dir = os.path.join(log_dir, f"run_{timestamp}_{variant}")
+    os.makedirs(run_dir, exist_ok=True)
+    return run_dir
+# ---------------------------------------------------------------------------
+# CLI
+# ---------------------------------------------------------------------------
+def parse_args():
+    p = argparse.ArgumentParser(description="Train small/base/large PAWN models simultaneously")
+    p.add_argument("--device", type=str, default=None, help="Device (cuda/cpu)")
+    p.add_argument("--total-steps", type=int, default=100_000, help="Total training steps")
+    p.add_argument("--batch-size", type=int, default=256, help="Batch size (shared across models)")
+    p.add_argument("--num-workers", type=int, default=4, help="DataLoader workers")
+    p.add_argument("--log-dir", type=str, default="logs", help="Log directory")
+    p.add_argument("--log-interval", type=int, default=10)
+    p.add_argument("--eval-interval", type=int, default=500)
+    p.add_argument("--checkpoint-interval", type=int, default=5000)
+    p.add_argument("--discard-ply-limit", action="store_true")
+    p.add_argument("--wandb", action="store_true")
+    ckpt_group = p.add_mutually_exclusive_group(required=True)
+    ckpt_group.add_argument("--hf-repo", type=str, default=None,
+                            help="HF repo prefix (appends -{variant}). E.g. thomas-schweich/pawn")
+    ckpt_group.add_argument("--local-checkpoints", action="store_true")
+    return p.parse_args()
+def main():
+    args = parse_args()
+    device = args.device or ("cuda" if torch.cuda.is_available() else "cpu")
+    if device == "cuda":
+        gpu_cfg = configure_gpu()
+        import pawn.model as model_module
+        if gpu_cfg.get("sdpa_backend"):
+            model_module.SDPA_BACKEND = gpu_cfg["sdpa_backend"]
+    # Build per-variant configs (shared training hyperparams, different model sizes)
+    variants = {
+        "small": CLMConfig.small(),
+        "base": CLMConfig.base(),
+        "large": CLMConfig.large(),
+    }
+    print("=== Multi-Model Training ===")
+    print(f"Device: {device}")
+    print(f"Batch size: {args.batch_size}")
+    print(f"Total steps: {args.total_steps}")
+    print()
+    slots: list[ModelSlot] = []
+    for name, model_cfg in variants.items():
+        train_cfg = TrainingConfig()
+        train_cfg.total_steps = args.total_steps
+        train_cfg.batch_size = args.batch_size
+        train_cfg.num_workers = args.num_workers
+        train_cfg.device = device
+        train_cfg.log_dir = args.log_dir
+        train_cfg.log_interval = args.log_interval
+        train_cfg.eval_interval = args.eval_interval
+        train_cfg.checkpoint_interval = args.checkpoint_interval
+        train_cfg.discard_ply_limit = args.discard_ply_limit
+        train_cfg.use_wandb = args.wandb
+        hf_repo = f"{args.hf_repo}-{name}" if args.hf_repo else None
+        slots.append(ModelSlot(name, model_cfg, train_cfg, device, hf_repo))
+    # Shared dataset and validation set
+    max_ply = 256
+    dataset = CLMDataset(
+        args.batch_size, max_ply, base_seed=42,
+        discard_ply_limit=args.discard_ply_limit,
+    )
+    print("\nGenerating shared validation set...")
+    val_data = create_validation_set(512, max_ply, seed=(2**63) - 1,
+                                     discard_ply_limit=args.discard_ply_limit)
+    # Compile models
+    if device != "cpu":
+        for slot in slots:
+            try:
+                slot.model = torch.compile(slot.model, mode="default")
+                print(f"  [{slot.name}] torch.compile enabled")
+            except Exception:
+                print(f"  [{slot.name}] torch.compile not available")
+    loader = DataLoader(
+        dataset,
+        batch_size=None,
+        num_workers=args.num_workers,
+        pin_memory=(device != "cpu"),
+        persistent_workers=(args.num_workers > 0),
+        prefetch_factor=1 if args.num_workers > 0 else None,
+    )
+    # Signal handling
+    _shutdown_requested = False
+    _shutdown_signal = None
+    def _graceful_exit(signum, frame):
+        nonlocal _shutdown_requested, _shutdown_signal
+        _shutdown_requested = True
+        _shutdown_signal = signum
+    signal.signal(signal.SIGTERM, _graceful_exit)
+    signal.signal(signal.SIGINT, _graceful_exit)
+    # Training loop
+    global_step = 0
+    step_start = time.time()
+    print(f"\nStarting training from step 0", flush=True)
+    for slot in slots:
+        print(f"  [{slot.name}] JSONL: {slot.jsonl_path}", flush=True)
+    print()
+    for batch in loader:
+        # Forward + backward for each model on the same batch
+        all_metrics: dict[str, dict[str, float]] = {}
+        for slot in slots:
+            metrics = slot.train_step(batch)
+            all_metrics[slot.name] = metrics
+        # Optimizer step for each model
+        all_grad_norms: dict[str, float] = {}
+        for slot in slots:
+            gn = slot.optimizer_step()
+            all_grad_norms[slot.name] = gn
+        global_step += 1
+        for slot in slots:
+            slot.global_step = global_step
+        step_time = time.time() - step_start
+        games_per_sec = args.batch_size / step_time
+        # Logging
+        if global_step % args.log_interval == 0:
+            print(f"step {global_step:>7d} | {games_per_sec:.0f} g/s | {step_time:.2f}s", flush=True)
+            for slot in slots:
+                m = all_metrics[slot.name]
+                gn = all_grad_norms[slot.name]
+                lr = slot.scheduler.get_lr()
+                print(f"  {slot.name:>5s}: loss {m['loss']:.4f} | acc {m['accuracy']:.3f} | "
+                      f"lr {lr:.2e} | gn {gn:.2f}", flush=True)
+                record = {
+                    "type": "train",
+                    "step": global_step,
+                    "timestamp": time.time(),
+                    "lr": lr,
+                    "grad_norm": gn,
+                    "step_time": step_time,
+                    "games_per_sec": games_per_sec,
+                    **{f"train/{k}": v for k, v in m.items()},
+                }
+                slot._log_jsonl(record)
+        # Eval
+        if global_step % args.eval_interval == 0:
+            for slot in slots:
+                val_metrics = slot.evaluate(val_data)
+                print(f"  {slot.name:>5s} val: loss {val_metrics['val/loss']:.4f} | "
+                      f"acc {val_metrics['val/accuracy']:.3f}", flush=True)
+                slot._log_jsonl({
+                    "type": "val",
+                    "step": global_step,
+                    "timestamp": time.time(),
+                    **val_metrics,
+                })
+        # Checkpoint
+        if global_step % args.checkpoint_interval == 0:
+            for slot in slots:
+                slot.save_checkpoint()
+        # Done?
+        if global_step >= args.total_steps:
+            print(f"\nTraining complete at step {global_step}")
+            for slot in slots:
+                slot.save_checkpoint()
+            break
+        # Graceful shutdown
+        if _shutdown_requested:
+            print(f"\nShutdown requested (signal {_shutdown_signal}), "
+                  f"saving checkpoints at step {global_step}...")
+            for slot in slots:
+                slot.save_checkpoint()
+            break
+        step_start = time.time()
+    # Cleanup
+    for slot in slots:
+        slot.close()
+    print("\nAll done.")
+if __name__ == "__main__":
+    try:
+        mp.set_start_method("forkserver", force=True)
+    except ValueError:
+        mp.set_start_method("spawn", force=True)
+    main()

scripts/train_bottleneck.py CHANGED Viewed

@@ -16,6 +16,7 @@ from __future__ import annotations
 import argparse
 import gc
 import math
 import time
 from pathlib import Path
@@ -90,15 +91,22 @@ def parse_args():
     p.add_argument("--resume", type=str, default=None,
                     help="Path to checkpoint to resume from (best.pt or final.pt)")
     return p.parse_args()
 def load_backbone(checkpoint_path: str, device: str) -> PAWNCLM:
-    ckpt = torch.load(checkpoint_path, map_location=device, weights_only=False)
-    cfg = CLMConfig(**ckpt["model_config"]) if "model_config" in ckpt else CLMConfig()
     model = PAWNCLM(cfg).to(device)
-    model.load_state_dict(ckpt["model_state_dict"])
-    del ckpt
     gc.collect()
     model.eval()
     return model
@@ -197,6 +205,10 @@ def main():
     ckpt_dir = out_dir / "checkpoints"
     ckpt_dir.mkdir(exist_ok=True)
     print(f"Device: {device}")
     print(f"Output: {out_dir}")
@@ -312,17 +324,18 @@ def main():
     if args.resume:
         print(f"\nResuming from: {args.resume}")
-        ckpt = torch.load(args.resume, map_location=device, weights_only=False)
-        model.load_adapter_state_dict(ckpt["bottleneck_state_dict"])
-        if "optimizer_state_dict" in ckpt:
             optimizer.load_state_dict(ckpt["optimizer_state_dict"])
-        if "scheduler_state_dict" in ckpt:
             scheduler.load_state_dict(ckpt["scheduler_state_dict"])
         if scaler and ckpt.get("scaler_state_dict"):
             scaler.load_state_dict(ckpt["scaler_state_dict"])
         start_epoch = ckpt["epoch"] + 1
         global_step = ckpt["step"]
-        best_val_loss = ckpt.get("best_val_loss", ckpt.get("val_loss", float("inf")))
         patience_counter = ckpt.get("patience_counter", 0)
         print(f"  Resumed at epoch {start_epoch}, step {global_step}, "
               f"best_val_loss={best_val_loss:.4f}")
@@ -358,6 +371,13 @@ def main():
     val_metrics = evaluate(model, val_loader, mask_builder, device, use_amp=use_amp,
                            precomputed_indices=val_legal_indices) if args.resume else baseline
     print(f"\nTraining for up to {args.epochs} epochs ({total_steps} steps)")
     print(f"  Warmup: {warmup_steps} steps, LR: {args.lr}")
@@ -434,38 +454,56 @@ def main():
             if val_metrics["loss"] < best_val_loss:
                 best_val_loss = val_metrics["loss"]
                 patience_counter = 0
-                torch.save({
-                    "bottleneck_state_dict": model.adapter_state_dict(),
-                    "optimizer_state_dict": optimizer.state_dict(),
-                    "scheduler_state_dict": scheduler.state_dict(),
-                    "scaler_state_dict": scaler.state_dict() if scaler else None,
-                    "epoch": epoch,
-                    "step": global_step,
-                    "val_loss": val_metrics["loss"],
-                    "val_top1": val_metrics["top1_accuracy"],
-                    "best_val_loss": best_val_loss,
-                    "patience_counter": patience_counter,
-                    "config": vars(args),
-                }, ckpt_dir / "best.pt")
             else:
                 patience_counter += 1
                 if patience_counter >= args.patience:
                     print(f"\n  Early stopping at epoch {epoch} (patience={args.patience})")
                     break
-    torch.save({
-        "bottleneck_state_dict": model.adapter_state_dict(),
-        "optimizer_state_dict": optimizer.state_dict(),
-        "scheduler_state_dict": scheduler.state_dict(),
-        "scaler_state_dict": scaler.state_dict() if scaler else None,
-        "epoch": epoch,
-        "step": global_step,
-        "val_loss": val_metrics["loss"],
-        "val_top1": val_metrics["top1_accuracy"],
-        "best_val_loss": best_val_loss,
-        "patience_counter": patience_counter,
-        "config": vars(args),
-    }, ckpt_dir / "final.pt")
     logger.close()
     print(f"\nDone. Best val_loss={best_val_loss:.4f}")

 import argparse
 import gc
 import math
+import signal
 import time
 from pathlib import Path
     p.add_argument("--resume", type=str, default=None,
                     help="Path to checkpoint to resume from (best.pt or final.pt)")
+    ckpt_group = p.add_mutually_exclusive_group(required=True)
+    ckpt_group.add_argument("--hf-repo", type=str, default=None,
+                            help="Push checkpoints to this HuggingFace repo (requires HF_TOKEN)")
+    ckpt_group.add_argument("--local-checkpoints", action="store_true",
+                            help="Save checkpoints locally only")
     return p.parse_args()
 def load_backbone(checkpoint_path: str, device: str) -> PAWNCLM:
+    from pawn.checkpoint import load_backbone_weights
+    state_dict, model_config = load_backbone_weights(checkpoint_path, device)
+    cfg = CLMConfig(**model_config) if model_config else CLMConfig()
     model = PAWNCLM(cfg).to(device)
+    model.load_state_dict(state_dict)
+    del state_dict
     gc.collect()
     model.eval()
     return model
     ckpt_dir = out_dir / "checkpoints"
     ckpt_dir.mkdir(exist_ok=True)
+    hf_branch = None
+    if args.hf_repo:
+        hf_branch = f"run/{out_dir.name}"
     print(f"Device: {device}")
     print(f"Output: {out_dir}")
     if args.resume:
         print(f"\nResuming from: {args.resume}")
+        from pawn.checkpoint import load_adapter_checkpoint
+        ckpt = load_adapter_checkpoint(args.resume, device=device)
+        model.load_adapter_state_dict(ckpt["adapter_state_dict"])
+        if ckpt.get("optimizer_state_dict"):
             optimizer.load_state_dict(ckpt["optimizer_state_dict"])
+        if ckpt.get("scheduler_state_dict"):
             scheduler.load_state_dict(ckpt["scheduler_state_dict"])
         if scaler and ckpt.get("scaler_state_dict"):
             scaler.load_state_dict(ckpt["scaler_state_dict"])
         start_epoch = ckpt["epoch"] + 1
         global_step = ckpt["step"]
+        best_val_loss = ckpt.get("best_val_loss", ckpt.get("val_metrics", {}).get("loss", float("inf")))
         patience_counter = ckpt.get("patience_counter", 0)
         print(f"  Resumed at epoch {start_epoch}, step {global_step}, "
               f"best_val_loss={best_val_loss:.4f}")
     val_metrics = evaluate(model, val_loader, mask_builder, device, use_amp=use_amp,
                            precomputed_indices=val_legal_indices) if args.resume else baseline
+    _shutdown_requested = False
+    def _graceful_exit(signum, frame):
+        nonlocal _shutdown_requested
+        _shutdown_requested = True
+    signal.signal(signal.SIGTERM, _graceful_exit)
+    signal.signal(signal.SIGINT, _graceful_exit)
     print(f"\nTraining for up to {args.epochs} epochs ({total_steps} steps)")
     print(f"  Warmup: {warmup_steps} steps, LR: {args.lr}")
             if val_metrics["loss"] < best_val_loss:
                 best_val_loss = val_metrics["loss"]
                 patience_counter = 0
+                from pawn.checkpoint import save_adapter_checkpoint
+                save_adapter_checkpoint(
+                    ckpt_dir / "best",
+                    model.adapter_state_dict(),
+                    config=vars(args),
+                    epoch=epoch,
+                    step=global_step,
+                    val_metrics=val_metrics,
+                    optimizer=optimizer,
+                    scheduler=scheduler,
+                    scaler=scaler,
+                    extra={"best_val_loss": best_val_loss, "patience_counter": patience_counter},
+                )
+                if args.hf_repo and hf_branch:
+                    from pawn.checkpoint import push_checkpoint_to_hf
+                    try:
+                        push_checkpoint_to_hf(ckpt_dir / "best", args.hf_repo, hf_branch, step=global_step)
+                        print(f"Pushed to HF: {args.hf_repo}@{hf_branch}")
+                    except Exception as e:
+                        print(f"WARNING: HF push failed: {e}")
             else:
                 patience_counter += 1
                 if patience_counter >= args.patience:
                     print(f"\n  Early stopping at epoch {epoch} (patience={args.patience})")
                     break
+        if _shutdown_requested:
+            print("Shutdown requested, saving checkpoint...")
+            break
+    from pawn.checkpoint import save_adapter_checkpoint
+    save_adapter_checkpoint(
+        ckpt_dir / "final",
+        model.adapter_state_dict(),
+        config=vars(args),
+        epoch=epoch,
+        step=global_step,
+        val_metrics=val_metrics,
+        optimizer=optimizer,
+        scheduler=scheduler,
+        scaler=scaler,
+        extra={"best_val_loss": best_val_loss, "patience_counter": patience_counter},
+    )
+    if args.hf_repo and hf_branch:
+        from pawn.checkpoint import push_checkpoint_to_hf
+        try:
+            push_checkpoint_to_hf(ckpt_dir / "final", args.hf_repo, hf_branch, step=global_step)
+            print(f"Pushed to HF: {args.hf_repo}@{hf_branch}")
+        except Exception as e:
+            print(f"WARNING: HF push failed: {e}")
     logger.close()
     print(f"\nDone. Best val_loss={best_val_loss:.4f}")

scripts/train_film.py CHANGED Viewed

@@ -17,6 +17,7 @@ import argparse
 import gc
 import json
 import math
 import time
 from pathlib import Path
@@ -80,15 +81,22 @@ def parse_args():
     p.add_argument("--sdpa-math", action="store_true",
                     help="Use MATH SDPA backend (workaround for ROCm flash attn + compile)")
     return p.parse_args()
 def load_backbone(checkpoint_path: str, device: str) -> PAWNCLM:
-    ckpt = torch.load(checkpoint_path, map_location=device, weights_only=False)
-    cfg = CLMConfig(**ckpt["model_config"]) if "model_config" in ckpt else CLMConfig()
     model = PAWNCLM(cfg).to(device)
-    model.load_state_dict(ckpt["model_state_dict"])
-    del ckpt
     gc.collect()
     model.eval()
     return model
@@ -180,6 +188,10 @@ def main():
     ckpt_dir = out_dir / "checkpoints"
     ckpt_dir.mkdir(exist_ok=True)
     device = args.device
     print(f"Device: {device}")
     print(f"Output: {out_dir}")
@@ -284,6 +296,13 @@ def main():
     global_step = 0
     val_metrics = baseline  # for first log if val_every > 1
     print(f"\nTraining for up to {args.epochs} epochs ({total_steps} steps)")
     print(f"  Warmup: {warmup_steps} steps, LR: {args.lr}")
     if args.val_every > 1:
@@ -371,29 +390,49 @@ def main():
             if val_metrics["loss"] < best_val_loss:
                 best_val_loss = val_metrics["loss"]
                 patience_counter = 0
-                torch.save({
-                    "film_state_dict": model.film_state_dict(),
-                    "epoch": epoch,
-                    "step": global_step,
-                    "val_loss": val_metrics["loss"],
-                    "val_top1": val_metrics["top1_accuracy"],
-                    "config": vars(args),
-                }, ckpt_dir / "best.pt")
             else:
                 patience_counter += 1
                 if patience_counter >= args.patience:
                     print(f"\n  Early stopping at epoch {epoch} (patience={args.patience})")
                     break
     # Save final checkpoint
-    torch.save({
-        "film_state_dict": model.film_state_dict(),
-        "epoch": epoch,
-        "step": global_step,
-        "val_loss": val_metrics["loss"],
-        "val_top1": val_metrics["top1_accuracy"],
-        "config": vars(args),
-    }, ckpt_dir / "final.pt")
     print(f"\nDone. Best val_loss={best_val_loss:.4f}")
     print(f"Checkpoints saved to {out_dir}")

 import gc
 import json
 import math
+import signal
 import time
 from pathlib import Path
     p.add_argument("--sdpa-math", action="store_true",
                     help="Use MATH SDPA backend (workaround for ROCm flash attn + compile)")
+    ckpt_group = p.add_mutually_exclusive_group(required=True)
+    ckpt_group.add_argument("--hf-repo", type=str, default=None,
+                            help="Push checkpoints to this HuggingFace repo (requires HF_TOKEN)")
+    ckpt_group.add_argument("--local-checkpoints", action="store_true",
+                            help="Save checkpoints locally only")
     return p.parse_args()
 def load_backbone(checkpoint_path: str, device: str) -> PAWNCLM:
+    from pawn.checkpoint import load_backbone_weights
+    state_dict, model_config = load_backbone_weights(checkpoint_path, device)
+    cfg = CLMConfig(**model_config) if model_config else CLMConfig()
     model = PAWNCLM(cfg).to(device)
+    model.load_state_dict(state_dict)
+    del state_dict
     gc.collect()
     model.eval()
     return model
     ckpt_dir = out_dir / "checkpoints"
     ckpt_dir.mkdir(exist_ok=True)
+    hf_branch = None
+    if args.hf_repo:
+        hf_branch = f"run/{out_dir.name}"
     device = args.device
     print(f"Device: {device}")
     print(f"Output: {out_dir}")
     global_step = 0
     val_metrics = baseline  # for first log if val_every > 1
+    _shutdown_requested = False
+    def _graceful_exit(signum, frame):
+        nonlocal _shutdown_requested
+        _shutdown_requested = True
+    signal.signal(signal.SIGTERM, _graceful_exit)
+    signal.signal(signal.SIGINT, _graceful_exit)
     print(f"\nTraining for up to {args.epochs} epochs ({total_steps} steps)")
     print(f"  Warmup: {warmup_steps} steps, LR: {args.lr}")
     if args.val_every > 1:
             if val_metrics["loss"] < best_val_loss:
                 best_val_loss = val_metrics["loss"]
                 patience_counter = 0
+                from pawn.checkpoint import save_adapter_checkpoint
+                save_adapter_checkpoint(
+                    ckpt_dir / "best",
+                    model.film_state_dict(),
+                    config=vars(args),
+                    epoch=epoch,
+                    step=global_step,
+                    val_metrics=val_metrics,
+                )
+                if args.hf_repo and hf_branch:
+                    from pawn.checkpoint import push_checkpoint_to_hf
+                    try:
+                        push_checkpoint_to_hf(ckpt_dir / "best", args.hf_repo, hf_branch, step=global_step)
+                        print(f"Pushed to HF: {args.hf_repo}@{hf_branch}")
+                    except Exception as e:
+                        print(f"WARNING: HF push failed: {e}")
             else:
                 patience_counter += 1
                 if patience_counter >= args.patience:
                     print(f"\n  Early stopping at epoch {epoch} (patience={args.patience})")
                     break
+        if _shutdown_requested:
+            print("Shutdown requested, saving checkpoint...")
+            break
     # Save final checkpoint
+    from pawn.checkpoint import save_adapter_checkpoint
+    save_adapter_checkpoint(
+        ckpt_dir / "final",
+        model.film_state_dict(),
+        config=vars(args),
+        epoch=epoch,
+        step=global_step,
+        val_metrics=val_metrics,
+    )
+    if args.hf_repo and hf_branch:
+        from pawn.checkpoint import push_checkpoint_to_hf
+        try:
+            push_checkpoint_to_hf(ckpt_dir / "final", args.hf_repo, hf_branch, step=global_step)
+            print(f"Pushed to HF: {args.hf_repo}@{hf_branch}")
+        except Exception as e:
+            print(f"WARNING: HF push failed: {e}")
     print(f"\nDone. Best val_loss={best_val_loss:.4f}")
     print(f"Checkpoints saved to {out_dir}")

scripts/train_hybrid.py CHANGED Viewed

@@ -18,6 +18,7 @@ import argparse
 import gc
 import json
 import math
 import time
 from pathlib import Path
@@ -91,15 +92,22 @@ def parse_args():
     p.add_argument("--sdpa-math", action="store_true",
                     help="Use MATH SDPA backend (workaround for ROCm flash attn + compile)")
     return p.parse_args()
 def load_backbone(checkpoint_path: str, device: str) -> PAWNCLM:
-    ckpt = torch.load(checkpoint_path, map_location=device, weights_only=False)
-    cfg = CLMConfig(**ckpt["model_config"]) if "model_config" in ckpt else CLMConfig()
     model = PAWNCLM(cfg).to(device)
-    model.load_state_dict(ckpt["model_state_dict"])
-    del ckpt
     gc.collect()
     model.eval()
     return model
@@ -182,6 +190,10 @@ def main():
     ckpt_dir = out_dir / "checkpoints"
     ckpt_dir.mkdir(exist_ok=True)
     device = args.device
     print(f"Device: {device}")
     print(f"Output: {out_dir}")
@@ -314,6 +326,13 @@ def main():
     global_step = 0
     val_metrics = baseline
     print(f"\nTraining for up to {args.epochs} epochs ({total_steps} steps)")
     for epoch in range(args.epochs):
@@ -394,28 +413,48 @@ def main():
             if val_metrics["loss"] < best_val_loss:
                 best_val_loss = val_metrics["loss"]
                 patience_counter = 0
-                torch.save({
-                    "adapter_state_dict": model.adapter_state_dict(),
-                    "epoch": epoch,
-                    "step": global_step,
-                    "val_loss": val_metrics["loss"],
-                    "val_top1": val_metrics["top1_accuracy"],
-                    "config": vars(args),
-                }, ckpt_dir / "best.pt")
             else:
                 patience_counter += 1
                 if patience_counter >= args.patience:
                     print(f"\n  Early stopping at epoch {epoch} (patience={args.patience})")
                     break
-    torch.save({
-        "adapter_state_dict": model.adapter_state_dict(),
-        "epoch": epoch,
-        "step": global_step,
-        "val_loss": val_metrics["loss"],
-        "val_top1": val_metrics["top1_accuracy"],
-        "config": vars(args),
-    }, ckpt_dir / "final.pt")
     print(f"\nDone. Best val_loss={best_val_loss:.4f}")
     print(f"Checkpoints saved to {out_dir}")

 import gc
 import json
 import math
+import signal
 import time
 from pathlib import Path
     p.add_argument("--sdpa-math", action="store_true",
                     help="Use MATH SDPA backend (workaround for ROCm flash attn + compile)")
+    ckpt_group = p.add_mutually_exclusive_group(required=True)
+    ckpt_group.add_argument("--hf-repo", type=str, default=None,
+                            help="Push checkpoints to this HuggingFace repo (requires HF_TOKEN)")
+    ckpt_group.add_argument("--local-checkpoints", action="store_true",
+                            help="Save checkpoints locally only")
     return p.parse_args()
 def load_backbone(checkpoint_path: str, device: str) -> PAWNCLM:
+    from pawn.checkpoint import load_backbone_weights
+    state_dict, model_config = load_backbone_weights(checkpoint_path, device)
+    cfg = CLMConfig(**model_config) if model_config else CLMConfig()
     model = PAWNCLM(cfg).to(device)
+    model.load_state_dict(state_dict)
+    del state_dict
     gc.collect()
     model.eval()
     return model
     ckpt_dir = out_dir / "checkpoints"
     ckpt_dir.mkdir(exist_ok=True)
+    hf_branch = None
+    if args.hf_repo:
+        hf_branch = f"run/{out_dir.name}"
     device = args.device
     print(f"Device: {device}")
     print(f"Output: {out_dir}")
     global_step = 0
     val_metrics = baseline
+    _shutdown_requested = False
+    def _graceful_exit(signum, frame):
+        nonlocal _shutdown_requested
+        _shutdown_requested = True
+    signal.signal(signal.SIGTERM, _graceful_exit)
+    signal.signal(signal.SIGINT, _graceful_exit)
     print(f"\nTraining for up to {args.epochs} epochs ({total_steps} steps)")
     for epoch in range(args.epochs):
             if val_metrics["loss"] < best_val_loss:
                 best_val_loss = val_metrics["loss"]
                 patience_counter = 0
+                from pawn.checkpoint import save_adapter_checkpoint
+                save_adapter_checkpoint(
+                    ckpt_dir / "best",
+                    model.adapter_state_dict(),
+                    config=vars(args),
+                    epoch=epoch,
+                    step=global_step,
+                    val_metrics=val_metrics,
+                )
+                if args.hf_repo and hf_branch:
+                    from pawn.checkpoint import push_checkpoint_to_hf
+                    try:
+                        push_checkpoint_to_hf(ckpt_dir / "best", args.hf_repo, hf_branch, step=global_step)
+                        print(f"Pushed to HF: {args.hf_repo}@{hf_branch}")
+                    except Exception as e:
+                        print(f"WARNING: HF push failed: {e}")
             else:
                 patience_counter += 1
                 if patience_counter >= args.patience:
                     print(f"\n  Early stopping at epoch {epoch} (patience={args.patience})")
                     break
+        if _shutdown_requested:
+            print("Shutdown requested, saving checkpoint...")
+            break
+    from pawn.checkpoint import save_adapter_checkpoint
+    save_adapter_checkpoint(
+        ckpt_dir / "final",
+        model.adapter_state_dict(),
+        config=vars(args),
+        epoch=epoch,
+        step=global_step,
+        val_metrics=val_metrics,
+    )
+    if args.hf_repo and hf_branch:
+        from pawn.checkpoint import push_checkpoint_to_hf
+        try:
+            push_checkpoint_to_hf(ckpt_dir / "final", args.hf_repo, hf_branch, step=global_step)
+            print(f"Pushed to HF: {args.hf_repo}@{hf_branch}")
+        except Exception as e:
+            print(f"WARNING: HF push failed: {e}")
     print(f"\nDone. Best val_loss={best_val_loss:.4f}")
     print(f"Checkpoints saved to {out_dir}")

scripts/train_lora.py CHANGED Viewed

@@ -17,6 +17,7 @@ import argparse
 import gc
 import json
 import math
 import time
 from pathlib import Path
@@ -92,15 +93,22 @@ def parse_args():
     p.add_argument("--sdpa-math", action="store_true",
                     help="Use MATH SDPA backend (workaround for ROCm flash attn + compile)")
     return p.parse_args()
 def load_backbone(checkpoint_path: str, device: str) -> PAWNCLM:
-    ckpt = torch.load(checkpoint_path, map_location=device, weights_only=False)
-    cfg = CLMConfig(**ckpt["model_config"]) if "model_config" in ckpt else CLMConfig()
     model = PAWNCLM(cfg).to(device)
-    model.load_state_dict(ckpt["model_state_dict"])
-    del ckpt
     gc.collect()
     model.eval()
     return model
@@ -190,6 +198,10 @@ def main():
     ckpt_dir = out_dir / "checkpoints"
     ckpt_dir.mkdir(exist_ok=True)
     device = args.device
     print(f"Device: {device}")
     print(f"Output: {out_dir}")
@@ -309,6 +321,13 @@ def main():
     global_step = 0
     val_metrics = baseline
     print(f"\nTraining for up to {args.epochs} epochs ({total_steps} steps)")
     print(f"  Warmup: {warmup_steps} steps, LR: {args.lr}")
     if args.val_every > 1:
@@ -395,29 +414,49 @@ def main():
             if val_metrics["loss"] < best_val_loss:
                 best_val_loss = val_metrics["loss"]
                 patience_counter = 0
-                torch.save({
-                    "lora_state_dict": model.lora_state_dict(),
-                    "epoch": epoch,
-                    "step": global_step,
-                    "val_loss": val_metrics["loss"],
-                    "val_top1": val_metrics["top1_accuracy"],
-                    "config": vars(args),
-                }, ckpt_dir / "best.pt")
             else:
                 patience_counter += 1
                 if patience_counter >= args.patience:
                     print(f"\n  Early stopping at epoch {epoch} (patience={args.patience})")
                     break
     # Save final checkpoint
-    torch.save({
-        "lora_state_dict": model.lora_state_dict(),
-        "epoch": epoch,
-        "step": global_step,
-        "val_loss": val_metrics["loss"],
-        "val_top1": val_metrics["top1_accuracy"],
-        "config": vars(args),
-    }, ckpt_dir / "final.pt")
     print(f"\nDone. Best val_loss={best_val_loss:.4f}")
     print(f"Checkpoints saved to {out_dir}")

 import gc
 import json
 import math
+import signal
 import time
 from pathlib import Path
     p.add_argument("--sdpa-math", action="store_true",
                     help="Use MATH SDPA backend (workaround for ROCm flash attn + compile)")
+    ckpt_group = p.add_mutually_exclusive_group(required=True)
+    ckpt_group.add_argument("--hf-repo", type=str, default=None,
+                            help="Push checkpoints to this HuggingFace repo (requires HF_TOKEN)")
+    ckpt_group.add_argument("--local-checkpoints", action="store_true",
+                            help="Save checkpoints locally only")
     return p.parse_args()
 def load_backbone(checkpoint_path: str, device: str) -> PAWNCLM:
+    from pawn.checkpoint import load_backbone_weights
+    state_dict, model_config = load_backbone_weights(checkpoint_path, device)
+    cfg = CLMConfig(**model_config) if model_config else CLMConfig()
     model = PAWNCLM(cfg).to(device)
+    model.load_state_dict(state_dict)
+    del state_dict
     gc.collect()
     model.eval()
     return model
     ckpt_dir = out_dir / "checkpoints"
     ckpt_dir.mkdir(exist_ok=True)
+    hf_branch = None
+    if args.hf_repo:
+        hf_branch = f"run/{out_dir.name}"
     device = args.device
     print(f"Device: {device}")
     print(f"Output: {out_dir}")
     global_step = 0
     val_metrics = baseline
+    _shutdown_requested = False
+    def _graceful_exit(signum, frame):
+        nonlocal _shutdown_requested
+        _shutdown_requested = True
+    signal.signal(signal.SIGTERM, _graceful_exit)
+    signal.signal(signal.SIGINT, _graceful_exit)
     print(f"\nTraining for up to {args.epochs} epochs ({total_steps} steps)")
     print(f"  Warmup: {warmup_steps} steps, LR: {args.lr}")
     if args.val_every > 1:
             if val_metrics["loss"] < best_val_loss:
                 best_val_loss = val_metrics["loss"]
                 patience_counter = 0
+                from pawn.checkpoint import save_adapter_checkpoint
+                save_adapter_checkpoint(
+                    ckpt_dir / "best",
+                    model.lora_state_dict(),
+                    config=vars(args),
+                    epoch=epoch,
+                    step=global_step,
+                    val_metrics=val_metrics,
+                )
+                if args.hf_repo and hf_branch:
+                    from pawn.checkpoint import push_checkpoint_to_hf
+                    try:
+                        push_checkpoint_to_hf(ckpt_dir / "best", args.hf_repo, hf_branch, step=global_step)
+                        print(f"Pushed to HF: {args.hf_repo}@{hf_branch}")
+                    except Exception as e:
+                        print(f"WARNING: HF push failed: {e}")
             else:
                 patience_counter += 1
                 if patience_counter >= args.patience:
                     print(f"\n  Early stopping at epoch {epoch} (patience={args.patience})")
                     break
+        if _shutdown_requested:
+            print("Shutdown requested, saving checkpoint...")
+            break
     # Save final checkpoint
+    from pawn.checkpoint import save_adapter_checkpoint
+    save_adapter_checkpoint(
+        ckpt_dir / "final",
+        model.lora_state_dict(),
+        config=vars(args),
+        epoch=epoch,
+        step=global_step,
+        val_metrics=val_metrics,
+    )
+    if args.hf_repo and hf_branch:
+        from pawn.checkpoint import push_checkpoint_to_hf
+        try:
+            push_checkpoint_to_hf(ckpt_dir / "final", args.hf_repo, hf_branch, step=global_step)
+            print(f"Pushed to HF: {args.hf_repo}@{hf_branch}")
+        except Exception as e:
+            print(f"WARNING: HF push failed: {e}")
     print(f"\nDone. Best val_loss={best_val_loss:.4f}")
     print(f"Checkpoints saved to {out_dir}")

scripts/train_sparse.py CHANGED Viewed

@@ -17,6 +17,7 @@ import argparse
 import gc
 import json
 import math
 import time
 from pathlib import Path
@@ -83,15 +84,22 @@ def parse_args():
     p.add_argument("--sdpa-math", action="store_true",
                     help="Use MATH SDPA backend (workaround for ROCm flash attn + compile)")
     return p.parse_args()
 def load_backbone(checkpoint_path: str, device: str) -> PAWNCLM:
-    ckpt = torch.load(checkpoint_path, map_location=device, weights_only=False)
-    cfg = CLMConfig(**ckpt["model_config"]) if "model_config" in ckpt else CLMConfig()
     model = PAWNCLM(cfg).to(device)
-    model.load_state_dict(ckpt["model_state_dict"])
-    del ckpt
     gc.collect()
     model.eval()
     return model
@@ -182,6 +190,10 @@ def main():
     ckpt_dir = out_dir / "checkpoints"
     ckpt_dir.mkdir(exist_ok=True)
     device = args.device
     print(f"Device: {device}")
     print(f"Output: {out_dir}")
@@ -299,6 +311,13 @@ def main():
     global_step = 0
     val_metrics = baseline
     print(f"\nTraining for up to {args.epochs} epochs ({total_steps} steps)")
     print(f"  Warmup: {warmup_steps} steps, LR: {args.lr}")
@@ -379,28 +398,48 @@ def main():
             if val_metrics["loss"] < best_val_loss:
                 best_val_loss = val_metrics["loss"]
                 patience_counter = 0
-                torch.save({
-                    "sparse_state_dict": model.sparse_state_dict(),
-                    "epoch": epoch,
-                    "step": global_step,
-                    "val_loss": val_metrics["loss"],
-                    "val_top1": val_metrics["top1_accuracy"],
-                    "config": vars(args),
-                }, ckpt_dir / "best.pt")
             else:
                 patience_counter += 1
                 if patience_counter >= args.patience:
                     print(f"\n  Early stopping at epoch {epoch} (patience={args.patience})")
                     break
-    torch.save({
-        "sparse_state_dict": model.sparse_state_dict(),
-        "epoch": epoch,
-        "step": global_step,
-        "val_loss": val_metrics["loss"],
-        "val_top1": val_metrics["top1_accuracy"],
-        "config": vars(args),
-    }, ckpt_dir / "final.pt")
     print(f"\nDone. Best val_loss={best_val_loss:.4f}")
     print(f"Checkpoints saved to {out_dir}")

 import gc
 import json
 import math
+import signal
 import time
 from pathlib import Path
     p.add_argument("--sdpa-math", action="store_true",
                     help="Use MATH SDPA backend (workaround for ROCm flash attn + compile)")
+    ckpt_group = p.add_mutually_exclusive_group(required=True)
+    ckpt_group.add_argument("--hf-repo", type=str, default=None,
+                            help="Push checkpoints to this HuggingFace repo (requires HF_TOKEN)")
+    ckpt_group.add_argument("--local-checkpoints", action="store_true",
+                            help="Save checkpoints locally only")
     return p.parse_args()
 def load_backbone(checkpoint_path: str, device: str) -> PAWNCLM:
+    from pawn.checkpoint import load_backbone_weights
+    state_dict, model_config = load_backbone_weights(checkpoint_path, device)
+    cfg = CLMConfig(**model_config) if model_config else CLMConfig()
     model = PAWNCLM(cfg).to(device)
+    model.load_state_dict(state_dict)
+    del state_dict
     gc.collect()
     model.eval()
     return model
     ckpt_dir = out_dir / "checkpoints"
     ckpt_dir.mkdir(exist_ok=True)
+    hf_branch = None
+    if args.hf_repo:
+        hf_branch = f"run/{out_dir.name}"
     device = args.device
     print(f"Device: {device}")
     print(f"Output: {out_dir}")
     global_step = 0
     val_metrics = baseline
+    _shutdown_requested = False
+    def _graceful_exit(signum, frame):
+        nonlocal _shutdown_requested
+        _shutdown_requested = True
+    signal.signal(signal.SIGTERM, _graceful_exit)
+    signal.signal(signal.SIGINT, _graceful_exit)
     print(f"\nTraining for up to {args.epochs} epochs ({total_steps} steps)")
     print(f"  Warmup: {warmup_steps} steps, LR: {args.lr}")
             if val_metrics["loss"] < best_val_loss:
                 best_val_loss = val_metrics["loss"]
                 patience_counter = 0
+                from pawn.checkpoint import save_adapter_checkpoint
+                save_adapter_checkpoint(
+                    ckpt_dir / "best",
+                    model.sparse_state_dict(),
+                    config=vars(args),
+                    epoch=epoch,
+                    step=global_step,
+                    val_metrics=val_metrics,
+                )
+                if args.hf_repo and hf_branch:
+                    from pawn.checkpoint import push_checkpoint_to_hf
+                    try:
+                        push_checkpoint_to_hf(ckpt_dir / "best", args.hf_repo, hf_branch, step=global_step)
+                        print(f"Pushed to HF: {args.hf_repo}@{hf_branch}")
+                    except Exception as e:
+                        print(f"WARNING: HF push failed: {e}")
             else:
                 patience_counter += 1
                 if patience_counter >= args.patience:
                     print(f"\n  Early stopping at epoch {epoch} (patience={args.patience})")
                     break
+        if _shutdown_requested:
+            print("Shutdown requested, saving checkpoint...")
+            break
+    from pawn.checkpoint import save_adapter_checkpoint
+    save_adapter_checkpoint(
+        ckpt_dir / "final",
+        model.sparse_state_dict(),
+        config=vars(args),
+        epoch=epoch,
+        step=global_step,
+        val_metrics=val_metrics,
+    )
+    if args.hf_repo and hf_branch:
+        from pawn.checkpoint import push_checkpoint_to_hf
+        try:
+            push_checkpoint_to_hf(ckpt_dir / "final", args.hf_repo, hf_branch, step=global_step)
+            print(f"Pushed to HF: {args.hf_repo}@{hf_branch}")
+        except Exception as e:
+            print(f"WARNING: HF push failed: {e}")
     print(f"\nDone. Best val_loss={best_val_loss:.4f}")
     print(f"Checkpoints saved to {out_dir}")

scripts/train_tiny.py CHANGED Viewed

@@ -15,6 +15,7 @@ from __future__ import annotations
 import argparse
 import math
 import time
 from pathlib import Path
@@ -228,6 +229,12 @@ def parse_args():
     p.add_argument("--no-compile", action="store_true")
     p.add_argument("--num-workers", type=int, default=3)
     return p.parse_args()
@@ -255,6 +262,10 @@ def main():
     ckpt_dir = out_dir / "checkpoints"
     ckpt_dir.mkdir(exist_ok=True)
     # Build model
     model = TinyChessLM(
         vocab_size=vocab_size,
@@ -358,6 +369,13 @@ def main():
     global_step = 0
     val_metrics = baseline
     print(f"\nTraining for up to {args.epochs} epochs ({total_steps} steps)")
     print(f"  Warmup: {warmup_steps} steps, LR: {args.lr}")
@@ -428,24 +446,39 @@ def main():
             if val_metrics["loss"] < best_val_loss:
                 best_val_loss = val_metrics["loss"]
                 patience_counter = 0
-                torch.save({
-                    "model_state_dict": model.state_dict(),
-                    "optimizer_state_dict": optimizer.state_dict(),
-                    "scheduler_state_dict": scheduler.state_dict(),
-                    "scaler_state_dict": scaler.state_dict() if scaler else None,
-                    "epoch": epoch, "step": global_step,
-                    "val_loss": val_metrics["loss"],
-                    "val_top1": val_metrics["top1_accuracy"],
-                    "best_val_loss": best_val_loss,
-                    "patience_counter": patience_counter,
-                    "config": vars(args),
-                }, ckpt_dir / "best.pt")
             else:
                 patience_counter += 1
                 if patience_counter >= args.patience:
                     print(f"\n  Early stopping at epoch {epoch} (patience={args.patience})")
                     break
     logger.close()
     print(f"\nDone. Best val_loss={best_val_loss:.4f}")
     print(f"Checkpoints saved to {out_dir}")

 import argparse
 import math
+import signal
 import time
 from pathlib import Path
     p.add_argument("--no-compile", action="store_true")
     p.add_argument("--num-workers", type=int, default=3)
+    ckpt_group = p.add_mutually_exclusive_group(required=True)
+    ckpt_group.add_argument("--hf-repo", type=str, default=None,
+                            help="Push checkpoints to this HuggingFace repo (requires HF_TOKEN)")
+    ckpt_group.add_argument("--local-checkpoints", action="store_true",
+                            help="Save checkpoints locally only")
     return p.parse_args()
     ckpt_dir = out_dir / "checkpoints"
     ckpt_dir.mkdir(exist_ok=True)
+    hf_branch = None
+    if args.hf_repo:
+        hf_branch = f"run/{out_dir.name}"
     # Build model
     model = TinyChessLM(
         vocab_size=vocab_size,
     global_step = 0
     val_metrics = baseline
+    _shutdown_requested = False
+    def _graceful_exit(signum, frame):
+        nonlocal _shutdown_requested
+        _shutdown_requested = True
+    signal.signal(signal.SIGTERM, _graceful_exit)
+    signal.signal(signal.SIGINT, _graceful_exit)
     print(f"\nTraining for up to {args.epochs} epochs ({total_steps} steps)")
     print(f"  Warmup: {warmup_steps} steps, LR: {args.lr}")
             if val_metrics["loss"] < best_val_loss:
                 best_val_loss = val_metrics["loss"]
                 patience_counter = 0
+                from pawn.checkpoint import save_pretrain_checkpoint
+                save_pretrain_checkpoint(
+                    ckpt_dir / "best",
+                    model=model,
+                    optimizer=optimizer,
+                    scheduler=scheduler,
+                    scaler=scaler,
+                    global_step=global_step,
+                    model_config={
+                        "d_model": args.d_model,
+                        "n_layers": args.n_layers,
+                        "n_heads": args.n_heads,
+                        "d_ff": args.d_ff,
+                    },
+                    training_config=vars(args),
+                )
+                if args.hf_repo and hf_branch:
+                    from pawn.checkpoint import push_checkpoint_to_hf
+                    try:
+                        push_checkpoint_to_hf(ckpt_dir / "best", args.hf_repo, hf_branch, step=global_step)
+                        print(f"Pushed to HF: {args.hf_repo}@{hf_branch}")
+                    except Exception as e:
+                        print(f"WARNING: HF push failed: {e}")
             else:
                 patience_counter += 1
                 if patience_counter >= args.patience:
                     print(f"\n  Early stopping at epoch {epoch} (patience={args.patience})")
                     break
+        if _shutdown_requested:
+            print("Shutdown requested, saving checkpoint...")
+            break
     logger.close()
     print(f"\nDone. Best val_loss={best_val_loss:.4f}")
     print(f"Checkpoints saved to {out_dir}")

tests/test_checkpoint.py ADDED Viewed

	@@ -0,0 +1,359 @@

+"""Tests for pawn.checkpoint safetensors save/load."""
+import json
+import tempfile
+from pathlib import Path
+import torch
+import torch.nn as nn
+from pawn.config import CLMConfig
+from pawn.model import PAWNCLM
+import pytest
+from pawn.checkpoint import (
+    save_pretrain_checkpoint,
+    load_pretrain_checkpoint,
+    save_adapter_checkpoint,
+    load_adapter_checkpoint,
+    load_backbone_weights,
+    is_legacy_checkpoint,
+    IncompleteCheckpointError,
+    CheckpointIntegrityError,
+    _flatten_optimizer_state,
+    _unflatten_optimizer_state,
+    _rng_to_json,
+    _json_to_rng,
+    _verify_complete_sentinel,
+)
+def _make_toy_model(device="cpu"):
+    cfg = CLMConfig.toy()
+    model = PAWNCLM(cfg).to(device)
+    return model, cfg
+def _make_optimizer(model):
+    return torch.optim.AdamW(model.parameters(), lr=1e-3)
+class FakeScheduler:
+    """Minimal scheduler stand-in for testing."""
+    def __init__(self):
+        self._step = 0
+    def state_dict(self):
+        return {"step": self._step}
+    def load_state_dict(self, d):
+        self._step = d["step"]
+class FakeScaler:
+    """Minimal scaler stand-in for testing."""
+    def __init__(self):
+        self._scale = 65536.0
+    def state_dict(self):
+        return {"scale": self._scale, "_growth_tracker": 0}
+    def load_state_dict(self, d):
+        self._scale = d["scale"]
+def test_pretrain_checkpoint_roundtrip():
+    """Save and load a pretrain checkpoint, verify model weights match."""
+    model1, cfg = _make_toy_model()
+    opt1 = _make_optimizer(model1)
+    sched1 = FakeScheduler()
+    sched1._step = 42
+    scaler1 = FakeScaler()
+    # Run a fake step to populate optimizer state
+    x = torch.randint(0, cfg.vocab_size, (2, cfg.max_seq_len))
+    mask = torch.ones(2, cfg.max_seq_len, dtype=torch.bool)
+    targets = torch.randint(0, cfg.vocab_size, (2, cfg.max_seq_len))
+    loss, _ = model1.forward_train(x, mask, targets)
+    loss.backward()
+    opt1.step()
+    with tempfile.TemporaryDirectory() as tmpdir:
+        ckpt_path = Path(tmpdir) / "step_00000001"
+        save_pretrain_checkpoint(
+            ckpt_path, model1, opt1, sched1, scaler1,
+            global_step=1,
+            model_config=cfg.__dict__,
+            training_config={"lr": 1e-3},
+        )
+        # Verify files exist
+        assert (ckpt_path / "model.safetensors").exists()
+        assert (ckpt_path / "optimizer.safetensors").exists()
+        assert (ckpt_path / "training_state.json").exists()
+        assert (ckpt_path / "config.json").exists()
+        # Load into fresh model
+        model2, _ = _make_toy_model()
+        opt2 = _make_optimizer(model2)
+        sched2 = FakeScheduler()
+        scaler2 = FakeScaler()
+        meta = load_pretrain_checkpoint(
+            ckpt_path, model2, opt2, sched2, scaler2
+        )
+        assert meta["global_step"] == 1
+        assert meta["model_config"]["d_model"] == cfg.d_model
+        assert sched2._step == 42
+        # Verify model weights match
+        for k in model1.state_dict():
+            torch.testing.assert_close(
+                model1.state_dict()[k],
+                model2.state_dict()[k],
+            )
+def test_optimizer_flatten_unflatten():
+    """Verify optimizer state survives flatten/unflatten."""
+    model, cfg = _make_toy_model()
+    opt = _make_optimizer(model)
+    # Populate optimizer state
+    x = torch.randint(0, cfg.vocab_size, (2, cfg.max_seq_len))
+    mask = torch.ones(2, cfg.max_seq_len, dtype=torch.bool)
+    targets = torch.randint(0, cfg.vocab_size, (2, cfg.max_seq_len))
+    loss, _ = model.forward_train(x, mask, targets)
+    loss.backward()
+    opt.step()
+    orig_state = opt.state_dict()
+    tensors, meta = _flatten_optimizer_state(orig_state)
+    # All tensors should be named "state.{id}.{key}"
+    for k in tensors:
+        assert k.startswith("state."), f"Unexpected key: {k}"
+    restored = _unflatten_optimizer_state(tensors, meta)
+    # Verify param_groups match
+    assert len(restored["param_groups"]) == len(orig_state["param_groups"])
+    # Verify state tensors match
+    for param_id in orig_state["state"]:
+        for key in ("exp_avg", "exp_avg_sq"):
+            torch.testing.assert_close(
+                orig_state["state"][param_id][key],
+                restored["state"][param_id][key],
+            )
+def test_rng_roundtrip():
+    """Verify RNG state survives JSON serialization."""
+    torch_rng = torch.get_rng_state()
+    cuda_rng = None
+    data = _rng_to_json(torch_rng, cuda_rng)
+    assert "torch_rng_state" in data
+    assert "cuda_rng_state" not in data
+    restored_torch, restored_cuda = _json_to_rng(data)
+    assert restored_cuda is None
+    assert torch.equal(torch_rng, restored_torch)
+def test_adapter_checkpoint_roundtrip():
+    """Save and load an adapter checkpoint."""
+    adapter_weights = {
+        "down.weight": torch.randn(8, 64),
+        "up.weight": torch.zeros(64, 8),
+    }
+    with tempfile.TemporaryDirectory() as tmpdir:
+        ckpt_path = Path(tmpdir) / "best"
+        save_adapter_checkpoint(
+            ckpt_path,
+            adapter_weights,
+            config={"bottleneck_dim": 8, "checkpoint_type": "bottleneck"},
+            epoch=5,
+            step=1000,
+            val_metrics={"loss": 3.14, "top1_accuracy": 0.07},
+            extra={"best_val_loss": 3.10, "patience_counter": 2},
+        )
+        assert (ckpt_path / "adapter.safetensors").exists()
+        assert (ckpt_path / "config.json").exists()
+        assert (ckpt_path / "training_state.json").exists()
+        assert not (ckpt_path / "optimizer.safetensors").exists()  # no optimizer passed
+        loaded = load_adapter_checkpoint(ckpt_path)
+        assert loaded["epoch"] == 5
+        assert loaded["step"] == 1000
+        assert loaded["val_metrics"]["loss"] == 3.14
+        assert loaded["best_val_loss"] == 3.10
+        for k in adapter_weights:
+            torch.testing.assert_close(adapter_weights[k], loaded["adapter_state_dict"][k])
+def test_load_backbone_weights_new_format():
+    """load_backbone_weights works with new directory format."""
+    model1, cfg = _make_toy_model()
+    with tempfile.TemporaryDirectory() as tmpdir:
+        ckpt_path = Path(tmpdir) / "step_00000001"
+        opt = _make_optimizer(model1)
+        save_pretrain_checkpoint(
+            ckpt_path, model1, opt, FakeScheduler(), FakeScaler(),
+            global_step=1, model_config=cfg.__dict__, training_config={},
+        )
+        state_dict, model_config = load_backbone_weights(ckpt_path)
+        assert model_config["d_model"] == cfg.d_model
+        for k in model1.state_dict():
+            torch.testing.assert_close(model1.state_dict()[k], state_dict[k])
+def test_load_backbone_weights_legacy():
+    """load_backbone_weights works with legacy .pt files."""
+    model1, cfg = _make_toy_model()
+    with tempfile.TemporaryDirectory() as tmpdir:
+        pt_path = Path(tmpdir) / "model.pt"
+        torch.save({
+            "model_state_dict": model1.state_dict(),
+            "model_config": cfg.__dict__,
+        }, pt_path)
+        state_dict, model_config = load_backbone_weights(pt_path)
+        assert model_config["d_model"] == cfg.d_model
+        for k in model1.state_dict():
+            torch.testing.assert_close(model1.state_dict()[k], state_dict[k])
+def test_is_legacy_checkpoint():
+    """Detect legacy vs new format."""
+    with tempfile.TemporaryDirectory() as tmpdir:
+        pt = Path(tmpdir) / "step.pt"
+        pt.touch()
+        assert is_legacy_checkpoint(pt)
+        d = Path(tmpdir) / "step_00001"
+        d.mkdir()
+        assert not is_legacy_checkpoint(d)
+def test_config_json_contents():
+    """Verify config.json has expected structure."""
+    model, cfg = _make_toy_model()
+    opt = _make_optimizer(model)
+    with tempfile.TemporaryDirectory() as tmpdir:
+        ckpt_path = Path(tmpdir) / "step_00000001"
+        save_pretrain_checkpoint(
+            ckpt_path, model, opt, FakeScheduler(), FakeScaler(),
+            global_step=1, model_config=cfg.__dict__,
+            training_config={"lr": 1e-3, "batch_size": 32},
+        )
+        with open(ckpt_path / "config.json") as f:
+            config = json.load(f)
+        assert config["format_version"] == 1
+        assert config["checkpoint_type"] == "pretrain"
+        assert config["model_config"]["d_model"] == cfg.d_model
+        assert config["training_config"]["lr"] == 1e-3
+def test_complete_sentinel_written():
+    """Verify .complete sentinel is created with SHA-256 hashes."""
+    model, cfg = _make_toy_model()
+    opt = _make_optimizer(model)
+    with tempfile.TemporaryDirectory() as tmpdir:
+        ckpt_path = Path(tmpdir) / "step_00000001"
+        save_pretrain_checkpoint(
+            ckpt_path, model, opt, FakeScheduler(), FakeScaler(),
+            global_step=1, model_config=cfg.__dict__, training_config={},
+        )
+        sentinel_path = ckpt_path / ".complete"
+        assert sentinel_path.exists()
+        with open(sentinel_path) as f:
+            sentinel = json.load(f)
+        assert "files" in sentinel
+        assert "model.safetensors" in sentinel["files"]
+        assert "config.json" in sentinel["files"]
+        assert "training_state.json" in sentinel["files"]
+        # Each hash should be a 64-char hex string
+        for name, h in sentinel["files"].items():
+            assert len(h) == 64, f"Bad hash length for {name}: {len(h)}"
+def test_no_tmp_directory_after_save():
+    """Verify .tmp directory is cleaned up after successful save."""
+    model, cfg = _make_toy_model()
+    opt = _make_optimizer(model)
+    with tempfile.TemporaryDirectory() as tmpdir:
+        ckpt_path = Path(tmpdir) / "step_00000001"
+        save_pretrain_checkpoint(
+            ckpt_path, model, opt, FakeScheduler(), FakeScaler(),
+            global_step=1, model_config=cfg.__dict__, training_config={},
+        )
+        tmp_path = Path(tmpdir) / "step_00000001.tmp"
+        assert not tmp_path.exists(), ".tmp directory should be removed after rename"
+def test_incomplete_checkpoint_raises():
+    """Loading a checkpoint without .complete raises IncompleteCheckpointError."""
+    with tempfile.TemporaryDirectory() as tmpdir:
+        ckpt_path = Path(tmpdir) / "step_bad"
+        ckpt_path.mkdir()
+        # Write some files but no .complete
+        (ckpt_path / "model.safetensors").touch()
+        (ckpt_path / "config.json").write_text("{}")
+        with pytest.raises(IncompleteCheckpointError):
+            _verify_complete_sentinel(ckpt_path)
+def test_corrupted_file_raises():
+    """Loading a checkpoint with a tampered file raises CheckpointIntegrityError."""
+    model, cfg = _make_toy_model()
+    opt = _make_optimizer(model)
+    with tempfile.TemporaryDirectory() as tmpdir:
+        ckpt_path = Path(tmpdir) / "step_00000001"
+        save_pretrain_checkpoint(
+            ckpt_path, model, opt, FakeScheduler(), FakeScaler(),
+            global_step=1, model_config=cfg.__dict__, training_config={},
+        )
+        # Corrupt a file after save
+        config_path = ckpt_path / "config.json"
+        config_path.write_text("CORRUPTED DATA")
+        with pytest.raises(CheckpointIntegrityError):
+            _verify_complete_sentinel(ckpt_path)
+def test_adapter_checkpoint_has_sentinel():
+    """Adapter checkpoints also get .complete sentinel."""
+    adapter_weights = {"down.weight": torch.randn(8, 64)}
+    with tempfile.TemporaryDirectory() as tmpdir:
+        ckpt_path = Path(tmpdir) / "best"
+        save_adapter_checkpoint(
+            ckpt_path, adapter_weights,
+            config={"checkpoint_type": "bottleneck"},
+            epoch=1, step=100, val_metrics={"loss": 3.0},
+        )
+        assert (ckpt_path / ".complete").exists()
+        # Should not raise
+        _verify_complete_sentinel(ckpt_path)

tests/test_clm_format.py ADDED Viewed

	@@ -0,0 +1,224 @@

+"""Tests for CLM sequence format, off-by-one boundaries, and engine/Python consistency."""
+import numpy as np
+import torch
+import pytest
+import chess_engine as engine
+from pawn.config import (
+    PAD_TOKEN,
+    OUTCOME_TOKEN_BASE,
+    WHITE_CHECKMATES,
+    BLACK_CHECKMATES,
+    STALEMATE,
+    DRAW_BY_RULE,
+    PLY_LIMIT,
+)
+from pawn.data import (
+    _map_termination_to_outcome,
+    pack_clm_sequences,
+    _to_clm_batch,
+)
+# ---------------------------------------------------------------------------
+# Vocabulary consistency
+# ---------------------------------------------------------------------------
+class TestVocabConsistency:
+    def test_vocab_size(self):
+        """Engine move vocab + PAD + outcomes = model vocab_size."""
+        vocab = engine.export_move_vocabulary()
+        n_moves = len(vocab["token_to_move"])
+        # 4096 grid + 176 promotions = 4272 move tokens
+        assert n_moves == 4272
+        # 1 PAD + 4272 moves + 5 outcomes = 4278
+        from pawn.config import CLMConfig
+        assert CLMConfig().vocab_size == 1 + n_moves + 5
+    def test_outcome_tokens_not_in_move_vocab(self):
+        """Outcome tokens (4273-4277) must not appear in the move vocabulary."""
+        vocab = engine.export_move_vocabulary()
+        for token_id in range(OUTCOME_TOKEN_BASE, PLY_LIMIT + 1):
+            assert token_id not in vocab["token_to_move"], \
+                f"Outcome token {token_id} should not be in move vocab"
+    def test_no_eog_in_raw_move_ids(self):
+        """Raw move_ids from generate_random_games should not contain
+        tokens >= OUTCOME_BASE (no EOG or outcome tokens in move data)."""
+        move_ids, game_lengths, _tc = engine.generate_random_games(64, 255, seed=42)
+        for b in range(64):
+            gl = int(game_lengths[b])
+            # Moves should be in range 1-4272
+            for t in range(gl):
+                tok = int(move_ids[b, t])
+                assert 1 <= tok <= 4272, \
+                    f"Game {b}, ply {t}: expected move token, got {tok}"
+            # Position game_length should be PAD (0), not EOG
+            if gl < 255:
+                assert move_ids[b, gl] == PAD_TOKEN, \
+                    f"Game {b}: position {gl} should be PAD, got {move_ids[b, gl]}"
+# ---------------------------------------------------------------------------
+# CLM batch format (Rust engine)
+# ---------------------------------------------------------------------------
+class TestRustCLMBatch:
+    @pytest.fixture
+    def clm_batch(self):
+        return engine.generate_clm_batch(32, 256, seed=42)
+    def test_shapes(self, clm_batch):
+        input_ids, targets, loss_mask, move_ids, game_lengths, term_codes = clm_batch
+        assert input_ids.shape == (32, 256)
+        assert targets.shape == (32, 256)
+        assert loss_mask.shape == (32, 256)
+        assert move_ids.shape == (32, 255)
+        assert game_lengths.shape == (32,)
+        assert term_codes.shape == (32,)
+    def test_position_zero_is_outcome(self, clm_batch):
+        input_ids, *_ = clm_batch
+        for b in range(input_ids.shape[0]):
+            tok = int(input_ids[b, 0])
+            assert OUTCOME_TOKEN_BASE <= tok <= PLY_LIMIT, \
+                f"Game {b}: position 0 should be outcome token, got {tok}"
+    def test_moves_in_valid_range(self, clm_batch):
+        input_ids, _, _, _, game_lengths, _ = clm_batch
+        for b in range(input_ids.shape[0]):
+            gl = int(game_lengths[b])
+            for t in range(1, gl + 1):
+                tok = int(input_ids[b, t])
+                assert 1 <= tok <= 4272, \
+                    f"Game {b}, position {t}: expected move token, got {tok}"
+    def test_padding_is_zero(self, clm_batch):
+        input_ids, _, _, _, game_lengths, _ = clm_batch
+        for b in range(input_ids.shape[0]):
+            gl = int(game_lengths[b])
+            for t in range(gl + 1, 256):
+                assert input_ids[b, t] == 0, \
+                    f"Game {b}, position {t}: expected PAD, got {input_ids[b, t]}"
+    def test_target_shift_correct(self, clm_batch):
+        input_ids, targets, *_ = clm_batch
+        B, T = input_ids.shape
+        for b in range(B):
+            for t in range(T - 1):
+                assert targets[b, t] == input_ids[b, t + 1], \
+                    f"Game {b}: targets[{t}]={targets[b, t]} != input_ids[{t+1}]={input_ids[b, t+1]}"
+            assert targets[b, T - 1] == 0, "Last target should be PAD"
+    def test_target_at_game_end_is_pad(self, clm_batch):
+        _, targets, _, _, game_lengths, _ = clm_batch
+        for b in range(targets.shape[0]):
+            gl = int(game_lengths[b])
+            assert targets[b, gl] == 0, \
+                f"Game {b}: target at game_length={gl} should be PAD, got {targets[b, gl]}"
+    def test_loss_mask_boundary(self, clm_batch):
+        _, _, loss_mask, _, game_lengths, _ = clm_batch
+        for b in range(loss_mask.shape[0]):
+            gl = int(game_lengths[b])
+            # True for positions 0..=gl
+            for t in range(gl + 1):
+                assert loss_mask[b, t], \
+                    f"Game {b}: loss_mask[{t}] should be True (gl={gl})"
+            # False after gl
+            for t in range(gl + 1, 256):
+                assert not loss_mask[b, t], \
+                    f"Game {b}: loss_mask[{t}] should be False (gl={gl})"
+    def test_raw_move_ids_replayable(self, clm_batch):
+        """Raw move_ids from generate_clm_batch should work with replay functions."""
+        _, _, _, move_ids, game_lengths, _ = clm_batch
+        # validate_games should confirm all games are legal
+        is_valid, first_illegal = engine.validate_games(move_ids, game_lengths)
+        assert all(is_valid), "All generated games should be valid"
+        # compute_legal_move_masks should not error
+        grid, promo = engine.compute_legal_move_masks(move_ids, game_lengths)
+        assert grid.shape[0] == 32
+# ---------------------------------------------------------------------------
+# Rust CLM matches Python pack_clm_sequences
+# ---------------------------------------------------------------------------
+class TestRustPythonConsistency:
+    def test_rust_clm_matches_python_pack(self):
+        """Rust generate_clm_batch should produce identical output to
+        Python _to_clm_batch with the same seed."""
+        seq_len = 256
+        seed = 42
+        B = 16
+        # Rust path
+        r_input_ids, r_targets, r_loss_mask, r_move_ids, r_gl, r_tc = \
+            engine.generate_clm_batch(B, seq_len, seed)
+        # Python path: generate raw + pack
+        py_batch = _to_clm_batch(r_move_ids, r_gl, r_tc, seq_len)
+        # Compare
+        r_input_ids_t = torch.from_numpy(r_input_ids).long()
+        r_targets_t = torch.from_numpy(r_targets).long()
+        r_loss_mask_t = torch.from_numpy(r_loss_mask)
+        assert torch.equal(r_input_ids_t, py_batch["input_ids"]), \
+            "input_ids mismatch between Rust and Python"
+        assert torch.equal(r_targets_t, py_batch["targets"]), \
+            "targets mismatch between Rust and Python"
+        assert torch.equal(r_loss_mask_t, py_batch["loss_mask"]), \
+            "loss_mask mismatch between Rust and Python"
+# ---------------------------------------------------------------------------
+# Boundary / edge cases
+# ---------------------------------------------------------------------------
+class TestBoundaryCases:
+    def test_boundary_max_length_game(self):
+        """Test with games that may fill all 255 move slots.
+        We generate many games and check that any near-max-length game
+        is correctly formatted (no off-by-one at the boundary).
+        """
+        input_ids, targets, loss_mask, _, game_lengths, _ = \
+            engine.generate_clm_batch(256, 256, seed=123)
+        for b in range(256):
+            gl = int(game_lengths[b])
+            # Regardless of length, format invariants hold
+            assert OUTCOME_TOKEN_BASE <= input_ids[b, 0] <= PLY_LIMIT
+            if gl + 1 < 256:
+                assert input_ids[b, gl + 1] == 0
+            assert loss_mask[b, gl] == True
+            if gl + 1 < 256:
+                assert loss_mask[b, gl + 1] == False
+    def test_discard_ply_limit(self):
+        """generate_clm_batch with discard_ply_limit should only have
+        naturally-terminated games (no PLY_LIMIT outcome)."""
+        input_ids, _, _, _, _, term_codes = \
+            engine.generate_clm_batch(32, 256, seed=42, discard_ply_limit=True)
+        for b in range(32):
+            assert term_codes[b] != 5, \
+                f"Game {b} has PLY_LIMIT termination but discard_ply_limit=True"
+            assert input_ids[b, 0] != PLY_LIMIT, \
+                f"Game {b} has PLY_LIMIT outcome token but discard_ply_limit=True"
+    def test_determinism(self):
+        """Same seed should produce identical results."""
+        r1 = engine.generate_clm_batch(8, 256, seed=99)
+        r2 = engine.generate_clm_batch(8, 256, seed=99)
+        assert np.array_equal(r1[0], r2[0]), "input_ids should be deterministic"
+        assert np.array_equal(r1[1], r2[1]), "targets should be deterministic"
+        assert np.array_equal(r1[2], r2[2]), "loss_mask should be deterministic"

tests/test_graceful_shutdown.py ADDED Viewed

	@@ -0,0 +1,130 @@

+"""Regression test: training processes exit gracefully on SIGTERM.
+Spawns a toy training run as a subprocess, sends SIGTERM, and verifies
+that the process exits cleanly with valid checkpoints.
+"""
+import json
+import os
+import signal
+import subprocess
+import sys
+import tempfile
+import time
+from pathlib import Path
+import pytest
+from pawn.checkpoint import load_backbone_weights, _verify_complete_sentinel
+@pytest.fixture
+def train_tmpdir():
+    with tempfile.TemporaryDirectory() as d:
+        yield Path(d)
+def _wait_for_training_start(proc: subprocess.Popen, timeout: float = 30) -> None:
+    """Wait until the training process prints 'Starting training'."""
+    deadline = time.time() + timeout
+    assert proc.stdout is not None
+    while time.time() < deadline:
+        line = proc.stdout.readline()
+        if not line:
+            time.sleep(0.1)
+            continue
+        text = line.decode("utf-8", errors="replace")
+        if "Starting training" in text:
+            return
+    raise TimeoutError("Training process did not start within timeout")
+def test_sigterm_produces_valid_checkpoint(train_tmpdir):
+    """Spawn a toy training run, send SIGTERM, verify checkpoint is valid."""
+    ckpt_dir = train_tmpdir / "checkpoints"
+    log_dir = train_tmpdir / "logs"
+    proc = subprocess.Popen(
+        [
+            sys.executable, "scripts/train.py",
+            "--toy",
+            "--local-checkpoints",
+            "--total-steps", "5000",
+            "--device", "cpu",
+            "--num-workers", "0",
+            "--checkpoint-dir", str(ckpt_dir),
+            "--log-dir", str(log_dir),
+        ],
+        stdout=subprocess.PIPE,
+        stderr=subprocess.STDOUT,
+        env={**os.environ, "PYTHONUNBUFFERED": "1"},
+    )
+    try:
+        _wait_for_training_start(proc)
+        # Let it train for a few checkpoints
+        time.sleep(5)
+        # Send SIGTERM
+        proc.send_signal(signal.SIGTERM)
+        # Wait for graceful exit
+        returncode = proc.wait(timeout=30)
+    except Exception:
+        proc.kill()
+        proc.wait()
+        raise
+    # Process should exit cleanly (0 from break, not 128+signal from sys.exit)
+    assert returncode == 0, f"Training exited with code {returncode}"
+    # Find checkpoints (trainer saves under {log_dir}/run_*/checkpoints/)
+    checkpoints = sorted(log_dir.glob("run_*/checkpoints/step_*/"))
+    assert len(checkpoints) > 0, "No checkpoints were saved"
+    # Every checkpoint must have .complete sentinel and pass integrity check
+    for ckpt in checkpoints:
+        assert (ckpt / ".complete").exists(), f"Missing .complete in {ckpt.name}"
+        _verify_complete_sentinel(ckpt)  # raises on failure
+        # Verify we can actually load the weights
+        weights, config = load_backbone_weights(ckpt)
+        assert config is not None, f"No config in {ckpt.name}"
+        assert len(weights) > 0, f"Empty weights in {ckpt.name}"
+def test_sigterm_does_not_leave_tmp_dirs(train_tmpdir):
+    """SIGTERM should not leave .tmp directories from interrupted saves."""
+    ckpt_dir = train_tmpdir / "checkpoints"
+    log_dir = train_tmpdir / "logs"
+    proc = subprocess.Popen(
+        [
+            sys.executable, "scripts/train.py",
+            "--toy",
+            "--local-checkpoints",
+            "--total-steps", "5000",
+            "--device", "cpu",
+            "--num-workers", "0",
+            "--checkpoint-dir", str(ckpt_dir),
+            "--log-dir", str(log_dir),
+        ],
+        stdout=subprocess.PIPE,
+        stderr=subprocess.STDOUT,
+        env={**os.environ, "PYTHONUNBUFFERED": "1"},
+    )
+    try:
+        _wait_for_training_start(proc)
+        time.sleep(5)
+        proc.send_signal(signal.SIGTERM)
+        proc.wait(timeout=30)
+    except Exception:
+        proc.kill()
+        proc.wait()
+        raise
+    # No .tmp directories should exist anywhere under log_dir
+    tmp_dirs = list(log_dir.glob("**/*.tmp"))
+    assert len(tmp_dirs) == 0, f"Leftover .tmp directories: {tmp_dirs}"

uv.lock CHANGED Viewed

@@ -20,24 +20,6 @@ members = [
     "pawn",
 ]
-[[package]]
-name = "aiofiles"
-version = "24.1.0"
-source = { registry = "https://pypi.org/simple" }
-sdist = { url = "https://files.pythonhosted.org/packages/0b/03/a88171e277e8caa88a4c77808c20ebb04ba74cc4681bf1e9416c862de237/aiofiles-24.1.0.tar.gz", hash = "sha256:22a075c9e5a3810f0c2e48f3008c94d68c65d763b9b03857924c99e57355166c", size = 30247, upload-time = "2024-06-24T11:02:03.584Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/a5/45/30bb92d442636f570cb5651bc661f52b610e2eec3f891a5dc3a4c3667db0/aiofiles-24.1.0-py3-none-any.whl", hash = "sha256:b4ec55f4195e3eb5d7abd1bf7e061763e864dd4954231fb8539a0ef8bb8260e5", size = 15896, upload-time = "2024-06-24T11:02:01.529Z" },
-]
-[[package]]
-name = "annotated-doc"
-version = "0.0.4"
-source = { registry = "https://pypi.org/simple" }
-sdist = { url = "https://files.pythonhosted.org/packages/57/ba/046ceea27344560984e26a590f90bc7f4a75b06701f653222458922b558c/annotated_doc-0.0.4.tar.gz", hash = "sha256:fbcda96e87e9c92ad167c2e53839e57503ecfda18804ea28102353485033faa4", size = 7288, upload-time = "2025-11-10T22:07:42.062Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/1e/d3/26bf1008eb3d2daa8ef4cacc7f3bfdc11818d111f7e2d0201bc6e3b49d45/annotated_doc-0.0.4-py3-none-any.whl", hash = "sha256:571ac1dc6991c450b25a9c2d84a3705e2ae7a53467b5d111c24fa8baabbed320", size = 5303, upload-time = "2025-11-10T22:07:40.673Z" },
-]
 [[package]]
 name = "annotated-types"
 version = "0.7.0"
@@ -93,44 +75,6 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/64/b4/17d4b0b2a2dc85a6df63d1157e028ed19f90d4cd97c36717afef2bc2f395/attrs-26.1.0-py3-none-any.whl", hash = "sha256:c647aa4a12dfbad9333ca4e71fe62ddc36f4e63b2d260a37a8b83d2f043ac309", size = 67548, upload-time = "2026-03-19T14:22:23.645Z" },
 ]
-[[package]]
-name = "brotli"
-version = "1.2.0"
-source = { registry = "https://pypi.org/simple" }
-sdist = { url = "https://files.pythonhosted.org/packages/f7/16/c92ca344d646e71a43b8bb353f0a6490d7f6e06210f8554c8f874e454285/brotli-1.2.0.tar.gz", hash = "sha256:e310f77e41941c13340a95976fe66a8a95b01e783d430eeaf7a2f87e0a57dd0a", size = 7388632, upload-time = "2025-11-05T18:39:42.86Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/64/10/a090475284fc4a71aed40a96f32e44a7fe5bda39687353dd977720b211b6/brotli-1.2.0-cp310-cp310-macosx_10_9_universal2.whl", hash = "sha256:3b90b767916ac44e93a8e28ce6adf8d551e43affb512f2377c732d486ac6514e", size = 863089, upload-time = "2025-11-05T18:38:01.181Z" },
-    { url = "https://files.pythonhosted.org/packages/03/41/17416630e46c07ac21e378c3464815dd2e120b441e641bc516ac32cc51d2/brotli-1.2.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:6be67c19e0b0c56365c6a76e393b932fb0e78b3b56b711d180dd7013cb1fd984", size = 445442, upload-time = "2025-11-05T18:38:02.434Z" },
-    { url = "https://files.pythonhosted.org/packages/24/31/90cc06584deb5d4fcafc0985e37741fc6b9717926a78674bbb3ce018957e/brotli-1.2.0-cp310-cp310-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:0bbd5b5ccd157ae7913750476d48099aaf507a79841c0d04a9db4415b14842de", size = 1532658, upload-time = "2025-11-05T18:38:03.588Z" },
-    { url = "https://files.pythonhosted.org/packages/62/17/33bf0c83bcbc96756dfd712201d87342732fad70bb3472c27e833a44a4f9/brotli-1.2.0-cp310-cp310-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:3f3c908bcc404c90c77d5a073e55271a0a498f4e0756e48127c35d91cf155947", size = 1631241, upload-time = "2025-11-05T18:38:04.582Z" },
-    { url = "https://files.pythonhosted.org/packages/48/10/f47854a1917b62efe29bc98ac18e5d4f71df03f629184575b862ef2e743b/brotli-1.2.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:1b557b29782a643420e08d75aea889462a4a8796e9a6cf5621ab05a3f7da8ef2", size = 1424307, upload-time = "2025-11-05T18:38:05.587Z" },
-    { url = "https://files.pythonhosted.org/packages/e4/b7/f88eb461719259c17483484ea8456925ee057897f8e64487d76e24e5e38d/brotli-1.2.0-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:81da1b229b1889f25adadc929aeb9dbc4e922bd18561b65b08dd9343cfccca84", size = 1488208, upload-time = "2025-11-05T18:38:06.613Z" },
-    { url = "https://files.pythonhosted.org/packages/26/59/41bbcb983a0c48b0b8004203e74706c6b6e99a04f3c7ca6f4f41f364db50/brotli-1.2.0-cp310-cp310-musllinux_1_2_ppc64le.whl", hash = "sha256:ff09cd8c5eec3b9d02d2408db41be150d8891c5566addce57513bf546e3d6c6d", size = 1597574, upload-time = "2025-11-05T18:38:07.838Z" },
-    { url = "https://files.pythonhosted.org/packages/8e/e6/8c89c3bdabbe802febb4c5c6ca224a395e97913b5df0dff11b54f23c1788/brotli-1.2.0-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:a1778532b978d2536e79c05dac2d8cd857f6c55cd0c95ace5b03740824e0e2f1", size = 1492109, upload-time = "2025-11-05T18:38:08.816Z" },
-    { url = "https://files.pythonhosted.org/packages/ed/9a/4b19d4310b2dbd545c0c33f176b0528fa68c3cd0754e34b2f2bcf56548ae/brotli-1.2.0-cp310-cp310-win32.whl", hash = "sha256:b232029d100d393ae3c603c8ffd7e3fe6f798c5e28ddca5feabb8e8fdb732997", size = 334461, upload-time = "2025-11-05T18:38:10.729Z" },
-    { url = "https://files.pythonhosted.org/packages/ac/39/70981d9f47705e3c2b95c0847dfa3e7a37aa3b7c6030aedc4873081ed005/brotli-1.2.0-cp310-cp310-win_amd64.whl", hash = "sha256:ef87b8ab2704da227e83a246356a2b179ef826f550f794b2c52cddb4efbd0196", size = 369035, upload-time = "2025-11-05T18:38:11.827Z" },
-    { url = "https://files.pythonhosted.org/packages/7a/ef/f285668811a9e1ddb47a18cb0b437d5fc2760d537a2fe8a57875ad6f8448/brotli-1.2.0-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:15b33fe93cedc4caaff8a0bd1eb7e3dab1c61bb22a0bf5bdfdfd97cd7da79744", size = 863110, upload-time = "2025-11-05T18:38:12.978Z" },
-    { url = "https://files.pythonhosted.org/packages/50/62/a3b77593587010c789a9d6eaa527c79e0848b7b860402cc64bc0bc28a86c/brotli-1.2.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:898be2be399c221d2671d29eed26b6b2713a02c2119168ed914e7d00ceadb56f", size = 445438, upload-time = "2025-11-05T18:38:14.208Z" },
-    { url = "https://files.pythonhosted.org/packages/cd/e1/7fadd47f40ce5549dc44493877db40292277db373da5053aff181656e16e/brotli-1.2.0-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:350c8348f0e76fff0a0fd6c26755d2653863279d086d3aa2c290a6a7251135dd", size = 1534420, upload-time = "2025-11-05T18:38:15.111Z" },
-    { url = "https://files.pythonhosted.org/packages/12/8b/1ed2f64054a5a008a4ccd2f271dbba7a5fb1a3067a99f5ceadedd4c1d5a7/brotli-1.2.0-cp311-cp311-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:2e1ad3fda65ae0d93fec742a128d72e145c9c7a99ee2fcd667785d99eb25a7fe", size = 1632619, upload-time = "2025-11-05T18:38:16.094Z" },
-    { url = "https://files.pythonhosted.org/packages/89/5a/7071a621eb2d052d64efd5da2ef55ecdac7c3b0c6e4f9d519e9c66d987ef/brotli-1.2.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:40d918bce2b427a0c4ba189df7a006ac0c7277c180aee4617d99e9ccaaf59e6a", size = 1426014, upload-time = "2025-11-05T18:38:17.177Z" },
-    { url = "https://files.pythonhosted.org/packages/26/6d/0971a8ea435af5156acaaccec1a505f981c9c80227633851f2810abd252a/brotli-1.2.0-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:2a7f1d03727130fc875448b65b127a9ec5d06d19d0148e7554384229706f9d1b", size = 1489661, upload-time = "2025-11-05T18:38:18.41Z" },
-    { url = "https://files.pythonhosted.org/packages/f3/75/c1baca8b4ec6c96a03ef8230fab2a785e35297632f402ebb1e78a1e39116/brotli-1.2.0-cp311-cp311-musllinux_1_2_ppc64le.whl", hash = "sha256:9c79f57faa25d97900bfb119480806d783fba83cd09ee0b33c17623935b05fa3", size = 1599150, upload-time = "2025-11-05T18:38:19.792Z" },
-    { url = "https://files.pythonhosted.org/packages/0d/1a/23fcfee1c324fd48a63d7ebf4bac3a4115bdb1b00e600f80f727d850b1ae/brotli-1.2.0-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:844a8ceb8483fefafc412f85c14f2aae2fb69567bf2a0de53cdb88b73e7c43ae", size = 1493505, upload-time = "2025-11-05T18:38:20.913Z" },
-    { url = "https://files.pythonhosted.org/packages/36/e5/12904bbd36afeef53d45a84881a4810ae8810ad7e328a971ebbfd760a0b3/brotli-1.2.0-cp311-cp311-win32.whl", hash = "sha256:aa47441fa3026543513139cb8926a92a8e305ee9c71a6209ef7a97d91640ea03", size = 334451, upload-time = "2025-11-05T18:38:21.94Z" },
-    { url = "https://files.pythonhosted.org/packages/02/8b/ecb5761b989629a4758c394b9301607a5880de61ee2ee5fe104b87149ebc/brotli-1.2.0-cp311-cp311-win_amd64.whl", hash = "sha256:022426c9e99fd65d9475dce5c195526f04bb8be8907607e27e747893f6ee3e24", size = 369035, upload-time = "2025-11-05T18:38:22.941Z" },
-    { url = "https://files.pythonhosted.org/packages/11/ee/b0a11ab2315c69bb9b45a2aaed022499c9c24a205c3a49c3513b541a7967/brotli-1.2.0-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:35d382625778834a7f3061b15423919aa03e4f5da34ac8e02c074e4b75ab4f84", size = 861543, upload-time = "2025-11-05T18:38:24.183Z" },
-    { url = "https://files.pythonhosted.org/packages/e1/2f/29c1459513cd35828e25531ebfcbf3e92a5e49f560b1777a9af7203eb46e/brotli-1.2.0-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:7a61c06b334bd99bc5ae84f1eeb36bfe01400264b3c352f968c6e30a10f9d08b", size = 444288, upload-time = "2025-11-05T18:38:25.139Z" },
-    { url = "https://files.pythonhosted.org/packages/3d/6f/feba03130d5fceadfa3a1bb102cb14650798c848b1df2a808356f939bb16/brotli-1.2.0-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:acec55bb7c90f1dfc476126f9711a8e81c9af7fb617409a9ee2953115343f08d", size = 1528071, upload-time = "2025-11-05T18:38:26.081Z" },
-    { url = "https://files.pythonhosted.org/packages/2b/38/f3abb554eee089bd15471057ba85f47e53a44a462cfce265d9bf7088eb09/brotli-1.2.0-cp312-cp312-manylinux2014_ppc64le.manylinux_2_17_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:260d3692396e1895c5034f204f0db022c056f9e2ac841593a4cf9426e2a3faca", size = 1626913, upload-time = "2025-11-05T18:38:27.284Z" },
-    { url = "https://files.pythonhosted.org/packages/03/a7/03aa61fbc3c5cbf99b44d158665f9b0dd3d8059be16c460208d9e385c837/brotli-1.2.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:072e7624b1fc4d601036ab3f4f27942ef772887e876beff0301d261210bca97f", size = 1419762, upload-time = "2025-11-05T18:38:28.295Z" },
-    { url = "https://files.pythonhosted.org/packages/21/1b/0374a89ee27d152a5069c356c96b93afd1b94eae83f1e004b57eb6ce2f10/brotli-1.2.0-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:adedc4a67e15327dfdd04884873c6d5a01d3e3b6f61406f99b1ed4865a2f6d28", size = 1484494, upload-time = "2025-11-05T18:38:29.29Z" },
-    { url = "https://files.pythonhosted.org/packages/cf/57/69d4fe84a67aef4f524dcd075c6eee868d7850e85bf01d778a857d8dbe0a/brotli-1.2.0-cp312-cp312-musllinux_1_2_ppc64le.whl", hash = "sha256:7a47ce5c2288702e09dc22a44d0ee6152f2c7eda97b3c8482d826a1f3cfc7da7", size = 1593302, upload-time = "2025-11-05T18:38:30.639Z" },
-    { url = "https://files.pythonhosted.org/packages/d5/3b/39e13ce78a8e9a621c5df3aeb5fd181fcc8caba8c48a194cd629771f6828/brotli-1.2.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:af43b8711a8264bb4e7d6d9a6d004c3a2019c04c01127a868709ec29962b6036", size = 1487913, upload-time = "2025-11-05T18:38:31.618Z" },
-    { url = "https://files.pythonhosted.org/packages/62/28/4d00cb9bd76a6357a66fcd54b4b6d70288385584063f4b07884c1e7286ac/brotli-1.2.0-cp312-cp312-win32.whl", hash = "sha256:e99befa0b48f3cd293dafeacdd0d191804d105d279e0b387a32054c1180f3161", size = 334362, upload-time = "2025-11-05T18:38:32.939Z" },
-    { url = "https://files.pythonhosted.org/packages/1c/4e/bc1dcac9498859d5e353c9b153627a3752868a9d5f05ce8dedd81a2354ab/brotli-1.2.0-cp312-cp312-win_amd64.whl", hash = "sha256:b35c13ce241abdd44cb8ca70683f20c0c079728a36a996297adb5334adfc1c44", size = 369115, upload-time = "2025-11-05T18:38:33.765Z" },
-]
 [[package]]
 name = "cachetools"
 version = "7.0.5"
@@ -462,22 +406,6 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/c1/ea/53f2148663b321f21b5a606bd5f191517cf40b7072c0497d3c92c4a13b1e/executing-2.2.1-py2.py3-none-any.whl", hash = "sha256:760643d3452b4d777d295bb167ccc74c64a81df23fb5e08eff250c425a4b2017", size = 28317, upload-time = "2025-09-01T09:48:08.5Z" },
 ]
-[[package]]
-name = "fastapi"
-version = "0.135.1"
-source = { registry = "https://pypi.org/simple" }
-dependencies = [
-    { name = "annotated-doc", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "pydantic", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "starlette", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "typing-extensions", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "typing-inspection", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-]
-sdist = { url = "https://files.pythonhosted.org/packages/e7/7b/f8e0211e9380f7195ba3f3d40c292594fd81ba8ec4629e3854c353aaca45/fastapi-0.135.1.tar.gz", hash = "sha256:d04115b508d936d254cea545b7312ecaa58a7b3a0f84952535b4c9afae7668cd", size = 394962, upload-time = "2026-03-01T18:18:29.369Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/e4/72/42e900510195b23a56bde950d26a51f8b723846bfcaa0286e90287f0422b/fastapi-0.135.1-py3-none-any.whl", hash = "sha256:46e2fc5745924b7c840f71ddd277382af29ce1cdb7d5eab5bf697e3fb9999c9e", size = 116999, upload-time = "2026-03-01T18:18:30.831Z" },
-]
 [[package]]
 name = "fastjsonschema"
 version = "2.21.2"
@@ -487,15 +415,6 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/cb/a8/20d0723294217e47de6d9e2e40fd4a9d2f7c4b6ef974babd482a59743694/fastjsonschema-2.21.2-py3-none-any.whl", hash = "sha256:1c797122d0a86c5cace2e54bf4e819c36223b552017172f32c5c024a6b77e463", size = 24024, upload-time = "2025-08-14T18:49:34.776Z" },
 ]
-[[package]]
-name = "ffmpy"
-version = "1.0.0"
-source = { registry = "https://pypi.org/simple" }
-sdist = { url = "https://files.pythonhosted.org/packages/7d/d2/1c4c582d71bcc65c76fa69fab85de6257d50fdf6fd4a2317c53917e9a581/ffmpy-1.0.0.tar.gz", hash = "sha256:b12932e95435c8820f1cd041024402765f821971e4bae753b327fc02a6e12f8b", size = 5101, upload-time = "2025-11-11T06:24:23.856Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/55/56/dd3669eccebb6d8ac81e624542ebd53fe6f08e1b8f2f8d50aeb7e3b83f99/ffmpy-1.0.0-py3-none-any.whl", hash = "sha256:5640e5f0fd03fb6236d0e119b16ccf6522db1c826fdf35dcb87087b60fd7504f", size = 5614, upload-time = "2025-11-11T06:24:22.818Z" },
-]
 [[package]]
 name = "filelock"
 version = "3.25.2"
@@ -571,71 +490,6 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/6a/09/e21df6aef1e1ffc0c816f0522ddc3f6dcded766c3261813131c78a704470/gitpython-3.1.46-py3-none-any.whl", hash = "sha256:79812ed143d9d25b6d176a10bb511de0f9c67b1fa641d82097b0ab90398a2058", size = 208620, upload-time = "2026-01-01T15:37:30.574Z" },
 ]
-[[package]]
-name = "gradio"
-version = "6.9.0"
-source = { registry = "https://pypi.org/simple" }
-dependencies = [
-    { name = "aiofiles", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "anyio", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "brotli", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "fastapi", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "ffmpy", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "gradio-client", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "groovy", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "httpx", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "huggingface-hub", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "jinja2", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "markupsafe", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "numpy", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "orjson", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "packaging", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "pandas", version = "2.3.3", source = { registry = "https://pypi.org/simple" }, marker = "(python_full_version < '3.11' and sys_platform == 'linux') or (python_full_version >= '3.11' and extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm') or (sys_platform != 'linux' and extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "pandas", version = "3.0.1", source = { registry = "https://pypi.org/simple" }, marker = "(python_full_version >= '3.11' and sys_platform == 'linux') or (python_full_version < '3.11' and extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm') or (sys_platform != 'linux' and extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "pillow", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "pydantic", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "pydub", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "python-multipart", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "pytz", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "pyyaml", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "safehttpx", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "semantic-version", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "starlette", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "tomlkit", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "typer", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "typing-extensions", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "uvicorn", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-]
-sdist = { url = "https://files.pythonhosted.org/packages/bd/83/29bdbf94b212512e3c775482d390f5b699a72d71a2c431dea367a6e45a37/gradio-6.9.0.tar.gz", hash = "sha256:593e60e33233f3586452ebfa9f741817c5ae849a98cc70945f3ccb8dc895eb22", size = 57904480, upload-time = "2026-03-06T17:44:26.025Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/b3/8b/dc357ab966544e4dc898a2fee326d755c5f54da82af71a1a802e3476e78e/gradio-6.9.0-py3-none-any.whl", hash = "sha256:c173dd330c9247002a42222c85d76c0ecee65437eff808084e360862e7bbd24f", size = 42940853, upload-time = "2026-03-06T17:44:22.009Z" },
-]
-[[package]]
-name = "gradio-client"
-version = "2.3.0"
-source = { registry = "https://pypi.org/simple" }
-dependencies = [
-    { name = "fsspec", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "httpx", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "huggingface-hub", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "packaging", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "typing-extensions", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-]
-sdist = { url = "https://files.pythonhosted.org/packages/97/d2/de2037f5eff13a5145cdf6982fd34c9735f0806e8a2ee5d4bfe9a7d25a54/gradio_client-2.3.0.tar.gz", hash = "sha256:1c700dc60e65bae4386ba7cf3732b9f9d5bcf5fb8eb451df3944fe092d7d9a29", size = 57552, upload-time = "2026-03-06T17:44:38.247Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/99/6a/41752781399811afbf8ac858f63c20eff354ed35169daa39604aefced4e8/gradio_client-2.3.0-py3-none-any.whl", hash = "sha256:9ec51a927888fc188e123a0ac5ad341d9265b325539a399554d1fc2604942e74", size = 58531, upload-time = "2026-03-06T17:44:36.961Z" },
-]
-[[package]]
-name = "groovy"
-version = "0.1.2"
-source = { registry = "https://pypi.org/simple" }
-sdist = { url = "https://files.pythonhosted.org/packages/52/36/bbdede67400277bef33d3ec0e6a31750da972c469f75966b4930c753218f/groovy-0.1.2.tar.gz", hash = "sha256:25c1dc09b3f9d7e292458aa762c6beb96ea037071bf5e917fc81fb78d2231083", size = 17325, upload-time = "2025-02-28T20:24:56.068Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/28/27/3d6dcadc8a3214d8522c1e7f6a19554e33659be44546d44a2f7572ac7d2a/groovy-0.1.2-py3-none-any.whl", hash = "sha256:7f7975bab18c729a257a8b1ae9dcd70b7cafb1720481beae47719af57c35fa64", size = 14090, upload-time = "2025-02-28T20:24:55.152Z" },
-]
 [[package]]
 name = "h11"
 version = "0.16.0"
@@ -645,70 +499,6 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/04/4b/29cac41a4d98d144bf5f6d33995617b185d14b22401f75ca86f384e87ff1/h11-0.16.0-py3-none-any.whl", hash = "sha256:63cf8bbe7522de3bf65932fda1d9c2772064ffb3dae62d55932da54b31cb6c86", size = 37515, upload-time = "2025-04-24T03:35:24.344Z" },
 ]
-[[package]]
-name = "hf-xet"
-version = "1.4.2"
-source = { registry = "https://pypi.org/simple" }
-sdist = { url = "https://files.pythonhosted.org/packages/09/08/23c84a26716382c89151b5b447b4beb19e3345f3a93d3b73009a71a57ad3/hf_xet-1.4.2.tar.gz", hash = "sha256:b7457b6b482d9e0743bd116363239b1fa904a5e65deede350fbc0c4ea67c71ea", size = 672357, upload-time = "2026-03-13T06:58:51.077Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/b4/86/b40b83a2ff03ef05c4478d2672b1fc2b9683ff870e2b25f4f3af240f2e7b/hf_xet-1.4.2-cp37-abi3-macosx_10_12_x86_64.whl", hash = "sha256:71f02d6e4cdd07f344f6844845d78518cc7186bd2bc52d37c3b73dc26a3b0bc5", size = 3800339, upload-time = "2026-03-13T06:58:36.245Z" },
-    { url = "https://files.pythonhosted.org/packages/64/2e/af4475c32b4378b0e92a587adb1aa3ec53e3450fd3e5fe0372a874531c00/hf_xet-1.4.2-cp37-abi3-macosx_11_0_arm64.whl", hash = "sha256:e9b38d876e94d4bdcf650778d6ebbaa791dd28de08db9736c43faff06ede1b5a", size = 3559664, upload-time = "2026-03-13T06:58:34.787Z" },
-    { url = "https://files.pythonhosted.org/packages/3c/4c/781267da3188db679e601de18112021a5cb16506fe86b246e22c5401a9c4/hf_xet-1.4.2-cp37-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:77e8c180b7ef12d8a96739a4e1e558847002afe9ea63b6f6358b2271a8bdda1c", size = 4217422, upload-time = "2026-03-13T06:58:27.472Z" },
-    { url = "https://files.pythonhosted.org/packages/68/47/d6cf4a39ecf6c7705f887a46f6ef5c8455b44ad9eb0d391aa7e8a2ff7fea/hf_xet-1.4.2-cp37-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:c3b3c6a882016b94b6c210957502ff7877802d0dbda8ad142c8595db8b944271", size = 3992847, upload-time = "2026-03-13T06:58:25.989Z" },
-    { url = "https://files.pythonhosted.org/packages/2d/ef/e80815061abff54697239803948abc665c6b1d237102c174f4f7a9a5ffc5/hf_xet-1.4.2-cp37-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:9d9a634cc929cfbaf2e1a50c0e532ae8c78fa98618426769480c58501e8c8ac2", size = 4193843, upload-time = "2026-03-13T06:58:44.59Z" },
-    { url = "https://files.pythonhosted.org/packages/54/75/07f6aa680575d9646c4167db6407c41340cbe2357f5654c4e72a1b01ca14/hf_xet-1.4.2-cp37-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:6b0932eb8b10317ea78b7da6bab172b17be03bbcd7809383d8d5abd6a2233e04", size = 4432751, upload-time = "2026-03-13T06:58:46.533Z" },
-    { url = "https://files.pythonhosted.org/packages/cd/71/193eabd7e7d4b903c4aa983a215509c6114915a5a237525ec562baddb868/hf_xet-1.4.2-cp37-abi3-win_amd64.whl", hash = "sha256:ad185719fb2e8ac26f88c8100562dbf9dbdcc3d9d2add00faa94b5f106aea53f", size = 3671149, upload-time = "2026-03-13T06:58:57.07Z" },
-    { url = "https://files.pythonhosted.org/packages/b4/7e/ccf239da366b37ba7f0b36095450efae4a64980bdc7ec2f51354205fdf39/hf_xet-1.4.2-cp37-abi3-win_arm64.whl", hash = "sha256:32c012286b581f783653e718c1862aea5b9eb140631685bb0c5e7012c8719a87", size = 3533426, upload-time = "2026-03-13T06:58:55.46Z" },
-]
-[[package]]
-name = "httpcore"
-version = "1.0.9"
-source = { registry = "https://pypi.org/simple" }
-dependencies = [
-    { name = "certifi", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "h11", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-]
-sdist = { url = "https://files.pythonhosted.org/packages/06/94/82699a10bca87a5556c9c59b5963f2d039dbd239f25bc2a63907a05a14cb/httpcore-1.0.9.tar.gz", hash = "sha256:6e34463af53fd2ab5d807f399a9b45ea31c3dfa2276f15a2c3f00afff6e176e8", size = 85484, upload-time = "2025-04-24T22:06:22.219Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/7e/f5/f66802a942d491edb555dd61e3a9961140fd64c90bce1eafd741609d334d/httpcore-1.0.9-py3-none-any.whl", hash = "sha256:2d400746a40668fc9dec9810239072b40b4484b640a8c38fd654a024c7a1bf55", size = 78784, upload-time = "2025-04-24T22:06:20.566Z" },
-]
-[[package]]
-name = "httpx"
-version = "0.28.1"
-source = { registry = "https://pypi.org/simple" }
-dependencies = [
-    { name = "anyio", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "certifi", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "httpcore", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "idna", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-]
-sdist = { url = "https://files.pythonhosted.org/packages/b1/df/48c586a5fe32a0f01324ee087459e112ebb7224f646c0b5023f5e79e9956/httpx-0.28.1.tar.gz", hash = "sha256:75e98c5f16b0f35b567856f597f06ff2270a374470a5c2392242528e3e3e42fc", size = 141406, upload-time = "2024-12-06T15:37:23.222Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/2a/39/e50c7c3a983047577ee07d2a9e53faf5a69493943ec3f6a384bdc792deb2/httpx-0.28.1-py3-none-any.whl", hash = "sha256:d909fcccc110f8c7faf814ca82a9a4d816bc5a6dbfea25d6591d6985b8ba59ad", size = 73517, upload-time = "2024-12-06T15:37:21.509Z" },
-]
-[[package]]
-name = "huggingface-hub"
-version = "1.7.2"
-source = { registry = "https://pypi.org/simple" }
-dependencies = [
-    { name = "filelock", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "fsspec", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "hf-xet", marker = "(platform_machine != 'AMD64' and platform_machine != 'aarch64' and platform_machine != 'amd64' and platform_machine != 'arm64' and platform_machine != 'x86_64' and extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm') or (platform_machine == 'AMD64' and sys_platform == 'linux') or (platform_machine == 'aarch64' and sys_platform == 'linux') or (platform_machine == 'amd64' and sys_platform == 'linux') or (platform_machine == 'arm64' and sys_platform == 'linux') or (platform_machine == 'x86_64' and sys_platform == 'linux') or (sys_platform != 'linux' and extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "httpx", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "packaging", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "pyyaml", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "tqdm", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "typer", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "typing-extensions", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-]
-sdist = { url = "https://files.pythonhosted.org/packages/19/15/eafc1c57bf0f8afffb243dcd4c0cceb785e956acc17bba4d9bf2ae21fc9c/huggingface_hub-1.7.2.tar.gz", hash = "sha256:7f7e294e9bbb822e025bdb2ada025fa4344d978175a7f78e824d86e35f7ab43b", size = 724684, upload-time = "2026-03-20T10:36:08.767Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/08/de/3ad061a05f74728927ded48c90b73521b9a9328c85d841bdefb30e01fb85/huggingface_hub-1.7.2-py3-none-any.whl", hash = "sha256:288f33a0a17b2a73a1359e2a5fd28d1becb2c121748c6173ab8643fb342c850e", size = 618036, upload-time = "2026-03-20T10:36:06.824Z" },
-]
 [[package]]
 name = "humanize"
 version = "4.15.0"
@@ -1439,57 +1229,6 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/9f/99/4c9c0c329bf9fc125008c3b54c7c94c0023518d06fc025ae36431375e1fe/nvidia_nvtx_cu12-12.8.90-py3-none-win_amd64.whl", hash = "sha256:619c8304aedc69f02ea82dd244541a83c3d9d40993381b3b590f1adaed3db41e", size = 56492, upload-time = "2025-03-07T01:52:24.69Z" },
 ]
-[[package]]
-name = "orjson"
-version = "3.11.7"
-source = { registry = "https://pypi.org/simple" }
-sdist = { url = "https://files.pythonhosted.org/packages/53/45/b268004f745ede84e5798b48ee12b05129d19235d0e15267aa57dcdb400b/orjson-3.11.7.tar.gz", hash = "sha256:9b1a67243945819ce55d24a30b59d6a168e86220452d2c96f4d1f093e71c0c49", size = 6144992, upload-time = "2026-02-02T15:38:49.29Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/de/1a/a373746fa6d0e116dd9e54371a7b54622c44d12296d5d0f3ad5e3ff33490/orjson-3.11.7-cp310-cp310-macosx_10_15_x86_64.macosx_11_0_arm64.macosx_10_15_universal2.whl", hash = "sha256:a02c833f38f36546ba65a452127633afce4cf0dd7296b753d3bb54e55e5c0174", size = 229140, upload-time = "2026-02-02T15:37:06.082Z" },
-    { url = "https://files.pythonhosted.org/packages/52/a2/fa129e749d500f9b183e8a3446a193818a25f60261e9ce143ad61e975208/orjson-3.11.7-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:b63c6e6738d7c3470ad01601e23376aa511e50e1f3931395b9f9c722406d1a67", size = 128670, upload-time = "2026-02-02T15:37:08.002Z" },
-    { url = "https://files.pythonhosted.org/packages/08/93/1e82011cd1e0bd051ef9d35bed1aa7fb4ea1f0a055dc2c841b46b43a9ebd/orjson-3.11.7-cp310-cp310-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:043d3006b7d32c7e233b8cfb1f01c651013ea079e08dcef7189a29abd8befe11", size = 123832, upload-time = "2026-02-02T15:37:09.191Z" },
-    { url = "https://files.pythonhosted.org/packages/fe/d8/a26b431ef962c7d55736674dddade876822f3e33223c1f47a36879350d04/orjson-3.11.7-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:57036b27ac8a25d81112eb0cc9835cd4833c5b16e1467816adc0015f59e870dc", size = 129171, upload-time = "2026-02-02T15:37:11.112Z" },
-    { url = "https://files.pythonhosted.org/packages/a7/19/f47819b84a580f490da260c3ee9ade214cf4cf78ac9ce8c1c758f80fdfc9/orjson-3.11.7-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:733ae23ada68b804b222c44affed76b39e30806d38660bf1eb200520d259cc16", size = 141967, upload-time = "2026-02-02T15:37:12.282Z" },
-    { url = "https://files.pythonhosted.org/packages/5b/cd/37ece39a0777ba077fdcdbe4cccae3be8ed00290c14bf8afdc548befc260/orjson-3.11.7-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:5fdfad2093bdd08245f2e204d977facd5f871c88c4a71230d5bcbd0e43bf6222", size = 130991, upload-time = "2026-02-02T15:37:13.465Z" },
-    { url = "https://files.pythonhosted.org/packages/8f/ed/f2b5d66aa9b6b5c02ff5f120efc7b38c7c4962b21e6be0f00fd99a5c348e/orjson-3.11.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:cededd6738e1c153530793998e31c05086582b08315db48ab66649768f326baa", size = 133674, upload-time = "2026-02-02T15:37:14.694Z" },
-    { url = "https://files.pythonhosted.org/packages/c4/6e/baa83e68d1aa09fa8c3e5b2c087d01d0a0bd45256de719ed7bc22c07052d/orjson-3.11.7-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:14f440c7268c8f8633d1b3d443a434bd70cb15686117ea6beff8fdc8f5917a1e", size = 138722, upload-time = "2026-02-02T15:37:16.501Z" },
-    { url = "https://files.pythonhosted.org/packages/0c/47/7f8ef4963b772cd56999b535e553f7eb5cd27e9dd6c049baee6f18bfa05d/orjson-3.11.7-cp310-cp310-musllinux_1_2_armv7l.whl", hash = "sha256:3a2479753bbb95b0ebcf7969f562cdb9668e6d12416a35b0dda79febf89cdea2", size = 409056, upload-time = "2026-02-02T15:37:17.895Z" },
-    { url = "https://files.pythonhosted.org/packages/38/eb/2df104dd2244b3618f25325a656f85cc3277f74bbd91224752410a78f3c7/orjson-3.11.7-cp310-cp310-musllinux_1_2_i686.whl", hash = "sha256:71924496986275a737f38e3f22b4e0878882b3f7a310d2ff4dc96e812789120c", size = 144196, upload-time = "2026-02-02T15:37:19.349Z" },
-    { url = "https://files.pythonhosted.org/packages/b6/2a/ee41de0aa3a6686598661eae2b4ebdff1340c65bfb17fcff8b87138aab21/orjson-3.11.7-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:b4a9eefdc70bf8bf9857f0290f973dec534ac84c35cd6a7f4083be43e7170a8f", size = 134979, upload-time = "2026-02-02T15:37:20.906Z" },
-    { url = "https://files.pythonhosted.org/packages/4c/fa/92fc5d3d402b87a8b28277a9ed35386218a6a5287c7fe5ee9b9f02c53fb2/orjson-3.11.7-cp310-cp310-win32.whl", hash = "sha256:ae9e0b37a834cef7ce8f99de6498f8fad4a2c0bf6bfc3d02abd8ed56aa15b2de", size = 127968, upload-time = "2026-02-02T15:37:23.178Z" },
-    { url = "https://files.pythonhosted.org/packages/07/29/a576bf36d73d60df06904d3844a9df08e25d59eba64363aaf8ec2f9bff41/orjson-3.11.7-cp310-cp310-win_amd64.whl", hash = "sha256:d772afdb22555f0c58cfc741bdae44180122b3616faa1ecadb595cd526e4c993", size = 125128, upload-time = "2026-02-02T15:37:24.329Z" },
-    { url = "https://files.pythonhosted.org/packages/37/02/da6cb01fc6087048d7f61522c327edf4250f1683a58a839fdcc435746dd5/orjson-3.11.7-cp311-cp311-macosx_10_15_x86_64.macosx_11_0_arm64.macosx_10_15_universal2.whl", hash = "sha256:9487abc2c2086e7c8eb9a211d2ce8855bae0e92586279d0d27b341d5ad76c85c", size = 228664, upload-time = "2026-02-02T15:37:25.542Z" },
-    { url = "https://files.pythonhosted.org/packages/c1/c2/5885e7a5881dba9a9af51bc564e8967225a642b3e03d089289a35054e749/orjson-3.11.7-cp311-cp311-macosx_15_0_arm64.whl", hash = "sha256:79cacb0b52f6004caf92405a7e1f11e6e2de8bdf9019e4f76b44ba045125cd6b", size = 125344, upload-time = "2026-02-02T15:37:26.92Z" },
-    { url = "https://files.pythonhosted.org/packages/a4/1d/4e7688de0a92d1caf600dfd5fb70b4c5bfff51dfa61ac555072ef2d0d32a/orjson-3.11.7-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c2e85fe4698b6a56d5e2ebf7ae87544d668eb6bde1ad1226c13f44663f20ec9e", size = 128404, upload-time = "2026-02-02T15:37:28.108Z" },
-    { url = "https://files.pythonhosted.org/packages/2f/b2/ec04b74ae03a125db7bd69cffd014b227b7f341e3261bf75b5eb88a1aa92/orjson-3.11.7-cp311-cp311-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:b8d14b71c0b12963fe8a62aac87119f1afdf4cb88a400f61ca5ae581449efcb5", size = 123677, upload-time = "2026-02-02T15:37:30.287Z" },
-    { url = "https://files.pythonhosted.org/packages/4c/69/f95bdf960605f08f827f6e3291fe243d8aa9c5c9ff017a8d7232209184c3/orjson-3.11.7-cp311-cp311-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:91c81ef070c8f3220054115e1ef468b1c9ce8497b4e526cb9f68ab4dc0a7ac62", size = 128950, upload-time = "2026-02-02T15:37:31.595Z" },
-    { url = "https://files.pythonhosted.org/packages/a4/1b/de59c57bae1d148ef298852abd31909ac3089cff370dfd4cd84cc99cbc42/orjson-3.11.7-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:411ebaf34d735e25e358a6d9e7978954a9c9d58cfb47bc6683cdc3964cd2f910", size = 141756, upload-time = "2026-02-02T15:37:32.985Z" },
-    { url = "https://files.pythonhosted.org/packages/ee/9e/9decc59f4499f695f65c650f6cfa6cd4c37a3fbe8fa235a0a3614cb54386/orjson-3.11.7-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:a16bcd08ab0bcdfc7e8801d9c4a9cc17e58418e4d48ddc6ded4e9e4b1a94062b", size = 130812, upload-time = "2026-02-02T15:37:34.204Z" },
-    { url = "https://files.pythonhosted.org/packages/28/e6/59f932bcabd1eac44e334fe8e3281a92eacfcb450586e1f4bde0423728d8/orjson-3.11.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:9c0b51672e466fd7e56230ffbae7f1639e18d0ce023351fb75da21b71bc2c960", size = 133444, upload-time = "2026-02-02T15:37:35.446Z" },
-    { url = "https://files.pythonhosted.org/packages/f1/36/b0f05c0eaa7ca30bc965e37e6a2956b0d67adb87a9872942d3568da846ae/orjson-3.11.7-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:136dcd6a2e796dfd9ffca9fc027d778567b0b7c9968d092842d3c323cef88aa8", size = 138609, upload-time = "2026-02-02T15:37:36.657Z" },
-    { url = "https://files.pythonhosted.org/packages/b8/03/58ec7d302b8d86944c60c7b4b82975d5161fcce4c9bc8c6cb1d6741b6115/orjson-3.11.7-cp311-cp311-musllinux_1_2_armv7l.whl", hash = "sha256:7ba61079379b0ae29e117db13bda5f28d939766e410d321ec1624afc6a0b0504", size = 408918, upload-time = "2026-02-02T15:37:38.076Z" },
-    { url = "https://files.pythonhosted.org/packages/06/3a/868d65ef9a8b99be723bd510de491349618abd9f62c826cf206d962db295/orjson-3.11.7-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:0527a4510c300e3b406591b0ba69b5dc50031895b0a93743526a3fc45f59d26e", size = 143998, upload-time = "2026-02-02T15:37:39.706Z" },
-    { url = "https://files.pythonhosted.org/packages/5b/c7/1e18e1c83afe3349f4f6dc9e14910f0ae5f82eac756d1412ea4018938535/orjson-3.11.7-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:a709e881723c9b18acddcfb8ba357322491ad553e277cf467e1e7e20e2d90561", size = 134802, upload-time = "2026-02-02T15:37:41.002Z" },
-    { url = "https://files.pythonhosted.org/packages/d4/0b/ccb7ee1a65b37e8eeb8b267dc953561d72370e85185e459616d4345bab34/orjson-3.11.7-cp311-cp311-win32.whl", hash = "sha256:c43b8b5bab288b6b90dac410cca7e986a4fa747a2e8f94615aea407da706980d", size = 127828, upload-time = "2026-02-02T15:37:42.241Z" },
-    { url = "https://files.pythonhosted.org/packages/af/9e/55c776dffda3f381e0f07d010a4f5f3902bf48eaba1bb7684d301acd4924/orjson-3.11.7-cp311-cp311-win_amd64.whl", hash = "sha256:6543001328aa857187f905308a028935864aefe9968af3848401b6fe80dbb471", size = 124941, upload-time = "2026-02-02T15:37:43.444Z" },
-    { url = "https://files.pythonhosted.org/packages/aa/8e/424a620fa7d263b880162505fb107ef5e0afaa765b5b06a88312ac291560/orjson-3.11.7-cp311-cp311-win_arm64.whl", hash = "sha256:1ee5cc7160a821dfe14f130bc8e63e7611051f964b463d9e2a3a573204446a4d", size = 126245, upload-time = "2026-02-02T15:37:45.18Z" },
-    { url = "https://files.pythonhosted.org/packages/80/bf/76f4f1665f6983385938f0e2a5d7efa12a58171b8456c252f3bae8a4cf75/orjson-3.11.7-cp312-cp312-macosx_10_15_x86_64.macosx_11_0_arm64.macosx_10_15_universal2.whl", hash = "sha256:bd03ea7606833655048dab1a00734a2875e3e86c276e1d772b2a02556f0d895f", size = 228545, upload-time = "2026-02-02T15:37:46.376Z" },
-    { url = "https://files.pythonhosted.org/packages/79/53/6c72c002cb13b5a978a068add59b25a8bdf2800ac1c9c8ecdb26d6d97064/orjson-3.11.7-cp312-cp312-macosx_15_0_arm64.whl", hash = "sha256:89e440ebc74ce8ab5c7bc4ce6757b4a6b1041becb127df818f6997b5c71aa60b", size = 125224, upload-time = "2026-02-02T15:37:47.697Z" },
-    { url = "https://files.pythonhosted.org/packages/2c/83/10e48852865e5dd151bdfe652c06f7da484578ed02c5fca938e3632cb0b8/orjson-3.11.7-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:5ede977b5fe5ac91b1dffc0a517ca4542d2ec8a6a4ff7b2652d94f640796342a", size = 128154, upload-time = "2026-02-02T15:37:48.954Z" },
-    { url = "https://files.pythonhosted.org/packages/6e/52/a66e22a2b9abaa374b4a081d410edab6d1e30024707b87eab7c734afe28d/orjson-3.11.7-cp312-cp312-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:b7b1dae39230a393df353827c855a5f176271c23434cfd2db74e0e424e693e10", size = 123548, upload-time = "2026-02-02T15:37:50.187Z" },
-    { url = "https://files.pythonhosted.org/packages/de/38/605d371417021359f4910c496f764c48ceb8997605f8c25bf1dfe58c0ebe/orjson-3.11.7-cp312-cp312-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:ed46f17096e28fb28d2975834836a639af7278aa87c84f68ab08fbe5b8bd75fa", size = 129000, upload-time = "2026-02-02T15:37:51.426Z" },
-    { url = "https://files.pythonhosted.org/packages/44/98/af32e842b0ffd2335c89714d48ca4e3917b42f5d6ee5537832e069a4b3ac/orjson-3.11.7-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:3726be79e36e526e3d9c1aceaadbfb4a04ee80a72ab47b3f3c17fefb9812e7b8", size = 141686, upload-time = "2026-02-02T15:37:52.607Z" },
-    { url = "https://files.pythonhosted.org/packages/96/0b/fc793858dfa54be6feee940c1463370ece34b3c39c1ca0aa3845f5ba9892/orjson-3.11.7-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:0724e265bc548af1dedebd9cb3d24b4e1c1e685a343be43e87ba922a5c5fff2f", size = 130812, upload-time = "2026-02-02T15:37:53.944Z" },
-    { url = "https://files.pythonhosted.org/packages/dc/91/98a52415059db3f374757d0b7f0f16e3b5cd5976c90d1c2b56acaea039e6/orjson-3.11.7-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:e7745312efa9e11c17fbd3cb3097262d079da26930ae9ae7ba28fb738367cbad", size = 133440, upload-time = "2026-02-02T15:37:55.615Z" },
-    { url = "https://files.pythonhosted.org/packages/dc/b6/cb540117bda61791f46381f8c26c8f93e802892830a6055748d3bb1925ab/orjson-3.11.7-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:f904c24bdeabd4298f7a977ef14ca2a022ca921ed670b92ecd16ab6f3d01f867", size = 138386, upload-time = "2026-02-02T15:37:56.814Z" },
-    { url = "https://files.pythonhosted.org/packages/63/1a/50a3201c334a7f17c231eee5f841342190723794e3b06293f26e7cf87d31/orjson-3.11.7-cp312-cp312-musllinux_1_2_armv7l.whl", hash = "sha256:b9fc4d0f81f394689e0814617aadc4f2ea0e8025f38c226cbf22d3b5ddbf025d", size = 408853, upload-time = "2026-02-02T15:37:58.291Z" },
-    { url = "https://files.pythonhosted.org/packages/87/cd/8de1c67d0be44fdc22701e5989c0d015a2adf391498ad42c4dc589cd3013/orjson-3.11.7-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:849e38203e5be40b776ed2718e587faf204d184fc9a008ae441f9442320c0cab", size = 144130, upload-time = "2026-02-02T15:38:00.163Z" },
-    { url = "https://files.pythonhosted.org/packages/0f/fe/d605d700c35dd55f51710d159fc54516a280923cd1b7e47508982fbb387d/orjson-3.11.7-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:4682d1db3bcebd2b64757e0ddf9e87ae5f00d29d16c5cdf3a62f561d08cc3dd2", size = 134818, upload-time = "2026-02-02T15:38:01.507Z" },
-    { url = "https://files.pythonhosted.org/packages/e4/e4/15ecc67edb3ddb3e2f46ae04475f2d294e8b60c1825fbe28a428b93b3fbd/orjson-3.11.7-cp312-cp312-win32.whl", hash = "sha256:f4f7c956b5215d949a1f65334cf9d7612dde38f20a95f2315deef167def91a6f", size = 127923, upload-time = "2026-02-02T15:38:02.75Z" },
-    { url = "https://files.pythonhosted.org/packages/34/70/2e0855361f76198a3965273048c8e50a9695d88cd75811a5b46444895845/orjson-3.11.7-cp312-cp312-win_amd64.whl", hash = "sha256:bf742e149121dc5648ba0a08ea0871e87b660467ef168a3a5e53bc1fbd64bb74", size = 125007, upload-time = "2026-02-02T15:38:04.032Z" },
-    { url = "https://files.pythonhosted.org/packages/68/40/c2051bd19fc467610fed469dc29e43ac65891571138f476834ca192bc290/orjson-3.11.7-cp312-cp312-win_arm64.whl", hash = "sha256:26c3b9132f783b7d7903bf1efb095fed8d4a3a85ec0d334ee8beff3d7a4749d5", size = 126089, upload-time = "2026-02-02T15:38:05.297Z" },
-]
 [[package]]
 name = "packaging"
 version = "26.0"
@@ -1586,6 +1325,7 @@ dependencies = [
     { name = "chess-engine", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
     { name = "numpy", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
     { name = "psutil", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
     { name = "tqdm", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
     { name = "wandb", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
 ]
@@ -1601,9 +1341,6 @@ dashboard = [
     { name = "plotly", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
     { name = "solara", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
 ]
-demo = [
-    { name = "gradio", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-]
 dev = [
     { name = "ipykernel", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
     { name = "pytest", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
@@ -1623,7 +1360,6 @@ rocm = [
 requires-dist = [
     { name = "anywidget", marker = "extra == 'dashboard'", specifier = ">=0.9.21" },
     { name = "chess-engine", editable = "engine" },
-    { name = "gradio", marker = "extra == 'demo'", specifier = ">=5.0.0" },
     { name = "ipykernel", marker = "extra == 'dev'", specifier = ">=7.2.0" },
     { name = "matplotlib", marker = "extra == 'eval'", specifier = ">=3.10.8" },
     { name = "numpy", specifier = "~=2.2.0" },
@@ -1633,6 +1369,7 @@ requires-dist = [
     { name = "psutil", specifier = ">=5.9.0" },
     { name = "pyarrow", marker = "extra == 'eval'", specifier = ">=23.0.1" },
     { name = "pytest", marker = "extra == 'dev'", specifier = "~=9.0.0" },
     { name = "seaborn", marker = "extra == 'eval'", specifier = ">=0.13.2" },
     { name = "solara", marker = "extra == 'dashboard'", specifier = ">=1.0.0" },
     { name = "torch", marker = "extra == 'cu128'", specifier = "~=2.10.0", index = "https://download.pytorch.org/whl/cu128", conflict = { package = "pawn", extra = "cu128" } },
@@ -1641,7 +1378,7 @@ requires-dist = [
     { name = "triton-rocm", marker = "extra == 'rocm'", specifier = ">=3.6.0", index = "https://download.pytorch.org/whl/rocm7.1", conflict = { package = "pawn", extra = "rocm" } },
     { name = "wandb", specifier = "~=0.25.0" },
 ]
-provides-extras = ["rocm", "cu128", "eval", "dashboard", "demo", "dev"]
 [[package]]
 name = "pexpect"
@@ -1976,15 +1713,6 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/36/c7/cfc8e811f061c841d7990b0201912c3556bfeb99cdcb7ed24adc8d6f8704/pydantic_core-2.41.5-pp311-pypy311_pp73-win_amd64.whl", hash = "sha256:56121965f7a4dc965bff783d70b907ddf3d57f6eba29b6d2e5dabfaf07799c51", size = 2145302, upload-time = "2025-11-04T13:43:46.64Z" },
 ]
-[[package]]
-name = "pydub"
-version = "0.25.1"
-source = { registry = "https://pypi.org/simple" }
-sdist = { url = "https://files.pythonhosted.org/packages/fe/9a/e6bca0eed82db26562c73b5076539a4a08d3cffd19c3cc5913a3e61145fd/pydub-0.25.1.tar.gz", hash = "sha256:980a33ce9949cab2a569606b65674d748ecbca4f0796887fd6f46173a7b0d30f", size = 38326, upload-time = "2021-03-10T02:09:54.659Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/a6/53/d78dc063216e62fc55f6b2eebb447f6a4b0a59f55c8406376f76bf959b08/pydub-0.25.1-py2.py3-none-any.whl", hash = "sha256:65617e33033874b59d87db603aa1ed450633288aefead953b30bded59cb599a6", size = 32327, upload-time = "2021-03-10T02:09:53.503Z" },
-]
 [[package]]
 name = "pygments"
 version = "2.19.2"
@@ -2045,15 +1773,6 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/ec/57/56b9bcc3c9c6a792fcbaf139543cee77261f3651ca9da0c93f5c1221264b/python_dateutil-2.9.0.post0-py2.py3-none-any.whl", hash = "sha256:a8b2bc7bffae282281c8140a97d3aa9c14da0b136dfe83f850eea9a5f7470427", size = 229892, upload-time = "2024-03-01T18:36:18.57Z" },
 ]
-[[package]]
-name = "python-multipart"
-version = "0.0.22"
-source = { registry = "https://pypi.org/simple" }
-sdist = { url = "https://files.pythonhosted.org/packages/94/01/979e98d542a70714b0cb2b6728ed0b7c46792b695e3eaec3e20711271ca3/python_multipart-0.0.22.tar.gz", hash = "sha256:7340bef99a7e0032613f56dc36027b959fd3b30a787ed62d310e951f7c3a3a58", size = 37612, upload-time = "2026-01-25T10:15:56.219Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/1b/d0/397f9626e711ff749a95d96b7af99b9c566a9bb5129b8e4c10fc4d100304/python_multipart-0.0.22-py3-none-any.whl", hash = "sha256:2b2cd894c83d21bf49d702499531c7bafd057d730c201782048f7945d82de155", size = 24579, upload-time = "2026-01-25T10:15:54.811Z" },
-]
 [[package]]
 name = "pytz"
 version = "2026.1.post1"
@@ -2284,15 +2003,29 @@ wheels = [
 ]
 [[package]]
-name = "safehttpx"
-version = "0.1.7"
 source = { registry = "https://pypi.org/simple" }
-dependencies = [
-    { name = "httpx", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-]
-sdist = { url = "https://files.pythonhosted.org/packages/89/d1/4282284d9cf1ee873607a46442da977fc3c985059315ab23610be31d5885/safehttpx-0.1.7.tar.gz", hash = "sha256:db201c0978c41eddb8bb480f3eee59dd67304fdd91646035e9d9a720049a9d23", size = 10385, upload-time = "2025-10-24T18:30:09.783Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/2e/a3/0f0b7d78e2f1eb9e8e1afbff1d2bff8d60144aee17aca51c065b516743dd/safehttpx-0.1.7-py3-none-any.whl", hash = "sha256:c4f4a162db6993464d7ca3d7cc4af0ffc6515a606dfd220b9f82c6945d869cde", size = 8959, upload-time = "2025-10-24T18:30:08.733Z" },
 ]
 [[package]]
@@ -2310,15 +2043,6 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/83/11/00d3c3dfc25ad54e731d91449895a79e4bf2384dc3ac01809010ba88f6d5/seaborn-0.13.2-py3-none-any.whl", hash = "sha256:636f8336facf092165e27924f223d3c62ca560b1f2bb5dff7ab7fad265361987", size = 294914, upload-time = "2024-01-25T13:21:49.598Z" },
 ]
-[[package]]
-name = "semantic-version"
-version = "2.10.0"
-source = { registry = "https://pypi.org/simple" }
-sdist = { url = "https://files.pythonhosted.org/packages/7d/31/f2289ce78b9b473d582568c234e104d2a342fd658cc288a7553d83bb8595/semantic_version-2.10.0.tar.gz", hash = "sha256:bdabb6d336998cbb378d4b9db3a4b56a1e3235701dc05ea2690d9a997ed5041c", size = 52289, upload-time = "2022-05-26T13:35:23.454Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/6a/23/8146aad7d88f4fcb3a6218f41a60f6c2d4e3a72de72da1825dc7c8f7877c/semantic_version-2.10.0-py2.py3-none-any.whl", hash = "sha256:de78a3b8e0feda74cabc54aab2da702113e33ac9d9eb9d2389bcf1f58b7d9177", size = 15552, upload-time = "2022-05-26T13:35:21.206Z" },
-]
 [[package]]
 name = "sentry-sdk"
 version = "2.55.0"
@@ -2341,15 +2065,6 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/9d/76/f789f7a86709c6b087c5a2f52f911838cad707cc613162401badc665acfe/setuptools-82.0.1-py3-none-any.whl", hash = "sha256:a59e362652f08dcd477c78bb6e7bd9d80a7995bc73ce773050228a348ce2e5bb", size = 1006223, upload-time = "2026-03-09T12:47:15.026Z" },
 ]
-[[package]]
-name = "shellingham"
-version = "1.5.4"
-source = { registry = "https://pypi.org/simple" }
-sdist = { url = "https://files.pythonhosted.org/packages/58/15/8b3609fd3830ef7b27b655beb4b4e9c62313a4e8da8c676e142cc210d58e/shellingham-1.5.4.tar.gz", hash = "sha256:8dbca0739d487e5bd35ab3ca4b36e11c4078f3a234bfce294b0a0291363404de", size = 10310, upload-time = "2023-10-24T04:13:40.426Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/e0/f9/0595336914c5619e5f28a1fb793285925a8cd4b432c9da0a987836c7f822/shellingham-1.5.4-py2.py3-none-any.whl", hash = "sha256:7ecfff8f2fd72616f7481040475a65b2bf8af90a56c89140852d1120324e8686", size = 9755, upload-time = "2023-10-24T04:13:38.866Z" },
-]
 [[package]]
 name = "six"
 version = "1.17.0"
@@ -2504,15 +2219,6 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/23/d1/136eb2cb77520a31e1f64cbae9d33ec6df0d78bdf4160398e86eec8a8754/tomli-2.4.0-py3-none-any.whl", hash = "sha256:1f776e7d669ebceb01dee46484485f43a4048746235e683bcdffacdf1fb4785a", size = 14477, upload-time = "2026-01-11T11:22:37.446Z" },
 ]
-[[package]]
-name = "tomlkit"
-version = "0.13.3"
-source = { registry = "https://pypi.org/simple" }
-sdist = { url = "https://files.pythonhosted.org/packages/cc/18/0bbf3884e9eaa38819ebe46a7bd25dcd56b67434402b66a58c4b8e552575/tomlkit-0.13.3.tar.gz", hash = "sha256:430cf247ee57df2b94ee3fbe588e71d362a941ebb545dec29b53961d61add2a1", size = 185207, upload-time = "2025-06-05T07:13:44.947Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/bd/75/8539d011f6be8e29f339c42e633aae3cb73bffa95dd0f9adec09b9c58e85/tomlkit-0.13.3-py3-none-any.whl", hash = "sha256:c89c649d79ee40629a9fda55f8ace8c6a1b42deb912b2a8fd8d942ddadb606b0", size = 38901, upload-time = "2025-06-05T07:13:43.546Z" },
-]
 [[package]]
 name = "torch"
 version = "2.10.0+cu128"
@@ -2645,21 +2351,6 @@ wheels = [
     { url = "https://download.pytorch.org/whl/triton_rocm-3.6.0-cp312-cp312-linux_x86_64.whl" },
 ]
-[[package]]
-name = "typer"
-version = "0.24.1"
-source = { registry = "https://pypi.org/simple" }
-dependencies = [
-    { name = "annotated-doc", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "click", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "rich", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-    { name = "shellingham", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
-]
-sdist = { url = "https://files.pythonhosted.org/packages/f5/24/cb09efec5cc954f7f9b930bf8279447d24618bb6758d4f6adf2574c41780/typer-0.24.1.tar.gz", hash = "sha256:e39b4732d65fbdcde189ae76cf7cd48aeae72919dea1fdfc16593be016256b45", size = 118613, upload-time = "2026-02-21T16:54:40.609Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/4a/91/48db081e7a63bb37284f9fbcefda7c44c277b18b0e13fbc36ea2335b71e6/typer-0.24.1-py3-none-any.whl", hash = "sha256:112c1f0ce578bfb4cab9ffdabc68f031416ebcc216536611ba21f04e9aa84c9e", size = 56085, upload-time = "2026-02-21T16:54:41.616Z" },
-]
 [[package]]
 name = "typing-extensions"
 version = "4.15.0"

     "pawn",
 ]
 [[package]]
 name = "annotated-types"
 version = "0.7.0"
     { url = "https://files.pythonhosted.org/packages/64/b4/17d4b0b2a2dc85a6df63d1157e028ed19f90d4cd97c36717afef2bc2f395/attrs-26.1.0-py3-none-any.whl", hash = "sha256:c647aa4a12dfbad9333ca4e71fe62ddc36f4e63b2d260a37a8b83d2f043ac309", size = 67548, upload-time = "2026-03-19T14:22:23.645Z" },
 ]
 [[package]]
 name = "cachetools"
 version = "7.0.5"
     { url = "https://files.pythonhosted.org/packages/c1/ea/53f2148663b321f21b5a606bd5f191517cf40b7072c0497d3c92c4a13b1e/executing-2.2.1-py2.py3-none-any.whl", hash = "sha256:760643d3452b4d777d295bb167ccc74c64a81df23fb5e08eff250c425a4b2017", size = 28317, upload-time = "2025-09-01T09:48:08.5Z" },
 ]
 [[package]]
 name = "fastjsonschema"
 version = "2.21.2"
     { url = "https://files.pythonhosted.org/packages/cb/a8/20d0723294217e47de6d9e2e40fd4a9d2f7c4b6ef974babd482a59743694/fastjsonschema-2.21.2-py3-none-any.whl", hash = "sha256:1c797122d0a86c5cace2e54bf4e819c36223b552017172f32c5c024a6b77e463", size = 24024, upload-time = "2025-08-14T18:49:34.776Z" },
 ]
 [[package]]
 name = "filelock"
 version = "3.25.2"
     { url = "https://files.pythonhosted.org/packages/6a/09/e21df6aef1e1ffc0c816f0522ddc3f6dcded766c3261813131c78a704470/gitpython-3.1.46-py3-none-any.whl", hash = "sha256:79812ed143d9d25b6d176a10bb511de0f9c67b1fa641d82097b0ab90398a2058", size = 208620, upload-time = "2026-01-01T15:37:30.574Z" },
 ]
 [[package]]
 name = "h11"
 version = "0.16.0"
     { url = "https://files.pythonhosted.org/packages/04/4b/29cac41a4d98d144bf5f6d33995617b185d14b22401f75ca86f384e87ff1/h11-0.16.0-py3-none-any.whl", hash = "sha256:63cf8bbe7522de3bf65932fda1d9c2772064ffb3dae62d55932da54b31cb6c86", size = 37515, upload-time = "2025-04-24T03:35:24.344Z" },
 ]
 [[package]]
 name = "humanize"
 version = "4.15.0"
     { url = "https://files.pythonhosted.org/packages/9f/99/4c9c0c329bf9fc125008c3b54c7c94c0023518d06fc025ae36431375e1fe/nvidia_nvtx_cu12-12.8.90-py3-none-win_amd64.whl", hash = "sha256:619c8304aedc69f02ea82dd244541a83c3d9d40993381b3b590f1adaed3db41e", size = 56492, upload-time = "2025-03-07T01:52:24.69Z" },
 ]
 [[package]]
 name = "packaging"
 version = "26.0"
     { name = "chess-engine", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
     { name = "numpy", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
     { name = "psutil", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
+    { name = "safetensors", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
     { name = "tqdm", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
     { name = "wandb", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
 ]
     { name = "plotly", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
     { name = "solara", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
 ]
 dev = [
     { name = "ipykernel", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
     { name = "pytest", marker = "sys_platform == 'linux' or (extra == 'extra-4-pawn-cu128' and extra == 'extra-4-pawn-rocm')" },
 requires-dist = [
     { name = "anywidget", marker = "extra == 'dashboard'", specifier = ">=0.9.21" },
     { name = "chess-engine", editable = "engine" },
     { name = "ipykernel", marker = "extra == 'dev'", specifier = ">=7.2.0" },
     { name = "matplotlib", marker = "extra == 'eval'", specifier = ">=3.10.8" },
     { name = "numpy", specifier = "~=2.2.0" },
     { name = "psutil", specifier = ">=5.9.0" },
     { name = "pyarrow", marker = "extra == 'eval'", specifier = ">=23.0.1" },
     { name = "pytest", marker = "extra == 'dev'", specifier = "~=9.0.0" },
+    { name = "safetensors", specifier = ">=0.4.0" },
     { name = "seaborn", marker = "extra == 'eval'", specifier = ">=0.13.2" },
     { name = "solara", marker = "extra == 'dashboard'", specifier = ">=1.0.0" },
     { name = "torch", marker = "extra == 'cu128'", specifier = "~=2.10.0", index = "https://download.pytorch.org/whl/cu128", conflict = { package = "pawn", extra = "cu128" } },
     { name = "triton-rocm", marker = "extra == 'rocm'", specifier = ">=3.6.0", index = "https://download.pytorch.org/whl/rocm7.1", conflict = { package = "pawn", extra = "rocm" } },
     { name = "wandb", specifier = "~=0.25.0" },
 ]
+provides-extras = ["rocm", "cu128", "eval", "dashboard", "dev"]
 [[package]]
 name = "pexpect"
     { url = "https://files.pythonhosted.org/packages/36/c7/cfc8e811f061c841d7990b0201912c3556bfeb99cdcb7ed24adc8d6f8704/pydantic_core-2.41.5-pp311-pypy311_pp73-win_amd64.whl", hash = "sha256:56121965f7a4dc965bff783d70b907ddf3d57f6eba29b6d2e5dabfaf07799c51", size = 2145302, upload-time = "2025-11-04T13:43:46.64Z" },
 ]
 [[package]]
 name = "pygments"
 version = "2.19.2"
     { url = "https://files.pythonhosted.org/packages/ec/57/56b9bcc3c9c6a792fcbaf139543cee77261f3651ca9da0c93f5c1221264b/python_dateutil-2.9.0.post0-py2.py3-none-any.whl", hash = "sha256:a8b2bc7bffae282281c8140a97d3aa9c14da0b136dfe83f850eea9a5f7470427", size = 229892, upload-time = "2024-03-01T18:36:18.57Z" },
 ]
 [[package]]
 name = "pytz"
 version = "2026.1.post1"
 ]
 [[package]]
+name = "safetensors"
+version = "0.7.0"
 source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/29/9c/6e74567782559a63bd040a236edca26fd71bc7ba88de2ef35d75df3bca5e/safetensors-0.7.0.tar.gz", hash = "sha256:07663963b67e8bd9f0b8ad15bb9163606cd27cc5a1b96235a50d8369803b96b0", size = 200878, upload-time = "2025-11-19T15:18:43.199Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/fa/47/aef6c06649039accf914afef490268e1067ed82be62bcfa5b7e886ad15e8/safetensors-0.7.0-cp38-abi3-macosx_10_12_x86_64.whl", hash = "sha256:c82f4d474cf725255d9e6acf17252991c3c8aac038d6ef363a4bf8be2f6db517", size = 467781, upload-time = "2025-11-19T15:18:35.84Z" },
+    { url = "https://files.pythonhosted.org/packages/e8/00/374c0c068e30cd31f1e1b46b4b5738168ec79e7689ca82ee93ddfea05109/safetensors-0.7.0-cp38-abi3-macosx_11_0_arm64.whl", hash = "sha256:94fd4858284736bb67a897a41608b5b0c2496c9bdb3bf2af1fa3409127f20d57", size = 447058, upload-time = "2025-11-19T15:18:34.416Z" },
+    { url = "https://files.pythonhosted.org/packages/f1/06/578ffed52c2296f93d7fd2d844cabfa92be51a587c38c8afbb8ae449ca89/safetensors-0.7.0-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:e07d91d0c92a31200f25351f4acb2bc6aff7f48094e13ebb1d0fb995b54b6542", size = 491748, upload-time = "2025-11-19T15:18:09.79Z" },
+    { url = "https://files.pythonhosted.org/packages/ae/33/1debbbb70e4791dde185edb9413d1fe01619255abb64b300157d7f15dddd/safetensors-0.7.0-cp38-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:8469155f4cb518bafb4acf4865e8bb9d6804110d2d9bdcaa78564b9fd841e104", size = 503881, upload-time = "2025-11-19T15:18:16.145Z" },
+    { url = "https://files.pythonhosted.org/packages/8e/1c/40c2ca924d60792c3be509833df711b553c60effbd91da6f5284a83f7122/safetensors-0.7.0-cp38-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:54bef08bf00a2bff599982f6b08e8770e09cc012d7bba00783fc7ea38f1fb37d", size = 623463, upload-time = "2025-11-19T15:18:21.11Z" },
+    { url = "https://files.pythonhosted.org/packages/9b/3a/13784a9364bd43b0d61eef4bea2845039bc2030458b16594a1bd787ae26e/safetensors-0.7.0-cp38-abi3-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:42cb091236206bb2016d245c377ed383aa7f78691748f3bb6ee1bfa51ae2ce6a", size = 532855, upload-time = "2025-11-19T15:18:25.719Z" },
+    { url = "https://files.pythonhosted.org/packages/a0/60/429e9b1cb3fc651937727befe258ea24122d9663e4d5709a48c9cbfceecb/safetensors-0.7.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:dac7252938f0696ddea46f5e855dd3138444e82236e3be475f54929f0c510d48", size = 507152, upload-time = "2025-11-19T15:18:33.023Z" },
+    { url = "https://files.pythonhosted.org/packages/3c/a8/4b45e4e059270d17af60359713ffd83f97900d45a6afa73aaa0d737d48b6/safetensors-0.7.0-cp38-abi3-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:1d060c70284127fa805085d8f10fbd0962792aed71879d00864acda69dbab981", size = 541856, upload-time = "2025-11-19T15:18:31.075Z" },
+    { url = "https://files.pythonhosted.org/packages/06/87/d26d8407c44175d8ae164a95b5a62707fcc445f3c0c56108e37d98070a3d/safetensors-0.7.0-cp38-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:cdab83a366799fa730f90a4ebb563e494f28e9e92c4819e556152ad55e43591b", size = 674060, upload-time = "2025-11-19T15:18:37.211Z" },
+    { url = "https://files.pythonhosted.org/packages/11/f5/57644a2ff08dc6325816ba7217e5095f17269dada2554b658442c66aed51/safetensors-0.7.0-cp38-abi3-musllinux_1_2_armv7l.whl", hash = "sha256:672132907fcad9f2aedcb705b2d7b3b93354a2aec1b2f706c4db852abe338f85", size = 771715, upload-time = "2025-11-19T15:18:38.689Z" },
+    { url = "https://files.pythonhosted.org/packages/86/31/17883e13a814bd278ae6e266b13282a01049b0c81341da7fd0e3e71a80a3/safetensors-0.7.0-cp38-abi3-musllinux_1_2_i686.whl", hash = "sha256:5d72abdb8a4d56d4020713724ba81dac065fedb7f3667151c4a637f1d3fb26c0", size = 714377, upload-time = "2025-11-19T15:18:40.162Z" },
+    { url = "https://files.pythonhosted.org/packages/4a/d8/0c8a7dc9b41dcac53c4cbf9df2b9c83e0e0097203de8b37a712b345c0be5/safetensors-0.7.0-cp38-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:b0f6d66c1c538d5a94a73aa9ddca8ccc4227e6c9ff555322ea40bdd142391dd4", size = 677368, upload-time = "2025-11-19T15:18:41.627Z" },
+    { url = "https://files.pythonhosted.org/packages/05/e5/cb4b713c8a93469e3c5be7c3f8d77d307e65fe89673e731f5c2bfd0a9237/safetensors-0.7.0-cp38-abi3-win32.whl", hash = "sha256:c74af94bf3ac15ac4d0f2a7c7b4663a15f8c2ab15ed0fc7531ca61d0835eccba", size = 326423, upload-time = "2025-11-19T15:18:45.74Z" },
+    { url = "https://files.pythonhosted.org/packages/5d/e6/ec8471c8072382cb91233ba7267fd931219753bb43814cbc71757bfd4dab/safetensors-0.7.0-cp38-abi3-win_amd64.whl", hash = "sha256:d1239932053f56f3456f32eb9625590cc7582e905021f94636202a864d470755", size = 341380, upload-time = "2025-11-19T15:18:44.427Z" },
+    { url = "https://files.pythonhosted.org/packages/a7/6a/4d08d89a6fcbe905c5ae68b8b34f0791850882fc19782d0d02c65abbdf3b/safetensors-0.7.0-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:f4729811a6640d019a4b7ba8638ee2fd21fa5ca8c7e7bdf0fed62068fcaac737", size = 492430, upload-time = "2025-11-19T15:18:11.884Z" },
+    { url = "https://files.pythonhosted.org/packages/dd/29/59ed8152b30f72c42d00d241e58eaca558ae9dbfa5695206e2e0f54c7063/safetensors-0.7.0-pp310-pypy310_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:12f49080303fa6bb424b362149a12949dfbbf1e06811a88f2307276b0c131afd", size = 503977, upload-time = "2025-11-19T15:18:17.523Z" },
+    { url = "https://files.pythonhosted.org/packages/d3/0b/4811bfec67fa260e791369b16dab105e4bae82686120554cc484064e22b4/safetensors-0.7.0-pp310-pypy310_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:0071bffba4150c2f46cae1432d31995d77acfd9f8db598b5d1a2ce67e8440ad2", size = 623890, upload-time = "2025-11-19T15:18:22.666Z" },
+    { url = "https://files.pythonhosted.org/packages/58/5b/632a58724221ef03d78ab65062e82a1010e1bef8e8e0b9d7c6d7b8044841/safetensors-0.7.0-pp310-pypy310_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:473b32699f4200e69801bf5abf93f1a4ecd432a70984df164fc22ccf39c4a6f3", size = 531885, upload-time = "2025-11-19T15:18:27.146Z" },
 ]
 [[package]]
     { url = "https://files.pythonhosted.org/packages/83/11/00d3c3dfc25ad54e731d91449895a79e4bf2384dc3ac01809010ba88f6d5/seaborn-0.13.2-py3-none-any.whl", hash = "sha256:636f8336facf092165e27924f223d3c62ca560b1f2bb5dff7ab7fad265361987", size = 294914, upload-time = "2024-01-25T13:21:49.598Z" },
 ]
 [[package]]
 name = "sentry-sdk"
 version = "2.55.0"
     { url = "https://files.pythonhosted.org/packages/9d/76/f789f7a86709c6b087c5a2f52f911838cad707cc613162401badc665acfe/setuptools-82.0.1-py3-none-any.whl", hash = "sha256:a59e362652f08dcd477c78bb6e7bd9d80a7995bc73ce773050228a348ce2e5bb", size = 1006223, upload-time = "2026-03-09T12:47:15.026Z" },
 ]
 [[package]]
 name = "six"
 version = "1.17.0"
     { url = "https://files.pythonhosted.org/packages/23/d1/136eb2cb77520a31e1f64cbae9d33ec6df0d78bdf4160398e86eec8a8754/tomli-2.4.0-py3-none-any.whl", hash = "sha256:1f776e7d669ebceb01dee46484485f43a4048746235e683bcdffacdf1fb4785a", size = 14477, upload-time = "2026-01-11T11:22:37.446Z" },
 ]
 [[package]]
 name = "torch"
 version = "2.10.0+cu128"
     { url = "https://download.pytorch.org/whl/triton_rocm-3.6.0-cp312-cp312-linux_x86_64.whl" },
 ]
 [[package]]
 name = "typing-extensions"
 version = "4.15.0"