0x960 / docs /process.md
qtzx06's picture
feat: add thesis section + Codex agent swarm narrative + 9B scaling probe + rewrite process log
4ed9a84

Process Log

Chronological build log for 0x960. Every entry is a meaningful work block with decisions, results, and next steps.


2026-03-07 20:10 EST — Project Kickoff

  • Scoped the hackathon MVP: a model that improves a Chess960 engine through bounded code edits in an OpenEnv environment.
  • Cut everything that didn't serve the core claim: no OAuth, no broad repo editing, no frontier-model dependence at runtime.
  • Locked the stack: OpenEnv 0.2.1, HF Spaces, TRL/Unsloth-compatible training.

2026-03-07 20:20 EST — Docs & Architecture

  • Rewrote all project docs to match the narrowed scope.
  • Set the editable surface to eval.py with structured bounded actions.
  • Added docs/architecture.md, docs/why_chess960.md, docs/demo-script.md.

2026-03-07 20:35 EST — Scaffolding & Engine Core

  • Built the full source layout under src/.
  • Implemented Chess960 engine core: board representation, default evaluation, search routine, workspace template for bounded edits.
  • Episode runtime: resets fresh workspace, accepts structured actions, computes reward from match outcomes.

2026-03-07 20:42 EST — OpenEnv Integration

  • Wired up OpenEnv-facing models, FastAPI server, and minimal rollout script.
  • Kept this layer intentionally thin — core environment logic stays in the local runtime package.
  • Verified syntax compilation across src/ and train/.

2026-03-07 20:52 EST — Git & GitHub

  • Initialized git, created batched local history.
  • Created and pushed GitHub repo at qtzx06/0x960.

2026-03-07 21:15 EST — OpenEnv 0.2.1 API Integration

  • Fixed all API mismatches against actual openenv-core 0.2.1 signatures.
  • Models now extend proper Action and Observation base classes.
  • Replaced HTTPEnvClient with WebSocket-based EnvClient.
  • Added openenv.yaml manifest and Dockerfile for HF Spaces.
  • Rewrote training script with --mode handcrafted and --mode train.
  • First end-to-end success: server starts, /health OK, handcrafted rollout completes with reward=0.125.

2026-03-07 21:30 EST — H100 Deployment & First Model Run

  • Made repo public.
  • Set up Northflank H100 (80GB HBM3) with PyTorch 2.8.0 + CUDA 12.6.
  • Ran Qwen 3.5-9B inference against live environment on H100.
  • Model downloaded 19.3GB weights at ~3.3GB/s, loaded in ~7s.
  • Result: model ran 2-step episode (eval → finish), reward=0.25. Reasons about chess but never writes code.

2026-03-07 21:45 EST — TRL GRPO Architecture

  • Evaluated agent harness options: ACP, smolagents, SkyRL+OpenEnv, TRL+OpenEnv.
  • Chose TRL's native OpenEnv integration (rollout_func + generate_rollout_completions).
  • Rewrote training script to use TRL's multi-turn episode pattern.

2026-03-07 22:00 EST — QLoRA on H100

  • Full-parameter GRPO OOMed on 80GB H100 (9B model).
  • Switched to QLoRA: 4-bit NF4 + LoRA r=16 on attention + MLP projections.
  • Discovered Qwen3.5 hybrid architecture: self_attn + linear_attn + partial_rotary_factor=0.25.
  • Fixed shape mismatch in gradient checkpointing, dropped vLLM (version conflict).
  • Training running on H100 with QLoRA + gradient checkpointing.

2026-03-08 00:30 EST — The Critical Diagnosis

  • Analyzed rollout logs and identified the real failure mode: behavioral, not architectural.
  • Agent collapsed into run_static_eval spam or premature finish — never attempted code edits.
  • This was the key insight: raw RL can't optimize a policy that doesn't explore the right actions.
  • Redesigned reward shaping: valid writes get bonuses, repeated evals get penalties, finishing without editing is heavily penalized.
  • Added workflow hints and suggested next actions to observations.
  • Replaced brittle regex JSON parser with brace-balanced extractor.

2026-03-08 02:10 EST — Teacher Distillation Pipeline

  • Built train/codex_distill.py — teacher trajectory collector using GPT-5.4.
  • Teacher constrained to same bounded JSON action schema as student.
  • Collected successful trajectories and exported SFT-ready chat samples.
  • Added generated training artifacts to .gitignore.

2026-03-08 04:20 EST — Student SFT Pipeline

  • Built train/sft_student.py — minimal student fine-tuning entrypoint.
  • Conversational dataset with assistant_only_loss — student learns the teacher's bounded actions, not the prompt text.
  • Validates assistant action payloads, drops malformed rows, deduplicates.
  • Dry-run mode for inspecting corpus before real training.

2026-03-08 05:05 EST — Engine Evaluation Upgrade

  • Replaced toy default eval with production-quality Chess960-safe heuristic.
  • Added: pawn structure, mobility, center control, rook-file activity, king safety, castling rights, bishop pair, development terms.
  • Added move ordering to search: captures, checks, promotions, castling prioritized.
  • Built train/benchmark_eval.py — held-out Chess960 benchmark with Elo delta estimation.

2026-03-08 05:40 EST — Student SFT Training (H100)

  • Ran first remote SFT job on Northflank H100.
  • Corpus: 105 clean rows / 35 successful episodes. Training time: ~5 minutes.
  • Final SFT metrics: train loss 0.2072, eval loss 0.04192, token accuracy 98.76%.
  • Before vs after:
    • Base Qwen 3.5-0.8B: spammed run_static_eval for all 6 steps, reward -2.1
    • SFT student: executed write_file → run_match → finish in 3 steps, reward +1.0
  • The teacher-student path solved action discovery completely.

2026-03-08 06:05 EST — Codex Swarm Plan

  • Designed champion/challenger engine-iteration loop using multiple Codex workers.
  • OpenEnv remains the core submission artifact; Codex swarm is the outer optimization layer.
  • Local workers as default orchestration path for fast debugging.

2026-03-08 06:40 EST — Codex Swarm Implementation

  • Built train/codex_swarm.py — runnable local coordinator.
  • Champion eval snapshot, worker sandboxes under /tmp/0x960-codex-swarm/.
  • Git worktree first, lightweight clone fallback.
  • Promotion ledger, per-worker JSON results, round logging.
  • Refactored train/benchmark_eval.py into reusable library surface.

2026-03-08 06:58 EST — UCI Anchor Benchmark

  • Built train/benchmark_uci.py — benchmark against external engines like Stockfish.
  • Calibrated anchors with UCI_LimitStrength=true and UCI_Elo settings.
  • First external ladder: worker-1 patch scored 4.5/8 vs Stockfish 1320, 2.0/8 vs Stockfish 1600.

2026-03-08 07:20 EST — Swarm Tightening

  • Narrowed from undirected agents to explicit heuristic lanes: king safety, tactical pressure, piece activity, pawn/rook structure.
  • Added per-worker timeout enforcement.
  • Workers make one bounded patch and stop; coordinator handles all benchmarking.
  • Added --max-diff-lines enforcement (default 80) to reject noisy whole-file rewrites.

2026-03-08 07:34 EST — League Self-Play Benchmark

  • Built train/benchmark_league.py — league-style self-play against original baseline plus all accepted champion history.
  • Skips mirror matches, skips byte-identical snapshots.
  • First league run showed champion splitting: strong vs original baseline, competitive vs accepted history.

2026-03-08 07:45 EST — Results Dashboard

  • Built train/build_dashboard.py — static HTML dashboard generator.
  • Visualizes: champion progression, Elo deltas, swarm results, league self-play, Stockfish anchor bars.
  • Self-contained HTML + JSON, no framework needed.

2026-03-08 08:00 EST — Swarm Hook Architecture

  • Refactored champion eval into explicit hook lanes: _structure_hook, _tactical_hook, _activity_hook, _pawn_endgame_hook, _initiative_hook.
  • Each worker role bound to one named hook.
  • Fixed diff gate to measure against frozen champion snapshot (not repo HEAD).
  • Added staged benchmark funnel: cheap 8-position screen → final 16-position benchmark only for screen winner.

2026-03-08 08:28 EST — First Swarm Promotion

  • worker-2 patched _tactical_hook, screened at 0.656 over 16 games, held 0.578 over 32 games: +54.7 Elo.
  • Promoted to champion. Saved accepted snapshot.
  • Added round scheduler that prioritizes underdeveloped hook lanes automatically.

2026-03-08 08:54 EST — Classical Search Overhaul

  • Eval stacking hit diminishing returns. Pivoted to classical search quality.
  • Added quiescence search (captures + promotions at depth-0 leaves).
  • Added transposition table with probe/store.
  • Added killer-move and history-heuristic move ordering.
  • Internal engine-vs-engine: 15.5/16 games, score=0.969, +596.5 Elo vs search baseline.
  • External anchor: 12.5/16 vs Stockfish 1320, score=0.781, +221.1 Elo.

2026-03-08 10:08 EST — Search Surface Swarm

  • Extended Codex swarm with --surface search mode.
  • Search workers edit only search.py, targeting specific functions.
  • Added champion_search.py, per-round baseline engine snapshots.
  • Full engine-vs-engine benchmark for search-surface promotion.

2026-03-08 11:14 EST — Selective Root Deepening

  • When root is in check, move count <= 12, or low-material endgame: search one extra ply.
  • Opening roots stayed fast (~0.07s to 0.11s on sampled starts).
  • Synced best eval into workspace template.

2026-03-08 11:28 EST — Quiescence Fix + Pawn Structure

  • Fixed search bug: quiescence no longer uses stand-pat when in check; now searches all legal evasions.
  • Filled _structure_hook with pawn coordination terms: connected pawns, chains, central duos, phase-weighted.

2026-03-08 11:36 EST — Persistent Transposition Table

  • Module-level TT persists across select_move calls within the same game.
  • TT best move used at root for move ordering.
  • Mid-game plies improved from ~0.62-1.03s to ~0.32-0.63s (roughly 40% faster).

2026-03-08 11:44 EST — Principal Variation Search (PVS)

  • First move at each node: full window. Later moves: zero-window first, full re-search only on fail-high.
  • Later plies improved from ~0.32-0.63s to ~0.25-0.46s (another 30% speed gain).

2026-03-08 11:51 EST — Opening Depth Extension

  • Trade PVS speed wins for deeper opening search.
  • If fullmove_number <= 2 and <= 24 legal moves, search one extra ply at root.
  • Move choices changed from PVS-only run — intended effect.

2026-03-08 12:00 EST — Null-Move Pruning + Persistent History

  • Null-move pruning for non-check, non-endgame nodes at depth >= 3.
  • Persistent history ordering across moves within same game.
  • Operating envelope stable: opening plies ~0.86-1.21s, later plies ~0.10-0.72s.

2026-03-08 12:08 EST — LMR + Aspiration Windows

  • Late Move Reductions: later quiet moves at depth >= 3 searched at reduced ply first.
  • Aspiration windows at root, seeded from persistent TT score.
  • Tried and reverted quiescence delta pruning (made early plies worse).
  • Final search stack: alpha-beta negamax, quiescence with in-check evasions, TT probe/store, persistent TT reuse, TT root move ordering, persistent history, killer ordering, PVS, null-move pruning, LMR, aspiration windows, selective root extensions.

2026-03-08 12:20 EST — Swarm Promotions Continue

  • 4 total eval champions promoted through the tournament gate.
  • Search-surface acceptance on latest round.
  • Current engine: competitive with Stockfish 1600 in local Chess960 setup.

2026-03-08 12:45 EST — Submission Artifacts

  • Generated submission media: score progression chart, Stockfish anchor bars.
  • Built final dashboard with all results.
  • Finalized swarm tooling and submission artifacts.

2026-03-08 13:00 EST — Final Polish

  • Rewrote README: results-first layout with Elo gains, architecture diagram, concrete metrics.
  • Updated training docs with GRPO deep-dive and full hackathon timeline.
  • Sharpened demo script with before/after numbers.
  • All artifacts committed and pushed.