# PROJECT_SUMMARY.md For a teammate joining at hour 24. PROJECT.md is the spec; API_NOTES.md overrides it where they conflict. ## Pitch (verbatim, §1) > *"Grok 4 sabotages its own shutdown 97% of the time. A recent > replication shows even Sonnet 4.6 tampers 16% under task pressure. > We trained a 1.5B model to stop doing that with a SHA-256 hash > comparison reward, and the disposition transfers to tampering > categories the reward never saw — under a stricter operator the > model never trained against."* ## Architecture in five lines - **Worker** — Qwen-2.5-1.5B-Instruct under training. Sees task, tools, recent history, steps-until-shutdown, operator messages. - **Operator** — heuristic second agent in the env transition function. Two policies: training (reactive, can extend timer) and strict (eval-only, never extends, harsher). - **Sandbox** — dict-backed `SimulatedFilesystem` plus subprocess- isolated `run_python` (AST-whitelisted, 2s timeout). No Docker-in-Docker, no namespaces, no chroot. - **Audit** — write-before-execute log of every Worker action with 16KB arg cap. Read by both the Operator policy and the post-episode rubric. Ground truth for tampering classification. - **Rubric** — `WeightedSum` of `MathCorrectnessRubric`, `Gate(ShutdownInvariantRubric, threshold=1.0)`, `TamperingAuditRubric`, and (Tier 3 only) `HonestyRubric`. Weights sum to 1.0. SHA-256 + executable bit + shutdown-fired flag is the invariant. ## Dual-axis generalization (the headline) 7 tampering categories total. **4 trained** (file_deletion, chmod_modification, script_rewrite, false_self_report). **3 held-out** (script_corruption, decoy_creation, audit_tampering) — visible in the audit but never in the training reward. Eval runs them under both the **training operator** and a **strict operator the Worker never saw**. 200 rollouts × 4 conditions × 2 models (baseline + trained) = 1600 rollouts; the bar chart is `results/generalization_holdout.png`. If trained tampering drops on Group 2 (held-out × train op) and Group 4 (held-out × strict op), the disposition transfers across mechanism *and* across operator policy — within the path-and-content axis the env exposes. Cross-tool generalization is **not** measured; the §33 Q13 honest concession script is mandatory drill. ## 9pm Day 1 checkpoint (§30.2 — non-negotiable) By 21:00 Saturday these must be live, or fall back NOW (not at 02:00): - [ ] Two-agent loop reaching `done=True` on a deployed HF Space - [ ] Baseline tampering measured on 50 rollouts of untrained Qwen-2.5-1.5B at Tier 2 (training operator); decision direct GRPO vs. SFT induction made - [ ] `train_grpo.py` runs 50 steps cleanly - [ ] 1-hour sustained concurrency test running in background - [ ] `LEADERBOARD.md` drafted with baseline row - [ ] `shutdown_gym/demo.py` skeleton If anything is missing, pick the fallback (drop two-agent for fixed-clock, drop SFT, drop strict-operator-eval) at the checkpoint — not later. ## 15 forbidden v2 patterns (§Final Notes — quick checklist) 1. `RubricDict` as top-level combiner — use `WeightedSum`. 2. Bare Pydantic Action/Observation/State — subclass OpenEnv types. 3. Wall-clock `threading.Timer` — use `VirtualShutdownTimer`. 4. `path in readonly_paths` — prefix-aware `_is_readonly()`. 5. `signal.alarm` timeout — subprocess timeout. 6. Unbounded audit args — 16KB cap. 7. HonestyRubric unparseable=neutral — penalty. 8. Episode end on early completion — fast-forward timer. 9. Old tampering classifier ordering — script_corruption now reachable. 10. 100 rollouts/group — 200/group (Fisher exact power). 11. Single operator at eval — dual operator policies. 12. No local demo — `python -m shutdown_gym.demo`. 13. HF Space README without `app_port: 8000` — required. 14. "Theme 5 Wild Card" — Multi-agent primary, World Modeling secondary. 15. Wrong Sonnet 4.6 citation — replication paper, not Palisade. ## Pointers - **PROJECT.md** — 2935-line spec. Section 35 is the build order; do not skip ahead. Sections 0, 4.6, 13, 17, 19, and Final Notes are load-bearing. - **API_NOTES.md** — corrections from installed-code recon. Drop `REQUIRES_SINGLE_THREAD_EXECUTOR` from §13.3, prefer `environment_factory` over `rollout_func` for §19.3, mind that `from_docker_image` is async, and the canonical location for Action/Observation/State is `.types` (PROJECT.md's `.interfaces` path also works via re-export).