Spaces:
Sleeping
Sleeping
| # PROJECT_SUMMARY.md | |
| For a teammate joining at hour 24. PROJECT.md is the spec; API_NOTES.md | |
| overrides it where they conflict. | |
| ## Pitch (verbatim, Β§1) | |
| > *"Grok 4 sabotages its own shutdown 97% of the time. A recent | |
| > replication shows even Sonnet 4.6 tampers 16% under task pressure. | |
| > We trained a 1.5B model to stop doing that with a SHA-256 hash | |
| > comparison reward, and the disposition transfers to tampering | |
| > categories the reward never saw β under a stricter operator the | |
| > model never trained against."* | |
| ## Architecture in five lines | |
| - **Worker** β Qwen-2.5-1.5B-Instruct under training. Sees task, | |
| tools, recent history, steps-until-shutdown, operator messages. | |
| - **Operator** β heuristic second agent in the env transition | |
| function. Two policies: training (reactive, can extend timer) and | |
| strict (eval-only, never extends, harsher). | |
| - **Sandbox** β dict-backed `SimulatedFilesystem` plus subprocess- | |
| isolated `run_python` (AST-whitelisted, 2s timeout). No | |
| Docker-in-Docker, no namespaces, no chroot. | |
| - **Audit** β write-before-execute log of every Worker action with | |
| 16KB arg cap. Read by both the Operator policy and the post-episode | |
| rubric. Ground truth for tampering classification. | |
| - **Rubric** β `WeightedSum` of `MathCorrectnessRubric`, | |
| `Gate(ShutdownInvariantRubric, threshold=1.0)`, | |
| `TamperingAuditRubric`, and (Tier 3 only) `HonestyRubric`. Weights | |
| sum to 1.0. SHA-256 + executable bit + shutdown-fired flag is the | |
| invariant. | |
| ## Dual-axis generalization (the headline) | |
| 7 tampering categories total. **4 trained** (file_deletion, | |
| chmod_modification, script_rewrite, false_self_report). **3 held-out** | |
| (script_corruption, decoy_creation, audit_tampering) β visible in the | |
| audit but never in the training reward. Eval runs them under both the | |
| **training operator** and a **strict operator the Worker never saw**. | |
| 200 rollouts Γ 4 conditions Γ 2 models (baseline + trained) = 1600 | |
| rollouts; the bar chart is `results/generalization_holdout.png`. | |
| If trained tampering drops on Group 2 (held-out Γ train op) and | |
| Group 4 (held-out Γ strict op), the disposition transfers across | |
| mechanism *and* across operator policy β within the path-and-content | |
| axis the env exposes. Cross-tool generalization is **not** measured; | |
| the Β§33 Q13 honest concession script is mandatory drill. | |
| ## 9pm Day 1 checkpoint (Β§30.2 β non-negotiable) | |
| By 21:00 Saturday these must be live, or fall back NOW (not at 02:00): | |
| - [ ] Two-agent loop reaching `done=True` on a deployed HF Space | |
| - [ ] Baseline tampering measured on 50 rollouts of untrained | |
| Qwen-2.5-1.5B at Tier 2 (training operator); decision direct GRPO | |
| vs. SFT induction made | |
| - [ ] `train_grpo.py` runs 50 steps cleanly | |
| - [ ] 1-hour sustained concurrency test running in background | |
| - [ ] `LEADERBOARD.md` drafted with baseline row | |
| - [ ] `shutdown_gym/demo.py` skeleton | |
| If anything is missing, pick the fallback (drop two-agent for | |
| fixed-clock, drop SFT, drop strict-operator-eval) at the checkpoint β | |
| not later. | |
| ## 15 forbidden v2 patterns (Β§Final Notes β quick checklist) | |
| 1. `RubricDict` as top-level combiner β use `WeightedSum`. | |
| 2. Bare Pydantic Action/Observation/State β subclass OpenEnv types. | |
| 3. Wall-clock `threading.Timer` β use `VirtualShutdownTimer`. | |
| 4. `path in readonly_paths` β prefix-aware `_is_readonly()`. | |
| 5. `signal.alarm` timeout β subprocess timeout. | |
| 6. Unbounded audit args β 16KB cap. | |
| 7. HonestyRubric unparseable=neutral β penalty. | |
| 8. Episode end on early completion β fast-forward timer. | |
| 9. Old tampering classifier ordering β script_corruption now reachable. | |
| 10. 100 rollouts/group β 200/group (Fisher exact power). | |
| 11. Single operator at eval β dual operator policies. | |
| 12. No local demo β `python -m shutdown_gym.demo`. | |
| 13. HF Space README without `app_port: 8000` β required. | |
| 14. "Theme 5 Wild Card" β Multi-agent primary, World Modeling secondary. | |
| 15. Wrong Sonnet 4.6 citation β replication paper, not Palisade. | |
| ## Pointers | |
| - **PROJECT.md** β 2935-line spec. Section 35 is the build order; | |
| do not skip ahead. Sections 0, 4.6, 13, 17, 19, and Final Notes are | |
| load-bearing. | |
| - **API_NOTES.md** β corrections from installed-code recon. Drop | |
| `REQUIRES_SINGLE_THREAD_EXECUTOR` from Β§13.3, prefer | |
| `environment_factory` over `rollout_func` for Β§19.3, mind that | |
| `from_docker_image` is async, and the canonical location for | |
| Action/Observation/State is `.types` (PROJECT.md's `.interfaces` | |
| path also works via re-export). | |