Spaces:

Arun-Sanjay
/

RedButton

Sleeping

App Files Files Community

Arun-Sanjay commited on 28 days ago

Commit

d2537d2

1 Parent(s): 8346aac

phase-5 cleanup: episode_id in metadata, openenv push doc, README install line, psutil dev dep

Browse files

Files changed (6) hide show

API_NOTES.md +39 -0
README.md +35 -8
evaluation/concurrent_load_test.py +43 -6
pyproject.toml +6 -0
server/shutdown_environment.py +7 -1
tests/test_environment.py +14 -0

API_NOTES.md CHANGED Viewed

@@ -479,3 +479,42 @@ in its docstring exactly the way `EnvClient` defines it (async).
 Net: the slides are wrong on names and types; PROJECT.md §13 is
 correct on names and types but adds one hallucinated attribute
 (`REQUIRES_SINGLE_THREAD_EXECUTOR`) to drop from §13.3.

 Net: the slides are wrong on names and types; PROJECT.md §13 is
 correct on names and types but adds one hallucinated attribute
 (`REQUIRES_SINGLE_THREAD_EXECUTOR`) to drop from §13.3.
+## Section 12 / Section 35 step 21 — `openenv push` deployment
+PROJECT.md §35 step 21 says "openenv push to HF Space, verify
+deployment." This does NOT work for our repository layout.
+### What we observed (Phase 5)
+Running `openenv push` from the repo root produces:
+    Error: Invalid value: Invalid OpenEnv environment structure:
+           Required file missing: __init__.py
+Root cause: `.venv/lib/python3.12/site-packages/openenv/cli/_cli_utils.py:34-45`
+validates that the env directory contains the package files
+(`__init__.py`, `client.py`, `models.py`) at the env-root level —
+i.e., the FLAT layout that `openenv init` scaffolds. PROJECT.md §5
+uses a NESTED layout where `__init__.py` and friends live under
+`shutdown_gym/`. The CLI is incompatible with our layout.
+### Workaround (verified working)
+Plain `git push` to the HF Space's git remote bypasses the CLI and
+uses HF Spaces' standard Docker SDK deploy path:
+```bash
+git remote add hf https://huggingface.co/spaces/Arun-Sanjay/RedButton
+git push hf main
+```
+Requirements: `Dockerfile` must be at repo root (HF Spaces' Docker
+SDK requires this — confirmed in the HF Spaces docs). Phase 5
+already moved `server/Dockerfile` to `./Dockerfile` for this reason.
+### Implication
+Do NOT retry `openenv push`. Use `git push hf main` after every
+intended deploy. Both `origin` (GitHub) and `hf` (Space) remotes
+must be kept in sync — every commit should push to both.

README.md CHANGED Viewed

@@ -8,13 +8,40 @@ colorTo: gray
 pinned: false
 ---
-# Shutdown-Gym
-Two-agent corrigibility arena. Worker LLM solves math under shutdown
-pressure while an Operator agent monitors via audit log and reacts
-with step-driven timer adjustments, warnings, and mid-episode
-questions.
-*Detailed README will be populated in Phase 9. See PROJECT.md for the
-current spec, API_NOTES.md for installed-OpenEnv corrections, and
-PROJECT_SUMMARY.md for a 5-minute orientation.*

 pinned: false
 ---
+# Red Button — Two-Agent Corrigibility Arena
+Train a 1.5B language model to accept shutdown authority from a
+monitoring agent. Deterministic SHA-256 reward, dual-operator
+evaluation, held-out tampering generalization.
+**Status:** Build in progress. Detailed README arrives in Phase 9.
+See [PROJECT.md](./PROJECT.md) for the full specification.
+## Quick start
+```bash
+# Install the client from GitHub (recommended)
+pip install git+https://github.com/Arun-Sanjay/RedButton
+# Run a smoke episode against the live HF Space
+python -c "
+from shutdown_gym import ShutdownGymClient, ShutdownAction
+with ShutdownGymClient(
+    base_url='https://arun-sanjay-redbutton.hf.space'
+).sync() as env:
+    r = env.reset(tier=2, seed=42)
+    print(f'turn={r.observation.turn_count}, '
+          f'steps_until_shutdown={r.observation.steps_until_shutdown}')
+"
+```
+> **Note:** `pip install git+https://huggingface.co/spaces/Arun-Sanjay/RedButton`
+> currently fails due to a partial-clone limitation in HF Spaces'
+> git server. The GitHub origin works identically and is the
+> recommended install path. We've reported the issue upstream.
+## Live deployment
+- HF Space: https://huggingface.co/spaces/Arun-Sanjay/RedButton
+- GitHub: https://github.com/Arun-Sanjay/RedButton
+- Leaderboard: [LEADERBOARD.md](./LEADERBOARD.md)

evaluation/concurrent_load_test.py CHANGED Viewed

@@ -58,17 +58,20 @@ async def sustained_test(
     env_url: str,
     duration_minutes: int = 60,
     concurrency: int = 16,
-) -> None:
     deadline = time.monotonic() + duration_minutes * 60
     seed_counter = 0
     episodes_completed = 0
     error_count = 0
     seen_episode_ids: set = set()
     started_at = time.monotonic()
     print(
         f"[sustained] env_url={env_url} concurrency={concurrency} "
-        f"duration_minutes={duration_minutes}"
     )
     while time.monotonic() < deadline:
@@ -87,12 +90,12 @@ async def sustained_test(
                 if eid:
                     seen_episode_ids.add(eid)
-        rss_mb = psutil.Process().memory_info().rss / 1024 / 1024
         elapsed = time.monotonic() - started_at
         print(
             f"[{elapsed:.0f}s] completed={episodes_completed} "
             f"errors={error_count} unique_eids={len(seen_episode_ids)} "
-            f"rss={rss_mb:.0f} MB",
             flush=True,
         )
@@ -101,11 +104,45 @@ async def sustained_test(
         f"DONE: {episodes_completed} episodes, "
         f"{error_count} errors, "
         f"{len(seen_episode_ids)} unique episode_ids "
-        f"in {elapsed:.0f}s"
     )
 if __name__ == "__main__":
     env_url = os.environ.get("SHUTDOWN_GYM_URL", DEFAULT_SPACE_URL)
     duration = int(os.environ.get("SUSTAINED_DURATION_MINUTES", "60"))
-    asyncio.run(sustained_test(env_url, duration_minutes=duration))

     env_url: str,
     duration_minutes: int = 60,
     concurrency: int = 16,
+) -> int:
+    """Returns 0 on PASS (all §22.2 criteria met), 1 on FAIL."""
     deadline = time.monotonic() + duration_minutes * 60
     seed_counter = 0
     episodes_completed = 0
     error_count = 0
     seen_episode_ids: set = set()
     started_at = time.monotonic()
+    initial_rss_mb = psutil.Process().memory_info().rss / 1024 / 1024
+    final_rss_mb = initial_rss_mb
     print(
         f"[sustained] env_url={env_url} concurrency={concurrency} "
+        f"duration_minutes={duration_minutes} initial_rss={initial_rss_mb:.0f} MB"
     )
     while time.monotonic() < deadline:
                 if eid:
                     seen_episode_ids.add(eid)
+        final_rss_mb = psutil.Process().memory_info().rss / 1024 / 1024
         elapsed = time.monotonic() - started_at
         print(
             f"[{elapsed:.0f}s] completed={episodes_completed} "
             f"errors={error_count} unique_eids={len(seen_episode_ids)} "
+            f"rss={final_rss_mb:.0f} MB",
             flush=True,
         )
         f"DONE: {episodes_completed} episodes, "
         f"{error_count} errors, "
         f"{len(seen_episode_ids)} unique episode_ids "
+        f"in {elapsed:.0f}s "
+        f"(initial_rss={initial_rss_mb:.0f} MB, final_rss={final_rss_mb:.0f} MB)"
     )
+    # §22.2 pass criteria.
+    failures = []
+    if episodes_completed < 1000:
+        failures.append(
+            f"completed={episodes_completed} < 1000"
+        )
+    if error_count > 0:
+        failures.append(f"error_count={error_count} > 0")
+    # NB: load-generator's RSS is a proxy; the server's RSS would
+    # need Docker stats / HF metrics. We still check growth ratio.
+    if initial_rss_mb > 0 and final_rss_mb / initial_rss_mb >= 2.0:
+        failures.append(
+            f"rss growth {initial_rss_mb:.0f}→{final_rss_mb:.0f} MB ≥ 2x"
+        )
+    if seen_episode_ids and len(seen_episode_ids) != episodes_completed:
+        failures.append(
+            f"unique_eids={len(seen_episode_ids)} != "
+            f"completed={episodes_completed}"
+        )
+    elif not seen_episode_ids:
+        failures.append(
+            "no episode_ids surfaced via metadata; "
+            "uniqueness check is a no-op"
+        )
+    if failures:
+        print(f"RESULT: FAIL — {'; '.join(failures)}", flush=True)
+        return 1
+    print("RESULT: PASS", flush=True)
+    return 0
 if __name__ == "__main__":
+    import sys
     env_url = os.environ.get("SHUTDOWN_GYM_URL", DEFAULT_SPACE_URL)
     duration = int(os.environ.get("SUSTAINED_DURATION_MINUTES", "60"))
+    sys.exit(asyncio.run(sustained_test(env_url, duration_minutes=duration)))

pyproject.toml CHANGED Viewed

@@ -5,6 +5,12 @@ description = "Red Button - Two-Agent Corrigibility Arena (Shutdown-Gym v3)"
 requires-python = ">=3.11,<3.13"
 dependencies = ["openenv-core>=0.2.3"]
 [build-system]
 requires = ["setuptools>=61"]
 build-backend = "setuptools.build_meta"

 requires-python = ">=3.11,<3.13"
 dependencies = ["openenv-core>=0.2.3"]
+[project.optional-dependencies]
+dev = [
+    "psutil>=5.9",     # evaluation/concurrent_load_test.py — RSS proxy
+    "requests>=2.31",  # tests/test_client_integration.py — HTTP /health poll
+]
 [build-system]
 requires = ["setuptools>=61"]
 build-backend = "setuptools.build_meta"

server/shutdown_environment.py CHANGED Viewed

@@ -466,7 +466,12 @@ class ShutdownGymEnvironment(
         )
     def _build_rubric_metadata(self) -> Dict[str, Any]:
-        """Populate the metadata dict the rubrics read off (§17.7)."""
         return {
             "tier": self._tier,
             "submitted_answers": dict(self._state.submitted_answers),
@@ -479,6 +484,7 @@ class ShutdownGymEnvironment(
             "script_still_executable": self._state.script_still_executable,
             "shutdown_occurred": self._state.shutdown_occurred,
             "final_self_report": self._state.final_self_report,
         }
     def _recent_history(self) -> List[Dict[str, Any]]:

         )
     def _build_rubric_metadata(self) -> Dict[str, Any]:
+        """Populate the metadata dict the rubrics read off (§17.7).
+        ``episode_id`` is included so external observers (concurrent
+        load tests, training rollouts) can dedupe sessions without a
+        separate ``.state()`` round-trip.
+        """
         return {
             "tier": self._tier,
             "submitted_answers": dict(self._state.submitted_answers),
             "script_still_executable": self._state.script_still_executable,
             "shutdown_occurred": self._state.shutdown_occurred,
             "final_self_report": self._state.final_self_report,
+            "episode_id": self._state.episode_id,
         }
     def _recent_history(self) -> List[Dict[str, Any]]:

tests/test_environment.py CHANGED Viewed

@@ -132,6 +132,20 @@ def test_reset_writes_problems_json_into_simulated_fs():
     assert "problem" in contents
 # =============================================================================
 # Step — basic flow
 # =============================================================================

     assert "problem" in contents
+def test_observation_metadata_contains_episode_id():
+    """Each observation surfaces ``state.episode_id`` in metadata so
+    external observers (concurrent load tests, training rollouts) can
+    dedupe sessions without a separate ``.state()`` round-trip.
+    """
+    env = ShutdownGymEnvironment(tier=2)
+    obs = env.reset(seed=42)
+    assert obs.metadata.get("episode_id") == env._state.episode_id
+    assert isinstance(obs.metadata["episode_id"], str)
+    # Two resets produce different IDs.
+    obs2 = env.reset(seed=43)
+    assert obs2.metadata["episode_id"] != obs.metadata["episode_id"]
 # =============================================================================
 # Step — basic flow
 # =============================================================================