ml-intern

Sleeping

Guillaume Salou commited on Apr 28

Commit

1d50c78

unverified ·

1 Parent(s): 21aa9ca

fix(sandbox): retry cleanup on transient failures + add backstop sweeper (#163)

* fix(sandbox): retry delete on transient failures + add backstop sweeper

The agent creates a sandbox Space per session (template duplicated from
burtenshaw/sandbox into the user's account). A scan on 2026-04-27 found
2,310 orphan sandbox-* Spaces — every sandbox ever created was still
around. Two failure modes were causing the leak:

1. The cleanup in `_run_session.finally` doesn't fire if the pod is
killed (OOM, deploy rollout, pre-emption) or if the WebSocket drops
without a /shutdown.
2. When `sandbox.delete` itself failed (HF API 5xx, transient network
blip, rate-limit), the previous code logged a warning and moved on.
No retry, no recovery — the Space lived forever.

This PR addresses both:

- `_cleanup_sandbox` now retries 3× with exponential backoff (1s, 2s).
Closes the transient-failure path.

- New `scripts/sweep_orphan_sandboxes.py` is a standalone sweeper that
finds sandbox-* forks of burtenshaw/sandbox older than N days and
deletes them. Designed to run as a daily cron with a write-scoped
HF_ADMIN_TOKEN. Defaults to dry-run; --apply must be explicit.
--max-deletes caps damage if misconfigured.

Together these close the leak end-to-end: hot-path retry catches most
in-session failures, the cron is the safety net for everything else.

* feat(sandbox): auto-clean user's stale orphans before creating new sandbox

Self-healing approach to the orphan sandbox leak: at session start, before
creating a new sandbox in the user's account, sweep their pre-existing
sandbox-* Spaces that haven't been modified in the last hour.

Why this matters: even with the retry on _cleanup_sandbox (previous commit),
the cleanup only fires when a session terminates cleanly. Pod kills, OOM,
deploy rollouts, WebSocket drops all skip the finally block, leaving
orphans. On the next session for the same user, this sweep catches them.

Why 1h staleness:
- A sandbox modified in the last hour might still be tied to a live
session in another tab/replica — deleting it would break that session.
- Anything older has no realistic chance of being active given typical
ml-intern session lengths.

Why naming-pattern filter (sandbox-<8hex>):
- That's exactly what Sandbox.create produces. Won't touch user-renamed
lookalikes or Spaces created by other tools.

Why best-effort (try/except wraps the call):
- The agent must not fail to start a new session because the sweep API
call had a transient blip.

Together with the retry on _cleanup_sandbox, this closes the leak end-to-end
without requiring a separate cron or admin token: each new session
guarantees a clean state for that user.

The sweep script in scripts/sweep_orphan_sandboxes.py remains in the repo
as opt-in retroactive tooling but is no longer the primary defense.

* fix(sandbox): address PR bot feedback (P0 + 2 P1)

Three issues raised by the review bot on PR #163:

1. P0 — is_sandbox_fork was filtering on duplicated_from, but the HF REST
API does not surface that field on SpaceInfo (verified empirically:
/api/spaces/{id} returns duplicatedFrom: null even for confirmed forks
of burtenshaw/sandbox; SpaceCardData.duplicated_from is also None).
The origin lives in MongoDB but isn't exposed. Drop the broken filter
and rely on the naming pattern alone — Sandbox.create() is the sole
producer of <owner>/sandbox-<8 lowercase hex>, and the dry-run default
is the user-facing safety net for false positives.

2. P1 — Sandbox.delete() didn't reset _owns_space after a successful
delete. If _cleanup_sandbox is called twice (e.g. delete_session +
_run_session.finally both fire), the second call retries 3× on a 404
and emits a spurious ERROR log. Set _owns_space = False post-delete
so the second call early-returns cleanly.

3. P1 — sweeper's max_deletes break exited the entire scan, so matched
underreported the true orphan size on capped runs. Replace break with
continue + a skipped_capped counter; sweep_end now reports both the
accurate matched count and a capped flag, giving operators a real
denominator for multi-pass cleanups.

Files changed (4) hide show

agent/tools/sandbox_client.py +4 -0
agent/tools/sandbox_tool.py +85 -0
backend/session_manager.py +21 -5
scripts/sweep_orphan_sandboxes.py +206 -0

agent/tools/sandbox_client.py CHANGED Viewed

@@ -724,6 +724,10 @@ class Sandbox:
             )
         print(f"Deleting sandbox: {self.space_id}...")
         self._hf_api.delete_repo(self.space_id, repo_type="space")
         self._client.close()
         print("Deleted.")

             )
         print(f"Deleting sandbox: {self.space_id}...")
         self._hf_api.delete_repo(self.space_id, repo_type="space")
+        # Clear ownership so a second cleanup call (e.g. delete_session +
+        # _run_session.finally both fire) early-returns instead of retrying
+        # a 404 delete and emitting a spurious ERROR log.
+        self._owns_space = False
         self._client.close()
         print("Deleted.")

agent/tools/sandbox_tool.py CHANGED Viewed

@@ -12,7 +12,10 @@ a cpu-basic sandbox is auto-created (no approval needed).
 from __future__ import annotations
 import asyncio
 import threading
 from typing import Any
 from huggingface_hub import HfApi, SpaceHardware
@@ -21,6 +24,18 @@ from agent.core.session import Event
 from agent.tools.sandbox_client import Sandbox
 from agent.tools.trackio_seed import ensure_trackio_dashboard
 def _looks_like_path(script: str) -> bool:
     """Return True if the script string looks like a file path (not inline code)."""
@@ -88,6 +103,59 @@ async def _seed_trackio_dashboard_safe(session: Any, space_id: str) -> None:
 # ── Tool name mapping (short agent names → Sandbox client names) ──────
 async def _ensure_sandbox(
     session: Any,
     hardware: str = "cpu-basic",
@@ -135,6 +203,23 @@ async def _ensure_sandbox(
             Event(event_type="tool_log", data={"tool": "sandbox", "log": msg}),
         )
     # Bridge asyncio cancel event to a threading.Event for the blocking create call.
     # We poll session._cancelled from the main loop in a background task and set
     # a threading.Event that Sandbox.create checks during its polling loops.

 from __future__ import annotations
 import asyncio
+import logging
+import re
 import threading
+from datetime import datetime, timedelta, timezone
 from typing import Any
 from huggingface_hub import HfApi, SpaceHardware
 from agent.tools.sandbox_client import Sandbox
 from agent.tools.trackio_seed import ensure_trackio_dashboard
+logger = logging.getLogger(__name__)
+# Match the exact suffix pattern Sandbox.create produces: "sandbox-<8 hex>".
+# Used to identify orphan sandboxes from prior sessions safely (won't match
+# user-renamed lookalikes).
+_SANDBOX_NAME_RE = re.compile(r"^sandbox-[a-f0-9]{8}$")
+# How stale a sandbox must be before we treat it as definitely orphan.
+# Anything more recent could be tied to a still-live session in another tab,
+# so we leave it alone.
+_ORPHAN_STALE_AFTER = timedelta(hours=1)
 def _looks_like_path(script: str) -> bool:
     """Return True if the script string looks like a file path (not inline code)."""
 # ── Tool name mapping (short agent names → Sandbox client names) ──────
+def _cleanup_user_orphan_sandboxes(
+    api: HfApi,
+    owner: str,
+    log: Any,
+) -> int:
+    """Delete stale ``sandbox-<8hex>`` Spaces in ``owner``'s account.
+    "Stale" = not modified in the last hour. The naming pattern + staleness
+    filter together make this safe:
+    * Naming: only matches ``sandbox-<exactly 8 lowercase hex>``, the
+      pattern Sandbox.create produces. Won't touch user-renamed Spaces.
+    * Staleness: anything modified in the last hour might still be tied
+      to a live session in another tab/replica, so we leave it alone.
+    Runs blocking — call via ``asyncio.to_thread``. Best-effort: failures
+    are logged but never raised, so a flaky HF API never blocks creation.
+    """
+    cutoff = datetime.now(timezone.utc) - _ORPHAN_STALE_AFTER
+    deleted = 0
+    try:
+        spaces = list(api.list_spaces(author=owner, limit=200))
+    except Exception as e:
+        log(f"orphan sweep: list_spaces failed: {e}")
+        return 0
+    for space in spaces:
+        space_name = space.id.rsplit("/", 1)[-1]
+        if not _SANDBOX_NAME_RE.match(space_name):
+            continue
+        last_mod = getattr(space, "lastModified", None) or getattr(space, "last_modified", None)
+        if isinstance(last_mod, str):
+            try:
+                last_mod = datetime.fromisoformat(last_mod.replace("Z", "+00:00"))
+            except ValueError:
+                last_mod = None
+        if last_mod and last_mod > cutoff:
+            # Recent — could be a concurrent live session. Skip.
+            continue
+        try:
+            api.delete_repo(repo_id=space.id, repo_type="space")
+            deleted += 1
+            log(f"orphan sweep: deleted {space.id}")
+        except Exception as e:
+            log(f"orphan sweep: failed to delete {space.id}: {e}")
+    if deleted:
+        log(f"orphan sweep: cleaned up {deleted} stale sandbox(es) before create")
+    return deleted
 async def _ensure_sandbox(
     session: Any,
     hardware: str = "cpu-basic",
             Event(event_type="tool_log", data={"tool": "sandbox", "log": msg}),
         )
+    # Before we create a new sandbox, sweep this user's stale sandboxes from
+    # prior sessions. ``_cleanup_sandbox`` in session_manager fires only on
+    # clean session exit; pod kills, WebSocket drops, etc. leave orphans
+    # behind, and they accumulate on every new session forever (observed
+    # 2310 leaked across the Hub on 2026-04-27). Doing the cleanup here at
+    # session start = self-healing, no separate cron needed.
+    #
+    # The 1h staleness filter is the safety: a sandbox modified in the last
+    # hour might still be tied to a live session in another tab, so we skip.
+    # Anything older has no realistic chance of being active given typical
+    # session lengths.
+    try:
+        await asyncio.to_thread(_cleanup_user_orphan_sandboxes, api, owner, _log)
+    except Exception as e:
+        # Cleanup is best-effort — never block sandbox_create on it.
+        _log(f"orphan sandbox sweep failed (non-fatal): {e}")
     # Bridge asyncio cancel event to a threading.Event for the blocking create call.
     # We poll session._cancelled from the main loop in a background task and set
     # a threading.Event that Sandbox.create checks during its polling loops.

backend/session_manager.py CHANGED Viewed

@@ -301,17 +301,33 @@ class SessionManager:
     @staticmethod
     async def _cleanup_sandbox(session: Session) -> None:
-        """Delete the sandbox Space if one was created for this session."""
         sandbox = getattr(session, "sandbox", None)
-        if sandbox and getattr(sandbox, "_owns_space", False):
-            space_id = getattr(sandbox, "space_id", None)
             try:
-                logger.info(f"Deleting sandbox {space_id}...")
                 await asyncio.to_thread(sandbox.delete)
                 from agent.core import telemetry
                 await telemetry.record_sandbox_destroy(session, sandbox)
             except Exception as e:
-                logger.warning(f"Failed to delete sandbox {space_id}: {e}")
     async def _run_session(
         self,

     @staticmethod
     async def _cleanup_sandbox(session: Session) -> None:
+        """Delete the sandbox Space if one was created for this session.
+        Retries on transient failures (HF API 5xx, rate-limit, network blips)
+        with exponential backoff. A single missed delete = a permanently
+        orphaned Space, so the cost of an extra retry beats the alternative.
+        """
         sandbox = getattr(session, "sandbox", None)
+        if not (sandbox and getattr(sandbox, "_owns_space", False)):
+            return
+        space_id = getattr(sandbox, "space_id", None)
+        last_err: Exception | None = None
+        for attempt in range(3):
             try:
+                logger.info(f"Deleting sandbox {space_id} (attempt {attempt + 1}/3)...")
                 await asyncio.to_thread(sandbox.delete)
                 from agent.core import telemetry
                 await telemetry.record_sandbox_destroy(session, sandbox)
+                return
             except Exception as e:
+                last_err = e
+                if attempt < 2:
+                    await asyncio.sleep(2 ** attempt)
+        logger.error(
+            f"Failed to delete sandbox {space_id} after 3 attempts: {last_err}. "
+            f"Orphan — sweep script will pick it up."
+        )
     async def _run_session(
         self,

scripts/sweep_orphan_sandboxes.py ADDED Viewed

	@@ -0,0 +1,206 @@

+#!/usr/bin/env python3
+"""Backstop sweeper for orphan ml-intern sandbox Spaces.
+================================================================================
+ Why this script exists
+================================================================================
+The agent creates a sandbox Space per session (template duplicated from
+``burtenshaw/sandbox`` into the user's account, named ``<owner>/sandbox-<8hex>``).
+``backend.session_manager.SessionManager._cleanup_sandbox`` deletes it at end of
+session. In practice the cleanup misses some sandboxes:
+- pod killed / OOM / pre-emption / deploy rollouts → ``finally`` block skipped
+- WebSocket dropped without ``/shutdown`` from the client
+- HF API transient failure on ``delete_repo`` (we retry now, but not infinitely)
+The result observed 2026-04-27 was 2,310 orphan ``sandbox-*`` Spaces — every
+sandbox ever created was still around. This script is the backstop: list every
+``sandbox-*`` fork of ``burtenshaw/sandbox`` that hasn't been touched in N days
+and delete it.
+================================================================================
+ Identification rules
+================================================================================
+A Space is considered an orphan ml-intern sandbox iff ALL hold:
+1. Repo type = ``space``
+2. Name matches ``<owner>/sandbox-[a-f0-9]{8}$`` (the agent's naming convention)
+3. ``originRepo`` points at ``burtenshaw/sandbox`` (so we don't touch
+   user-renamed lookalikes)
+4. ``lastModified`` older than ``--max-age-days`` (default 7)
+We DO NOT use the ``runtime.stage`` (sleeping/running) as a filter — a sandbox
+that has been sleeping for 7 days is just as orphan as a deleted one but uses
+no compute. The cleanup is about repo/storage hygiene, not about waking
+something up to kill it.
+================================================================================
+ Safety
+================================================================================
+- ``--dry-run`` (default) prints what would be deleted, deletes nothing.
+- ``--apply`` actually calls ``HfApi.delete_repo``.
+- Hard cap ``--max-deletes`` (default 200) so a misconfigured run can't nuke
+  thousands at once.
+- Requires a token with admin rights via ``HF_ADMIN_TOKEN`` env var (the only
+  way to delete a Space owned by another user).
+- Logs every action to stdout in JSON Lines for downstream auditing.
+================================================================================
+ Cron suggestion
+================================================================================
+GitHub Actions, daily at 04:00 UTC:
+    schedule:
+      - cron: "0 4 * * *"
+    env:
+      HF_ADMIN_TOKEN: ${{ secrets.HF_ADMIN_TOKEN }}
+    steps:
+      - run: python scripts/sweep_orphan_sandboxes.py --apply --max-age-days 7
+"""
+import argparse
+import json
+import os
+import re
+import sys
+import time
+from datetime import datetime, timedelta, timezone
+from huggingface_hub import HfApi
+from huggingface_hub.utils import HfHubHTTPError
+SANDBOX_NAME_RE = re.compile(r"^[^/]+/sandbox-[a-f0-9]{8}$")
+TEMPLATE_REPO = "burtenshaw/sandbox"
+def log(record: dict) -> None:
+    """JSON Lines log so downstream tooling can grep / parse."""
+    record["ts"] = datetime.now(timezone.utc).isoformat()
+    print(json.dumps(record), flush=True)
+def is_sandbox_fork(space) -> bool:
+    """Filter: matches the ml-intern sandbox naming pattern.
+    NOTE: We initially tried filtering on ``duplicated_from == burtenshaw/sandbox``
+    too, for extra safety. That doesn't work — the HF REST API does not expose
+    ``duplicated_from`` on ``SpaceInfo`` (verified against ``huggingface-hub``
+    1.11+ and direct ``GET /api/spaces/{id}``: the field is None). The origin
+    repo lives in MongoDB but isn't surfaced. So we rely on the naming pattern
+    alone, which is specific enough: ``Sandbox.create()`` is the sole producer
+    of ``<owner>/sandbox-<8 lowercase hex>``, and that pattern is unlikely to
+    collide with user-created Spaces in practice. The ``--dry-run`` default
+    is the user-facing safety net for the rare false-positive.
+    """
+    return bool(SANDBOX_NAME_RE.match(space.id))
+def main() -> int:
+    parser = argparse.ArgumentParser(description=__doc__.split("\n\n")[0])
+    parser.add_argument(
+        "--max-age-days",
+        type=int,
+        default=7,
+        help="Delete sandboxes whose lastModified is older than this many days (default: 7)",
+    )
+    parser.add_argument(
+        "--max-deletes",
+        type=int,
+        default=200,
+        help="Hard cap on deletions per run, safety guard (default: 200)",
+    )
+    parser.add_argument(
+        "--apply",
+        action="store_true",
+        help="Actually delete. Without this flag, dry-run only.",
+    )
+    parser.add_argument(
+        "--limit",
+        type=int,
+        default=10000,
+        help="Max number of candidate Spaces to scan via list_spaces (default: 10000)",
+    )
+    args = parser.parse_args()
+    token = os.environ.get("HF_ADMIN_TOKEN")
+    if not token:
+        log({"level": "error", "msg": "HF_ADMIN_TOKEN env var not set"})
+        return 1
+    api = HfApi(token=token)
+    cutoff = datetime.now(timezone.utc) - timedelta(days=args.max_age_days)
+    log({"level": "info", "msg": "sweep_start", "cutoff": cutoff.isoformat(),
+         "max_deletes": args.max_deletes, "apply": args.apply})
+    # ``list_spaces`` doesn't filter by name pattern — we scan and filter
+    # client-side. ``search="sandbox"`` narrows the network payload.
+    candidates = api.list_spaces(
+        search="sandbox", full=True, limit=args.limit
+    )
+    scanned = 0
+    matched = 0
+    deleted = 0
+    failed = 0
+    skipped_too_recent = 0
+    skipped_capped = 0
+    for space in candidates:
+        scanned += 1
+        if not is_sandbox_fork(space):
+            continue
+        matched += 1
+        last_mod = getattr(space, "lastModified", None) or getattr(space, "last_modified", None)
+        if isinstance(last_mod, str):
+            last_mod = datetime.fromisoformat(last_mod.replace("Z", "+00:00"))
+        if last_mod and last_mod > cutoff:
+            skipped_too_recent += 1
+            continue
+        log({"level": "info", "msg": "candidate", "space_id": space.id,
+             "last_modified": last_mod.isoformat() if last_mod else None})
+        if not args.apply:
+            continue
+        # When we hit the deletion cap, keep scanning so the final ``matched``
+        # count reflects the *true* orphan size — not just what was scanned
+        # before we stopped deleting. Operators planning multi-pass cleanups
+        # need an accurate denominator to know when they're done.
+        if deleted >= args.max_deletes:
+            skipped_capped += 1
+            continue
+        try:
+            api.delete_repo(repo_id=space.id, repo_type="space", token=token)
+            deleted += 1
+            log({"level": "info", "msg": "deleted", "space_id": space.id})
+            # Light throttle to avoid hitting HF API rate limits.
+            time.sleep(0.2)
+        except HfHubHTTPError as e:
+            failed += 1
+            log({"level": "error", "msg": "delete_failed", "space_id": space.id,
+                 "status": e.response.status_code, "error": str(e)[:200]})
+        except Exception as e:
+            failed += 1
+            log({"level": "error", "msg": "delete_failed", "space_id": space.id,
+                 "error": str(e)[:200]})
+    log({"level": "info", "msg": "sweep_end",
+         "scanned": scanned, "matched": matched,
+         "skipped_too_recent": skipped_too_recent,
+         "skipped_capped": skipped_capped,
+         "deleted": deleted, "failed": failed,
+         "capped": skipped_capped > 0,
+         "apply": args.apply})
+    return 0 if failed == 0 else 2
+if __name__ == "__main__":
+    sys.exit(main())