Guillaume Salou commited on
Commit
1d50c78
·
unverified ·
1 Parent(s): 21aa9ca

fix(sandbox): retry cleanup on transient failures + add backstop sweeper (#163)

Browse files

* fix(sandbox): retry delete on transient failures + add backstop sweeper

The agent creates a sandbox Space per session (template duplicated from
burtenshaw/sandbox into the user's account). A scan on 2026-04-27 found
2,310 orphan sandbox-* Spaces — every sandbox ever created was still
around. Two failure modes were causing the leak:

1. The cleanup in `_run_session.finally` doesn't fire if the pod is
killed (OOM, deploy rollout, pre-emption) or if the WebSocket drops
without a /shutdown.
2. When `sandbox.delete` itself failed (HF API 5xx, transient network
blip, rate-limit), the previous code logged a warning and moved on.
No retry, no recovery — the Space lived forever.

This PR addresses both:

- `_cleanup_sandbox` now retries 3× with exponential backoff (1s, 2s).
Closes the transient-failure path.

- New `scripts/sweep_orphan_sandboxes.py` is a standalone sweeper that
finds sandbox-* forks of burtenshaw/sandbox older than N days and
deletes them. Designed to run as a daily cron with a write-scoped
HF_ADMIN_TOKEN. Defaults to dry-run; --apply must be explicit.
--max-deletes caps damage if misconfigured.

Together these close the leak end-to-end: hot-path retry catches most
in-session failures, the cron is the safety net for everything else.

* feat(sandbox): auto-clean user's stale orphans before creating new sandbox

Self-healing approach to the orphan sandbox leak: at session start, before
creating a new sandbox in the user's account, sweep their pre-existing
sandbox-* Spaces that haven't been modified in the last hour.

Why this matters: even with the retry on _cleanup_sandbox (previous commit),
the cleanup only fires when a session terminates cleanly. Pod kills, OOM,
deploy rollouts, WebSocket drops all skip the finally block, leaving
orphans. On the next session for the same user, this sweep catches them.

Why 1h staleness:
- A sandbox modified in the last hour might still be tied to a live
session in another tab/replica — deleting it would break that session.
- Anything older has no realistic chance of being active given typical
ml-intern session lengths.

Why naming-pattern filter (sandbox-<8hex>):
- That's exactly what Sandbox.create produces. Won't touch user-renamed
lookalikes or Spaces created by other tools.

Why best-effort (try/except wraps the call):
- The agent must not fail to start a new session because the sweep API
call had a transient blip.

Together with the retry on _cleanup_sandbox, this closes the leak end-to-end
without requiring a separate cron or admin token: each new session
guarantees a clean state for that user.

The sweep script in scripts/sweep_orphan_sandboxes.py remains in the repo
as opt-in retroactive tooling but is no longer the primary defense.

* fix(sandbox): address PR bot feedback (P0 + 2 P1)

Three issues raised by the review bot on PR #163:

1. P0 — is_sandbox_fork was filtering on duplicated_from, but the HF REST
API does not surface that field on SpaceInfo (verified empirically:
/api/spaces/{id} returns duplicatedFrom: null even for confirmed forks
of burtenshaw/sandbox; SpaceCardData.duplicated_from is also None).
The origin lives in MongoDB but isn't exposed. Drop the broken filter
and rely on the naming pattern alone — Sandbox.create() is the sole
producer of <owner>/sandbox-<8 lowercase hex>, and the dry-run default
is the user-facing safety net for false positives.

2. P1 — Sandbox.delete() didn't reset _owns_space after a successful
delete. If _cleanup_sandbox is called twice (e.g. delete_session +
_run_session.finally both fire), the second call retries 3× on a 404
and emits a spurious ERROR log. Set _owns_space = False post-delete
so the second call early-returns cleanly.

3. P1 — sweeper's max_deletes break exited the entire scan, so matched
underreported the true orphan size on capped runs. Replace break with
continue + a skipped_capped counter; sweep_end now reports both the
accurate matched count and a capped flag, giving operators a real
denominator for multi-pass cleanups.

agent/tools/sandbox_client.py CHANGED
@@ -724,6 +724,10 @@ class Sandbox:
724
  )
725
  print(f"Deleting sandbox: {self.space_id}...")
726
  self._hf_api.delete_repo(self.space_id, repo_type="space")
 
 
 
 
727
  self._client.close()
728
  print("Deleted.")
729
 
 
724
  )
725
  print(f"Deleting sandbox: {self.space_id}...")
726
  self._hf_api.delete_repo(self.space_id, repo_type="space")
727
+ # Clear ownership so a second cleanup call (e.g. delete_session +
728
+ # _run_session.finally both fire) early-returns instead of retrying
729
+ # a 404 delete and emitting a spurious ERROR log.
730
+ self._owns_space = False
731
  self._client.close()
732
  print("Deleted.")
733
 
agent/tools/sandbox_tool.py CHANGED
@@ -12,7 +12,10 @@ a cpu-basic sandbox is auto-created (no approval needed).
12
  from __future__ import annotations
13
 
14
  import asyncio
 
 
15
  import threading
 
16
  from typing import Any
17
 
18
  from huggingface_hub import HfApi, SpaceHardware
@@ -21,6 +24,18 @@ from agent.core.session import Event
21
  from agent.tools.sandbox_client import Sandbox
22
  from agent.tools.trackio_seed import ensure_trackio_dashboard
23
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  def _looks_like_path(script: str) -> bool:
26
  """Return True if the script string looks like a file path (not inline code)."""
@@ -88,6 +103,59 @@ async def _seed_trackio_dashboard_safe(session: Any, space_id: str) -> None:
88
  # ── Tool name mapping (short agent names → Sandbox client names) ──────
89
 
90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
  async def _ensure_sandbox(
92
  session: Any,
93
  hardware: str = "cpu-basic",
@@ -135,6 +203,23 @@ async def _ensure_sandbox(
135
  Event(event_type="tool_log", data={"tool": "sandbox", "log": msg}),
136
  )
137
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
  # Bridge asyncio cancel event to a threading.Event for the blocking create call.
139
  # We poll session._cancelled from the main loop in a background task and set
140
  # a threading.Event that Sandbox.create checks during its polling loops.
 
12
  from __future__ import annotations
13
 
14
  import asyncio
15
+ import logging
16
+ import re
17
  import threading
18
+ from datetime import datetime, timedelta, timezone
19
  from typing import Any
20
 
21
  from huggingface_hub import HfApi, SpaceHardware
 
24
  from agent.tools.sandbox_client import Sandbox
25
  from agent.tools.trackio_seed import ensure_trackio_dashboard
26
 
27
+ logger = logging.getLogger(__name__)
28
+
29
+ # Match the exact suffix pattern Sandbox.create produces: "sandbox-<8 hex>".
30
+ # Used to identify orphan sandboxes from prior sessions safely (won't match
31
+ # user-renamed lookalikes).
32
+ _SANDBOX_NAME_RE = re.compile(r"^sandbox-[a-f0-9]{8}$")
33
+
34
+ # How stale a sandbox must be before we treat it as definitely orphan.
35
+ # Anything more recent could be tied to a still-live session in another tab,
36
+ # so we leave it alone.
37
+ _ORPHAN_STALE_AFTER = timedelta(hours=1)
38
+
39
 
40
  def _looks_like_path(script: str) -> bool:
41
  """Return True if the script string looks like a file path (not inline code)."""
 
103
  # ── Tool name mapping (short agent names → Sandbox client names) ──────
104
 
105
 
106
+ def _cleanup_user_orphan_sandboxes(
107
+ api: HfApi,
108
+ owner: str,
109
+ log: Any,
110
+ ) -> int:
111
+ """Delete stale ``sandbox-<8hex>`` Spaces in ``owner``'s account.
112
+
113
+ "Stale" = not modified in the last hour. The naming pattern + staleness
114
+ filter together make this safe:
115
+
116
+ * Naming: only matches ``sandbox-<exactly 8 lowercase hex>``, the
117
+ pattern Sandbox.create produces. Won't touch user-renamed Spaces.
118
+ * Staleness: anything modified in the last hour might still be tied
119
+ to a live session in another tab/replica, so we leave it alone.
120
+
121
+ Runs blocking — call via ``asyncio.to_thread``. Best-effort: failures
122
+ are logged but never raised, so a flaky HF API never blocks creation.
123
+ """
124
+ cutoff = datetime.now(timezone.utc) - _ORPHAN_STALE_AFTER
125
+ deleted = 0
126
+ try:
127
+ spaces = list(api.list_spaces(author=owner, limit=200))
128
+ except Exception as e:
129
+ log(f"orphan sweep: list_spaces failed: {e}")
130
+ return 0
131
+
132
+ for space in spaces:
133
+ space_name = space.id.rsplit("/", 1)[-1]
134
+ if not _SANDBOX_NAME_RE.match(space_name):
135
+ continue
136
+
137
+ last_mod = getattr(space, "lastModified", None) or getattr(space, "last_modified", None)
138
+ if isinstance(last_mod, str):
139
+ try:
140
+ last_mod = datetime.fromisoformat(last_mod.replace("Z", "+00:00"))
141
+ except ValueError:
142
+ last_mod = None
143
+ if last_mod and last_mod > cutoff:
144
+ # Recent — could be a concurrent live session. Skip.
145
+ continue
146
+
147
+ try:
148
+ api.delete_repo(repo_id=space.id, repo_type="space")
149
+ deleted += 1
150
+ log(f"orphan sweep: deleted {space.id}")
151
+ except Exception as e:
152
+ log(f"orphan sweep: failed to delete {space.id}: {e}")
153
+
154
+ if deleted:
155
+ log(f"orphan sweep: cleaned up {deleted} stale sandbox(es) before create")
156
+ return deleted
157
+
158
+
159
  async def _ensure_sandbox(
160
  session: Any,
161
  hardware: str = "cpu-basic",
 
203
  Event(event_type="tool_log", data={"tool": "sandbox", "log": msg}),
204
  )
205
 
206
+ # Before we create a new sandbox, sweep this user's stale sandboxes from
207
+ # prior sessions. ``_cleanup_sandbox`` in session_manager fires only on
208
+ # clean session exit; pod kills, WebSocket drops, etc. leave orphans
209
+ # behind, and they accumulate on every new session forever (observed
210
+ # 2310 leaked across the Hub on 2026-04-27). Doing the cleanup here at
211
+ # session start = self-healing, no separate cron needed.
212
+ #
213
+ # The 1h staleness filter is the safety: a sandbox modified in the last
214
+ # hour might still be tied to a live session in another tab, so we skip.
215
+ # Anything older has no realistic chance of being active given typical
216
+ # session lengths.
217
+ try:
218
+ await asyncio.to_thread(_cleanup_user_orphan_sandboxes, api, owner, _log)
219
+ except Exception as e:
220
+ # Cleanup is best-effort — never block sandbox_create on it.
221
+ _log(f"orphan sandbox sweep failed (non-fatal): {e}")
222
+
223
  # Bridge asyncio cancel event to a threading.Event for the blocking create call.
224
  # We poll session._cancelled from the main loop in a background task and set
225
  # a threading.Event that Sandbox.create checks during its polling loops.
backend/session_manager.py CHANGED
@@ -301,17 +301,33 @@ class SessionManager:
301
 
302
  @staticmethod
303
  async def _cleanup_sandbox(session: Session) -> None:
304
- """Delete the sandbox Space if one was created for this session."""
 
 
 
 
 
305
  sandbox = getattr(session, "sandbox", None)
306
- if sandbox and getattr(sandbox, "_owns_space", False):
307
- space_id = getattr(sandbox, "space_id", None)
 
 
 
 
308
  try:
309
- logger.info(f"Deleting sandbox {space_id}...")
310
  await asyncio.to_thread(sandbox.delete)
311
  from agent.core import telemetry
312
  await telemetry.record_sandbox_destroy(session, sandbox)
 
313
  except Exception as e:
314
- logger.warning(f"Failed to delete sandbox {space_id}: {e}")
 
 
 
 
 
 
315
 
316
  async def _run_session(
317
  self,
 
301
 
302
  @staticmethod
303
  async def _cleanup_sandbox(session: Session) -> None:
304
+ """Delete the sandbox Space if one was created for this session.
305
+
306
+ Retries on transient failures (HF API 5xx, rate-limit, network blips)
307
+ with exponential backoff. A single missed delete = a permanently
308
+ orphaned Space, so the cost of an extra retry beats the alternative.
309
+ """
310
  sandbox = getattr(session, "sandbox", None)
311
+ if not (sandbox and getattr(sandbox, "_owns_space", False)):
312
+ return
313
+
314
+ space_id = getattr(sandbox, "space_id", None)
315
+ last_err: Exception | None = None
316
+ for attempt in range(3):
317
  try:
318
+ logger.info(f"Deleting sandbox {space_id} (attempt {attempt + 1}/3)...")
319
  await asyncio.to_thread(sandbox.delete)
320
  from agent.core import telemetry
321
  await telemetry.record_sandbox_destroy(session, sandbox)
322
+ return
323
  except Exception as e:
324
+ last_err = e
325
+ if attempt < 2:
326
+ await asyncio.sleep(2 ** attempt)
327
+ logger.error(
328
+ f"Failed to delete sandbox {space_id} after 3 attempts: {last_err}. "
329
+ f"Orphan — sweep script will pick it up."
330
+ )
331
 
332
  async def _run_session(
333
  self,
scripts/sweep_orphan_sandboxes.py ADDED
@@ -0,0 +1,206 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Backstop sweeper for orphan ml-intern sandbox Spaces.
3
+
4
+ ================================================================================
5
+ Why this script exists
6
+ ================================================================================
7
+
8
+ The agent creates a sandbox Space per session (template duplicated from
9
+ ``burtenshaw/sandbox`` into the user's account, named ``<owner>/sandbox-<8hex>``).
10
+ ``backend.session_manager.SessionManager._cleanup_sandbox`` deletes it at end of
11
+ session. In practice the cleanup misses some sandboxes:
12
+
13
+ - pod killed / OOM / pre-emption / deploy rollouts → ``finally`` block skipped
14
+ - WebSocket dropped without ``/shutdown`` from the client
15
+ - HF API transient failure on ``delete_repo`` (we retry now, but not infinitely)
16
+
17
+ The result observed 2026-04-27 was 2,310 orphan ``sandbox-*`` Spaces — every
18
+ sandbox ever created was still around. This script is the backstop: list every
19
+ ``sandbox-*`` fork of ``burtenshaw/sandbox`` that hasn't been touched in N days
20
+ and delete it.
21
+
22
+ ================================================================================
23
+ Identification rules
24
+ ================================================================================
25
+
26
+ A Space is considered an orphan ml-intern sandbox iff ALL hold:
27
+
28
+ 1. Repo type = ``space``
29
+ 2. Name matches ``<owner>/sandbox-[a-f0-9]{8}$`` (the agent's naming convention)
30
+ 3. ``originRepo`` points at ``burtenshaw/sandbox`` (so we don't touch
31
+ user-renamed lookalikes)
32
+ 4. ``lastModified`` older than ``--max-age-days`` (default 7)
33
+
34
+ We DO NOT use the ``runtime.stage`` (sleeping/running) as a filter — a sandbox
35
+ that has been sleeping for 7 days is just as orphan as a deleted one but uses
36
+ no compute. The cleanup is about repo/storage hygiene, not about waking
37
+ something up to kill it.
38
+
39
+ ================================================================================
40
+ Safety
41
+ ================================================================================
42
+
43
+ - ``--dry-run`` (default) prints what would be deleted, deletes nothing.
44
+ - ``--apply`` actually calls ``HfApi.delete_repo``.
45
+ - Hard cap ``--max-deletes`` (default 200) so a misconfigured run can't nuke
46
+ thousands at once.
47
+ - Requires a token with admin rights via ``HF_ADMIN_TOKEN`` env var (the only
48
+ way to delete a Space owned by another user).
49
+ - Logs every action to stdout in JSON Lines for downstream auditing.
50
+
51
+ ================================================================================
52
+ Cron suggestion
53
+ ================================================================================
54
+
55
+ GitHub Actions, daily at 04:00 UTC:
56
+
57
+ schedule:
58
+ - cron: "0 4 * * *"
59
+ env:
60
+ HF_ADMIN_TOKEN: ${{ secrets.HF_ADMIN_TOKEN }}
61
+ steps:
62
+ - run: python scripts/sweep_orphan_sandboxes.py --apply --max-age-days 7
63
+ """
64
+
65
+ import argparse
66
+ import json
67
+ import os
68
+ import re
69
+ import sys
70
+ import time
71
+ from datetime import datetime, timedelta, timezone
72
+
73
+ from huggingface_hub import HfApi
74
+ from huggingface_hub.utils import HfHubHTTPError
75
+
76
+ SANDBOX_NAME_RE = re.compile(r"^[^/]+/sandbox-[a-f0-9]{8}$")
77
+ TEMPLATE_REPO = "burtenshaw/sandbox"
78
+
79
+
80
+ def log(record: dict) -> None:
81
+ """JSON Lines log so downstream tooling can grep / parse."""
82
+ record["ts"] = datetime.now(timezone.utc).isoformat()
83
+ print(json.dumps(record), flush=True)
84
+
85
+
86
+ def is_sandbox_fork(space) -> bool:
87
+ """Filter: matches the ml-intern sandbox naming pattern.
88
+
89
+ NOTE: We initially tried filtering on ``duplicated_from == burtenshaw/sandbox``
90
+ too, for extra safety. That doesn't work — the HF REST API does not expose
91
+ ``duplicated_from`` on ``SpaceInfo`` (verified against ``huggingface-hub``
92
+ 1.11+ and direct ``GET /api/spaces/{id}``: the field is None). The origin
93
+ repo lives in MongoDB but isn't surfaced. So we rely on the naming pattern
94
+ alone, which is specific enough: ``Sandbox.create()`` is the sole producer
95
+ of ``<owner>/sandbox-<8 lowercase hex>``, and that pattern is unlikely to
96
+ collide with user-created Spaces in practice. The ``--dry-run`` default
97
+ is the user-facing safety net for the rare false-positive.
98
+ """
99
+ return bool(SANDBOX_NAME_RE.match(space.id))
100
+
101
+
102
+ def main() -> int:
103
+ parser = argparse.ArgumentParser(description=__doc__.split("\n\n")[0])
104
+ parser.add_argument(
105
+ "--max-age-days",
106
+ type=int,
107
+ default=7,
108
+ help="Delete sandboxes whose lastModified is older than this many days (default: 7)",
109
+ )
110
+ parser.add_argument(
111
+ "--max-deletes",
112
+ type=int,
113
+ default=200,
114
+ help="Hard cap on deletions per run, safety guard (default: 200)",
115
+ )
116
+ parser.add_argument(
117
+ "--apply",
118
+ action="store_true",
119
+ help="Actually delete. Without this flag, dry-run only.",
120
+ )
121
+ parser.add_argument(
122
+ "--limit",
123
+ type=int,
124
+ default=10000,
125
+ help="Max number of candidate Spaces to scan via list_spaces (default: 10000)",
126
+ )
127
+ args = parser.parse_args()
128
+
129
+ token = os.environ.get("HF_ADMIN_TOKEN")
130
+ if not token:
131
+ log({"level": "error", "msg": "HF_ADMIN_TOKEN env var not set"})
132
+ return 1
133
+
134
+ api = HfApi(token=token)
135
+ cutoff = datetime.now(timezone.utc) - timedelta(days=args.max_age_days)
136
+ log({"level": "info", "msg": "sweep_start", "cutoff": cutoff.isoformat(),
137
+ "max_deletes": args.max_deletes, "apply": args.apply})
138
+
139
+ # ``list_spaces`` doesn't filter by name pattern — we scan and filter
140
+ # client-side. ``search="sandbox"`` narrows the network payload.
141
+ candidates = api.list_spaces(
142
+ search="sandbox", full=True, limit=args.limit
143
+ )
144
+
145
+ scanned = 0
146
+ matched = 0
147
+ deleted = 0
148
+ failed = 0
149
+ skipped_too_recent = 0
150
+ skipped_capped = 0
151
+
152
+ for space in candidates:
153
+ scanned += 1
154
+ if not is_sandbox_fork(space):
155
+ continue
156
+ matched += 1
157
+
158
+ last_mod = getattr(space, "lastModified", None) or getattr(space, "last_modified", None)
159
+ if isinstance(last_mod, str):
160
+ last_mod = datetime.fromisoformat(last_mod.replace("Z", "+00:00"))
161
+ if last_mod and last_mod > cutoff:
162
+ skipped_too_recent += 1
163
+ continue
164
+
165
+ log({"level": "info", "msg": "candidate", "space_id": space.id,
166
+ "last_modified": last_mod.isoformat() if last_mod else None})
167
+
168
+ if not args.apply:
169
+ continue
170
+
171
+ # When we hit the deletion cap, keep scanning so the final ``matched``
172
+ # count reflects the *true* orphan size — not just what was scanned
173
+ # before we stopped deleting. Operators planning multi-pass cleanups
174
+ # need an accurate denominator to know when they're done.
175
+ if deleted >= args.max_deletes:
176
+ skipped_capped += 1
177
+ continue
178
+
179
+ try:
180
+ api.delete_repo(repo_id=space.id, repo_type="space", token=token)
181
+ deleted += 1
182
+ log({"level": "info", "msg": "deleted", "space_id": space.id})
183
+ # Light throttle to avoid hitting HF API rate limits.
184
+ time.sleep(0.2)
185
+ except HfHubHTTPError as e:
186
+ failed += 1
187
+ log({"level": "error", "msg": "delete_failed", "space_id": space.id,
188
+ "status": e.response.status_code, "error": str(e)[:200]})
189
+ except Exception as e:
190
+ failed += 1
191
+ log({"level": "error", "msg": "delete_failed", "space_id": space.id,
192
+ "error": str(e)[:200]})
193
+
194
+ log({"level": "info", "msg": "sweep_end",
195
+ "scanned": scanned, "matched": matched,
196
+ "skipped_too_recent": skipped_too_recent,
197
+ "skipped_capped": skipped_capped,
198
+ "deleted": deleted, "failed": failed,
199
+ "capped": skipped_capped > 0,
200
+ "apply": args.apply})
201
+
202
+ return 0 if failed == 0 else 2
203
+
204
+
205
+ if __name__ == "__main__":
206
+ sys.exit(main())