Spaces:

HuggingAI4Engineering
/

cadgenbench-leaderboard

Running

Michael Rabinovich commited on 9 days ago

Commit

0a2a2c6

1 Parent(s): 3237736

submit: stuck-pending sweep on boot (chunk 6)

Step 6 (E) chunk 6. A pending row whose worker died (Space
restart for a deploy, OOM, crash) has no one to flip it; without
this sweep it stays "pending" in the leaderboard forever (we
hit this twice during chunk-3..5 churn already).

On submit.py module import, spawn a daemon thread that:
1. Downloads results.jsonl (no lock; read-only).
2. Iterates rows; for each pending row whose `submitted_at`
is older than 30 min, calls _flip_row_to_failed(sid,
"evaluation interrupted by Space restart").
3. Logs the list of stuck ids so a deployer can see what was
swept.

Threshold is 30 min: well above the real eval ceiling on
cpu-upgrade (~5 min), so a genuinely-still-running submission is
safe. Hub-fetch failure at boot is non-fatal (logged warning,
sweep skipped; next boot retries). Per-row flip failures are
caught + logged + skipped (the rest of the sweep continues).

Opt-out via CADGENBENCH_DISABLE_BOOT_SWEEP=1 for test imports
that don't want the Hub round-trip.

Two test fixtures left over from chunk-3 race-with-rebuild
should flip on the next boot:
- dedup-verify-dan_..._20260528-071315 (~47 min old)
- dedup-verify-dan_..._20260528-071356 (~47 min old)

Closes the last bug-fix gap before chunk 7 (end-to-end smoke).

Files changed (1) hide show

submit.py +109 -9

submit.py CHANGED Viewed

@@ -1,13 +1,17 @@
 """Submit-tab handler for the CADGenBench leaderboard Space.
-Step 6 (E) chunks 2 + 3 + 4: cheap-sync validation pipeline + pending-row
-write + zip upload + background-thread eval. The handler validates
-the upload, uploads the zip to ``submissions/<id>.zip``, appends a
-``status: pending`` row to ``results.jsonl`` (under a process-wide
-lock), spawns a daemon thread to run ``cadgenbench evaluate`` +
-``cadgenbench report single``, and returns immediately. The worker
-uploads ``reports/<id>.{html,json}`` and flips the row
-``pending -> completed`` (or ``failed`` with a ``failure_reason``).
 Validation gates, in order:
@@ -68,7 +72,7 @@ import sys
 import tempfile
 import threading
 import zipfile
-from datetime import datetime, timezone
 from pathlib import Path
 from typing import Any
@@ -100,6 +104,10 @@ EVAL_TIMEOUT_SECONDS = 15 * 60
 REPORT_TIMEOUT_SECONDS = 2 * 60
 EVAL_WORKER_COUNT = "8"
 SHA256_BLOCK_SIZE = 64 * 1024
 # One HfApi client per process. HF_TOKEN is picked up from the env at
 # construction time and reused for every call.
@@ -726,3 +734,95 @@ def _flip_row_to_failed(submission_id: str, reason: str) -> None:
         submission_id,
         {"status": "failed", "failure_reason": reason},
     )

 """Submit-tab handler for the CADGenBench leaderboard Space.
+Step 6 (E) chunks 2 + 3 + 4 + 6: cheap-sync validation + pending-row
+write + zip upload + background-thread eval + boot-time stuck-pending
+sweep. The handler validates the upload, uploads the zip to
+``submissions/<id>.zip``, appends a ``status: pending`` row to
+``results.jsonl`` (under a process-wide lock), spawns a daemon thread
+to run ``cadgenbench evaluate`` + ``cadgenbench report single``, and
+returns immediately. The worker uploads ``reports/<id>.{html,json}``
+and flips the row ``pending -> completed`` (or ``failed`` with a
+``failure_reason``). At module import a one-shot daemon sweep flips
+any ``pending`` row whose ``submitted_at`` is older than 30 min to
+``failed`` with a "Space restart" reason, so rows stranded by a deploy
+/ OOM / crash don't sit pending forever.
 Validation gates, in order:
 import tempfile
 import threading
 import zipfile
+from datetime import datetime, timedelta, timezone
 from pathlib import Path
 from typing import Any
 REPORT_TIMEOUT_SECONDS = 2 * 60
 EVAL_WORKER_COUNT = "8"
 SHA256_BLOCK_SIZE = 64 * 1024
+STUCK_PENDING_THRESHOLD_SECONDS = 30 * 60
+SUBMITTED_AT_FORMAT = "%Y-%m-%dT%H:%M:%SZ"
+STUCK_PENDING_REASON = "evaluation interrupted by Space restart"
+BOOT_SWEEP_ENV = "CADGENBENCH_DISABLE_BOOT_SWEEP"
 # One HfApi client per process. HF_TOKEN is picked up from the env at
 # construction time and reused for every call.
         submission_id,
         {"status": "failed", "failure_reason": reason},
     )
+# ---------------------------------------------------------------------------
+# Boot-time stuck-pending sweep
+# ---------------------------------------------------------------------------
+def _sweep_stuck_pending() -> None:
+    """Flip pending rows older than the threshold to failed.
+    A ``pending`` row whose worker died (Space restart, OOM, crash)
+    has no one to flip it; without this sweep it stays pending in
+    the leaderboard forever. The check is "submitted_at older than
+    30 min" - well above the real eval ceiling (~5 min on
+    cpu-upgrade), so any genuinely-still-running submission is safe.
+    Runs once per process at module-import time inside a daemon
+    thread so app boot doesn't block on the Hub read.
+    """
+    try:
+        body = _download_results_jsonl()
+    except Exception as e:  # noqa: BLE001 - Hub API surface is broad
+        logger.warning(
+            "Stuck-pending sweep skipped, Hub fetch failed (%s: %s)",
+            type(e).__name__, e,
+        )
+        return
+    cutoff = datetime.now(timezone.utc) - timedelta(
+        seconds=STUCK_PENDING_THRESHOLD_SECONDS
+    )
+    stuck_ids: list[str] = []
+    for line in body.splitlines():
+        if not line.strip():
+            continue
+        try:
+            row = json.loads(line)
+        except json.JSONDecodeError:
+            continue
+        if row.get("status") != "pending":
+            continue
+        sid = row.get("submission_id")
+        ts_str = row.get("submitted_at")
+        if not sid or not ts_str:
+            continue
+        try:
+            ts = datetime.strptime(ts_str, SUBMITTED_AT_FORMAT).replace(
+                tzinfo=timezone.utc
+            )
+        except ValueError:
+            logger.warning(
+                "Skipping unparseable submitted_at %r on row %s",
+                ts_str, sid,
+            )
+            continue
+        if ts < cutoff:
+            stuck_ids.append(sid)
+    if not stuck_ids:
+        logger.info("Stuck-pending sweep: nothing stale")
+        return
+    logger.warning(
+        "Stuck-pending sweep: flipping %d row(s) to failed: %s",
+        len(stuck_ids), stuck_ids,
+    )
+    for sid in stuck_ids:
+        try:
+            _flip_row_to_failed(sid, STUCK_PENDING_REASON)
+        except Exception as e:  # noqa: BLE001 - log + carry on per-row
+            logger.exception(
+                "Stuck-pending flip failed for %s (%s: %s)",
+                sid, type(e).__name__, e,
+            )
+def _start_boot_sweep() -> None:
+    """Spawn the sweep on a daemon thread at module import.
+    Setting ``CADGENBENCH_DISABLE_BOOT_SWEEP=1`` opts out (useful
+    for unit-test imports that don't want the Hub round-trip).
+    """
+    if os.getenv(BOOT_SWEEP_ENV) == "1":
+        logger.info("Stuck-pending sweep disabled via %s", BOOT_SWEEP_ENV)
+        return
+    threading.Thread(
+        target=_sweep_stuck_pending,
+        name="cgb-boot-sweep",
+        daemon=True,
+    ).start()
+_start_boot_sweep()