Spaces:

HuggingAI4Engineering
/

cadgenbench-leaderboard

Running

Michael Rabinovich commited on 9 days ago

Commit

855bb8a

1 Parent(s): 5aee3e5

submit: background-thread eval, flip row pending -> completed / failed

Step 6 (E) chunk 4. On a successful submit handle_submit now spawns
a daemon thread that runs `cadgenbench evaluate` then `cadgenbench
report single` over the unpacked submission, uploads
reports/<id>.{html,json} to the submissions dataset, reads
run_summary.json, and flips the row pending -> completed under the
existing _HUB_LOCK. Any worker-side exception flips the row to
failed with a short failure_reason (<=200 chars; full traceback
goes to Space logs). Tempdir cleanup always runs in finally.

Pipeline per submission, in the worker:
1. cadgenbench evaluate <run_dir> --workers 8
(subprocess via `python -m cadgenbench.cli`; the eval CLI
already fans out across fixtures with ProcessPoolExecutor,
self-throttles to n_fixtures when smaller).
2. cadgenbench report single <run_dir> -o <tmp>/<id>.html
3. Build reports JSON: run_summary.json + each fixture's
result.json bundled together (mirrors README's description
of reports/<id>.json as the "machine-readable mirror").
4. Upload reports/<id>.html and reports/<id>.json.
5. Under _HUB_LOCK: merge aggregate_score, validity_rate,
score_by_task_type, per_task_scores, per_fixture_scores from
run_summary.json into the existing pending row; set status
to "completed", refresh cadgenbench_data_revision.

Subprocess timeouts: 15 min for eval (eval ceiling is ~5 min on
cpu-upgrade; generous), 2 min for report. Subprocess.run with
check=False so we surface the exit code + stderr tail in the
RuntimeError, not the default CalledProcessError chain. env is
inherited so HF_TOKEN / CADGENBENCH_DATA_REPO /
CADGENBENCH_DATA_GT_REPO reach the cadgenbench subprocess.

Tempdir lifecycle change: handle_submit was using a `with
TemporaryDirectory()` that auto-cleaned at function exit, which
would have nuked the unpacked submission before the worker had
a chance to read it. Switched to manual mkdtemp + try/finally;
ownership transfers to the worker on successful spawn (worker's
own finally does shutil.rmtree). On any pre-spawn rejection
(validation or Hub-write) the handler still cleans up.

Refactor: _append_pending_row (chunk 3) and the new _update_row
(chunk 4) both do a lock-acquire + download + mutate + upload
of results.jsonl. Extracted _hub_rmw_results that takes a
mutate(rows) callable; append is mut.append(row), update is
mut(rows): find by id and r.update(updates). Same lock, same
RMW cycle.

Failure modes covered:
- eval subprocess non-zero / timeout -> row "failed" + reason
- report subprocess non-zero / missing out -> row "failed" + reason
- report JSON build (missing run_summary) -> row "failed" + reason
- Hub upload of reports/<id>.{html,json} -> row "failed" + reason
- Final row-flip itself fails -> row stays "pending";
chunk 6's stuck-pending sweep catches it on next Space boot.

"Queued." UI message updated to include the typical eval wall
clock ("2-5 minutes on this Space's cpu-upgrade tier") so the
user knows roughly when to come back. The pending row's
"evaluating..." cell tag from the earlier polish commit holds
the in-table progress signal until chunk 5 adds the every=10
auto-refresh.

Files changed (1) hide show

submit.py +301 -64

submit.py CHANGED Viewed

@@ -1,11 +1,13 @@
 """Submit-tab handler for the CADGenBench leaderboard Space.
-Step 6 (E) chunks 2 + 3: cheap-sync validation pipeline + pending-row
-write + zip upload. The handler validates the upload, uploads the
-zip to ``submissions/<id>.zip``, appends a ``status: pending`` row to
-``results.jsonl`` (under a process-wide lock), and returns
-immediately. No eval and no worker yet, the row stays pending
-forever until later chunks add the background thread.
 Validation gates, in order:
@@ -31,16 +33,37 @@ Hub-write ordering (after validation passes):
 2. Build pending row (metadata + null scores + ``submission_blob_url``).
 3. Acquire ``_HUB_LOCK``; download current ``results.jsonl`` (or
    start empty); append the pending row; re-upload.
 If step 1 fails the user sees a clean rejection. If step 3 fails the
 zip is left orphaned in ``submissions/`` and the user sees a clean
 rejection; an orphan-zip sweep is a future-chunk concern.
 """
 from __future__ import annotations
 import json
 import logging
 import re
 import tempfile
 import threading
 import zipfile
@@ -69,7 +92,12 @@ REQUIRED_META_KEYS: tuple[str, ...] = (
 SUBMISSION_ID_SLUG_MAX = 40
 RESULTS_FILENAME = "results.jsonl"
 SUBMISSIONS_DIR = "submissions"
 DATA_REV_SHORT_LEN = 12
 # One HfApi client per process. HF_TOKEN is picked up from the env at
 # construction time and reused for every call.
@@ -109,33 +137,48 @@ def handle_submit(zip_file) -> str:
         return form_err
     zip_path = Path(zip_file.name)
-    with tempfile.TemporaryDirectory(prefix="cadgenbench-validate-") as tmp:
-        unpacked = Path(tmp) / "unpacked"
-        unpacked.mkdir()
         try:
-            _extract_zip(zip_path, unpacked)
-            meta = _load_and_validate_meta(unpacked)
-            fixture_names = _validate_fixture_set(unpacked)
-            _validate_steps_parseable(unpacked, fixture_names)
         except _ValidationError as e:
             return f"**Submission rejected.** {e}"
-    submission_id = _mint_submission_id(
-        meta["submitter_name"], meta["submission_name"]
-    )
-    try:
-        blob_url = _upload_submission_zip(submission_id, zip_path)
-        row = _build_pending_row(submission_id, meta, fixture_names, blob_url)
-        _append_pending_row(row)
-    except _HubWriteError as e:
-        return f"**Submission rejected.** {e}"
     return (
-        f"**Queued.** Submission `{submission_id}` has been accepted and a "
-        f"`pending` row added to the leaderboard (submitter: "
-        f"`{meta['submitter_name']}`, system: `{meta['submission_name']}`, "
-        f"{len(fixture_names)} fixtures). Evaluation will populate the "
-        f"score columns once the worker lands in a later chunk."
     )
@@ -358,47 +401,79 @@ def _build_pending_row(
 def _append_pending_row(row: dict[str, Any]) -> None:
-    """Append a pending row to ``results.jsonl`` on the Hub under the lock.
-    Read-modify-write: download the current file (or treat as empty if
-    it doesn't exist yet), append one line, re-upload. The lock is
-    held only for the duration of this cycle (~1-2s), not for any
-    background eval; concurrent submitters serialise here, not on the
-    eval pipeline.
     """
-    with _HUB_LOCK:
-        try:
-            existing = _download_results_jsonl()
-        except Exception as e:  # noqa: BLE001 - Hub API surface is broad
-            logger.exception("Failed to download results.jsonl for append")
-            raise _HubWriteError(
-                f"Server-side error reading the submissions table "
-                f"({type(e).__name__}: {e}). Please try again later."
-            ) from e
-        line = json.dumps(row, ensure_ascii=False)
-        new_body = existing + line + "\n" if existing else line + "\n"
-        try:
-            _HF_API.upload_file(
-                path_or_fileobj=new_body.encode("utf-8"),
-                path_in_repo=RESULTS_FILENAME,
-                repo_id=HF_SUBMISSIONS_REPO,
-                repo_type="dataset",
-                commit_message=(
-                    f"add pending row for {row['submission_id']}"
-                ),
-            )
-        except Exception as e:  # noqa: BLE001 - Hub API surface is broad
-            logger.exception(
-                "Failed to upload appended results.jsonl for %s",
-                row["submission_id"],
-            )
-            raise _HubWriteError(
-                f"Server-side error writing the submissions table "
-                f"({type(e).__name__}: {e}). The submission zip was "
-                f"uploaded but the row was not; please try again later."
-            ) from e
 def _download_results_jsonl() -> str:
@@ -436,3 +511,165 @@ def _resolve_data_revision() -> str:
         )
         _DATA_REVISION = "unknown"
     return _DATA_REVISION

 """Submit-tab handler for the CADGenBench leaderboard Space.
+Step 6 (E) chunks 2 + 3 + 4: cheap-sync validation pipeline + pending-row
+write + zip upload + background-thread eval. The handler validates
+the upload, uploads the zip to ``submissions/<id>.zip``, appends a
+``status: pending`` row to ``results.jsonl`` (under a process-wide
+lock), spawns a daemon thread to run ``cadgenbench evaluate`` +
+``cadgenbench report single``, and returns immediately. The worker
+uploads ``reports/<id>.{html,json}`` and flips the row
+``pending -> completed`` (or ``failed`` with a ``failure_reason``).
 Validation gates, in order:
 2. Build pending row (metadata + null scores + ``submission_blob_url``).
 3. Acquire ``_HUB_LOCK``; download current ``results.jsonl`` (or
    start empty); append the pending row; re-upload.
+4. Spawn worker thread (daemon, named after submission_id). The
+   worker owns the tempdir's lifecycle past this point.
 If step 1 fails the user sees a clean rejection. If step 3 fails the
 zip is left orphaned in ``submissions/`` and the user sees a clean
 rejection; an orphan-zip sweep is a future-chunk concern.
+Background worker, per submission:
+1. ``cadgenbench evaluate <run_dir>`` (subprocess; runs
+   per-fixture eval in parallel via the CLI's ProcessPoolExecutor;
+   writes ``run_summary.json`` at the run-dir root).
+2. ``cadgenbench report single <run_dir> -o <report.html>``
+   (subprocess; self-contained HTML with embedded renders).
+3. Upload ``reports/<id>.html`` + ``reports/<id>.json``. The JSON
+   bundles ``run_summary.json`` + every per-fixture ``result.json``.
+4. Read ``run_summary.json``; under ``_HUB_LOCK`` flip the row's
+   ``status`` to ``"completed"`` and merge the score fields.
+5. On any worker-side exception, flip the row to ``"failed"`` with
+   a short ``failure_reason``. Tempdir cleanup runs in ``finally``
+   either way.
 """
 from __future__ import annotations
 import json
 import logging
+import os
 import re
+import shutil
+import subprocess
+import sys
 import tempfile
 import threading
 import zipfile
 SUBMISSION_ID_SLUG_MAX = 40
 RESULTS_FILENAME = "results.jsonl"
 SUBMISSIONS_DIR = "submissions"
+REPORTS_DIR = "reports"
 DATA_REV_SHORT_LEN = 12
+FAILURE_REASON_MAX_CHARS = 200
+EVAL_TIMEOUT_SECONDS = 15 * 60
+REPORT_TIMEOUT_SECONDS = 2 * 60
+EVAL_WORKER_COUNT = "8"
 # One HfApi client per process. HF_TOKEN is picked up from the env at
 # construction time and reused for every call.
         return form_err
     zip_path = Path(zip_file.name)
+    # Manual tempdir lifecycle: cleaned up here on any rejection, but
+    # ownership passes to the worker on a successful spawn (the worker
+    # cleans up in its own finally). TemporaryDirectory's context
+    # manager doesn't fit because the dir has to outlive this function.
+    tmp = Path(tempfile.mkdtemp(prefix="cadgenbench-submit-"))
+    run_dir = tmp / "run"
+    run_dir.mkdir()
+    try:
         try:
+            _extract_zip(zip_path, run_dir)
+            meta = _load_and_validate_meta(run_dir)
+            fixture_names = _validate_fixture_set(run_dir)
+            _validate_steps_parseable(run_dir, fixture_names)
         except _ValidationError as e:
             return f"**Submission rejected.** {e}"
+        submission_id = _mint_submission_id(
+            meta["submitter_name"], meta["submission_name"]
+        )
+        try:
+            blob_url = _upload_submission_zip(submission_id, zip_path)
+            row = _build_pending_row(
+                submission_id, meta, fixture_names, blob_url
+            )
+            _append_pending_row(row)
+        except _HubWriteError as e:
+            return f"**Submission rejected.** {e}"
+        _spawn_worker(submission_id, tmp, run_dir)
+        tmp = None  # ownership transferred; skip cleanup below
+    finally:
+        if tmp is not None:
+            shutil.rmtree(tmp, ignore_errors=True)
     return (
+        f"**Queued.** Submission `{submission_id}` has been accepted "
+        f"(submitter: `{meta['submitter_name']}`, system: "
+        f"`{meta['submission_name']}`, {len(fixture_names)} fixtures). "
+        f"Evaluation typically takes 2-5 minutes on this Space's "
+        f"`cpu-upgrade` tier; the row flips to `completed` with score "
+        f"columns populated when the worker finishes."
     )
 def _append_pending_row(row: dict[str, Any]) -> None:
+    """Append a pending row to ``results.jsonl`` on the Hub under the lock."""
+    submission_id = row["submission_id"]
+    def mutate(rows: list[dict[str, Any]]) -> None:
+        rows.append(row)
+    try:
+        _hub_rmw_results(
+            mutate, commit_message=f"add pending row for {submission_id}"
+        )
+    except Exception as e:  # noqa: BLE001 - Hub API surface is broad
+        logger.exception(
+            "Failed RMW of results.jsonl while appending pending row for %s",
+            submission_id,
+        )
+        raise _HubWriteError(
+            f"Server-side error writing the submissions table "
+            f"({type(e).__name__}: {e}). The submission zip was uploaded "
+            f"but the row was not; please try again later."
+        ) from e
+def _update_row(submission_id: str, updates: dict[str, Any]) -> None:
+    """Find the row for *submission_id* and merge *updates* into it.
+    Raises ``LookupError`` if no row with that id exists (worker fired
+    before the pending row was committed, which shouldn't happen, but
+    surfaces clearly if it ever does).
     """
+    def mutate(rows: list[dict[str, Any]]) -> None:
+        for r in rows:
+            if r.get("submission_id") == submission_id:
+                r.update(updates)
+                return
+        raise LookupError(
+            f"No row with submission_id={submission_id!r} in results.jsonl."
+        )
+    _hub_rmw_results(
+        mutate,
+        commit_message=(
+            f"flip row for {submission_id} -> {updates.get('status', '?')}"
+        ),
+    )
+def _hub_rmw_results(
+    mutate, *, commit_message: str,
+) -> None:
+    """Lock + download + mutate + upload of ``results.jsonl``.
+    The lock is held only for the read-modify-write cycle (~1-2s),
+    never for eval time. Concurrent submitters serialise here, not
+    in the eval pipeline. Treats a missing file as the empty list.
+    """
+    with _HUB_LOCK:
+        existing = _download_results_jsonl()
+        rows: list[dict[str, Any]] = [
+            json.loads(line) for line in existing.splitlines() if line.strip()
+        ]
+        mutate(rows)
+        new_body = (
+            "\n".join(json.dumps(r, ensure_ascii=False) for r in rows) + "\n"
+            if rows
+            else ""
+        )
+        _HF_API.upload_file(
+            path_or_fileobj=new_body.encode("utf-8"),
+            path_in_repo=RESULTS_FILENAME,
+            repo_id=HF_SUBMISSIONS_REPO,
+            repo_type="dataset",
+            commit_message=commit_message,
+        )
 def _download_results_jsonl() -> str:
         )
         _DATA_REVISION = "unknown"
     return _DATA_REVISION
+# ---------------------------------------------------------------------------
+# Background worker (eval + report + row flip)
+# ---------------------------------------------------------------------------
+def _spawn_worker(submission_id: str, tmp: Path, run_dir: Path) -> None:
+    """Start the eval worker thread. Fire-and-forget; daemon=True so a
+    Space restart doesn't block on in-flight workers (chunk 6's
+    boot-time sweep flips any rows their workers didn't finish to
+    failed).
+    """
+    t = threading.Thread(
+        target=_run_worker,
+        args=(submission_id, tmp, run_dir),
+        name=f"cgb-worker-{submission_id}",
+        daemon=True,
+    )
+    t.start()
+def _run_worker(submission_id: str, tmp: Path, run_dir: Path) -> None:
+    """Top-level worker entry: run eval, build + upload reports, flip row.
+    Any exception in the pipeline flips the row to ``failed`` with a
+    short ``failure_reason`` (full traceback goes to the Space's
+    runtime logs). The tempdir is always cleaned up.
+    """
+    try:
+        try:
+            _run_eval(run_dir)
+            report_html = tmp / f"{submission_id}.html"
+            _run_report(run_dir, report_html)
+            report_json = _build_report_json(run_dir)
+            _upload_reports(submission_id, report_html, report_json)
+            summary = json.loads(
+                (run_dir / "run_summary.json").read_text(encoding="utf-8")
+            )
+            _flip_row_to_completed(submission_id, summary)
+            logger.info("Worker completed for %s", submission_id)
+        except Exception as e:  # noqa: BLE001 - broad on purpose; we map to row state
+            logger.exception("Worker failed for %s", submission_id)
+            reason = f"{type(e).__name__}: {str(e)}"[:FAILURE_REASON_MAX_CHARS]
+            try:
+                _flip_row_to_failed(submission_id, reason)
+            except Exception:
+                # If even the row-flip fails, the row stays pending.
+                # Chunk 6's stuck-pending sweep will catch it on the
+                # next Space boot.
+                logger.exception(
+                    "Failed to flip row to failed for %s; row stays pending",
+                    submission_id,
+                )
+    finally:
+        shutil.rmtree(tmp, ignore_errors=True)
+def _run_eval(run_dir: Path) -> None:
+    """Invoke ``cadgenbench evaluate`` over the run_dir; raise on non-zero."""
+    cmd = [
+        sys.executable, "-m", "cadgenbench.cli", "evaluate", str(run_dir),
+        "--workers", EVAL_WORKER_COUNT,
+    ]
+    logger.info("Running eval: %s", " ".join(cmd))
+    proc = subprocess.run(
+        cmd,
+        capture_output=True,
+        text=True,
+        timeout=EVAL_TIMEOUT_SECONDS,
+        env=os.environ.copy(),
+        check=False,
+    )
+    if proc.returncode != 0:
+        # Surface a short tail of stderr; full output is in Space logs above.
+        tail = (proc.stderr or proc.stdout or "")[-500:].strip()
+        raise RuntimeError(
+            f"cadgenbench evaluate exited {proc.returncode}: {tail}"
+        )
+def _run_report(run_dir: Path, html_out: Path) -> None:
+    """Invoke ``cadgenbench report single`` for the run_dir; raise on non-zero."""
+    cmd = [
+        sys.executable, "-m", "cadgenbench.cli", "report", "single",
+        str(run_dir), "-o", str(html_out),
+    ]
+    logger.info("Running report: %s", " ".join(cmd))
+    proc = subprocess.run(
+        cmd,
+        capture_output=True,
+        text=True,
+        timeout=REPORT_TIMEOUT_SECONDS,
+        env=os.environ.copy(),
+        check=False,
+    )
+    if proc.returncode != 0 or not html_out.is_file():
+        tail = (proc.stderr or proc.stdout or "")[-500:].strip()
+        raise RuntimeError(
+            f"cadgenbench report single exited {proc.returncode}: {tail}"
+        )
+def _build_report_json(run_dir: Path) -> dict[str, Any]:
+    """Bundle ``run_summary.json`` + every per-fixture ``result.json``."""
+    summary_path = run_dir / "run_summary.json"
+    if not summary_path.is_file():
+        raise RuntimeError(
+            f"run_summary.json not produced under {run_dir} (eval issue?)"
+        )
+    summary = json.loads(summary_path.read_text(encoding="utf-8"))
+    per_fixture: dict[str, dict[str, Any]] = {}
+    for fixture_dir in sorted(d for d in run_dir.iterdir() if d.is_dir()):
+        rp = fixture_dir / "result.json"
+        if rp.is_file():
+            per_fixture[fixture_dir.name] = json.loads(
+                rp.read_text(encoding="utf-8")
+            )
+    return {"run_summary": summary, "per_fixture_results": per_fixture}
+def _upload_reports(
+    submission_id: str, html_path: Path, report_json: dict[str, Any],
+) -> None:
+    """Upload ``reports/<id>.html`` and ``reports/<id>.json`` to the Hub."""
+    _HF_API.upload_file(
+        path_or_fileobj=str(html_path),
+        path_in_repo=f"{REPORTS_DIR}/{submission_id}.html",
+        repo_id=HF_SUBMISSIONS_REPO,
+        repo_type="dataset",
+        commit_message=f"add HTML report for {submission_id}",
+    )
+    _HF_API.upload_file(
+        path_or_fileobj=json.dumps(report_json, ensure_ascii=False, indent=2).encode("utf-8"),
+        path_in_repo=f"{REPORTS_DIR}/{submission_id}.json",
+        repo_id=HF_SUBMISSIONS_REPO,
+        repo_type="dataset",
+        commit_message=f"add JSON report for {submission_id}",
+    )
+def _flip_row_to_completed(submission_id: str, summary: dict[str, Any]) -> None:
+    """Merge ``run_summary.json`` fields into the pending row."""
+    updates: dict[str, Any] = {
+        "status": "completed",
+        "failure_reason": None,
+        "cadgenbench_data_revision": _resolve_data_revision(),
+        "aggregate_score": summary.get("aggregate_score"),
+        "validity_rate": summary.get("validity_rate"),
+        "score_by_task_type": summary.get("score_by_task_type"),
+        "per_task_scores": summary.get("per_task_scores"),
+        "per_fixture_scores": summary.get("per_fixture_scores"),
+    }
+    _update_row(submission_id, updates)
+def _flip_row_to_failed(submission_id: str, reason: str) -> None:
+    """Mark the row as ``failed`` with a short reason; scores stay null."""
+    _update_row(
+        submission_id,
+        {"status": "failed", "failure_reason": reason},
+    )