submit: background-thread eval, flip row pending -> completed / failed
Browse filesStep 6 (E) chunk 4. On a successful submit handle_submit now spawns
a daemon thread that runs `cadgenbench evaluate` then `cadgenbench
report single` over the unpacked submission, uploads
reports/<id>.{html,json} to the submissions dataset, reads
run_summary.json, and flips the row pending -> completed under the
existing _HUB_LOCK. Any worker-side exception flips the row to
failed with a short failure_reason (<=200 chars; full traceback
goes to Space logs). Tempdir cleanup always runs in finally.
Pipeline per submission, in the worker:
1. cadgenbench evaluate <run_dir> --workers 8
(subprocess via `python -m cadgenbench.cli`; the eval CLI
already fans out across fixtures with ProcessPoolExecutor,
self-throttles to n_fixtures when smaller).
2. cadgenbench report single <run_dir> -o <tmp>/<id>.html
3. Build reports JSON: run_summary.json + each fixture's
result.json bundled together (mirrors README's description
of reports/<id>.json as the "machine-readable mirror").
4. Upload reports/<id>.html and reports/<id>.json.
5. Under _HUB_LOCK: merge aggregate_score, validity_rate,
score_by_task_type, per_task_scores, per_fixture_scores from
run_summary.json into the existing pending row; set status
to "completed", refresh cadgenbench_data_revision.
Subprocess timeouts: 15 min for eval (eval ceiling is ~5 min on
cpu-upgrade; generous), 2 min for report. Subprocess.run with
check=False so we surface the exit code + stderr tail in the
RuntimeError, not the default CalledProcessError chain. env is
inherited so HF_TOKEN / CADGENBENCH_DATA_REPO /
CADGENBENCH_DATA_GT_REPO reach the cadgenbench subprocess.
Tempdir lifecycle change: handle_submit was using a `with
TemporaryDirectory()` that auto-cleaned at function exit, which
would have nuked the unpacked submission before the worker had
a chance to read it. Switched to manual mkdtemp + try/finally;
ownership transfers to the worker on successful spawn (worker's
own finally does shutil.rmtree). On any pre-spawn rejection
(validation or Hub-write) the handler still cleans up.
Refactor: _append_pending_row (chunk 3) and the new _update_row
(chunk 4) both do a lock-acquire + download + mutate + upload
of results.jsonl. Extracted _hub_rmw_results that takes a
mutate(rows) callable; append is mut.append(row), update is
mut(rows): find by id and r.update(updates). Same lock, same
RMW cycle.
Failure modes covered:
- eval subprocess non-zero / timeout -> row "failed" + reason
- report subprocess non-zero / missing out -> row "failed" + reason
- report JSON build (missing run_summary) -> row "failed" + reason
- Hub upload of reports/<id>.{html,json} -> row "failed" + reason
- Final row-flip itself fails -> row stays "pending";
chunk 6's stuck-pending sweep catches it on next Space boot.
"Queued." UI message updated to include the typical eval wall
clock ("2-5 minutes on this Space's cpu-upgrade tier") so the
user knows roughly when to come back. The pending row's
"evaluating..." cell tag from the earlier polish commit holds
the in-table progress signal until chunk 5 adds the every=10
auto-refresh.
|
@@ -1,11 +1,13 @@
|
|
| 1 |
"""Submit-tab handler for the CADGenBench leaderboard Space.
|
| 2 |
|
| 3 |
-
Step 6 (E) chunks 2 + 3: cheap-sync validation pipeline + pending-row
|
| 4 |
-
write + zip upload. The handler validates
|
| 5 |
-
zip to ``submissions/<id>.zip``, appends a
|
| 6 |
-
``results.jsonl`` (under a process-wide
|
| 7 |
-
|
| 8 |
-
|
|
|
|
|
|
|
| 9 |
|
| 10 |
Validation gates, in order:
|
| 11 |
|
|
@@ -31,16 +33,37 @@ Hub-write ordering (after validation passes):
|
|
| 31 |
2. Build pending row (metadata + null scores + ``submission_blob_url``).
|
| 32 |
3. Acquire ``_HUB_LOCK``; download current ``results.jsonl`` (or
|
| 33 |
start empty); append the pending row; re-upload.
|
|
|
|
|
|
|
| 34 |
|
| 35 |
If step 1 fails the user sees a clean rejection. If step 3 fails the
|
| 36 |
zip is left orphaned in ``submissions/`` and the user sees a clean
|
| 37 |
rejection; an orphan-zip sweep is a future-chunk concern.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
"""
|
| 39 |
from __future__ import annotations
|
| 40 |
|
| 41 |
import json
|
| 42 |
import logging
|
|
|
|
| 43 |
import re
|
|
|
|
|
|
|
|
|
|
| 44 |
import tempfile
|
| 45 |
import threading
|
| 46 |
import zipfile
|
|
@@ -69,7 +92,12 @@ REQUIRED_META_KEYS: tuple[str, ...] = (
|
|
| 69 |
SUBMISSION_ID_SLUG_MAX = 40
|
| 70 |
RESULTS_FILENAME = "results.jsonl"
|
| 71 |
SUBMISSIONS_DIR = "submissions"
|
|
|
|
| 72 |
DATA_REV_SHORT_LEN = 12
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
|
| 74 |
# One HfApi client per process. HF_TOKEN is picked up from the env at
|
| 75 |
# construction time and reused for every call.
|
|
@@ -109,33 +137,48 @@ def handle_submit(zip_file) -> str:
|
|
| 109 |
return form_err
|
| 110 |
|
| 111 |
zip_path = Path(zip_file.name)
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 115 |
try:
|
| 116 |
-
_extract_zip(zip_path,
|
| 117 |
-
meta = _load_and_validate_meta(
|
| 118 |
-
fixture_names = _validate_fixture_set(
|
| 119 |
-
_validate_steps_parseable(
|
| 120 |
except _ValidationError as e:
|
| 121 |
return f"**Submission rejected.** {e}"
|
| 122 |
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 132 |
|
| 133 |
return (
|
| 134 |
-
f"**Queued.** Submission `{submission_id}` has been accepted
|
| 135 |
-
f"`
|
| 136 |
-
f"`{meta['
|
| 137 |
-
f"
|
| 138 |
-
f"
|
|
|
|
| 139 |
)
|
| 140 |
|
| 141 |
|
|
@@ -358,47 +401,79 @@ def _build_pending_row(
|
|
| 358 |
|
| 359 |
|
| 360 |
def _append_pending_row(row: dict[str, Any]) -> None:
|
| 361 |
-
"""Append a pending row to ``results.jsonl`` on the Hub under the lock.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 362 |
|
| 363 |
-
|
| 364 |
-
|
| 365 |
-
|
| 366 |
-
|
| 367 |
-
|
|
|
|
|
|
|
| 368 |
"""
|
| 369 |
-
|
| 370 |
-
|
| 371 |
-
|
| 372 |
-
|
| 373 |
-
|
| 374 |
-
|
| 375 |
-
|
| 376 |
-
|
| 377 |
-
) from e
|
| 378 |
|
| 379 |
-
|
| 380 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 381 |
|
| 382 |
-
|
| 383 |
-
|
| 384 |
-
|
| 385 |
-
|
| 386 |
-
|
| 387 |
-
|
| 388 |
-
|
| 389 |
-
|
| 390 |
-
|
| 391 |
-
|
| 392 |
-
|
| 393 |
-
|
| 394 |
-
|
| 395 |
-
|
| 396 |
-
|
| 397 |
-
|
| 398 |
-
|
| 399 |
-
|
| 400 |
-
|
| 401 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 402 |
|
| 403 |
|
| 404 |
def _download_results_jsonl() -> str:
|
|
@@ -436,3 +511,165 @@ def _resolve_data_revision() -> str:
|
|
| 436 |
)
|
| 437 |
_DATA_REVISION = "unknown"
|
| 438 |
return _DATA_REVISION
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
"""Submit-tab handler for the CADGenBench leaderboard Space.
|
| 2 |
|
| 3 |
+
Step 6 (E) chunks 2 + 3 + 4: cheap-sync validation pipeline + pending-row
|
| 4 |
+
write + zip upload + background-thread eval. The handler validates
|
| 5 |
+
the upload, uploads the zip to ``submissions/<id>.zip``, appends a
|
| 6 |
+
``status: pending`` row to ``results.jsonl`` (under a process-wide
|
| 7 |
+
lock), spawns a daemon thread to run ``cadgenbench evaluate`` +
|
| 8 |
+
``cadgenbench report single``, and returns immediately. The worker
|
| 9 |
+
uploads ``reports/<id>.{html,json}`` and flips the row
|
| 10 |
+
``pending -> completed`` (or ``failed`` with a ``failure_reason``).
|
| 11 |
|
| 12 |
Validation gates, in order:
|
| 13 |
|
|
|
|
| 33 |
2. Build pending row (metadata + null scores + ``submission_blob_url``).
|
| 34 |
3. Acquire ``_HUB_LOCK``; download current ``results.jsonl`` (or
|
| 35 |
start empty); append the pending row; re-upload.
|
| 36 |
+
4. Spawn worker thread (daemon, named after submission_id). The
|
| 37 |
+
worker owns the tempdir's lifecycle past this point.
|
| 38 |
|
| 39 |
If step 1 fails the user sees a clean rejection. If step 3 fails the
|
| 40 |
zip is left orphaned in ``submissions/`` and the user sees a clean
|
| 41 |
rejection; an orphan-zip sweep is a future-chunk concern.
|
| 42 |
+
|
| 43 |
+
Background worker, per submission:
|
| 44 |
+
|
| 45 |
+
1. ``cadgenbench evaluate <run_dir>`` (subprocess; runs
|
| 46 |
+
per-fixture eval in parallel via the CLI's ProcessPoolExecutor;
|
| 47 |
+
writes ``run_summary.json`` at the run-dir root).
|
| 48 |
+
2. ``cadgenbench report single <run_dir> -o <report.html>``
|
| 49 |
+
(subprocess; self-contained HTML with embedded renders).
|
| 50 |
+
3. Upload ``reports/<id>.html`` + ``reports/<id>.json``. The JSON
|
| 51 |
+
bundles ``run_summary.json`` + every per-fixture ``result.json``.
|
| 52 |
+
4. Read ``run_summary.json``; under ``_HUB_LOCK`` flip the row's
|
| 53 |
+
``status`` to ``"completed"`` and merge the score fields.
|
| 54 |
+
5. On any worker-side exception, flip the row to ``"failed"`` with
|
| 55 |
+
a short ``failure_reason``. Tempdir cleanup runs in ``finally``
|
| 56 |
+
either way.
|
| 57 |
"""
|
| 58 |
from __future__ import annotations
|
| 59 |
|
| 60 |
import json
|
| 61 |
import logging
|
| 62 |
+
import os
|
| 63 |
import re
|
| 64 |
+
import shutil
|
| 65 |
+
import subprocess
|
| 66 |
+
import sys
|
| 67 |
import tempfile
|
| 68 |
import threading
|
| 69 |
import zipfile
|
|
|
|
| 92 |
SUBMISSION_ID_SLUG_MAX = 40
|
| 93 |
RESULTS_FILENAME = "results.jsonl"
|
| 94 |
SUBMISSIONS_DIR = "submissions"
|
| 95 |
+
REPORTS_DIR = "reports"
|
| 96 |
DATA_REV_SHORT_LEN = 12
|
| 97 |
+
FAILURE_REASON_MAX_CHARS = 200
|
| 98 |
+
EVAL_TIMEOUT_SECONDS = 15 * 60
|
| 99 |
+
REPORT_TIMEOUT_SECONDS = 2 * 60
|
| 100 |
+
EVAL_WORKER_COUNT = "8"
|
| 101 |
|
| 102 |
# One HfApi client per process. HF_TOKEN is picked up from the env at
|
| 103 |
# construction time and reused for every call.
|
|
|
|
| 137 |
return form_err
|
| 138 |
|
| 139 |
zip_path = Path(zip_file.name)
|
| 140 |
+
|
| 141 |
+
# Manual tempdir lifecycle: cleaned up here on any rejection, but
|
| 142 |
+
# ownership passes to the worker on a successful spawn (the worker
|
| 143 |
+
# cleans up in its own finally). TemporaryDirectory's context
|
| 144 |
+
# manager doesn't fit because the dir has to outlive this function.
|
| 145 |
+
tmp = Path(tempfile.mkdtemp(prefix="cadgenbench-submit-"))
|
| 146 |
+
run_dir = tmp / "run"
|
| 147 |
+
run_dir.mkdir()
|
| 148 |
+
try:
|
| 149 |
try:
|
| 150 |
+
_extract_zip(zip_path, run_dir)
|
| 151 |
+
meta = _load_and_validate_meta(run_dir)
|
| 152 |
+
fixture_names = _validate_fixture_set(run_dir)
|
| 153 |
+
_validate_steps_parseable(run_dir, fixture_names)
|
| 154 |
except _ValidationError as e:
|
| 155 |
return f"**Submission rejected.** {e}"
|
| 156 |
|
| 157 |
+
submission_id = _mint_submission_id(
|
| 158 |
+
meta["submitter_name"], meta["submission_name"]
|
| 159 |
+
)
|
| 160 |
+
try:
|
| 161 |
+
blob_url = _upload_submission_zip(submission_id, zip_path)
|
| 162 |
+
row = _build_pending_row(
|
| 163 |
+
submission_id, meta, fixture_names, blob_url
|
| 164 |
+
)
|
| 165 |
+
_append_pending_row(row)
|
| 166 |
+
except _HubWriteError as e:
|
| 167 |
+
return f"**Submission rejected.** {e}"
|
| 168 |
+
|
| 169 |
+
_spawn_worker(submission_id, tmp, run_dir)
|
| 170 |
+
tmp = None # ownership transferred; skip cleanup below
|
| 171 |
+
finally:
|
| 172 |
+
if tmp is not None:
|
| 173 |
+
shutil.rmtree(tmp, ignore_errors=True)
|
| 174 |
|
| 175 |
return (
|
| 176 |
+
f"**Queued.** Submission `{submission_id}` has been accepted "
|
| 177 |
+
f"(submitter: `{meta['submitter_name']}`, system: "
|
| 178 |
+
f"`{meta['submission_name']}`, {len(fixture_names)} fixtures). "
|
| 179 |
+
f"Evaluation typically takes 2-5 minutes on this Space's "
|
| 180 |
+
f"`cpu-upgrade` tier; the row flips to `completed` with score "
|
| 181 |
+
f"columns populated when the worker finishes."
|
| 182 |
)
|
| 183 |
|
| 184 |
|
|
|
|
| 401 |
|
| 402 |
|
| 403 |
def _append_pending_row(row: dict[str, Any]) -> None:
|
| 404 |
+
"""Append a pending row to ``results.jsonl`` on the Hub under the lock."""
|
| 405 |
+
submission_id = row["submission_id"]
|
| 406 |
+
|
| 407 |
+
def mutate(rows: list[dict[str, Any]]) -> None:
|
| 408 |
+
rows.append(row)
|
| 409 |
+
|
| 410 |
+
try:
|
| 411 |
+
_hub_rmw_results(
|
| 412 |
+
mutate, commit_message=f"add pending row for {submission_id}"
|
| 413 |
+
)
|
| 414 |
+
except Exception as e: # noqa: BLE001 - Hub API surface is broad
|
| 415 |
+
logger.exception(
|
| 416 |
+
"Failed RMW of results.jsonl while appending pending row for %s",
|
| 417 |
+
submission_id,
|
| 418 |
+
)
|
| 419 |
+
raise _HubWriteError(
|
| 420 |
+
f"Server-side error writing the submissions table "
|
| 421 |
+
f"({type(e).__name__}: {e}). The submission zip was uploaded "
|
| 422 |
+
f"but the row was not; please try again later."
|
| 423 |
+
) from e
|
| 424 |
|
| 425 |
+
|
| 426 |
+
def _update_row(submission_id: str, updates: dict[str, Any]) -> None:
|
| 427 |
+
"""Find the row for *submission_id* and merge *updates* into it.
|
| 428 |
+
|
| 429 |
+
Raises ``LookupError`` if no row with that id exists (worker fired
|
| 430 |
+
before the pending row was committed, which shouldn't happen, but
|
| 431 |
+
surfaces clearly if it ever does).
|
| 432 |
"""
|
| 433 |
+
def mutate(rows: list[dict[str, Any]]) -> None:
|
| 434 |
+
for r in rows:
|
| 435 |
+
if r.get("submission_id") == submission_id:
|
| 436 |
+
r.update(updates)
|
| 437 |
+
return
|
| 438 |
+
raise LookupError(
|
| 439 |
+
f"No row with submission_id={submission_id!r} in results.jsonl."
|
| 440 |
+
)
|
|
|
|
| 441 |
|
| 442 |
+
_hub_rmw_results(
|
| 443 |
+
mutate,
|
| 444 |
+
commit_message=(
|
| 445 |
+
f"flip row for {submission_id} -> {updates.get('status', '?')}"
|
| 446 |
+
),
|
| 447 |
+
)
|
| 448 |
|
| 449 |
+
|
| 450 |
+
def _hub_rmw_results(
|
| 451 |
+
mutate, *, commit_message: str,
|
| 452 |
+
) -> None:
|
| 453 |
+
"""Lock + download + mutate + upload of ``results.jsonl``.
|
| 454 |
+
|
| 455 |
+
The lock is held only for the read-modify-write cycle (~1-2s),
|
| 456 |
+
never for eval time. Concurrent submitters serialise here, not
|
| 457 |
+
in the eval pipeline. Treats a missing file as the empty list.
|
| 458 |
+
"""
|
| 459 |
+
with _HUB_LOCK:
|
| 460 |
+
existing = _download_results_jsonl()
|
| 461 |
+
rows: list[dict[str, Any]] = [
|
| 462 |
+
json.loads(line) for line in existing.splitlines() if line.strip()
|
| 463 |
+
]
|
| 464 |
+
mutate(rows)
|
| 465 |
+
new_body = (
|
| 466 |
+
"\n".join(json.dumps(r, ensure_ascii=False) for r in rows) + "\n"
|
| 467 |
+
if rows
|
| 468 |
+
else ""
|
| 469 |
+
)
|
| 470 |
+
_HF_API.upload_file(
|
| 471 |
+
path_or_fileobj=new_body.encode("utf-8"),
|
| 472 |
+
path_in_repo=RESULTS_FILENAME,
|
| 473 |
+
repo_id=HF_SUBMISSIONS_REPO,
|
| 474 |
+
repo_type="dataset",
|
| 475 |
+
commit_message=commit_message,
|
| 476 |
+
)
|
| 477 |
|
| 478 |
|
| 479 |
def _download_results_jsonl() -> str:
|
|
|
|
| 511 |
)
|
| 512 |
_DATA_REVISION = "unknown"
|
| 513 |
return _DATA_REVISION
|
| 514 |
+
|
| 515 |
+
|
| 516 |
+
# ---------------------------------------------------------------------------
|
| 517 |
+
# Background worker (eval + report + row flip)
|
| 518 |
+
# ---------------------------------------------------------------------------
|
| 519 |
+
|
| 520 |
+
|
| 521 |
+
def _spawn_worker(submission_id: str, tmp: Path, run_dir: Path) -> None:
|
| 522 |
+
"""Start the eval worker thread. Fire-and-forget; daemon=True so a
|
| 523 |
+
Space restart doesn't block on in-flight workers (chunk 6's
|
| 524 |
+
boot-time sweep flips any rows their workers didn't finish to
|
| 525 |
+
failed).
|
| 526 |
+
"""
|
| 527 |
+
t = threading.Thread(
|
| 528 |
+
target=_run_worker,
|
| 529 |
+
args=(submission_id, tmp, run_dir),
|
| 530 |
+
name=f"cgb-worker-{submission_id}",
|
| 531 |
+
daemon=True,
|
| 532 |
+
)
|
| 533 |
+
t.start()
|
| 534 |
+
|
| 535 |
+
|
| 536 |
+
def _run_worker(submission_id: str, tmp: Path, run_dir: Path) -> None:
|
| 537 |
+
"""Top-level worker entry: run eval, build + upload reports, flip row.
|
| 538 |
+
|
| 539 |
+
Any exception in the pipeline flips the row to ``failed`` with a
|
| 540 |
+
short ``failure_reason`` (full traceback goes to the Space's
|
| 541 |
+
runtime logs). The tempdir is always cleaned up.
|
| 542 |
+
"""
|
| 543 |
+
try:
|
| 544 |
+
try:
|
| 545 |
+
_run_eval(run_dir)
|
| 546 |
+
report_html = tmp / f"{submission_id}.html"
|
| 547 |
+
_run_report(run_dir, report_html)
|
| 548 |
+
report_json = _build_report_json(run_dir)
|
| 549 |
+
_upload_reports(submission_id, report_html, report_json)
|
| 550 |
+
summary = json.loads(
|
| 551 |
+
(run_dir / "run_summary.json").read_text(encoding="utf-8")
|
| 552 |
+
)
|
| 553 |
+
_flip_row_to_completed(submission_id, summary)
|
| 554 |
+
logger.info("Worker completed for %s", submission_id)
|
| 555 |
+
except Exception as e: # noqa: BLE001 - broad on purpose; we map to row state
|
| 556 |
+
logger.exception("Worker failed for %s", submission_id)
|
| 557 |
+
reason = f"{type(e).__name__}: {str(e)}"[:FAILURE_REASON_MAX_CHARS]
|
| 558 |
+
try:
|
| 559 |
+
_flip_row_to_failed(submission_id, reason)
|
| 560 |
+
except Exception:
|
| 561 |
+
# If even the row-flip fails, the row stays pending.
|
| 562 |
+
# Chunk 6's stuck-pending sweep will catch it on the
|
| 563 |
+
# next Space boot.
|
| 564 |
+
logger.exception(
|
| 565 |
+
"Failed to flip row to failed for %s; row stays pending",
|
| 566 |
+
submission_id,
|
| 567 |
+
)
|
| 568 |
+
finally:
|
| 569 |
+
shutil.rmtree(tmp, ignore_errors=True)
|
| 570 |
+
|
| 571 |
+
|
| 572 |
+
def _run_eval(run_dir: Path) -> None:
|
| 573 |
+
"""Invoke ``cadgenbench evaluate`` over the run_dir; raise on non-zero."""
|
| 574 |
+
cmd = [
|
| 575 |
+
sys.executable, "-m", "cadgenbench.cli", "evaluate", str(run_dir),
|
| 576 |
+
"--workers", EVAL_WORKER_COUNT,
|
| 577 |
+
]
|
| 578 |
+
logger.info("Running eval: %s", " ".join(cmd))
|
| 579 |
+
proc = subprocess.run(
|
| 580 |
+
cmd,
|
| 581 |
+
capture_output=True,
|
| 582 |
+
text=True,
|
| 583 |
+
timeout=EVAL_TIMEOUT_SECONDS,
|
| 584 |
+
env=os.environ.copy(),
|
| 585 |
+
check=False,
|
| 586 |
+
)
|
| 587 |
+
if proc.returncode != 0:
|
| 588 |
+
# Surface a short tail of stderr; full output is in Space logs above.
|
| 589 |
+
tail = (proc.stderr or proc.stdout or "")[-500:].strip()
|
| 590 |
+
raise RuntimeError(
|
| 591 |
+
f"cadgenbench evaluate exited {proc.returncode}: {tail}"
|
| 592 |
+
)
|
| 593 |
+
|
| 594 |
+
|
| 595 |
+
def _run_report(run_dir: Path, html_out: Path) -> None:
|
| 596 |
+
"""Invoke ``cadgenbench report single`` for the run_dir; raise on non-zero."""
|
| 597 |
+
cmd = [
|
| 598 |
+
sys.executable, "-m", "cadgenbench.cli", "report", "single",
|
| 599 |
+
str(run_dir), "-o", str(html_out),
|
| 600 |
+
]
|
| 601 |
+
logger.info("Running report: %s", " ".join(cmd))
|
| 602 |
+
proc = subprocess.run(
|
| 603 |
+
cmd,
|
| 604 |
+
capture_output=True,
|
| 605 |
+
text=True,
|
| 606 |
+
timeout=REPORT_TIMEOUT_SECONDS,
|
| 607 |
+
env=os.environ.copy(),
|
| 608 |
+
check=False,
|
| 609 |
+
)
|
| 610 |
+
if proc.returncode != 0 or not html_out.is_file():
|
| 611 |
+
tail = (proc.stderr or proc.stdout or "")[-500:].strip()
|
| 612 |
+
raise RuntimeError(
|
| 613 |
+
f"cadgenbench report single exited {proc.returncode}: {tail}"
|
| 614 |
+
)
|
| 615 |
+
|
| 616 |
+
|
| 617 |
+
def _build_report_json(run_dir: Path) -> dict[str, Any]:
|
| 618 |
+
"""Bundle ``run_summary.json`` + every per-fixture ``result.json``."""
|
| 619 |
+
summary_path = run_dir / "run_summary.json"
|
| 620 |
+
if not summary_path.is_file():
|
| 621 |
+
raise RuntimeError(
|
| 622 |
+
f"run_summary.json not produced under {run_dir} (eval issue?)"
|
| 623 |
+
)
|
| 624 |
+
summary = json.loads(summary_path.read_text(encoding="utf-8"))
|
| 625 |
+
per_fixture: dict[str, dict[str, Any]] = {}
|
| 626 |
+
for fixture_dir in sorted(d for d in run_dir.iterdir() if d.is_dir()):
|
| 627 |
+
rp = fixture_dir / "result.json"
|
| 628 |
+
if rp.is_file():
|
| 629 |
+
per_fixture[fixture_dir.name] = json.loads(
|
| 630 |
+
rp.read_text(encoding="utf-8")
|
| 631 |
+
)
|
| 632 |
+
return {"run_summary": summary, "per_fixture_results": per_fixture}
|
| 633 |
+
|
| 634 |
+
|
| 635 |
+
def _upload_reports(
|
| 636 |
+
submission_id: str, html_path: Path, report_json: dict[str, Any],
|
| 637 |
+
) -> None:
|
| 638 |
+
"""Upload ``reports/<id>.html`` and ``reports/<id>.json`` to the Hub."""
|
| 639 |
+
_HF_API.upload_file(
|
| 640 |
+
path_or_fileobj=str(html_path),
|
| 641 |
+
path_in_repo=f"{REPORTS_DIR}/{submission_id}.html",
|
| 642 |
+
repo_id=HF_SUBMISSIONS_REPO,
|
| 643 |
+
repo_type="dataset",
|
| 644 |
+
commit_message=f"add HTML report for {submission_id}",
|
| 645 |
+
)
|
| 646 |
+
_HF_API.upload_file(
|
| 647 |
+
path_or_fileobj=json.dumps(report_json, ensure_ascii=False, indent=2).encode("utf-8"),
|
| 648 |
+
path_in_repo=f"{REPORTS_DIR}/{submission_id}.json",
|
| 649 |
+
repo_id=HF_SUBMISSIONS_REPO,
|
| 650 |
+
repo_type="dataset",
|
| 651 |
+
commit_message=f"add JSON report for {submission_id}",
|
| 652 |
+
)
|
| 653 |
+
|
| 654 |
+
|
| 655 |
+
def _flip_row_to_completed(submission_id: str, summary: dict[str, Any]) -> None:
|
| 656 |
+
"""Merge ``run_summary.json`` fields into the pending row."""
|
| 657 |
+
updates: dict[str, Any] = {
|
| 658 |
+
"status": "completed",
|
| 659 |
+
"failure_reason": None,
|
| 660 |
+
"cadgenbench_data_revision": _resolve_data_revision(),
|
| 661 |
+
"aggregate_score": summary.get("aggregate_score"),
|
| 662 |
+
"validity_rate": summary.get("validity_rate"),
|
| 663 |
+
"score_by_task_type": summary.get("score_by_task_type"),
|
| 664 |
+
"per_task_scores": summary.get("per_task_scores"),
|
| 665 |
+
"per_fixture_scores": summary.get("per_fixture_scores"),
|
| 666 |
+
}
|
| 667 |
+
_update_row(submission_id, updates)
|
| 668 |
+
|
| 669 |
+
|
| 670 |
+
def _flip_row_to_failed(submission_id: str, reason: str) -> None:
|
| 671 |
+
"""Mark the row as ``failed`` with a short reason; scores stay null."""
|
| 672 |
+
_update_row(
|
| 673 |
+
submission_id,
|
| 674 |
+
{"status": "failed", "failure_reason": reason},
|
| 675 |
+
)
|