Spaces:

HuggingAI4Engineering
/

CADGenBench

Running

Michael Rabinovich commited on May 29

Commit

6e3ab50

1 Parent(s): 37e45d8

submit: serialize cadgenbench evaluate to dodge cpu-upgrade contention

EVAL_WORKER_COUNT goes from "8" to "1". Removes the debug
/debug/render-bench route added in the previous commit.

Why: on the Space's cpu-upgrade tier today, a single headless-
Chromium render takes ~7s; five in parallel collapse the host and
each individual render blows past its 120s subprocess timeout.
Same renderer, same fixtures, same Playwright version run on a
laptop in ~5s per render with no slowdown under 5-way parallelism.
The slow path is HF's shared host, not our renderer.

Sequential eval (one fixture at a time, one render at a time) gives
each Chromium the whole box. Five fixtures finish in ~30-60s wall
time, far under the 15-minute outer EVAL_TIMEOUT_SECONDS budget.

Tradeoff: total eval wall time grows linearly with fixture count.
For today's five fixtures that's a few seconds slower than the
healthy-host parallel case, and a huge cliff better than a failed
eval. For the 100-fixture launch this becomes ~10 minutes per
submission, which is the limit of this approach: a real fix at
that scale needs a non-Chromium renderer (PyVista/VTK with off-
screen OpenGL) and/or moving eval off-Space to HF Jobs. Tracked
as a follow-up.

Also revert /debug/render-bench. It served its purpose (confirming
per-render is fast, contention is what kills the pipeline).

Files changed (2) hide show

app.py +0 -47
submit.py +8 -1

app.py CHANGED Viewed

@@ -421,53 +421,6 @@ app.add_api_route(
     serve_report,
     methods=["GET"],
 )
-def debug_render_bench() -> dict:
-    """One-shot render-timing probe.
-    Sequentially times ``render_step`` on each cadgenbench-data input
-    STEP. Reads-only; no side effects. Used to compare per-render
-    cost on the Space's container vs. a local reference, when an
-    eval has started timing out at the 120s render_step ceiling but
-    nothing on the render path changed in our code.
-    Run via:
-      curl -H "Authorization: Bearer $HF_TOKEN" \
-        https://<space>.hf.space/debug/render-bench
-    """
-    import time
-    from cadgenbench.common.paths import data_inputs_dir
-    from cadgenbench.common.viewer import render_step
-    base = Path(data_inputs_dir())
-    results: dict = {}
-    for fixture in sorted(p for p in base.iterdir() if p.is_dir()):
-        step = fixture / "input.step"
-        if not step.exists():
-            results[fixture.name] = {"error": "no input.step"}
-            continue
-        t0 = time.perf_counter()
-        try:
-            imgs = render_step(str(step), timeout=180)
-            dt = time.perf_counter() - t0
-            results[fixture.name] = {
-                "ok": True, "seconds": round(dt, 2), "views": len(imgs),
-            }
-        except Exception as e:  # noqa: BLE001 - report whatever fails
-            dt = time.perf_counter() - t0
-            results[fixture.name] = {
-                "ok": False, "seconds": round(dt, 2),
-                "error": f"{type(e).__name__}: {str(e)[:300]}",
-            }
-    return results
-app.add_api_route(
-    "/debug/render-bench",
-    debug_render_bench,
-    methods=["GET"],
-)
 app = gr.mount_gradio_app(app, blocks, path="/")

     serve_report,
     methods=["GET"],
 )
 app = gr.mount_gradio_app(app, blocks, path="/")

submit.py CHANGED Viewed

@@ -103,7 +103,14 @@ DATA_REV_SHORT_LEN = 12
 FAILURE_REASON_MAX_CHARS = 200
 EVAL_TIMEOUT_SECONDS = 15 * 60
 REPORT_TIMEOUT_SECONDS = 2 * 60
-EVAL_WORKER_COUNT = "8"
 SHA256_BLOCK_SIZE = 64 * 1024
 STUCK_PENDING_THRESHOLD_SECONDS = 30 * 60
 SUBMITTED_AT_FORMAT = "%Y-%m-%dT%H:%M:%SZ"

 FAILURE_REASON_MAX_CHARS = 200
 EVAL_TIMEOUT_SECONDS = 15 * 60
 REPORT_TIMEOUT_SECONDS = 2 * 60
+# Per-fixture eval workers. Was "8" (one Python worker per fixture,
+# each spawning its own headless-Chromium render subprocess in
+# parallel). Concurrent rendering on the Space's cpu-upgrade tier
+# oversubscribes the host: 5 simultaneous Chromiums turn 7s renders
+# into 120s+ timeouts. Sequential ("1") gives each render the box
+# to itself; 5 fixtures finish in ~30-60s wall time. Tracked as a
+# follow-up to move off Chromium-based rendering for scale.
+EVAL_WORKER_COUNT = "1"
 SHA256_BLOCK_SIZE = 64 * 1024
 STUCK_PENDING_THRESHOLD_SECONDS = 30 * 60
 SUBMITTED_AT_FORMAT = "%Y-%m-%dT%H:%M:%SZ"