Spaces:

RomeroLab-Duke
/

BioDesignBench-Leaderboard

Running

App Files Files Community

Jasonkim8652 commited on Apr 15

Commit

3e1b7c7

verified ·

1 Parent(s): 7fd8751

Phase B (live): Boltz-2 via Modal sidecar instead of ZeroGPU

Browse files

- Add modal_boltz_app.py: A10G companion app deployed to Modal, exposes
POST /predict with FastAPI; runs `boltz predict` on demand and returns
pLDDT/pTM/ipTM/i_pAE per item. Image: torch 2.10 + boltz 2.2.1 +
cuequivariance 0.9 + fastapi[standard]. Auto-stops after 5min idle.
- Rewrite eval_boltz.py as an HTTP client of the Modal endpoint.
Reads MODAL_BOLTZ_URL and MODAL_BOLTZ_TOKEN from Space secrets;
graceful fallback when unset.
- requirements.txt: drop torch/boltz/spaces (no longer needed in the
HF Space image -- prediction runs on Modal).
- README: describe the Modal sidecar architecture and deployment.
- Smoke-tested end to end with ubiquitin: pLDDT 93.89, pTM 0.9194.

Files changed (4) hide show

README.md +29 -15
eval_boltz.py +110 -180
modal_boltz_app.py +270 -0
requirements.txt +4 -9

README.md CHANGED Viewed

@@ -46,26 +46,40 @@ Submission processing runs in 4 admin-controlled phases:
 | Phase | Step | Status | Notes |
 |---|---|---|---|
 | **A** | Dispatch tasks → CPU scoring | live | HTTP POST to submitter endpoint, validate, score 5/6 components |
-| **B** | Boltz-2 structure verification | code-ready | Needs ZeroGPU hardware + uncommented `torch`/`boltz` deps |
 | **C** | LLM judge panel (28-pt hybrid) | live | 3-judge PoLL with self-exclusion, requires API key secrets |
 | **D** | Finalize + publish to leaderboard | live | Aggregates hybrid scores, writes back to submissions dataset |
-### Phase B activation checklist
-To wire up Boltz-2 verification on this Space:
-1. **Switch hardware** in HF Space settings → Hardware → `zero-a10g`
-   (requires HF Pro / Enterprise).
-2. **Edit `requirements.txt`** and uncomment the two lines:
-   ```
-   torch>=2.2
-   boltz>=0.4
-   ```
-3. **Verify secrets** are set: `HF_TOKEN` (private dataset),
-   `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, `GOOGLE_API_KEY`,
-   `DEEPSEEK_API_KEY`.
-4. Restart the Space. The first build will pull ~2GB of CUDA wheels.
-On `cpu-basic` hardware the Phase B predictors return a structured
 failure dict with `success=False` and an actionable error message
 instead of crashing the dispatcher.

 | Phase | Step | Status | Notes |
 |---|---|---|---|
 | **A** | Dispatch tasks → CPU scoring | live | HTTP POST to submitter endpoint, validate, score 5/6 components |
+| **B** | Boltz-2 structure verification | live (Modal) | Modal-hosted A10G companion app provisions GPU on demand |
 | **C** | LLM judge panel (28-pt hybrid) | live | 3-judge PoLL with self-exclusion, requires API key secrets |
 | **D** | Finalize + publish to leaderboard | live | Aggregates hybrid scores, writes back to submissions dataset |
+### Phase B architecture (Modal companion app)
+The HF Space runs on `cpu-basic` and cannot host Boltz directly, so
+Phase B uses a Modal-deployed sidecar (`modal_boltz_app.py`) that:
+- pre-builds an image with `boltz==2.2.1`, `torch==2.10`, NVIDIA
+  cuequivariance kernels, and FastAPI;
+- exposes a single web endpoint at
+  `https://<workspace>--bdb-boltz-predict.modal.run`;
+- spins up an A10G on demand, runs `boltz predict` (via the same CLI
+  the dev pipeline uses), and returns confidence metrics;
+- auto-stops after 5 minutes idle so the lab is only billed for active
+  inference time (~$0.06 per task at A10G rates).
+The HF Space is just an HTTP client (`eval_boltz.py`); design sequences
+are POSTed to the Modal endpoint with a shared bearer token. To
+deploy the sidecar (one time):
+```bash
+cd biodesignbench-leaderboard
+modal deploy modal_boltz_app.py
+```
+Then set these HF Space secrets:
+```
+MODAL_BOLTZ_URL    https://<workspace>--bdb-boltz-predict.modal.run
+MODAL_BOLTZ_TOKEN  matches the modal secret `bdb-boltz-shared` TOKEN
+```
+If `MODAL_BOLTZ_URL` is unset, Phase B predictors return a structured
 failure dict with `success=False` and an actionable error message
 instead of crashing the dispatcher.

eval_boltz.py CHANGED Viewed

@@ -1,221 +1,168 @@
-"""Boltz structure prediction for post-assessment scoring.
-Uses @spaces.GPU decorator for ZeroGPU on HuggingFace Spaces.
-Two prediction modes:
-  - Monomer: Non-binding tasks -> pLDDT, pTM
-  - Complex: Binding tasks (binder + target) -> ipTM, i_pAE
-Batch chunking respects ZeroGPU time limits (~180-240s per burst).
-Phase B activation checklist (must all be true to actually run Boltz):
-  1. HF Space hardware switched to a GPU tier (zero-a10g recommended).
-  2. requirements.txt has `torch` and `boltz` uncommented.
-  3. HF_TOKEN secret set on the Space (for the private hidden-tasks dataset).
-On a cpu-basic Space the predictors return a structured failure dict
-with `success=False` and an actionable error message rather than
-crashing the dispatcher.
 """
 from __future__ import annotations
 import logging
-import time
 from typing import Any
 logger = logging.getLogger(__name__)
-# Chunking limits for ZeroGPU (free tier: ~300s max per burst)
-MONOMER_CHUNK_SIZE = 5    # ~30-60s per monomer
-COMPLEX_CHUNK_SIZE = 2    # ~60-120s per complex
-MAX_GPU_TIME = 240         # safety margin under 300s ZeroGPU limit
-# ---------------------------------------------------------------------------
-#  Boltz prediction (GPU-accelerated)
-# ---------------------------------------------------------------------------
-_BOLTZ_NOT_INSTALLED = (
-    "Boltz / torch not available on this Space. To enable Phase B, "
-    "switch the Space hardware to ZeroGPU (zero-a10g) and uncomment the "
-    "torch + boltz lines in requirements.txt."
 )
-def _predict_monomer(sequence: str) -> dict[str, float]:
-    """Predict structure of a single protein sequence using Boltz.
-    Returns:
-        Dict with: pLDDT, pTM (or a structured failure dict).
-    """
-    try:
-        import torch  # noqa: F401
-        from boltz import Boltz
-    except ImportError:
-        logger.warning(_BOLTZ_NOT_INSTALLED)
-        return {
-            "pLDDT": 0.0, "pTM": 0.0,
-            "success": False, "error": _BOLTZ_NOT_INSTALLED,
-        }
-    try:
-        model = Boltz.from_pretrained("boltz2")
-        result = model.predict(sequence)
-        plddt = float(result.confidence.plddt.mean())
-        ptm = float(result.confidence.ptm)
-        return {
-            "pLDDT": round(plddt, 2),
-            "pTM": round(ptm, 4),
-            "success": True,
-        }
-    except Exception as e:
-        logger.error(f"Boltz monomer prediction failed: {e}")
-        return {"pLDDT": 0.0, "pTM": 0.0, "success": False, "error": str(e)}
-def _predict_complex(
-    binder_seq: str,
-    target_seq: str,
-) -> dict[str, float]:
-    """Predict complex structure and binding metrics using Boltz.
-    Returns:
-        Dict with: ipTM, i_pAE, pLDDT, pTM (or a structured failure dict).
     """
     try:
-        import torch  # noqa: F401
-        from boltz import Boltz
     except ImportError:
-        logger.warning(_BOLTZ_NOT_INSTALLED)
         return {
-            "pLDDT": 0.0, "pTM": 0.0, "ipTM": 0.0, "i_pAE": 0.0,
-            "success": False, "error": _BOLTZ_NOT_INSTALLED,
         }
-    try:
-        model = Boltz.from_pretrained("boltz2")
-        result = model.predict([binder_seq, target_seq])
-        plddt = float(result.confidence.plddt.mean())
-        ptm = float(result.confidence.ptm)
-        iptm = float(result.confidence.iptm) if hasattr(result.confidence, "iptm") else 0.0
-        ipae = float(result.confidence.ipae) if hasattr(result.confidence, "ipae") else 0.0
-        return {
-            "pLDDT": round(plddt, 2),
-            "pTM": round(ptm, 4),
-            "ipTM": round(iptm, 4),
-            "i_pAE": round(ipae, 2),
-            "success": True,
-        }
     except Exception as e:
-        logger.error(f"Boltz complex prediction failed: {e}")
         return {
-            "pLDDT": 0.0, "pTM": 0.0, "ipTM": 0.0, "i_pAE": 0.0,
-            "success": False, "error": str(e),
         }
-# ---------------------------------------------------------------------------
-#  GPU-decorated entry points (for HF Spaces with ZeroGPU)
-# ---------------------------------------------------------------------------
-try:
-    import spaces
-    @spaces.GPU(duration=MAX_GPU_TIME)
-    def predict_monomer_batch(sequences: list[str]) -> list[dict[str, float]]:
-        """Predict structures for a batch of monomer sequences.
-        Decorated with @spaces.GPU for ZeroGPU allocation.
-        Args:
-            sequences: List of amino acid sequences (max MONOMER_CHUNK_SIZE).
-        Returns:
-            List of prediction result dicts with pLDDT, pTM.
-        """
-        results = []
-        for seq in sequences[:MONOMER_CHUNK_SIZE]:
-            results.append(_predict_monomer(seq))
-        return results
-    @spaces.GPU(duration=MAX_GPU_TIME)
-    def predict_complex_batch(
-        pairs: list[tuple[str, str]],
-    ) -> list[dict[str, float]]:
-        """Predict structures for a batch of binder-target pairs.
-        Args:
-            pairs: List of (binder_seq, target_seq) tuples.
-        Returns:
-            List of prediction result dicts with ipTM, i_pAE, pLDDT, pTM.
-        """
-        results = []
-        for binder, target in pairs[:COMPLEX_CHUNK_SIZE]:
-            results.append(_predict_complex(binder, target))
-        return results
-except ImportError:
-    # Not running on HF Spaces -- provide un-decorated versions
-    def predict_monomer_batch(sequences: list[str]) -> list[dict[str, float]]:
-        return [_predict_monomer(seq) for seq in sequences[:MONOMER_CHUNK_SIZE]]
-    def predict_complex_batch(
-        pairs: list[tuple[str, str]],
-    ) -> list[dict[str, float]]:
-        return [_predict_complex(b, t) for b, t in pairs[:COMPLEX_CHUNK_SIZE]]
-# ---------------------------------------------------------------------------
-#  High-level assessment API
-# ---------------------------------------------------------------------------
 def run_boltz_posteval(
     per_task_results: dict[str, dict[str, Any]],
     progress_callback=None,
 ) -> dict[str, dict[str, Any]]:
-    """Run Boltz post-assessment on all tasks that need it.
-    For each task:
-      - Non-binding: pick best design -> monomer prediction
-      - Binding: pick best design + target sequence -> complex prediction
       - Merge Boltz metrics into existing results
-      - Re-score quality component
-    Args:
-        per_task_results: Dict of task_id -> dispatch result (from dispatcher).
-        progress_callback: Optional callback(task_id, i, total, metrics).
-    Returns:
-        Updated per_task_results with Boltz metrics and final quality scores.
     """
-    from eval_scorer import _is_binding_task, score_quality
-    # Separate tasks into monomer and complex batches
-    monomer_tasks = []
-    complex_tasks = []
     for task_id, result in per_task_results.items():
         if not result.get("success") or not result.get("quality_pending"):
             continue
         sequences = result.get("sequences", [])
         if not sequences:
             continue
-        best_seq = sequences[0]  # Use first design for Boltz
         if _is_binding_task(task_id):
-            # Need target sequence from ground truth
-            target_seq = result.get("ground_truth_thresholds", {}).get("target_sequence")
             if target_seq:
                 complex_tasks.append((task_id, best_seq, target_seq))
             else:
-                # Fall back to monomer if no target
                 monomer_tasks.append((task_id, best_seq))
         else:
             monomer_tasks.append((task_id, best_seq))
@@ -223,32 +170,24 @@ def run_boltz_posteval(
     total = len(monomer_tasks) + len(complex_tasks)
     done = 0
-    # Process monomer tasks in chunks
     for chunk_start in range(0, len(monomer_tasks), MONOMER_CHUNK_SIZE):
         chunk = monomer_tasks[chunk_start:chunk_start + MONOMER_CHUNK_SIZE]
         seqs = [seq for _, seq in chunk]
         boltz_results = predict_monomer_batch(seqs)
         for (task_id, _), metrics in zip(chunk, boltz_results):
             if metrics.get("success"):
                 _merge_boltz_metrics(per_task_results[task_id], metrics)
             done += 1
             if progress_callback:
                 progress_callback(task_id, done, total, metrics)
-    # Process complex tasks in chunks
     for chunk_start in range(0, len(complex_tasks), COMPLEX_CHUNK_SIZE):
         chunk = complex_tasks[chunk_start:chunk_start + COMPLEX_CHUNK_SIZE]
         pairs = [(binder, target) for _, binder, target in chunk]
         boltz_results = predict_complex_batch(pairs)
         for (task_id, _, _), metrics in zip(chunk, boltz_results):
             if metrics.get("success"):
                 _merge_boltz_metrics(per_task_results[task_id], metrics)
             done += 1
             if progress_callback:
                 progress_callback(task_id, done, total, metrics)
@@ -258,21 +197,16 @@ def run_boltz_posteval(
 def _merge_boltz_metrics(
     task_result: dict[str, Any],
-    boltz_metrics: dict[str, float],
 ) -> None:
-    """Merge Boltz prediction metrics into a task result and re-score quality.
-    Modifies task_result in-place.
-    """
     from eval_scorer import apply_design_gate, score_quality
-    # Merge Boltz metrics with any agent-reported metrics
     merged_metrics = task_result.get("agent_metrics", {}).copy()
     for key in ("pLDDT", "pTM", "ipTM", "i_pAE"):
         if key in boltz_metrics and boltz_metrics[key] > 0:
             merged_metrics[key] = boltz_metrics[key]
-    # Re-score quality with Boltz metrics
     quality_result = score_quality(
         agent_metrics=merged_metrics,
         thresholds=task_result.get("ground_truth_thresholds", {}),
@@ -281,15 +215,11 @@ def _merge_boltz_metrics(
         oracle_sequences=task_result.get("oracle_sequences"),
     )
-    # Update scores
     task_result["boltz_metrics"] = boltz_metrics
     task_result["quality_pending"] = False
     if "cpu_scores" in task_result:
         task_result["cpu_scores"]["quality"] = quality_result["score"]
-    # Compute final gated score
-    if "cpu_scores" in task_result:
         component_scores = dict(task_result["cpu_scores"])
         gated = apply_design_gate(component_scores, task_result.get("num_designs", 0))
         task_result["final_scores"] = gated

+"""Boltz-2 structure verification client (Phase B).
+The HF Space leaderboard runs on cpu-basic, so it cannot host Boltz
+directly. This module is a thin HTTP client that POSTs design sequences
+to a Modal-deployed companion app (`modal_boltz_app.py`), which
+provisions an A10G on demand, runs `boltz predict`, and returns
+confidence metrics.
+Two prediction modes (selected automatically by `run_boltz_posteval`):
+  - Monomer (non-binding tasks)   -> pLDDT, pTM
+  - Complex (binding tasks)       -> pLDDT, pTM, ipTM, i_pAE
+Required HF Space secrets (set out-of-band via the leaderboard admin):
+  MODAL_BOLTZ_URL    https://<workspace>--bdb-boltz-predict.modal.run
+  MODAL_BOLTZ_TOKEN  shared bearer token matching the modal secret TOKEN
+If `MODAL_BOLTZ_URL` is unset the predictors return a structured
+failure dict with `success=False` and an actionable error message
+rather than crashing the dispatcher.
 """
 from __future__ import annotations
 import logging
+import os
 from typing import Any
 logger = logging.getLogger(__name__)
+# Batch sizes large enough to amortize Modal cold-start, small enough
+# to stay under the 1700s function timeout.
+MONOMER_CHUNK_SIZE = 20
+COMPLEX_CHUNK_SIZE = 10
+HTTP_TIMEOUT_SEC = 1700
+_NOT_CONFIGURED = (
+    "Modal Boltz endpoint not configured. Set MODAL_BOLTZ_URL (and "
+    "MODAL_BOLTZ_TOKEN) on the HF Space, or deploy the companion app "
+    "with `modal deploy modal_boltz_app.py`."
 )
+def _modal_url() -> str | None:
+    return os.environ.get("MODAL_BOLTZ_URL", "").strip() or None
+def _modal_token() -> str:
+    return os.environ.get("MODAL_BOLTZ_TOKEN", "").strip()
+def _failure(error: str, complex_keys: bool = False) -> dict[str, Any]:
+    out = {"pLDDT": 0.0, "pTM": 0.0, "success": False, "error": error}
+    if complex_keys:
+        out.update({"ipTM": 0.0, "i_pAE": 0.0})
+    return out
+def _post_predictions(items: list[dict[str, Any]]) -> dict[str, dict[str, Any]]:
+    """POST a list of prediction items to the Modal endpoint.
+    Returns a dict mapping each item's `name` to a metric dict, with
+    structured failure entries on error.
     """
+    url = _modal_url()
+    if not url:
+        return {item["name"]: _failure(_NOT_CONFIGURED) for item in items}
     try:
+        import httpx
     except ImportError:
         return {
+            item["name"]: _failure("httpx not installed in leaderboard image")
+            for item in items
         }
+    headers = {"Content-Type": "application/json"}
+    payload = {"token": _modal_token(), "items": items}
+    try:
+        resp = httpx.post(
+            url, json=payload, headers=headers, timeout=HTTP_TIMEOUT_SEC,
+        )
     except Exception as e:
+        return {item["name"]: _failure(f"Modal POST failed: {e}") for item in items}
+    if resp.status_code != 200:
         return {
+            item["name"]: _failure(f"Modal HTTP {resp.status_code}: {resp.text[:200]}")
+            for item in items
         }
+    try:
+        body = resp.json()
+    except Exception as e:
+        return {item["name"]: _failure(f"Modal returned non-JSON: {e}") for item in items}
+    if "error" in body:
+        msg = body["error"]
+        return {item["name"]: _failure(f"Modal: {msg}") for item in items}
+    results = body.get("results", {})
+    out: dict[str, dict[str, Any]] = {}
+    for item in items:
+        name = item["name"]
+        out[name] = results.get(name) or _failure(
+            "Modal returned no result for this item"
+        )
+    return out
+def predict_monomer_batch(sequences: list[str]) -> list[dict[str, float]]:
+    """Predict structures for a batch of monomer sequences."""
+    items = [
+        {"name": f"mono_{i}", "kind": "monomer", "sequences": [seq]}
+        for i, seq in enumerate(sequences[:MONOMER_CHUNK_SIZE])
+    ]
+    by_name = _post_predictions(items)
+    return [by_name[item["name"]] for item in items]
+def predict_complex_batch(
+    pairs: list[tuple[str, str]],
+) -> list[dict[str, float]]:
+    """Predict structures for a batch of (binder, target) pairs."""
+    items = [
+        {"name": f"cmplx_{i}", "kind": "complex", "sequences": [b, t]}
+        for i, (b, t) in enumerate(pairs[:COMPLEX_CHUNK_SIZE])
+    ]
+    by_name = _post_predictions(items)
+    return [by_name[item["name"]] for item in items]
 def run_boltz_posteval(
     per_task_results: dict[str, dict[str, Any]],
     progress_callback=None,
 ) -> dict[str, dict[str, Any]]:
+    """Run Boltz post-assessment on every task that needs it.
+    For each successful task:
+      - Non-binding: pick the first design -> monomer prediction
+      - Binding: pick the first design + target sequence -> complex prediction
       - Merge Boltz metrics into existing results
+      - Re-score the quality component
     """
+    from eval_scorer import _is_binding_task
+    monomer_tasks: list[tuple[str, str]] = []
+    complex_tasks: list[tuple[str, str, str]] = []
     for task_id, result in per_task_results.items():
         if not result.get("success") or not result.get("quality_pending"):
             continue
         sequences = result.get("sequences", [])
         if not sequences:
             continue
+        best_seq = sequences[0]
         if _is_binding_task(task_id):
+            target_seq = (
+                result.get("ground_truth_thresholds", {}).get("target_sequence")
+            )
             if target_seq:
                 complex_tasks.append((task_id, best_seq, target_seq))
             else:
                 monomer_tasks.append((task_id, best_seq))
         else:
             monomer_tasks.append((task_id, best_seq))
     total = len(monomer_tasks) + len(complex_tasks)
     done = 0
     for chunk_start in range(0, len(monomer_tasks), MONOMER_CHUNK_SIZE):
         chunk = monomer_tasks[chunk_start:chunk_start + MONOMER_CHUNK_SIZE]
         seqs = [seq for _, seq in chunk]
         boltz_results = predict_monomer_batch(seqs)
         for (task_id, _), metrics in zip(chunk, boltz_results):
             if metrics.get("success"):
                 _merge_boltz_metrics(per_task_results[task_id], metrics)
             done += 1
             if progress_callback:
                 progress_callback(task_id, done, total, metrics)
     for chunk_start in range(0, len(complex_tasks), COMPLEX_CHUNK_SIZE):
         chunk = complex_tasks[chunk_start:chunk_start + COMPLEX_CHUNK_SIZE]
         pairs = [(binder, target) for _, binder, target in chunk]
         boltz_results = predict_complex_batch(pairs)
         for (task_id, _, _), metrics in zip(chunk, boltz_results):
             if metrics.get("success"):
                 _merge_boltz_metrics(per_task_results[task_id], metrics)
             done += 1
             if progress_callback:
                 progress_callback(task_id, done, total, metrics)
 def _merge_boltz_metrics(
     task_result: dict[str, Any],
+    boltz_metrics: dict[str, Any],
 ) -> None:
+    """Merge Boltz prediction metrics into a task result and re-score quality."""
     from eval_scorer import apply_design_gate, score_quality
     merged_metrics = task_result.get("agent_metrics", {}).copy()
     for key in ("pLDDT", "pTM", "ipTM", "i_pAE"):
         if key in boltz_metrics and boltz_metrics[key] > 0:
             merged_metrics[key] = boltz_metrics[key]
     quality_result = score_quality(
         agent_metrics=merged_metrics,
         thresholds=task_result.get("ground_truth_thresholds", {}),
         oracle_sequences=task_result.get("oracle_sequences"),
     )
     task_result["boltz_metrics"] = boltz_metrics
     task_result["quality_pending"] = False
     if "cpu_scores" in task_result:
         task_result["cpu_scores"]["quality"] = quality_result["score"]
         component_scores = dict(task_result["cpu_scores"])
         gated = apply_design_gate(component_scores, task_result.get("num_designs", 0))
         task_result["final_scores"] = gated

modal_boltz_app.py ADDED Viewed

	@@ -0,0 +1,270 @@

+"""Modal app: Boltz-2 structure prediction for BioDesignBench Phase B.
+This is the GPU-side companion to `eval_boltz.py`. The HF Space leaderboard
+runs on cpu-basic, so it cannot host Boltz directly; instead it POSTs design
+sequences to this Modal app, which spins up an A10G on demand, runs
+`boltz predict`, and returns confidence metrics.
+Setup (one-time, on a machine with `pip install modal`):
+    modal token new                       # if you don't have a token yet
+    cd biodesignbench-leaderboard
+    modal deploy modal_boltz_app.py
+After deploy Modal prints a URL like
+    https://<workspace>--bdb-boltz-predict.modal.run
+Add that URL plus a shared secret to the HF Space secrets:
+    MODAL_BOLTZ_URL  = https://<workspace>--bdb-boltz-predict.modal.run
+    MODAL_BOLTZ_TOKEN = <random 32-byte hex>
+Cost: A10G is billed per-second, container auto-stops after
+`container_idle_timeout` seconds. With one submission per month and
+~76 tasks * ~30s = ~38min GPU per submission, expected spend is
+well within Modal's free tier.
+"""
+from __future__ import annotations
+import os
+import modal
+APP_NAME = "bdb-boltz"
+ENDPOINT_LABEL = "bdb-boltz-predict"
+app = modal.App(APP_NAME)
+# Persistent volume for Boltz-2 model weights (~6GB, downloaded on first call)
+weights_volume = modal.Volume.from_name(
+    "bdb-boltz-weights", create_if_missing=True
+)
+# Boltz GPU image. Boltz-2 is published on PyPI as `boltz` and pulls a
+# CUDA-12 torch wheel automatically.
+gpu_image = (
+    modal.Image.from_registry(
+        "nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04",
+        add_python="3.11",
+    )
+    .apt_install("git", "wget", "build-essential")
+    # Boltz-2 (>=2.2) uses NVIDIA cuequivariance for the triangular-multiply
+    # kernel and requires CUDA 12.5+. We let pip pick a torch that matches
+    # cuequivariance's nvidia-cublas-cu12>=12.5 constraint.
+    .pip_install(
+        # Match dev's known-working stack: torch 2.10 ships nvidia-cublas-cu12
+        # 12.8 which satisfies cuequivariance>=12.5 requirement.
+        "torch==2.10.0",
+        "boltz==2.2.1",
+        "cuequivariance==0.9.0",
+        "cuequivariance-torch==0.9.0",
+        "cuequivariance-ops-cu12==0.9.0",
+        "cuequivariance-ops-torch-cu12==0.9.0",
+        "fastapi[standard]",
+        "pyyaml",
+        "numpy",
+    )
+    .env(
+        {
+            "BOLTZ_CACHE": "/weights",
+            "TORCH_HOME": "/weights/torch",
+            "HF_HOME": "/weights/hf",
+        }
+    )
+)
+# ---------------------------------------------------------------------------
+#  Internal: write YAMLs, run boltz predict, parse outputs
+# ---------------------------------------------------------------------------
+def _write_yaml(item: dict) -> str:
+    """Render one prediction item to a Boltz YAML string.
+    item shape:
+        {"name": "task_001",
+         "kind": "monomer" | "complex",
+         "sequences": ["MKKL...", ...]}    # 1 for monomer, 2 for complex
+    """
+    seqs = item.get("sequences") or []
+    chain_ids = ["A", "B", "C", "D", "E"]
+    lines = ["sequences:"]
+    for i, seq in enumerate(seqs):
+        cid = chain_ids[i] if i < len(chain_ids) else f"X{i}"
+        lines.append("  - protein:")
+        lines.append(f"      id: {cid}")
+        lines.append(f"      sequence: {seq}")
+    return "\n".join(lines) + "\n"
+def _parse_confidence(pred_dir) -> dict:
+    """Parse a Boltz prediction directory into a flat metric dict."""
+    import json
+    from pathlib import Path
+    import numpy as np
+    out = {
+        "pLDDT": 0.0, "pTM": 0.0, "ipTM": 0.0, "i_pAE": 0.0,
+        "success": False,
+    }
+    pred_dir = Path(pred_dir)
+    conf_files = list(pred_dir.rglob("confidence*.json"))
+    if conf_files:
+        try:
+            with open(conf_files[0]) as f:
+                c = json.load(f)
+            out["pLDDT"] = round(float(c.get("complex_plddt", 0.0)) * 100, 2)
+            out["pTM"] = round(float(c.get("ptm", 0.0)), 4)
+            out["ipTM"] = round(float(c.get("iptm", 0.0)), 4)
+            out["i_pAE"] = round(float(c.get("complex_ipae", 0.0)), 2)
+            out["success"] = True
+        except Exception:
+            pass
+    if not out["success"]:
+        # Fall back to per-residue plddt npz if confidence.json is missing
+        plddt_files = list(pred_dir.rglob("plddt*.npz"))
+        if plddt_files:
+            try:
+                arr = np.load(plddt_files[0])["plddt"]
+                out["pLDDT"] = round(float(arr.mean()) * 100, 2)
+                out["success"] = True
+            except Exception:
+                pass
+    return out
+# ---------------------------------------------------------------------------
+#  GPU entry point — single web endpoint handling both monomer and complex
+# ---------------------------------------------------------------------------
+@app.function(
+    image=gpu_image,
+    gpu="A10G",
+    volumes={"/weights": weights_volume},
+    timeout=1800,
+    scaledown_window=300,
+    secrets=[modal.Secret.from_name("bdb-boltz-shared", required_keys=["TOKEN"])],
+)
+@modal.fastapi_endpoint(method="POST", label=ENDPOINT_LABEL)
+def predict(payload: dict) -> dict:
+    """Run Boltz-2 on a list of prediction items.
+    Body shape:
+        {"token": "<shared secret>",
+         "items": [{"name": "...", "kind": "monomer"|"complex",
+                    "sequences": [...]}, ...]}
+    The list is assembled into a single ``boltz predict`` invocation so
+    the model loads only once per call (amortizes ~30s cold start).
+    Returns a dict mapping each item's `name` to a metric dict:
+        {"pLDDT", "pTM", "ipTM", "i_pAE", "success"}
+    """
+    import shutil
+    import subprocess
+    import tempfile
+    from pathlib import Path
+    expected_token = os.environ.get("TOKEN", "")
+    if expected_token and (payload.get("token") or "") != expected_token:
+        return {"error": "Unauthorized -- bad MODAL_BOLTZ_TOKEN"}
+    items = payload.get("items") or []
+    if not items:
+        return {"results": {}}
+    work = Path(tempfile.mkdtemp(prefix="bdb_boltz_"))
+    in_dir = work / "inputs"
+    out_dir = work / "out"
+    in_dir.mkdir()
+    out_dir.mkdir()
+    name_to_yaml: dict[str, str] = {}
+    for i, item in enumerate(items):
+        name = str(item.get("name") or f"item_{i:04d}")
+        safe = "".join(c if c.isalnum() else "_" for c in name)[:60]
+        yaml_name = f"{i:04d}_{safe}"
+        (in_dir / f"{yaml_name}.yaml").write_text(_write_yaml(item))
+        name_to_yaml[name] = yaml_name
+    cmd = [
+        "boltz", "predict",
+        str(in_dir),
+        "--out_dir", str(out_dir),
+        "--cache", "/weights/boltz_cache",
+        "--diffusion_samples", "1",
+        "--output_format", "pdb",
+        "--use_msa_server",
+    ]
+    proc = subprocess.run(
+        cmd, capture_output=True, text=True, timeout=1700, cwd=str(work),
+    )
+    # Persist downloaded model weights to the shared volume
+    try:
+        weights_volume.commit()
+    except Exception:
+        pass
+    if proc.returncode != 0:
+        shutil.rmtree(str(work), ignore_errors=True)
+        return {
+            "error": "boltz predict failed",
+            "stderr": proc.stderr[-2000:],
+            "stdout": proc.stdout[-2000:],
+        }
+    # boltz writes outputs to out/boltz_results_inputs/predictions/<name>/
+    predictions_root = None
+    for p in out_dir.rglob("predictions"):
+        if p.is_dir():
+            predictions_root = p
+            break
+    results: dict[str, dict] = {}
+    if predictions_root is not None:
+        for name, yaml_name in name_to_yaml.items():
+            pred_dirs = [
+                d for d in predictions_root.iterdir()
+                if d.is_dir() and (d.name.startswith(yaml_name) or d.name == yaml_name)
+            ]
+            if pred_dirs:
+                results[name] = _parse_confidence(pred_dirs[0])
+            else:
+                results[name] = {
+                    "pLDDT": 0.0, "pTM": 0.0, "ipTM": 0.0, "i_pAE": 0.0,
+                    "success": False, "error": "prediction missing",
+                }
+    shutil.rmtree(str(work), ignore_errors=True)
+    return {"results": results}
+# ---------------------------------------------------------------------------
+#  CLI smoke test:  modal run modal_boltz_app.py
+# ---------------------------------------------------------------------------
+@app.local_entrypoint()
+def main():
+    """Quick sanity check — a short ubiquitin-like sequence."""
+    import json
+    items = [
+        {
+            "name": "monomer_demo",
+            "kind": "monomer",
+            "sequences": [
+                "MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG"
+            ],
+        },
+    ]
+    out = predict.remote(items, authorization="")
+    print(json.dumps(out, indent=2))

requirements.txt CHANGED Viewed

@@ -10,12 +10,7 @@ anthropic>=0.75
 openai>=1.40
 google-genai>=0.3
-# Phase B (Boltz post-eval). The `spaces` shim is safe on any hardware
-# tier; the `@spaces.GPU(...)` decorator is a no-op on cpu-basic and
-# provisions ZeroGPU on zero-a10g. Boltz-1 + torch require an actual
-# CUDA build, so they are gated: uncomment ONLY after switching the
-# Space hardware to a GPU tier (zero-a10g recommended) — otherwise pip
-# will pull ~2GB of CUDA wheels onto a CPU image and the build fails.
-spaces>=0.30
-# torch>=2.2          # ZeroGPU only — uncomment after hardware flip
-# boltz>=0.4          # ZeroGPU only — uncomment after hardware flip

 openai>=1.40
 google-genai>=0.3
+# Phase B uses a Modal-hosted Boltz sidecar (modal_boltz_app.py), so
+# torch / boltz are NOT installed in the Space image; the Space only
+# acts as an HTTP client of the Modal endpoint. See
+# biodesignbench-leaderboard/README.md for deployment notes.