Jasonkim8652 commited on
Commit
3e1b7c7
·
verified ·
1 Parent(s): 7fd8751

Phase B (live): Boltz-2 via Modal sidecar instead of ZeroGPU

Browse files

- Add modal_boltz_app.py: A10G companion app deployed to Modal, exposes
POST /predict with FastAPI; runs `boltz predict` on demand and returns
pLDDT/pTM/ipTM/i_pAE per item. Image: torch 2.10 + boltz 2.2.1 +
cuequivariance 0.9 + fastapi[standard]. Auto-stops after 5min idle.
- Rewrite eval_boltz.py as an HTTP client of the Modal endpoint.
Reads MODAL_BOLTZ_URL and MODAL_BOLTZ_TOKEN from Space secrets;
graceful fallback when unset.
- requirements.txt: drop torch/boltz/spaces (no longer needed in the
HF Space image -- prediction runs on Modal).
- README: describe the Modal sidecar architecture and deployment.
- Smoke-tested end to end with ubiquitin: pLDDT 93.89, pTM 0.9194.

Files changed (4) hide show
  1. README.md +29 -15
  2. eval_boltz.py +110 -180
  3. modal_boltz_app.py +270 -0
  4. requirements.txt +4 -9
README.md CHANGED
@@ -46,26 +46,40 @@ Submission processing runs in 4 admin-controlled phases:
46
  | Phase | Step | Status | Notes |
47
  |---|---|---|---|
48
  | **A** | Dispatch tasks → CPU scoring | live | HTTP POST to submitter endpoint, validate, score 5/6 components |
49
- | **B** | Boltz-2 structure verification | code-ready | Needs ZeroGPU hardware + uncommented `torch`/`boltz` deps |
50
  | **C** | LLM judge panel (28-pt hybrid) | live | 3-judge PoLL with self-exclusion, requires API key secrets |
51
  | **D** | Finalize + publish to leaderboard | live | Aggregates hybrid scores, writes back to submissions dataset |
52
 
53
- ### Phase B activation checklist
54
 
55
- To wire up Boltz-2 verification on this Space:
 
56
 
57
- 1. **Switch hardware** in HF Space settings → Hardware → `zero-a10g`
58
- (requires HF Pro / Enterprise).
59
- 2. **Edit `requirements.txt`** and uncomment the two lines:
60
- ```
61
- torch>=2.2
62
- boltz>=0.4
63
- ```
64
- 3. **Verify secrets** are set: `HF_TOKEN` (private dataset),
65
- `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, `GOOGLE_API_KEY`,
66
- `DEEPSEEK_API_KEY`.
67
- 4. Restart the Space. The first build will pull ~2GB of CUDA wheels.
68
 
69
- On `cpu-basic` hardware the Phase B predictors return a structured
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
  failure dict with `success=False` and an actionable error message
71
  instead of crashing the dispatcher.
 
46
  | Phase | Step | Status | Notes |
47
  |---|---|---|---|
48
  | **A** | Dispatch tasks → CPU scoring | live | HTTP POST to submitter endpoint, validate, score 5/6 components |
49
+ | **B** | Boltz-2 structure verification | live (Modal) | Modal-hosted A10G companion app provisions GPU on demand |
50
  | **C** | LLM judge panel (28-pt hybrid) | live | 3-judge PoLL with self-exclusion, requires API key secrets |
51
  | **D** | Finalize + publish to leaderboard | live | Aggregates hybrid scores, writes back to submissions dataset |
52
 
53
+ ### Phase B architecture (Modal companion app)
54
 
55
+ The HF Space runs on `cpu-basic` and cannot host Boltz directly, so
56
+ Phase B uses a Modal-deployed sidecar (`modal_boltz_app.py`) that:
57
 
58
+ - pre-builds an image with `boltz==2.2.1`, `torch==2.10`, NVIDIA
59
+ cuequivariance kernels, and FastAPI;
60
+ - exposes a single web endpoint at
61
+ `https://<workspace>--bdb-boltz-predict.modal.run`;
62
+ - spins up an A10G on demand, runs `boltz predict` (via the same CLI
63
+ the dev pipeline uses), and returns confidence metrics;
64
+ - auto-stops after 5 minutes idle so the lab is only billed for active
65
+ inference time (~$0.06 per task at A10G rates).
 
 
 
66
 
67
+ The HF Space is just an HTTP client (`eval_boltz.py`); design sequences
68
+ are POSTed to the Modal endpoint with a shared bearer token. To
69
+ deploy the sidecar (one time):
70
+
71
+ ```bash
72
+ cd biodesignbench-leaderboard
73
+ modal deploy modal_boltz_app.py
74
+ ```
75
+
76
+ Then set these HF Space secrets:
77
+
78
+ ```
79
+ MODAL_BOLTZ_URL https://<workspace>--bdb-boltz-predict.modal.run
80
+ MODAL_BOLTZ_TOKEN matches the modal secret `bdb-boltz-shared` TOKEN
81
+ ```
82
+
83
+ If `MODAL_BOLTZ_URL` is unset, Phase B predictors return a structured
84
  failure dict with `success=False` and an actionable error message
85
  instead of crashing the dispatcher.
eval_boltz.py CHANGED
@@ -1,221 +1,168 @@
1
- """Boltz structure prediction for post-assessment scoring.
2
 
3
- Uses @spaces.GPU decorator for ZeroGPU on HuggingFace Spaces.
 
 
 
 
4
 
5
- Two prediction modes:
6
- - Monomer: Non-binding tasks -> pLDDT, pTM
7
- - Complex: Binding tasks (binder + target) -> ipTM, i_pAE
8
 
9
- Batch chunking respects ZeroGPU time limits (~180-240s per burst).
 
 
10
 
11
- Phase B activation checklist (must all be true to actually run Boltz):
12
- 1. HF Space hardware switched to a GPU tier (zero-a10g recommended).
13
- 2. requirements.txt has `torch` and `boltz` uncommented.
14
- 3. HF_TOKEN secret set on the Space (for the private hidden-tasks dataset).
15
- On a cpu-basic Space the predictors return a structured failure dict
16
- with `success=False` and an actionable error message rather than
17
- crashing the dispatcher.
18
  """
19
 
20
  from __future__ import annotations
21
 
22
  import logging
23
- import time
24
  from typing import Any
25
 
26
  logger = logging.getLogger(__name__)
27
 
28
- # Chunking limits for ZeroGPU (free tier: ~300s max per burst)
29
- MONOMER_CHUNK_SIZE = 5 # ~30-60s per monomer
30
- COMPLEX_CHUNK_SIZE = 2 # ~60-120s per complex
31
- MAX_GPU_TIME = 240 # safety margin under 300s ZeroGPU limit
 
32
 
33
 
34
- # ---------------------------------------------------------------------------
35
- # Boltz prediction (GPU-accelerated)
36
- # ---------------------------------------------------------------------------
37
-
38
-
39
- _BOLTZ_NOT_INSTALLED = (
40
- "Boltz / torch not available on this Space. To enable Phase B, "
41
- "switch the Space hardware to ZeroGPU (zero-a10g) and uncomment the "
42
- "torch + boltz lines in requirements.txt."
43
  )
44
 
45
 
46
- def _predict_monomer(sequence: str) -> dict[str, float]:
47
- """Predict structure of a single protein sequence using Boltz.
48
 
49
- Returns:
50
- Dict with: pLDDT, pTM (or a structured failure dict).
51
- """
52
- try:
53
- import torch # noqa: F401
54
- from boltz import Boltz
55
- except ImportError:
56
- logger.warning(_BOLTZ_NOT_INSTALLED)
57
- return {
58
- "pLDDT": 0.0, "pTM": 0.0,
59
- "success": False, "error": _BOLTZ_NOT_INSTALLED,
60
- }
61
- try:
62
- model = Boltz.from_pretrained("boltz2")
63
- result = model.predict(sequence)
64
 
65
- plddt = float(result.confidence.plddt.mean())
66
- ptm = float(result.confidence.ptm)
67
 
68
- return {
69
- "pLDDT": round(plddt, 2),
70
- "pTM": round(ptm, 4),
71
- "success": True,
72
- }
73
- except Exception as e:
74
- logger.error(f"Boltz monomer prediction failed: {e}")
75
- return {"pLDDT": 0.0, "pTM": 0.0, "success": False, "error": str(e)}
76
 
77
 
78
- def _predict_complex(
79
- binder_seq: str,
80
- target_seq: str,
81
- ) -> dict[str, float]:
82
- """Predict complex structure and binding metrics using Boltz.
83
 
84
- Returns:
85
- Dict with: ipTM, i_pAE, pLDDT, pTM (or a structured failure dict).
86
  """
 
 
 
 
87
  try:
88
- import torch # noqa: F401
89
- from boltz import Boltz
90
  except ImportError:
91
- logger.warning(_BOLTZ_NOT_INSTALLED)
92
  return {
93
- "pLDDT": 0.0, "pTM": 0.0, "ipTM": 0.0, "i_pAE": 0.0,
94
- "success": False, "error": _BOLTZ_NOT_INSTALLED,
95
  }
96
- try:
97
- model = Boltz.from_pretrained("boltz2")
98
- result = model.predict([binder_seq, target_seq])
99
 
100
- plddt = float(result.confidence.plddt.mean())
101
- ptm = float(result.confidence.ptm)
102
- iptm = float(result.confidence.iptm) if hasattr(result.confidence, "iptm") else 0.0
103
- ipae = float(result.confidence.ipae) if hasattr(result.confidence, "ipae") else 0.0
104
 
105
- return {
106
- "pLDDT": round(plddt, 2),
107
- "pTM": round(ptm, 4),
108
- "ipTM": round(iptm, 4),
109
- "i_pAE": round(ipae, 2),
110
- "success": True,
111
- }
112
  except Exception as e:
113
- logger.error(f"Boltz complex prediction failed: {e}")
 
 
114
  return {
115
- "pLDDT": 0.0, "pTM": 0.0, "ipTM": 0.0, "i_pAE": 0.0,
116
- "success": False, "error": str(e),
117
  }
118
 
119
-
120
- # ---------------------------------------------------------------------------
121
- # GPU-decorated entry points (for HF Spaces with ZeroGPU)
122
- # ---------------------------------------------------------------------------
123
-
124
- try:
125
- import spaces
126
-
127
- @spaces.GPU(duration=MAX_GPU_TIME)
128
- def predict_monomer_batch(sequences: list[str]) -> list[dict[str, float]]:
129
- """Predict structures for a batch of monomer sequences.
130
-
131
- Decorated with @spaces.GPU for ZeroGPU allocation.
132
-
133
- Args:
134
- sequences: List of amino acid sequences (max MONOMER_CHUNK_SIZE).
135
-
136
- Returns:
137
- List of prediction result dicts with pLDDT, pTM.
138
- """
139
- results = []
140
- for seq in sequences[:MONOMER_CHUNK_SIZE]:
141
- results.append(_predict_monomer(seq))
142
- return results
143
-
144
- @spaces.GPU(duration=MAX_GPU_TIME)
145
- def predict_complex_batch(
146
- pairs: list[tuple[str, str]],
147
- ) -> list[dict[str, float]]:
148
- """Predict structures for a batch of binder-target pairs.
149
-
150
- Args:
151
- pairs: List of (binder_seq, target_seq) tuples.
152
-
153
- Returns:
154
- List of prediction result dicts with ipTM, i_pAE, pLDDT, pTM.
155
- """
156
- results = []
157
- for binder, target in pairs[:COMPLEX_CHUNK_SIZE]:
158
- results.append(_predict_complex(binder, target))
159
- return results
160
-
161
- except ImportError:
162
- # Not running on HF Spaces -- provide un-decorated versions
163
- def predict_monomer_batch(sequences: list[str]) -> list[dict[str, float]]:
164
- return [_predict_monomer(seq) for seq in sequences[:MONOMER_CHUNK_SIZE]]
165
-
166
- def predict_complex_batch(
167
- pairs: list[tuple[str, str]],
168
- ) -> list[dict[str, float]]:
169
- return [_predict_complex(b, t) for b, t in pairs[:COMPLEX_CHUNK_SIZE]]
170
-
171
-
172
- # ---------------------------------------------------------------------------
173
- # High-level assessment API
174
- # ---------------------------------------------------------------------------
175
 
176
 
177
  def run_boltz_posteval(
178
  per_task_results: dict[str, dict[str, Any]],
179
  progress_callback=None,
180
  ) -> dict[str, dict[str, Any]]:
181
- """Run Boltz post-assessment on all tasks that need it.
182
 
183
- For each task:
184
- - Non-binding: pick best design -> monomer prediction
185
- - Binding: pick best design + target sequence -> complex prediction
186
  - Merge Boltz metrics into existing results
187
- - Re-score quality component
188
-
189
- Args:
190
- per_task_results: Dict of task_id -> dispatch result (from dispatcher).
191
- progress_callback: Optional callback(task_id, i, total, metrics).
192
-
193
- Returns:
194
- Updated per_task_results with Boltz metrics and final quality scores.
195
  """
196
- from eval_scorer import _is_binding_task, score_quality
197
 
198
- # Separate tasks into monomer and complex batches
199
- monomer_tasks = []
200
- complex_tasks = []
201
 
202
  for task_id, result in per_task_results.items():
203
  if not result.get("success") or not result.get("quality_pending"):
204
  continue
205
-
206
  sequences = result.get("sequences", [])
207
  if not sequences:
208
  continue
209
-
210
- best_seq = sequences[0] # Use first design for Boltz
211
 
212
  if _is_binding_task(task_id):
213
- # Need target sequence from ground truth
214
- target_seq = result.get("ground_truth_thresholds", {}).get("target_sequence")
 
215
  if target_seq:
216
  complex_tasks.append((task_id, best_seq, target_seq))
217
  else:
218
- # Fall back to monomer if no target
219
  monomer_tasks.append((task_id, best_seq))
220
  else:
221
  monomer_tasks.append((task_id, best_seq))
@@ -223,32 +170,24 @@ def run_boltz_posteval(
223
  total = len(monomer_tasks) + len(complex_tasks)
224
  done = 0
225
 
226
- # Process monomer tasks in chunks
227
  for chunk_start in range(0, len(monomer_tasks), MONOMER_CHUNK_SIZE):
228
  chunk = monomer_tasks[chunk_start:chunk_start + MONOMER_CHUNK_SIZE]
229
  seqs = [seq for _, seq in chunk]
230
-
231
  boltz_results = predict_monomer_batch(seqs)
232
-
233
  for (task_id, _), metrics in zip(chunk, boltz_results):
234
  if metrics.get("success"):
235
  _merge_boltz_metrics(per_task_results[task_id], metrics)
236
-
237
  done += 1
238
  if progress_callback:
239
  progress_callback(task_id, done, total, metrics)
240
 
241
- # Process complex tasks in chunks
242
  for chunk_start in range(0, len(complex_tasks), COMPLEX_CHUNK_SIZE):
243
  chunk = complex_tasks[chunk_start:chunk_start + COMPLEX_CHUNK_SIZE]
244
  pairs = [(binder, target) for _, binder, target in chunk]
245
-
246
  boltz_results = predict_complex_batch(pairs)
247
-
248
  for (task_id, _, _), metrics in zip(chunk, boltz_results):
249
  if metrics.get("success"):
250
  _merge_boltz_metrics(per_task_results[task_id], metrics)
251
-
252
  done += 1
253
  if progress_callback:
254
  progress_callback(task_id, done, total, metrics)
@@ -258,21 +197,16 @@ def run_boltz_posteval(
258
 
259
  def _merge_boltz_metrics(
260
  task_result: dict[str, Any],
261
- boltz_metrics: dict[str, float],
262
  ) -> None:
263
- """Merge Boltz prediction metrics into a task result and re-score quality.
264
-
265
- Modifies task_result in-place.
266
- """
267
  from eval_scorer import apply_design_gate, score_quality
268
 
269
- # Merge Boltz metrics with any agent-reported metrics
270
  merged_metrics = task_result.get("agent_metrics", {}).copy()
271
  for key in ("pLDDT", "pTM", "ipTM", "i_pAE"):
272
  if key in boltz_metrics and boltz_metrics[key] > 0:
273
  merged_metrics[key] = boltz_metrics[key]
274
 
275
- # Re-score quality with Boltz metrics
276
  quality_result = score_quality(
277
  agent_metrics=merged_metrics,
278
  thresholds=task_result.get("ground_truth_thresholds", {}),
@@ -281,15 +215,11 @@ def _merge_boltz_metrics(
281
  oracle_sequences=task_result.get("oracle_sequences"),
282
  )
283
 
284
- # Update scores
285
  task_result["boltz_metrics"] = boltz_metrics
286
  task_result["quality_pending"] = False
287
 
288
  if "cpu_scores" in task_result:
289
  task_result["cpu_scores"]["quality"] = quality_result["score"]
290
-
291
- # Compute final gated score
292
- if "cpu_scores" in task_result:
293
  component_scores = dict(task_result["cpu_scores"])
294
  gated = apply_design_gate(component_scores, task_result.get("num_designs", 0))
295
  task_result["final_scores"] = gated
 
1
+ """Boltz-2 structure verification client (Phase B).
2
 
3
+ The HF Space leaderboard runs on cpu-basic, so it cannot host Boltz
4
+ directly. This module is a thin HTTP client that POSTs design sequences
5
+ to a Modal-deployed companion app (`modal_boltz_app.py`), which
6
+ provisions an A10G on demand, runs `boltz predict`, and returns
7
+ confidence metrics.
8
 
9
+ Two prediction modes (selected automatically by `run_boltz_posteval`):
10
+ - Monomer (non-binding tasks) -> pLDDT, pTM
11
+ - Complex (binding tasks) -> pLDDT, pTM, ipTM, i_pAE
12
 
13
+ Required HF Space secrets (set out-of-band via the leaderboard admin):
14
+ MODAL_BOLTZ_URL https://<workspace>--bdb-boltz-predict.modal.run
15
+ MODAL_BOLTZ_TOKEN shared bearer token matching the modal secret TOKEN
16
 
17
+ If `MODAL_BOLTZ_URL` is unset the predictors return a structured
18
+ failure dict with `success=False` and an actionable error message
19
+ rather than crashing the dispatcher.
 
 
 
 
20
  """
21
 
22
  from __future__ import annotations
23
 
24
  import logging
25
+ import os
26
  from typing import Any
27
 
28
  logger = logging.getLogger(__name__)
29
 
30
+ # Batch sizes large enough to amortize Modal cold-start, small enough
31
+ # to stay under the 1700s function timeout.
32
+ MONOMER_CHUNK_SIZE = 20
33
+ COMPLEX_CHUNK_SIZE = 10
34
+ HTTP_TIMEOUT_SEC = 1700
35
 
36
 
37
+ _NOT_CONFIGURED = (
38
+ "Modal Boltz endpoint not configured. Set MODAL_BOLTZ_URL (and "
39
+ "MODAL_BOLTZ_TOKEN) on the HF Space, or deploy the companion app "
40
+ "with `modal deploy modal_boltz_app.py`."
 
 
 
 
 
41
  )
42
 
43
 
44
+ def _modal_url() -> str | None:
45
+ return os.environ.get("MODAL_BOLTZ_URL", "").strip() or None
46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
+ def _modal_token() -> str:
49
+ return os.environ.get("MODAL_BOLTZ_TOKEN", "").strip()
50
 
51
+
52
+ def _failure(error: str, complex_keys: bool = False) -> dict[str, Any]:
53
+ out = {"pLDDT": 0.0, "pTM": 0.0, "success": False, "error": error}
54
+ if complex_keys:
55
+ out.update({"ipTM": 0.0, "i_pAE": 0.0})
56
+ return out
 
 
57
 
58
 
59
+ def _post_predictions(items: list[dict[str, Any]]) -> dict[str, dict[str, Any]]:
60
+ """POST a list of prediction items to the Modal endpoint.
 
 
 
61
 
62
+ Returns a dict mapping each item's `name` to a metric dict, with
63
+ structured failure entries on error.
64
  """
65
+ url = _modal_url()
66
+ if not url:
67
+ return {item["name"]: _failure(_NOT_CONFIGURED) for item in items}
68
+
69
  try:
70
+ import httpx
 
71
  except ImportError:
 
72
  return {
73
+ item["name"]: _failure("httpx not installed in leaderboard image")
74
+ for item in items
75
  }
 
 
 
76
 
77
+ headers = {"Content-Type": "application/json"}
78
+ payload = {"token": _modal_token(), "items": items}
 
 
79
 
80
+ try:
81
+ resp = httpx.post(
82
+ url, json=payload, headers=headers, timeout=HTTP_TIMEOUT_SEC,
83
+ )
 
 
 
84
  except Exception as e:
85
+ return {item["name"]: _failure(f"Modal POST failed: {e}") for item in items}
86
+
87
+ if resp.status_code != 200:
88
  return {
89
+ item["name"]: _failure(f"Modal HTTP {resp.status_code}: {resp.text[:200]}")
90
+ for item in items
91
  }
92
 
93
+ try:
94
+ body = resp.json()
95
+ except Exception as e:
96
+ return {item["name"]: _failure(f"Modal returned non-JSON: {e}") for item in items}
97
+
98
+ if "error" in body:
99
+ msg = body["error"]
100
+ return {item["name"]: _failure(f"Modal: {msg}") for item in items}
101
+
102
+ results = body.get("results", {})
103
+ out: dict[str, dict[str, Any]] = {}
104
+ for item in items:
105
+ name = item["name"]
106
+ out[name] = results.get(name) or _failure(
107
+ "Modal returned no result for this item"
108
+ )
109
+ return out
110
+
111
+
112
+ def predict_monomer_batch(sequences: list[str]) -> list[dict[str, float]]:
113
+ """Predict structures for a batch of monomer sequences."""
114
+ items = [
115
+ {"name": f"mono_{i}", "kind": "monomer", "sequences": [seq]}
116
+ for i, seq in enumerate(sequences[:MONOMER_CHUNK_SIZE])
117
+ ]
118
+ by_name = _post_predictions(items)
119
+ return [by_name[item["name"]] for item in items]
120
+
121
+
122
+ def predict_complex_batch(
123
+ pairs: list[tuple[str, str]],
124
+ ) -> list[dict[str, float]]:
125
+ """Predict structures for a batch of (binder, target) pairs."""
126
+ items = [
127
+ {"name": f"cmplx_{i}", "kind": "complex", "sequences": [b, t]}
128
+ for i, (b, t) in enumerate(pairs[:COMPLEX_CHUNK_SIZE])
129
+ ]
130
+ by_name = _post_predictions(items)
131
+ return [by_name[item["name"]] for item in items]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
132
 
133
 
134
  def run_boltz_posteval(
135
  per_task_results: dict[str, dict[str, Any]],
136
  progress_callback=None,
137
  ) -> dict[str, dict[str, Any]]:
138
+ """Run Boltz post-assessment on every task that needs it.
139
 
140
+ For each successful task:
141
+ - Non-binding: pick the first design -> monomer prediction
142
+ - Binding: pick the first design + target sequence -> complex prediction
143
  - Merge Boltz metrics into existing results
144
+ - Re-score the quality component
 
 
 
 
 
 
 
145
  """
146
+ from eval_scorer import _is_binding_task
147
 
148
+ monomer_tasks: list[tuple[str, str]] = []
149
+ complex_tasks: list[tuple[str, str, str]] = []
 
150
 
151
  for task_id, result in per_task_results.items():
152
  if not result.get("success") or not result.get("quality_pending"):
153
  continue
 
154
  sequences = result.get("sequences", [])
155
  if not sequences:
156
  continue
157
+ best_seq = sequences[0]
 
158
 
159
  if _is_binding_task(task_id):
160
+ target_seq = (
161
+ result.get("ground_truth_thresholds", {}).get("target_sequence")
162
+ )
163
  if target_seq:
164
  complex_tasks.append((task_id, best_seq, target_seq))
165
  else:
 
166
  monomer_tasks.append((task_id, best_seq))
167
  else:
168
  monomer_tasks.append((task_id, best_seq))
 
170
  total = len(monomer_tasks) + len(complex_tasks)
171
  done = 0
172
 
 
173
  for chunk_start in range(0, len(monomer_tasks), MONOMER_CHUNK_SIZE):
174
  chunk = monomer_tasks[chunk_start:chunk_start + MONOMER_CHUNK_SIZE]
175
  seqs = [seq for _, seq in chunk]
 
176
  boltz_results = predict_monomer_batch(seqs)
 
177
  for (task_id, _), metrics in zip(chunk, boltz_results):
178
  if metrics.get("success"):
179
  _merge_boltz_metrics(per_task_results[task_id], metrics)
 
180
  done += 1
181
  if progress_callback:
182
  progress_callback(task_id, done, total, metrics)
183
 
 
184
  for chunk_start in range(0, len(complex_tasks), COMPLEX_CHUNK_SIZE):
185
  chunk = complex_tasks[chunk_start:chunk_start + COMPLEX_CHUNK_SIZE]
186
  pairs = [(binder, target) for _, binder, target in chunk]
 
187
  boltz_results = predict_complex_batch(pairs)
 
188
  for (task_id, _, _), metrics in zip(chunk, boltz_results):
189
  if metrics.get("success"):
190
  _merge_boltz_metrics(per_task_results[task_id], metrics)
 
191
  done += 1
192
  if progress_callback:
193
  progress_callback(task_id, done, total, metrics)
 
197
 
198
  def _merge_boltz_metrics(
199
  task_result: dict[str, Any],
200
+ boltz_metrics: dict[str, Any],
201
  ) -> None:
202
+ """Merge Boltz prediction metrics into a task result and re-score quality."""
 
 
 
203
  from eval_scorer import apply_design_gate, score_quality
204
 
 
205
  merged_metrics = task_result.get("agent_metrics", {}).copy()
206
  for key in ("pLDDT", "pTM", "ipTM", "i_pAE"):
207
  if key in boltz_metrics and boltz_metrics[key] > 0:
208
  merged_metrics[key] = boltz_metrics[key]
209
 
 
210
  quality_result = score_quality(
211
  agent_metrics=merged_metrics,
212
  thresholds=task_result.get("ground_truth_thresholds", {}),
 
215
  oracle_sequences=task_result.get("oracle_sequences"),
216
  )
217
 
 
218
  task_result["boltz_metrics"] = boltz_metrics
219
  task_result["quality_pending"] = False
220
 
221
  if "cpu_scores" in task_result:
222
  task_result["cpu_scores"]["quality"] = quality_result["score"]
 
 
 
223
  component_scores = dict(task_result["cpu_scores"])
224
  gated = apply_design_gate(component_scores, task_result.get("num_designs", 0))
225
  task_result["final_scores"] = gated
modal_boltz_app.py ADDED
@@ -0,0 +1,270 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Modal app: Boltz-2 structure prediction for BioDesignBench Phase B.
2
+
3
+ This is the GPU-side companion to `eval_boltz.py`. The HF Space leaderboard
4
+ runs on cpu-basic, so it cannot host Boltz directly; instead it POSTs design
5
+ sequences to this Modal app, which spins up an A10G on demand, runs
6
+ `boltz predict`, and returns confidence metrics.
7
+
8
+ Setup (one-time, on a machine with `pip install modal`):
9
+
10
+ modal token new # if you don't have a token yet
11
+ cd biodesignbench-leaderboard
12
+ modal deploy modal_boltz_app.py
13
+
14
+ After deploy Modal prints a URL like
15
+ https://<workspace>--bdb-boltz-predict.modal.run
16
+
17
+ Add that URL plus a shared secret to the HF Space secrets:
18
+ MODAL_BOLTZ_URL = https://<workspace>--bdb-boltz-predict.modal.run
19
+ MODAL_BOLTZ_TOKEN = <random 32-byte hex>
20
+
21
+ Cost: A10G is billed per-second, container auto-stops after
22
+ `container_idle_timeout` seconds. With one submission per month and
23
+ ~76 tasks * ~30s = ~38min GPU per submission, expected spend is
24
+ well within Modal's free tier.
25
+ """
26
+
27
+ from __future__ import annotations
28
+
29
+ import os
30
+
31
+ import modal
32
+
33
+ APP_NAME = "bdb-boltz"
34
+ ENDPOINT_LABEL = "bdb-boltz-predict"
35
+
36
+ app = modal.App(APP_NAME)
37
+
38
+ # Persistent volume for Boltz-2 model weights (~6GB, downloaded on first call)
39
+ weights_volume = modal.Volume.from_name(
40
+ "bdb-boltz-weights", create_if_missing=True
41
+ )
42
+
43
+ # Boltz GPU image. Boltz-2 is published on PyPI as `boltz` and pulls a
44
+ # CUDA-12 torch wheel automatically.
45
+ gpu_image = (
46
+ modal.Image.from_registry(
47
+ "nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04",
48
+ add_python="3.11",
49
+ )
50
+ .apt_install("git", "wget", "build-essential")
51
+ # Boltz-2 (>=2.2) uses NVIDIA cuequivariance for the triangular-multiply
52
+ # kernel and requires CUDA 12.5+. We let pip pick a torch that matches
53
+ # cuequivariance's nvidia-cublas-cu12>=12.5 constraint.
54
+ .pip_install(
55
+ # Match dev's known-working stack: torch 2.10 ships nvidia-cublas-cu12
56
+ # 12.8 which satisfies cuequivariance>=12.5 requirement.
57
+ "torch==2.10.0",
58
+ "boltz==2.2.1",
59
+ "cuequivariance==0.9.0",
60
+ "cuequivariance-torch==0.9.0",
61
+ "cuequivariance-ops-cu12==0.9.0",
62
+ "cuequivariance-ops-torch-cu12==0.9.0",
63
+ "fastapi[standard]",
64
+ "pyyaml",
65
+ "numpy",
66
+ )
67
+ .env(
68
+ {
69
+ "BOLTZ_CACHE": "/weights",
70
+ "TORCH_HOME": "/weights/torch",
71
+ "HF_HOME": "/weights/hf",
72
+ }
73
+ )
74
+ )
75
+
76
+
77
+ # ---------------------------------------------------------------------------
78
+ # Internal: write YAMLs, run boltz predict, parse outputs
79
+ # ---------------------------------------------------------------------------
80
+
81
+
82
+ def _write_yaml(item: dict) -> str:
83
+ """Render one prediction item to a Boltz YAML string.
84
+
85
+ item shape:
86
+ {"name": "task_001",
87
+ "kind": "monomer" | "complex",
88
+ "sequences": ["MKKL...", ...]} # 1 for monomer, 2 for complex
89
+ """
90
+ seqs = item.get("sequences") or []
91
+ chain_ids = ["A", "B", "C", "D", "E"]
92
+ lines = ["sequences:"]
93
+ for i, seq in enumerate(seqs):
94
+ cid = chain_ids[i] if i < len(chain_ids) else f"X{i}"
95
+ lines.append(" - protein:")
96
+ lines.append(f" id: {cid}")
97
+ lines.append(f" sequence: {seq}")
98
+ return "\n".join(lines) + "\n"
99
+
100
+
101
+ def _parse_confidence(pred_dir) -> dict:
102
+ """Parse a Boltz prediction directory into a flat metric dict."""
103
+ import json
104
+ from pathlib import Path
105
+
106
+ import numpy as np
107
+
108
+ out = {
109
+ "pLDDT": 0.0, "pTM": 0.0, "ipTM": 0.0, "i_pAE": 0.0,
110
+ "success": False,
111
+ }
112
+ pred_dir = Path(pred_dir)
113
+
114
+ conf_files = list(pred_dir.rglob("confidence*.json"))
115
+ if conf_files:
116
+ try:
117
+ with open(conf_files[0]) as f:
118
+ c = json.load(f)
119
+ out["pLDDT"] = round(float(c.get("complex_plddt", 0.0)) * 100, 2)
120
+ out["pTM"] = round(float(c.get("ptm", 0.0)), 4)
121
+ out["ipTM"] = round(float(c.get("iptm", 0.0)), 4)
122
+ out["i_pAE"] = round(float(c.get("complex_ipae", 0.0)), 2)
123
+ out["success"] = True
124
+ except Exception:
125
+ pass
126
+
127
+ if not out["success"]:
128
+ # Fall back to per-residue plddt npz if confidence.json is missing
129
+ plddt_files = list(pred_dir.rglob("plddt*.npz"))
130
+ if plddt_files:
131
+ try:
132
+ arr = np.load(plddt_files[0])["plddt"]
133
+ out["pLDDT"] = round(float(arr.mean()) * 100, 2)
134
+ out["success"] = True
135
+ except Exception:
136
+ pass
137
+
138
+ return out
139
+
140
+
141
+ # ---------------------------------------------------------------------------
142
+ # GPU entry point — single web endpoint handling both monomer and complex
143
+ # ---------------------------------------------------------------------------
144
+
145
+
146
+ @app.function(
147
+ image=gpu_image,
148
+ gpu="A10G",
149
+ volumes={"/weights": weights_volume},
150
+ timeout=1800,
151
+ scaledown_window=300,
152
+ secrets=[modal.Secret.from_name("bdb-boltz-shared", required_keys=["TOKEN"])],
153
+ )
154
+ @modal.fastapi_endpoint(method="POST", label=ENDPOINT_LABEL)
155
+ def predict(payload: dict) -> dict:
156
+ """Run Boltz-2 on a list of prediction items.
157
+
158
+ Body shape:
159
+ {"token": "<shared secret>",
160
+ "items": [{"name": "...", "kind": "monomer"|"complex",
161
+ "sequences": [...]}, ...]}
162
+
163
+ The list is assembled into a single ``boltz predict`` invocation so
164
+ the model loads only once per call (amortizes ~30s cold start).
165
+
166
+ Returns a dict mapping each item's `name` to a metric dict:
167
+ {"pLDDT", "pTM", "ipTM", "i_pAE", "success"}
168
+ """
169
+ import shutil
170
+ import subprocess
171
+ import tempfile
172
+ from pathlib import Path
173
+
174
+ expected_token = os.environ.get("TOKEN", "")
175
+ if expected_token and (payload.get("token") or "") != expected_token:
176
+ return {"error": "Unauthorized -- bad MODAL_BOLTZ_TOKEN"}
177
+
178
+ items = payload.get("items") or []
179
+ if not items:
180
+ return {"results": {}}
181
+
182
+ work = Path(tempfile.mkdtemp(prefix="bdb_boltz_"))
183
+ in_dir = work / "inputs"
184
+ out_dir = work / "out"
185
+ in_dir.mkdir()
186
+ out_dir.mkdir()
187
+
188
+ name_to_yaml: dict[str, str] = {}
189
+ for i, item in enumerate(items):
190
+ name = str(item.get("name") or f"item_{i:04d}")
191
+ safe = "".join(c if c.isalnum() else "_" for c in name)[:60]
192
+ yaml_name = f"{i:04d}_{safe}"
193
+ (in_dir / f"{yaml_name}.yaml").write_text(_write_yaml(item))
194
+ name_to_yaml[name] = yaml_name
195
+
196
+ cmd = [
197
+ "boltz", "predict",
198
+ str(in_dir),
199
+ "--out_dir", str(out_dir),
200
+ "--cache", "/weights/boltz_cache",
201
+ "--diffusion_samples", "1",
202
+ "--output_format", "pdb",
203
+ "--use_msa_server",
204
+ ]
205
+
206
+ proc = subprocess.run(
207
+ cmd, capture_output=True, text=True, timeout=1700, cwd=str(work),
208
+ )
209
+
210
+ # Persist downloaded model weights to the shared volume
211
+ try:
212
+ weights_volume.commit()
213
+ except Exception:
214
+ pass
215
+
216
+ if proc.returncode != 0:
217
+ shutil.rmtree(str(work), ignore_errors=True)
218
+ return {
219
+ "error": "boltz predict failed",
220
+ "stderr": proc.stderr[-2000:],
221
+ "stdout": proc.stdout[-2000:],
222
+ }
223
+
224
+ # boltz writes outputs to out/boltz_results_inputs/predictions/<name>/
225
+ predictions_root = None
226
+ for p in out_dir.rglob("predictions"):
227
+ if p.is_dir():
228
+ predictions_root = p
229
+ break
230
+
231
+ results: dict[str, dict] = {}
232
+ if predictions_root is not None:
233
+ for name, yaml_name in name_to_yaml.items():
234
+ pred_dirs = [
235
+ d for d in predictions_root.iterdir()
236
+ if d.is_dir() and (d.name.startswith(yaml_name) or d.name == yaml_name)
237
+ ]
238
+ if pred_dirs:
239
+ results[name] = _parse_confidence(pred_dirs[0])
240
+ else:
241
+ results[name] = {
242
+ "pLDDT": 0.0, "pTM": 0.0, "ipTM": 0.0, "i_pAE": 0.0,
243
+ "success": False, "error": "prediction missing",
244
+ }
245
+
246
+ shutil.rmtree(str(work), ignore_errors=True)
247
+ return {"results": results}
248
+
249
+
250
+ # ---------------------------------------------------------------------------
251
+ # CLI smoke test: modal run modal_boltz_app.py
252
+ # ---------------------------------------------------------------------------
253
+
254
+
255
+ @app.local_entrypoint()
256
+ def main():
257
+ """Quick sanity check — a short ubiquitin-like sequence."""
258
+ import json
259
+
260
+ items = [
261
+ {
262
+ "name": "monomer_demo",
263
+ "kind": "monomer",
264
+ "sequences": [
265
+ "MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG"
266
+ ],
267
+ },
268
+ ]
269
+ out = predict.remote(items, authorization="")
270
+ print(json.dumps(out, indent=2))
requirements.txt CHANGED
@@ -10,12 +10,7 @@ anthropic>=0.75
10
  openai>=1.40
11
  google-genai>=0.3
12
 
13
- # Phase B (Boltz post-eval). The `spaces` shim is safe on any hardware
14
- # tier; the `@spaces.GPU(...)` decorator is a no-op on cpu-basic and
15
- # provisions ZeroGPU on zero-a10g. Boltz-1 + torch require an actual
16
- # CUDA build, so they are gated: uncomment ONLY after switching the
17
- # Space hardware to a GPU tier (zero-a10g recommended) — otherwise pip
18
- # will pull ~2GB of CUDA wheels onto a CPU image and the build fails.
19
- spaces>=0.30
20
- # torch>=2.2 # ZeroGPU only — uncomment after hardware flip
21
- # boltz>=0.4 # ZeroGPU only — uncomment after hardware flip
 
10
  openai>=1.40
11
  google-genai>=0.3
12
 
13
+ # Phase B uses a Modal-hosted Boltz sidecar (modal_boltz_app.py), so
14
+ # torch / boltz are NOT installed in the Space image; the Space only
15
+ # acts as an HTTP client of the Modal endpoint. See
16
+ # biodesignbench-leaderboard/README.md for deployment notes.