File size: 5,861 Bytes
25af141
 
eecaec9
 
25af141
 
6b95ac9
25af141
 
eecaec9
25af141
 
eecaec9
 
8e08ed6
eecaec9
 
 
8e08ed6
eecaec9
8e08ed6
 
 
 
 
 
 
eecaec9
8e08ed6
 
 
eecaec9
8e08ed6
c59de83
8e08ed6
 
 
 
 
 
7fd8751
c088eff
 
 
 
 
 
73e1720
c088eff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d41901a
 
 
 
 
c088eff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7fd8751
 
 
 
 
 
 
3e1b7c7
7fd8751
 
 
3e1b7c7
7fd8751
3e1b7c7
 
7fd8751
3e1b7c7
 
 
 
 
 
 
 
7fd8751
3e1b7c7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7fd8751
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
title: BioDesignBench Leaderboard
emoji: "\U0001F9EC"
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: "5.50.0"
app_file: app.py
pinned: false
license: mit
---

# BioDesignBench Leaderboard

Evaluating LLM Agents on Protein Design via MCP Tools.

**Romero Lab, Duke University**

## Overview

BioDesignBench evaluates LLM agents as orchestrators of multi-step *stochastic*
protein-design pipelines. This leaderboard tracks agent performance across
**76 design tasks** spanning a **2 Γ— 5 design matrix** (de novo design vs
redesign Γ— five molecular families: antibody, binder, enzyme, scaffold,
fluorescent protein, **9 occupied cells**), scored on a 100-point hybrid rubric:
**72 algorithmic points** (Boltz-2 verification + sequence/feasibility metrics)
plus **28 LLM-judge points** (3-judge panel with self-exclusion).

The six rubric components are Approach, Orchestration, Quality, Feasibility,
Novelty, and Diversity. See the *About* tab for the full methodology and the
*Depth Gap* tab for evaluation-depth interventions.

## Features

- **Overall Leaderboard** β€” Mixed-ranking table with human baselines and LLM agents
- **Taxonomy Heatmap** β€” Per-cell scores across the 9 occupied cells of the 2 Γ— 5 design matrix
- **Component Analysis** β€” Radar and bar charts comparing the 6 scoring components
- **Guidance Effect** β€” Paired comparison of the same LLM in unguided (atomic tools) vs guided (composite workflows) mode
- **Depth Gap** β€” Forced-depth and low-diversity intervention results
- **About** β€” Methodology, submission guide, and citation info

## Bringing your own MCP tools

BioDesignBench is "bring your own LLM, optionally bring your own tools."
The **Custom MCP** submission mode lets you evaluate any 17-tool
implementation against the identical 76 tasks, the identical agent
harness, and the identical scoring rubric used by the paper's reference
runs. Our [`protein-design-mcp`](https://github.com/jasonkim8652/protein-design-mcp)
is just one reference implementation.

### The contract

Your MCP server is a public HTTPS endpoint that accepts POST requests:

```
POST https://your-mcp.example.com/
Authorization: Bearer <optional shared token>
Content-Type: application/json

{
  "name": "predict_structure",
  "arguments": {"sequence": "MKKL..."}
}
```

The response must be a JSON object. Report errors in a top-level
`error` field rather than via HTTP status codes so the agent loop can
see the reason.

### Tool schemas

The 17 reference tool names plus full JSON Schema for each argument
set live in [`mcp_tool_schemas.json`](./mcp_tool_schemas.json). Your
implementation must accept the same tool names and argument shapes β€”
the leaderboard's agent loop picks tools by name from this list.

### Reference implementation

See [`jasonkim8652/protein-design-mcp`](https://github.com/jasonkim8652/protein-design-mcp)
for the lab's reference implementation (also on PyPI:
`pip install protein-design-mcp`). The repo includes a Dockerfile and
a Modal deploy template (`deploy/modal_app.py`) you can fork, swap in
your own tool handlers, and redeploy.

### Hosting options

Any public HTTPS endpoint works β€” we only check that the contract is
satisfied:

| Option | Pros | Cons |
|---|---|---|
| **Modal** (serverless GPU) | Cheap pay-per-use, auto-scale | Modal account needed |
| **AWS / GCP / Azure VM** | Full control, reuse existing cloud | 24/7 billing or manual shutdown |
| **Runpod / Lambda Labs / Vast** | Cheap GPU rentals | Manual spin-up per submission |
| **Kubernetes / HPC** | Reuse on-prem GPU | Ops overhead |
| **ngrok + local GPU** | $0, fastest iteration | URL is ephemeral |

The lab's reference MCP is hosted on Modal for serverless
pay-per-use cost control; your submission does not have to match.

### Minimal stub

A ~150-line FastAPI template you can fork is at
[`example_mcp_server.py`](./example_mcp_server.py). Replace the
`handle_*` stubs with your implementations, deploy, and paste the URL
into the submission form's **Advanced: Custom MCP** section.


## Backend pipeline phases

Submission processing runs in 4 admin-controlled phases:

| Phase | Step | Status | Notes |
|---|---|---|---|
| **A** | Dispatch tasks β†’ CPU scoring | live | HTTP POST to submitter endpoint, validate, score 5/6 components |
| **B** | Boltz-2 structure verification | live (Modal) | Modal-hosted A10G companion app provisions GPU on demand |
| **C** | LLM judge panel (28-pt hybrid) | live | 3-judge PoLL with self-exclusion, requires API key secrets |
| **D** | Finalize + publish to leaderboard | live | Aggregates hybrid scores, writes back to submissions dataset |

### Phase B architecture (Modal companion app)

The HF Space runs on `cpu-basic` and cannot host Boltz directly, so
Phase B uses a Modal-deployed sidecar (`modal_boltz_app.py`) that:

- pre-builds an image with `boltz==2.2.1`, `torch==2.10`, NVIDIA
  cuequivariance kernels, and FastAPI;
- exposes a single web endpoint at
  `https://<workspace>--bdb-boltz-predict.modal.run`;
- spins up an A10G on demand, runs `boltz predict` (via the same CLI
  the dev pipeline uses), and returns confidence metrics;
- auto-stops after 5 minutes idle so the lab is only billed for active
  inference time (~$0.06 per task at A10G rates).

The HF Space is just an HTTP client (`eval_boltz.py`); design sequences
are POSTed to the Modal endpoint with a shared bearer token. To
deploy the sidecar (one time):

```bash
cd biodesignbench-leaderboard
modal deploy modal_boltz_app.py
```

Then set these HF Space secrets:

```
MODAL_BOLTZ_URL    https://<workspace>--bdb-boltz-predict.modal.run
MODAL_BOLTZ_TOKEN  matches the modal secret `bdb-boltz-shared` TOKEN
```

If `MODAL_BOLTZ_URL` is unset, Phase B predictors return a structured
failure dict with `success=False` and an actionable error message
instead of crashing the dispatcher.