File size: 5,861 Bytes
25af141 eecaec9 25af141 6b95ac9 25af141 eecaec9 25af141 eecaec9 8e08ed6 eecaec9 8e08ed6 eecaec9 8e08ed6 eecaec9 8e08ed6 eecaec9 8e08ed6 c59de83 8e08ed6 7fd8751 c088eff 73e1720 c088eff d41901a c088eff 7fd8751 3e1b7c7 7fd8751 3e1b7c7 7fd8751 3e1b7c7 7fd8751 3e1b7c7 7fd8751 3e1b7c7 7fd8751 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 | ---
title: BioDesignBench Leaderboard
emoji: "\U0001F9EC"
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: "5.50.0"
app_file: app.py
pinned: false
license: mit
---
# BioDesignBench Leaderboard
Evaluating LLM Agents on Protein Design via MCP Tools.
**Romero Lab, Duke University**
## Overview
BioDesignBench evaluates LLM agents as orchestrators of multi-step *stochastic*
protein-design pipelines. This leaderboard tracks agent performance across
**76 design tasks** spanning a **2 Γ 5 design matrix** (de novo design vs
redesign Γ five molecular families: antibody, binder, enzyme, scaffold,
fluorescent protein, **9 occupied cells**), scored on a 100-point hybrid rubric:
**72 algorithmic points** (Boltz-2 verification + sequence/feasibility metrics)
plus **28 LLM-judge points** (3-judge panel with self-exclusion).
The six rubric components are Approach, Orchestration, Quality, Feasibility,
Novelty, and Diversity. See the *About* tab for the full methodology and the
*Depth Gap* tab for evaluation-depth interventions.
## Features
- **Overall Leaderboard** β Mixed-ranking table with human baselines and LLM agents
- **Taxonomy Heatmap** β Per-cell scores across the 9 occupied cells of the 2 Γ 5 design matrix
- **Component Analysis** β Radar and bar charts comparing the 6 scoring components
- **Guidance Effect** β Paired comparison of the same LLM in unguided (atomic tools) vs guided (composite workflows) mode
- **Depth Gap** β Forced-depth and low-diversity intervention results
- **About** β Methodology, submission guide, and citation info
## Bringing your own MCP tools
BioDesignBench is "bring your own LLM, optionally bring your own tools."
The **Custom MCP** submission mode lets you evaluate any 17-tool
implementation against the identical 76 tasks, the identical agent
harness, and the identical scoring rubric used by the paper's reference
runs. Our [`protein-design-mcp`](https://github.com/jasonkim8652/protein-design-mcp)
is just one reference implementation.
### The contract
Your MCP server is a public HTTPS endpoint that accepts POST requests:
```
POST https://your-mcp.example.com/
Authorization: Bearer <optional shared token>
Content-Type: application/json
{
"name": "predict_structure",
"arguments": {"sequence": "MKKL..."}
}
```
The response must be a JSON object. Report errors in a top-level
`error` field rather than via HTTP status codes so the agent loop can
see the reason.
### Tool schemas
The 17 reference tool names plus full JSON Schema for each argument
set live in [`mcp_tool_schemas.json`](./mcp_tool_schemas.json). Your
implementation must accept the same tool names and argument shapes β
the leaderboard's agent loop picks tools by name from this list.
### Reference implementation
See [`jasonkim8652/protein-design-mcp`](https://github.com/jasonkim8652/protein-design-mcp)
for the lab's reference implementation (also on PyPI:
`pip install protein-design-mcp`). The repo includes a Dockerfile and
a Modal deploy template (`deploy/modal_app.py`) you can fork, swap in
your own tool handlers, and redeploy.
### Hosting options
Any public HTTPS endpoint works β we only check that the contract is
satisfied:
| Option | Pros | Cons |
|---|---|---|
| **Modal** (serverless GPU) | Cheap pay-per-use, auto-scale | Modal account needed |
| **AWS / GCP / Azure VM** | Full control, reuse existing cloud | 24/7 billing or manual shutdown |
| **Runpod / Lambda Labs / Vast** | Cheap GPU rentals | Manual spin-up per submission |
| **Kubernetes / HPC** | Reuse on-prem GPU | Ops overhead |
| **ngrok + local GPU** | $0, fastest iteration | URL is ephemeral |
The lab's reference MCP is hosted on Modal for serverless
pay-per-use cost control; your submission does not have to match.
### Minimal stub
A ~150-line FastAPI template you can fork is at
[`example_mcp_server.py`](./example_mcp_server.py). Replace the
`handle_*` stubs with your implementations, deploy, and paste the URL
into the submission form's **Advanced: Custom MCP** section.
## Backend pipeline phases
Submission processing runs in 4 admin-controlled phases:
| Phase | Step | Status | Notes |
|---|---|---|---|
| **A** | Dispatch tasks β CPU scoring | live | HTTP POST to submitter endpoint, validate, score 5/6 components |
| **B** | Boltz-2 structure verification | live (Modal) | Modal-hosted A10G companion app provisions GPU on demand |
| **C** | LLM judge panel (28-pt hybrid) | live | 3-judge PoLL with self-exclusion, requires API key secrets |
| **D** | Finalize + publish to leaderboard | live | Aggregates hybrid scores, writes back to submissions dataset |
### Phase B architecture (Modal companion app)
The HF Space runs on `cpu-basic` and cannot host Boltz directly, so
Phase B uses a Modal-deployed sidecar (`modal_boltz_app.py`) that:
- pre-builds an image with `boltz==2.2.1`, `torch==2.10`, NVIDIA
cuequivariance kernels, and FastAPI;
- exposes a single web endpoint at
`https://<workspace>--bdb-boltz-predict.modal.run`;
- spins up an A10G on demand, runs `boltz predict` (via the same CLI
the dev pipeline uses), and returns confidence metrics;
- auto-stops after 5 minutes idle so the lab is only billed for active
inference time (~$0.06 per task at A10G rates).
The HF Space is just an HTTP client (`eval_boltz.py`); design sequences
are POSTed to the Modal endpoint with a shared bearer token. To
deploy the sidecar (one time):
```bash
cd biodesignbench-leaderboard
modal deploy modal_boltz_app.py
```
Then set these HF Space secrets:
```
MODAL_BOLTZ_URL https://<workspace>--bdb-boltz-predict.modal.run
MODAL_BOLTZ_TOKEN matches the modal secret `bdb-boltz-shared` TOKEN
```
If `MODAL_BOLTZ_URL` is unset, Phase B predictors return a structured
failure dict with `success=False` and an actionable error message
instead of crashing the dispatcher.
|