Jasonkim8652's picture
docs: fix protein-design-mcp link (RomeroLab -> jasonkim8652) + PyPI note
d41901a verified
---
title: BioDesignBench Leaderboard
emoji: "\U0001F9EC"
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: "5.50.0"
app_file: app.py
pinned: false
license: mit
---
# BioDesignBench Leaderboard
Evaluating LLM Agents on Protein Design via MCP Tools.
**Romero Lab, Duke University**
## Overview
BioDesignBench evaluates LLM agents as orchestrators of multi-step *stochastic*
protein-design pipelines. This leaderboard tracks agent performance across
**76 design tasks** spanning a **2 Γ— 5 design matrix** (de novo design vs
redesign Γ— five molecular families: antibody, binder, enzyme, scaffold,
fluorescent protein, **9 occupied cells**), scored on a 100-point hybrid rubric:
**72 algorithmic points** (Boltz-2 verification + sequence/feasibility metrics)
plus **28 LLM-judge points** (3-judge panel with self-exclusion).
The six rubric components are Approach, Orchestration, Quality, Feasibility,
Novelty, and Diversity. See the *About* tab for the full methodology and the
*Depth Gap* tab for evaluation-depth interventions.
## Features
- **Overall Leaderboard** β€” Mixed-ranking table with human baselines and LLM agents
- **Taxonomy Heatmap** β€” Per-cell scores across the 9 occupied cells of the 2 Γ— 5 design matrix
- **Component Analysis** β€” Radar and bar charts comparing the 6 scoring components
- **Guidance Effect** β€” Paired comparison of the same LLM in unguided (atomic tools) vs guided (composite workflows) mode
- **Depth Gap** β€” Forced-depth and low-diversity intervention results
- **About** β€” Methodology, submission guide, and citation info
## Bringing your own MCP tools
BioDesignBench is "bring your own LLM, optionally bring your own tools."
The **Custom MCP** submission mode lets you evaluate any 17-tool
implementation against the identical 76 tasks, the identical agent
harness, and the identical scoring rubric used by the paper's reference
runs. Our [`protein-design-mcp`](https://github.com/RomeroLab/protein-design-mcp)
is just one reference implementation.
### The contract
Your MCP server is a public HTTPS endpoint that accepts POST requests:
```
POST https://your-mcp.example.com/
Authorization: Bearer <optional shared token>
Content-Type: application/json
{
"name": "predict_structure",
"arguments": {"sequence": "MKKL..."}
}
```
The response must be a JSON object. Report errors in a top-level
`error` field rather than via HTTP status codes so the agent loop can
see the reason.
### Tool schemas
The 17 reference tool names plus full JSON Schema for each argument
set live in [`mcp_tool_schemas.json`](./mcp_tool_schemas.json). Your
implementation must accept the same tool names and argument shapes β€”
the leaderboard's agent loop picks tools by name from this list.
### Reference implementation
See [`jasonkim8652/protein-design-mcp`](https://github.com/jasonkim8652/protein-design-mcp)
for the lab's reference implementation (also on PyPI:
`pip install protein-design-mcp`). The repo includes a Dockerfile and
a Modal deploy template (`deploy/modal_app.py`) you can fork, swap in
your own tool handlers, and redeploy.
### Hosting options
Any public HTTPS endpoint works β€” we only check that the contract is
satisfied:
| Option | Pros | Cons |
|---|---|---|
| **Modal** (serverless GPU) | Cheap pay-per-use, auto-scale | Modal account needed |
| **AWS / GCP / Azure VM** | Full control, reuse existing cloud | 24/7 billing or manual shutdown |
| **Runpod / Lambda Labs / Vast** | Cheap GPU rentals | Manual spin-up per submission |
| **Kubernetes / HPC** | Reuse on-prem GPU | Ops overhead |
| **ngrok + local GPU** | $0, fastest iteration | URL is ephemeral |
The lab's reference MCP is hosted on Modal for serverless
pay-per-use cost control; your submission does not have to match.
### Minimal stub
A ~150-line FastAPI template you can fork is at
[`example_mcp_server.py`](./example_mcp_server.py). Replace the
`handle_*` stubs with your implementations, deploy, and paste the URL
into the submission form's **Advanced: Custom MCP** section.
## Backend pipeline phases
Submission processing runs in 4 admin-controlled phases:
| Phase | Step | Status | Notes |
|---|---|---|---|
| **A** | Dispatch tasks β†’ CPU scoring | live | HTTP POST to submitter endpoint, validate, score 5/6 components |
| **B** | Boltz-2 structure verification | live (Modal) | Modal-hosted A10G companion app provisions GPU on demand |
| **C** | LLM judge panel (28-pt hybrid) | live | 3-judge PoLL with self-exclusion, requires API key secrets |
| **D** | Finalize + publish to leaderboard | live | Aggregates hybrid scores, writes back to submissions dataset |
### Phase B architecture (Modal companion app)
The HF Space runs on `cpu-basic` and cannot host Boltz directly, so
Phase B uses a Modal-deployed sidecar (`modal_boltz_app.py`) that:
- pre-builds an image with `boltz==2.2.1`, `torch==2.10`, NVIDIA
cuequivariance kernels, and FastAPI;
- exposes a single web endpoint at
`https://<workspace>--bdb-boltz-predict.modal.run`;
- spins up an A10G on demand, runs `boltz predict` (via the same CLI
the dev pipeline uses), and returns confidence metrics;
- auto-stops after 5 minutes idle so the lab is only billed for active
inference time (~$0.06 per task at A10G rates).
The HF Space is just an HTTP client (`eval_boltz.py`); design sequences
are POSTed to the Modal endpoint with a shared bearer token. To
deploy the sidecar (one time):
```bash
cd biodesignbench-leaderboard
modal deploy modal_boltz_app.py
```
Then set these HF Space secrets:
```
MODAL_BOLTZ_URL https://<workspace>--bdb-boltz-predict.modal.run
MODAL_BOLTZ_TOKEN matches the modal secret `bdb-boltz-shared` TOKEN
```
If `MODAL_BOLTZ_URL` is unset, Phase B predictors return a structured
failure dict with `success=False` and an actionable error message
instead of crashing the dispatcher.