A newer version of the Gradio SDK is available: 6.14.0
title: BioDesignBench Leaderboard
emoji: π§¬
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
pinned: false
license: mit
BioDesignBench Leaderboard
Evaluating LLM Agents on Protein Design via MCP Tools.
Romero Lab, Duke University
Overview
BioDesignBench evaluates LLM agents as orchestrators of multi-step stochastic protein-design pipelines. This leaderboard tracks agent performance across 76 design tasks spanning a 2 Γ 5 design matrix (de novo design vs redesign Γ five molecular families: antibody, binder, enzyme, scaffold, fluorescent protein, 9 occupied cells), scored on a 100-point hybrid rubric: 72 algorithmic points (Boltz-2 verification + sequence/feasibility metrics) plus 28 LLM-judge points (3-judge panel with self-exclusion).
The six rubric components are Approach, Orchestration, Quality, Feasibility, Novelty, and Diversity. See the About tab for the full methodology and the Depth Gap tab for evaluation-depth interventions.
Features
- Overall Leaderboard β Mixed-ranking table with human baselines and LLM agents
- Taxonomy Heatmap β Per-cell scores across the 9 occupied cells of the 2 Γ 5 design matrix
- Component Analysis β Radar and bar charts comparing the 6 scoring components
- Guidance Effect β Paired comparison of the same LLM in unguided (atomic tools) vs guided (composite workflows) mode
- Depth Gap β Forced-depth and low-diversity intervention results
- About β Methodology, submission guide, and citation info
Bringing your own MCP tools
BioDesignBench is "bring your own LLM, optionally bring your own tools."
The Custom MCP submission mode lets you evaluate any 17-tool
implementation against the identical 76 tasks, the identical agent
harness, and the identical scoring rubric used by the paper's reference
runs. Our protein-design-mcp
is just one reference implementation.
The contract
Your MCP server is a public HTTPS endpoint that accepts POST requests:
POST https://your-mcp.example.com/
Authorization: Bearer <optional shared token>
Content-Type: application/json
{
"name": "predict_structure",
"arguments": {"sequence": "MKKL..."}
}
The response must be a JSON object. Report errors in a top-level
error field rather than via HTTP status codes so the agent loop can
see the reason.
Tool schemas
The 17 reference tool names plus full JSON Schema for each argument
set live in mcp_tool_schemas.json. Your
implementation must accept the same tool names and argument shapes β
the leaderboard's agent loop picks tools by name from this list.
Reference implementation
See jasonkim8652/protein-design-mcp
for the lab's reference implementation (also on PyPI:
pip install protein-design-mcp). The repo includes a Dockerfile and
a Modal deploy template (deploy/modal_app.py) you can fork, swap in
your own tool handlers, and redeploy.
Hosting options
Any public HTTPS endpoint works β we only check that the contract is satisfied:
| Option | Pros | Cons |
|---|---|---|
| Modal (serverless GPU) | Cheap pay-per-use, auto-scale | Modal account needed |
| AWS / GCP / Azure VM | Full control, reuse existing cloud | 24/7 billing or manual shutdown |
| Runpod / Lambda Labs / Vast | Cheap GPU rentals | Manual spin-up per submission |
| Kubernetes / HPC | Reuse on-prem GPU | Ops overhead |
| ngrok + local GPU | $0, fastest iteration | URL is ephemeral |
The lab's reference MCP is hosted on Modal for serverless pay-per-use cost control; your submission does not have to match.
Minimal stub
A ~150-line FastAPI template you can fork is at
example_mcp_server.py. Replace the
handle_* stubs with your implementations, deploy, and paste the URL
into the submission form's Advanced: Custom MCP section.
Backend pipeline phases
Submission processing runs in 4 admin-controlled phases:
| Phase | Step | Status | Notes |
|---|---|---|---|
| A | Dispatch tasks β CPU scoring | live | HTTP POST to submitter endpoint, validate, score 5/6 components |
| B | Boltz-2 structure verification | live (Modal) | Modal-hosted A10G companion app provisions GPU on demand |
| C | LLM judge panel (28-pt hybrid) | live | 3-judge PoLL with self-exclusion, requires API key secrets |
| D | Finalize + publish to leaderboard | live | Aggregates hybrid scores, writes back to submissions dataset |
Phase B architecture (Modal companion app)
The HF Space runs on cpu-basic and cannot host Boltz directly, so
Phase B uses a Modal-deployed sidecar (modal_boltz_app.py) that:
- pre-builds an image with
boltz==2.2.1,torch==2.10, NVIDIA cuequivariance kernels, and FastAPI; - exposes a single web endpoint at
https://<workspace>--bdb-boltz-predict.modal.run; - spins up an A10G on demand, runs
boltz predict(via the same CLI the dev pipeline uses), and returns confidence metrics; - auto-stops after 5 minutes idle so the lab is only billed for active inference time (~$0.06 per task at A10G rates).
The HF Space is just an HTTP client (eval_boltz.py); design sequences
are POSTed to the Modal endpoint with a shared bearer token. To
deploy the sidecar (one time):
cd biodesignbench-leaderboard
modal deploy modal_boltz_app.py
Then set these HF Space secrets:
MODAL_BOLTZ_URL https://<workspace>--bdb-boltz-predict.modal.run
MODAL_BOLTZ_TOKEN matches the modal secret `bdb-boltz-shared` TOKEN
If MODAL_BOLTZ_URL is unset, Phase B predictors return a structured
failure dict with success=False and an actionable error message
instead of crashing the dispatcher.