| --- |
| title: BioDesignBench Leaderboard |
| emoji: "\U0001F9EC" |
| colorFrom: blue |
| colorTo: purple |
| sdk: gradio |
| sdk_version: "5.50.0" |
| app_file: app.py |
| pinned: false |
| license: mit |
| --- |
| |
| # BioDesignBench Leaderboard |
|
|
| Evaluating LLM Agents on Protein Design via MCP Tools. |
|
|
| **Romero Lab, Duke University** |
|
|
| ## Overview |
|
|
| BioDesignBench evaluates LLM agents as orchestrators of multi-step *stochastic* |
| protein-design pipelines. This leaderboard tracks agent performance across |
| **76 design tasks** spanning a **2 Γ 5 design matrix** (de novo design vs |
| redesign Γ five molecular families: antibody, binder, enzyme, scaffold, |
| fluorescent protein, **9 occupied cells**), scored on a 100-point hybrid rubric: |
| **72 algorithmic points** (Boltz-2 verification + sequence/feasibility metrics) |
| plus **28 LLM-judge points** (3-judge panel with self-exclusion). |
|
|
| The six rubric components are Approach, Orchestration, Quality, Feasibility, |
| Novelty, and Diversity. See the *About* tab for the full methodology and the |
| *Depth Gap* tab for evaluation-depth interventions. |
|
|
| ## Features |
|
|
| - **Overall Leaderboard** β Mixed-ranking table with human baselines and LLM agents |
| - **Taxonomy Heatmap** β Per-cell scores across the 9 occupied cells of the 2 Γ 5 design matrix |
| - **Component Analysis** β Radar and bar charts comparing the 6 scoring components |
| - **Guidance Effect** β Paired comparison of the same LLM in unguided (atomic tools) vs guided (composite workflows) mode |
| - **Depth Gap** β Forced-depth and low-diversity intervention results |
| - **About** β Methodology, submission guide, and citation info |
|
|
| ## Bringing your own MCP tools |
|
|
| BioDesignBench is "bring your own LLM, optionally bring your own tools." |
| The **Custom MCP** submission mode lets you evaluate any 17-tool |
| implementation against the identical 76 tasks, the identical agent |
| harness, and the identical scoring rubric used by the paper's reference |
| runs. Our [`protein-design-mcp`](https://github.com/RomeroLab/protein-design-mcp) |
| is just one reference implementation. |
|
|
| ### The contract |
|
|
| Your MCP server is a public HTTPS endpoint that accepts POST requests: |
|
|
| ``` |
| POST https://your-mcp.example.com/ |
| Authorization: Bearer <optional shared token> |
| Content-Type: application/json |
| |
| { |
| "name": "predict_structure", |
| "arguments": {"sequence": "MKKL..."} |
| } |
| ``` |
|
|
| The response must be a JSON object. Report errors in a top-level |
| `error` field rather than via HTTP status codes so the agent loop can |
| see the reason. |
|
|
| ### Tool schemas |
|
|
| The 17 reference tool names plus full JSON Schema for each argument |
| set live in [`mcp_tool_schemas.json`](./mcp_tool_schemas.json). Your |
| implementation must accept the same tool names and argument shapes β |
| the leaderboard's agent loop picks tools by name from this list. |
|
|
| ### Reference implementation |
|
|
| See [`jasonkim8652/protein-design-mcp`](https://github.com/jasonkim8652/protein-design-mcp) |
| for the lab's reference implementation (also on PyPI: |
| `pip install protein-design-mcp`). The repo includes a Dockerfile and |
| a Modal deploy template (`deploy/modal_app.py`) you can fork, swap in |
| your own tool handlers, and redeploy. |
|
|
| ### Hosting options |
|
|
| Any public HTTPS endpoint works β we only check that the contract is |
| satisfied: |
|
|
| | Option | Pros | Cons | |
| |---|---|---| |
| | **Modal** (serverless GPU) | Cheap pay-per-use, auto-scale | Modal account needed | |
| | **AWS / GCP / Azure VM** | Full control, reuse existing cloud | 24/7 billing or manual shutdown | |
| | **Runpod / Lambda Labs / Vast** | Cheap GPU rentals | Manual spin-up per submission | |
| | **Kubernetes / HPC** | Reuse on-prem GPU | Ops overhead | |
| | **ngrok + local GPU** | $0, fastest iteration | URL is ephemeral | |
|
|
| The lab's reference MCP is hosted on Modal for serverless |
| pay-per-use cost control; your submission does not have to match. |
|
|
| ### Minimal stub |
|
|
| A ~150-line FastAPI template you can fork is at |
| [`example_mcp_server.py`](./example_mcp_server.py). Replace the |
| `handle_*` stubs with your implementations, deploy, and paste the URL |
| into the submission form's **Advanced: Custom MCP** section. |
|
|
|
|
| ## Backend pipeline phases |
|
|
| Submission processing runs in 4 admin-controlled phases: |
|
|
| | Phase | Step | Status | Notes | |
| |---|---|---|---| |
| | **A** | Dispatch tasks β CPU scoring | live | HTTP POST to submitter endpoint, validate, score 5/6 components | |
| | **B** | Boltz-2 structure verification | live (Modal) | Modal-hosted A10G companion app provisions GPU on demand | |
| | **C** | LLM judge panel (28-pt hybrid) | live | 3-judge PoLL with self-exclusion, requires API key secrets | |
| | **D** | Finalize + publish to leaderboard | live | Aggregates hybrid scores, writes back to submissions dataset | |
|
|
| ### Phase B architecture (Modal companion app) |
|
|
| The HF Space runs on `cpu-basic` and cannot host Boltz directly, so |
| Phase B uses a Modal-deployed sidecar (`modal_boltz_app.py`) that: |
|
|
| - pre-builds an image with `boltz==2.2.1`, `torch==2.10`, NVIDIA |
| cuequivariance kernels, and FastAPI; |
| - exposes a single web endpoint at |
| `https://<workspace>--bdb-boltz-predict.modal.run`; |
| - spins up an A10G on demand, runs `boltz predict` (via the same CLI |
| the dev pipeline uses), and returns confidence metrics; |
| - auto-stops after 5 minutes idle so the lab is only billed for active |
| inference time (~$0.06 per task at A10G rates). |
|
|
| The HF Space is just an HTTP client (`eval_boltz.py`); design sequences |
| are POSTed to the Modal endpoint with a shared bearer token. To |
| deploy the sidecar (one time): |
|
|
| ```bash |
| cd biodesignbench-leaderboard |
| modal deploy modal_boltz_app.py |
| ``` |
|
|
| Then set these HF Space secrets: |
|
|
| ``` |
| MODAL_BOLTZ_URL https://<workspace>--bdb-boltz-predict.modal.run |
| MODAL_BOLTZ_TOKEN matches the modal secret `bdb-boltz-shared` TOKEN |
| ``` |
|
|
| If `MODAL_BOLTZ_URL` is unset, Phase B predictors return a structured |
| failure dict with `success=False` and an actionable error message |
| instead of crashing the dispatcher. |
|
|