--- title: BioDesignBench Leaderboard emoji: "\U0001F9EC" colorFrom: blue colorTo: purple sdk: gradio sdk_version: "5.50.0" app_file: app.py pinned: false license: mit --- # BioDesignBench Leaderboard Evaluating LLM Agents on Protein Design via MCP Tools. **Romero Lab, Duke University** ## Overview BioDesignBench evaluates LLM agents as orchestrators of multi-step *stochastic* protein-design pipelines. This leaderboard tracks agent performance across **76 design tasks** spanning a **2 × 5 design matrix** (de novo design vs redesign × five molecular families: antibody, binder, enzyme, scaffold, fluorescent protein, **9 occupied cells**), scored on a 100-point hybrid rubric: **72 algorithmic points** (Boltz-2 verification + sequence/feasibility metrics) plus **28 LLM-judge points** (3-judge panel with self-exclusion). The six rubric components are Approach, Orchestration, Quality, Feasibility, Novelty, and Diversity. See the *About* tab for the full methodology and the *Depth Gap* tab for evaluation-depth interventions. ## Features - **Overall Leaderboard** — Mixed-ranking table with human baselines and LLM agents - **Taxonomy Heatmap** — Per-cell scores across the 9 occupied cells of the 2 × 5 design matrix - **Component Analysis** — Radar and bar charts comparing the 6 scoring components - **Guidance Effect** — Paired comparison of the same LLM in unguided (atomic tools) vs guided (composite workflows) mode - **Depth Gap** — Forced-depth and low-diversity intervention results - **About** — Methodology, submission guide, and citation info ## Bringing your own MCP tools BioDesignBench is "bring your own LLM, optionally bring your own tools." The **Custom MCP** submission mode lets you evaluate any 17-tool implementation against the identical 76 tasks, the identical agent harness, and the identical scoring rubric used by the paper's reference runs. Our [`protein-design-mcp`](https://github.com/jasonkim8652/protein-design-mcp) is just one reference implementation. ### The contract Your MCP server is a public HTTPS endpoint that accepts POST requests: ``` POST https://your-mcp.example.com/ Authorization: Bearer Content-Type: application/json { "name": "predict_structure", "arguments": {"sequence": "MKKL..."} } ``` The response must be a JSON object. Report errors in a top-level `error` field rather than via HTTP status codes so the agent loop can see the reason. ### Tool schemas The 17 reference tool names plus full JSON Schema for each argument set live in [`mcp_tool_schemas.json`](./mcp_tool_schemas.json). Your implementation must accept the same tool names and argument shapes — the leaderboard's agent loop picks tools by name from this list. ### Reference implementation See [`jasonkim8652/protein-design-mcp`](https://github.com/jasonkim8652/protein-design-mcp) for the lab's reference implementation (also on PyPI: `pip install protein-design-mcp`). The repo includes a Dockerfile and a Modal deploy template (`deploy/modal_app.py`) you can fork, swap in your own tool handlers, and redeploy. ### Hosting options Any public HTTPS endpoint works — we only check that the contract is satisfied: | Option | Pros | Cons | |---|---|---| | **Modal** (serverless GPU) | Cheap pay-per-use, auto-scale | Modal account needed | | **AWS / GCP / Azure VM** | Full control, reuse existing cloud | 24/7 billing or manual shutdown | | **Runpod / Lambda Labs / Vast** | Cheap GPU rentals | Manual spin-up per submission | | **Kubernetes / HPC** | Reuse on-prem GPU | Ops overhead | | **ngrok + local GPU** | $0, fastest iteration | URL is ephemeral | The lab's reference MCP is hosted on Modal for serverless pay-per-use cost control; your submission does not have to match. ### Minimal stub A ~150-line FastAPI template you can fork is at [`example_mcp_server.py`](./example_mcp_server.py). Replace the `handle_*` stubs with your implementations, deploy, and paste the URL into the submission form's **Advanced: Custom MCP** section. ## Backend pipeline phases Submission processing runs in 4 admin-controlled phases: | Phase | Step | Status | Notes | |---|---|---|---| | **A** | Dispatch tasks → CPU scoring | live | HTTP POST to submitter endpoint, validate, score 5/6 components | | **B** | Boltz-2 structure verification | live (Modal) | Modal-hosted A10G companion app provisions GPU on demand | | **C** | LLM judge panel (28-pt hybrid) | live | 3-judge PoLL with self-exclusion, requires API key secrets | | **D** | Finalize + publish to leaderboard | live | Aggregates hybrid scores, writes back to submissions dataset | ### Phase B architecture (Modal companion app) The HF Space runs on `cpu-basic` and cannot host Boltz directly, so Phase B uses a Modal-deployed sidecar (`modal_boltz_app.py`) that: - pre-builds an image with `boltz==2.2.1`, `torch==2.10`, NVIDIA cuequivariance kernels, and FastAPI; - exposes a single web endpoint at `https://--bdb-boltz-predict.modal.run`; - spins up an A10G on demand, runs `boltz predict` (via the same CLI the dev pipeline uses), and returns confidence metrics; - auto-stops after 5 minutes idle so the lab is only billed for active inference time (~$0.06 per task at A10G rates). The HF Space is just an HTTP client (`eval_boltz.py`); design sequences are POSTed to the Modal endpoint with a shared bearer token. To deploy the sidecar (one time): ```bash cd biodesignbench-leaderboard modal deploy modal_boltz_app.py ``` Then set these HF Space secrets: ``` MODAL_BOLTZ_URL https://--bdb-boltz-predict.modal.run MODAL_BOLTZ_TOKEN matches the modal secret `bdb-boltz-shared` TOKEN ``` If `MODAL_BOLTZ_URL` is unset, Phase B predictors return a structured failure dict with `success=False` and an actionable error message instead of crashing the dispatcher.