Jasonkim8652's picture
docs: fix protein-design-mcp link (RomeroLab -> jasonkim8652) + PyPI note
d41901a verified

A newer version of the Gradio SDK is available: 6.14.0

Upgrade
metadata
title: BioDesignBench Leaderboard
emoji: 🧬
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
pinned: false
license: mit

BioDesignBench Leaderboard

Evaluating LLM Agents on Protein Design via MCP Tools.

Romero Lab, Duke University

Overview

BioDesignBench evaluates LLM agents as orchestrators of multi-step stochastic protein-design pipelines. This leaderboard tracks agent performance across 76 design tasks spanning a 2 Γ— 5 design matrix (de novo design vs redesign Γ— five molecular families: antibody, binder, enzyme, scaffold, fluorescent protein, 9 occupied cells), scored on a 100-point hybrid rubric: 72 algorithmic points (Boltz-2 verification + sequence/feasibility metrics) plus 28 LLM-judge points (3-judge panel with self-exclusion).

The six rubric components are Approach, Orchestration, Quality, Feasibility, Novelty, and Diversity. See the About tab for the full methodology and the Depth Gap tab for evaluation-depth interventions.

Features

  • Overall Leaderboard β€” Mixed-ranking table with human baselines and LLM agents
  • Taxonomy Heatmap β€” Per-cell scores across the 9 occupied cells of the 2 Γ— 5 design matrix
  • Component Analysis β€” Radar and bar charts comparing the 6 scoring components
  • Guidance Effect β€” Paired comparison of the same LLM in unguided (atomic tools) vs guided (composite workflows) mode
  • Depth Gap β€” Forced-depth and low-diversity intervention results
  • About β€” Methodology, submission guide, and citation info

Bringing your own MCP tools

BioDesignBench is "bring your own LLM, optionally bring your own tools." The Custom MCP submission mode lets you evaluate any 17-tool implementation against the identical 76 tasks, the identical agent harness, and the identical scoring rubric used by the paper's reference runs. Our protein-design-mcp is just one reference implementation.

The contract

Your MCP server is a public HTTPS endpoint that accepts POST requests:

POST https://your-mcp.example.com/
Authorization: Bearer <optional shared token>
Content-Type: application/json

{
  "name": "predict_structure",
  "arguments": {"sequence": "MKKL..."}
}

The response must be a JSON object. Report errors in a top-level error field rather than via HTTP status codes so the agent loop can see the reason.

Tool schemas

The 17 reference tool names plus full JSON Schema for each argument set live in mcp_tool_schemas.json. Your implementation must accept the same tool names and argument shapes β€” the leaderboard's agent loop picks tools by name from this list.

Reference implementation

See jasonkim8652/protein-design-mcp for the lab's reference implementation (also on PyPI: pip install protein-design-mcp). The repo includes a Dockerfile and a Modal deploy template (deploy/modal_app.py) you can fork, swap in your own tool handlers, and redeploy.

Hosting options

Any public HTTPS endpoint works β€” we only check that the contract is satisfied:

Option Pros Cons
Modal (serverless GPU) Cheap pay-per-use, auto-scale Modal account needed
AWS / GCP / Azure VM Full control, reuse existing cloud 24/7 billing or manual shutdown
Runpod / Lambda Labs / Vast Cheap GPU rentals Manual spin-up per submission
Kubernetes / HPC Reuse on-prem GPU Ops overhead
ngrok + local GPU $0, fastest iteration URL is ephemeral

The lab's reference MCP is hosted on Modal for serverless pay-per-use cost control; your submission does not have to match.

Minimal stub

A ~150-line FastAPI template you can fork is at example_mcp_server.py. Replace the handle_* stubs with your implementations, deploy, and paste the URL into the submission form's Advanced: Custom MCP section.

Backend pipeline phases

Submission processing runs in 4 admin-controlled phases:

Phase Step Status Notes
A Dispatch tasks β†’ CPU scoring live HTTP POST to submitter endpoint, validate, score 5/6 components
B Boltz-2 structure verification live (Modal) Modal-hosted A10G companion app provisions GPU on demand
C LLM judge panel (28-pt hybrid) live 3-judge PoLL with self-exclusion, requires API key secrets
D Finalize + publish to leaderboard live Aggregates hybrid scores, writes back to submissions dataset

Phase B architecture (Modal companion app)

The HF Space runs on cpu-basic and cannot host Boltz directly, so Phase B uses a Modal-deployed sidecar (modal_boltz_app.py) that:

  • pre-builds an image with boltz==2.2.1, torch==2.10, NVIDIA cuequivariance kernels, and FastAPI;
  • exposes a single web endpoint at https://<workspace>--bdb-boltz-predict.modal.run;
  • spins up an A10G on demand, runs boltz predict (via the same CLI the dev pipeline uses), and returns confidence metrics;
  • auto-stops after 5 minutes idle so the lab is only billed for active inference time (~$0.06 per task at A10G rates).

The HF Space is just an HTTP client (eval_boltz.py); design sequences are POSTed to the Modal endpoint with a shared bearer token. To deploy the sidecar (one time):

cd biodesignbench-leaderboard
modal deploy modal_boltz_app.py

Then set these HF Space secrets:

MODAL_BOLTZ_URL    https://<workspace>--bdb-boltz-predict.modal.run
MODAL_BOLTZ_TOKEN  matches the modal secret `bdb-boltz-shared` TOKEN

If MODAL_BOLTZ_URL is unset, Phase B predictors return a structured failure dict with success=False and an actionable error message instead of crashing the dispatcher.