Spaces:

RomeroLab-Duke
/

BioDesignBench-Leaderboard

Running

App Files Files Community

Jasonkim8652 commited on Mar 10

Commit

c59de83

verified ·

1 Parent(s): b34cf54

update leaderboard with rescored results and fair diversity formula

Browse files

Files changed (4) hide show

README.md +79 -13
app.py +3 -3
eval_scorer.py +6 -5
leaderboard_data.json +401 -250

README.md CHANGED Viewed

@@ -12,22 +12,88 @@ license: mit
 # BioDesignBench Leaderboard
-Evaluating LLM Agents on Protein Design via MCP Tools.
 **Romero Lab, Duke University**
-## Overview
-BioDesignBench is the first comprehensive benchmark for evaluating LLM agents on
-protein design tasks via MCP (Model Context Protocol) tool use. This leaderboard
-tracks agent performance across 76 design tasks spanning 17 taxonomy cells
-(5 DesignTaskTypes x 6 BiologicalContexts), scored on a 100-point rubric with
-6 components: Approach, Orchestration, Quality, Feasibility, Novelty, Diversity.
-## Features
-- **Overall Leaderboard** — Mixed-ranking table with baselines and LLM agents
-- **Taxonomy Breakdown** — Heatmap of per-cell scores across 17 taxonomy cells
-- **Component Analysis** — Radar and bar charts comparing 6 scoring components
-- **Benchmark vs User Mode** — Paired comparison of the same LLM in two modes
-- **About** — Methodology, submission guide, and citation info

 # BioDesignBench Leaderboard
+Interactive leaderboard for **BioDesignBench**, a benchmark evaluating LLM agents on protein design tasks via MCP (Model Context Protocol) tool use.
 **Romero Lab, Duke University**
+## What the leaderboard shows
+- **Overall Leaderboard** -- Mixed-ranking table with human baselines and LLM agents, filterable by mode (benchmark/user), MCP tool type (reference/custom), and entry type.
+- **Taxonomy Breakdown** -- Heatmap of per-cell scores across 17 taxonomy cells (5 task types x 5 biological contexts) with average-per-type bar chart.
+- **Component Analysis** -- Radar and grouped bar charts comparing the 6 scoring components (Approach, Orchestration, Quality, Feasibility, Novelty, Diversity) between any two agents.
+- **Benchmark vs User Mode** -- Paired comparison showing how the same LLM performs with minimal prompting (benchmark) vs rich guidance (user mode).
+- **Submit** -- Form to submit your own protein design agent for evaluation.
+- **About** -- Methodology, scoring rubric, submission guide, and citation.
+## Run locally
+```bash
+pip install -r requirements.txt
+python app.py
+```
+The app launches a Gradio server at `http://localhost:7860`.
+## HuggingFace Space deployment
+This directory is structured as a self-contained HF Space. To deploy:
+1. Create a new Space on HuggingFace (`sdk: gradio`).
+2. Push the contents of this directory to the Space repo.
+3. Set the `BDB_ADMIN_PASSWORD` secret in the Space settings for admin panel access.
+4. Optionally set `HF_TOKEN` for submission queue access (private dataset).
+The Space will automatically build and serve the leaderboard.
+## How to update results
+Add new entries to `leaderboard_data.json` following the existing schema:
+```json
+{
+  "agent_name": "Your Agent",
+  "agent_id": "your-agent-user",
+  "mode": "user",
+  "mcp_custom": false,
+  "submission_type": "llm",
+  "organization": "Your Org",
+  "overall_score": 42.0,
+  "component_scores": {
+    "approach": 10.0,
+    "orchestration": 8.0,
+    "quality": 14.0,
+    "feasibility": 6.0,
+    "novelty": 2.0,
+    "diversity": 2.0
+  },
+  "taxonomy_scores": {
+    "de_novo_binder": {"ab": 45, "enz": 40, "sig": 43},
+    "sequence_optimization": {"ab": 50, "enz": 42, "sig": 38, "str": 44, "flu": 52},
+    "de_novo_backbone": {"str": 28},
+    "complex_engineering": {"enz": 40, "sig": 44, "str": 46},
+    "conformational_design": {"enz": 38, "sig": 42, "str": 40, "flu": 44}
+  },
+  "tasks_completed": 76,
+  "tasks_total": 76,
+  "tasks_with_zero": 4,
+  "avg_latency_sec": 50.0,
+  "submission_date": "2026-03-15"
+}
+```
+Update the `last_updated` field at the top of the JSON file after adding entries.
+## File overview
+| File | Description |
+|------|-------------|
+| `app.py` | Main Gradio application with 7 tabs |
+| `leaderboard_data.json` | Current benchmark results |
+| `mcp_tool_schemas.json` | 17 reference MCP tool schemas |
+| `eval_scorer.py` | Self-contained 100-point scoring rubric |
+| `eval_queue.py` | Submission queue (HuggingFace Datasets) |
+| `eval_dispatcher.py` | HTTP task dispatcher for benchmarking |
+| `eval_boltz.py` | Boltz structure prediction post-eval |
+| `eval_tasks.py` | Hidden task loader from HF Dataset |
+| `example_server.py` | Reference FastAPI server for submitters |
+| `requirements.txt` | Python dependencies |

app.py CHANGED Viewed

@@ -20,7 +20,7 @@ from pathlib import Path
 import gradio as gr
 import plotly.graph_objects as go
-ADMIN_PASSWORD = os.environ.get("BDB_ADMIN_PASSWORD", "biodesignbench2026")
 # ═══════════════════════════════════════════════════════════════════
@@ -28,8 +28,8 @@ ADMIN_PASSWORD = os.environ.get("BDB_ADMIN_PASSWORD", "biodesignbench2026")
 # ═══════════════════════════════════════════════════════════════════
 PAPER_URL = "#"
-GITHUB_URL = "#"
-HF_URL = "#"
 # ═══════════════════════════════════════════════════════════════════

 import gradio as gr
 import plotly.graph_objects as go
+ADMIN_PASSWORD = os.environ.get("BDB_ADMIN_PASSWORD", "")
 # ═══════════════════════════════════════════════════════════════════
 # ═══════════════════════════════════════════════════════════════════
 PAPER_URL = "#"
+GITHUB_URL = "https://github.com/biodesignbench/biodesignbench"
+HF_URL = "https://huggingface.co/spaces/biodesignbench/leaderboard"
 # ═══════════════════════════════════════════════════════════════════

eval_scorer.py CHANGED Viewed

@@ -1368,14 +1368,15 @@ def score_diversity(
         return {"score": 0, "max": max_points, "num_designs": 0, "pairwise_diversity": 0.0, "entropy": 0.0}
     num = len(designs)
-    count_fraction = min(num / max_designs, 1.0) if max_designs > 0 else 1.0
     diversity = mean_pairwise_diversity(designs)
     entropy = sequence_entropy(designs)
-    count_score = count_fraction * max_points * 0.4
-    diversity_score = diversity * max_points * 0.4
-    entropy_score = entropy * max_points * 0.2
-    total = int(round(count_score + diversity_score + entropy_score))
     return {
         "score": min(total, max_points), "max": max_points,

         return {"score": 0, "max": max_points, "num_designs": 0, "pairwise_diversity": 0.0, "entropy": 0.0}
     num = len(designs)
     diversity = mean_pairwise_diversity(designs)
     entropy = sequence_entropy(designs)
+    # Score based purely on sequence diversity (not design count).
+    # Tasks don't specify how many designs to produce, so counting
+    # would unfairly penalise agents that submit fewer designs.
+    diversity_score = diversity * max_points * 0.65
+    entropy_score = entropy * max_points * 0.35
+    total = int(round(diversity_score + entropy_score))
     return {
         "score": min(total, max_points), "max": max_points,

leaderboard_data.json CHANGED Viewed

@@ -1,34 +1,53 @@
 {
-  "last_updated": "2026-03-03",
   "entries": [
     {
-      "agent_name": "Human Oracle",
-      "agent_id": "human-oracle",
       "mode": null,
       "mcp_custom": false,
-      "submission_type": "human_oracle",
       "organization": "Ground Truth",
-      "overall_score": 85.0,
       "component_scores": {
-        "approach": 17.5,
-        "orchestration": 13.5,
-        "quality": 30.0,
-        "feasibility": 13.8,
-        "novelty": 3.5,
-        "diversity": 6.7
       },
       "taxonomy_scores": {
-        "de_novo_binder": {"ab": 88, "enz": 82, "sig": 86},
-        "sequence_optimization": {"ab": 90, "enz": 85, "sig": 80, "str": 87, "flu": 92},
-        "de_novo_backbone": {"str": 75},
-        "complex_engineering": {"enz": 80, "sig": 85, "str": 88},
-        "conformational_design": {"enz": 78, "sig": 82, "str": 80, "flu": 85}
       },
       "tasks_completed": 76,
       "tasks_total": 76,
       "tasks_with_zero": 0,
       "avg_latency_sec": null,
-      "submission_date": "2026-03-01"
     },
     {
       "agent_name": "Human Expert",
@@ -36,231 +55,335 @@
       "mode": null,
       "mcp_custom": false,
       "submission_type": "human_expert",
-      "organization": "Manual (Jason)",
-      "overall_score": 62.0,
-      "component_scores": {
-        "approach": 14.0,
-        "orchestration": 11.0,
-        "quality": 20.5,
-        "feasibility": 10.5,
-        "novelty": 2.5,
-        "diversity": 3.5
-      },
-      "taxonomy_scores": {
-        "de_novo_binder": {"ab": 65, "enz": 58, "sig": 63},
-        "sequence_optimization": {"ab": 70, "enz": 62, "sig": 55, "str": 64, "flu": 72},
-        "de_novo_backbone": {"str": 50},
-        "complex_engineering": {"enz": 58, "sig": 62, "str": 66},
-        "conformational_design": {"enz": 55, "sig": 60, "str": 58, "flu": 62}
-      },
-      "tasks_completed": 76,
-      "tasks_total": 76,
-      "tasks_with_zero": 2,
-      "avg_latency_sec": null,
-      "submission_date": "2026-03-01"
-    },
-    {
-      "agent_name": "Hardcoded Pipeline",
-      "agent_id": "hardcoded-pipeline",
-      "mode": null,
-      "mcp_custom": false,
-      "submission_type": "hardcoded",
-      "organization": "Deterministic",
-      "overall_score": 41.5,
       "component_scores": {
-        "approach": 10.0,
-        "orchestration": 9.5,
-        "quality": 12.0,
-        "feasibility": 6.5,
-        "novelty": 1.5,
-        "diversity": 2.0
       },
       "taxonomy_scores": {
-        "de_novo_binder": {"ab": 42, "enz": 38, "sig": 44},
-        "sequence_optimization": {"ab": 48, "enz": 40, "sig": 35, "str": 42, "flu": 50},
-        "de_novo_backbone": {"str": 30},
-        "complex_engineering": {"enz": 38, "sig": 42, "str": 45},
-        "conformational_design": {"enz": 35, "sig": 40, "str": 38, "flu": 42}
       },
       "tasks_completed": 76,
       "tasks_total": 76,
-      "tasks_with_zero": 5,
       "avg_latency_sec": null,
-      "submission_date": "2026-03-01"
     },
     {
-      "agent_name": "Claude-4.5",
-      "agent_id": "claude45-user",
       "mode": "user",
       "mcp_custom": false,
       "submission_type": "llm",
-      "organization": "Anthropic",
-      "overall_score": 35.0,
       "component_scores": {
-        "approach": 8.5,
-        "orchestration": 7.0,
-        "quality": 10.5,
-        "feasibility": 5.5,
-        "novelty": 1.5,
-        "diversity": 2.0
       },
       "taxonomy_scores": {
-        "de_novo_binder": {"ab": 38, "enz": 32, "sig": 36},
-        "sequence_optimization": {"ab": 42, "enz": 35, "sig": 30, "str": 36, "flu": 44},
-        "de_novo_backbone": {"str": 22},
-        "complex_engineering": {"enz": 32, "sig": 36, "str": 38},
-        "conformational_design": {"enz": 30, "sig": 34, "str": 32, "flu": 36}
       },
       "tasks_completed": 76,
       "tasks_total": 76,
-      "tasks_with_zero": 6,
-      "avg_latency_sec": 52.3,
-      "submission_date": "2026-03-01"
     },
     {
-      "agent_name": "GPT-5",
-      "agent_id": "gpt5-user",
-      "mode": "user",
       "mcp_custom": false,
-      "submission_type": "llm",
-      "organization": "OpenAI",
-      "overall_score": 33.0,
       "component_scores": {
-        "approach": 8.0,
-        "orchestration": 6.5,
-        "quality": 10.0,
-        "feasibility": 5.0,
-        "novelty": 1.5,
         "diversity": 2.0
       },
       "taxonomy_scores": {
-        "de_novo_binder": {"ab": 35, "enz": 30, "sig": 34},
-        "sequence_optimization": {"ab": 40, "enz": 33, "sig": 28, "str": 34, "flu": 42},
-        "de_novo_backbone": {"str": 20},
-        "complex_engineering": {"enz": 30, "sig": 34, "str": 36},
-        "conformational_design": {"enz": 28, "sig": 32, "str": 30, "flu": 34}
       },
       "tasks_completed": 76,
       "tasks_total": 76,
-      "tasks_with_zero": 8,
-      "avg_latency_sec": 45.2,
-      "submission_date": "2026-03-01"
     },
     {
-      "agent_name": "Deepseek-v3.2",
-      "agent_id": "deepseek32-user",
-      "mode": "user",
       "mcp_custom": false,
       "submission_type": "llm",
-      "organization": "Deepseek",
-      "overall_score": 30.0,
       "component_scores": {
-        "approach": 7.2,
-        "orchestration": 6.0,
-        "quality": 9.0,
-        "feasibility": 4.5,
-        "novelty": 1.3,
-        "diversity": 2.0
       },
       "taxonomy_scores": {
-        "de_novo_binder": {"ab": 32, "enz": 28, "sig": 31},
-        "sequence_optimization": {"ab": 36, "enz": 30, "sig": 25, "str": 31, "flu": 38},
-        "de_novo_backbone": {"str": 18},
-        "complex_engineering": {"enz": 28, "sig": 31, "str": 33},
-        "conformational_design": {"enz": 25, "sig": 29, "str": 28, "flu": 31}
       },
       "tasks_completed": 76,
       "tasks_total": 76,
-      "tasks_with_zero": 10,
-      "avg_latency_sec": 38.7,
-      "submission_date": "2026-03-02"
     },
     {
-      "agent_name": "Gemini-2.5-Pro",
-      "agent_id": "gemini25-user",
       "mode": "user",
       "mcp_custom": false,
       "submission_type": "llm",
-      "organization": "Google",
-      "overall_score": 28.0,
       "component_scores": {
-        "approach": 6.5,
-        "orchestration": 5.5,
-        "quality": 8.5,
-        "feasibility": 4.5,
-        "novelty": 1.2,
-        "diversity": 1.8
       },
       "taxonomy_scores": {
-        "de_novo_binder": {"ab": 30, "enz": 25, "sig": 29},
-        "sequence_optimization": {"ab": 34, "enz": 28, "sig": 22, "str": 29, "flu": 36},
-        "de_novo_backbone": {"str": 16},
-        "complex_engineering": {"enz": 25, "sig": 28, "str": 30},
-        "conformational_design": {"enz": 22, "sig": 27, "str": 25, "flu": 29}
       },
       "tasks_completed": 76,
       "tasks_total": 76,
-      "tasks_with_zero": 12,
-      "avg_latency_sec": 55.1,
-      "submission_date": "2026-03-02"
     },
     {
-      "agent_name": "QWEN-3.5",
-      "agent_id": "qwen35-user",
       "mode": "user",
       "mcp_custom": false,
       "submission_type": "llm",
-      "organization": "Alibaba",
-      "overall_score": 26.0,
       "component_scores": {
-        "approach": 6.0,
-        "orchestration": 5.0,
-        "quality": 8.0,
-        "feasibility": 4.0,
-        "novelty": 1.2,
-        "diversity": 1.8
       },
       "taxonomy_scores": {
-        "de_novo_binder": {"ab": 28, "enz": 23, "sig": 27},
-        "sequence_optimization": {"ab": 32, "enz": 26, "sig": 20, "str": 27, "flu": 34},
-        "de_novo_backbone": {"str": 14},
-        "complex_engineering": {"enz": 23, "sig": 26, "str": 28},
-        "conformational_design": {"enz": 20, "sig": 25, "str": 23, "flu": 27}
       },
       "tasks_completed": 76,
       "tasks_total": 76,
-      "tasks_with_zero": 14,
-      "avg_latency_sec": 41.8,
-      "submission_date": "2026-03-02"
     },
     {
-      "agent_name": "Claude-4.5",
-      "agent_id": "claude45-benchmark",
       "mode": "benchmark",
       "mcp_custom": false,
       "submission_type": "llm",
       "organization": "Anthropic",
-      "overall_score": 20.0,
       "component_scores": {
-        "approach": 5.5,
-        "orchestration": 3.5,
-        "quality": 6.0,
-        "feasibility": 3.0,
-        "novelty": 1.0,
-        "diversity": 1.0
       },
       "taxonomy_scores": {
-        "de_novo_binder": {"ab": 22, "enz": 18, "sig": 21},
-        "sequence_optimization": {"ab": 25, "enz": 20, "sig": 16, "str": 21, "flu": 28},
-        "de_novo_backbone": {"str": 12},
-        "complex_engineering": {"enz": 18, "sig": 20, "str": 22},
-        "conformational_design": {"enz": 16, "sig": 19, "str": 18, "flu": 20}
       },
       "tasks_completed": 76,
       "tasks_total": 76,
-      "tasks_with_zero": 14,
-      "avg_latency_sec": 48.5,
-      "submission_date": "2026-03-01"
     },
     {
       "agent_name": "GPT-5",
@@ -269,114 +392,142 @@
       "mcp_custom": false,
       "submission_type": "llm",
       "organization": "OpenAI",
-      "overall_score": 18.5,
       "component_scores": {
         "approach": 5.2,
-        "orchestration": 3.1,
-        "quality": 5.8,
-        "feasibility": 2.5,
-        "novelty": 0.9,
-        "diversity": 1.0
-      },
-      "taxonomy_scores": {
-        "de_novo_binder": {"ab": 20, "enz": 16, "sig": 19},
-        "sequence_optimization": {"ab": 23, "enz": 18, "sig": 14, "str": 19, "flu": 26},
-        "de_novo_backbone": {"str": 10},
-        "complex_engineering": {"enz": 16, "sig": 18, "str": 20},
-        "conformational_design": {"enz": 14, "sig": 17, "str": 16, "flu": 18}
-      },
-      "tasks_completed": 76,
-      "tasks_total": 76,
-      "tasks_with_zero": 16,
-      "avg_latency_sec": 42.0,
-      "submission_date": "2026-03-01"
-    },
-    {
-      "agent_name": "Deepseek-v3.2",
-      "agent_id": "deepseek32-benchmark",
-      "mode": "benchmark",
-      "mcp_custom": false,
-      "submission_type": "llm",
-      "organization": "Deepseek",
-      "overall_score": 16.0,
-      "component_scores": {
-        "approach": 4.5,
-        "orchestration": 2.8,
-        "quality": 5.0,
-        "feasibility": 2.2,
-        "novelty": 0.7,
-        "diversity": 0.8
       },
       "taxonomy_scores": {
-        "de_novo_binder": {"ab": 18, "enz": 14, "sig": 17},
-        "sequence_optimization": {"ab": 20, "enz": 16, "sig": 12, "str": 17, "flu": 22},
-        "de_novo_backbone": {"str": 8},
-        "complex_engineering": {"enz": 14, "sig": 16, "str": 18},
-        "conformational_design": {"enz": 12, "sig": 15, "str": 14, "flu": 16}
       },
       "tasks_completed": 76,
       "tasks_total": 76,
-      "tasks_with_zero": 18,
-      "avg_latency_sec": 35.2,
-      "submission_date": "2026-03-02"
     },
     {
-      "agent_name": "Gemini-2.5-Pro",
-      "agent_id": "gemini25-benchmark",
-      "mode": "benchmark",
       "mcp_custom": false,
       "submission_type": "llm",
       "organization": "Google",
-      "overall_score": 15.0,
       "component_scores": {
-        "approach": 4.2,
-        "orchestration": 2.5,
-        "quality": 4.5,
-        "feasibility": 2.0,
-        "novelty": 0.8,
-        "diversity": 1.0
       },
       "taxonomy_scores": {
-        "de_novo_binder": {"ab": 16, "enz": 12, "sig": 16},
-        "sequence_optimization": {"ab": 18, "enz": 15, "sig": 10, "str": 16, "flu": 20},
-        "de_novo_backbone": {"str": 8},
-        "complex_engineering": {"enz": 12, "sig": 15, "str": 16},
-        "conformational_design": {"enz": 10, "sig": 14, "str": 12, "flu": 15}
       },
       "tasks_completed": 76,
       "tasks_total": 76,
-      "tasks_with_zero": 20,
-      "avg_latency_sec": 50.3,
-      "submission_date": "2026-03-02"
     },
     {
-      "agent_name": "QWEN-3.5",
-      "agent_id": "qwen35-benchmark",
       "mode": "benchmark",
       "mcp_custom": false,
       "submission_type": "llm",
-      "organization": "Alibaba",
-      "overall_score": 14.0,
       "component_scores": {
-        "approach": 3.8,
-        "orchestration": 2.2,
-        "quality": 4.2,
-        "feasibility": 2.0,
-        "novelty": 0.8,
-        "diversity": 1.0
       },
       "taxonomy_scores": {
-        "de_novo_binder": {"ab": 15, "enz": 11, "sig": 14},
-        "sequence_optimization": {"ab": 17, "enz": 14, "sig": 10, "str": 15, "flu": 18},
-        "de_novo_backbone": {"str": 7},
-        "complex_engineering": {"enz": 11, "sig": 14, "str": 15},
-        "conformational_design": {"enz": 10, "sig": 13, "str": 11, "flu": 14}
       },
       "tasks_completed": 76,
       "tasks_total": 76,
-      "tasks_with_zero": 22,
-      "avg_latency_sec": 39.5,
-      "submission_date": "2026-03-02"
     }
   ]
-}

 {
+  "last_updated": "2026-03-10",
   "entries": [
     {
+      "agent_name": "Oracle",
+      "agent_id": "oracle",
       "mode": null,
       "mcp_custom": false,
+      "submission_type": "oracle",
       "organization": "Ground Truth",
+      "overall_score": 87.3,
       "component_scores": {
+        "approach": 20.0,
+        "orchestration": 15.0,
+        "quality": 22.3,
+        "feasibility": 15.0,
+        "novelty": 5.0,
+        "diversity": 10.0
       },
       "taxonomy_scores": {
+        "de_novo_binder": {
+          "ab": 74.0,
+          "bnd": 82.0,
+          "scf": 92.0
+        },
+        "conformational_design": {
+          "enz": 92.0,
+          "fp": 96.0,
+          "scf": 81.0
+        },
+        "complex_engineering": {
+          "enz": 75.0,
+          "bnd": 84.0,
+          "scf": 78.0
+        },
+        "de_novo_backbone": {
+          "scf": 98.0
+        },
+        "sequence_optimization": {
+          "enz": 99.0,
+          "fp": 97.0,
+          "ab": 98.0,
+          "scf": 98.0
+        }
       },
       "tasks_completed": 76,
       "tasks_total": 76,
       "tasks_with_zero": 0,
       "avg_latency_sec": null,
+      "submission_date": "2026-03-10"
     },
     {
       "agent_name": "Human Expert",
       "mode": null,
       "mcp_custom": false,
       "submission_type": "human_expert",
+      "organization": "Romero Lab",
+      "overall_score": 62.4,
       "component_scores": {
+        "approach": 19.0,
+        "orchestration": 9.9,
+        "quality": 12.9,
+        "feasibility": 13.6,
+        "novelty": 4.5,
+        "diversity": 2.6
       },
       "taxonomy_scores": {
+        "de_novo_binder": {
+          "ab": 57.0,
+          "bnd": 71.0,
+          "scf": 70.0
+        },
+        "conformational_design": {
+          "enz": 68.0,
+          "fp": 59.0,
+          "scf": 50.0
+        },
+        "complex_engineering": {
+          "enz": 40.0,
+          "bnd": 76.0,
+          "scf": 67.0
+        },
+        "de_novo_backbone": {
+          "scf": 84.0
+        },
+        "sequence_optimization": {
+          "enz": 48.0,
+          "fp": 51.0,
+          "ab": 65.0,
+          "scf": 54.0
+        }
       },
       "tasks_completed": 76,
       "tasks_total": 76,
+      "tasks_with_zero": 0,
       "avg_latency_sec": null,
+      "submission_date": "2026-03-10"
     },
     {
+      "agent_name": "DeepSeek V3",
+      "agent_id": "deepseek-v3-user",
       "mode": "user",
       "mcp_custom": false,
       "submission_type": "llm",
+      "organization": "DeepSeek",
+      "overall_score": 58.4,
       "component_scores": {
+        "approach": 12.8,
+        "orchestration": 10.0,
+        "quality": 15.6,
+        "feasibility": 12.2,
+        "novelty": 4.3,
+        "diversity": 3.4
       },
       "taxonomy_scores": {
+        "de_novo_binder": {
+          "ab": 55.0,
+          "bnd": 63.0,
+          "scf": 56.0
+        },
+        "conformational_design": {
+          "enz": 48.0,
+          "fp": 56.0,
+          "scf": 54.0
+        },
+        "complex_engineering": {
+          "enz": 56.0,
+          "bnd": 66.0,
+          "scf": 60.0
+        },
+        "de_novo_backbone": {
+          "scf": 37.0
+        },
+        "sequence_optimization": {
+          "enz": 61.0,
+          "fp": 66.0,
+          "ab": 83.0,
+          "scf": 62.0
+        }
       },
       "tasks_completed": 76,
       "tasks_total": 76,
+      "tasks_with_zero": 1,
+      "avg_latency_sec": null,
+      "submission_date": "2026-03-10"
     },
     {
+      "agent_name": "Hardcoded Pipeline",
+      "agent_id": "hardcoded-pipeline",
+      "mode": null,
       "mcp_custom": false,
+      "submission_type": "hardcoded",
+      "organization": "Deterministic",
+      "overall_score": 52.4,
       "component_scores": {
+        "approach": 12.1,
+        "orchestration": 9.9,
+        "quality": 14.8,
+        "feasibility": 9.7,
+        "novelty": 3.8,
         "diversity": 2.0
       },
       "taxonomy_scores": {
+        "de_novo_binder": {
+          "ab": 45.0,
+          "bnd": 56.0,
+          "scf": 67.0
+        },
+        "conformational_design": {
+          "enz": 38.0,
+          "fp": 27.0,
+          "scf": 35.0
+        },
+        "complex_engineering": {
+          "enz": 57.0,
+          "bnd": 64.0,
+          "scf": 64.0
+        },
+        "de_novo_backbone": {
+          "scf": 11.0
+        },
+        "sequence_optimization": {
+          "enz": 70.0,
+          "fp": 67.0,
+          "ab": 57.0,
+          "scf": 75.0
+        }
       },
       "tasks_completed": 76,
       "tasks_total": 76,
+      "tasks_with_zero": 5,
+      "avg_latency_sec": null,
+      "submission_date": "2026-03-10"
     },
     {
+      "agent_name": "DeepSeek V3",
+      "agent_id": "deepseek-v3-benchmark",
+      "mode": "benchmark",
       "mcp_custom": false,
       "submission_type": "llm",
+      "organization": "DeepSeek",
+      "overall_score": 50.5,
       "component_scores": {
+        "approach": 7.1,
+        "orchestration": 7.2,
+        "quality": 16.1,
+        "feasibility": 13.2,
+        "novelty": 4.1,
+        "diversity": 3.0
       },
       "taxonomy_scores": {
+        "de_novo_binder": {
+          "ab": 46.0,
+          "bnd": 53.0,
+          "scf": 47.0
+        },
+        "conformational_design": {
+          "enz": 44.0,
+          "fp": 62.0,
+          "scf": 38.0
+        },
+        "complex_engineering": {
+          "enz": 33.0,
+          "bnd": 56.0,
+          "scf": 52.0
+        },
+        "de_novo_backbone": {
+          "scf": 54.0
+        },
+        "sequence_optimization": {
+          "enz": 55.0,
+          "fp": 41.0,
+          "ab": 69.0,
+          "scf": 72.0
+        }
       },
       "tasks_completed": 76,
       "tasks_total": 76,
+      "tasks_with_zero": 2,
+      "avg_latency_sec": null,
+      "submission_date": "2026-03-10"
     },
     {
+      "agent_name": "GPT-5",
+      "agent_id": "gpt5-user",
       "mode": "user",
       "mcp_custom": false,
       "submission_type": "llm",
+      "organization": "OpenAI",
+      "overall_score": 49.2,
       "component_scores": {
+        "approach": 7.9,
+        "orchestration": 7.6,
+        "quality": 15.3,
+        "feasibility": 11.1,
+        "novelty": 4.1,
+        "diversity": 3.1
       },
       "taxonomy_scores": {
+        "de_novo_binder": {
+          "ab": 43.0,
+          "bnd": 55.0,
+          "scf": 54.0
+        },
+        "conformational_design": {
+          "enz": 32.0,
+          "fp": 40.0,
+          "scf": 39.0
+        },
+        "complex_engineering": {
+          "enz": 43.0,
+          "bnd": 57.0,
+          "scf": 53.0
+        },
+        "de_novo_backbone": {
+          "scf": 45.0
+        },
+        "sequence_optimization": {
+          "enz": 48.0,
+          "fp": 52.0,
+          "ab": 71.0,
+          "scf": 62.0
+        }
       },
       "tasks_completed": 76,
       "tasks_total": 76,
+      "tasks_with_zero": 3,
+      "avg_latency_sec": null,
+      "submission_date": "2026-03-10"
     },
     {
+      "agent_name": "Claude Sonnet 4.5",
+      "agent_id": "sonnet-4.5-user",
       "mode": "user",
       "mcp_custom": false,
       "submission_type": "llm",
+      "organization": "Anthropic",
+      "overall_score": 47.9,
       "component_scores": {
+        "approach": 8.6,
+        "orchestration": 7.8,
+        "quality": 15.0,
+        "feasibility": 10.9,
+        "novelty": 3.4,
+        "diversity": 2.2
       },
       "taxonomy_scores": {
+        "de_novo_binder": {
+          "ab": 42.0,
+          "bnd": 53.0,
+          "scf": 38.0
+        },
+        "conformational_design": {
+          "enz": 42.0,
+          "fp": 47.0,
+          "scf": 35.0
+        },
+        "complex_engineering": {
+          "enz": 48.0,
+          "bnd": 66.0,
+          "scf": 53.0
+        },
+        "de_novo_backbone": {
+          "scf": 33.0
+        },
+        "sequence_optimization": {
+          "enz": 48.0,
+          "fp": 60.0,
+          "ab": 67.0,
+          "scf": 18.0
+        }
       },
       "tasks_completed": 76,
       "tasks_total": 76,
+      "tasks_with_zero": 6,
+      "avg_latency_sec": null,
+      "submission_date": "2026-03-10"
     },
     {
+      "agent_name": "Claude Sonnet 4.5",
+      "agent_id": "sonnet-4.5-benchmark",
       "mode": "benchmark",
       "mcp_custom": false,
       "submission_type": "llm",
       "organization": "Anthropic",
+      "overall_score": 42.3,
       "component_scores": {
+        "approach": 6.0,
+        "orchestration": 6.2,
+        "quality": 13.8,
+        "feasibility": 11.4,
+        "novelty": 3.2,
+        "diversity": 1.7
       },
       "taxonomy_scores": {
+        "de_novo_binder": {
+          "ab": 32.0,
+          "bnd": 44.0,
+          "scf": 36.0
+        },
+        "conformational_design": {
+          "enz": 17.0,
+          "fp": 56.0,
+          "scf": 41.0
+        },
+        "complex_engineering": {
+          "enz": 44.0,
+          "bnd": 55.0,
+          "scf": 37.0
+        },
+        "de_novo_backbone": {
+          "scf": 44.0
+        },
+        "sequence_optimization": {
+          "enz": 40.0,
+          "fp": 51.0,
+          "ab": 58.0,
+          "scf": 20.0
+        }
       },
       "tasks_completed": 76,
       "tasks_total": 76,
+      "tasks_with_zero": 9,
+      "avg_latency_sec": null,
+      "submission_date": "2026-03-10"
     },
     {
       "agent_name": "GPT-5",
       "mcp_custom": false,
       "submission_type": "llm",
       "organization": "OpenAI",
+      "overall_score": 41.0,
       "component_scores": {
         "approach": 5.2,
+        "orchestration": 4.9,
+        "quality": 15.0,
+        "feasibility": 11.5,
+        "novelty": 3.5,
+        "diversity": 0.9
       },
       "taxonomy_scores": {
+        "de_novo_binder": {
+          "ab": 32.0,
+          "bnd": 41.0,
+          "scf": 45.0
+        },
+        "conformational_design": {
+          "enz": 22.0,
+          "fp": 55.0,
+          "scf": 40.0
+        },
+        "complex_engineering": {
+          "enz": 3.0,
+          "bnd": 49.0,
+          "scf": 26.0
+        },
+        "de_novo_backbone": {
+          "scf": 45.0
+        },
+        "sequence_optimization": {
+          "enz": 44.0,
+          "fp": 52.0,
+          "ab": 52.0,
+          "scf": 49.0
+        }
       },
       "tasks_completed": 76,
       "tasks_total": 76,
+      "tasks_with_zero": 5,
+      "avg_latency_sec": null,
+      "submission_date": "2026-03-10"
     },
     {
+      "agent_name": "Gemini 2.5 Pro",
+      "agent_id": "gemini-2.5-pro-user",
+      "mode": "user",
       "mcp_custom": false,
       "submission_type": "llm",
       "organization": "Google",
+      "overall_score": 26.2,
       "component_scores": {
+        "approach": 0.0,
+        "orchestration": 0.0,
+        "quality": 10.3,
+        "feasibility": 10.9,
+        "novelty": 3.5,
+        "diversity": 1.5
       },
       "taxonomy_scores": {
+        "de_novo_binder": {
+          "ab": 22.0,
+          "bnd": 36.0,
+          "scf": 28.0
+        },
+        "conformational_design": {
+          "enz": 8.0,
+          "fp": 9.0,
+          "scf": 10.0
+        },
+        "complex_engineering": {
+          "enz": 12.0,
+          "bnd": 35.0,
+          "scf": 22.0
+        },
+        "de_novo_backbone": {
+          "scf": 21.0
+        },
+        "sequence_optimization": {
+          "enz": 33.0,
+          "fp": 36.0,
+          "ab": 53.0,
+          "scf": 22.0
+        }
       },
       "tasks_completed": 76,
       "tasks_total": 76,
+      "tasks_with_zero": 15,
+      "avg_latency_sec": null,
+      "submission_date": "2026-03-10"
     },
     {
+      "agent_name": "Gemini 2.5 Pro",
+      "agent_id": "gemini-2.5-pro-benchmark",
       "mode": "benchmark",
       "mcp_custom": false,
       "submission_type": "llm",
+      "organization": "Google",
+      "overall_score": 25.8,
       "component_scores": {
+        "approach": 0.0,
+        "orchestration": 0.0,
+        "quality": 10.1,
+        "feasibility": 10.7,
+        "novelty": 3.4,
+        "diversity": 1.6
       },
       "taxonomy_scores": {
+        "de_novo_binder": {
+          "ab": 28.0,
+          "bnd": 35.0,
+          "scf": 20.0
+        },
+        "conformational_design": {
+          "enz": 16.0,
+          "fp": 22.0,
+          "scf": 6.0
+        },
+        "complex_engineering": {
+          "enz": 0.0,
+          "bnd": 32.0,
+          "scf": 27.0
+        },
+        "de_novo_backbone": {
+          "scf": 21.0
+        },
+        "sequence_optimization": {
+          "enz": 30.0,
+          "fp": 33.0,
+          "ab": 52.0,
+          "scf": 15.0
+        }
       },
       "tasks_completed": 76,
       "tasks_total": 76,
+      "tasks_with_zero": 17,
+      "avg_latency_sec": null,
+      "submission_date": "2026-03-10"
     }
   ]
+}