Jasonkim8652 commited on
Commit
c59de83
·
verified ·
1 Parent(s): b34cf54

update leaderboard with rescored results and fair diversity formula

Browse files
Files changed (4) hide show
  1. README.md +79 -13
  2. app.py +3 -3
  3. eval_scorer.py +6 -5
  4. leaderboard_data.json +401 -250
README.md CHANGED
@@ -12,22 +12,88 @@ license: mit
12
 
13
  # BioDesignBench Leaderboard
14
 
15
- Evaluating LLM Agents on Protein Design via MCP Tools.
16
 
17
  **Romero Lab, Duke University**
18
 
19
- ## Overview
20
 
21
- BioDesignBench is the first comprehensive benchmark for evaluating LLM agents on
22
- protein design tasks via MCP (Model Context Protocol) tool use. This leaderboard
23
- tracks agent performance across 76 design tasks spanning 17 taxonomy cells
24
- (5 DesignTaskTypes x 6 BiologicalContexts), scored on a 100-point rubric with
25
- 6 components: Approach, Orchestration, Quality, Feasibility, Novelty, Diversity.
 
26
 
27
- ## Features
28
 
29
- - **Overall Leaderboard** — Mixed-ranking table with baselines and LLM agents
30
- - **Taxonomy Breakdown** — Heatmap of per-cell scores across 17 taxonomy cells
31
- - **Component Analysis** — Radar and bar charts comparing 6 scoring components
32
- - **Benchmark vs User Mode** — Paired comparison of the same LLM in two modes
33
- - **About** — Methodology, submission guide, and citation info
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
  # BioDesignBench Leaderboard
14
 
15
+ Interactive leaderboard for **BioDesignBench**, a benchmark evaluating LLM agents on protein design tasks via MCP (Model Context Protocol) tool use.
16
 
17
  **Romero Lab, Duke University**
18
 
19
+ ## What the leaderboard shows
20
 
21
+ - **Overall Leaderboard** -- Mixed-ranking table with human baselines and LLM agents, filterable by mode (benchmark/user), MCP tool type (reference/custom), and entry type.
22
+ - **Taxonomy Breakdown** -- Heatmap of per-cell scores across 17 taxonomy cells (5 task types x 5 biological contexts) with average-per-type bar chart.
23
+ - **Component Analysis** -- Radar and grouped bar charts comparing the 6 scoring components (Approach, Orchestration, Quality, Feasibility, Novelty, Diversity) between any two agents.
24
+ - **Benchmark vs User Mode** -- Paired comparison showing how the same LLM performs with minimal prompting (benchmark) vs rich guidance (user mode).
25
+ - **Submit** -- Form to submit your own protein design agent for evaluation.
26
+ - **About** -- Methodology, scoring rubric, submission guide, and citation.
27
 
28
+ ## Run locally
29
 
30
+ ```bash
31
+ pip install -r requirements.txt
32
+ python app.py
33
+ ```
34
+
35
+ The app launches a Gradio server at `http://localhost:7860`.
36
+
37
+ ## HuggingFace Space deployment
38
+
39
+ This directory is structured as a self-contained HF Space. To deploy:
40
+
41
+ 1. Create a new Space on HuggingFace (`sdk: gradio`).
42
+ 2. Push the contents of this directory to the Space repo.
43
+ 3. Set the `BDB_ADMIN_PASSWORD` secret in the Space settings for admin panel access.
44
+ 4. Optionally set `HF_TOKEN` for submission queue access (private dataset).
45
+
46
+ The Space will automatically build and serve the leaderboard.
47
+
48
+ ## How to update results
49
+
50
+ Add new entries to `leaderboard_data.json` following the existing schema:
51
+
52
+ ```json
53
+ {
54
+ "agent_name": "Your Agent",
55
+ "agent_id": "your-agent-user",
56
+ "mode": "user",
57
+ "mcp_custom": false,
58
+ "submission_type": "llm",
59
+ "organization": "Your Org",
60
+ "overall_score": 42.0,
61
+ "component_scores": {
62
+ "approach": 10.0,
63
+ "orchestration": 8.0,
64
+ "quality": 14.0,
65
+ "feasibility": 6.0,
66
+ "novelty": 2.0,
67
+ "diversity": 2.0
68
+ },
69
+ "taxonomy_scores": {
70
+ "de_novo_binder": {"ab": 45, "enz": 40, "sig": 43},
71
+ "sequence_optimization": {"ab": 50, "enz": 42, "sig": 38, "str": 44, "flu": 52},
72
+ "de_novo_backbone": {"str": 28},
73
+ "complex_engineering": {"enz": 40, "sig": 44, "str": 46},
74
+ "conformational_design": {"enz": 38, "sig": 42, "str": 40, "flu": 44}
75
+ },
76
+ "tasks_completed": 76,
77
+ "tasks_total": 76,
78
+ "tasks_with_zero": 4,
79
+ "avg_latency_sec": 50.0,
80
+ "submission_date": "2026-03-15"
81
+ }
82
+ ```
83
+
84
+ Update the `last_updated` field at the top of the JSON file after adding entries.
85
+
86
+ ## File overview
87
+
88
+ | File | Description |
89
+ |------|-------------|
90
+ | `app.py` | Main Gradio application with 7 tabs |
91
+ | `leaderboard_data.json` | Current benchmark results |
92
+ | `mcp_tool_schemas.json` | 17 reference MCP tool schemas |
93
+ | `eval_scorer.py` | Self-contained 100-point scoring rubric |
94
+ | `eval_queue.py` | Submission queue (HuggingFace Datasets) |
95
+ | `eval_dispatcher.py` | HTTP task dispatcher for benchmarking |
96
+ | `eval_boltz.py` | Boltz structure prediction post-eval |
97
+ | `eval_tasks.py` | Hidden task loader from HF Dataset |
98
+ | `example_server.py` | Reference FastAPI server for submitters |
99
+ | `requirements.txt` | Python dependencies |
app.py CHANGED
@@ -20,7 +20,7 @@ from pathlib import Path
20
  import gradio as gr
21
  import plotly.graph_objects as go
22
 
23
- ADMIN_PASSWORD = os.environ.get("BDB_ADMIN_PASSWORD", "biodesignbench2026")
24
 
25
 
26
  # ═══════════════════════════════════════════════════════════════════
@@ -28,8 +28,8 @@ ADMIN_PASSWORD = os.environ.get("BDB_ADMIN_PASSWORD", "biodesignbench2026")
28
  # ═══════════════════════════════════════════════════════════════════
29
 
30
  PAPER_URL = "#"
31
- GITHUB_URL = "#"
32
- HF_URL = "#"
33
 
34
 
35
  # ═══════════════════════════════════════════════════════════════════
 
20
  import gradio as gr
21
  import plotly.graph_objects as go
22
 
23
+ ADMIN_PASSWORD = os.environ.get("BDB_ADMIN_PASSWORD", "")
24
 
25
 
26
  # ═══════════════════════════════════════════════════════════════════
 
28
  # ═══════════════════════════════════════════════════════════════════
29
 
30
  PAPER_URL = "#"
31
+ GITHUB_URL = "https://github.com/biodesignbench/biodesignbench"
32
+ HF_URL = "https://huggingface.co/spaces/biodesignbench/leaderboard"
33
 
34
 
35
  # ═══════════════════════════════════════════════════════════════════
eval_scorer.py CHANGED
@@ -1368,14 +1368,15 @@ def score_diversity(
1368
  return {"score": 0, "max": max_points, "num_designs": 0, "pairwise_diversity": 0.0, "entropy": 0.0}
1369
 
1370
  num = len(designs)
1371
- count_fraction = min(num / max_designs, 1.0) if max_designs > 0 else 1.0
1372
  diversity = mean_pairwise_diversity(designs)
1373
  entropy = sequence_entropy(designs)
1374
 
1375
- count_score = count_fraction * max_points * 0.4
1376
- diversity_score = diversity * max_points * 0.4
1377
- entropy_score = entropy * max_points * 0.2
1378
- total = int(round(count_score + diversity_score + entropy_score))
 
 
1379
 
1380
  return {
1381
  "score": min(total, max_points), "max": max_points,
 
1368
  return {"score": 0, "max": max_points, "num_designs": 0, "pairwise_diversity": 0.0, "entropy": 0.0}
1369
 
1370
  num = len(designs)
 
1371
  diversity = mean_pairwise_diversity(designs)
1372
  entropy = sequence_entropy(designs)
1373
 
1374
+ # Score based purely on sequence diversity (not design count).
1375
+ # Tasks don't specify how many designs to produce, so counting
1376
+ # would unfairly penalise agents that submit fewer designs.
1377
+ diversity_score = diversity * max_points * 0.65
1378
+ entropy_score = entropy * max_points * 0.35
1379
+ total = int(round(diversity_score + entropy_score))
1380
 
1381
  return {
1382
  "score": min(total, max_points), "max": max_points,
leaderboard_data.json CHANGED
@@ -1,34 +1,53 @@
1
  {
2
- "last_updated": "2026-03-03",
3
  "entries": [
4
  {
5
- "agent_name": "Human Oracle",
6
- "agent_id": "human-oracle",
7
  "mode": null,
8
  "mcp_custom": false,
9
- "submission_type": "human_oracle",
10
  "organization": "Ground Truth",
11
- "overall_score": 85.0,
12
  "component_scores": {
13
- "approach": 17.5,
14
- "orchestration": 13.5,
15
- "quality": 30.0,
16
- "feasibility": 13.8,
17
- "novelty": 3.5,
18
- "diversity": 6.7
19
  },
20
  "taxonomy_scores": {
21
- "de_novo_binder": {"ab": 88, "enz": 82, "sig": 86},
22
- "sequence_optimization": {"ab": 90, "enz": 85, "sig": 80, "str": 87, "flu": 92},
23
- "de_novo_backbone": {"str": 75},
24
- "complex_engineering": {"enz": 80, "sig": 85, "str": 88},
25
- "conformational_design": {"enz": 78, "sig": 82, "str": 80, "flu": 85}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  },
27
  "tasks_completed": 76,
28
  "tasks_total": 76,
29
  "tasks_with_zero": 0,
30
  "avg_latency_sec": null,
31
- "submission_date": "2026-03-01"
32
  },
33
  {
34
  "agent_name": "Human Expert",
@@ -36,231 +55,335 @@
36
  "mode": null,
37
  "mcp_custom": false,
38
  "submission_type": "human_expert",
39
- "organization": "Manual (Jason)",
40
- "overall_score": 62.0,
41
- "component_scores": {
42
- "approach": 14.0,
43
- "orchestration": 11.0,
44
- "quality": 20.5,
45
- "feasibility": 10.5,
46
- "novelty": 2.5,
47
- "diversity": 3.5
48
- },
49
- "taxonomy_scores": {
50
- "de_novo_binder": {"ab": 65, "enz": 58, "sig": 63},
51
- "sequence_optimization": {"ab": 70, "enz": 62, "sig": 55, "str": 64, "flu": 72},
52
- "de_novo_backbone": {"str": 50},
53
- "complex_engineering": {"enz": 58, "sig": 62, "str": 66},
54
- "conformational_design": {"enz": 55, "sig": 60, "str": 58, "flu": 62}
55
- },
56
- "tasks_completed": 76,
57
- "tasks_total": 76,
58
- "tasks_with_zero": 2,
59
- "avg_latency_sec": null,
60
- "submission_date": "2026-03-01"
61
- },
62
- {
63
- "agent_name": "Hardcoded Pipeline",
64
- "agent_id": "hardcoded-pipeline",
65
- "mode": null,
66
- "mcp_custom": false,
67
- "submission_type": "hardcoded",
68
- "organization": "Deterministic",
69
- "overall_score": 41.5,
70
  "component_scores": {
71
- "approach": 10.0,
72
- "orchestration": 9.5,
73
- "quality": 12.0,
74
- "feasibility": 6.5,
75
- "novelty": 1.5,
76
- "diversity": 2.0
77
  },
78
  "taxonomy_scores": {
79
- "de_novo_binder": {"ab": 42, "enz": 38, "sig": 44},
80
- "sequence_optimization": {"ab": 48, "enz": 40, "sig": 35, "str": 42, "flu": 50},
81
- "de_novo_backbone": {"str": 30},
82
- "complex_engineering": {"enz": 38, "sig": 42, "str": 45},
83
- "conformational_design": {"enz": 35, "sig": 40, "str": 38, "flu": 42}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
  },
85
  "tasks_completed": 76,
86
  "tasks_total": 76,
87
- "tasks_with_zero": 5,
88
  "avg_latency_sec": null,
89
- "submission_date": "2026-03-01"
90
  },
91
  {
92
- "agent_name": "Claude-4.5",
93
- "agent_id": "claude45-user",
94
  "mode": "user",
95
  "mcp_custom": false,
96
  "submission_type": "llm",
97
- "organization": "Anthropic",
98
- "overall_score": 35.0,
99
  "component_scores": {
100
- "approach": 8.5,
101
- "orchestration": 7.0,
102
- "quality": 10.5,
103
- "feasibility": 5.5,
104
- "novelty": 1.5,
105
- "diversity": 2.0
106
  },
107
  "taxonomy_scores": {
108
- "de_novo_binder": {"ab": 38, "enz": 32, "sig": 36},
109
- "sequence_optimization": {"ab": 42, "enz": 35, "sig": 30, "str": 36, "flu": 44},
110
- "de_novo_backbone": {"str": 22},
111
- "complex_engineering": {"enz": 32, "sig": 36, "str": 38},
112
- "conformational_design": {"enz": 30, "sig": 34, "str": 32, "flu": 36}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
113
  },
114
  "tasks_completed": 76,
115
  "tasks_total": 76,
116
- "tasks_with_zero": 6,
117
- "avg_latency_sec": 52.3,
118
- "submission_date": "2026-03-01"
119
  },
120
  {
121
- "agent_name": "GPT-5",
122
- "agent_id": "gpt5-user",
123
- "mode": "user",
124
  "mcp_custom": false,
125
- "submission_type": "llm",
126
- "organization": "OpenAI",
127
- "overall_score": 33.0,
128
  "component_scores": {
129
- "approach": 8.0,
130
- "orchestration": 6.5,
131
- "quality": 10.0,
132
- "feasibility": 5.0,
133
- "novelty": 1.5,
134
  "diversity": 2.0
135
  },
136
  "taxonomy_scores": {
137
- "de_novo_binder": {"ab": 35, "enz": 30, "sig": 34},
138
- "sequence_optimization": {"ab": 40, "enz": 33, "sig": 28, "str": 34, "flu": 42},
139
- "de_novo_backbone": {"str": 20},
140
- "complex_engineering": {"enz": 30, "sig": 34, "str": 36},
141
- "conformational_design": {"enz": 28, "sig": 32, "str": 30, "flu": 34}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
142
  },
143
  "tasks_completed": 76,
144
  "tasks_total": 76,
145
- "tasks_with_zero": 8,
146
- "avg_latency_sec": 45.2,
147
- "submission_date": "2026-03-01"
148
  },
149
  {
150
- "agent_name": "Deepseek-v3.2",
151
- "agent_id": "deepseek32-user",
152
- "mode": "user",
153
  "mcp_custom": false,
154
  "submission_type": "llm",
155
- "organization": "Deepseek",
156
- "overall_score": 30.0,
157
  "component_scores": {
158
- "approach": 7.2,
159
- "orchestration": 6.0,
160
- "quality": 9.0,
161
- "feasibility": 4.5,
162
- "novelty": 1.3,
163
- "diversity": 2.0
164
  },
165
  "taxonomy_scores": {
166
- "de_novo_binder": {"ab": 32, "enz": 28, "sig": 31},
167
- "sequence_optimization": {"ab": 36, "enz": 30, "sig": 25, "str": 31, "flu": 38},
168
- "de_novo_backbone": {"str": 18},
169
- "complex_engineering": {"enz": 28, "sig": 31, "str": 33},
170
- "conformational_design": {"enz": 25, "sig": 29, "str": 28, "flu": 31}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
171
  },
172
  "tasks_completed": 76,
173
  "tasks_total": 76,
174
- "tasks_with_zero": 10,
175
- "avg_latency_sec": 38.7,
176
- "submission_date": "2026-03-02"
177
  },
178
  {
179
- "agent_name": "Gemini-2.5-Pro",
180
- "agent_id": "gemini25-user",
181
  "mode": "user",
182
  "mcp_custom": false,
183
  "submission_type": "llm",
184
- "organization": "Google",
185
- "overall_score": 28.0,
186
  "component_scores": {
187
- "approach": 6.5,
188
- "orchestration": 5.5,
189
- "quality": 8.5,
190
- "feasibility": 4.5,
191
- "novelty": 1.2,
192
- "diversity": 1.8
193
  },
194
  "taxonomy_scores": {
195
- "de_novo_binder": {"ab": 30, "enz": 25, "sig": 29},
196
- "sequence_optimization": {"ab": 34, "enz": 28, "sig": 22, "str": 29, "flu": 36},
197
- "de_novo_backbone": {"str": 16},
198
- "complex_engineering": {"enz": 25, "sig": 28, "str": 30},
199
- "conformational_design": {"enz": 22, "sig": 27, "str": 25, "flu": 29}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
200
  },
201
  "tasks_completed": 76,
202
  "tasks_total": 76,
203
- "tasks_with_zero": 12,
204
- "avg_latency_sec": 55.1,
205
- "submission_date": "2026-03-02"
206
  },
207
  {
208
- "agent_name": "QWEN-3.5",
209
- "agent_id": "qwen35-user",
210
  "mode": "user",
211
  "mcp_custom": false,
212
  "submission_type": "llm",
213
- "organization": "Alibaba",
214
- "overall_score": 26.0,
215
  "component_scores": {
216
- "approach": 6.0,
217
- "orchestration": 5.0,
218
- "quality": 8.0,
219
- "feasibility": 4.0,
220
- "novelty": 1.2,
221
- "diversity": 1.8
222
  },
223
  "taxonomy_scores": {
224
- "de_novo_binder": {"ab": 28, "enz": 23, "sig": 27},
225
- "sequence_optimization": {"ab": 32, "enz": 26, "sig": 20, "str": 27, "flu": 34},
226
- "de_novo_backbone": {"str": 14},
227
- "complex_engineering": {"enz": 23, "sig": 26, "str": 28},
228
- "conformational_design": {"enz": 20, "sig": 25, "str": 23, "flu": 27}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
229
  },
230
  "tasks_completed": 76,
231
  "tasks_total": 76,
232
- "tasks_with_zero": 14,
233
- "avg_latency_sec": 41.8,
234
- "submission_date": "2026-03-02"
235
  },
236
  {
237
- "agent_name": "Claude-4.5",
238
- "agent_id": "claude45-benchmark",
239
  "mode": "benchmark",
240
  "mcp_custom": false,
241
  "submission_type": "llm",
242
  "organization": "Anthropic",
243
- "overall_score": 20.0,
244
  "component_scores": {
245
- "approach": 5.5,
246
- "orchestration": 3.5,
247
- "quality": 6.0,
248
- "feasibility": 3.0,
249
- "novelty": 1.0,
250
- "diversity": 1.0
251
  },
252
  "taxonomy_scores": {
253
- "de_novo_binder": {"ab": 22, "enz": 18, "sig": 21},
254
- "sequence_optimization": {"ab": 25, "enz": 20, "sig": 16, "str": 21, "flu": 28},
255
- "de_novo_backbone": {"str": 12},
256
- "complex_engineering": {"enz": 18, "sig": 20, "str": 22},
257
- "conformational_design": {"enz": 16, "sig": 19, "str": 18, "flu": 20}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
258
  },
259
  "tasks_completed": 76,
260
  "tasks_total": 76,
261
- "tasks_with_zero": 14,
262
- "avg_latency_sec": 48.5,
263
- "submission_date": "2026-03-01"
264
  },
265
  {
266
  "agent_name": "GPT-5",
@@ -269,114 +392,142 @@
269
  "mcp_custom": false,
270
  "submission_type": "llm",
271
  "organization": "OpenAI",
272
- "overall_score": 18.5,
273
  "component_scores": {
274
  "approach": 5.2,
275
- "orchestration": 3.1,
276
- "quality": 5.8,
277
- "feasibility": 2.5,
278
- "novelty": 0.9,
279
- "diversity": 1.0
280
- },
281
- "taxonomy_scores": {
282
- "de_novo_binder": {"ab": 20, "enz": 16, "sig": 19},
283
- "sequence_optimization": {"ab": 23, "enz": 18, "sig": 14, "str": 19, "flu": 26},
284
- "de_novo_backbone": {"str": 10},
285
- "complex_engineering": {"enz": 16, "sig": 18, "str": 20},
286
- "conformational_design": {"enz": 14, "sig": 17, "str": 16, "flu": 18}
287
- },
288
- "tasks_completed": 76,
289
- "tasks_total": 76,
290
- "tasks_with_zero": 16,
291
- "avg_latency_sec": 42.0,
292
- "submission_date": "2026-03-01"
293
- },
294
- {
295
- "agent_name": "Deepseek-v3.2",
296
- "agent_id": "deepseek32-benchmark",
297
- "mode": "benchmark",
298
- "mcp_custom": false,
299
- "submission_type": "llm",
300
- "organization": "Deepseek",
301
- "overall_score": 16.0,
302
- "component_scores": {
303
- "approach": 4.5,
304
- "orchestration": 2.8,
305
- "quality": 5.0,
306
- "feasibility": 2.2,
307
- "novelty": 0.7,
308
- "diversity": 0.8
309
  },
310
  "taxonomy_scores": {
311
- "de_novo_binder": {"ab": 18, "enz": 14, "sig": 17},
312
- "sequence_optimization": {"ab": 20, "enz": 16, "sig": 12, "str": 17, "flu": 22},
313
- "de_novo_backbone": {"str": 8},
314
- "complex_engineering": {"enz": 14, "sig": 16, "str": 18},
315
- "conformational_design": {"enz": 12, "sig": 15, "str": 14, "flu": 16}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
316
  },
317
  "tasks_completed": 76,
318
  "tasks_total": 76,
319
- "tasks_with_zero": 18,
320
- "avg_latency_sec": 35.2,
321
- "submission_date": "2026-03-02"
322
  },
323
  {
324
- "agent_name": "Gemini-2.5-Pro",
325
- "agent_id": "gemini25-benchmark",
326
- "mode": "benchmark",
327
  "mcp_custom": false,
328
  "submission_type": "llm",
329
  "organization": "Google",
330
- "overall_score": 15.0,
331
  "component_scores": {
332
- "approach": 4.2,
333
- "orchestration": 2.5,
334
- "quality": 4.5,
335
- "feasibility": 2.0,
336
- "novelty": 0.8,
337
- "diversity": 1.0
338
  },
339
  "taxonomy_scores": {
340
- "de_novo_binder": {"ab": 16, "enz": 12, "sig": 16},
341
- "sequence_optimization": {"ab": 18, "enz": 15, "sig": 10, "str": 16, "flu": 20},
342
- "de_novo_backbone": {"str": 8},
343
- "complex_engineering": {"enz": 12, "sig": 15, "str": 16},
344
- "conformational_design": {"enz": 10, "sig": 14, "str": 12, "flu": 15}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
345
  },
346
  "tasks_completed": 76,
347
  "tasks_total": 76,
348
- "tasks_with_zero": 20,
349
- "avg_latency_sec": 50.3,
350
- "submission_date": "2026-03-02"
351
  },
352
  {
353
- "agent_name": "QWEN-3.5",
354
- "agent_id": "qwen35-benchmark",
355
  "mode": "benchmark",
356
  "mcp_custom": false,
357
  "submission_type": "llm",
358
- "organization": "Alibaba",
359
- "overall_score": 14.0,
360
  "component_scores": {
361
- "approach": 3.8,
362
- "orchestration": 2.2,
363
- "quality": 4.2,
364
- "feasibility": 2.0,
365
- "novelty": 0.8,
366
- "diversity": 1.0
367
  },
368
  "taxonomy_scores": {
369
- "de_novo_binder": {"ab": 15, "enz": 11, "sig": 14},
370
- "sequence_optimization": {"ab": 17, "enz": 14, "sig": 10, "str": 15, "flu": 18},
371
- "de_novo_backbone": {"str": 7},
372
- "complex_engineering": {"enz": 11, "sig": 14, "str": 15},
373
- "conformational_design": {"enz": 10, "sig": 13, "str": 11, "flu": 14}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
374
  },
375
  "tasks_completed": 76,
376
  "tasks_total": 76,
377
- "tasks_with_zero": 22,
378
- "avg_latency_sec": 39.5,
379
- "submission_date": "2026-03-02"
380
  }
381
  ]
382
- }
 
1
  {
2
+ "last_updated": "2026-03-10",
3
  "entries": [
4
  {
5
+ "agent_name": "Oracle",
6
+ "agent_id": "oracle",
7
  "mode": null,
8
  "mcp_custom": false,
9
+ "submission_type": "oracle",
10
  "organization": "Ground Truth",
11
+ "overall_score": 87.3,
12
  "component_scores": {
13
+ "approach": 20.0,
14
+ "orchestration": 15.0,
15
+ "quality": 22.3,
16
+ "feasibility": 15.0,
17
+ "novelty": 5.0,
18
+ "diversity": 10.0
19
  },
20
  "taxonomy_scores": {
21
+ "de_novo_binder": {
22
+ "ab": 74.0,
23
+ "bnd": 82.0,
24
+ "scf": 92.0
25
+ },
26
+ "conformational_design": {
27
+ "enz": 92.0,
28
+ "fp": 96.0,
29
+ "scf": 81.0
30
+ },
31
+ "complex_engineering": {
32
+ "enz": 75.0,
33
+ "bnd": 84.0,
34
+ "scf": 78.0
35
+ },
36
+ "de_novo_backbone": {
37
+ "scf": 98.0
38
+ },
39
+ "sequence_optimization": {
40
+ "enz": 99.0,
41
+ "fp": 97.0,
42
+ "ab": 98.0,
43
+ "scf": 98.0
44
+ }
45
  },
46
  "tasks_completed": 76,
47
  "tasks_total": 76,
48
  "tasks_with_zero": 0,
49
  "avg_latency_sec": null,
50
+ "submission_date": "2026-03-10"
51
  },
52
  {
53
  "agent_name": "Human Expert",
 
55
  "mode": null,
56
  "mcp_custom": false,
57
  "submission_type": "human_expert",
58
+ "organization": "Romero Lab",
59
+ "overall_score": 62.4,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
  "component_scores": {
61
+ "approach": 19.0,
62
+ "orchestration": 9.9,
63
+ "quality": 12.9,
64
+ "feasibility": 13.6,
65
+ "novelty": 4.5,
66
+ "diversity": 2.6
67
  },
68
  "taxonomy_scores": {
69
+ "de_novo_binder": {
70
+ "ab": 57.0,
71
+ "bnd": 71.0,
72
+ "scf": 70.0
73
+ },
74
+ "conformational_design": {
75
+ "enz": 68.0,
76
+ "fp": 59.0,
77
+ "scf": 50.0
78
+ },
79
+ "complex_engineering": {
80
+ "enz": 40.0,
81
+ "bnd": 76.0,
82
+ "scf": 67.0
83
+ },
84
+ "de_novo_backbone": {
85
+ "scf": 84.0
86
+ },
87
+ "sequence_optimization": {
88
+ "enz": 48.0,
89
+ "fp": 51.0,
90
+ "ab": 65.0,
91
+ "scf": 54.0
92
+ }
93
  },
94
  "tasks_completed": 76,
95
  "tasks_total": 76,
96
+ "tasks_with_zero": 0,
97
  "avg_latency_sec": null,
98
+ "submission_date": "2026-03-10"
99
  },
100
  {
101
+ "agent_name": "DeepSeek V3",
102
+ "agent_id": "deepseek-v3-user",
103
  "mode": "user",
104
  "mcp_custom": false,
105
  "submission_type": "llm",
106
+ "organization": "DeepSeek",
107
+ "overall_score": 58.4,
108
  "component_scores": {
109
+ "approach": 12.8,
110
+ "orchestration": 10.0,
111
+ "quality": 15.6,
112
+ "feasibility": 12.2,
113
+ "novelty": 4.3,
114
+ "diversity": 3.4
115
  },
116
  "taxonomy_scores": {
117
+ "de_novo_binder": {
118
+ "ab": 55.0,
119
+ "bnd": 63.0,
120
+ "scf": 56.0
121
+ },
122
+ "conformational_design": {
123
+ "enz": 48.0,
124
+ "fp": 56.0,
125
+ "scf": 54.0
126
+ },
127
+ "complex_engineering": {
128
+ "enz": 56.0,
129
+ "bnd": 66.0,
130
+ "scf": 60.0
131
+ },
132
+ "de_novo_backbone": {
133
+ "scf": 37.0
134
+ },
135
+ "sequence_optimization": {
136
+ "enz": 61.0,
137
+ "fp": 66.0,
138
+ "ab": 83.0,
139
+ "scf": 62.0
140
+ }
141
  },
142
  "tasks_completed": 76,
143
  "tasks_total": 76,
144
+ "tasks_with_zero": 1,
145
+ "avg_latency_sec": null,
146
+ "submission_date": "2026-03-10"
147
  },
148
  {
149
+ "agent_name": "Hardcoded Pipeline",
150
+ "agent_id": "hardcoded-pipeline",
151
+ "mode": null,
152
  "mcp_custom": false,
153
+ "submission_type": "hardcoded",
154
+ "organization": "Deterministic",
155
+ "overall_score": 52.4,
156
  "component_scores": {
157
+ "approach": 12.1,
158
+ "orchestration": 9.9,
159
+ "quality": 14.8,
160
+ "feasibility": 9.7,
161
+ "novelty": 3.8,
162
  "diversity": 2.0
163
  },
164
  "taxonomy_scores": {
165
+ "de_novo_binder": {
166
+ "ab": 45.0,
167
+ "bnd": 56.0,
168
+ "scf": 67.0
169
+ },
170
+ "conformational_design": {
171
+ "enz": 38.0,
172
+ "fp": 27.0,
173
+ "scf": 35.0
174
+ },
175
+ "complex_engineering": {
176
+ "enz": 57.0,
177
+ "bnd": 64.0,
178
+ "scf": 64.0
179
+ },
180
+ "de_novo_backbone": {
181
+ "scf": 11.0
182
+ },
183
+ "sequence_optimization": {
184
+ "enz": 70.0,
185
+ "fp": 67.0,
186
+ "ab": 57.0,
187
+ "scf": 75.0
188
+ }
189
  },
190
  "tasks_completed": 76,
191
  "tasks_total": 76,
192
+ "tasks_with_zero": 5,
193
+ "avg_latency_sec": null,
194
+ "submission_date": "2026-03-10"
195
  },
196
  {
197
+ "agent_name": "DeepSeek V3",
198
+ "agent_id": "deepseek-v3-benchmark",
199
+ "mode": "benchmark",
200
  "mcp_custom": false,
201
  "submission_type": "llm",
202
+ "organization": "DeepSeek",
203
+ "overall_score": 50.5,
204
  "component_scores": {
205
+ "approach": 7.1,
206
+ "orchestration": 7.2,
207
+ "quality": 16.1,
208
+ "feasibility": 13.2,
209
+ "novelty": 4.1,
210
+ "diversity": 3.0
211
  },
212
  "taxonomy_scores": {
213
+ "de_novo_binder": {
214
+ "ab": 46.0,
215
+ "bnd": 53.0,
216
+ "scf": 47.0
217
+ },
218
+ "conformational_design": {
219
+ "enz": 44.0,
220
+ "fp": 62.0,
221
+ "scf": 38.0
222
+ },
223
+ "complex_engineering": {
224
+ "enz": 33.0,
225
+ "bnd": 56.0,
226
+ "scf": 52.0
227
+ },
228
+ "de_novo_backbone": {
229
+ "scf": 54.0
230
+ },
231
+ "sequence_optimization": {
232
+ "enz": 55.0,
233
+ "fp": 41.0,
234
+ "ab": 69.0,
235
+ "scf": 72.0
236
+ }
237
  },
238
  "tasks_completed": 76,
239
  "tasks_total": 76,
240
+ "tasks_with_zero": 2,
241
+ "avg_latency_sec": null,
242
+ "submission_date": "2026-03-10"
243
  },
244
  {
245
+ "agent_name": "GPT-5",
246
+ "agent_id": "gpt5-user",
247
  "mode": "user",
248
  "mcp_custom": false,
249
  "submission_type": "llm",
250
+ "organization": "OpenAI",
251
+ "overall_score": 49.2,
252
  "component_scores": {
253
+ "approach": 7.9,
254
+ "orchestration": 7.6,
255
+ "quality": 15.3,
256
+ "feasibility": 11.1,
257
+ "novelty": 4.1,
258
+ "diversity": 3.1
259
  },
260
  "taxonomy_scores": {
261
+ "de_novo_binder": {
262
+ "ab": 43.0,
263
+ "bnd": 55.0,
264
+ "scf": 54.0
265
+ },
266
+ "conformational_design": {
267
+ "enz": 32.0,
268
+ "fp": 40.0,
269
+ "scf": 39.0
270
+ },
271
+ "complex_engineering": {
272
+ "enz": 43.0,
273
+ "bnd": 57.0,
274
+ "scf": 53.0
275
+ },
276
+ "de_novo_backbone": {
277
+ "scf": 45.0
278
+ },
279
+ "sequence_optimization": {
280
+ "enz": 48.0,
281
+ "fp": 52.0,
282
+ "ab": 71.0,
283
+ "scf": 62.0
284
+ }
285
  },
286
  "tasks_completed": 76,
287
  "tasks_total": 76,
288
+ "tasks_with_zero": 3,
289
+ "avg_latency_sec": null,
290
+ "submission_date": "2026-03-10"
291
  },
292
  {
293
+ "agent_name": "Claude Sonnet 4.5",
294
+ "agent_id": "sonnet-4.5-user",
295
  "mode": "user",
296
  "mcp_custom": false,
297
  "submission_type": "llm",
298
+ "organization": "Anthropic",
299
+ "overall_score": 47.9,
300
  "component_scores": {
301
+ "approach": 8.6,
302
+ "orchestration": 7.8,
303
+ "quality": 15.0,
304
+ "feasibility": 10.9,
305
+ "novelty": 3.4,
306
+ "diversity": 2.2
307
  },
308
  "taxonomy_scores": {
309
+ "de_novo_binder": {
310
+ "ab": 42.0,
311
+ "bnd": 53.0,
312
+ "scf": 38.0
313
+ },
314
+ "conformational_design": {
315
+ "enz": 42.0,
316
+ "fp": 47.0,
317
+ "scf": 35.0
318
+ },
319
+ "complex_engineering": {
320
+ "enz": 48.0,
321
+ "bnd": 66.0,
322
+ "scf": 53.0
323
+ },
324
+ "de_novo_backbone": {
325
+ "scf": 33.0
326
+ },
327
+ "sequence_optimization": {
328
+ "enz": 48.0,
329
+ "fp": 60.0,
330
+ "ab": 67.0,
331
+ "scf": 18.0
332
+ }
333
  },
334
  "tasks_completed": 76,
335
  "tasks_total": 76,
336
+ "tasks_with_zero": 6,
337
+ "avg_latency_sec": null,
338
+ "submission_date": "2026-03-10"
339
  },
340
  {
341
+ "agent_name": "Claude Sonnet 4.5",
342
+ "agent_id": "sonnet-4.5-benchmark",
343
  "mode": "benchmark",
344
  "mcp_custom": false,
345
  "submission_type": "llm",
346
  "organization": "Anthropic",
347
+ "overall_score": 42.3,
348
  "component_scores": {
349
+ "approach": 6.0,
350
+ "orchestration": 6.2,
351
+ "quality": 13.8,
352
+ "feasibility": 11.4,
353
+ "novelty": 3.2,
354
+ "diversity": 1.7
355
  },
356
  "taxonomy_scores": {
357
+ "de_novo_binder": {
358
+ "ab": 32.0,
359
+ "bnd": 44.0,
360
+ "scf": 36.0
361
+ },
362
+ "conformational_design": {
363
+ "enz": 17.0,
364
+ "fp": 56.0,
365
+ "scf": 41.0
366
+ },
367
+ "complex_engineering": {
368
+ "enz": 44.0,
369
+ "bnd": 55.0,
370
+ "scf": 37.0
371
+ },
372
+ "de_novo_backbone": {
373
+ "scf": 44.0
374
+ },
375
+ "sequence_optimization": {
376
+ "enz": 40.0,
377
+ "fp": 51.0,
378
+ "ab": 58.0,
379
+ "scf": 20.0
380
+ }
381
  },
382
  "tasks_completed": 76,
383
  "tasks_total": 76,
384
+ "tasks_with_zero": 9,
385
+ "avg_latency_sec": null,
386
+ "submission_date": "2026-03-10"
387
  },
388
  {
389
  "agent_name": "GPT-5",
 
392
  "mcp_custom": false,
393
  "submission_type": "llm",
394
  "organization": "OpenAI",
395
+ "overall_score": 41.0,
396
  "component_scores": {
397
  "approach": 5.2,
398
+ "orchestration": 4.9,
399
+ "quality": 15.0,
400
+ "feasibility": 11.5,
401
+ "novelty": 3.5,
402
+ "diversity": 0.9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
403
  },
404
  "taxonomy_scores": {
405
+ "de_novo_binder": {
406
+ "ab": 32.0,
407
+ "bnd": 41.0,
408
+ "scf": 45.0
409
+ },
410
+ "conformational_design": {
411
+ "enz": 22.0,
412
+ "fp": 55.0,
413
+ "scf": 40.0
414
+ },
415
+ "complex_engineering": {
416
+ "enz": 3.0,
417
+ "bnd": 49.0,
418
+ "scf": 26.0
419
+ },
420
+ "de_novo_backbone": {
421
+ "scf": 45.0
422
+ },
423
+ "sequence_optimization": {
424
+ "enz": 44.0,
425
+ "fp": 52.0,
426
+ "ab": 52.0,
427
+ "scf": 49.0
428
+ }
429
  },
430
  "tasks_completed": 76,
431
  "tasks_total": 76,
432
+ "tasks_with_zero": 5,
433
+ "avg_latency_sec": null,
434
+ "submission_date": "2026-03-10"
435
  },
436
  {
437
+ "agent_name": "Gemini 2.5 Pro",
438
+ "agent_id": "gemini-2.5-pro-user",
439
+ "mode": "user",
440
  "mcp_custom": false,
441
  "submission_type": "llm",
442
  "organization": "Google",
443
+ "overall_score": 26.2,
444
  "component_scores": {
445
+ "approach": 0.0,
446
+ "orchestration": 0.0,
447
+ "quality": 10.3,
448
+ "feasibility": 10.9,
449
+ "novelty": 3.5,
450
+ "diversity": 1.5
451
  },
452
  "taxonomy_scores": {
453
+ "de_novo_binder": {
454
+ "ab": 22.0,
455
+ "bnd": 36.0,
456
+ "scf": 28.0
457
+ },
458
+ "conformational_design": {
459
+ "enz": 8.0,
460
+ "fp": 9.0,
461
+ "scf": 10.0
462
+ },
463
+ "complex_engineering": {
464
+ "enz": 12.0,
465
+ "bnd": 35.0,
466
+ "scf": 22.0
467
+ },
468
+ "de_novo_backbone": {
469
+ "scf": 21.0
470
+ },
471
+ "sequence_optimization": {
472
+ "enz": 33.0,
473
+ "fp": 36.0,
474
+ "ab": 53.0,
475
+ "scf": 22.0
476
+ }
477
  },
478
  "tasks_completed": 76,
479
  "tasks_total": 76,
480
+ "tasks_with_zero": 15,
481
+ "avg_latency_sec": null,
482
+ "submission_date": "2026-03-10"
483
  },
484
  {
485
+ "agent_name": "Gemini 2.5 Pro",
486
+ "agent_id": "gemini-2.5-pro-benchmark",
487
  "mode": "benchmark",
488
  "mcp_custom": false,
489
  "submission_type": "llm",
490
+ "organization": "Google",
491
+ "overall_score": 25.8,
492
  "component_scores": {
493
+ "approach": 0.0,
494
+ "orchestration": 0.0,
495
+ "quality": 10.1,
496
+ "feasibility": 10.7,
497
+ "novelty": 3.4,
498
+ "diversity": 1.6
499
  },
500
  "taxonomy_scores": {
501
+ "de_novo_binder": {
502
+ "ab": 28.0,
503
+ "bnd": 35.0,
504
+ "scf": 20.0
505
+ },
506
+ "conformational_design": {
507
+ "enz": 16.0,
508
+ "fp": 22.0,
509
+ "scf": 6.0
510
+ },
511
+ "complex_engineering": {
512
+ "enz": 0.0,
513
+ "bnd": 32.0,
514
+ "scf": 27.0
515
+ },
516
+ "de_novo_backbone": {
517
+ "scf": 21.0
518
+ },
519
+ "sequence_optimization": {
520
+ "enz": 30.0,
521
+ "fp": 33.0,
522
+ "ab": 52.0,
523
+ "scf": 15.0
524
+ }
525
  },
526
  "tasks_completed": 76,
527
  "tasks_total": 76,
528
+ "tasks_with_zero": 17,
529
+ "avg_latency_sec": null,
530
+ "submission_date": "2026-03-10"
531
  }
532
  ]
533
+ }