Jasonkim8652 commited on
Commit
8e08ed6
·
verified ·
1 Parent(s): af5defe

Phase A: integrate LLM judge panel for hybrid scoring

Browse files

- Port biodesignbench/eval/llm_judge package as leaderboard/llm_judge
- New eval_judge.py orchestrates per-task panel runs with self-exclusion
- aggregate_scores: prefer hybrid_total/hybrid_scores when present
- Admin pipeline: insert 'Phase C: Run LLM Judge' button between Boltz and Finalize
- requirements: add anthropic, openai, google-genai
- Boltz handler now persists boltz-augmented per-task results
- Sync README.md taxonomy stats (9 cells, 2x5 matrix, hybrid 72+28)

README.md CHANGED
@@ -12,88 +12,29 @@ license: mit
12
 
13
  # BioDesignBench Leaderboard
14
 
15
- Interactive leaderboard for **BioDesignBench**, a benchmark evaluating LLM agents on protein design tasks via MCP (Model Context Protocol) tool use.
16
 
17
  **Romero Lab, Duke University**
18
 
19
- ## What the leaderboard shows
20
 
21
- - **Overall Leaderboard** -- Mixed-ranking table with human baselines and LLM agents, filterable by mode (benchmark/user), MCP tool type (reference/custom), and entry type.
22
- - **Taxonomy Breakdown** -- Heatmap of per-cell scores across 17 taxonomy cells (5 task types x 5 biological contexts) with average-per-type bar chart.
23
- - **Component Analysis** -- Radar and grouped bar charts comparing the 6 scoring components (Approach, Orchestration, Quality, Feasibility, Novelty, Diversity) between any two agents.
24
- - **Benchmark vs User Mode** -- Paired comparison showing how the same LLM performs with minimal prompting (benchmark) vs rich guidance (user mode).
25
- - **Submit** -- Form to submit your own protein design agent for evaluation.
26
- - **About** -- Methodology, scoring rubric, submission guide, and citation.
 
27
 
28
- ## Run locally
 
 
29
 
30
- ```bash
31
- pip install -r requirements.txt
32
- python app.py
33
- ```
34
 
35
- The app launches a Gradio server at `http://localhost:7860`.
36
-
37
- ## HuggingFace Space deployment
38
-
39
- This directory is structured as a self-contained HF Space. To deploy:
40
-
41
- 1. Create a new Space on HuggingFace (`sdk: gradio`).
42
- 2. Push the contents of this directory to the Space repo.
43
- 3. Set the `BDB_ADMIN_PASSWORD` secret in the Space settings for admin panel access.
44
- 4. Optionally set `HF_TOKEN` for submission queue access (private dataset).
45
-
46
- The Space will automatically build and serve the leaderboard.
47
-
48
- ## How to update results
49
-
50
- Add new entries to `leaderboard_data.json` following the existing schema:
51
-
52
- ```json
53
- {
54
- "agent_name": "Your Agent",
55
- "agent_id": "your-agent-user",
56
- "mode": "user",
57
- "mcp_custom": false,
58
- "submission_type": "llm",
59
- "organization": "Your Org",
60
- "overall_score": 42.0,
61
- "component_scores": {
62
- "approach": 10.0,
63
- "orchestration": 8.0,
64
- "quality": 14.0,
65
- "feasibility": 6.0,
66
- "novelty": 2.0,
67
- "diversity": 2.0
68
- },
69
- "taxonomy_scores": {
70
- "de_novo_binder": {"ab": 45, "enz": 40, "sig": 43},
71
- "sequence_optimization": {"ab": 50, "enz": 42, "sig": 38, "str": 44, "flu": 52},
72
- "de_novo_backbone": {"str": 28},
73
- "complex_engineering": {"enz": 40, "sig": 44, "str": 46},
74
- "conformational_design": {"enz": 38, "sig": 42, "str": 40, "flu": 44}
75
- },
76
- "tasks_completed": 76,
77
- "tasks_total": 76,
78
- "tasks_with_zero": 4,
79
- "avg_latency_sec": 50.0,
80
- "submission_date": "2026-03-15"
81
- }
82
- ```
83
-
84
- Update the `last_updated` field at the top of the JSON file after adding entries.
85
-
86
- ## File overview
87
-
88
- | File | Description |
89
- |------|-------------|
90
- | `app.py` | Main Gradio application with 7 tabs |
91
- | `leaderboard_data.json` | Current benchmark results |
92
- | `mcp_tool_schemas.json` | 17 reference MCP tool schemas |
93
- | `eval_scorer.py` | Self-contained 100-point scoring rubric |
94
- | `eval_queue.py` | Submission queue (HuggingFace Datasets) |
95
- | `eval_dispatcher.py` | HTTP task dispatcher for benchmarking |
96
- | `eval_boltz.py` | Boltz structure prediction post-eval |
97
- | `eval_tasks.py` | Hidden task loader from HF Dataset |
98
- | `example_server.py` | Reference FastAPI server for submitters |
99
- | `requirements.txt` | Python dependencies |
 
12
 
13
  # BioDesignBench Leaderboard
14
 
15
+ Evaluating LLM Agents on Protein Design via MCP Tools.
16
 
17
  **Romero Lab, Duke University**
18
 
19
+ ## Overview
20
 
21
+ BioDesignBench evaluates LLM agents as orchestrators of multi-step *stochastic*
22
+ protein-design pipelines. This leaderboard tracks agent performance across
23
+ **76 design tasks** spanning a **2 × 5 design matrix** (de novo design vs
24
+ redesign × five molecular families: antibody, binder, enzyme, scaffold,
25
+ fluorescent protein, **9 occupied cells**), scored on a 100-point hybrid rubric:
26
+ **72 algorithmic points** (Boltz-2 verification + sequence/feasibility metrics)
27
+ plus **28 LLM-judge points** (3-judge panel with self-exclusion).
28
 
29
+ The six rubric components are Approach, Orchestration, Quality, Feasibility,
30
+ Novelty, and Diversity. See the *About* tab for the full methodology and the
31
+ *Depth Gap* tab for evaluation-depth interventions.
32
 
33
+ ## Features
 
 
 
34
 
35
+ - **Overall Leaderboard** Mixed-ranking table with human baselines and LLM agents
36
+ - **Taxonomy Heatmap** — Per-cell scores across the 9 occupied cells of the 2 × 5 design matrix
37
+ - **Component Analysis** — Radar and bar charts comparing the 6 scoring components
38
+ - **Guidance Effect** — Paired comparison of the same LLM in unguided (atomic tools) vs guided (composite workflows) mode
39
+ - **Depth Gap** Forced-depth and low-diversity intervention results
40
+ - **About** — Methodology, submission guide, and citation info
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app.py CHANGED
@@ -1523,14 +1523,21 @@ def create_app() -> gr.Blocks:
1523
  label="Submission ID", scale=2,
1524
  )
1525
  boltz_btn = gr.Button(
1526
- "Phase B: Run Boltz", scale=1,
 
 
 
 
 
 
 
1527
  )
1528
  with gr.Row():
1529
  final_id = gr.Textbox(
1530
  label="Submission ID", scale=2,
1531
  )
1532
  final_btn = gr.Button(
1533
- "Phase C: Finalize & Publish", scale=1,
1534
  )
1535
  pipeline_out = gr.HTML()
1536
 
@@ -1681,6 +1688,9 @@ def create_app() -> gr.Blocks:
1681
  "No task results to process.</div>"
1682
  )
1683
  run_boltz_posteval(per_task)
 
 
 
1684
  return (
1685
  '<div style="color:#38a169">'
1686
  "Boltz post-assessment complete.</div>"
@@ -1688,6 +1698,51 @@ def create_app() -> gr.Blocks:
1688
  except Exception as e:
1689
  return f'<div style="color:#e53e3e">{e}</div>'
1690
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1691
  def _run_finalize(sid):
1692
  try:
1693
  from eval_queue import (
@@ -1712,10 +1767,12 @@ def create_app() -> gr.Blocks:
1712
  component_scores=agg["component_scores"],
1713
  taxonomy_scores=agg["taxonomy_scores"],
1714
  )
 
1715
  return (
1716
  f'<div style="color:#38a169">'
1717
  f'Finalized! Score: '
1718
- f'{agg["overall_score"]:.1f}</div>'
 
1719
  )
1720
  except Exception as e:
1721
  return f'<div style="color:#e53e3e">{e}</div>'
@@ -1726,6 +1783,9 @@ def create_app() -> gr.Blocks:
1726
  boltz_btn.click(
1727
  _run_boltz, [boltz_id], pipeline_out,
1728
  )
 
 
 
1729
  final_btn.click(
1730
  _run_finalize, [final_id], pipeline_out,
1731
  )
 
1523
  label="Submission ID", scale=2,
1524
  )
1525
  boltz_btn = gr.Button(
1526
+ "Phase B: Run Boltz (GPU)", scale=1,
1527
+ )
1528
+ with gr.Row():
1529
+ judge_id = gr.Textbox(
1530
+ label="Submission ID", scale=2,
1531
+ )
1532
+ judge_btn = gr.Button(
1533
+ "Phase C: Run LLM Judge", scale=1,
1534
  )
1535
  with gr.Row():
1536
  final_id = gr.Textbox(
1537
  label="Submission ID", scale=2,
1538
  )
1539
  final_btn = gr.Button(
1540
+ "Phase D: Finalize & Publish", scale=1,
1541
  )
1542
  pipeline_out = gr.HTML()
1543
 
 
1688
  "No task results to process.</div>"
1689
  )
1690
  run_boltz_posteval(per_task)
1691
+ from eval_queue import save_task_result
1692
+ for tid, tres in per_task.items():
1693
+ save_task_result(sid.strip(), tid, tres)
1694
  return (
1695
  '<div style="color:#38a169">'
1696
  "Boltz post-assessment complete.</div>"
 
1698
  except Exception as e:
1699
  return f'<div style="color:#e53e3e">{e}</div>'
1700
 
1701
+ def _run_judge(sid):
1702
+ try:
1703
+ import eval_judge as ej
1704
+ from eval_queue import (
1705
+ get_submission, save_task_result, update_status,
1706
+ )
1707
+
1708
+ sub = get_submission(sid.strip())
1709
+ if sub is None:
1710
+ return ('<div style="color:#e53e3e">'
1711
+ 'Not found</div>')
1712
+ per_task = json.loads(
1713
+ sub.get("per_task_results", "{}")
1714
+ )
1715
+ if not per_task:
1716
+ return ('<div style="color:#e53e3e">'
1717
+ "No task results to process.</div>")
1718
+
1719
+ update_status(sid.strip(), "scoring")
1720
+ ej.run_judge_panel(
1721
+ per_task,
1722
+ agent_id=sub.get("agent_name", "unknown"),
1723
+ dry_run=False,
1724
+ )
1725
+ for tid, tres in per_task.items():
1726
+ save_task_result(sid.strip(), tid, tres)
1727
+
1728
+ n_done = sum(
1729
+ 1 for r in per_task.values()
1730
+ if r.get("hybrid_total") is not None
1731
+ )
1732
+ return (
1733
+ f'<div style="color:#38a169">'
1734
+ f"LLM judge complete on {n_done} tasks."
1735
+ "</div>"
1736
+ )
1737
+ except Exception as e:
1738
+ import traceback
1739
+ return (
1740
+ f'<div style="color:#e53e3e">'
1741
+ f'<strong>Judge error:</strong> {e}<br>'
1742
+ f'<pre style="font-size:0.7rem">'
1743
+ f'{traceback.format_exc()[:600]}</pre></div>'
1744
+ )
1745
+
1746
  def _run_finalize(sid):
1747
  try:
1748
  from eval_queue import (
 
1767
  component_scores=agg["component_scores"],
1768
  taxonomy_scores=agg["taxonomy_scores"],
1769
  )
1770
+ mode_label = agg.get("scoring_mode", "algo")
1771
  return (
1772
  f'<div style="color:#38a169">'
1773
  f'Finalized! Score: '
1774
+ f'{agg["overall_score"]:.1f} '
1775
+ f'(scoring={mode_label})</div>'
1776
  )
1777
  except Exception as e:
1778
  return f'<div style="color:#e53e3e">{e}</div>'
 
1783
  boltz_btn.click(
1784
  _run_boltz, [boltz_id], pipeline_out,
1785
  )
1786
+ judge_btn.click(
1787
+ _run_judge, [judge_id], pipeline_out,
1788
+ )
1789
  final_btn.click(
1790
  _run_finalize, [final_id], pipeline_out,
1791
  )
eval_judge.py ADDED
@@ -0,0 +1,148 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """LLM Judge orchestration for the leaderboard backend.
2
+
3
+ Runs the cross-model judge panel on each successfully scored task and
4
+ merges the resulting LLM points into the algorithmic component scores
5
+ to produce hybrid totals (28 LLM points + 72 algorithmic points = 100).
6
+
7
+ The judge panel uses 3 judges from different model families with
8
+ self-exclusion (PoLL, Verga et al. 2024). Individual judge calls are
9
+ synchronous; we process tasks sequentially to keep the API spend
10
+ predictable. Provider keys are read from environment variables that
11
+ must be configured as HuggingFace Space secrets:
12
+
13
+ ANTHROPIC_API_KEY
14
+ OPENAI_API_KEY
15
+ GOOGLE_API_KEY
16
+ DEEPSEEK_API_KEY
17
+ """
18
+
19
+ from __future__ import annotations
20
+
21
+ import logging
22
+ from typing import Any
23
+
24
+ from llm_judge import (
25
+ LLMJudgePanel,
26
+ detect_agent_family,
27
+ merge_algo_and_judge_scores,
28
+ split_algo_score,
29
+ )
30
+
31
+ logger = logging.getLogger(__name__)
32
+
33
+
34
+ def _build_algo_dict(task_result: dict[str, Any]) -> dict[str, float]:
35
+ """Pull per-component algo scores from a task result.
36
+
37
+ Prefers 'cpu_scores' (post-Boltz) but falls back to 'final_scores'
38
+ if it has been computed already.
39
+ """
40
+ if "cpu_scores" in task_result:
41
+ return dict(task_result["cpu_scores"])
42
+ if "final_scores" in task_result:
43
+ return dict(task_result["final_scores"])
44
+ return {
45
+ "approach": 0,
46
+ "orchestration": 0,
47
+ "quality": 0,
48
+ "feasibility": 0,
49
+ "novelty": 0,
50
+ "diversity": 0,
51
+ }
52
+
53
+
54
+ def run_judge_panel(
55
+ per_task_results: dict[str, dict[str, Any]],
56
+ agent_id: str,
57
+ dry_run: bool = False,
58
+ progress_callback=None,
59
+ ) -> dict[str, dict[str, Any]]:
60
+ """Run the LLM judge panel over every successful task in a submission.
61
+
62
+ For each task with a non-empty design output:
63
+ 1. Look up the original task prompt (used to give the panel context).
64
+ 2. Build a 3-judge panel that excludes the agent's own model family.
65
+ 3. Run all judges synchronously and aggregate.
66
+ 4. Compute the hybrid component scores by:
67
+ - splitting each algo score into its algo-portion (split_algo_score)
68
+ - adding the matching judge LLM-portion (merge_algo_and_judge_scores)
69
+ 5. Store both raw judge results and final hybrid scores on the task.
70
+
71
+ The function modifies per_task_results in place and also returns it.
72
+
73
+ Args:
74
+ per_task_results: Dict mapping task_id → task result (from the
75
+ dispatcher + boltz post-eval pipeline).
76
+ agent_id: Agent identifier for self-exclusion (e.g., 'gpt5-tools').
77
+ dry_run: If True, judges return midpoint scores without API calls.
78
+ progress_callback: Optional callable(task_id, i, total).
79
+
80
+ Returns:
81
+ The same dict, now augmented with 'judge_scores' and 'hybrid_scores'
82
+ per task and 'hybrid_total' on each successful entry.
83
+ """
84
+ from eval_tasks import get_task
85
+
86
+ family = detect_agent_family(agent_id)
87
+ panel = LLMJudgePanel(agent_model_family=family, dry_run=dry_run)
88
+ logger.info(
89
+ f"LLM judge panel for agent '{agent_id}' (family={family}): "
90
+ f"{len(panel.judges)} judges, dry_run={dry_run}"
91
+ )
92
+
93
+ eligible = [
94
+ tid for tid, r in per_task_results.items()
95
+ if r.get("success") and r.get("sequences")
96
+ ]
97
+ total = len(eligible)
98
+
99
+ for i, task_id in enumerate(eligible):
100
+ result = per_task_results[task_id]
101
+
102
+ # Pull task prompt for judge context. If the dataset is not
103
+ # reachable (e.g., dev mode without HF_TOKEN) we still run with
104
+ # a placeholder description rather than aborting the whole run.
105
+ task_data = get_task(task_id) or {}
106
+ task_description = task_data.get("prompt_md") or f"BioDesignBench task {task_id}"
107
+
108
+ algo_metrics = result.get("agent_metrics", {})
109
+ if "boltz_metrics" in result:
110
+ algo_metrics = {**algo_metrics, **result["boltz_metrics"]}
111
+
112
+ try:
113
+ judge_result = panel.evaluate_sync(
114
+ task_description=task_description,
115
+ tool_call_log=result.get("run_log", []),
116
+ designed_sequences=result.get("sequences", []),
117
+ algorithmic_metrics=algo_metrics,
118
+ )
119
+ except Exception as e:
120
+ logger.error(f"Judge panel failed on {task_id}: {e}")
121
+ judge_result = None
122
+
123
+ # Build algo-portion dict (split each component down to its algo max)
124
+ algo_full = _build_algo_dict(result)
125
+ rubric_max = {
126
+ "approach": 20, "orchestration": 15, "quality": 35,
127
+ "feasibility": 15, "novelty": 5, "diversity": 10,
128
+ }
129
+ algo_split = {
130
+ comp: split_algo_score(comp, score, rubric_max[comp])
131
+ for comp, score in algo_full.items()
132
+ }
133
+
134
+ hybrid = merge_algo_and_judge_scores(algo_split, judge_result)
135
+ hybrid_total = sum(hybrid.values())
136
+
137
+ result["judge_scores"] = judge_result
138
+ result["hybrid_scores"] = hybrid
139
+ result["hybrid_total"] = round(hybrid_total, 2)
140
+
141
+ if progress_callback:
142
+ progress_callback(task_id, i + 1, total)
143
+
144
+ logger.info(
145
+ f"[{i+1}/{total}] {task_id}: hybrid={hybrid_total:.1f}"
146
+ )
147
+
148
+ return per_task_results
eval_scorer.py CHANGED
@@ -1368,15 +1368,14 @@ def score_diversity(
1368
  return {"score": 0, "max": max_points, "num_designs": 0, "pairwise_diversity": 0.0, "entropy": 0.0}
1369
 
1370
  num = len(designs)
 
1371
  diversity = mean_pairwise_diversity(designs)
1372
  entropy = sequence_entropy(designs)
1373
 
1374
- # Score based purely on sequence diversity (not design count).
1375
- # Tasks don't specify how many designs to produce, so counting
1376
- # would unfairly penalise agents that submit fewer designs.
1377
- diversity_score = diversity * max_points * 0.65
1378
- entropy_score = entropy * max_points * 0.35
1379
- total = int(round(diversity_score + entropy_score))
1380
 
1381
  return {
1382
  "score": min(total, max_points), "max": max_points,
@@ -1584,12 +1583,11 @@ def aggregate_scores(
1584
  ) -> dict[str, Any]:
1585
  """Aggregate per-task scores into an overall submission result.
1586
 
1587
- Args:
1588
- per_task_scores: Dict mapping task_id score_submission_task() result.
1589
-
1590
- Returns:
1591
- Dict with: overall_score, component_scores (averaged), taxonomy_scores,
1592
- tasks_completed, tasks_with_zero.
1593
  """
1594
  if not per_task_scores:
1595
  return {
@@ -1604,16 +1602,24 @@ def aggregate_scores(
1604
  totals = {c: 0.0 for c in DEFAULT_DESIGN_RUBRIC}
1605
  n = len(per_task_scores)
1606
  tasks_with_zero = 0
 
1607
 
1608
  # Taxonomy breakdown
1609
  taxonomy_scores: dict[str, dict[str, list[float]]] = {}
1610
 
1611
  for task_id, result in per_task_scores.items():
1612
- total_score = result["total_score"]
 
 
 
 
 
 
 
1613
  if total_score == 0:
1614
  tasks_with_zero += 1
1615
 
1616
- for comp, val in result["component_scores"].items():
1617
  totals[comp] += val
1618
 
1619
  # Taxonomy mapping
@@ -1641,4 +1647,5 @@ def aggregate_scores(
1641
  "tasks_completed": n,
1642
  "tasks_total": n,
1643
  "tasks_with_zero": tasks_with_zero,
 
1644
  }
 
1368
  return {"score": 0, "max": max_points, "num_designs": 0, "pairwise_diversity": 0.0, "entropy": 0.0}
1369
 
1370
  num = len(designs)
1371
+ count_fraction = min(num / max_designs, 1.0) if max_designs > 0 else 1.0
1372
  diversity = mean_pairwise_diversity(designs)
1373
  entropy = sequence_entropy(designs)
1374
 
1375
+ count_score = count_fraction * max_points * 0.4
1376
+ diversity_score = diversity * max_points * 0.4
1377
+ entropy_score = entropy * max_points * 0.2
1378
+ total = int(round(count_score + diversity_score + entropy_score))
 
 
1379
 
1380
  return {
1381
  "score": min(total, max_points), "max": max_points,
 
1583
  ) -> dict[str, Any]:
1584
  """Aggregate per-task scores into an overall submission result.
1585
 
1586
+ If `eval_judge.run_judge_panel()` has been run beforehand each task
1587
+ will carry `hybrid_scores` and `hybrid_total`; in that case we use
1588
+ the hybrid (algo + LLM judge, capped at rubric max) as the canonical
1589
+ score. Otherwise we fall back to the algo-only `component_scores` /
1590
+ `total_score` produced by the dispatcher + Boltz pipeline.
 
1591
  """
1592
  if not per_task_scores:
1593
  return {
 
1602
  totals = {c: 0.0 for c in DEFAULT_DESIGN_RUBRIC}
1603
  n = len(per_task_scores)
1604
  tasks_with_zero = 0
1605
+ used_hybrid = False
1606
 
1607
  # Taxonomy breakdown
1608
  taxonomy_scores: dict[str, dict[str, list[float]]] = {}
1609
 
1610
  for task_id, result in per_task_scores.items():
1611
+ if "hybrid_scores" in result and "hybrid_total" in result:
1612
+ comp_scores = result["hybrid_scores"]
1613
+ total_score = result["hybrid_total"]
1614
+ used_hybrid = True
1615
+ else:
1616
+ comp_scores = result.get("component_scores", {})
1617
+ total_score = result.get("total_score", 0.0)
1618
+
1619
  if total_score == 0:
1620
  tasks_with_zero += 1
1621
 
1622
+ for comp, val in comp_scores.items():
1623
  totals[comp] += val
1624
 
1625
  # Taxonomy mapping
 
1647
  "tasks_completed": n,
1648
  "tasks_total": n,
1649
  "tasks_with_zero": tasks_with_zero,
1650
+ "scoring_mode": "hybrid" if used_hybrid else "algo",
1651
  }
llm_judge/__init__.py ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """LLM-as-a-Judge scoring for BioDesignBench Tier 2 evaluation.
2
+
3
+ Provides cross-model LLM judge panels that evaluate subjective dimensions
4
+ (approach, orchestration, feasibility, novelty, diversity) while quality
5
+ metrics remain 100% algorithmic.
6
+
7
+ Usage:
8
+ from llm_judge import LLMJudgePanel
9
+
10
+ panel = LLMJudgePanel(agent_model_family="anthropic", dry_run=True)
11
+ result = panel.evaluate_sync(
12
+ task_description="Design a binder for IL-6R",
13
+ tool_call_log=[...],
14
+ designed_sequences=["MKVL..."],
15
+ algorithmic_metrics={"pLDDT": 82.5},
16
+ )
17
+ """
18
+
19
+ from llm_judge.aggregation import (
20
+ WEIGHT_SPLIT,
21
+ aggregate_judge_scores,
22
+ merge_algo_and_judge_scores,
23
+ split_algo_score,
24
+ )
25
+ from llm_judge.judge import LLMJudge, parse_judge_response
26
+ from llm_judge.panel import (
27
+ LLMJudgePanel,
28
+ detect_agent_family,
29
+ get_judge_models,
30
+ )
31
+ from llm_judge.rubrics import (
32
+ JUDGE_DIMENSIONS,
33
+ JUDGE_SYSTEM_PROMPT,
34
+ build_judge_prompt,
35
+ )
36
+
37
+ __all__ = [
38
+ "LLMJudge",
39
+ "LLMJudgePanel",
40
+ "JUDGE_DIMENSIONS",
41
+ "JUDGE_SYSTEM_PROMPT",
42
+ "WEIGHT_SPLIT",
43
+ "aggregate_judge_scores",
44
+ "build_judge_prompt",
45
+ "detect_agent_family",
46
+ "get_judge_models",
47
+ "merge_algo_and_judge_scores",
48
+ "parse_judge_response",
49
+ "split_algo_score",
50
+ ]
llm_judge/aggregation.py ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Score aggregation and merging for LLM judge panel.
2
+
3
+ Implements:
4
+ - Weighted averaging with outlier downweighting
5
+ - Algo + LLM score merging with rubric cap enforcement
6
+ - Weight split configuration (72/28 algo-LLM)
7
+ """
8
+
9
+ from __future__ import annotations
10
+
11
+ import statistics
12
+ from typing import Any
13
+
14
+ from llm_judge.rubrics import JUDGE_DIMENSIONS
15
+
16
+
17
+ # ---------------------------------------------------------------------------
18
+ # Weight split: algo + LLM portions per component (must sum to rubric max)
19
+ # ---------------------------------------------------------------------------
20
+
21
+ WEIGHT_SPLIT: dict[str, dict[str, int]] = {
22
+ "approach": {"algo": 10, "llm": 10}, # 20 total
23
+ "orchestration": {"algo": 7, "llm": 8}, # 15 total
24
+ "quality": {"algo": 35, "llm": 0}, # 35 total (no LLM)
25
+ "feasibility": {"algo": 10, "llm": 5}, # 15 total
26
+ "novelty": {"algo": 3, "llm": 2}, # 5 total
27
+ "diversity": {"algo": 7, "llm": 3}, # 10 total
28
+ }
29
+
30
+ # Mapping from LLM judge dimension → rubric component
31
+ _JUDGE_DIM_TO_COMPONENT: dict[str, str] = {
32
+ "approach_strategy": "approach",
33
+ "orchestration_reasoning": "orchestration",
34
+ "bio_feasibility": "feasibility",
35
+ "novelty_quality": "novelty",
36
+ "diversity_quality": "diversity",
37
+ }
38
+
39
+ # Rubric max per component
40
+ _RUBRIC_MAX: dict[str, int] = {
41
+ "approach": 20,
42
+ "orchestration": 15,
43
+ "quality": 35,
44
+ "feasibility": 15,
45
+ "novelty": 5,
46
+ "diversity": 10,
47
+ }
48
+
49
+
50
+ def aggregate_judge_scores(
51
+ judge_results: list[dict[str, dict[str, Any]]],
52
+ ) -> dict[str, dict[str, Any]]:
53
+ """Aggregate scores from multiple judges with outlier downweighting.
54
+
55
+ For each dimension:
56
+ 1. Collect raw scores from all judges
57
+ 2. Compute median
58
+ 3. Downweight outliers (>2 points from median) by 0.5x
59
+ 4. Compute weighted average
60
+
61
+ Args:
62
+ judge_results: List of per-judge result dicts.
63
+ Each maps dimension_name → {reasoning, score}.
64
+
65
+ Returns:
66
+ Aggregated dict mapping dimension_name → {score, reasoning, raw_scores}.
67
+
68
+ Raises:
69
+ ValueError: If judge_results is empty.
70
+ """
71
+ if not judge_results:
72
+ raise ValueError("aggregate_judge_scores requires at least one judge result")
73
+
74
+ if len(judge_results) == 1:
75
+ # Single judge: pass through directly
76
+ result = {}
77
+ for dim in JUDGE_DIMENSIONS:
78
+ entry = judge_results[0].get(dim, {"score": 0, "reasoning": ""})
79
+ result[dim] = {
80
+ "score": float(entry["score"]),
81
+ "reasoning": entry["reasoning"],
82
+ "raw_scores": [entry["score"]],
83
+ }
84
+ return result
85
+
86
+ aggregated = {}
87
+ for dim, info in JUDGE_DIMENSIONS.items():
88
+ raw_scores = []
89
+ reasonings = []
90
+ for jr in judge_results:
91
+ entry = jr.get(dim, {"score": info["max_score"] // 2, "reasoning": ""})
92
+ raw_scores.append(float(entry["score"]))
93
+ reasonings.append(entry.get("reasoning", ""))
94
+
95
+ # Outlier detection: downweight scores >2 points from median
96
+ med = statistics.median(raw_scores)
97
+ weights = []
98
+ for s in raw_scores:
99
+ if abs(s - med) > 2.0:
100
+ weights.append(0.5)
101
+ else:
102
+ weights.append(1.0)
103
+
104
+ # Weighted average
105
+ weighted_sum = sum(s * w for s, w in zip(raw_scores, weights))
106
+ weight_total = sum(weights)
107
+ avg = weighted_sum / weight_total if weight_total > 0 else 0
108
+
109
+ # Clamp to valid range
110
+ avg = max(0, min(avg, info["max_score"]))
111
+
112
+ aggregated[dim] = {
113
+ "score": round(avg, 1),
114
+ "reasoning": " | ".join(
115
+ f"[Judge {i+1}] {r}" for i, r in enumerate(reasonings) if r
116
+ ),
117
+ "raw_scores": raw_scores,
118
+ }
119
+
120
+ return aggregated
121
+
122
+
123
+ def split_algo_score(
124
+ component: str,
125
+ original_score: float,
126
+ original_max: int,
127
+ ) -> float:
128
+ """Scale an algorithmic score to its algo-only portion.
129
+
130
+ For the hybrid system, algorithmic scores are computed against the
131
+ original rubric max (e.g., approach out of 20), then scaled down
132
+ to the algo-only portion (e.g., 10 out of 20).
133
+
134
+ Quality is special: it keeps its full 35 points (no LLM portion).
135
+
136
+ Args:
137
+ component: Rubric component name.
138
+ original_score: Score computed against original max.
139
+ original_max: Original rubric max for this component.
140
+
141
+ Returns:
142
+ Scaled score for the algo-only portion.
143
+ """
144
+ split = WEIGHT_SPLIT.get(component)
145
+ if split is None:
146
+ return original_score
147
+
148
+ algo_max = split["algo"]
149
+
150
+ if split["llm"] == 0:
151
+ # No LLM portion — return original score unchanged
152
+ return original_score
153
+
154
+ # Scale: (original_score / original_max) * algo_max
155
+ if original_max == 0:
156
+ return 0.0
157
+ ratio = original_score / original_max
158
+ return round(ratio * algo_max, 2)
159
+
160
+
161
+ def merge_algo_and_judge_scores(
162
+ algo_scores: dict[str, float | int],
163
+ judge_scores: dict[str, dict[str, Any]] | None,
164
+ ) -> dict[str, float]:
165
+ """Merge algorithmic and LLM judge scores into final component scores.
166
+
167
+ Args:
168
+ algo_scores: Dict mapping component → algo-portion score.
169
+ These should already be split via split_algo_score().
170
+ judge_scores: Aggregated judge scores (from aggregate_judge_scores).
171
+ None if LLM judge is disabled.
172
+
173
+ Returns:
174
+ Dict mapping component → final merged score (capped at rubric max).
175
+ """
176
+ if judge_scores is None:
177
+ return dict(algo_scores)
178
+
179
+ merged = {}
180
+ for component, algo_score in algo_scores.items():
181
+ rubric_max = _RUBRIC_MAX.get(component, 100)
182
+
183
+ # Find matching judge dimension
184
+ judge_dim = None
185
+ for jd, comp in _JUDGE_DIM_TO_COMPONENT.items():
186
+ if comp == component:
187
+ judge_dim = jd
188
+ break
189
+
190
+ if judge_dim and judge_dim in judge_scores:
191
+ llm_score = judge_scores[judge_dim].get("score", 0)
192
+ if isinstance(llm_score, dict):
193
+ llm_score = llm_score.get("score", 0)
194
+ total = algo_score + llm_score
195
+ else:
196
+ total = algo_score
197
+
198
+ merged[component] = min(total, rubric_max)
199
+
200
+ return merged
llm_judge/judge.py ADDED
@@ -0,0 +1,217 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Single LLM judge: wraps one API call to evaluate a design attempt.
2
+
3
+ Supports Anthropic, OpenAI, Google, and DeepSeek providers.
4
+ In dry_run mode, returns deterministic midpoint scores without API calls.
5
+ """
6
+
7
+ from __future__ import annotations
8
+
9
+ import json
10
+ import re
11
+ from typing import Any
12
+
13
+ from llm_judge.rubrics import (
14
+ JUDGE_DIMENSIONS,
15
+ JUDGE_SYSTEM_PROMPT,
16
+ build_judge_prompt,
17
+ )
18
+
19
+
20
+ def _midpoint_scores() -> dict[str, dict[str, Any]]:
21
+ """Return deterministic midpoint scores for dry-run mode."""
22
+ result = {}
23
+ for dim, info in JUDGE_DIMENSIONS.items():
24
+ mid = info["max_score"] // 2
25
+ if info["max_score"] % 2 == 1 and mid * 2 < info["max_score"]:
26
+ # For odd max (5, 3), floor division gives correct 50%
27
+ pass
28
+ result[dim] = {
29
+ "reasoning": f"[Dry run] Midpoint score for {dim}.",
30
+ "score": mid,
31
+ }
32
+ return result
33
+
34
+
35
+ def parse_judge_response(raw_text: str) -> dict[str, dict[str, Any]]:
36
+ """Parse LLM judge response into structured scores.
37
+
38
+ Handles:
39
+ - Direct JSON response
40
+ - JSON inside markdown code blocks
41
+ - Out-of-range score clamping
42
+ - Invalid JSON fallback to midpoint scores
43
+
44
+ Args:
45
+ raw_text: Raw LLM response text.
46
+
47
+ Returns:
48
+ Dict mapping dimension names to {reasoning, score}.
49
+ """
50
+ # Try to extract JSON from markdown code block
51
+ json_match = re.search(r"```(?:json)?\s*\n?(.*?)\n?\s*```", raw_text, re.DOTALL)
52
+ json_str = json_match.group(1) if json_match else raw_text
53
+
54
+ try:
55
+ data = json.loads(json_str)
56
+ except json.JSONDecodeError:
57
+ # Try finding any JSON object in the text
58
+ brace_match = re.search(r"\{.*\}", raw_text, re.DOTALL)
59
+ if brace_match:
60
+ try:
61
+ data = json.loads(brace_match.group())
62
+ except json.JSONDecodeError:
63
+ return _midpoint_scores()
64
+ else:
65
+ return _midpoint_scores()
66
+
67
+ # Validate and clamp scores
68
+ result = {}
69
+ for dim, info in JUDGE_DIMENSIONS.items():
70
+ if dim in data and isinstance(data[dim], dict):
71
+ score = data[dim].get("score", info["max_score"] // 2)
72
+ if isinstance(score, (int, float)):
73
+ score = max(0, min(score, info["max_score"]))
74
+ else:
75
+ score = info["max_score"] // 2
76
+ reasoning = data[dim].get("reasoning", "")
77
+ result[dim] = {"reasoning": str(reasoning), "score": score}
78
+ else:
79
+ # Missing dimension — use midpoint
80
+ result[dim] = {
81
+ "reasoning": f"[Fallback] Dimension {dim} missing from judge response.",
82
+ "score": info["max_score"] // 2,
83
+ }
84
+
85
+ return result
86
+
87
+
88
+ class LLMJudge:
89
+ """Single LLM judge that evaluates a protein design attempt.
90
+
91
+ Args:
92
+ provider: API provider ('anthropic', 'openai', 'google', 'deepseek').
93
+ model: Model identifier string.
94
+ dry_run: If True, return deterministic scores without API calls.
95
+ api_key: Optional API key override.
96
+ """
97
+
98
+ def __init__(
99
+ self,
100
+ provider: str,
101
+ model: str,
102
+ dry_run: bool = False,
103
+ api_key: str | None = None,
104
+ ):
105
+ self.provider = provider
106
+ self.model = model
107
+ self.dry_run = dry_run
108
+ self.api_key = api_key
109
+ self.api_calls = 0
110
+ self._client = None
111
+
112
+ def _get_client(self):
113
+ """Lazy-initialize the API client."""
114
+ if self._client is not None:
115
+ return self._client
116
+
117
+ import os
118
+
119
+ if self.provider == "anthropic":
120
+ import anthropic
121
+
122
+ key = self.api_key or os.environ.get("ANTHROPIC_API_KEY")
123
+ self._client = anthropic.Anthropic(api_key=key)
124
+ elif self.provider == "openai":
125
+ from openai import OpenAI
126
+
127
+ key = self.api_key or os.environ.get("OPENAI_API_KEY")
128
+ self._client = OpenAI(api_key=key)
129
+ elif self.provider == "google":
130
+ from google import genai
131
+
132
+ key = self.api_key or os.environ.get("GOOGLE_API_KEY")
133
+ self._client = genai.Client(api_key=key)
134
+ elif self.provider == "deepseek":
135
+ from openai import OpenAI
136
+
137
+ key = self.api_key or os.environ.get("DEEPSEEK_API_KEY")
138
+ self._client = OpenAI(
139
+ api_key=key, base_url="https://api.deepseek.com"
140
+ )
141
+ else:
142
+ raise ValueError(f"Unknown provider: {self.provider}")
143
+
144
+ return self._client
145
+
146
+ def _call_api(self, system: str, user: str) -> str:
147
+ """Make a single API call and return raw text response."""
148
+ client = self._get_client()
149
+ self.api_calls += 1
150
+
151
+ if self.provider == "anthropic":
152
+ response = client.messages.create(
153
+ model=self.model,
154
+ max_tokens=4096,
155
+ system=system,
156
+ messages=[{"role": "user", "content": user}],
157
+ )
158
+ return response.content[0].text
159
+
160
+ elif self.provider in ("openai", "deepseek"):
161
+ # GPT-5+ uses max_completion_tokens; older models use max_tokens
162
+ token_param = (
163
+ "max_completion_tokens" if "gpt-5" in self.model or "o3" in self.model or "o4" in self.model
164
+ else "max_tokens"
165
+ )
166
+ response = client.chat.completions.create(
167
+ model=self.model,
168
+ **{token_param: 4096},
169
+ messages=[
170
+ {"role": "system", "content": system},
171
+ {"role": "user", "content": user},
172
+ ],
173
+ )
174
+ return response.choices[0].message.content
175
+
176
+ elif self.provider == "google":
177
+ response = client.models.generate_content(
178
+ model=self.model,
179
+ contents=f"{system}\n\n{user}",
180
+ )
181
+ return response.text
182
+
183
+ raise ValueError(f"Unsupported provider: {self.provider}")
184
+
185
+ def evaluate_sync(
186
+ self,
187
+ task_description: str,
188
+ tool_call_log: list[dict[str, Any]],
189
+ designed_sequences: list[str],
190
+ algorithmic_metrics: dict[str, Any],
191
+ reference_pipeline: list[str] | None = None,
192
+ ) -> dict[str, dict[str, Any]]:
193
+ """Evaluate a design attempt synchronously.
194
+
195
+ Args:
196
+ task_description: Original task prompt.
197
+ tool_call_log: Agent's tool call sequence.
198
+ designed_sequences: Designed protein sequences.
199
+ algorithmic_metrics: Computed biophysical metrics.
200
+ reference_pipeline: Expected expert pipeline.
201
+
202
+ Returns:
203
+ Dict mapping dimension names to {reasoning, score}.
204
+ """
205
+ if self.dry_run:
206
+ return _midpoint_scores()
207
+
208
+ prompt = build_judge_prompt(
209
+ task_description=task_description,
210
+ tool_call_log=tool_call_log,
211
+ designed_sequences=designed_sequences,
212
+ algorithmic_metrics=algorithmic_metrics,
213
+ reference_pipeline=reference_pipeline,
214
+ )
215
+
216
+ raw_response = self._call_api(JUDGE_SYSTEM_PROMPT, prompt)
217
+ return parse_judge_response(raw_response)
llm_judge/panel.py ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """LLM Judge Panel: manages cross-model evaluation with self-exclusion.
2
+
3
+ Following PoLL (Verga et al., 2024): 3 judges from different model families,
4
+ excluding the generating model. Human baselines get all 4 judges.
5
+ """
6
+
7
+ from __future__ import annotations
8
+
9
+ from typing import Any
10
+
11
+ from llm_judge.aggregation import aggregate_judge_scores
12
+ from llm_judge.judge import LLMJudge
13
+
14
+
15
+ # ---------------------------------------------------------------------------
16
+ # Available judge models (one per family)
17
+ # ---------------------------------------------------------------------------
18
+
19
+ JUDGE_MODELS: list[dict[str, str]] = [
20
+ {
21
+ "family": "anthropic",
22
+ "provider": "anthropic",
23
+ "model": "claude-sonnet-4-20250514",
24
+ },
25
+ {
26
+ "family": "openai",
27
+ "provider": "openai",
28
+ "model": "gpt-5.2",
29
+ },
30
+ {
31
+ "family": "google",
32
+ "provider": "google",
33
+ "model": "gemini-2.5-pro",
34
+ },
35
+ {
36
+ "family": "deepseek",
37
+ "provider": "deepseek",
38
+ "model": "deepseek-chat",
39
+ },
40
+ ]
41
+
42
+
43
+ # ---------------------------------------------------------------------------
44
+ # Agent ID → model family mapping
45
+ # ---------------------------------------------------------------------------
46
+
47
+ _AGENT_FAMILY_PREFIXES: dict[str, str] = {
48
+ "claude": "anthropic",
49
+ "gpt": "openai",
50
+ "gemini": "google",
51
+ "deepseek": "deepseek",
52
+ "human": "human",
53
+ }
54
+
55
+
56
+ def detect_agent_family(agent_id: str) -> str:
57
+ """Map an agent ID to its model family.
58
+
59
+ Args:
60
+ agent_id: Agent identifier (e.g., 'claude-code', 'gpt5-tools-benchmark').
61
+
62
+ Returns:
63
+ Family string: 'anthropic', 'openai', 'google', 'deepseek', 'human',
64
+ or 'unknown'.
65
+ """
66
+ agent_lower = agent_id.lower()
67
+ for prefix, family in _AGENT_FAMILY_PREFIXES.items():
68
+ if agent_lower.startswith(prefix):
69
+ return family
70
+ return "unknown"
71
+
72
+
73
+ def get_judge_models(agent_model_family: str) -> list[dict[str, str]]:
74
+ """Select judge models for a given agent, excluding self.
75
+
76
+ Args:
77
+ agent_model_family: Family of the agent being evaluated
78
+ ('anthropic', 'openai', 'google', 'deepseek', 'human', 'unknown').
79
+
80
+ Returns:
81
+ List of judge model dicts (3 for agents, 4 for human baselines).
82
+ """
83
+ if agent_model_family == "human":
84
+ return list(JUDGE_MODELS) # All 4 judges
85
+
86
+ return [j for j in JUDGE_MODELS if j["family"] != agent_model_family]
87
+
88
+
89
+ class LLMJudgePanel:
90
+ """Cross-model judge panel for protein design evaluation.
91
+
92
+ Manages 3 judges (excluding the agent's own model family) and
93
+ aggregates their scores.
94
+
95
+ Args:
96
+ agent_model_family: Model family to exclude ('anthropic', etc).
97
+ dry_run: If True, all judges return deterministic midpoint scores.
98
+ """
99
+
100
+ def __init__(
101
+ self,
102
+ agent_model_family: str,
103
+ dry_run: bool = False,
104
+ ):
105
+ self.agent_model_family = agent_model_family
106
+ self.dry_run = dry_run
107
+ self.judge_configs = get_judge_models(agent_model_family)
108
+ self.judges = [
109
+ LLMJudge(
110
+ provider=cfg["provider"],
111
+ model=cfg["model"],
112
+ dry_run=dry_run,
113
+ )
114
+ for cfg in self.judge_configs
115
+ ]
116
+
117
+ def evaluate_sync(
118
+ self,
119
+ task_description: str,
120
+ tool_call_log: list[dict[str, Any]],
121
+ designed_sequences: list[str],
122
+ algorithmic_metrics: dict[str, Any],
123
+ reference_pipeline: list[str] | None = None,
124
+ ) -> dict[str, Any]:
125
+ """Evaluate a design with all judges and aggregate.
126
+
127
+ Args:
128
+ task_description: Original task prompt.
129
+ tool_call_log: Agent's tool call sequence.
130
+ designed_sequences: Designed protein sequences.
131
+ algorithmic_metrics: Computed biophysical metrics.
132
+ reference_pipeline: Expected expert pipeline.
133
+
134
+ Returns:
135
+ Dict with aggregated scores, judge count, and individual results.
136
+ """
137
+ individual_results = []
138
+
139
+ for judge in self.judges:
140
+ result = judge.evaluate_sync(
141
+ task_description=task_description,
142
+ tool_call_log=tool_call_log,
143
+ designed_sequences=designed_sequences,
144
+ algorithmic_metrics=algorithmic_metrics,
145
+ reference_pipeline=reference_pipeline,
146
+ )
147
+ individual_results.append(result)
148
+
149
+ aggregated = aggregate_judge_scores(individual_results)
150
+
151
+ return {
152
+ **aggregated,
153
+ "judge_count": len(self.judges),
154
+ "individual_judges": [
155
+ {
156
+ "model": cfg["model"],
157
+ "family": cfg["family"],
158
+ "scores": result,
159
+ }
160
+ for cfg, result in zip(self.judge_configs, individual_results)
161
+ ],
162
+ }
llm_judge/plan_eval.py ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """LLM-based plan evaluation: judge whether agent's reasoning trace
2
+ demonstrates understanding of each pipeline step.
3
+
4
+ Replaces keyword matching with LLM assessment of 4 pipeline steps:
5
+ backbone_generation, sequence_design, structure_prediction, scoring_validation
6
+
7
+ Each step scored as 0 or 1 per judge, aggregated across 3-4 judges via majority vote.
8
+ """
9
+ from __future__ import annotations
10
+
11
+ import json
12
+ import re
13
+ from typing import Any
14
+
15
+ from llm_judge.judge import LLMJudge
16
+
17
+ PLAN_EVAL_SYSTEM = """You are an expert protein design evaluator. Your task is to assess whether an AI agent's reasoning trace demonstrates awareness and planning of specific protein design pipeline steps.
18
+
19
+ You have deep knowledge of:
20
+ - RFdiffusion for backbone generation
21
+ - ProteinMPNN for inverse folding / sequence design
22
+ - AlphaFold2, ESMFold, Boltz for structure prediction
23
+ - Rosetta for energy scoring and validation
24
+
25
+ Be strict: the agent must show genuine understanding or intent to use a step, not just mention a keyword in passing. Look for evidence that the agent planned to perform the step as part of its design strategy."""
26
+
27
+ PLAN_EVAL_PROMPT_TEMPLATE = """## Task
28
+ {task_description}
29
+
30
+ ## Agent's Reasoning Trace
31
+ {reasoning_trace}
32
+
33
+ ## Pipeline Steps to Evaluate
34
+
35
+ For each step below, determine whether the agent's reasoning trace shows that the agent **planned or intended** to perform this step. Score 1 if the agent demonstrates clear awareness and intent, 0 if not.
36
+
37
+ 1. **backbone_generation**: Did the agent plan to generate a de novo protein backbone/scaffold? (e.g., using RFdiffusion, backbone diffusion, scaffold generation, de novo structure design)
38
+
39
+ 2. **sequence_design**: Did the agent plan to design/optimize amino acid sequences for the structure? (e.g., using ProteinMPNN, inverse folding, sequence optimization, fixed-backbone design)
40
+
41
+ 3. **structure_prediction**: Did the agent plan to predict/validate the 3D structure of designed sequences? (e.g., using AlphaFold2, ESMFold, Boltz, checking pLDDT/pTM, fold confidence)
42
+
43
+ 4. **scoring_validation**: Did the agent plan to score the design's energy/stability? (e.g., using Rosetta, energy minimization, interface analysis, ddG calculation, binding energy)
44
+
45
+ ## Response Format
46
+ Return a JSON object with exactly this structure:
47
+ ```json
48
+ {{
49
+ "backbone_generation": {{"planned": 0 or 1, "evidence": "brief quote or reason"}},
50
+ "sequence_design": {{"planned": 0 or 1, "evidence": "brief quote or reason"}},
51
+ "structure_prediction": {{"planned": 0 or 1, "evidence": "brief quote or reason"}},
52
+ "scoring_validation": {{"planned": 0 or 1, "evidence": "brief quote or reason"}}
53
+ }}
54
+ ```
55
+ """
56
+
57
+ STEPS = ["backbone_generation", "sequence_design", "structure_prediction", "scoring_validation"]
58
+
59
+
60
+ def parse_plan_response(raw_text: str) -> dict[str, int]:
61
+ """Parse LLM response into per-step binary scores."""
62
+ # Try JSON extraction
63
+ json_match = re.search(r"```(?:json)?\s*\n?(.*?)\n?\s*```", raw_text, re.DOTALL)
64
+ json_str = json_match.group(1) if json_match else raw_text
65
+
66
+ try:
67
+ data = json.loads(json_str)
68
+ except json.JSONDecodeError:
69
+ brace_match = re.search(r"\{.*\}", raw_text, re.DOTALL)
70
+ if brace_match:
71
+ try:
72
+ data = json.loads(brace_match.group())
73
+ except json.JSONDecodeError:
74
+ return {s: 0 for s in STEPS}
75
+ else:
76
+ return {s: 0 for s in STEPS}
77
+
78
+ result = {}
79
+ for step in STEPS:
80
+ if step in data and isinstance(data[step], dict):
81
+ val = data[step].get("planned", 0)
82
+ result[step] = 1 if val == 1 or val is True else 0
83
+ else:
84
+ result[step] = 0
85
+ return result
86
+
87
+
88
+ def evaluate_plan_single(
89
+ judge: LLMJudge,
90
+ task_description: str,
91
+ reasoning_trace: str,
92
+ ) -> dict[str, int]:
93
+ """Evaluate plan with a single judge."""
94
+ if not reasoning_trace or len(reasoning_trace.strip()) < 10:
95
+ return {s: 0 for s in STEPS}
96
+
97
+ if judge.dry_run:
98
+ return {s: 0 for s in STEPS}
99
+
100
+ # Cap trace length
101
+ trace = reasoning_trace[:4000]
102
+ prompt = PLAN_EVAL_PROMPT_TEMPLATE.format(
103
+ task_description=task_description[:1000],
104
+ reasoning_trace=trace,
105
+ )
106
+
107
+ raw = judge._call_api(PLAN_EVAL_SYSTEM, prompt)
108
+ return parse_plan_response(raw)
109
+
110
+
111
+ def evaluate_plan_panel(
112
+ judges: list[LLMJudge],
113
+ task_description: str,
114
+ reasoning_trace: str,
115
+ ) -> dict[str, dict[str, Any]]:
116
+ """Evaluate plan with multiple judges, aggregate via majority vote.
117
+
118
+ Returns dict mapping step → {planned: 0/1, votes: [per-judge], n_judges: int}.
119
+ """
120
+ if not reasoning_trace or len(reasoning_trace.strip()) < 10:
121
+ return {
122
+ s: {"planned": 0, "votes": [0] * len(judges), "n_judges": len(judges)}
123
+ for s in STEPS
124
+ }
125
+
126
+ all_votes: dict[str, list[int]] = {s: [] for s in STEPS}
127
+ for judge in judges:
128
+ result = evaluate_plan_single(judge, task_description, reasoning_trace)
129
+ for step in STEPS:
130
+ all_votes[step].append(result.get(step, 0))
131
+
132
+ aggregated = {}
133
+ for step in STEPS:
134
+ votes = all_votes[step]
135
+ planned = 1 if sum(votes) > len(votes) / 2 else 0 # majority vote
136
+ aggregated[step] = {
137
+ "planned": planned,
138
+ "votes": votes,
139
+ "n_judges": len(judges),
140
+ }
141
+ return aggregated
llm_judge/rubrics.py ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Structured rubric prompts for LLM judge evaluation.
2
+
3
+ Each judge evaluates 5 dimensions with explicit score-level descriptors
4
+ following the Prometheus (ICLR 2024) rubric-based approach.
5
+ """
6
+
7
+ from __future__ import annotations
8
+
9
+ import json
10
+ from typing import Any
11
+
12
+ # ---------------------------------------------------------------------------
13
+ # Judge dimensions with max scores matching the LLM portion of the split
14
+ # ---------------------------------------------------------------------------
15
+
16
+ JUDGE_DIMENSIONS: dict[str, dict[str, Any]] = {
17
+ "approach_strategy": {
18
+ "max_score": 10,
19
+ "description": "Strategic quality of tool/methodology selection",
20
+ },
21
+ "orchestration_reasoning": {
22
+ "max_score": 8,
23
+ "description": "Pipeline logic, error handling, and adaptive reasoning",
24
+ },
25
+ "bio_feasibility": {
26
+ "max_score": 5,
27
+ "description": "Biological plausibility beyond sequence-level checks",
28
+ },
29
+ "novelty_quality": {
30
+ "max_score": 2,
31
+ "description": "Meaningful innovation vs accidental variation",
32
+ },
33
+ "diversity_quality": {
34
+ "max_score": 3,
35
+ "description": "Functional diversity of design strategies",
36
+ },
37
+ }
38
+
39
+
40
+ JUDGE_SYSTEM_PROMPT = (
41
+ "You are an expert protein design evaluator with deep knowledge of "
42
+ "computational protein engineering, including backbone generation "
43
+ "(RFdiffusion, Chroma), sequence design (ProteinMPNN, LigandMPNN), "
44
+ "structure prediction (AlphaFold2, ESMFold, Boltz), and interface "
45
+ "analysis (PyRosetta, FoldX). You evaluate AI agent protein design "
46
+ "attempts against a structured rubric. Score each dimension "
47
+ "independently. Provide reasoning BEFORE your score. Be critical "
48
+ "but fair — a score of 5/10 means average, not bad."
49
+ )
50
+
51
+
52
+ _RUBRIC_TEXT = """\
53
+ ### Approach Strategy (0-10 pts)
54
+ - 9-10: Selects optimal tools for this specific target; demonstrates deep \
55
+ understanding of design strategy (e.g., chooses RFdiffusion hotspot \
56
+ conditioning for epitope-specific binder, not generic backbone generation)
57
+ - 7-8: Appropriate tool selection with minor suboptimalities
58
+ - 5-6: Reasonable tools but misses key steps or uses generic strategy
59
+ - 3-4: Partially appropriate; missing critical tools for this task type
60
+ - 0-2: Inappropriate or random tool selection
61
+
62
+ ### Orchestration Reasoning (0-8 pts)
63
+ - 7-8: Logical pipeline with error handling, iterative refinement based on \
64
+ intermediate results, clear adaptive reasoning
65
+ - 5-6: Correct ordering with some validation but limited adaptation
66
+ - 3-4: Basic pipeline but missing intermediate checks or illogical ordering
67
+ - 0-2: No clear pipeline logic; tools called without reasoning
68
+
69
+ ### Biological Feasibility (0-5 pts)
70
+ - 4-5: Designs are biologically plausible — CDR loops appropriate for \
71
+ target, active site geometry consistent, no obvious steric clashes
72
+ - 2-3: Generally plausible with minor concerns
73
+ - 0-1: Biologically implausible designs (e.g., all-alanine core, \
74
+ impossible disulfide patterns)
75
+
76
+ ### Novelty Quality (0-2 pts)
77
+ - 2: Novel design represents meaningful innovation (new fold, creative \
78
+ binding mode) not just random mutations
79
+ - 1: Some novelty but appears accidental rather than designed
80
+ - 0: No meaningful novelty; trivially similar to reference or random
81
+
82
+ ### Diversity Quality (0-3 pts)
83
+ - 3: Multiple designs explore different binding modes/conformations/\
84
+ strategies — functionally diverse, not just sequence variants
85
+ - 1-2: Some diversity but designs are minor variants of each other
86
+ - 0: No meaningful diversity; essentially one design repeated
87
+ """
88
+
89
+
90
+ def build_judge_prompt(
91
+ task_description: str,
92
+ tool_call_log: list[dict[str, Any]],
93
+ designed_sequences: list[str],
94
+ algorithmic_metrics: dict[str, Any],
95
+ reference_pipeline: list[str] | None = None,
96
+ ) -> str:
97
+ """Build the user prompt for LLM judge evaluation.
98
+
99
+ Args:
100
+ task_description: The original design task prompt.
101
+ tool_call_log: Sequence of tool calls with args.
102
+ designed_sequences: FASTA-format designed sequences.
103
+ algorithmic_metrics: Computed metrics (pLDDT, ipTM, etc).
104
+ reference_pipeline: Expected expert pipeline for this task type.
105
+
106
+ Returns:
107
+ Formatted prompt string for the judge LLM.
108
+ """
109
+ sections = []
110
+
111
+ # Task description
112
+ sections.append(f"## Task Description\n{task_description}")
113
+
114
+ # Reference pipeline (for approach/orchestration context)
115
+ if reference_pipeline:
116
+ pipeline_str = " → ".join(reference_pipeline)
117
+ sections.append(
118
+ f"## Reference Pipeline (Expert-Validated)\n{pipeline_str}"
119
+ )
120
+
121
+ # Tool call log
122
+ if tool_call_log:
123
+ log_lines = []
124
+ for i, entry in enumerate(tool_call_log, 1):
125
+ tool = entry.get("tool", "unknown")
126
+ args = entry.get("args_summary", {})
127
+ args_str = json.dumps(args, default=str) if args else "{}"
128
+ log_lines.append(f"{i}. {tool}({args_str})")
129
+ sections.append(
130
+ "## Agent's Tool Call Log\n" + "\n".join(log_lines)
131
+ )
132
+ else:
133
+ sections.append("## Agent's Tool Call Log\nNo tool calls recorded.")
134
+
135
+ # Designed sequences
136
+ if designed_sequences:
137
+ seq_lines = []
138
+ for i, seq in enumerate(designed_sequences[:10], 1): # Cap at 10
139
+ display = seq[:80] + "..." if len(seq) > 80 else seq
140
+ seq_lines.append(f">design_{i} (len={len(seq)})\n{display}")
141
+ sections.append(
142
+ f"## Designed Sequences ({len(designed_sequences)} total)\n"
143
+ + "\n".join(seq_lines)
144
+ )
145
+ else:
146
+ sections.append("## Designed Sequences\nNo sequences produced.")
147
+
148
+ # Algorithmic metrics (read-only context)
149
+ if algorithmic_metrics:
150
+ metrics_str = json.dumps(algorithmic_metrics, indent=2, default=str)
151
+ sections.append(
152
+ f"## Algorithmic Metrics (Read-Only Context)\n```json\n{metrics_str}\n```"
153
+ )
154
+
155
+ # Scoring rubric
156
+ sections.append(f"## Scoring Rubric\n{_RUBRIC_TEXT}")
157
+
158
+ # Output format instruction
159
+ output_format = {
160
+ dim: {"reasoning": "...", "score": f"0-{info['max_score']}"}
161
+ for dim, info in JUDGE_DIMENSIONS.items()
162
+ }
163
+ sections.append(
164
+ "## Required Output Format\n"
165
+ "Evaluate each dimension. For each:\n"
166
+ "1. Cite specific evidence from the agent's work\n"
167
+ "2. Reason about quality relative to the rubric\n"
168
+ "3. Assign a score\n\n"
169
+ "Respond in JSON format:\n"
170
+ f"```json\n{json.dumps(output_format, indent=2)}\n```"
171
+ )
172
+
173
+ return "\n\n".join(sections)
requirements.txt CHANGED
@@ -4,3 +4,8 @@ plotly
4
  httpx>=0.25
5
  huggingface_hub>=0.20
6
  datasets>=2.16
 
 
 
 
 
 
4
  httpx>=0.25
5
  huggingface_hub>=0.20
6
  datasets>=2.16
7
+
8
+ # LLM judge panel (Phase A)
9
+ anthropic>=0.75
10
+ openai>=1.40
11
+ google-genai>=0.3