Phase A: integrate LLM judge panel for hybrid scoring
Browse files- Port biodesignbench/eval/llm_judge package as leaderboard/llm_judge
- New eval_judge.py orchestrates per-task panel runs with self-exclusion
- aggregate_scores: prefer hybrid_total/hybrid_scores when present
- Admin pipeline: insert 'Phase C: Run LLM Judge' button between Boltz and Finalize
- requirements: add anthropic, openai, google-genai
- Boltz handler now persists boltz-augmented per-task results
- Sync README.md taxonomy stats (9 cells, 2x5 matrix, hybrid 72+28)
- README.md +19 -78
- app.py +63 -3
- eval_judge.py +148 -0
- eval_scorer.py +21 -14
- llm_judge/__init__.py +50 -0
- llm_judge/aggregation.py +200 -0
- llm_judge/judge.py +217 -0
- llm_judge/panel.py +162 -0
- llm_judge/plan_eval.py +141 -0
- llm_judge/rubrics.py +173 -0
- requirements.txt +5 -0
README.md
CHANGED
|
@@ -12,88 +12,29 @@ license: mit
|
|
| 12 |
|
| 13 |
# BioDesignBench Leaderboard
|
| 14 |
|
| 15 |
-
|
| 16 |
|
| 17 |
**Romero Lab, Duke University**
|
| 18 |
|
| 19 |
-
##
|
| 20 |
|
| 21 |
-
|
| 22 |
-
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
|
|
|
| 27 |
|
| 28 |
-
|
|
|
|
|
|
|
| 29 |
|
| 30 |
-
|
| 31 |
-
pip install -r requirements.txt
|
| 32 |
-
python app.py
|
| 33 |
-
```
|
| 34 |
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
1. Create a new Space on HuggingFace (`sdk: gradio`).
|
| 42 |
-
2. Push the contents of this directory to the Space repo.
|
| 43 |
-
3. Set the `BDB_ADMIN_PASSWORD` secret in the Space settings for admin panel access.
|
| 44 |
-
4. Optionally set `HF_TOKEN` for submission queue access (private dataset).
|
| 45 |
-
|
| 46 |
-
The Space will automatically build and serve the leaderboard.
|
| 47 |
-
|
| 48 |
-
## How to update results
|
| 49 |
-
|
| 50 |
-
Add new entries to `leaderboard_data.json` following the existing schema:
|
| 51 |
-
|
| 52 |
-
```json
|
| 53 |
-
{
|
| 54 |
-
"agent_name": "Your Agent",
|
| 55 |
-
"agent_id": "your-agent-user",
|
| 56 |
-
"mode": "user",
|
| 57 |
-
"mcp_custom": false,
|
| 58 |
-
"submission_type": "llm",
|
| 59 |
-
"organization": "Your Org",
|
| 60 |
-
"overall_score": 42.0,
|
| 61 |
-
"component_scores": {
|
| 62 |
-
"approach": 10.0,
|
| 63 |
-
"orchestration": 8.0,
|
| 64 |
-
"quality": 14.0,
|
| 65 |
-
"feasibility": 6.0,
|
| 66 |
-
"novelty": 2.0,
|
| 67 |
-
"diversity": 2.0
|
| 68 |
-
},
|
| 69 |
-
"taxonomy_scores": {
|
| 70 |
-
"de_novo_binder": {"ab": 45, "enz": 40, "sig": 43},
|
| 71 |
-
"sequence_optimization": {"ab": 50, "enz": 42, "sig": 38, "str": 44, "flu": 52},
|
| 72 |
-
"de_novo_backbone": {"str": 28},
|
| 73 |
-
"complex_engineering": {"enz": 40, "sig": 44, "str": 46},
|
| 74 |
-
"conformational_design": {"enz": 38, "sig": 42, "str": 40, "flu": 44}
|
| 75 |
-
},
|
| 76 |
-
"tasks_completed": 76,
|
| 77 |
-
"tasks_total": 76,
|
| 78 |
-
"tasks_with_zero": 4,
|
| 79 |
-
"avg_latency_sec": 50.0,
|
| 80 |
-
"submission_date": "2026-03-15"
|
| 81 |
-
}
|
| 82 |
-
```
|
| 83 |
-
|
| 84 |
-
Update the `last_updated` field at the top of the JSON file after adding entries.
|
| 85 |
-
|
| 86 |
-
## File overview
|
| 87 |
-
|
| 88 |
-
| File | Description |
|
| 89 |
-
|------|-------------|
|
| 90 |
-
| `app.py` | Main Gradio application with 7 tabs |
|
| 91 |
-
| `leaderboard_data.json` | Current benchmark results |
|
| 92 |
-
| `mcp_tool_schemas.json` | 17 reference MCP tool schemas |
|
| 93 |
-
| `eval_scorer.py` | Self-contained 100-point scoring rubric |
|
| 94 |
-
| `eval_queue.py` | Submission queue (HuggingFace Datasets) |
|
| 95 |
-
| `eval_dispatcher.py` | HTTP task dispatcher for benchmarking |
|
| 96 |
-
| `eval_boltz.py` | Boltz structure prediction post-eval |
|
| 97 |
-
| `eval_tasks.py` | Hidden task loader from HF Dataset |
|
| 98 |
-
| `example_server.py` | Reference FastAPI server for submitters |
|
| 99 |
-
| `requirements.txt` | Python dependencies |
|
|
|
|
| 12 |
|
| 13 |
# BioDesignBench Leaderboard
|
| 14 |
|
| 15 |
+
Evaluating LLM Agents on Protein Design via MCP Tools.
|
| 16 |
|
| 17 |
**Romero Lab, Duke University**
|
| 18 |
|
| 19 |
+
## Overview
|
| 20 |
|
| 21 |
+
BioDesignBench evaluates LLM agents as orchestrators of multi-step *stochastic*
|
| 22 |
+
protein-design pipelines. This leaderboard tracks agent performance across
|
| 23 |
+
**76 design tasks** spanning a **2 × 5 design matrix** (de novo design vs
|
| 24 |
+
redesign × five molecular families: antibody, binder, enzyme, scaffold,
|
| 25 |
+
fluorescent protein, **9 occupied cells**), scored on a 100-point hybrid rubric:
|
| 26 |
+
**72 algorithmic points** (Boltz-2 verification + sequence/feasibility metrics)
|
| 27 |
+
plus **28 LLM-judge points** (3-judge panel with self-exclusion).
|
| 28 |
|
| 29 |
+
The six rubric components are Approach, Orchestration, Quality, Feasibility,
|
| 30 |
+
Novelty, and Diversity. See the *About* tab for the full methodology and the
|
| 31 |
+
*Depth Gap* tab for evaluation-depth interventions.
|
| 32 |
|
| 33 |
+
## Features
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
+
- **Overall Leaderboard** — Mixed-ranking table with human baselines and LLM agents
|
| 36 |
+
- **Taxonomy Heatmap** — Per-cell scores across the 9 occupied cells of the 2 × 5 design matrix
|
| 37 |
+
- **Component Analysis** — Radar and bar charts comparing the 6 scoring components
|
| 38 |
+
- **Guidance Effect** — Paired comparison of the same LLM in unguided (atomic tools) vs guided (composite workflows) mode
|
| 39 |
+
- **Depth Gap** — Forced-depth and low-diversity intervention results
|
| 40 |
+
- **About** — Methodology, submission guide, and citation info
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
app.py
CHANGED
|
@@ -1523,14 +1523,21 @@ def create_app() -> gr.Blocks:
|
|
| 1523 |
label="Submission ID", scale=2,
|
| 1524 |
)
|
| 1525 |
boltz_btn = gr.Button(
|
| 1526 |
-
"Phase B: Run Boltz", scale=1,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1527 |
)
|
| 1528 |
with gr.Row():
|
| 1529 |
final_id = gr.Textbox(
|
| 1530 |
label="Submission ID", scale=2,
|
| 1531 |
)
|
| 1532 |
final_btn = gr.Button(
|
| 1533 |
-
"Phase
|
| 1534 |
)
|
| 1535 |
pipeline_out = gr.HTML()
|
| 1536 |
|
|
@@ -1681,6 +1688,9 @@ def create_app() -> gr.Blocks:
|
|
| 1681 |
"No task results to process.</div>"
|
| 1682 |
)
|
| 1683 |
run_boltz_posteval(per_task)
|
|
|
|
|
|
|
|
|
|
| 1684 |
return (
|
| 1685 |
'<div style="color:#38a169">'
|
| 1686 |
"Boltz post-assessment complete.</div>"
|
|
@@ -1688,6 +1698,51 @@ def create_app() -> gr.Blocks:
|
|
| 1688 |
except Exception as e:
|
| 1689 |
return f'<div style="color:#e53e3e">{e}</div>'
|
| 1690 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1691 |
def _run_finalize(sid):
|
| 1692 |
try:
|
| 1693 |
from eval_queue import (
|
|
@@ -1712,10 +1767,12 @@ def create_app() -> gr.Blocks:
|
|
| 1712 |
component_scores=agg["component_scores"],
|
| 1713 |
taxonomy_scores=agg["taxonomy_scores"],
|
| 1714 |
)
|
|
|
|
| 1715 |
return (
|
| 1716 |
f'<div style="color:#38a169">'
|
| 1717 |
f'Finalized! Score: '
|
| 1718 |
-
f'{agg["overall_score"]:.1f}
|
|
|
|
| 1719 |
)
|
| 1720 |
except Exception as e:
|
| 1721 |
return f'<div style="color:#e53e3e">{e}</div>'
|
|
@@ -1726,6 +1783,9 @@ def create_app() -> gr.Blocks:
|
|
| 1726 |
boltz_btn.click(
|
| 1727 |
_run_boltz, [boltz_id], pipeline_out,
|
| 1728 |
)
|
|
|
|
|
|
|
|
|
|
| 1729 |
final_btn.click(
|
| 1730 |
_run_finalize, [final_id], pipeline_out,
|
| 1731 |
)
|
|
|
|
| 1523 |
label="Submission ID", scale=2,
|
| 1524 |
)
|
| 1525 |
boltz_btn = gr.Button(
|
| 1526 |
+
"Phase B: Run Boltz (GPU)", scale=1,
|
| 1527 |
+
)
|
| 1528 |
+
with gr.Row():
|
| 1529 |
+
judge_id = gr.Textbox(
|
| 1530 |
+
label="Submission ID", scale=2,
|
| 1531 |
+
)
|
| 1532 |
+
judge_btn = gr.Button(
|
| 1533 |
+
"Phase C: Run LLM Judge", scale=1,
|
| 1534 |
)
|
| 1535 |
with gr.Row():
|
| 1536 |
final_id = gr.Textbox(
|
| 1537 |
label="Submission ID", scale=2,
|
| 1538 |
)
|
| 1539 |
final_btn = gr.Button(
|
| 1540 |
+
"Phase D: Finalize & Publish", scale=1,
|
| 1541 |
)
|
| 1542 |
pipeline_out = gr.HTML()
|
| 1543 |
|
|
|
|
| 1688 |
"No task results to process.</div>"
|
| 1689 |
)
|
| 1690 |
run_boltz_posteval(per_task)
|
| 1691 |
+
from eval_queue import save_task_result
|
| 1692 |
+
for tid, tres in per_task.items():
|
| 1693 |
+
save_task_result(sid.strip(), tid, tres)
|
| 1694 |
return (
|
| 1695 |
'<div style="color:#38a169">'
|
| 1696 |
"Boltz post-assessment complete.</div>"
|
|
|
|
| 1698 |
except Exception as e:
|
| 1699 |
return f'<div style="color:#e53e3e">{e}</div>'
|
| 1700 |
|
| 1701 |
+
def _run_judge(sid):
|
| 1702 |
+
try:
|
| 1703 |
+
import eval_judge as ej
|
| 1704 |
+
from eval_queue import (
|
| 1705 |
+
get_submission, save_task_result, update_status,
|
| 1706 |
+
)
|
| 1707 |
+
|
| 1708 |
+
sub = get_submission(sid.strip())
|
| 1709 |
+
if sub is None:
|
| 1710 |
+
return ('<div style="color:#e53e3e">'
|
| 1711 |
+
'Not found</div>')
|
| 1712 |
+
per_task = json.loads(
|
| 1713 |
+
sub.get("per_task_results", "{}")
|
| 1714 |
+
)
|
| 1715 |
+
if not per_task:
|
| 1716 |
+
return ('<div style="color:#e53e3e">'
|
| 1717 |
+
"No task results to process.</div>")
|
| 1718 |
+
|
| 1719 |
+
update_status(sid.strip(), "scoring")
|
| 1720 |
+
ej.run_judge_panel(
|
| 1721 |
+
per_task,
|
| 1722 |
+
agent_id=sub.get("agent_name", "unknown"),
|
| 1723 |
+
dry_run=False,
|
| 1724 |
+
)
|
| 1725 |
+
for tid, tres in per_task.items():
|
| 1726 |
+
save_task_result(sid.strip(), tid, tres)
|
| 1727 |
+
|
| 1728 |
+
n_done = sum(
|
| 1729 |
+
1 for r in per_task.values()
|
| 1730 |
+
if r.get("hybrid_total") is not None
|
| 1731 |
+
)
|
| 1732 |
+
return (
|
| 1733 |
+
f'<div style="color:#38a169">'
|
| 1734 |
+
f"LLM judge complete on {n_done} tasks."
|
| 1735 |
+
"</div>"
|
| 1736 |
+
)
|
| 1737 |
+
except Exception as e:
|
| 1738 |
+
import traceback
|
| 1739 |
+
return (
|
| 1740 |
+
f'<div style="color:#e53e3e">'
|
| 1741 |
+
f'<strong>Judge error:</strong> {e}<br>'
|
| 1742 |
+
f'<pre style="font-size:0.7rem">'
|
| 1743 |
+
f'{traceback.format_exc()[:600]}</pre></div>'
|
| 1744 |
+
)
|
| 1745 |
+
|
| 1746 |
def _run_finalize(sid):
|
| 1747 |
try:
|
| 1748 |
from eval_queue import (
|
|
|
|
| 1767 |
component_scores=agg["component_scores"],
|
| 1768 |
taxonomy_scores=agg["taxonomy_scores"],
|
| 1769 |
)
|
| 1770 |
+
mode_label = agg.get("scoring_mode", "algo")
|
| 1771 |
return (
|
| 1772 |
f'<div style="color:#38a169">'
|
| 1773 |
f'Finalized! Score: '
|
| 1774 |
+
f'{agg["overall_score"]:.1f} '
|
| 1775 |
+
f'(scoring={mode_label})</div>'
|
| 1776 |
)
|
| 1777 |
except Exception as e:
|
| 1778 |
return f'<div style="color:#e53e3e">{e}</div>'
|
|
|
|
| 1783 |
boltz_btn.click(
|
| 1784 |
_run_boltz, [boltz_id], pipeline_out,
|
| 1785 |
)
|
| 1786 |
+
judge_btn.click(
|
| 1787 |
+
_run_judge, [judge_id], pipeline_out,
|
| 1788 |
+
)
|
| 1789 |
final_btn.click(
|
| 1790 |
_run_finalize, [final_id], pipeline_out,
|
| 1791 |
)
|
eval_judge.py
ADDED
|
@@ -0,0 +1,148 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""LLM Judge orchestration for the leaderboard backend.
|
| 2 |
+
|
| 3 |
+
Runs the cross-model judge panel on each successfully scored task and
|
| 4 |
+
merges the resulting LLM points into the algorithmic component scores
|
| 5 |
+
to produce hybrid totals (28 LLM points + 72 algorithmic points = 100).
|
| 6 |
+
|
| 7 |
+
The judge panel uses 3 judges from different model families with
|
| 8 |
+
self-exclusion (PoLL, Verga et al. 2024). Individual judge calls are
|
| 9 |
+
synchronous; we process tasks sequentially to keep the API spend
|
| 10 |
+
predictable. Provider keys are read from environment variables that
|
| 11 |
+
must be configured as HuggingFace Space secrets:
|
| 12 |
+
|
| 13 |
+
ANTHROPIC_API_KEY
|
| 14 |
+
OPENAI_API_KEY
|
| 15 |
+
GOOGLE_API_KEY
|
| 16 |
+
DEEPSEEK_API_KEY
|
| 17 |
+
"""
|
| 18 |
+
|
| 19 |
+
from __future__ import annotations
|
| 20 |
+
|
| 21 |
+
import logging
|
| 22 |
+
from typing import Any
|
| 23 |
+
|
| 24 |
+
from llm_judge import (
|
| 25 |
+
LLMJudgePanel,
|
| 26 |
+
detect_agent_family,
|
| 27 |
+
merge_algo_and_judge_scores,
|
| 28 |
+
split_algo_score,
|
| 29 |
+
)
|
| 30 |
+
|
| 31 |
+
logger = logging.getLogger(__name__)
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
def _build_algo_dict(task_result: dict[str, Any]) -> dict[str, float]:
|
| 35 |
+
"""Pull per-component algo scores from a task result.
|
| 36 |
+
|
| 37 |
+
Prefers 'cpu_scores' (post-Boltz) but falls back to 'final_scores'
|
| 38 |
+
if it has been computed already.
|
| 39 |
+
"""
|
| 40 |
+
if "cpu_scores" in task_result:
|
| 41 |
+
return dict(task_result["cpu_scores"])
|
| 42 |
+
if "final_scores" in task_result:
|
| 43 |
+
return dict(task_result["final_scores"])
|
| 44 |
+
return {
|
| 45 |
+
"approach": 0,
|
| 46 |
+
"orchestration": 0,
|
| 47 |
+
"quality": 0,
|
| 48 |
+
"feasibility": 0,
|
| 49 |
+
"novelty": 0,
|
| 50 |
+
"diversity": 0,
|
| 51 |
+
}
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
def run_judge_panel(
|
| 55 |
+
per_task_results: dict[str, dict[str, Any]],
|
| 56 |
+
agent_id: str,
|
| 57 |
+
dry_run: bool = False,
|
| 58 |
+
progress_callback=None,
|
| 59 |
+
) -> dict[str, dict[str, Any]]:
|
| 60 |
+
"""Run the LLM judge panel over every successful task in a submission.
|
| 61 |
+
|
| 62 |
+
For each task with a non-empty design output:
|
| 63 |
+
1. Look up the original task prompt (used to give the panel context).
|
| 64 |
+
2. Build a 3-judge panel that excludes the agent's own model family.
|
| 65 |
+
3. Run all judges synchronously and aggregate.
|
| 66 |
+
4. Compute the hybrid component scores by:
|
| 67 |
+
- splitting each algo score into its algo-portion (split_algo_score)
|
| 68 |
+
- adding the matching judge LLM-portion (merge_algo_and_judge_scores)
|
| 69 |
+
5. Store both raw judge results and final hybrid scores on the task.
|
| 70 |
+
|
| 71 |
+
The function modifies per_task_results in place and also returns it.
|
| 72 |
+
|
| 73 |
+
Args:
|
| 74 |
+
per_task_results: Dict mapping task_id → task result (from the
|
| 75 |
+
dispatcher + boltz post-eval pipeline).
|
| 76 |
+
agent_id: Agent identifier for self-exclusion (e.g., 'gpt5-tools').
|
| 77 |
+
dry_run: If True, judges return midpoint scores without API calls.
|
| 78 |
+
progress_callback: Optional callable(task_id, i, total).
|
| 79 |
+
|
| 80 |
+
Returns:
|
| 81 |
+
The same dict, now augmented with 'judge_scores' and 'hybrid_scores'
|
| 82 |
+
per task and 'hybrid_total' on each successful entry.
|
| 83 |
+
"""
|
| 84 |
+
from eval_tasks import get_task
|
| 85 |
+
|
| 86 |
+
family = detect_agent_family(agent_id)
|
| 87 |
+
panel = LLMJudgePanel(agent_model_family=family, dry_run=dry_run)
|
| 88 |
+
logger.info(
|
| 89 |
+
f"LLM judge panel for agent '{agent_id}' (family={family}): "
|
| 90 |
+
f"{len(panel.judges)} judges, dry_run={dry_run}"
|
| 91 |
+
)
|
| 92 |
+
|
| 93 |
+
eligible = [
|
| 94 |
+
tid for tid, r in per_task_results.items()
|
| 95 |
+
if r.get("success") and r.get("sequences")
|
| 96 |
+
]
|
| 97 |
+
total = len(eligible)
|
| 98 |
+
|
| 99 |
+
for i, task_id in enumerate(eligible):
|
| 100 |
+
result = per_task_results[task_id]
|
| 101 |
+
|
| 102 |
+
# Pull task prompt for judge context. If the dataset is not
|
| 103 |
+
# reachable (e.g., dev mode without HF_TOKEN) we still run with
|
| 104 |
+
# a placeholder description rather than aborting the whole run.
|
| 105 |
+
task_data = get_task(task_id) or {}
|
| 106 |
+
task_description = task_data.get("prompt_md") or f"BioDesignBench task {task_id}"
|
| 107 |
+
|
| 108 |
+
algo_metrics = result.get("agent_metrics", {})
|
| 109 |
+
if "boltz_metrics" in result:
|
| 110 |
+
algo_metrics = {**algo_metrics, **result["boltz_metrics"]}
|
| 111 |
+
|
| 112 |
+
try:
|
| 113 |
+
judge_result = panel.evaluate_sync(
|
| 114 |
+
task_description=task_description,
|
| 115 |
+
tool_call_log=result.get("run_log", []),
|
| 116 |
+
designed_sequences=result.get("sequences", []),
|
| 117 |
+
algorithmic_metrics=algo_metrics,
|
| 118 |
+
)
|
| 119 |
+
except Exception as e:
|
| 120 |
+
logger.error(f"Judge panel failed on {task_id}: {e}")
|
| 121 |
+
judge_result = None
|
| 122 |
+
|
| 123 |
+
# Build algo-portion dict (split each component down to its algo max)
|
| 124 |
+
algo_full = _build_algo_dict(result)
|
| 125 |
+
rubric_max = {
|
| 126 |
+
"approach": 20, "orchestration": 15, "quality": 35,
|
| 127 |
+
"feasibility": 15, "novelty": 5, "diversity": 10,
|
| 128 |
+
}
|
| 129 |
+
algo_split = {
|
| 130 |
+
comp: split_algo_score(comp, score, rubric_max[comp])
|
| 131 |
+
for comp, score in algo_full.items()
|
| 132 |
+
}
|
| 133 |
+
|
| 134 |
+
hybrid = merge_algo_and_judge_scores(algo_split, judge_result)
|
| 135 |
+
hybrid_total = sum(hybrid.values())
|
| 136 |
+
|
| 137 |
+
result["judge_scores"] = judge_result
|
| 138 |
+
result["hybrid_scores"] = hybrid
|
| 139 |
+
result["hybrid_total"] = round(hybrid_total, 2)
|
| 140 |
+
|
| 141 |
+
if progress_callback:
|
| 142 |
+
progress_callback(task_id, i + 1, total)
|
| 143 |
+
|
| 144 |
+
logger.info(
|
| 145 |
+
f"[{i+1}/{total}] {task_id}: hybrid={hybrid_total:.1f}"
|
| 146 |
+
)
|
| 147 |
+
|
| 148 |
+
return per_task_results
|
eval_scorer.py
CHANGED
|
@@ -1368,15 +1368,14 @@ def score_diversity(
|
|
| 1368 |
return {"score": 0, "max": max_points, "num_designs": 0, "pairwise_diversity": 0.0, "entropy": 0.0}
|
| 1369 |
|
| 1370 |
num = len(designs)
|
|
|
|
| 1371 |
diversity = mean_pairwise_diversity(designs)
|
| 1372 |
entropy = sequence_entropy(designs)
|
| 1373 |
|
| 1374 |
-
|
| 1375 |
-
|
| 1376 |
-
|
| 1377 |
-
|
| 1378 |
-
entropy_score = entropy * max_points * 0.35
|
| 1379 |
-
total = int(round(diversity_score + entropy_score))
|
| 1380 |
|
| 1381 |
return {
|
| 1382 |
"score": min(total, max_points), "max": max_points,
|
|
@@ -1584,12 +1583,11 @@ def aggregate_scores(
|
|
| 1584 |
) -> dict[str, Any]:
|
| 1585 |
"""Aggregate per-task scores into an overall submission result.
|
| 1586 |
|
| 1587 |
-
|
| 1588 |
-
|
| 1589 |
-
|
| 1590 |
-
|
| 1591 |
-
|
| 1592 |
-
tasks_completed, tasks_with_zero.
|
| 1593 |
"""
|
| 1594 |
if not per_task_scores:
|
| 1595 |
return {
|
|
@@ -1604,16 +1602,24 @@ def aggregate_scores(
|
|
| 1604 |
totals = {c: 0.0 for c in DEFAULT_DESIGN_RUBRIC}
|
| 1605 |
n = len(per_task_scores)
|
| 1606 |
tasks_with_zero = 0
|
|
|
|
| 1607 |
|
| 1608 |
# Taxonomy breakdown
|
| 1609 |
taxonomy_scores: dict[str, dict[str, list[float]]] = {}
|
| 1610 |
|
| 1611 |
for task_id, result in per_task_scores.items():
|
| 1612 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1613 |
if total_score == 0:
|
| 1614 |
tasks_with_zero += 1
|
| 1615 |
|
| 1616 |
-
for comp, val in
|
| 1617 |
totals[comp] += val
|
| 1618 |
|
| 1619 |
# Taxonomy mapping
|
|
@@ -1641,4 +1647,5 @@ def aggregate_scores(
|
|
| 1641 |
"tasks_completed": n,
|
| 1642 |
"tasks_total": n,
|
| 1643 |
"tasks_with_zero": tasks_with_zero,
|
|
|
|
| 1644 |
}
|
|
|
|
| 1368 |
return {"score": 0, "max": max_points, "num_designs": 0, "pairwise_diversity": 0.0, "entropy": 0.0}
|
| 1369 |
|
| 1370 |
num = len(designs)
|
| 1371 |
+
count_fraction = min(num / max_designs, 1.0) if max_designs > 0 else 1.0
|
| 1372 |
diversity = mean_pairwise_diversity(designs)
|
| 1373 |
entropy = sequence_entropy(designs)
|
| 1374 |
|
| 1375 |
+
count_score = count_fraction * max_points * 0.4
|
| 1376 |
+
diversity_score = diversity * max_points * 0.4
|
| 1377 |
+
entropy_score = entropy * max_points * 0.2
|
| 1378 |
+
total = int(round(count_score + diversity_score + entropy_score))
|
|
|
|
|
|
|
| 1379 |
|
| 1380 |
return {
|
| 1381 |
"score": min(total, max_points), "max": max_points,
|
|
|
|
| 1583 |
) -> dict[str, Any]:
|
| 1584 |
"""Aggregate per-task scores into an overall submission result.
|
| 1585 |
|
| 1586 |
+
If `eval_judge.run_judge_panel()` has been run beforehand each task
|
| 1587 |
+
will carry `hybrid_scores` and `hybrid_total`; in that case we use
|
| 1588 |
+
the hybrid (algo + LLM judge, capped at rubric max) as the canonical
|
| 1589 |
+
score. Otherwise we fall back to the algo-only `component_scores` /
|
| 1590 |
+
`total_score` produced by the dispatcher + Boltz pipeline.
|
|
|
|
| 1591 |
"""
|
| 1592 |
if not per_task_scores:
|
| 1593 |
return {
|
|
|
|
| 1602 |
totals = {c: 0.0 for c in DEFAULT_DESIGN_RUBRIC}
|
| 1603 |
n = len(per_task_scores)
|
| 1604 |
tasks_with_zero = 0
|
| 1605 |
+
used_hybrid = False
|
| 1606 |
|
| 1607 |
# Taxonomy breakdown
|
| 1608 |
taxonomy_scores: dict[str, dict[str, list[float]]] = {}
|
| 1609 |
|
| 1610 |
for task_id, result in per_task_scores.items():
|
| 1611 |
+
if "hybrid_scores" in result and "hybrid_total" in result:
|
| 1612 |
+
comp_scores = result["hybrid_scores"]
|
| 1613 |
+
total_score = result["hybrid_total"]
|
| 1614 |
+
used_hybrid = True
|
| 1615 |
+
else:
|
| 1616 |
+
comp_scores = result.get("component_scores", {})
|
| 1617 |
+
total_score = result.get("total_score", 0.0)
|
| 1618 |
+
|
| 1619 |
if total_score == 0:
|
| 1620 |
tasks_with_zero += 1
|
| 1621 |
|
| 1622 |
+
for comp, val in comp_scores.items():
|
| 1623 |
totals[comp] += val
|
| 1624 |
|
| 1625 |
# Taxonomy mapping
|
|
|
|
| 1647 |
"tasks_completed": n,
|
| 1648 |
"tasks_total": n,
|
| 1649 |
"tasks_with_zero": tasks_with_zero,
|
| 1650 |
+
"scoring_mode": "hybrid" if used_hybrid else "algo",
|
| 1651 |
}
|
llm_judge/__init__.py
ADDED
|
@@ -0,0 +1,50 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""LLM-as-a-Judge scoring for BioDesignBench Tier 2 evaluation.
|
| 2 |
+
|
| 3 |
+
Provides cross-model LLM judge panels that evaluate subjective dimensions
|
| 4 |
+
(approach, orchestration, feasibility, novelty, diversity) while quality
|
| 5 |
+
metrics remain 100% algorithmic.
|
| 6 |
+
|
| 7 |
+
Usage:
|
| 8 |
+
from llm_judge import LLMJudgePanel
|
| 9 |
+
|
| 10 |
+
panel = LLMJudgePanel(agent_model_family="anthropic", dry_run=True)
|
| 11 |
+
result = panel.evaluate_sync(
|
| 12 |
+
task_description="Design a binder for IL-6R",
|
| 13 |
+
tool_call_log=[...],
|
| 14 |
+
designed_sequences=["MKVL..."],
|
| 15 |
+
algorithmic_metrics={"pLDDT": 82.5},
|
| 16 |
+
)
|
| 17 |
+
"""
|
| 18 |
+
|
| 19 |
+
from llm_judge.aggregation import (
|
| 20 |
+
WEIGHT_SPLIT,
|
| 21 |
+
aggregate_judge_scores,
|
| 22 |
+
merge_algo_and_judge_scores,
|
| 23 |
+
split_algo_score,
|
| 24 |
+
)
|
| 25 |
+
from llm_judge.judge import LLMJudge, parse_judge_response
|
| 26 |
+
from llm_judge.panel import (
|
| 27 |
+
LLMJudgePanel,
|
| 28 |
+
detect_agent_family,
|
| 29 |
+
get_judge_models,
|
| 30 |
+
)
|
| 31 |
+
from llm_judge.rubrics import (
|
| 32 |
+
JUDGE_DIMENSIONS,
|
| 33 |
+
JUDGE_SYSTEM_PROMPT,
|
| 34 |
+
build_judge_prompt,
|
| 35 |
+
)
|
| 36 |
+
|
| 37 |
+
__all__ = [
|
| 38 |
+
"LLMJudge",
|
| 39 |
+
"LLMJudgePanel",
|
| 40 |
+
"JUDGE_DIMENSIONS",
|
| 41 |
+
"JUDGE_SYSTEM_PROMPT",
|
| 42 |
+
"WEIGHT_SPLIT",
|
| 43 |
+
"aggregate_judge_scores",
|
| 44 |
+
"build_judge_prompt",
|
| 45 |
+
"detect_agent_family",
|
| 46 |
+
"get_judge_models",
|
| 47 |
+
"merge_algo_and_judge_scores",
|
| 48 |
+
"parse_judge_response",
|
| 49 |
+
"split_algo_score",
|
| 50 |
+
]
|
llm_judge/aggregation.py
ADDED
|
@@ -0,0 +1,200 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Score aggregation and merging for LLM judge panel.
|
| 2 |
+
|
| 3 |
+
Implements:
|
| 4 |
+
- Weighted averaging with outlier downweighting
|
| 5 |
+
- Algo + LLM score merging with rubric cap enforcement
|
| 6 |
+
- Weight split configuration (72/28 algo-LLM)
|
| 7 |
+
"""
|
| 8 |
+
|
| 9 |
+
from __future__ import annotations
|
| 10 |
+
|
| 11 |
+
import statistics
|
| 12 |
+
from typing import Any
|
| 13 |
+
|
| 14 |
+
from llm_judge.rubrics import JUDGE_DIMENSIONS
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
# ---------------------------------------------------------------------------
|
| 18 |
+
# Weight split: algo + LLM portions per component (must sum to rubric max)
|
| 19 |
+
# ---------------------------------------------------------------------------
|
| 20 |
+
|
| 21 |
+
WEIGHT_SPLIT: dict[str, dict[str, int]] = {
|
| 22 |
+
"approach": {"algo": 10, "llm": 10}, # 20 total
|
| 23 |
+
"orchestration": {"algo": 7, "llm": 8}, # 15 total
|
| 24 |
+
"quality": {"algo": 35, "llm": 0}, # 35 total (no LLM)
|
| 25 |
+
"feasibility": {"algo": 10, "llm": 5}, # 15 total
|
| 26 |
+
"novelty": {"algo": 3, "llm": 2}, # 5 total
|
| 27 |
+
"diversity": {"algo": 7, "llm": 3}, # 10 total
|
| 28 |
+
}
|
| 29 |
+
|
| 30 |
+
# Mapping from LLM judge dimension → rubric component
|
| 31 |
+
_JUDGE_DIM_TO_COMPONENT: dict[str, str] = {
|
| 32 |
+
"approach_strategy": "approach",
|
| 33 |
+
"orchestration_reasoning": "orchestration",
|
| 34 |
+
"bio_feasibility": "feasibility",
|
| 35 |
+
"novelty_quality": "novelty",
|
| 36 |
+
"diversity_quality": "diversity",
|
| 37 |
+
}
|
| 38 |
+
|
| 39 |
+
# Rubric max per component
|
| 40 |
+
_RUBRIC_MAX: dict[str, int] = {
|
| 41 |
+
"approach": 20,
|
| 42 |
+
"orchestration": 15,
|
| 43 |
+
"quality": 35,
|
| 44 |
+
"feasibility": 15,
|
| 45 |
+
"novelty": 5,
|
| 46 |
+
"diversity": 10,
|
| 47 |
+
}
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
def aggregate_judge_scores(
|
| 51 |
+
judge_results: list[dict[str, dict[str, Any]]],
|
| 52 |
+
) -> dict[str, dict[str, Any]]:
|
| 53 |
+
"""Aggregate scores from multiple judges with outlier downweighting.
|
| 54 |
+
|
| 55 |
+
For each dimension:
|
| 56 |
+
1. Collect raw scores from all judges
|
| 57 |
+
2. Compute median
|
| 58 |
+
3. Downweight outliers (>2 points from median) by 0.5x
|
| 59 |
+
4. Compute weighted average
|
| 60 |
+
|
| 61 |
+
Args:
|
| 62 |
+
judge_results: List of per-judge result dicts.
|
| 63 |
+
Each maps dimension_name → {reasoning, score}.
|
| 64 |
+
|
| 65 |
+
Returns:
|
| 66 |
+
Aggregated dict mapping dimension_name → {score, reasoning, raw_scores}.
|
| 67 |
+
|
| 68 |
+
Raises:
|
| 69 |
+
ValueError: If judge_results is empty.
|
| 70 |
+
"""
|
| 71 |
+
if not judge_results:
|
| 72 |
+
raise ValueError("aggregate_judge_scores requires at least one judge result")
|
| 73 |
+
|
| 74 |
+
if len(judge_results) == 1:
|
| 75 |
+
# Single judge: pass through directly
|
| 76 |
+
result = {}
|
| 77 |
+
for dim in JUDGE_DIMENSIONS:
|
| 78 |
+
entry = judge_results[0].get(dim, {"score": 0, "reasoning": ""})
|
| 79 |
+
result[dim] = {
|
| 80 |
+
"score": float(entry["score"]),
|
| 81 |
+
"reasoning": entry["reasoning"],
|
| 82 |
+
"raw_scores": [entry["score"]],
|
| 83 |
+
}
|
| 84 |
+
return result
|
| 85 |
+
|
| 86 |
+
aggregated = {}
|
| 87 |
+
for dim, info in JUDGE_DIMENSIONS.items():
|
| 88 |
+
raw_scores = []
|
| 89 |
+
reasonings = []
|
| 90 |
+
for jr in judge_results:
|
| 91 |
+
entry = jr.get(dim, {"score": info["max_score"] // 2, "reasoning": ""})
|
| 92 |
+
raw_scores.append(float(entry["score"]))
|
| 93 |
+
reasonings.append(entry.get("reasoning", ""))
|
| 94 |
+
|
| 95 |
+
# Outlier detection: downweight scores >2 points from median
|
| 96 |
+
med = statistics.median(raw_scores)
|
| 97 |
+
weights = []
|
| 98 |
+
for s in raw_scores:
|
| 99 |
+
if abs(s - med) > 2.0:
|
| 100 |
+
weights.append(0.5)
|
| 101 |
+
else:
|
| 102 |
+
weights.append(1.0)
|
| 103 |
+
|
| 104 |
+
# Weighted average
|
| 105 |
+
weighted_sum = sum(s * w for s, w in zip(raw_scores, weights))
|
| 106 |
+
weight_total = sum(weights)
|
| 107 |
+
avg = weighted_sum / weight_total if weight_total > 0 else 0
|
| 108 |
+
|
| 109 |
+
# Clamp to valid range
|
| 110 |
+
avg = max(0, min(avg, info["max_score"]))
|
| 111 |
+
|
| 112 |
+
aggregated[dim] = {
|
| 113 |
+
"score": round(avg, 1),
|
| 114 |
+
"reasoning": " | ".join(
|
| 115 |
+
f"[Judge {i+1}] {r}" for i, r in enumerate(reasonings) if r
|
| 116 |
+
),
|
| 117 |
+
"raw_scores": raw_scores,
|
| 118 |
+
}
|
| 119 |
+
|
| 120 |
+
return aggregated
|
| 121 |
+
|
| 122 |
+
|
| 123 |
+
def split_algo_score(
|
| 124 |
+
component: str,
|
| 125 |
+
original_score: float,
|
| 126 |
+
original_max: int,
|
| 127 |
+
) -> float:
|
| 128 |
+
"""Scale an algorithmic score to its algo-only portion.
|
| 129 |
+
|
| 130 |
+
For the hybrid system, algorithmic scores are computed against the
|
| 131 |
+
original rubric max (e.g., approach out of 20), then scaled down
|
| 132 |
+
to the algo-only portion (e.g., 10 out of 20).
|
| 133 |
+
|
| 134 |
+
Quality is special: it keeps its full 35 points (no LLM portion).
|
| 135 |
+
|
| 136 |
+
Args:
|
| 137 |
+
component: Rubric component name.
|
| 138 |
+
original_score: Score computed against original max.
|
| 139 |
+
original_max: Original rubric max for this component.
|
| 140 |
+
|
| 141 |
+
Returns:
|
| 142 |
+
Scaled score for the algo-only portion.
|
| 143 |
+
"""
|
| 144 |
+
split = WEIGHT_SPLIT.get(component)
|
| 145 |
+
if split is None:
|
| 146 |
+
return original_score
|
| 147 |
+
|
| 148 |
+
algo_max = split["algo"]
|
| 149 |
+
|
| 150 |
+
if split["llm"] == 0:
|
| 151 |
+
# No LLM portion — return original score unchanged
|
| 152 |
+
return original_score
|
| 153 |
+
|
| 154 |
+
# Scale: (original_score / original_max) * algo_max
|
| 155 |
+
if original_max == 0:
|
| 156 |
+
return 0.0
|
| 157 |
+
ratio = original_score / original_max
|
| 158 |
+
return round(ratio * algo_max, 2)
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
def merge_algo_and_judge_scores(
|
| 162 |
+
algo_scores: dict[str, float | int],
|
| 163 |
+
judge_scores: dict[str, dict[str, Any]] | None,
|
| 164 |
+
) -> dict[str, float]:
|
| 165 |
+
"""Merge algorithmic and LLM judge scores into final component scores.
|
| 166 |
+
|
| 167 |
+
Args:
|
| 168 |
+
algo_scores: Dict mapping component → algo-portion score.
|
| 169 |
+
These should already be split via split_algo_score().
|
| 170 |
+
judge_scores: Aggregated judge scores (from aggregate_judge_scores).
|
| 171 |
+
None if LLM judge is disabled.
|
| 172 |
+
|
| 173 |
+
Returns:
|
| 174 |
+
Dict mapping component → final merged score (capped at rubric max).
|
| 175 |
+
"""
|
| 176 |
+
if judge_scores is None:
|
| 177 |
+
return dict(algo_scores)
|
| 178 |
+
|
| 179 |
+
merged = {}
|
| 180 |
+
for component, algo_score in algo_scores.items():
|
| 181 |
+
rubric_max = _RUBRIC_MAX.get(component, 100)
|
| 182 |
+
|
| 183 |
+
# Find matching judge dimension
|
| 184 |
+
judge_dim = None
|
| 185 |
+
for jd, comp in _JUDGE_DIM_TO_COMPONENT.items():
|
| 186 |
+
if comp == component:
|
| 187 |
+
judge_dim = jd
|
| 188 |
+
break
|
| 189 |
+
|
| 190 |
+
if judge_dim and judge_dim in judge_scores:
|
| 191 |
+
llm_score = judge_scores[judge_dim].get("score", 0)
|
| 192 |
+
if isinstance(llm_score, dict):
|
| 193 |
+
llm_score = llm_score.get("score", 0)
|
| 194 |
+
total = algo_score + llm_score
|
| 195 |
+
else:
|
| 196 |
+
total = algo_score
|
| 197 |
+
|
| 198 |
+
merged[component] = min(total, rubric_max)
|
| 199 |
+
|
| 200 |
+
return merged
|
llm_judge/judge.py
ADDED
|
@@ -0,0 +1,217 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Single LLM judge: wraps one API call to evaluate a design attempt.
|
| 2 |
+
|
| 3 |
+
Supports Anthropic, OpenAI, Google, and DeepSeek providers.
|
| 4 |
+
In dry_run mode, returns deterministic midpoint scores without API calls.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
from __future__ import annotations
|
| 8 |
+
|
| 9 |
+
import json
|
| 10 |
+
import re
|
| 11 |
+
from typing import Any
|
| 12 |
+
|
| 13 |
+
from llm_judge.rubrics import (
|
| 14 |
+
JUDGE_DIMENSIONS,
|
| 15 |
+
JUDGE_SYSTEM_PROMPT,
|
| 16 |
+
build_judge_prompt,
|
| 17 |
+
)
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
def _midpoint_scores() -> dict[str, dict[str, Any]]:
|
| 21 |
+
"""Return deterministic midpoint scores for dry-run mode."""
|
| 22 |
+
result = {}
|
| 23 |
+
for dim, info in JUDGE_DIMENSIONS.items():
|
| 24 |
+
mid = info["max_score"] // 2
|
| 25 |
+
if info["max_score"] % 2 == 1 and mid * 2 < info["max_score"]:
|
| 26 |
+
# For odd max (5, 3), floor division gives correct 50%
|
| 27 |
+
pass
|
| 28 |
+
result[dim] = {
|
| 29 |
+
"reasoning": f"[Dry run] Midpoint score for {dim}.",
|
| 30 |
+
"score": mid,
|
| 31 |
+
}
|
| 32 |
+
return result
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
def parse_judge_response(raw_text: str) -> dict[str, dict[str, Any]]:
|
| 36 |
+
"""Parse LLM judge response into structured scores.
|
| 37 |
+
|
| 38 |
+
Handles:
|
| 39 |
+
- Direct JSON response
|
| 40 |
+
- JSON inside markdown code blocks
|
| 41 |
+
- Out-of-range score clamping
|
| 42 |
+
- Invalid JSON fallback to midpoint scores
|
| 43 |
+
|
| 44 |
+
Args:
|
| 45 |
+
raw_text: Raw LLM response text.
|
| 46 |
+
|
| 47 |
+
Returns:
|
| 48 |
+
Dict mapping dimension names to {reasoning, score}.
|
| 49 |
+
"""
|
| 50 |
+
# Try to extract JSON from markdown code block
|
| 51 |
+
json_match = re.search(r"```(?:json)?\s*\n?(.*?)\n?\s*```", raw_text, re.DOTALL)
|
| 52 |
+
json_str = json_match.group(1) if json_match else raw_text
|
| 53 |
+
|
| 54 |
+
try:
|
| 55 |
+
data = json.loads(json_str)
|
| 56 |
+
except json.JSONDecodeError:
|
| 57 |
+
# Try finding any JSON object in the text
|
| 58 |
+
brace_match = re.search(r"\{.*\}", raw_text, re.DOTALL)
|
| 59 |
+
if brace_match:
|
| 60 |
+
try:
|
| 61 |
+
data = json.loads(brace_match.group())
|
| 62 |
+
except json.JSONDecodeError:
|
| 63 |
+
return _midpoint_scores()
|
| 64 |
+
else:
|
| 65 |
+
return _midpoint_scores()
|
| 66 |
+
|
| 67 |
+
# Validate and clamp scores
|
| 68 |
+
result = {}
|
| 69 |
+
for dim, info in JUDGE_DIMENSIONS.items():
|
| 70 |
+
if dim in data and isinstance(data[dim], dict):
|
| 71 |
+
score = data[dim].get("score", info["max_score"] // 2)
|
| 72 |
+
if isinstance(score, (int, float)):
|
| 73 |
+
score = max(0, min(score, info["max_score"]))
|
| 74 |
+
else:
|
| 75 |
+
score = info["max_score"] // 2
|
| 76 |
+
reasoning = data[dim].get("reasoning", "")
|
| 77 |
+
result[dim] = {"reasoning": str(reasoning), "score": score}
|
| 78 |
+
else:
|
| 79 |
+
# Missing dimension — use midpoint
|
| 80 |
+
result[dim] = {
|
| 81 |
+
"reasoning": f"[Fallback] Dimension {dim} missing from judge response.",
|
| 82 |
+
"score": info["max_score"] // 2,
|
| 83 |
+
}
|
| 84 |
+
|
| 85 |
+
return result
|
| 86 |
+
|
| 87 |
+
|
| 88 |
+
class LLMJudge:
|
| 89 |
+
"""Single LLM judge that evaluates a protein design attempt.
|
| 90 |
+
|
| 91 |
+
Args:
|
| 92 |
+
provider: API provider ('anthropic', 'openai', 'google', 'deepseek').
|
| 93 |
+
model: Model identifier string.
|
| 94 |
+
dry_run: If True, return deterministic scores without API calls.
|
| 95 |
+
api_key: Optional API key override.
|
| 96 |
+
"""
|
| 97 |
+
|
| 98 |
+
def __init__(
|
| 99 |
+
self,
|
| 100 |
+
provider: str,
|
| 101 |
+
model: str,
|
| 102 |
+
dry_run: bool = False,
|
| 103 |
+
api_key: str | None = None,
|
| 104 |
+
):
|
| 105 |
+
self.provider = provider
|
| 106 |
+
self.model = model
|
| 107 |
+
self.dry_run = dry_run
|
| 108 |
+
self.api_key = api_key
|
| 109 |
+
self.api_calls = 0
|
| 110 |
+
self._client = None
|
| 111 |
+
|
| 112 |
+
def _get_client(self):
|
| 113 |
+
"""Lazy-initialize the API client."""
|
| 114 |
+
if self._client is not None:
|
| 115 |
+
return self._client
|
| 116 |
+
|
| 117 |
+
import os
|
| 118 |
+
|
| 119 |
+
if self.provider == "anthropic":
|
| 120 |
+
import anthropic
|
| 121 |
+
|
| 122 |
+
key = self.api_key or os.environ.get("ANTHROPIC_API_KEY")
|
| 123 |
+
self._client = anthropic.Anthropic(api_key=key)
|
| 124 |
+
elif self.provider == "openai":
|
| 125 |
+
from openai import OpenAI
|
| 126 |
+
|
| 127 |
+
key = self.api_key or os.environ.get("OPENAI_API_KEY")
|
| 128 |
+
self._client = OpenAI(api_key=key)
|
| 129 |
+
elif self.provider == "google":
|
| 130 |
+
from google import genai
|
| 131 |
+
|
| 132 |
+
key = self.api_key or os.environ.get("GOOGLE_API_KEY")
|
| 133 |
+
self._client = genai.Client(api_key=key)
|
| 134 |
+
elif self.provider == "deepseek":
|
| 135 |
+
from openai import OpenAI
|
| 136 |
+
|
| 137 |
+
key = self.api_key or os.environ.get("DEEPSEEK_API_KEY")
|
| 138 |
+
self._client = OpenAI(
|
| 139 |
+
api_key=key, base_url="https://api.deepseek.com"
|
| 140 |
+
)
|
| 141 |
+
else:
|
| 142 |
+
raise ValueError(f"Unknown provider: {self.provider}")
|
| 143 |
+
|
| 144 |
+
return self._client
|
| 145 |
+
|
| 146 |
+
def _call_api(self, system: str, user: str) -> str:
|
| 147 |
+
"""Make a single API call and return raw text response."""
|
| 148 |
+
client = self._get_client()
|
| 149 |
+
self.api_calls += 1
|
| 150 |
+
|
| 151 |
+
if self.provider == "anthropic":
|
| 152 |
+
response = client.messages.create(
|
| 153 |
+
model=self.model,
|
| 154 |
+
max_tokens=4096,
|
| 155 |
+
system=system,
|
| 156 |
+
messages=[{"role": "user", "content": user}],
|
| 157 |
+
)
|
| 158 |
+
return response.content[0].text
|
| 159 |
+
|
| 160 |
+
elif self.provider in ("openai", "deepseek"):
|
| 161 |
+
# GPT-5+ uses max_completion_tokens; older models use max_tokens
|
| 162 |
+
token_param = (
|
| 163 |
+
"max_completion_tokens" if "gpt-5" in self.model or "o3" in self.model or "o4" in self.model
|
| 164 |
+
else "max_tokens"
|
| 165 |
+
)
|
| 166 |
+
response = client.chat.completions.create(
|
| 167 |
+
model=self.model,
|
| 168 |
+
**{token_param: 4096},
|
| 169 |
+
messages=[
|
| 170 |
+
{"role": "system", "content": system},
|
| 171 |
+
{"role": "user", "content": user},
|
| 172 |
+
],
|
| 173 |
+
)
|
| 174 |
+
return response.choices[0].message.content
|
| 175 |
+
|
| 176 |
+
elif self.provider == "google":
|
| 177 |
+
response = client.models.generate_content(
|
| 178 |
+
model=self.model,
|
| 179 |
+
contents=f"{system}\n\n{user}",
|
| 180 |
+
)
|
| 181 |
+
return response.text
|
| 182 |
+
|
| 183 |
+
raise ValueError(f"Unsupported provider: {self.provider}")
|
| 184 |
+
|
| 185 |
+
def evaluate_sync(
|
| 186 |
+
self,
|
| 187 |
+
task_description: str,
|
| 188 |
+
tool_call_log: list[dict[str, Any]],
|
| 189 |
+
designed_sequences: list[str],
|
| 190 |
+
algorithmic_metrics: dict[str, Any],
|
| 191 |
+
reference_pipeline: list[str] | None = None,
|
| 192 |
+
) -> dict[str, dict[str, Any]]:
|
| 193 |
+
"""Evaluate a design attempt synchronously.
|
| 194 |
+
|
| 195 |
+
Args:
|
| 196 |
+
task_description: Original task prompt.
|
| 197 |
+
tool_call_log: Agent's tool call sequence.
|
| 198 |
+
designed_sequences: Designed protein sequences.
|
| 199 |
+
algorithmic_metrics: Computed biophysical metrics.
|
| 200 |
+
reference_pipeline: Expected expert pipeline.
|
| 201 |
+
|
| 202 |
+
Returns:
|
| 203 |
+
Dict mapping dimension names to {reasoning, score}.
|
| 204 |
+
"""
|
| 205 |
+
if self.dry_run:
|
| 206 |
+
return _midpoint_scores()
|
| 207 |
+
|
| 208 |
+
prompt = build_judge_prompt(
|
| 209 |
+
task_description=task_description,
|
| 210 |
+
tool_call_log=tool_call_log,
|
| 211 |
+
designed_sequences=designed_sequences,
|
| 212 |
+
algorithmic_metrics=algorithmic_metrics,
|
| 213 |
+
reference_pipeline=reference_pipeline,
|
| 214 |
+
)
|
| 215 |
+
|
| 216 |
+
raw_response = self._call_api(JUDGE_SYSTEM_PROMPT, prompt)
|
| 217 |
+
return parse_judge_response(raw_response)
|
llm_judge/panel.py
ADDED
|
@@ -0,0 +1,162 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""LLM Judge Panel: manages cross-model evaluation with self-exclusion.
|
| 2 |
+
|
| 3 |
+
Following PoLL (Verga et al., 2024): 3 judges from different model families,
|
| 4 |
+
excluding the generating model. Human baselines get all 4 judges.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
from __future__ import annotations
|
| 8 |
+
|
| 9 |
+
from typing import Any
|
| 10 |
+
|
| 11 |
+
from llm_judge.aggregation import aggregate_judge_scores
|
| 12 |
+
from llm_judge.judge import LLMJudge
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
# ---------------------------------------------------------------------------
|
| 16 |
+
# Available judge models (one per family)
|
| 17 |
+
# ---------------------------------------------------------------------------
|
| 18 |
+
|
| 19 |
+
JUDGE_MODELS: list[dict[str, str]] = [
|
| 20 |
+
{
|
| 21 |
+
"family": "anthropic",
|
| 22 |
+
"provider": "anthropic",
|
| 23 |
+
"model": "claude-sonnet-4-20250514",
|
| 24 |
+
},
|
| 25 |
+
{
|
| 26 |
+
"family": "openai",
|
| 27 |
+
"provider": "openai",
|
| 28 |
+
"model": "gpt-5.2",
|
| 29 |
+
},
|
| 30 |
+
{
|
| 31 |
+
"family": "google",
|
| 32 |
+
"provider": "google",
|
| 33 |
+
"model": "gemini-2.5-pro",
|
| 34 |
+
},
|
| 35 |
+
{
|
| 36 |
+
"family": "deepseek",
|
| 37 |
+
"provider": "deepseek",
|
| 38 |
+
"model": "deepseek-chat",
|
| 39 |
+
},
|
| 40 |
+
]
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
# ---------------------------------------------------------------------------
|
| 44 |
+
# Agent ID → model family mapping
|
| 45 |
+
# ---------------------------------------------------------------------------
|
| 46 |
+
|
| 47 |
+
_AGENT_FAMILY_PREFIXES: dict[str, str] = {
|
| 48 |
+
"claude": "anthropic",
|
| 49 |
+
"gpt": "openai",
|
| 50 |
+
"gemini": "google",
|
| 51 |
+
"deepseek": "deepseek",
|
| 52 |
+
"human": "human",
|
| 53 |
+
}
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
def detect_agent_family(agent_id: str) -> str:
|
| 57 |
+
"""Map an agent ID to its model family.
|
| 58 |
+
|
| 59 |
+
Args:
|
| 60 |
+
agent_id: Agent identifier (e.g., 'claude-code', 'gpt5-tools-benchmark').
|
| 61 |
+
|
| 62 |
+
Returns:
|
| 63 |
+
Family string: 'anthropic', 'openai', 'google', 'deepseek', 'human',
|
| 64 |
+
or 'unknown'.
|
| 65 |
+
"""
|
| 66 |
+
agent_lower = agent_id.lower()
|
| 67 |
+
for prefix, family in _AGENT_FAMILY_PREFIXES.items():
|
| 68 |
+
if agent_lower.startswith(prefix):
|
| 69 |
+
return family
|
| 70 |
+
return "unknown"
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
def get_judge_models(agent_model_family: str) -> list[dict[str, str]]:
|
| 74 |
+
"""Select judge models for a given agent, excluding self.
|
| 75 |
+
|
| 76 |
+
Args:
|
| 77 |
+
agent_model_family: Family of the agent being evaluated
|
| 78 |
+
('anthropic', 'openai', 'google', 'deepseek', 'human', 'unknown').
|
| 79 |
+
|
| 80 |
+
Returns:
|
| 81 |
+
List of judge model dicts (3 for agents, 4 for human baselines).
|
| 82 |
+
"""
|
| 83 |
+
if agent_model_family == "human":
|
| 84 |
+
return list(JUDGE_MODELS) # All 4 judges
|
| 85 |
+
|
| 86 |
+
return [j for j in JUDGE_MODELS if j["family"] != agent_model_family]
|
| 87 |
+
|
| 88 |
+
|
| 89 |
+
class LLMJudgePanel:
|
| 90 |
+
"""Cross-model judge panel for protein design evaluation.
|
| 91 |
+
|
| 92 |
+
Manages 3 judges (excluding the agent's own model family) and
|
| 93 |
+
aggregates their scores.
|
| 94 |
+
|
| 95 |
+
Args:
|
| 96 |
+
agent_model_family: Model family to exclude ('anthropic', etc).
|
| 97 |
+
dry_run: If True, all judges return deterministic midpoint scores.
|
| 98 |
+
"""
|
| 99 |
+
|
| 100 |
+
def __init__(
|
| 101 |
+
self,
|
| 102 |
+
agent_model_family: str,
|
| 103 |
+
dry_run: bool = False,
|
| 104 |
+
):
|
| 105 |
+
self.agent_model_family = agent_model_family
|
| 106 |
+
self.dry_run = dry_run
|
| 107 |
+
self.judge_configs = get_judge_models(agent_model_family)
|
| 108 |
+
self.judges = [
|
| 109 |
+
LLMJudge(
|
| 110 |
+
provider=cfg["provider"],
|
| 111 |
+
model=cfg["model"],
|
| 112 |
+
dry_run=dry_run,
|
| 113 |
+
)
|
| 114 |
+
for cfg in self.judge_configs
|
| 115 |
+
]
|
| 116 |
+
|
| 117 |
+
def evaluate_sync(
|
| 118 |
+
self,
|
| 119 |
+
task_description: str,
|
| 120 |
+
tool_call_log: list[dict[str, Any]],
|
| 121 |
+
designed_sequences: list[str],
|
| 122 |
+
algorithmic_metrics: dict[str, Any],
|
| 123 |
+
reference_pipeline: list[str] | None = None,
|
| 124 |
+
) -> dict[str, Any]:
|
| 125 |
+
"""Evaluate a design with all judges and aggregate.
|
| 126 |
+
|
| 127 |
+
Args:
|
| 128 |
+
task_description: Original task prompt.
|
| 129 |
+
tool_call_log: Agent's tool call sequence.
|
| 130 |
+
designed_sequences: Designed protein sequences.
|
| 131 |
+
algorithmic_metrics: Computed biophysical metrics.
|
| 132 |
+
reference_pipeline: Expected expert pipeline.
|
| 133 |
+
|
| 134 |
+
Returns:
|
| 135 |
+
Dict with aggregated scores, judge count, and individual results.
|
| 136 |
+
"""
|
| 137 |
+
individual_results = []
|
| 138 |
+
|
| 139 |
+
for judge in self.judges:
|
| 140 |
+
result = judge.evaluate_sync(
|
| 141 |
+
task_description=task_description,
|
| 142 |
+
tool_call_log=tool_call_log,
|
| 143 |
+
designed_sequences=designed_sequences,
|
| 144 |
+
algorithmic_metrics=algorithmic_metrics,
|
| 145 |
+
reference_pipeline=reference_pipeline,
|
| 146 |
+
)
|
| 147 |
+
individual_results.append(result)
|
| 148 |
+
|
| 149 |
+
aggregated = aggregate_judge_scores(individual_results)
|
| 150 |
+
|
| 151 |
+
return {
|
| 152 |
+
**aggregated,
|
| 153 |
+
"judge_count": len(self.judges),
|
| 154 |
+
"individual_judges": [
|
| 155 |
+
{
|
| 156 |
+
"model": cfg["model"],
|
| 157 |
+
"family": cfg["family"],
|
| 158 |
+
"scores": result,
|
| 159 |
+
}
|
| 160 |
+
for cfg, result in zip(self.judge_configs, individual_results)
|
| 161 |
+
],
|
| 162 |
+
}
|
llm_judge/plan_eval.py
ADDED
|
@@ -0,0 +1,141 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""LLM-based plan evaluation: judge whether agent's reasoning trace
|
| 2 |
+
demonstrates understanding of each pipeline step.
|
| 3 |
+
|
| 4 |
+
Replaces keyword matching with LLM assessment of 4 pipeline steps:
|
| 5 |
+
backbone_generation, sequence_design, structure_prediction, scoring_validation
|
| 6 |
+
|
| 7 |
+
Each step scored as 0 or 1 per judge, aggregated across 3-4 judges via majority vote.
|
| 8 |
+
"""
|
| 9 |
+
from __future__ import annotations
|
| 10 |
+
|
| 11 |
+
import json
|
| 12 |
+
import re
|
| 13 |
+
from typing import Any
|
| 14 |
+
|
| 15 |
+
from llm_judge.judge import LLMJudge
|
| 16 |
+
|
| 17 |
+
PLAN_EVAL_SYSTEM = """You are an expert protein design evaluator. Your task is to assess whether an AI agent's reasoning trace demonstrates awareness and planning of specific protein design pipeline steps.
|
| 18 |
+
|
| 19 |
+
You have deep knowledge of:
|
| 20 |
+
- RFdiffusion for backbone generation
|
| 21 |
+
- ProteinMPNN for inverse folding / sequence design
|
| 22 |
+
- AlphaFold2, ESMFold, Boltz for structure prediction
|
| 23 |
+
- Rosetta for energy scoring and validation
|
| 24 |
+
|
| 25 |
+
Be strict: the agent must show genuine understanding or intent to use a step, not just mention a keyword in passing. Look for evidence that the agent planned to perform the step as part of its design strategy."""
|
| 26 |
+
|
| 27 |
+
PLAN_EVAL_PROMPT_TEMPLATE = """## Task
|
| 28 |
+
{task_description}
|
| 29 |
+
|
| 30 |
+
## Agent's Reasoning Trace
|
| 31 |
+
{reasoning_trace}
|
| 32 |
+
|
| 33 |
+
## Pipeline Steps to Evaluate
|
| 34 |
+
|
| 35 |
+
For each step below, determine whether the agent's reasoning trace shows that the agent **planned or intended** to perform this step. Score 1 if the agent demonstrates clear awareness and intent, 0 if not.
|
| 36 |
+
|
| 37 |
+
1. **backbone_generation**: Did the agent plan to generate a de novo protein backbone/scaffold? (e.g., using RFdiffusion, backbone diffusion, scaffold generation, de novo structure design)
|
| 38 |
+
|
| 39 |
+
2. **sequence_design**: Did the agent plan to design/optimize amino acid sequences for the structure? (e.g., using ProteinMPNN, inverse folding, sequence optimization, fixed-backbone design)
|
| 40 |
+
|
| 41 |
+
3. **structure_prediction**: Did the agent plan to predict/validate the 3D structure of designed sequences? (e.g., using AlphaFold2, ESMFold, Boltz, checking pLDDT/pTM, fold confidence)
|
| 42 |
+
|
| 43 |
+
4. **scoring_validation**: Did the agent plan to score the design's energy/stability? (e.g., using Rosetta, energy minimization, interface analysis, ddG calculation, binding energy)
|
| 44 |
+
|
| 45 |
+
## Response Format
|
| 46 |
+
Return a JSON object with exactly this structure:
|
| 47 |
+
```json
|
| 48 |
+
{{
|
| 49 |
+
"backbone_generation": {{"planned": 0 or 1, "evidence": "brief quote or reason"}},
|
| 50 |
+
"sequence_design": {{"planned": 0 or 1, "evidence": "brief quote or reason"}},
|
| 51 |
+
"structure_prediction": {{"planned": 0 or 1, "evidence": "brief quote or reason"}},
|
| 52 |
+
"scoring_validation": {{"planned": 0 or 1, "evidence": "brief quote or reason"}}
|
| 53 |
+
}}
|
| 54 |
+
```
|
| 55 |
+
"""
|
| 56 |
+
|
| 57 |
+
STEPS = ["backbone_generation", "sequence_design", "structure_prediction", "scoring_validation"]
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
def parse_plan_response(raw_text: str) -> dict[str, int]:
|
| 61 |
+
"""Parse LLM response into per-step binary scores."""
|
| 62 |
+
# Try JSON extraction
|
| 63 |
+
json_match = re.search(r"```(?:json)?\s*\n?(.*?)\n?\s*```", raw_text, re.DOTALL)
|
| 64 |
+
json_str = json_match.group(1) if json_match else raw_text
|
| 65 |
+
|
| 66 |
+
try:
|
| 67 |
+
data = json.loads(json_str)
|
| 68 |
+
except json.JSONDecodeError:
|
| 69 |
+
brace_match = re.search(r"\{.*\}", raw_text, re.DOTALL)
|
| 70 |
+
if brace_match:
|
| 71 |
+
try:
|
| 72 |
+
data = json.loads(brace_match.group())
|
| 73 |
+
except json.JSONDecodeError:
|
| 74 |
+
return {s: 0 for s in STEPS}
|
| 75 |
+
else:
|
| 76 |
+
return {s: 0 for s in STEPS}
|
| 77 |
+
|
| 78 |
+
result = {}
|
| 79 |
+
for step in STEPS:
|
| 80 |
+
if step in data and isinstance(data[step], dict):
|
| 81 |
+
val = data[step].get("planned", 0)
|
| 82 |
+
result[step] = 1 if val == 1 or val is True else 0
|
| 83 |
+
else:
|
| 84 |
+
result[step] = 0
|
| 85 |
+
return result
|
| 86 |
+
|
| 87 |
+
|
| 88 |
+
def evaluate_plan_single(
|
| 89 |
+
judge: LLMJudge,
|
| 90 |
+
task_description: str,
|
| 91 |
+
reasoning_trace: str,
|
| 92 |
+
) -> dict[str, int]:
|
| 93 |
+
"""Evaluate plan with a single judge."""
|
| 94 |
+
if not reasoning_trace or len(reasoning_trace.strip()) < 10:
|
| 95 |
+
return {s: 0 for s in STEPS}
|
| 96 |
+
|
| 97 |
+
if judge.dry_run:
|
| 98 |
+
return {s: 0 for s in STEPS}
|
| 99 |
+
|
| 100 |
+
# Cap trace length
|
| 101 |
+
trace = reasoning_trace[:4000]
|
| 102 |
+
prompt = PLAN_EVAL_PROMPT_TEMPLATE.format(
|
| 103 |
+
task_description=task_description[:1000],
|
| 104 |
+
reasoning_trace=trace,
|
| 105 |
+
)
|
| 106 |
+
|
| 107 |
+
raw = judge._call_api(PLAN_EVAL_SYSTEM, prompt)
|
| 108 |
+
return parse_plan_response(raw)
|
| 109 |
+
|
| 110 |
+
|
| 111 |
+
def evaluate_plan_panel(
|
| 112 |
+
judges: list[LLMJudge],
|
| 113 |
+
task_description: str,
|
| 114 |
+
reasoning_trace: str,
|
| 115 |
+
) -> dict[str, dict[str, Any]]:
|
| 116 |
+
"""Evaluate plan with multiple judges, aggregate via majority vote.
|
| 117 |
+
|
| 118 |
+
Returns dict mapping step → {planned: 0/1, votes: [per-judge], n_judges: int}.
|
| 119 |
+
"""
|
| 120 |
+
if not reasoning_trace or len(reasoning_trace.strip()) < 10:
|
| 121 |
+
return {
|
| 122 |
+
s: {"planned": 0, "votes": [0] * len(judges), "n_judges": len(judges)}
|
| 123 |
+
for s in STEPS
|
| 124 |
+
}
|
| 125 |
+
|
| 126 |
+
all_votes: dict[str, list[int]] = {s: [] for s in STEPS}
|
| 127 |
+
for judge in judges:
|
| 128 |
+
result = evaluate_plan_single(judge, task_description, reasoning_trace)
|
| 129 |
+
for step in STEPS:
|
| 130 |
+
all_votes[step].append(result.get(step, 0))
|
| 131 |
+
|
| 132 |
+
aggregated = {}
|
| 133 |
+
for step in STEPS:
|
| 134 |
+
votes = all_votes[step]
|
| 135 |
+
planned = 1 if sum(votes) > len(votes) / 2 else 0 # majority vote
|
| 136 |
+
aggregated[step] = {
|
| 137 |
+
"planned": planned,
|
| 138 |
+
"votes": votes,
|
| 139 |
+
"n_judges": len(judges),
|
| 140 |
+
}
|
| 141 |
+
return aggregated
|
llm_judge/rubrics.py
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Structured rubric prompts for LLM judge evaluation.
|
| 2 |
+
|
| 3 |
+
Each judge evaluates 5 dimensions with explicit score-level descriptors
|
| 4 |
+
following the Prometheus (ICLR 2024) rubric-based approach.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
from __future__ import annotations
|
| 8 |
+
|
| 9 |
+
import json
|
| 10 |
+
from typing import Any
|
| 11 |
+
|
| 12 |
+
# ---------------------------------------------------------------------------
|
| 13 |
+
# Judge dimensions with max scores matching the LLM portion of the split
|
| 14 |
+
# ---------------------------------------------------------------------------
|
| 15 |
+
|
| 16 |
+
JUDGE_DIMENSIONS: dict[str, dict[str, Any]] = {
|
| 17 |
+
"approach_strategy": {
|
| 18 |
+
"max_score": 10,
|
| 19 |
+
"description": "Strategic quality of tool/methodology selection",
|
| 20 |
+
},
|
| 21 |
+
"orchestration_reasoning": {
|
| 22 |
+
"max_score": 8,
|
| 23 |
+
"description": "Pipeline logic, error handling, and adaptive reasoning",
|
| 24 |
+
},
|
| 25 |
+
"bio_feasibility": {
|
| 26 |
+
"max_score": 5,
|
| 27 |
+
"description": "Biological plausibility beyond sequence-level checks",
|
| 28 |
+
},
|
| 29 |
+
"novelty_quality": {
|
| 30 |
+
"max_score": 2,
|
| 31 |
+
"description": "Meaningful innovation vs accidental variation",
|
| 32 |
+
},
|
| 33 |
+
"diversity_quality": {
|
| 34 |
+
"max_score": 3,
|
| 35 |
+
"description": "Functional diversity of design strategies",
|
| 36 |
+
},
|
| 37 |
+
}
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
JUDGE_SYSTEM_PROMPT = (
|
| 41 |
+
"You are an expert protein design evaluator with deep knowledge of "
|
| 42 |
+
"computational protein engineering, including backbone generation "
|
| 43 |
+
"(RFdiffusion, Chroma), sequence design (ProteinMPNN, LigandMPNN), "
|
| 44 |
+
"structure prediction (AlphaFold2, ESMFold, Boltz), and interface "
|
| 45 |
+
"analysis (PyRosetta, FoldX). You evaluate AI agent protein design "
|
| 46 |
+
"attempts against a structured rubric. Score each dimension "
|
| 47 |
+
"independently. Provide reasoning BEFORE your score. Be critical "
|
| 48 |
+
"but fair — a score of 5/10 means average, not bad."
|
| 49 |
+
)
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
_RUBRIC_TEXT = """\
|
| 53 |
+
### Approach Strategy (0-10 pts)
|
| 54 |
+
- 9-10: Selects optimal tools for this specific target; demonstrates deep \
|
| 55 |
+
understanding of design strategy (e.g., chooses RFdiffusion hotspot \
|
| 56 |
+
conditioning for epitope-specific binder, not generic backbone generation)
|
| 57 |
+
- 7-8: Appropriate tool selection with minor suboptimalities
|
| 58 |
+
- 5-6: Reasonable tools but misses key steps or uses generic strategy
|
| 59 |
+
- 3-4: Partially appropriate; missing critical tools for this task type
|
| 60 |
+
- 0-2: Inappropriate or random tool selection
|
| 61 |
+
|
| 62 |
+
### Orchestration Reasoning (0-8 pts)
|
| 63 |
+
- 7-8: Logical pipeline with error handling, iterative refinement based on \
|
| 64 |
+
intermediate results, clear adaptive reasoning
|
| 65 |
+
- 5-6: Correct ordering with some validation but limited adaptation
|
| 66 |
+
- 3-4: Basic pipeline but missing intermediate checks or illogical ordering
|
| 67 |
+
- 0-2: No clear pipeline logic; tools called without reasoning
|
| 68 |
+
|
| 69 |
+
### Biological Feasibility (0-5 pts)
|
| 70 |
+
- 4-5: Designs are biologically plausible — CDR loops appropriate for \
|
| 71 |
+
target, active site geometry consistent, no obvious steric clashes
|
| 72 |
+
- 2-3: Generally plausible with minor concerns
|
| 73 |
+
- 0-1: Biologically implausible designs (e.g., all-alanine core, \
|
| 74 |
+
impossible disulfide patterns)
|
| 75 |
+
|
| 76 |
+
### Novelty Quality (0-2 pts)
|
| 77 |
+
- 2: Novel design represents meaningful innovation (new fold, creative \
|
| 78 |
+
binding mode) not just random mutations
|
| 79 |
+
- 1: Some novelty but appears accidental rather than designed
|
| 80 |
+
- 0: No meaningful novelty; trivially similar to reference or random
|
| 81 |
+
|
| 82 |
+
### Diversity Quality (0-3 pts)
|
| 83 |
+
- 3: Multiple designs explore different binding modes/conformations/\
|
| 84 |
+
strategies — functionally diverse, not just sequence variants
|
| 85 |
+
- 1-2: Some diversity but designs are minor variants of each other
|
| 86 |
+
- 0: No meaningful diversity; essentially one design repeated
|
| 87 |
+
"""
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
def build_judge_prompt(
|
| 91 |
+
task_description: str,
|
| 92 |
+
tool_call_log: list[dict[str, Any]],
|
| 93 |
+
designed_sequences: list[str],
|
| 94 |
+
algorithmic_metrics: dict[str, Any],
|
| 95 |
+
reference_pipeline: list[str] | None = None,
|
| 96 |
+
) -> str:
|
| 97 |
+
"""Build the user prompt for LLM judge evaluation.
|
| 98 |
+
|
| 99 |
+
Args:
|
| 100 |
+
task_description: The original design task prompt.
|
| 101 |
+
tool_call_log: Sequence of tool calls with args.
|
| 102 |
+
designed_sequences: FASTA-format designed sequences.
|
| 103 |
+
algorithmic_metrics: Computed metrics (pLDDT, ipTM, etc).
|
| 104 |
+
reference_pipeline: Expected expert pipeline for this task type.
|
| 105 |
+
|
| 106 |
+
Returns:
|
| 107 |
+
Formatted prompt string for the judge LLM.
|
| 108 |
+
"""
|
| 109 |
+
sections = []
|
| 110 |
+
|
| 111 |
+
# Task description
|
| 112 |
+
sections.append(f"## Task Description\n{task_description}")
|
| 113 |
+
|
| 114 |
+
# Reference pipeline (for approach/orchestration context)
|
| 115 |
+
if reference_pipeline:
|
| 116 |
+
pipeline_str = " → ".join(reference_pipeline)
|
| 117 |
+
sections.append(
|
| 118 |
+
f"## Reference Pipeline (Expert-Validated)\n{pipeline_str}"
|
| 119 |
+
)
|
| 120 |
+
|
| 121 |
+
# Tool call log
|
| 122 |
+
if tool_call_log:
|
| 123 |
+
log_lines = []
|
| 124 |
+
for i, entry in enumerate(tool_call_log, 1):
|
| 125 |
+
tool = entry.get("tool", "unknown")
|
| 126 |
+
args = entry.get("args_summary", {})
|
| 127 |
+
args_str = json.dumps(args, default=str) if args else "{}"
|
| 128 |
+
log_lines.append(f"{i}. {tool}({args_str})")
|
| 129 |
+
sections.append(
|
| 130 |
+
"## Agent's Tool Call Log\n" + "\n".join(log_lines)
|
| 131 |
+
)
|
| 132 |
+
else:
|
| 133 |
+
sections.append("## Agent's Tool Call Log\nNo tool calls recorded.")
|
| 134 |
+
|
| 135 |
+
# Designed sequences
|
| 136 |
+
if designed_sequences:
|
| 137 |
+
seq_lines = []
|
| 138 |
+
for i, seq in enumerate(designed_sequences[:10], 1): # Cap at 10
|
| 139 |
+
display = seq[:80] + "..." if len(seq) > 80 else seq
|
| 140 |
+
seq_lines.append(f">design_{i} (len={len(seq)})\n{display}")
|
| 141 |
+
sections.append(
|
| 142 |
+
f"## Designed Sequences ({len(designed_sequences)} total)\n"
|
| 143 |
+
+ "\n".join(seq_lines)
|
| 144 |
+
)
|
| 145 |
+
else:
|
| 146 |
+
sections.append("## Designed Sequences\nNo sequences produced.")
|
| 147 |
+
|
| 148 |
+
# Algorithmic metrics (read-only context)
|
| 149 |
+
if algorithmic_metrics:
|
| 150 |
+
metrics_str = json.dumps(algorithmic_metrics, indent=2, default=str)
|
| 151 |
+
sections.append(
|
| 152 |
+
f"## Algorithmic Metrics (Read-Only Context)\n```json\n{metrics_str}\n```"
|
| 153 |
+
)
|
| 154 |
+
|
| 155 |
+
# Scoring rubric
|
| 156 |
+
sections.append(f"## Scoring Rubric\n{_RUBRIC_TEXT}")
|
| 157 |
+
|
| 158 |
+
# Output format instruction
|
| 159 |
+
output_format = {
|
| 160 |
+
dim: {"reasoning": "...", "score": f"0-{info['max_score']}"}
|
| 161 |
+
for dim, info in JUDGE_DIMENSIONS.items()
|
| 162 |
+
}
|
| 163 |
+
sections.append(
|
| 164 |
+
"## Required Output Format\n"
|
| 165 |
+
"Evaluate each dimension. For each:\n"
|
| 166 |
+
"1. Cite specific evidence from the agent's work\n"
|
| 167 |
+
"2. Reason about quality relative to the rubric\n"
|
| 168 |
+
"3. Assign a score\n\n"
|
| 169 |
+
"Respond in JSON format:\n"
|
| 170 |
+
f"```json\n{json.dumps(output_format, indent=2)}\n```"
|
| 171 |
+
)
|
| 172 |
+
|
| 173 |
+
return "\n\n".join(sections)
|
requirements.txt
CHANGED
|
@@ -4,3 +4,8 @@ plotly
|
|
| 4 |
httpx>=0.25
|
| 5 |
huggingface_hub>=0.20
|
| 6 |
datasets>=2.16
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
httpx>=0.25
|
| 5 |
huggingface_hub>=0.20
|
| 6 |
datasets>=2.16
|
| 7 |
+
|
| 8 |
+
# LLM judge panel (Phase A)
|
| 9 |
+
anthropic>=0.75
|
| 10 |
+
openai>=1.40
|
| 11 |
+
google-genai>=0.3
|