Spaces:

RomeroLab-Duke
/

BioDesignBench-Leaderboard

Running

App Files Files Community

Jasonkim8652 commited on Apr 15

Commit

8e08ed6

verified ·

1 Parent(s): af5defe

Phase A: integrate LLM judge panel for hybrid scoring

Browse files

- Port biodesignbench/eval/llm_judge package as leaderboard/llm_judge
- New eval_judge.py orchestrates per-task panel runs with self-exclusion
- aggregate_scores: prefer hybrid_total/hybrid_scores when present
- Admin pipeline: insert 'Phase C: Run LLM Judge' button between Boltz and Finalize
- requirements: add anthropic, openai, google-genai
- Boltz handler now persists boltz-augmented per-task results
- Sync README.md taxonomy stats (9 cells, 2x5 matrix, hybrid 72+28)

Files changed (11) hide show

README.md +19 -78
app.py +63 -3
eval_judge.py +148 -0
eval_scorer.py +21 -14
llm_judge/__init__.py +50 -0
llm_judge/aggregation.py +200 -0
llm_judge/judge.py +217 -0
llm_judge/panel.py +162 -0
llm_judge/plan_eval.py +141 -0
llm_judge/rubrics.py +173 -0
requirements.txt +5 -0

README.md CHANGED Viewed

@@ -12,88 +12,29 @@ license: mit
 # BioDesignBench Leaderboard
-Interactive leaderboard for **BioDesignBench**, a benchmark evaluating LLM agents on protein design tasks via MCP (Model Context Protocol) tool use.
 **Romero Lab, Duke University**
-## What the leaderboard shows
-- **Overall Leaderboard** -- Mixed-ranking table with human baselines and LLM agents, filterable by mode (benchmark/user), MCP tool type (reference/custom), and entry type.
-- **Taxonomy Breakdown** -- Heatmap of per-cell scores across 17 taxonomy cells (5 task types x 5 biological contexts) with average-per-type bar chart.
-- **Component Analysis** -- Radar and grouped bar charts comparing the 6 scoring components (Approach, Orchestration, Quality, Feasibility, Novelty, Diversity) between any two agents.
-- **Benchmark vs User Mode** -- Paired comparison showing how the same LLM performs with minimal prompting (benchmark) vs rich guidance (user mode).
-- **Submit** -- Form to submit your own protein design agent for evaluation.
-- **About** -- Methodology, scoring rubric, submission guide, and citation.
-## Run locally
-```bash
-pip install -r requirements.txt
-python app.py
-```
-The app launches a Gradio server at `http://localhost:7860`.
-## HuggingFace Space deployment
-This directory is structured as a self-contained HF Space. To deploy:
-1. Create a new Space on HuggingFace (`sdk: gradio`).
-2. Push the contents of this directory to the Space repo.
-3. Set the `BDB_ADMIN_PASSWORD` secret in the Space settings for admin panel access.
-4. Optionally set `HF_TOKEN` for submission queue access (private dataset).
-The Space will automatically build and serve the leaderboard.
-## How to update results
-Add new entries to `leaderboard_data.json` following the existing schema:
-```json
-{
-  "agent_name": "Your Agent",
-  "agent_id": "your-agent-user",
-  "mode": "user",
-  "mcp_custom": false,
-  "submission_type": "llm",
-  "organization": "Your Org",
-  "overall_score": 42.0,
-  "component_scores": {
-    "approach": 10.0,
-    "orchestration": 8.0,
-    "quality": 14.0,
-    "feasibility": 6.0,
-    "novelty": 2.0,
-    "diversity": 2.0
-  },
-  "taxonomy_scores": {
-    "de_novo_binder": {"ab": 45, "enz": 40, "sig": 43},
-    "sequence_optimization": {"ab": 50, "enz": 42, "sig": 38, "str": 44, "flu": 52},
-    "de_novo_backbone": {"str": 28},
-    "complex_engineering": {"enz": 40, "sig": 44, "str": 46},
-    "conformational_design": {"enz": 38, "sig": 42, "str": 40, "flu": 44}
-  },
-  "tasks_completed": 76,
-  "tasks_total": 76,
-  "tasks_with_zero": 4,
-  "avg_latency_sec": 50.0,
-  "submission_date": "2026-03-15"
-}
-```
-Update the `last_updated` field at the top of the JSON file after adding entries.
-## File overview
-| File | Description |
-|------|-------------|
-| `app.py` | Main Gradio application with 7 tabs |
-| `leaderboard_data.json` | Current benchmark results |
-| `mcp_tool_schemas.json` | 17 reference MCP tool schemas |
-| `eval_scorer.py` | Self-contained 100-point scoring rubric |
-| `eval_queue.py` | Submission queue (HuggingFace Datasets) |
-| `eval_dispatcher.py` | HTTP task dispatcher for benchmarking |
-| `eval_boltz.py` | Boltz structure prediction post-eval |
-| `eval_tasks.py` | Hidden task loader from HF Dataset |
-| `example_server.py` | Reference FastAPI server for submitters |
-| `requirements.txt` | Python dependencies |

 # BioDesignBench Leaderboard
+Evaluating LLM Agents on Protein Design via MCP Tools.
 **Romero Lab, Duke University**
+## Overview
+BioDesignBench evaluates LLM agents as orchestrators of multi-step *stochastic*
+protein-design pipelines. This leaderboard tracks agent performance across
+**76 design tasks** spanning a **2 × 5 design matrix** (de novo design vs
+redesign × five molecular families: antibody, binder, enzyme, scaffold,
+fluorescent protein, **9 occupied cells**), scored on a 100-point hybrid rubric:
+**72 algorithmic points** (Boltz-2 verification + sequence/feasibility metrics)
+plus **28 LLM-judge points** (3-judge panel with self-exclusion).
+The six rubric components are Approach, Orchestration, Quality, Feasibility,
+Novelty, and Diversity. See the *About* tab for the full methodology and the
+*Depth Gap* tab for evaluation-depth interventions.
+## Features
+- **Overall Leaderboard** — Mixed-ranking table with human baselines and LLM agents
+- **Taxonomy Heatmap** — Per-cell scores across the 9 occupied cells of the 2 × 5 design matrix
+- **Component Analysis** — Radar and bar charts comparing the 6 scoring components
+- **Guidance Effect** — Paired comparison of the same LLM in unguided (atomic tools) vs guided (composite workflows) mode
+- **Depth Gap** — Forced-depth and low-diversity intervention results
+- **About** — Methodology, submission guide, and citation info

app.py CHANGED Viewed

@@ -1523,14 +1523,21 @@ def create_app() -> gr.Blocks:
                                 label="Submission ID", scale=2,
                             )
                             boltz_btn = gr.Button(
-                                "Phase B: Run Boltz", scale=1,
                             )
                         with gr.Row():
                             final_id = gr.Textbox(
                                 label="Submission ID", scale=2,
                             )
                             final_btn = gr.Button(
-                                "Phase C: Finalize & Publish", scale=1,
                             )
                         pipeline_out = gr.HTML()
@@ -1681,6 +1688,9 @@ def create_app() -> gr.Blocks:
                                     "No task results to process.</div>"
                                 )
                             run_boltz_posteval(per_task)
                             return (
                                 '<div style="color:#38a169">'
                                 "Boltz post-assessment complete.</div>"
@@ -1688,6 +1698,51 @@ def create_app() -> gr.Blocks:
                         except Exception as e:
                             return f'<div style="color:#e53e3e">{e}</div>'
                     def _run_finalize(sid):
                         try:
                             from eval_queue import (
@@ -1712,10 +1767,12 @@ def create_app() -> gr.Blocks:
                                 component_scores=agg["component_scores"],
                                 taxonomy_scores=agg["taxonomy_scores"],
                             )
                             return (
                                 f'<div style="color:#38a169">'
                                 f'Finalized! Score: '
-                                f'{agg["overall_score"]:.1f}</div>'
                             )
                         except Exception as e:
                             return f'<div style="color:#e53e3e">{e}</div>'
@@ -1726,6 +1783,9 @@ def create_app() -> gr.Blocks:
                     boltz_btn.click(
                         _run_boltz, [boltz_id], pipeline_out,
                     )
                     final_btn.click(
                         _run_finalize, [final_id], pipeline_out,
                     )

                                 label="Submission ID", scale=2,
                             )
                             boltz_btn = gr.Button(
+                                "Phase B: Run Boltz (GPU)", scale=1,
+                            )
+                        with gr.Row():
+                            judge_id = gr.Textbox(
+                                label="Submission ID", scale=2,
+                            )
+                            judge_btn = gr.Button(
+                                "Phase C: Run LLM Judge", scale=1,
                             )
                         with gr.Row():
                             final_id = gr.Textbox(
                                 label="Submission ID", scale=2,
                             )
                             final_btn = gr.Button(
+                                "Phase D: Finalize & Publish", scale=1,
                             )
                         pipeline_out = gr.HTML()
                                     "No task results to process.</div>"
                                 )
                             run_boltz_posteval(per_task)
+                            from eval_queue import save_task_result
+                            for tid, tres in per_task.items():
+                                save_task_result(sid.strip(), tid, tres)
                             return (
                                 '<div style="color:#38a169">'
                                 "Boltz post-assessment complete.</div>"
                         except Exception as e:
                             return f'<div style="color:#e53e3e">{e}</div>'
+                    def _run_judge(sid):
+                        try:
+                            import eval_judge as ej
+                            from eval_queue import (
+                                get_submission, save_task_result, update_status,
+                            )
+                            sub = get_submission(sid.strip())
+                            if sub is None:
+                                return ('<div style="color:#e53e3e">'
+                                        'Not found</div>')
+                            per_task = json.loads(
+                                sub.get("per_task_results", "{}")
+                            )
+                            if not per_task:
+                                return ('<div style="color:#e53e3e">'
+                                        "No task results to process.</div>")
+                            update_status(sid.strip(), "scoring")
+                            ej.run_judge_panel(
+                                per_task,
+                                agent_id=sub.get("agent_name", "unknown"),
+                                dry_run=False,
+                            )
+                            for tid, tres in per_task.items():
+                                save_task_result(sid.strip(), tid, tres)
+                            n_done = sum(
+                                1 for r in per_task.values()
+                                if r.get("hybrid_total") is not None
+                            )
+                            return (
+                                f'<div style="color:#38a169">'
+                                f"LLM judge complete on {n_done} tasks."
+                                "</div>"
+                            )
+                        except Exception as e:
+                            import traceback
+                            return (
+                                f'<div style="color:#e53e3e">'
+                                f'<strong>Judge error:</strong> {e}<br>'
+                                f'<pre style="font-size:0.7rem">'
+                                f'{traceback.format_exc()[:600]}</pre></div>'
+                            )
                     def _run_finalize(sid):
                         try:
                             from eval_queue import (
                                 component_scores=agg["component_scores"],
                                 taxonomy_scores=agg["taxonomy_scores"],
                             )
+                            mode_label = agg.get("scoring_mode", "algo")
                             return (
                                 f'<div style="color:#38a169">'
                                 f'Finalized! Score: '
+                                f'{agg["overall_score"]:.1f} '
+                                f'(scoring={mode_label})</div>'
                             )
                         except Exception as e:
                             return f'<div style="color:#e53e3e">{e}</div>'
                     boltz_btn.click(
                         _run_boltz, [boltz_id], pipeline_out,
                     )
+                    judge_btn.click(
+                        _run_judge, [judge_id], pipeline_out,
+                    )
                     final_btn.click(
                         _run_finalize, [final_id], pipeline_out,
                     )

eval_judge.py ADDED Viewed

	@@ -0,0 +1,148 @@

+"""LLM Judge orchestration for the leaderboard backend.
+Runs the cross-model judge panel on each successfully scored task and
+merges the resulting LLM points into the algorithmic component scores
+to produce hybrid totals (28 LLM points + 72 algorithmic points = 100).
+The judge panel uses 3 judges from different model families with
+self-exclusion (PoLL, Verga et al. 2024). Individual judge calls are
+synchronous; we process tasks sequentially to keep the API spend
+predictable. Provider keys are read from environment variables that
+must be configured as HuggingFace Space secrets:
+    ANTHROPIC_API_KEY
+    OPENAI_API_KEY
+    GOOGLE_API_KEY
+    DEEPSEEK_API_KEY
+"""
+from __future__ import annotations
+import logging
+from typing import Any
+from llm_judge import (
+    LLMJudgePanel,
+    detect_agent_family,
+    merge_algo_and_judge_scores,
+    split_algo_score,
+)
+logger = logging.getLogger(__name__)
+def _build_algo_dict(task_result: dict[str, Any]) -> dict[str, float]:
+    """Pull per-component algo scores from a task result.
+    Prefers 'cpu_scores' (post-Boltz) but falls back to 'final_scores'
+    if it has been computed already.
+    """
+    if "cpu_scores" in task_result:
+        return dict(task_result["cpu_scores"])
+    if "final_scores" in task_result:
+        return dict(task_result["final_scores"])
+    return {
+        "approach": 0,
+        "orchestration": 0,
+        "quality": 0,
+        "feasibility": 0,
+        "novelty": 0,
+        "diversity": 0,
+    }
+def run_judge_panel(
+    per_task_results: dict[str, dict[str, Any]],
+    agent_id: str,
+    dry_run: bool = False,
+    progress_callback=None,
+) -> dict[str, dict[str, Any]]:
+    """Run the LLM judge panel over every successful task in a submission.
+    For each task with a non-empty design output:
+      1. Look up the original task prompt (used to give the panel context).
+      2. Build a 3-judge panel that excludes the agent's own model family.
+      3. Run all judges synchronously and aggregate.
+      4. Compute the hybrid component scores by:
+           - splitting each algo score into its algo-portion (split_algo_score)
+           - adding the matching judge LLM-portion (merge_algo_and_judge_scores)
+      5. Store both raw judge results and final hybrid scores on the task.
+    The function modifies per_task_results in place and also returns it.
+    Args:
+        per_task_results: Dict mapping task_id → task result (from the
+            dispatcher + boltz post-eval pipeline).
+        agent_id: Agent identifier for self-exclusion (e.g., 'gpt5-tools').
+        dry_run: If True, judges return midpoint scores without API calls.
+        progress_callback: Optional callable(task_id, i, total).
+    Returns:
+        The same dict, now augmented with 'judge_scores' and 'hybrid_scores'
+        per task and 'hybrid_total' on each successful entry.
+    """
+    from eval_tasks import get_task
+    family = detect_agent_family(agent_id)
+    panel = LLMJudgePanel(agent_model_family=family, dry_run=dry_run)
+    logger.info(
+        f"LLM judge panel for agent '{agent_id}' (family={family}): "
+        f"{len(panel.judges)} judges, dry_run={dry_run}"
+    )
+    eligible = [
+        tid for tid, r in per_task_results.items()
+        if r.get("success") and r.get("sequences")
+    ]
+    total = len(eligible)
+    for i, task_id in enumerate(eligible):
+        result = per_task_results[task_id]
+        # Pull task prompt for judge context. If the dataset is not
+        # reachable (e.g., dev mode without HF_TOKEN) we still run with
+        # a placeholder description rather than aborting the whole run.
+        task_data = get_task(task_id) or {}
+        task_description = task_data.get("prompt_md") or f"BioDesignBench task {task_id}"
+        algo_metrics = result.get("agent_metrics", {})
+        if "boltz_metrics" in result:
+            algo_metrics = {**algo_metrics, **result["boltz_metrics"]}
+        try:
+            judge_result = panel.evaluate_sync(
+                task_description=task_description,
+                tool_call_log=result.get("run_log", []),
+                designed_sequences=result.get("sequences", []),
+                algorithmic_metrics=algo_metrics,
+            )
+        except Exception as e:
+            logger.error(f"Judge panel failed on {task_id}: {e}")
+            judge_result = None
+        # Build algo-portion dict (split each component down to its algo max)
+        algo_full = _build_algo_dict(result)
+        rubric_max = {
+            "approach": 20, "orchestration": 15, "quality": 35,
+            "feasibility": 15, "novelty": 5, "diversity": 10,
+        }
+        algo_split = {
+            comp: split_algo_score(comp, score, rubric_max[comp])
+            for comp, score in algo_full.items()
+        }
+        hybrid = merge_algo_and_judge_scores(algo_split, judge_result)
+        hybrid_total = sum(hybrid.values())
+        result["judge_scores"] = judge_result
+        result["hybrid_scores"] = hybrid
+        result["hybrid_total"] = round(hybrid_total, 2)
+        if progress_callback:
+            progress_callback(task_id, i + 1, total)
+        logger.info(
+            f"[{i+1}/{total}] {task_id}: hybrid={hybrid_total:.1f}"
+        )
+    return per_task_results

eval_scorer.py CHANGED Viewed

@@ -1368,15 +1368,14 @@ def score_diversity(
         return {"score": 0, "max": max_points, "num_designs": 0, "pairwise_diversity": 0.0, "entropy": 0.0}
     num = len(designs)
     diversity = mean_pairwise_diversity(designs)
     entropy = sequence_entropy(designs)
-    # Score based purely on sequence diversity (not design count).
-    # Tasks don't specify how many designs to produce, so counting
-    # would unfairly penalise agents that submit fewer designs.
-    diversity_score = diversity * max_points * 0.65
-    entropy_score = entropy * max_points * 0.35
-    total = int(round(diversity_score + entropy_score))
     return {
         "score": min(total, max_points), "max": max_points,
@@ -1584,12 +1583,11 @@ def aggregate_scores(
 ) -> dict[str, Any]:
     """Aggregate per-task scores into an overall submission result.
-    Args:
-        per_task_scores: Dict mapping task_id → score_submission_task() result.
-    Returns:
-        Dict with: overall_score, component_scores (averaged), taxonomy_scores,
-        tasks_completed, tasks_with_zero.
     """
     if not per_task_scores:
         return {
@@ -1604,16 +1602,24 @@ def aggregate_scores(
     totals = {c: 0.0 for c in DEFAULT_DESIGN_RUBRIC}
     n = len(per_task_scores)
     tasks_with_zero = 0
     # Taxonomy breakdown
     taxonomy_scores: dict[str, dict[str, list[float]]] = {}
     for task_id, result in per_task_scores.items():
-        total_score = result["total_score"]
         if total_score == 0:
             tasks_with_zero += 1
-        for comp, val in result["component_scores"].items():
             totals[comp] += val
         # Taxonomy mapping
@@ -1641,4 +1647,5 @@ def aggregate_scores(
         "tasks_completed": n,
         "tasks_total": n,
         "tasks_with_zero": tasks_with_zero,
     }

         return {"score": 0, "max": max_points, "num_designs": 0, "pairwise_diversity": 0.0, "entropy": 0.0}
     num = len(designs)
+    count_fraction = min(num / max_designs, 1.0) if max_designs > 0 else 1.0
     diversity = mean_pairwise_diversity(designs)
     entropy = sequence_entropy(designs)
+    count_score = count_fraction * max_points * 0.4
+    diversity_score = diversity * max_points * 0.4
+    entropy_score = entropy * max_points * 0.2
+    total = int(round(count_score + diversity_score + entropy_score))
     return {
         "score": min(total, max_points), "max": max_points,
 ) -> dict[str, Any]:
     """Aggregate per-task scores into an overall submission result.
+    If `eval_judge.run_judge_panel()` has been run beforehand each task
+    will carry `hybrid_scores` and `hybrid_total`; in that case we use
+    the hybrid (algo + LLM judge, capped at rubric max) as the canonical
+    score. Otherwise we fall back to the algo-only `component_scores` /
+    `total_score` produced by the dispatcher + Boltz pipeline.
     """
     if not per_task_scores:
         return {
     totals = {c: 0.0 for c in DEFAULT_DESIGN_RUBRIC}
     n = len(per_task_scores)
     tasks_with_zero = 0
+    used_hybrid = False
     # Taxonomy breakdown
     taxonomy_scores: dict[str, dict[str, list[float]]] = {}
     for task_id, result in per_task_scores.items():
+        if "hybrid_scores" in result and "hybrid_total" in result:
+            comp_scores = result["hybrid_scores"]
+            total_score = result["hybrid_total"]
+            used_hybrid = True
+        else:
+            comp_scores = result.get("component_scores", {})
+            total_score = result.get("total_score", 0.0)
         if total_score == 0:
             tasks_with_zero += 1
+        for comp, val in comp_scores.items():
             totals[comp] += val
         # Taxonomy mapping
         "tasks_completed": n,
         "tasks_total": n,
         "tasks_with_zero": tasks_with_zero,
+        "scoring_mode": "hybrid" if used_hybrid else "algo",
     }

llm_judge/__init__.py ADDED Viewed

	@@ -0,0 +1,50 @@

+"""LLM-as-a-Judge scoring for BioDesignBench Tier 2 evaluation.
+Provides cross-model LLM judge panels that evaluate subjective dimensions
+(approach, orchestration, feasibility, novelty, diversity) while quality
+metrics remain 100% algorithmic.
+Usage:
+    from llm_judge import LLMJudgePanel
+    panel = LLMJudgePanel(agent_model_family="anthropic", dry_run=True)
+    result = panel.evaluate_sync(
+        task_description="Design a binder for IL-6R",
+        tool_call_log=[...],
+        designed_sequences=["MKVL..."],
+        algorithmic_metrics={"pLDDT": 82.5},
+    )
+"""
+from llm_judge.aggregation import (
+    WEIGHT_SPLIT,
+    aggregate_judge_scores,
+    merge_algo_and_judge_scores,
+    split_algo_score,
+)
+from llm_judge.judge import LLMJudge, parse_judge_response
+from llm_judge.panel import (
+    LLMJudgePanel,
+    detect_agent_family,
+    get_judge_models,
+)
+from llm_judge.rubrics import (
+    JUDGE_DIMENSIONS,
+    JUDGE_SYSTEM_PROMPT,
+    build_judge_prompt,
+)
+__all__ = [
+    "LLMJudge",
+    "LLMJudgePanel",
+    "JUDGE_DIMENSIONS",
+    "JUDGE_SYSTEM_PROMPT",
+    "WEIGHT_SPLIT",
+    "aggregate_judge_scores",
+    "build_judge_prompt",
+    "detect_agent_family",
+    "get_judge_models",
+    "merge_algo_and_judge_scores",
+    "parse_judge_response",
+    "split_algo_score",
+]

llm_judge/aggregation.py ADDED Viewed

	@@ -0,0 +1,200 @@

+"""Score aggregation and merging for LLM judge panel.
+Implements:
+- Weighted averaging with outlier downweighting
+- Algo + LLM score merging with rubric cap enforcement
+- Weight split configuration (72/28 algo-LLM)
+"""
+from __future__ import annotations
+import statistics
+from typing import Any
+from llm_judge.rubrics import JUDGE_DIMENSIONS
+# ---------------------------------------------------------------------------
+# Weight split: algo + LLM portions per component (must sum to rubric max)
+# ---------------------------------------------------------------------------
+WEIGHT_SPLIT: dict[str, dict[str, int]] = {
+    "approach":      {"algo": 10, "llm": 10},   # 20 total
+    "orchestration": {"algo": 7,  "llm": 8},    # 15 total
+    "quality":       {"algo": 35, "llm": 0},    # 35 total (no LLM)
+    "feasibility":   {"algo": 10, "llm": 5},    # 15 total
+    "novelty":       {"algo": 3,  "llm": 2},    # 5 total
+    "diversity":     {"algo": 7,  "llm": 3},    # 10 total
+}
+# Mapping from LLM judge dimension → rubric component
+_JUDGE_DIM_TO_COMPONENT: dict[str, str] = {
+    "approach_strategy": "approach",
+    "orchestration_reasoning": "orchestration",
+    "bio_feasibility": "feasibility",
+    "novelty_quality": "novelty",
+    "diversity_quality": "diversity",
+}
+# Rubric max per component
+_RUBRIC_MAX: dict[str, int] = {
+    "approach": 20,
+    "orchestration": 15,
+    "quality": 35,
+    "feasibility": 15,
+    "novelty": 5,
+    "diversity": 10,
+}
+def aggregate_judge_scores(
+    judge_results: list[dict[str, dict[str, Any]]],
+) -> dict[str, dict[str, Any]]:
+    """Aggregate scores from multiple judges with outlier downweighting.
+    For each dimension:
+    1. Collect raw scores from all judges
+    2. Compute median
+    3. Downweight outliers (>2 points from median) by 0.5x
+    4. Compute weighted average
+    Args:
+        judge_results: List of per-judge result dicts.
+            Each maps dimension_name → {reasoning, score}.
+    Returns:
+        Aggregated dict mapping dimension_name → {score, reasoning, raw_scores}.
+    Raises:
+        ValueError: If judge_results is empty.
+    """
+    if not judge_results:
+        raise ValueError("aggregate_judge_scores requires at least one judge result")
+    if len(judge_results) == 1:
+        # Single judge: pass through directly
+        result = {}
+        for dim in JUDGE_DIMENSIONS:
+            entry = judge_results[0].get(dim, {"score": 0, "reasoning": ""})
+            result[dim] = {
+                "score": float(entry["score"]),
+                "reasoning": entry["reasoning"],
+                "raw_scores": [entry["score"]],
+            }
+        return result
+    aggregated = {}
+    for dim, info in JUDGE_DIMENSIONS.items():
+        raw_scores = []
+        reasonings = []
+        for jr in judge_results:
+            entry = jr.get(dim, {"score": info["max_score"] // 2, "reasoning": ""})
+            raw_scores.append(float(entry["score"]))
+            reasonings.append(entry.get("reasoning", ""))
+        # Outlier detection: downweight scores >2 points from median
+        med = statistics.median(raw_scores)
+        weights = []
+        for s in raw_scores:
+            if abs(s - med) > 2.0:
+                weights.append(0.5)
+            else:
+                weights.append(1.0)
+        # Weighted average
+        weighted_sum = sum(s * w for s, w in zip(raw_scores, weights))
+        weight_total = sum(weights)
+        avg = weighted_sum / weight_total if weight_total > 0 else 0
+        # Clamp to valid range
+        avg = max(0, min(avg, info["max_score"]))
+        aggregated[dim] = {
+            "score": round(avg, 1),
+            "reasoning": " | ".join(
+                f"[Judge {i+1}] {r}" for i, r in enumerate(reasonings) if r
+            ),
+            "raw_scores": raw_scores,
+        }
+    return aggregated
+def split_algo_score(
+    component: str,
+    original_score: float,
+    original_max: int,
+) -> float:
+    """Scale an algorithmic score to its algo-only portion.
+    For the hybrid system, algorithmic scores are computed against the
+    original rubric max (e.g., approach out of 20), then scaled down
+    to the algo-only portion (e.g., 10 out of 20).
+    Quality is special: it keeps its full 35 points (no LLM portion).
+    Args:
+        component: Rubric component name.
+        original_score: Score computed against original max.
+        original_max: Original rubric max for this component.
+    Returns:
+        Scaled score for the algo-only portion.
+    """
+    split = WEIGHT_SPLIT.get(component)
+    if split is None:
+        return original_score
+    algo_max = split["algo"]
+    if split["llm"] == 0:
+        # No LLM portion — return original score unchanged
+        return original_score
+    # Scale: (original_score / original_max) * algo_max
+    if original_max == 0:
+        return 0.0
+    ratio = original_score / original_max
+    return round(ratio * algo_max, 2)
+def merge_algo_and_judge_scores(
+    algo_scores: dict[str, float | int],
+    judge_scores: dict[str, dict[str, Any]] | None,
+) -> dict[str, float]:
+    """Merge algorithmic and LLM judge scores into final component scores.
+    Args:
+        algo_scores: Dict mapping component → algo-portion score.
+            These should already be split via split_algo_score().
+        judge_scores: Aggregated judge scores (from aggregate_judge_scores).
+            None if LLM judge is disabled.
+    Returns:
+        Dict mapping component → final merged score (capped at rubric max).
+    """
+    if judge_scores is None:
+        return dict(algo_scores)
+    merged = {}
+    for component, algo_score in algo_scores.items():
+        rubric_max = _RUBRIC_MAX.get(component, 100)
+        # Find matching judge dimension
+        judge_dim = None
+        for jd, comp in _JUDGE_DIM_TO_COMPONENT.items():
+            if comp == component:
+                judge_dim = jd
+                break
+        if judge_dim and judge_dim in judge_scores:
+            llm_score = judge_scores[judge_dim].get("score", 0)
+            if isinstance(llm_score, dict):
+                llm_score = llm_score.get("score", 0)
+            total = algo_score + llm_score
+        else:
+            total = algo_score
+        merged[component] = min(total, rubric_max)
+    return merged

llm_judge/judge.py ADDED Viewed

	@@ -0,0 +1,217 @@

+"""Single LLM judge: wraps one API call to evaluate a design attempt.
+Supports Anthropic, OpenAI, Google, and DeepSeek providers.
+In dry_run mode, returns deterministic midpoint scores without API calls.
+"""
+from __future__ import annotations
+import json
+import re
+from typing import Any
+from llm_judge.rubrics import (
+    JUDGE_DIMENSIONS,
+    JUDGE_SYSTEM_PROMPT,
+    build_judge_prompt,
+)
+def _midpoint_scores() -> dict[str, dict[str, Any]]:
+    """Return deterministic midpoint scores for dry-run mode."""
+    result = {}
+    for dim, info in JUDGE_DIMENSIONS.items():
+        mid = info["max_score"] // 2
+        if info["max_score"] % 2 == 1 and mid * 2 < info["max_score"]:
+            # For odd max (5, 3), floor division gives correct 50%
+            pass
+        result[dim] = {
+            "reasoning": f"[Dry run] Midpoint score for {dim}.",
+            "score": mid,
+        }
+    return result
+def parse_judge_response(raw_text: str) -> dict[str, dict[str, Any]]:
+    """Parse LLM judge response into structured scores.
+    Handles:
+    - Direct JSON response
+    - JSON inside markdown code blocks
+    - Out-of-range score clamping
+    - Invalid JSON fallback to midpoint scores
+    Args:
+        raw_text: Raw LLM response text.
+    Returns:
+        Dict mapping dimension names to {reasoning, score}.
+    """
+    # Try to extract JSON from markdown code block
+    json_match = re.search(r"```(?:json)?\s*\n?(.*?)\n?\s*```", raw_text, re.DOTALL)
+    json_str = json_match.group(1) if json_match else raw_text
+    try:
+        data = json.loads(json_str)
+    except json.JSONDecodeError:
+        # Try finding any JSON object in the text
+        brace_match = re.search(r"\{.*\}", raw_text, re.DOTALL)
+        if brace_match:
+            try:
+                data = json.loads(brace_match.group())
+            except json.JSONDecodeError:
+                return _midpoint_scores()
+        else:
+            return _midpoint_scores()
+    # Validate and clamp scores
+    result = {}
+    for dim, info in JUDGE_DIMENSIONS.items():
+        if dim in data and isinstance(data[dim], dict):
+            score = data[dim].get("score", info["max_score"] // 2)
+            if isinstance(score, (int, float)):
+                score = max(0, min(score, info["max_score"]))
+            else:
+                score = info["max_score"] // 2
+            reasoning = data[dim].get("reasoning", "")
+            result[dim] = {"reasoning": str(reasoning), "score": score}
+        else:
+            # Missing dimension — use midpoint
+            result[dim] = {
+                "reasoning": f"[Fallback] Dimension {dim} missing from judge response.",
+                "score": info["max_score"] // 2,
+            }
+    return result
+class LLMJudge:
+    """Single LLM judge that evaluates a protein design attempt.
+    Args:
+        provider: API provider ('anthropic', 'openai', 'google', 'deepseek').
+        model: Model identifier string.
+        dry_run: If True, return deterministic scores without API calls.
+        api_key: Optional API key override.
+    """
+    def __init__(
+        self,
+        provider: str,
+        model: str,
+        dry_run: bool = False,
+        api_key: str | None = None,
+    ):
+        self.provider = provider
+        self.model = model
+        self.dry_run = dry_run
+        self.api_key = api_key
+        self.api_calls = 0
+        self._client = None
+    def _get_client(self):
+        """Lazy-initialize the API client."""
+        if self._client is not None:
+            return self._client
+        import os
+        if self.provider == "anthropic":
+            import anthropic
+            key = self.api_key or os.environ.get("ANTHROPIC_API_KEY")
+            self._client = anthropic.Anthropic(api_key=key)
+        elif self.provider == "openai":
+            from openai import OpenAI
+            key = self.api_key or os.environ.get("OPENAI_API_KEY")
+            self._client = OpenAI(api_key=key)
+        elif self.provider == "google":
+            from google import genai
+            key = self.api_key or os.environ.get("GOOGLE_API_KEY")
+            self._client = genai.Client(api_key=key)
+        elif self.provider == "deepseek":
+            from openai import OpenAI
+            key = self.api_key or os.environ.get("DEEPSEEK_API_KEY")
+            self._client = OpenAI(
+                api_key=key, base_url="https://api.deepseek.com"
+            )
+        else:
+            raise ValueError(f"Unknown provider: {self.provider}")
+        return self._client
+    def _call_api(self, system: str, user: str) -> str:
+        """Make a single API call and return raw text response."""
+        client = self._get_client()
+        self.api_calls += 1
+        if self.provider == "anthropic":
+            response = client.messages.create(
+                model=self.model,
+                max_tokens=4096,
+                system=system,
+                messages=[{"role": "user", "content": user}],
+            )
+            return response.content[0].text
+        elif self.provider in ("openai", "deepseek"):
+            # GPT-5+ uses max_completion_tokens; older models use max_tokens
+            token_param = (
+                "max_completion_tokens" if "gpt-5" in self.model or "o3" in self.model or "o4" in self.model
+                else "max_tokens"
+            )
+            response = client.chat.completions.create(
+                model=self.model,
+                **{token_param: 4096},
+                messages=[
+                    {"role": "system", "content": system},
+                    {"role": "user", "content": user},
+                ],
+            )
+            return response.choices[0].message.content
+        elif self.provider == "google":
+            response = client.models.generate_content(
+                model=self.model,
+                contents=f"{system}\n\n{user}",
+            )
+            return response.text
+        raise ValueError(f"Unsupported provider: {self.provider}")
+    def evaluate_sync(
+        self,
+        task_description: str,
+        tool_call_log: list[dict[str, Any]],
+        designed_sequences: list[str],
+        algorithmic_metrics: dict[str, Any],
+        reference_pipeline: list[str] | None = None,
+    ) -> dict[str, dict[str, Any]]:
+        """Evaluate a design attempt synchronously.
+        Args:
+            task_description: Original task prompt.
+            tool_call_log: Agent's tool call sequence.
+            designed_sequences: Designed protein sequences.
+            algorithmic_metrics: Computed biophysical metrics.
+            reference_pipeline: Expected expert pipeline.
+        Returns:
+            Dict mapping dimension names to {reasoning, score}.
+        """
+        if self.dry_run:
+            return _midpoint_scores()
+        prompt = build_judge_prompt(
+            task_description=task_description,
+            tool_call_log=tool_call_log,
+            designed_sequences=designed_sequences,
+            algorithmic_metrics=algorithmic_metrics,
+            reference_pipeline=reference_pipeline,
+        )
+        raw_response = self._call_api(JUDGE_SYSTEM_PROMPT, prompt)
+        return parse_judge_response(raw_response)

llm_judge/panel.py ADDED Viewed

	@@ -0,0 +1,162 @@

+"""LLM Judge Panel: manages cross-model evaluation with self-exclusion.
+Following PoLL (Verga et al., 2024): 3 judges from different model families,
+excluding the generating model. Human baselines get all 4 judges.
+"""
+from __future__ import annotations
+from typing import Any
+from llm_judge.aggregation import aggregate_judge_scores
+from llm_judge.judge import LLMJudge
+# ---------------------------------------------------------------------------
+# Available judge models (one per family)
+# ---------------------------------------------------------------------------
+JUDGE_MODELS: list[dict[str, str]] = [
+    {
+        "family": "anthropic",
+        "provider": "anthropic",
+        "model": "claude-sonnet-4-20250514",
+    },
+    {
+        "family": "openai",
+        "provider": "openai",
+        "model": "gpt-5.2",
+    },
+    {
+        "family": "google",
+        "provider": "google",
+        "model": "gemini-2.5-pro",
+    },
+    {
+        "family": "deepseek",
+        "provider": "deepseek",
+        "model": "deepseek-chat",
+    },
+]
+# ---------------------------------------------------------------------------
+# Agent ID → model family mapping
+# ---------------------------------------------------------------------------
+_AGENT_FAMILY_PREFIXES: dict[str, str] = {
+    "claude": "anthropic",
+    "gpt": "openai",
+    "gemini": "google",
+    "deepseek": "deepseek",
+    "human": "human",
+}
+def detect_agent_family(agent_id: str) -> str:
+    """Map an agent ID to its model family.
+    Args:
+        agent_id: Agent identifier (e.g., 'claude-code', 'gpt5-tools-benchmark').
+    Returns:
+        Family string: 'anthropic', 'openai', 'google', 'deepseek', 'human',
+        or 'unknown'.
+    """
+    agent_lower = agent_id.lower()
+    for prefix, family in _AGENT_FAMILY_PREFIXES.items():
+        if agent_lower.startswith(prefix):
+            return family
+    return "unknown"
+def get_judge_models(agent_model_family: str) -> list[dict[str, str]]:
+    """Select judge models for a given agent, excluding self.
+    Args:
+        agent_model_family: Family of the agent being evaluated
+            ('anthropic', 'openai', 'google', 'deepseek', 'human', 'unknown').
+    Returns:
+        List of judge model dicts (3 for agents, 4 for human baselines).
+    """
+    if agent_model_family == "human":
+        return list(JUDGE_MODELS)  # All 4 judges
+    return [j for j in JUDGE_MODELS if j["family"] != agent_model_family]
+class LLMJudgePanel:
+    """Cross-model judge panel for protein design evaluation.
+    Manages 3 judges (excluding the agent's own model family) and
+    aggregates their scores.
+    Args:
+        agent_model_family: Model family to exclude ('anthropic', etc).
+        dry_run: If True, all judges return deterministic midpoint scores.
+    """
+    def __init__(
+        self,
+        agent_model_family: str,
+        dry_run: bool = False,
+    ):
+        self.agent_model_family = agent_model_family
+        self.dry_run = dry_run
+        self.judge_configs = get_judge_models(agent_model_family)
+        self.judges = [
+            LLMJudge(
+                provider=cfg["provider"],
+                model=cfg["model"],
+                dry_run=dry_run,
+            )
+            for cfg in self.judge_configs
+        ]
+    def evaluate_sync(
+        self,
+        task_description: str,
+        tool_call_log: list[dict[str, Any]],
+        designed_sequences: list[str],
+        algorithmic_metrics: dict[str, Any],
+        reference_pipeline: list[str] | None = None,
+    ) -> dict[str, Any]:
+        """Evaluate a design with all judges and aggregate.
+        Args:
+            task_description: Original task prompt.
+            tool_call_log: Agent's tool call sequence.
+            designed_sequences: Designed protein sequences.
+            algorithmic_metrics: Computed biophysical metrics.
+            reference_pipeline: Expected expert pipeline.
+        Returns:
+            Dict with aggregated scores, judge count, and individual results.
+        """
+        individual_results = []
+        for judge in self.judges:
+            result = judge.evaluate_sync(
+                task_description=task_description,
+                tool_call_log=tool_call_log,
+                designed_sequences=designed_sequences,
+                algorithmic_metrics=algorithmic_metrics,
+                reference_pipeline=reference_pipeline,
+            )
+            individual_results.append(result)
+        aggregated = aggregate_judge_scores(individual_results)
+        return {
+            **aggregated,
+            "judge_count": len(self.judges),
+            "individual_judges": [
+                {
+                    "model": cfg["model"],
+                    "family": cfg["family"],
+                    "scores": result,
+                }
+                for cfg, result in zip(self.judge_configs, individual_results)
+            ],
+        }

llm_judge/plan_eval.py ADDED Viewed

	@@ -0,0 +1,141 @@

+"""LLM-based plan evaluation: judge whether agent's reasoning trace
+demonstrates understanding of each pipeline step.
+Replaces keyword matching with LLM assessment of 4 pipeline steps:
+  backbone_generation, sequence_design, structure_prediction, scoring_validation
+Each step scored as 0 or 1 per judge, aggregated across 3-4 judges via majority vote.
+"""
+from __future__ import annotations
+import json
+import re
+from typing import Any
+from llm_judge.judge import LLMJudge
+PLAN_EVAL_SYSTEM = """You are an expert protein design evaluator. Your task is to assess whether an AI agent's reasoning trace demonstrates awareness and planning of specific protein design pipeline steps.
+You have deep knowledge of:
+- RFdiffusion for backbone generation
+- ProteinMPNN for inverse folding / sequence design
+- AlphaFold2, ESMFold, Boltz for structure prediction
+- Rosetta for energy scoring and validation
+Be strict: the agent must show genuine understanding or intent to use a step, not just mention a keyword in passing. Look for evidence that the agent planned to perform the step as part of its design strategy."""
+PLAN_EVAL_PROMPT_TEMPLATE = """## Task
+{task_description}
+## Agent's Reasoning Trace
+{reasoning_trace}
+## Pipeline Steps to Evaluate
+For each step below, determine whether the agent's reasoning trace shows that the agent **planned or intended** to perform this step. Score 1 if the agent demonstrates clear awareness and intent, 0 if not.
+1. **backbone_generation**: Did the agent plan to generate a de novo protein backbone/scaffold? (e.g., using RFdiffusion, backbone diffusion, scaffold generation, de novo structure design)
+2. **sequence_design**: Did the agent plan to design/optimize amino acid sequences for the structure? (e.g., using ProteinMPNN, inverse folding, sequence optimization, fixed-backbone design)
+3. **structure_prediction**: Did the agent plan to predict/validate the 3D structure of designed sequences? (e.g., using AlphaFold2, ESMFold, Boltz, checking pLDDT/pTM, fold confidence)
+4. **scoring_validation**: Did the agent plan to score the design's energy/stability? (e.g., using Rosetta, energy minimization, interface analysis, ddG calculation, binding energy)
+## Response Format
+Return a JSON object with exactly this structure:
+```json
+{{
+    "backbone_generation": {{"planned": 0 or 1, "evidence": "brief quote or reason"}},
+    "sequence_design": {{"planned": 0 or 1, "evidence": "brief quote or reason"}},
+    "structure_prediction": {{"planned": 0 or 1, "evidence": "brief quote or reason"}},
+    "scoring_validation": {{"planned": 0 or 1, "evidence": "brief quote or reason"}}
+}}
+```
+"""
+STEPS = ["backbone_generation", "sequence_design", "structure_prediction", "scoring_validation"]
+def parse_plan_response(raw_text: str) -> dict[str, int]:
+    """Parse LLM response into per-step binary scores."""
+    # Try JSON extraction
+    json_match = re.search(r"```(?:json)?\s*\n?(.*?)\n?\s*```", raw_text, re.DOTALL)
+    json_str = json_match.group(1) if json_match else raw_text
+    try:
+        data = json.loads(json_str)
+    except json.JSONDecodeError:
+        brace_match = re.search(r"\{.*\}", raw_text, re.DOTALL)
+        if brace_match:
+            try:
+                data = json.loads(brace_match.group())
+            except json.JSONDecodeError:
+                return {s: 0 for s in STEPS}
+        else:
+            return {s: 0 for s in STEPS}
+    result = {}
+    for step in STEPS:
+        if step in data and isinstance(data[step], dict):
+            val = data[step].get("planned", 0)
+            result[step] = 1 if val == 1 or val is True else 0
+        else:
+            result[step] = 0
+    return result
+def evaluate_plan_single(
+    judge: LLMJudge,
+    task_description: str,
+    reasoning_trace: str,
+) -> dict[str, int]:
+    """Evaluate plan with a single judge."""
+    if not reasoning_trace or len(reasoning_trace.strip()) < 10:
+        return {s: 0 for s in STEPS}
+    if judge.dry_run:
+        return {s: 0 for s in STEPS}
+    # Cap trace length
+    trace = reasoning_trace[:4000]
+    prompt = PLAN_EVAL_PROMPT_TEMPLATE.format(
+        task_description=task_description[:1000],
+        reasoning_trace=trace,
+    )
+    raw = judge._call_api(PLAN_EVAL_SYSTEM, prompt)
+    return parse_plan_response(raw)
+def evaluate_plan_panel(
+    judges: list[LLMJudge],
+    task_description: str,
+    reasoning_trace: str,
+) -> dict[str, dict[str, Any]]:
+    """Evaluate plan with multiple judges, aggregate via majority vote.
+    Returns dict mapping step → {planned: 0/1, votes: [per-judge], n_judges: int}.
+    """
+    if not reasoning_trace or len(reasoning_trace.strip()) < 10:
+        return {
+            s: {"planned": 0, "votes": [0] * len(judges), "n_judges": len(judges)}
+            for s in STEPS
+        }
+    all_votes: dict[str, list[int]] = {s: [] for s in STEPS}
+    for judge in judges:
+        result = evaluate_plan_single(judge, task_description, reasoning_trace)
+        for step in STEPS:
+            all_votes[step].append(result.get(step, 0))
+    aggregated = {}
+    for step in STEPS:
+        votes = all_votes[step]
+        planned = 1 if sum(votes) > len(votes) / 2 else 0  # majority vote
+        aggregated[step] = {
+            "planned": planned,
+            "votes": votes,
+            "n_judges": len(judges),
+        }
+    return aggregated

llm_judge/rubrics.py ADDED Viewed

	@@ -0,0 +1,173 @@

+"""Structured rubric prompts for LLM judge evaluation.
+Each judge evaluates 5 dimensions with explicit score-level descriptors
+following the Prometheus (ICLR 2024) rubric-based approach.
+"""
+from __future__ import annotations
+import json
+from typing import Any
+# ---------------------------------------------------------------------------
+# Judge dimensions with max scores matching the LLM portion of the split
+# ---------------------------------------------------------------------------
+JUDGE_DIMENSIONS: dict[str, dict[str, Any]] = {
+    "approach_strategy": {
+        "max_score": 10,
+        "description": "Strategic quality of tool/methodology selection",
+    },
+    "orchestration_reasoning": {
+        "max_score": 8,
+        "description": "Pipeline logic, error handling, and adaptive reasoning",
+    },
+    "bio_feasibility": {
+        "max_score": 5,
+        "description": "Biological plausibility beyond sequence-level checks",
+    },
+    "novelty_quality": {
+        "max_score": 2,
+        "description": "Meaningful innovation vs accidental variation",
+    },
+    "diversity_quality": {
+        "max_score": 3,
+        "description": "Functional diversity of design strategies",
+    },
+}
+JUDGE_SYSTEM_PROMPT = (
+    "You are an expert protein design evaluator with deep knowledge of "
+    "computational protein engineering, including backbone generation "
+    "(RFdiffusion, Chroma), sequence design (ProteinMPNN, LigandMPNN), "
+    "structure prediction (AlphaFold2, ESMFold, Boltz), and interface "
+    "analysis (PyRosetta, FoldX). You evaluate AI agent protein design "
+    "attempts against a structured rubric. Score each dimension "
+    "independently. Provide reasoning BEFORE your score. Be critical "
+    "but fair — a score of 5/10 means average, not bad."
+)
+_RUBRIC_TEXT = """\
+### Approach Strategy (0-10 pts)
+- 9-10: Selects optimal tools for this specific target; demonstrates deep \
+understanding of design strategy (e.g., chooses RFdiffusion hotspot \
+conditioning for epitope-specific binder, not generic backbone generation)
+- 7-8:  Appropriate tool selection with minor suboptimalities
+- 5-6:  Reasonable tools but misses key steps or uses generic strategy
+- 3-4:  Partially appropriate; missing critical tools for this task type
+- 0-2:  Inappropriate or random tool selection
+### Orchestration Reasoning (0-8 pts)
+- 7-8: Logical pipeline with error handling, iterative refinement based on \
+intermediate results, clear adaptive reasoning
+- 5-6: Correct ordering with some validation but limited adaptation
+- 3-4: Basic pipeline but missing intermediate checks or illogical ordering
+- 0-2: No clear pipeline logic; tools called without reasoning
+### Biological Feasibility (0-5 pts)
+- 4-5: Designs are biologically plausible — CDR loops appropriate for \
+target, active site geometry consistent, no obvious steric clashes
+- 2-3: Generally plausible with minor concerns
+- 0-1: Biologically implausible designs (e.g., all-alanine core, \
+impossible disulfide patterns)
+### Novelty Quality (0-2 pts)
+- 2: Novel design represents meaningful innovation (new fold, creative \
+binding mode) not just random mutations
+- 1: Some novelty but appears accidental rather than designed
+- 0: No meaningful novelty; trivially similar to reference or random
+### Diversity Quality (0-3 pts)
+- 3: Multiple designs explore different binding modes/conformations/\
+strategies — functionally diverse, not just sequence variants
+- 1-2: Some diversity but designs are minor variants of each other
+- 0: No meaningful diversity; essentially one design repeated
+"""
+def build_judge_prompt(
+    task_description: str,
+    tool_call_log: list[dict[str, Any]],
+    designed_sequences: list[str],
+    algorithmic_metrics: dict[str, Any],
+    reference_pipeline: list[str] | None = None,
+) -> str:
+    """Build the user prompt for LLM judge evaluation.
+    Args:
+        task_description: The original design task prompt.
+        tool_call_log: Sequence of tool calls with args.
+        designed_sequences: FASTA-format designed sequences.
+        algorithmic_metrics: Computed metrics (pLDDT, ipTM, etc).
+        reference_pipeline: Expected expert pipeline for this task type.
+    Returns:
+        Formatted prompt string for the judge LLM.
+    """
+    sections = []
+    # Task description
+    sections.append(f"## Task Description\n{task_description}")
+    # Reference pipeline (for approach/orchestration context)
+    if reference_pipeline:
+        pipeline_str = " → ".join(reference_pipeline)
+        sections.append(
+            f"## Reference Pipeline (Expert-Validated)\n{pipeline_str}"
+        )
+    # Tool call log
+    if tool_call_log:
+        log_lines = []
+        for i, entry in enumerate(tool_call_log, 1):
+            tool = entry.get("tool", "unknown")
+            args = entry.get("args_summary", {})
+            args_str = json.dumps(args, default=str) if args else "{}"
+            log_lines.append(f"{i}. {tool}({args_str})")
+        sections.append(
+            "## Agent's Tool Call Log\n" + "\n".join(log_lines)
+        )
+    else:
+        sections.append("## Agent's Tool Call Log\nNo tool calls recorded.")
+    # Designed sequences
+    if designed_sequences:
+        seq_lines = []
+        for i, seq in enumerate(designed_sequences[:10], 1):  # Cap at 10
+            display = seq[:80] + "..." if len(seq) > 80 else seq
+            seq_lines.append(f">design_{i} (len={len(seq)})\n{display}")
+        sections.append(
+            f"## Designed Sequences ({len(designed_sequences)} total)\n"
+            + "\n".join(seq_lines)
+        )
+    else:
+        sections.append("## Designed Sequences\nNo sequences produced.")
+    # Algorithmic metrics (read-only context)
+    if algorithmic_metrics:
+        metrics_str = json.dumps(algorithmic_metrics, indent=2, default=str)
+        sections.append(
+            f"## Algorithmic Metrics (Read-Only Context)\n```json\n{metrics_str}\n```"
+        )
+    # Scoring rubric
+    sections.append(f"## Scoring Rubric\n{_RUBRIC_TEXT}")
+    # Output format instruction
+    output_format = {
+        dim: {"reasoning": "...", "score": f"0-{info['max_score']}"}
+        for dim, info in JUDGE_DIMENSIONS.items()
+    }
+    sections.append(
+        "## Required Output Format\n"
+        "Evaluate each dimension. For each:\n"
+        "1. Cite specific evidence from the agent's work\n"
+        "2. Reason about quality relative to the rubric\n"
+        "3. Assign a score\n\n"
+        "Respond in JSON format:\n"
+        f"```json\n{json.dumps(output_format, indent=2)}\n```"
+    )
+    return "\n\n".join(sections)

requirements.txt CHANGED Viewed

@@ -4,3 +4,8 @@ plotly
 httpx>=0.25
 huggingface_hub>=0.20
 datasets>=2.16

 httpx>=0.25
 huggingface_hub>=0.20
 datasets>=2.16
+# LLM judge panel (Phase A)
+anthropic>=0.75
+openai>=1.40
+google-genai>=0.3