Spaces:

remdms
/

mediastorm

Sleeping

App Files Files Community

remdms Claude Opus 4.6 commited on Mar 31

Commit

b145b51

1 Parent(s): 72296b0

docs: implementation plan for eval CLI

Browse files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Files changed (1) hide show

docs/superpowers/plans/2026-03-31-eval-cli.md +564 -0

docs/superpowers/plans/2026-03-31-eval-cli.md ADDED Viewed

	@@ -0,0 +1,564 @@

+# Eval CLI Implementation Plan
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+**Goal:** Add `cli.py eval` command that runs retrieval evaluation, saves results as JSON, and shows diffs against previous runs.
+**Architecture:** CLI command calls existing `eval_retrieval.run_eval()`, structures results into a JSON file in `data/eval_runs/`, then compares against the most recent previous run to show regressions/improvements. Two new modules: `runner.py` (orchestration + storage) and `display.py` (console formatting).
+**Tech Stack:** Python, Click (existing CLI), JSON files for storage
+---
+### Task 1: Runner — run eval and save JSON
+**Files:**
+- Create: `src/mediastorm/eval/__init__.py`
+- Create: `src/mediastorm/eval/runner.py`
+- [ ] **Step 1: Create empty `__init__.py`**
+```python
+# src/mediastorm/eval/__init__.py
+```
+- [ ] **Step 2: Write `runner.py`**
+```python
+"""Eval runner — orchestrates evaluation and persists results."""
+import asyncio
+import json
+import time
+from datetime import datetime
+from pathlib import Path
+EVAL_RUNS_DIR = Path("./data/eval_runs")
+_SEMANTIC_CATS = {"geographic", "thematic", "people"}
+_FILTER_CATS = {"temporal", "genre", "awards"}
+def _build_run_data(eval_result: dict) -> dict:
+    """Structure raw eval_retrieval output into storable run data."""
+    details = eval_result["details"]
+    ts = datetime.now()
+    # Per-category aggregates
+    categories: dict[str, dict] = {}
+    for row in details:
+        cat = row["category"]
+        if cat not in categories:
+            categories[cat] = {"rows": []}
+        categories[cat]["rows"].append(row)
+    cat_summary = {}
+    for cat, data in categories.items():
+        rows = data["rows"]
+        if cat == "edge_no_match":
+            passed = sum(1 for r in rows if r["success"])
+            cat_summary[cat] = {"passed": passed, "total": len(rows)}
+        else:
+            cat_summary[cat] = {
+                "p1": _avg(rows, "precision_at_1"),
+                "r5": _avg(rows, "recall_at_5"),
+                "mrr": _avg(rows, "mrr"),
+                "ndcg5": _avg(rows, "ndcg_at_5"),
+                "count": len(rows),
+            }
+    # Query-level details
+    queries = []
+    for row in details:
+        if row["category"] == "edge_no_match":
+            queries.append({
+                "query": row["query"],
+                "category": row["category"],
+                "success": row["success"],
+                "num_returned": row["num_returned"],
+                "duration": row["duration"],
+            })
+        else:
+            queries.append({
+                "query": row["query"],
+                "category": row["category"],
+                "p1": row["precision_at_1"],
+                "r5": row["recall_at_5"],
+                "mrr": row["mrr"],
+                "ndcg5": row["ndcg_at_5"],
+                "retrieved_ids": row["retrieved"],
+                "expected_ids": list(row.get("expected", [])) if "expected" not in row else row["retrieved"],  # fallback
+                "missed": list(set(row.get("expected", set())) - set(row["retrieved"])) if row.get("expected") else [],
+                "duration": row["duration"],
+            })
+    # We need expected IDs in the output — but eval_retrieval doesn't return them
+    # in the details dict. We'll enrich from EVAL_QUERIES in the CLI command.
+    return {
+        "timestamp": ts.isoformat(timespec="seconds"),
+        "aggregates": {
+            "semantic_p1": eval_result["semantic_precision_at_1"],
+            "semantic_r5": eval_result["semantic_recall_at_5"],
+            "semantic_mrr": eval_result["semantic_mrr"],
+            "semantic_ndcg5": eval_result["semantic_ndcg_at_5"],
+            "filter_p1": eval_result["filter_precision_at_1"],
+            "filter_r5": eval_result["filter_recall_at_5"],
+            "edge_pass_rate": eval_result["edge_pass_rate"],
+        },
+        "categories": cat_summary,
+        "queries": queries,
+    }
+def _avg(rows: list[dict], key: str) -> float:
+    vals = [r[key] for r in rows if key in r]
+    return sum(vals) / len(vals) if vals else 0.0
+def save_run(run_data: dict, runs_dir: Path = EVAL_RUNS_DIR) -> Path:
+    """Save run data as timestamped JSON. Returns the file path."""
+    runs_dir.mkdir(parents=True, exist_ok=True)
+    ts = datetime.fromisoformat(run_data["timestamp"])
+    filename = ts.strftime("%Y-%m-%d_%H-%M-%S") + ".json"
+    path = runs_dir / filename
+    path.write_text(json.dumps(run_data, indent=2, ensure_ascii=False))
+    return path
+def load_previous_run(runs_dir: Path = EVAL_RUNS_DIR) -> dict | None:
+    """Load the most recent run JSON, or None if no runs exist."""
+    if not runs_dir.exists():
+        return None
+    files = sorted(runs_dir.glob("*.json"))
+    if not files:
+        return None
+    return json.loads(files[-1].read_text())
+def load_all_runs(runs_dir: Path = EVAL_RUNS_DIR) -> list[dict]:
+    """Load all run JSONs sorted chronologically."""
+    if not runs_dir.exists():
+        return []
+    files = sorted(runs_dir.glob("*.json"))
+    return [json.loads(f.read_text()) for f in files]
+```
+- [ ] **Step 3: Verify module imports**
+Run: `source .venv/bin/activate && python -c "from mediastorm.eval.runner import save_run, load_previous_run, load_all_runs, _build_run_data; print('OK')"`
+Expected: `OK`
+- [ ] **Step 4: Commit**
+```bash
+git add src/mediastorm/eval/__init__.py src/mediastorm/eval/runner.py
+git commit -m "feat(eval): add runner module for eval orchestration and JSON storage"
+```
+---
+### Task 2: Display — console formatting
+**Files:**
+- Create: `src/mediastorm/eval/display.py`
+- [ ] **Step 1: Write `display.py`**
+```python
+"""Console display for eval runs — scores, diffs, history."""
+def print_scores(run_data: dict) -> None:
+    """Print aggregate scores table (same layout as eval_retrieval.py)."""
+    agg = run_data["aggregates"]
+    cats = run_data["categories"]
+    print()
+    print("=" * 60)
+    print("MediaStorm RAG — Retrieval Evaluation")
+    print("=" * 60)
+    print()
+    print("CORE SEMANTIC SEARCH (people, thematic, geographic)")
+    print("-" * 60)
+    print(f"  Precision@1:    {agg['semantic_p1']:.2f}  (target ≥ 0.85)")
+    print(f"  Recall@5:       {agg['semantic_r5']:.2f}  (target ≥ 0.90)")
+    print(f"  MRR:            {agg['semantic_mrr']:.2f}")
+    print(f"  NDCG@5:         {agg['semantic_ndcg5']:.2f}")
+    print()
+    print("FILTER QUERIES (temporal, genre, awards)")
+    print("-" * 60)
+    print(f"  Precision@1:    {agg['filter_p1']:.2f}")
+    print(f"  Recall@5:       {agg['filter_r5']:.2f}")
+    edge = cats.get("edge_no_match", {})
+    passed = edge.get("passed", 0)
+    total = edge.get("total", 0)
+    print()
+    print("EDGE CASES")
+    print("-" * 60)
+    print(f"  Correctly rejected: {passed}/{total}")
+    # Per-category breakdown
+    print()
+    print("PER-CATEGORY BREAKDOWN")
+    print("-" * 60)
+    _SEM = {"geographic", "thematic", "people"}
+    for cat, data in cats.items():
+        if cat == "edge_no_match":
+            print(f"  {cat:20s}  {data['passed']}/{data['total']} rejected")
+        else:
+            label = "semantic" if cat in _SEM else "filter"
+            print(f"  {cat:20s}  P@1={data['p1']:.2f}  R@5={data['r5']:.2f}  ({data['count']} queries) [{label}]")
+    print("=" * 60)
+def print_verbose(run_data: dict) -> None:
+    """Print per-query details before the aggregate scores."""
+    print()
+    for i, q in enumerate(run_data["queries"]):
+        if q["category"] == "edge_no_match":
+            status = "PASS" if q["success"] else "FAIL"
+            print(f"  [{status}] Q{i+1}: {q['query']}")
+            if not q["success"]:
+                print(f"         Returned {q['num_returned']} results (expected 0)")
+        else:
+            status = "PASS" if q["r5"] > 0 else "MISS"
+            print(f"  [{status}] Q{i+1}: {q['query']}")
+            print(f"         P@1={q['p1']:.0f}  R@5={q['r5']:.2f}  MRR={q['mrr']:.2f}  NDCG@5={q['ndcg5']:.2f}  ({q['duration']:.1f}s)")
+            if q.get("missed"):
+                print(f"         Missed: {q['missed']}")
+def print_diff(current: dict, previous: dict) -> None:
+    """Print comparison between two runs — deltas, regressions, improvements."""
+    prev_ts = previous["timestamp"][:16].replace("T", " ")
+    print()
+    print(f"COMPARISON vs {prev_ts}")
+    print("-" * 60)
+    cur_agg = current["aggregates"]
+    prev_agg = previous["aggregates"]
+    _diff_line("semantic P@1", prev_agg["semantic_p1"], cur_agg["semantic_p1"])
+    _diff_line("semantic R@5", prev_agg["semantic_r5"], cur_agg["semantic_r5"])
+    _diff_line("filter P@1", prev_agg["filter_p1"], cur_agg["filter_p1"])
+    _diff_line("filter R@5", prev_agg["filter_r5"], cur_agg["filter_r5"])
+    # Edge cases
+    prev_edge = previous["categories"].get("edge_no_match", {})
+    cur_edge = current["categories"].get("edge_no_match", {})
+    prev_e = prev_edge.get("passed", 0)
+    cur_e = cur_edge.get("passed", 0)
+    total_e = cur_edge.get("total", 0)
+    delta_e = cur_e - prev_e
+    arrow = " ▲" if delta_e > 0 else " ▼" if delta_e < 0 else ""
+    print(f"  edge rejected:  {prev_e}/{total_e} → {cur_e}/{total_e}  ({'+' if delta_e >= 0 else ''}{delta_e}){arrow}")
+    # Per-query regressions and improvements
+    prev_queries = {q["query"]: q for q in previous["queries"]}
+    regressions = []
+    improvements = []
+    for q in current["queries"]:
+        if q["category"] == "edge_no_match":
+            continue
+        prev_q = prev_queries.get(q["query"])
+        if not prev_q or prev_q["category"] == "edge_no_match":
+            continue
+        delta = q["r5"] - prev_q["r5"]
+        if delta < -0.01:
+            regressions.append((q["query"], prev_q["r5"], q["r5"], delta))
+        elif delta > 0.01:
+            improvements.append((q["query"], prev_q["r5"], q["r5"], delta))
+    if regressions:
+        print()
+        print(f"REGRESSIONS ({len(regressions)}):")
+        for query, old, new, delta in regressions:
+            print(f'  "{query}"')
+            print(f"       R@5: {old:.2f} → {new:.2f} ({delta:+.2f}) ▼")
+    if improvements:
+        print()
+        print(f"IMPROVEMENTS ({len(improvements)}):")
+        for query, old, new, delta in improvements:
+            print(f'  "{query}"')
+            print(f"       R@5: {old:.2f} → {new:.2f} ({delta:+.2f}) ▲")
+    if not regressions and not improvements:
+        print()
+        print("  No per-query changes.")
+    print()
+def _diff_line(label: str, old: float, new: float) -> None:
+    delta = new - old
+    if abs(delta) < 0.005:
+        arrow = "(=)"
+    elif delta > 0:
+        arrow = f"(+{delta:.2f}) ▲"
+    else:
+        arrow = f"({delta:.2f}) ▼ REGRESSION"
+    print(f"  {label:16s} {old:.2f} → {new:.2f}  {arrow}")
+def print_history(runs: list[dict]) -> None:
+    """Print one-liner per run with trend indicators."""
+    if not runs:
+        print("No eval runs found.")
+        return
+    print()
+    print(f"EVAL HISTORY ({len(runs)} runs)")
+    print("-" * 60)
+    prev = None
+    for run in runs:
+        ts = run["timestamp"][:16].replace("T", "_").replace("-", "-").replace(":", "-")
+        # Use just date_time for display
+        display_ts = run["timestamp"][:16].replace("T", " ")
+        agg = run["aggregates"]
+        edge = run["categories"].get("edge_no_match", {})
+        edge_str = f"{edge.get('passed', 0)}/{edge.get('total', 0)}"
+        line = f"  {display_ts}  sem_R@5={agg['semantic_r5']:.2f}  filt_R@5={agg['filter_r5']:.2f}  edge={edge_str}"
+        if prev:
+            prev_agg = prev["aggregates"]
+            sem_delta = agg["semantic_r5"] - prev_agg["semantic_r5"]
+            filt_delta = agg["filter_r5"] - prev_agg["filter_r5"]
+            if sem_delta > 0.01 and filt_delta > 0.01:
+                line += "  ▲ both"
+            elif sem_delta > 0.01:
+                line += f"  ▲ sem +{sem_delta:.2f}"
+            elif filt_delta > 0.01:
+                line += f"  ▲ filt +{filt_delta:.2f}"
+            elif sem_delta < -0.01 or filt_delta < -0.01:
+                line += "  ▼"
+        print(line)
+        prev = run
+    print()
+```
+- [ ] **Step 2: Verify module imports**
+Run: `source .venv/bin/activate && python -c "from mediastorm.eval.display import print_scores, print_diff, print_history, print_verbose; print('OK')"`
+Expected: `OK`
+- [ ] **Step 3: Commit**
+```bash
+git add src/mediastorm/eval/display.py
+git commit -m "feat(eval): add display module for console formatting and diffs"
+```
+---
+### Task 3: CLI command — wire it all together
+**Files:**
+- Modify: `cli.py`
+- [ ] **Step 1: Add eval command to `cli.py`**
+Add after the `audit` command (before `if __name__`):
+```python
+@cli.command(name="eval")
+@click.option("--verbose", "-v", is_flag=True, help="Show per-query details.")
+@click.option("--history", is_flag=True, help="Show history of past runs.")
+def eval_cmd(verbose: bool, history: bool):
+    """Run retrieval evaluation and compare to previous run."""
+    from mediastorm.eval.runner import (
+        _build_run_data, save_run, load_previous_run, load_all_runs,
+    )
+    from mediastorm.eval.display import (
+        print_scores, print_verbose, print_diff, print_history,
+    )
+    if history:
+        runs = load_all_runs()
+        print_history(runs)
+        return
+    # Load previous run BEFORE running eval (so the new run isn't compared to itself)
+    previous = load_previous_run()
+    # Run evaluation
+    from eval_retrieval import run_eval
+    click.echo("Running retrieval evaluation...")
+    eval_result = asyncio.run(run_eval(verbose=False))
+    # Build and save run data
+    run_data = _build_run_data(eval_result)
+    path = save_run(run_data)
+    click.echo(f"Results saved to {path}")
+    # Display
+    if verbose:
+        print_verbose(run_data)
+    print_scores(run_data)
+    if previous:
+        print_diff(run_data, previous)
+    else:
+        click.echo("\nFirst run — no comparison available.")
+```
+- [ ] **Step 2: Test the command runs**
+Run: `source .venv/bin/activate && python cli.py eval`
+Expected: scores table printed, JSON saved to `data/eval_runs/`, "First run — no comparison available."
+- [ ] **Step 3: Test verbose mode**
+Run: `source .venv/bin/activate && python cli.py eval --verbose`
+Expected: per-query details printed before scores, plus comparison vs first run
+- [ ] **Step 4: Test history**
+Run: `source .venv/bin/activate && python cli.py eval --history`
+Expected: two runs listed with trend indicator
+- [ ] **Step 5: Commit**
+```bash
+git add cli.py
+git commit -m "feat(eval): add cli.py eval command with diff and history"
+```
+---
+### Task 4: Suppress eval_retrieval.py stdout
+**Files:**
+- Modify: `eval_retrieval.py:284-291` (the print block at the start of `run_eval`)
+The current `run_eval()` prints directly to stdout. Since `display.py` handles all formatting, we need to suppress the prints when called from the CLI. Add a `quiet` parameter.
+- [ ] **Step 1: Add `quiet` parameter to `run_eval()`**
+In `eval_retrieval.py`, change the signature and wrap all `print()` calls:
+```python
+async def run_eval(verbose: bool = False, quiet: bool = False) -> dict:
+```
+Then wrap every `print(...)` in `run_eval` with `if not quiet:`. This affects:
+- Lines 284-287 (header)
+- Lines 310-314 (edge verbose)
+- Lines 335-340 (scored verbose)
+- Lines 359-393 (aggregates and breakdown)
+- [ ] **Step 2: Update CLI to use `quiet=True`**
+In `cli.py`, change the `run_eval` call:
+```python
+    eval_result = asyncio.run(run_eval(verbose=False, quiet=True))
+```
+- [ ] **Step 3: Verify standalone script still works**
+Run: `source .venv/bin/activate && python eval_retrieval.py --verbose`
+Expected: same output as before (quiet defaults to False)
+- [ ] **Step 4: Verify CLI suppresses duplicate output**
+Run: `source .venv/bin/activate && python cli.py eval`
+Expected: only the display module's output, no duplicate headers
+- [ ] **Step 5: Commit**
+```bash
+git add eval_retrieval.py cli.py
+git commit -m "feat(eval): add quiet mode to eval_retrieval to avoid duplicate output"
+```
+---
+### Task 5: Enrich run data with expected IDs from ground truth
+**Files:**
+- Modify: `eval_retrieval.py:298-342` (the per-query loop)
+Currently `run_eval()` doesn't include `expected` IDs in the returned details. We need them for the diff to show missed UIDs.
+- [ ] **Step 1: Add expected IDs to each result row**
+In `eval_retrieval.py`, inside the per-query loop, add `expected` to each row dict. For non-edge queries (around line 316-330):
+```python
+            row = {
+                "query": query,
+                "category": category,
+                "precision_at_1": p1,
+                "recall_at_5": r5,
+                "mrr": m,
+                "ndcg_at_5": n5,
+                "retrieved": retrieved_ids,
+                "expected": list(expected),
+                "missed": list(expected - set(retrieved_ids)),
+                "duration": duration,
+            }
+```
+For edge queries (around line 302-308):
+```python
+            row = {
+                "query": query,
+                "category": category,
+                "success": success,
+                "num_returned": len(retrieval.stories),
+                "expected": [],
+                "duration": duration,
+            }
+```
+- [ ] **Step 2: Update `_build_run_data` in `runner.py` to use the enriched data**
+In `runner.py`, simplify the query building since expected/missed now come from eval_retrieval:
+```python
+    for row in details:
+        if row["category"] == "edge_no_match":
+            queries.append({
+                "query": row["query"],
+                "category": row["category"],
+                "success": row["success"],
+                "num_returned": row["num_returned"],
+                "duration": row["duration"],
+            })
+        else:
+            queries.append({
+                "query": row["query"],
+                "category": row["category"],
+                "p1": row["precision_at_1"],
+                "r5": row["recall_at_5"],
+                "mrr": row["mrr"],
+                "ndcg5": row["ndcg_at_5"],
+                "retrieved_ids": row["retrieved"],
+                "expected_ids": row["expected"],
+                "missed": row["missed"],
+                "duration": row["duration"],
+            })
+```
+- [ ] **Step 3: Verify**
+Run: `source .venv/bin/activate && python cli.py eval --verbose`
+Expected: missed UIDs shown in verbose output
+- [ ] **Step 4: Commit**
+```bash
+git add eval_retrieval.py src/mediastorm/eval/runner.py
+git commit -m "feat(eval): include expected IDs and missed UIDs in eval results"
+```