Spaces:

NinjainPJs
/

ninja-code-guard

Sleeping

App Files Files Community

ninja-code-guard / docs /WEEK9_POLISH_AND_EVALUATION.md

NinjainPJs

initial - commit

4b445f6 3 months ago

preview code

raw

history blame contribute delete

24.6 kB

Week 9: Evaluation Harness & Project Polish — Detailed Documentation

Goal: Build an evaluation harness that measures review quality against ground truth, compute precision/recall/F1, track latency percentiles, and polish the README for public release. Status: Complete — Evaluation framework operational, README finalized Date: 2026-03-20 Key Metric: Ground truth matching with 3-line tolerance, precision/recall/F1 per test case Deliverables: Evaluation harness, test dataset, production-quality README

What We Built

Week 9 adds two critical capabilities that transform Ninja Code Guard from a "works on my machine" prototype into a project ready for production evaluation and public presentation.

Evaluation Harness — A framework that runs the full review pipeline against test PRs with known issues (ground truth) and measures precision, recall, F1, and latency. This answers the question every interviewer asks: "How do you know your system actually works?"
README Polish — A comprehensive README.md that serves as the project's public face, covering architecture, setup, usage, and test results.

                    ┌──────────────────────────────────┐
                    │     Evaluation Harness            │
                    │     tests/eval/                   │
                    │                                   │
                    │  ┌────────────────────────────┐   │
                    │  │  Dataset (JSON files)       │   │
                    │  │  sql_injection_basic.json   │   │
                    │  │  n_plus_one_query.json      │   │  Each file contains:
                    │  │  hardcoded_secret.json      │   │  - PR diff
                    │  │  ...                        │   │  - File contents
                    │  └──────────┬─────────────────┘   │  - Expected findings
                    │             │                     │    (ground truth)
                    │             ▼                     │
                    │  ┌────────────────────────────┐   │
                    │  │  run_eval.py                │   │
                    │  │  For each test case:        │   │
                    │  │    1. Run 3 agents parallel  │   │
                    │  │    2. Synthesize findings    │   │
                    │  │    3. Match vs ground truth  │   │
                    │  │    4. Compute TP/FP/FN       │   │
                    │  └──────────┬─────────────────┘   │
                    │             │                     │
                    │             ▼                     │
                    │  ┌────────────────────────────┐   │
                    │  │  metrics.py                 │   │
                    │  │  Per-PR: P, R, F1, latency  │   │
                    │  │  Aggregate: avg P/R/F1      │   │
                    │  │  Latency: p50, p95          │   │
                    │  └────────────────────────────┘   │
                    └──────────────────────────────────┘

Concept: Why Evaluation Matters

The Problem: "It Seems to Work" Is Not Enough

Without systematic evaluation, we're relying on anecdotal evidence: "I ran it on PR #4 and it found the SQL injection." But this tells us nothing about:

Precision — Of the issues it flagged, how many are real? (Are there false positives?)
Recall — Of the real issues, how many did it find? (Are there false negatives?)
Consistency — Does it work on different code patterns, or just the ones we tested?
Latency — How long does a review take? Is it fast enough for a real workflow?

The evaluation harness answers all of these with reproducible, quantitative metrics.

The Three Core Metrics

                    All items in test PR
              ┌────────────────────────────────────┐
              │                                    │
              │    Ground Truth         Detected   │
              │    (expected)           (actual)    │
              │   ┌──────────┐     ┌──────────┐    │
              │   │          │     │          │    │
              │   │   FN     │ TP  │   FP     │    │
              │   │ (missed) │     │ (false   │    │
              │   │          │     │  alarm)  │    │
              │   └──────────┘     └──────────┘    │
              │                                    │
              └────────────────────────────────────┘

  Precision = TP / (TP + FP)  →  "Of what we flagged, how much is real?"
  Recall    = TP / (TP + FN)  →  "Of what's real, how much did we find?"
  F1        = 2*P*R / (P+R)   →  "Harmonic mean — balance of both"

Why F1 and not just accuracy? Accuracy (TP + TN) / total is misleading for imbalanced problems. A PR with 100 lines and 1 vulnerability: a system that says "everything is fine" has 99% accuracy but 0% recall. F1 balances precision and recall, penalizing systems that sacrifice one for the other.

Interview talking point: "We measure precision, recall, and F1 rather than accuracy because code review is an imbalanced classification problem — most lines are fine, only a few have issues. A system that flags nothing has 0% recall but near-100% precision. A system that flags everything has 100% recall but near-0% precision. F1 forces us to balance both."

Step-by-Step Implementation Log

Step 1: Design the Evaluation Dataset Format

What we did: Defined a JSON schema for test cases with known vulnerabilities.

{
  "pr_id": "sql_injection_basic",
  "diff": "diff --git a/app.py b/app.py\n--- /dev/null\n+++ b/app.py\n@@ -0,0 +1,10 @@\n+import sqlite3\n+\n+def get_user(user_id):\n+    conn = sqlite3.connect('users.db')\n+    query = f\"SELECT * FROM users WHERE id = {user_id}\"\n+    return conn.execute(query).fetchone()\n+\n+def safe_get_user(user_id):\n+    conn = sqlite3.connect('users.db')\n+    return conn.execute('SELECT * FROM users WHERE id = ?', (user_id,)).fetchone()\n",
  "file_contents": {
    "app.py": "import sqlite3\n\ndef get_user(user_id):\n    conn = sqlite3.connect('users.db')\n    query = f\"SELECT * FROM users WHERE id = {user_id}\"\n    return conn.execute(query).fetchone()\n\ndef safe_get_user(user_id):\n    conn = sqlite3.connect('users.db')\n    return conn.execute('SELECT * FROM users WHERE id = ?', (user_id,)).fetchone()\n"
  },
  "expected_findings": [
    {
      "file_path": "app.py",
      "line_start": 5,
      "category": "sql_injection"
    }
  ]
}

Each test case contains four fields:

Field	Purpose
`pr_id`	Unique identifier for this test case (used in logging)
`diff`	The PR diff in unified diff format (what GitHub sends)
`file_contents`	Full file source code (used by agents for analysis)
`expected_findings`	Ground truth: known issues with file, line, and category

Design decisions:

Self-contained JSON: Each test case includes both the diff and full file contents. This means the evaluation can run without GitHub API access — no network dependencies, fully reproducible.
Minimal ground truth fields: Expected findings only specify file_path, line_start, and category. We don't specify severity, title, or description because those are subjective — different agents might reasonably assign different severities to the same issue.
Positive and negative examples in the same file: The sql_injection_basic test includes both a vulnerable function (get_user with f-string interpolation) and a safe function (safe_get_user with parameterized query). The system should flag line 5 but NOT flag line 10. This tests both recall (did it find the bug?) and precision (did it avoid flagging the safe code?).

Interview talking point: "Each evaluation test case is a self-contained JSON file with a PR diff, full file contents, and ground truth findings. The ground truth specifies file, line, and category — but not severity or description, because those are subjective. This design lets us test detection accuracy without penalizing agents for reasonable differences in how they describe the same issue."

Step 2: Build the Metrics Module (tests/eval/metrics.py)

What we did: Created dataclasses for per-PR and aggregate evaluation results.

EvalResult — Per-PR Metrics

@dataclass
class EvalResult:
    """Result of evaluating one PR against ground truth."""

    pr_id: str
    true_positives: int = 0
    false_positives: int = 0
    false_negatives: int = 0
    latency_ms: int = 0

    @property
    def precision(self) -> float:
        total = self.true_positives + self.false_positives
        return self.true_positives / total if total > 0 else 1.0

    @property
    def recall(self) -> float:
        total = self.true_positives + self.false_negatives
        return self.true_positives / total if total > 0 else 1.0

    @property
    def f1(self) -> float:
        p, r = self.precision, self.recall
        return 2 * p * r / (p + r) if (p + r) > 0 else 0.0

Edge case handling:

If there are no detections at all (TP=0, FP=0): precision defaults to 1.0 (nothing was flagged, so nothing was wrong — vacuously true)
If there are no expected findings (TP=0, FN=0): recall defaults to 1.0 (nothing was expected, so nothing was missed)
If precision + recall = 0: F1 defaults to 0.0 (avoid division by zero)

Why precision defaults to 1.0 when TP + FP = 0? This is the convention for "nothing flagged" — since no false positives were produced, precision is perfect. This matters for clean test cases (PRs with no issues) where the correct behavior is to flag nothing.

EvalSummary — Aggregate Metrics

@dataclass
class EvalSummary:
    """Aggregate metrics across all evaluated PRs."""

    results: list[EvalResult] = field(default_factory=list)

    @property
    def avg_precision(self) -> float:
        if not self.results:
            return 0.0
        return sum(r.precision for r in self.results) / len(self.results)

    @property
    def avg_recall(self) -> float:
        if not self.results:
            return 0.0
        return sum(r.recall for r in self.results) / len(self.results)

    @property
    def avg_f1(self) -> float:
        if not self.results:
            return 0.0
        return sum(r.f1 for r in self.results) / len(self.results)

    @property
    def latency_p50(self) -> int:
        if not self.results:
            return 0
        latencies = sorted(r.latency_ms for r in self.results)
        return latencies[len(latencies) // 2]

    @property
    def latency_p95(self) -> int:
        if not self.results:
            return 0
        latencies = sorted(r.latency_ms for r in self.results)
        idx = int(len(latencies) * 0.95)
        return latencies[min(idx, len(latencies) - 1)]

    def summary(self) -> str:
        return (
            f"Evaluation Summary ({len(self.results)} PRs)\n"
            f"  Precision: {self.avg_precision:.1%}\n"
            f"  Recall:    {self.avg_recall:.1%}\n"
            f"  F1 Score:  {self.avg_f1:.1%}\n"
            f"  Latency:   p50={self.latency_p50}ms, p95={self.latency_p95}ms\n"
        )

Latency percentiles explained:

p50 (median): The typical case. 50% of reviews complete faster than this.
p95: The worst-case (within reason). 95% of reviews complete faster than this. The remaining 5% are outliers (cold starts, network issues).

Why p50/p95 and not average latency? Averages are misleading for latency because outliers skew them heavily. If 9 reviews take 1 second and 1 review takes 30 seconds (cold start), the average is 3.9 seconds — but the typical experience is 1 second. p50 shows the typical case; p95 shows the tail.

Interview talking point: "We track p50 and p95 latency rather than mean because latency distributions are typically long-tailed. A single cold start can double the mean without affecting the experience for 95% of users. p50 tells us 'what does a typical review feel like?' and p95 tells us 'what's the worst experience we should plan for?'"

Step 3: Build the Evaluation Runner (tests/eval/run_eval.py)

What we did: Created the main evaluation script that runs the full pipeline on each test case and compares results against ground truth.

async def evaluate_single_pr(test_case: dict) -> EvalResult:
    """
    Run the pipeline on one test PR and compare against ground truth.

    A finding is considered a true positive if it matches an expected
    finding on the same file_path and within 3 lines of the expected line.
    """
    from app.agents.security_agent import SecurityAgent
    from app.agents.performance_agent import PerformanceAgent
    from app.agents.style_agent import StyleAgent
    from app.agents.synthesizer import synthesize
    from app.github.client import PRData

    pr_data = PRData(
        repo_full_name="eval/test",
        pr_number=0,
        commit_sha="eval",
        title=test_case.get("pr_id", "eval"),
        diff=test_case["diff"],
        changed_files=[],
        file_contents=test_case.get("file_contents", {}),
    )

    start = time.time()

    # Run all agents (same as production pipeline)
    security = SecurityAgent()
    performance = PerformanceAgent()
    style = StyleAgent()

    sec_findings, perf_findings, style_findings = await asyncio.gather(
        security.review(pr_data),
        performance.review(pr_data),
        style.review(pr_data),
    )

    review = synthesize(sec_findings, perf_findings, style_findings)
    elapsed_ms = int((time.time() - start) * 1000)

Key design decisions:

Same pipeline as production: The evaluation runs the exact same code path — same agents, same synthesizer, same deduplication. This ensures we're measuring the real system, not a simplified version.
Lazy imports: Agent classes are imported inside the function, not at module level. This prevents import errors when running the evaluation harness in environments where not all dependencies are installed.

Step 4: Implement Ground Truth Matching

The matching algorithm:

    # Compare against ground truth
    expected = test_case.get("expected_findings", [])
    actual = review.findings

    matched_expected = set()
    matched_actual = set()

    for i, exp in enumerate(expected):
        for j, act in enumerate(actual):
            if j in matched_actual:
                continue
            # Match: same file, within 3 lines, same category
            if (
                act.file_path == exp["file_path"]
                and abs(act.line_start - exp["line_start"]) <= 3
                and act.category == exp.get("category", act.category)
            ):
                matched_expected.add(i)
                matched_actual.add(j)
                break

    tp = len(matched_expected)
    fp = len(actual) - len(matched_actual)
    fn = len(expected) - len(matched_expected)

The 3-line tolerance: A finding is considered a true positive if it matches an expected finding with:

Same file path — exact string match
Within 3 lines — abs(actual_line - expected_line) <= 3
Same category — if the ground truth specifies a category, it must match

Why 3-line tolerance instead of exact line match? LLMs sometimes report the line where the vulnerability is used (line 6: conn.execute(query)) rather than where it's defined (line 5: query = f"SELECT..."). Both are correct — they just point to different parts of the same vulnerability. The 3-line tolerance allows for this variation without penalizing the system.

Why not 0-line tolerance? Too strict — minor differences in how the LLM interprets line numbers would cause false negatives in the evaluation, even when the system correctly identified the issue.

Why not 10-line tolerance? Too loose — a finding 10 lines away might be a completely different issue. The 3-line window is calibrated to allow reasonable variation while still requiring the finding to be "in the right neighborhood."

Bipartite matching: Each expected finding can match at most one actual finding, and vice versa. The matched_actual set prevents double-counting. This is a greedy (not optimal) matching — for a small number of findings per PR, the greedy approach is equivalent to optimal in practice.

Interview talking point: "We use a 3-line tolerance for ground truth matching because LLMs may point to slightly different lines for the same vulnerability — the definition vs. the usage. This is calibrated to allow reasonable variation without being so loose that different issues get matched together. It's similar to how NLP evaluation uses token-level F1 with partial overlap."

Step 5: Build the Evaluation Runner Loop

async def run_evaluation():
    """Run evaluation on all test cases in the dataset directory."""
    dataset_dir = Path(__file__).parent / "dataset"

    if not dataset_dir.exists() or not list(dataset_dir.glob("*.json")):
        print("No evaluation dataset found.")
        print("Create JSON files in tests/eval/dataset/")
        return

    summary = EvalSummary()

    for test_file in sorted(dataset_dir.glob("*.json")):
        print(f"Evaluating: {test_file.name}...")
        test_case = json.loads(test_file.read_text())
        result = await evaluate_single_pr(test_case)
        summary.results.append(result)
        print(f"  P={result.precision:.0%} R={result.recall:.0%} "
              f"F1={result.f1:.0%} ({result.latency_ms}ms)")

    print("\n" + summary.summary())


if __name__ == "__main__":
    asyncio.run(run_evaluation())

Usage:

python -m tests.eval.run_eval

Example output:

Evaluating: sql_injection_basic.json...
  P=100% R=100% F1=100% (4200ms)

Evaluation Summary (1 PRs)
  Precision: 100.0%
  Recall:    100.0%
  F1 Score:  100.0%
  Latency:   p50=4200ms, p95=4200ms

Sorted glob ensures deterministic ordering: Test cases run in alphabetical order, making the evaluation reproducible. Adding a new test case doesn't change the order of existing ones.

Step 6: Polish the README

What we did: Wrote a comprehensive README.md that serves as the project's public face.

README structure:

Section	Content	Why
Title + tagline	"Multi-agent code review system..."	First impression — what it does in one sentence
How It Works	ASCII flowchart	Visual architecture overview
What Each Agent Does	Table with focus, tools, examples	Quick reference for each agent's capabilities
Tech Stack	Table: layer, technology, why	Justifies every technology choice
Quick Start	Setup commands + env vars	Get running in 2 minutes
Architecture	4 layers + design patterns	Technical depth for senior reviewers
Test Results	PR #4 output	Concrete evidence that it works
Running Tests	`pytest` command	How to verify locally
Project Structure	Directory tree	Codebase navigation
Documentation	Links to weekly docs	Deep-dive references

Design principles for the README:

Lead with the value proposition: The first sentence explains WHAT the system does and WHY it matters — "reviews PRs the way a senior engineering team would."
Show, don't tell: The ASCII flowchart conveys the architecture faster than paragraphs of text. The test results section shows real output, not theoretical claims.
Quick Start in under 30 seconds of reading: Clone, install, configure, run — four commands. Environment variables listed explicitly so developers don't have to hunt.
Architecture section names the patterns: "Template Method," "Structured Output," "Fail-Open Cache," "Background Tasks," "Parallel Execution." These are interview keywords that demonstrate systems design knowledge.
Links to deep dives: Each weekly doc is linked for readers who want implementation details beyond the README overview.

Interview talking point: "The README is structured for three audiences: managers who read the first two sections and move on, developers who read Quick Start and Architecture, and interviewers who want to see design patterns and test results. Each section is self-contained — you don't need to read the whole thing to get value."

Architecture Patterns Used

Pattern	Where	Why
Ground Truth Evaluation	`run_eval.py`	Objective quality measurement against known-correct answers
Fuzzy Matching	3-line tolerance	Handles legitimate variation in LLM line number reporting
Greedy Bipartite Matching	TP/FP/FN computation	Each expected finding matches at most one actual finding
Percentile-based Latency	p50/p95 in `metrics.py`	Robust to outliers, standard industry practice
Self-contained Test Fixtures	JSON dataset files	Reproducible evaluation without external dependencies
Dataclass with Properties	`EvalResult`, `EvalSummary`	Computed metrics derived from raw counts, always consistent

Files Created / Modified in Week 9

File	Purpose
`tests/eval/metrics.py`	EvalResult + EvalSummary dataclasses with P/R/F1/latency
`tests/eval/run_eval.py`	Evaluation harness runner
`tests/eval/dataset/sql_injection_basic.json`	Test case: SQL injection with ground truth
`README.md`	Comprehensive project documentation for public release

Interview Talking Points Summary

"How do you know your system works?" "We built an evaluation harness that runs the full pipeline against test PRs with known vulnerabilities and measures precision, recall, and F1. Each test case is a self-contained JSON file with a diff, file contents, and ground truth findings. The harness uses 3-line tolerance for matching because LLMs may point to slightly different lines for the same issue."
"Why precision AND recall? Why not just one?" "A system that flags nothing has perfect precision but zero recall. A system that flags everything has perfect recall but near-zero precision. We need both: precision measures trust (developers stop reading if there are too many false positives), and recall measures safety (missing a real vulnerability is worse than a false alarm)."
"What's the 3-line tolerance about?" "LLMs may report the line where a vulnerability is defined versus the line where it's used. Both are correct — they reference the same underlying issue. The 3-line window allows for this variation without being so loose that different issues get matched together. It's similar to how NLP evaluation uses partial overlap metrics."
"How would you expand the evaluation?" "Add more test cases covering different vulnerability types (XSS, SSRF, auth bypass), different languages (the current dataset is Python), and edge cases (false positive traps — code that looks vulnerable but isn't). We could also add severity correctness as a metric: did the system assign the right severity level?"
"Why track p50 and p95 latency?" "Average latency is misleading because cold starts skew it. p50 tells us the typical user experience, p95 tells us the worst case we should plan for. In production, we'd set SLOs against these: 'p50 under 10 seconds, p95 under 30 seconds.'"

Documentation written 2026-03-20 as part of Week 9 completion.