Spaces:
Sleeping
Sleeping
File size: 24,634 Bytes
4b445f6 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 | # Week 9: Evaluation Harness & Project Polish β Detailed Documentation
> **Goal:** Build an evaluation harness that measures review quality against ground truth, compute precision/recall/F1, track latency percentiles, and polish the README for public release.
> **Status:** Complete β Evaluation framework operational, README finalized
> **Date:** 2026-03-20
> **Key Metric:** Ground truth matching with 3-line tolerance, precision/recall/F1 per test case
> **Deliverables:** Evaluation harness, test dataset, production-quality README
---
## What We Built
Week 9 adds two critical capabilities that transform Ninja Code Guard from a "works on my
machine" prototype into a project ready for production evaluation and public presentation.
1. **Evaluation Harness** β A framework that runs the full review pipeline against test PRs
with known issues (ground truth) and measures precision, recall, F1, and latency. This
answers the question every interviewer asks: "How do you know your system actually works?"
2. **README Polish** β A comprehensive README.md that serves as the project's public face,
covering architecture, setup, usage, and test results.
```
ββββββββββββββββββββββββββββββββββββ
β Evaluation Harness β
β tests/eval/ β
β β
β ββββββββββββββββββββββββββββββ β
β β Dataset (JSON files) β β
β β sql_injection_basic.json β β
β β n_plus_one_query.json β β Each file contains:
β β hardcoded_secret.json β β - PR diff
β β ... β β - File contents
β ββββββββββββ¬ββββββββββββββββββ β - Expected findings
β β β (ground truth)
β βΌ β
β ββββββββββββββββββββββββββββββ β
β β run_eval.py β β
β β For each test case: β β
β β 1. Run 3 agents parallel β β
β β 2. Synthesize findings β β
β β 3. Match vs ground truth β β
β β 4. Compute TP/FP/FN β β
β ββββββββββββ¬ββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββ β
β β metrics.py β β
β β Per-PR: P, R, F1, latency β β
β β Aggregate: avg P/R/F1 β β
β β Latency: p50, p95 β β
β ββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββ
```
---
## Concept: Why Evaluation Matters
### The Problem: "It Seems to Work" Is Not Enough
Without systematic evaluation, we're relying on anecdotal evidence: "I ran it on PR #4
and it found the SQL injection." But this tells us nothing about:
- **Precision** β Of the issues it flagged, how many are real? (Are there false positives?)
- **Recall** β Of the real issues, how many did it find? (Are there false negatives?)
- **Consistency** β Does it work on different code patterns, or just the ones we tested?
- **Latency** β How long does a review take? Is it fast enough for a real workflow?
The evaluation harness answers all of these with reproducible, quantitative metrics.
### The Three Core Metrics
```
All items in test PR
ββββββββββββββββββββββββββββββββββββββ
β β
β Ground Truth Detected β
β (expected) (actual) β
β ββββββββββββ ββββββββββββ β
β β β β β β
β β FN β TP β FP β β
β β (missed) β β (false β β
β β β β alarm) β β
β ββββββββββββ ββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββ
Precision = TP / (TP + FP) β "Of what we flagged, how much is real?"
Recall = TP / (TP + FN) β "Of what's real, how much did we find?"
F1 = 2*P*R / (P+R) β "Harmonic mean β balance of both"
```
**Why F1 and not just accuracy?**
Accuracy (TP + TN) / total is misleading for imbalanced problems. A PR with 100 lines and
1 vulnerability: a system that says "everything is fine" has 99% accuracy but 0% recall.
F1 balances precision and recall, penalizing systems that sacrifice one for the other.
**Interview talking point:** "We measure precision, recall, and F1 rather than accuracy
because code review is an imbalanced classification problem β most lines are fine, only a
few have issues. A system that flags nothing has 0% recall but near-100% precision. A system
that flags everything has 100% recall but near-0% precision. F1 forces us to balance both."
---
## Step-by-Step Implementation Log
### Step 1: Design the Evaluation Dataset Format
**What we did:** Defined a JSON schema for test cases with known vulnerabilities.
```json
{
"pr_id": "sql_injection_basic",
"diff": "diff --git a/app.py b/app.py\n--- /dev/null\n+++ b/app.py\n@@ -0,0 +1,10 @@\n+import sqlite3\n+\n+def get_user(user_id):\n+ conn = sqlite3.connect('users.db')\n+ query = f\"SELECT * FROM users WHERE id = {user_id}\"\n+ return conn.execute(query).fetchone()\n+\n+def safe_get_user(user_id):\n+ conn = sqlite3.connect('users.db')\n+ return conn.execute('SELECT * FROM users WHERE id = ?', (user_id,)).fetchone()\n",
"file_contents": {
"app.py": "import sqlite3\n\ndef get_user(user_id):\n conn = sqlite3.connect('users.db')\n query = f\"SELECT * FROM users WHERE id = {user_id}\"\n return conn.execute(query).fetchone()\n\ndef safe_get_user(user_id):\n conn = sqlite3.connect('users.db')\n return conn.execute('SELECT * FROM users WHERE id = ?', (user_id,)).fetchone()\n"
},
"expected_findings": [
{
"file_path": "app.py",
"line_start": 5,
"category": "sql_injection"
}
]
}
```
**Each test case contains four fields:**
| Field | Purpose |
|-------|---------|
| `pr_id` | Unique identifier for this test case (used in logging) |
| `diff` | The PR diff in unified diff format (what GitHub sends) |
| `file_contents` | Full file source code (used by agents for analysis) |
| `expected_findings` | Ground truth: known issues with file, line, and category |
**Design decisions:**
1. **Self-contained JSON:** Each test case includes both the diff and full file contents.
This means the evaluation can run without GitHub API access β no network dependencies,
fully reproducible.
2. **Minimal ground truth fields:** Expected findings only specify `file_path`,
`line_start`, and `category`. We don't specify severity, title, or description because
those are subjective β different agents might reasonably assign different severities
to the same issue.
3. **Positive and negative examples in the same file:** The `sql_injection_basic` test
includes both a vulnerable function (`get_user` with f-string interpolation) and a safe
function (`safe_get_user` with parameterized query). The system should flag line 5
but NOT flag line 10. This tests both recall (did it find the bug?) and precision
(did it avoid flagging the safe code?).
**Interview talking point:** "Each evaluation test case is a self-contained JSON file with
a PR diff, full file contents, and ground truth findings. The ground truth specifies file,
line, and category β but not severity or description, because those are subjective. This
design lets us test detection accuracy without penalizing agents for reasonable
differences in how they describe the same issue."
### Step 2: Build the Metrics Module (tests/eval/metrics.py)
**What we did:** Created dataclasses for per-PR and aggregate evaluation results.
#### EvalResult β Per-PR Metrics
```python
@dataclass
class EvalResult:
"""Result of evaluating one PR against ground truth."""
pr_id: str
true_positives: int = 0
false_positives: int = 0
false_negatives: int = 0
latency_ms: int = 0
@property
def precision(self) -> float:
total = self.true_positives + self.false_positives
return self.true_positives / total if total > 0 else 1.0
@property
def recall(self) -> float:
total = self.true_positives + self.false_negatives
return self.true_positives / total if total > 0 else 1.0
@property
def f1(self) -> float:
p, r = self.precision, self.recall
return 2 * p * r / (p + r) if (p + r) > 0 else 0.0
```
**Edge case handling:**
- If there are no detections at all (TP=0, FP=0): precision defaults to 1.0 (nothing
was flagged, so nothing was wrong β vacuously true)
- If there are no expected findings (TP=0, FN=0): recall defaults to 1.0 (nothing was
expected, so nothing was missed)
- If precision + recall = 0: F1 defaults to 0.0 (avoid division by zero)
**Why precision defaults to 1.0 when TP + FP = 0?**
This is the convention for "nothing flagged" β since no false positives were produced,
precision is perfect. This matters for clean test cases (PRs with no issues) where the
correct behavior is to flag nothing.
#### EvalSummary β Aggregate Metrics
```python
@dataclass
class EvalSummary:
"""Aggregate metrics across all evaluated PRs."""
results: list[EvalResult] = field(default_factory=list)
@property
def avg_precision(self) -> float:
if not self.results:
return 0.0
return sum(r.precision for r in self.results) / len(self.results)
@property
def avg_recall(self) -> float:
if not self.results:
return 0.0
return sum(r.recall for r in self.results) / len(self.results)
@property
def avg_f1(self) -> float:
if not self.results:
return 0.0
return sum(r.f1 for r in self.results) / len(self.results)
@property
def latency_p50(self) -> int:
if not self.results:
return 0
latencies = sorted(r.latency_ms for r in self.results)
return latencies[len(latencies) // 2]
@property
def latency_p95(self) -> int:
if not self.results:
return 0
latencies = sorted(r.latency_ms for r in self.results)
idx = int(len(latencies) * 0.95)
return latencies[min(idx, len(latencies) - 1)]
def summary(self) -> str:
return (
f"Evaluation Summary ({len(self.results)} PRs)\n"
f" Precision: {self.avg_precision:.1%}\n"
f" Recall: {self.avg_recall:.1%}\n"
f" F1 Score: {self.avg_f1:.1%}\n"
f" Latency: p50={self.latency_p50}ms, p95={self.latency_p95}ms\n"
)
```
**Latency percentiles explained:**
- **p50 (median):** The typical case. 50% of reviews complete faster than this.
- **p95:** The worst-case (within reason). 95% of reviews complete faster than this.
The remaining 5% are outliers (cold starts, network issues).
**Why p50/p95 and not average latency?**
Averages are misleading for latency because outliers skew them heavily. If 9 reviews take
1 second and 1 review takes 30 seconds (cold start), the average is 3.9 seconds β but the
typical experience is 1 second. p50 shows the typical case; p95 shows the tail.
**Interview talking point:** "We track p50 and p95 latency rather than mean because latency
distributions are typically long-tailed. A single cold start can double the mean without
affecting the experience for 95% of users. p50 tells us 'what does a typical review feel
like?' and p95 tells us 'what's the worst experience we should plan for?'"
### Step 3: Build the Evaluation Runner (tests/eval/run_eval.py)
**What we did:** Created the main evaluation script that runs the full pipeline on each
test case and compares results against ground truth.
```python
async def evaluate_single_pr(test_case: dict) -> EvalResult:
"""
Run the pipeline on one test PR and compare against ground truth.
A finding is considered a true positive if it matches an expected
finding on the same file_path and within 3 lines of the expected line.
"""
from app.agents.security_agent import SecurityAgent
from app.agents.performance_agent import PerformanceAgent
from app.agents.style_agent import StyleAgent
from app.agents.synthesizer import synthesize
from app.github.client import PRData
pr_data = PRData(
repo_full_name="eval/test",
pr_number=0,
commit_sha="eval",
title=test_case.get("pr_id", "eval"),
diff=test_case["diff"],
changed_files=[],
file_contents=test_case.get("file_contents", {}),
)
start = time.time()
# Run all agents (same as production pipeline)
security = SecurityAgent()
performance = PerformanceAgent()
style = StyleAgent()
sec_findings, perf_findings, style_findings = await asyncio.gather(
security.review(pr_data),
performance.review(pr_data),
style.review(pr_data),
)
review = synthesize(sec_findings, perf_findings, style_findings)
elapsed_ms = int((time.time() - start) * 1000)
```
**Key design decisions:**
1. **Same pipeline as production:** The evaluation runs the exact same code path β same
agents, same synthesizer, same deduplication. This ensures we're measuring the real
system, not a simplified version.
2. **Lazy imports:** Agent classes are imported inside the function, not at module level.
This prevents import errors when running the evaluation harness in environments where
not all dependencies are installed.
### Step 4: Implement Ground Truth Matching
**The matching algorithm:**
```python
# Compare against ground truth
expected = test_case.get("expected_findings", [])
actual = review.findings
matched_expected = set()
matched_actual = set()
for i, exp in enumerate(expected):
for j, act in enumerate(actual):
if j in matched_actual:
continue
# Match: same file, within 3 lines, same category
if (
act.file_path == exp["file_path"]
and abs(act.line_start - exp["line_start"]) <= 3
and act.category == exp.get("category", act.category)
):
matched_expected.add(i)
matched_actual.add(j)
break
tp = len(matched_expected)
fp = len(actual) - len(matched_actual)
fn = len(expected) - len(matched_expected)
```
**The 3-line tolerance:**
A finding is considered a true positive if it matches an expected finding with:
1. **Same file path** β exact string match
2. **Within 3 lines** β `abs(actual_line - expected_line) <= 3`
3. **Same category** β if the ground truth specifies a category, it must match
**Why 3-line tolerance instead of exact line match?**
LLMs sometimes report the line where the vulnerability is used (line 6: `conn.execute(query)`)
rather than where it's defined (line 5: `query = f"SELECT..."`). Both are correct β they
just point to different parts of the same vulnerability. The 3-line tolerance allows for
this variation without penalizing the system.
**Why not 0-line tolerance?** Too strict β minor differences in how the LLM interprets
line numbers would cause false negatives in the evaluation, even when the system correctly
identified the issue.
**Why not 10-line tolerance?** Too loose β a finding 10 lines away might be a completely
different issue. The 3-line window is calibrated to allow reasonable variation while still
requiring the finding to be "in the right neighborhood."
**Bipartite matching:** Each expected finding can match at most one actual finding, and
vice versa. The `matched_actual` set prevents double-counting. This is a greedy (not
optimal) matching β for a small number of findings per PR, the greedy approach is
equivalent to optimal in practice.
**Interview talking point:** "We use a 3-line tolerance for ground truth matching because
LLMs may point to slightly different lines for the same vulnerability β the definition vs.
the usage. This is calibrated to allow reasonable variation without being so loose that
different issues get matched together. It's similar to how NLP evaluation uses token-level
F1 with partial overlap."
### Step 5: Build the Evaluation Runner Loop
```python
async def run_evaluation():
"""Run evaluation on all test cases in the dataset directory."""
dataset_dir = Path(__file__).parent / "dataset"
if not dataset_dir.exists() or not list(dataset_dir.glob("*.json")):
print("No evaluation dataset found.")
print("Create JSON files in tests/eval/dataset/")
return
summary = EvalSummary()
for test_file in sorted(dataset_dir.glob("*.json")):
print(f"Evaluating: {test_file.name}...")
test_case = json.loads(test_file.read_text())
result = await evaluate_single_pr(test_case)
summary.results.append(result)
print(f" P={result.precision:.0%} R={result.recall:.0%} "
f"F1={result.f1:.0%} ({result.latency_ms}ms)")
print("\n" + summary.summary())
if __name__ == "__main__":
asyncio.run(run_evaluation())
```
**Usage:**
```bash
python -m tests.eval.run_eval
```
**Example output:**
```
Evaluating: sql_injection_basic.json...
P=100% R=100% F1=100% (4200ms)
Evaluation Summary (1 PRs)
Precision: 100.0%
Recall: 100.0%
F1 Score: 100.0%
Latency: p50=4200ms, p95=4200ms
```
**Sorted glob ensures deterministic ordering:** Test cases run in alphabetical order,
making the evaluation reproducible. Adding a new test case doesn't change the order
of existing ones.
### Step 6: Polish the README
**What we did:** Wrote a comprehensive README.md that serves as the project's public face.
**README structure:**
| Section | Content | Why |
|---------|---------|-----|
| Title + tagline | "Multi-agent code review system..." | First impression β what it does in one sentence |
| How It Works | ASCII flowchart | Visual architecture overview |
| What Each Agent Does | Table with focus, tools, examples | Quick reference for each agent's capabilities |
| Tech Stack | Table: layer, technology, why | Justifies every technology choice |
| Quick Start | Setup commands + env vars | Get running in 2 minutes |
| Architecture | 4 layers + design patterns | Technical depth for senior reviewers |
| Test Results | PR #4 output | Concrete evidence that it works |
| Running Tests | `pytest` command | How to verify locally |
| Project Structure | Directory tree | Codebase navigation |
| Documentation | Links to weekly docs | Deep-dive references |
**Design principles for the README:**
1. **Lead with the value proposition:** The first sentence explains WHAT the system does
and WHY it matters β "reviews PRs the way a senior engineering team would."
2. **Show, don't tell:** The ASCII flowchart conveys the architecture faster than
paragraphs of text. The test results section shows real output, not theoretical claims.
3. **Quick Start in under 30 seconds of reading:** Clone, install, configure, run β four
commands. Environment variables listed explicitly so developers don't have to hunt.
4. **Architecture section names the patterns:** "Template Method," "Structured Output,"
"Fail-Open Cache," "Background Tasks," "Parallel Execution." These are interview
keywords that demonstrate systems design knowledge.
5. **Links to deep dives:** Each weekly doc is linked for readers who want implementation
details beyond the README overview.
**Interview talking point:** "The README is structured for three audiences: managers who
read the first two sections and move on, developers who read Quick Start and Architecture,
and interviewers who want to see design patterns and test results. Each section is
self-contained β you don't need to read the whole thing to get value."
---
## Architecture Patterns Used
| Pattern | Where | Why |
|---------|-------|-----|
| **Ground Truth Evaluation** | `run_eval.py` | Objective quality measurement against known-correct answers |
| **Fuzzy Matching** | 3-line tolerance | Handles legitimate variation in LLM line number reporting |
| **Greedy Bipartite Matching** | TP/FP/FN computation | Each expected finding matches at most one actual finding |
| **Percentile-based Latency** | p50/p95 in `metrics.py` | Robust to outliers, standard industry practice |
| **Self-contained Test Fixtures** | JSON dataset files | Reproducible evaluation without external dependencies |
| **Dataclass with Properties** | `EvalResult`, `EvalSummary` | Computed metrics derived from raw counts, always consistent |
---
## Files Created / Modified in Week 9
| File | Purpose |
|------|---------|
| `tests/eval/metrics.py` | EvalResult + EvalSummary dataclasses with P/R/F1/latency |
| `tests/eval/run_eval.py` | Evaluation harness runner |
| `tests/eval/dataset/sql_injection_basic.json` | Test case: SQL injection with ground truth |
| `README.md` | Comprehensive project documentation for public release |
---
## Interview Talking Points Summary
1. **"How do you know your system works?"**
"We built an evaluation harness that runs the full pipeline against test PRs with known
vulnerabilities and measures precision, recall, and F1. Each test case is a self-contained
JSON file with a diff, file contents, and ground truth findings. The harness uses 3-line
tolerance for matching because LLMs may point to slightly different lines for the same
issue."
2. **"Why precision AND recall? Why not just one?"**
"A system that flags nothing has perfect precision but zero recall. A system that flags
everything has perfect recall but near-zero precision. We need both: precision measures
trust (developers stop reading if there are too many false positives), and recall
measures safety (missing a real vulnerability is worse than a false alarm)."
3. **"What's the 3-line tolerance about?"**
"LLMs may report the line where a vulnerability is defined versus the line where it's
used. Both are correct β they reference the same underlying issue. The 3-line window
allows for this variation without being so loose that different issues get matched
together. It's similar to how NLP evaluation uses partial overlap metrics."
4. **"How would you expand the evaluation?"**
"Add more test cases covering different vulnerability types (XSS, SSRF, auth bypass),
different languages (the current dataset is Python), and edge cases (false positive
traps β code that looks vulnerable but isn't). We could also add severity correctness
as a metric: did the system assign the right severity level?"
5. **"Why track p50 and p95 latency?"**
"Average latency is misleading because cold starts skew it. p50 tells us the typical
user experience, p95 tells us the worst case we should plan for. In production, we'd
set SLOs against these: 'p50 under 10 seconds, p95 under 30 seconds.'"
---
*Documentation written 2026-03-20 as part of Week 9 completion.*
|