Spaces:

NinjainPJs
/

ninja-code-guard

Sleeping

File size: 34,932 Bytes

4b445f6

# Week 4: Performance Agent — Detailed Documentation

> **Goal:** Build the Performance Agent — LLM + radon complexity analysis to find real performance issues.
> **Status:** Complete — Live-tested on PR #4 with intentionally slow code
> **Date:** 2026-03-20
> **Test PR:** github.com/ninjacode911/codeguard-test/pull/4
> **Result:** 3 findings (quadratic loop, blocking I/O, complex function), Health Score 65/100

---

## What We Built

The Performance Agent is the second domain agent. It combines **radon cyclomatic complexity
analysis** with **LLM reasoning** (Groq Llama-3.3-70B) to find performance issues: quadratic
algorithms, N+1 queries, blocking I/O in async code, missing caching, and more.

The key insight this week: because we invested in the BaseAgent Template Method pattern
in Week 3, the entire PerformanceAgent is only **~30 lines of code**. Everything else is
inherited.

```
PR Diff + File Contents
        |
        v
+-------------------------------+
|     Static Analysis           |  Radon: 1 finding (complex function, grade D)
|  Radon (cyclomatic complexity)|  Time: ~0.5 seconds
+-------------+-----------------+
              | tool output as text
              v
+-------------------------------+
|     Groq LLM                  |  Model: llama-3.3-70b-versatile
|  System prompt: Perf engineer |  Input: diff + files + radon results
|  Structured output: JSON      |  Output: 3 Finding objects
|  Temperature: 0.1             |  Time: ~2.5 seconds
+-------------+-----------------+
              | Finding[]
              v
+-------------------------------+
|     Comment Formatter         |  Health Score: 65/100
|  Summary + inline comments    |  Recommendation: Needs Work
|  Posted to GitHub PR          |  Severity table + details
+-------------------------------+
```

**Contrast with Week 3:**
The architecture diagram is nearly identical to the Security Agent's. That is the entire
point. The *flow* is the same; only the *analysis* is different. This is the Template Method
pattern paying off.

---

## Why Performance Review Matters

Most code review (human or automated) focuses on correctness and style. Performance issues
slip through because they are invisible in small-data tests:

- A nested loop that works fine on 10 items takes 10 seconds on 10,000 items
- An ORM call inside a for-loop makes 1 query during development but 10,000 in production
- A blocking `requests.get()` inside an `async def` works in testing but kills throughput
  under concurrent load

The Performance Agent catches these issues *before* they reach production, when they are
cheap to fix. The key difference from a linter: it estimates the *impact* at scale, not
just flags a pattern.

---

## Step-by-Step Implementation Log

### Step 1: Install Radon

```bash
pip install radon
```

| Package | Purpose |
|---------|---------|
| `radon` | Computes cyclomatic complexity, Halstead metrics, and maintainability index for Python code |

Radon is a pure-Python tool, so it installs without native compilation. It runs locally
(no API calls), making it fast and free.

---

### Step 2: The Template Method Payoff (app/agents/performance_agent.py)

**This is the most important concept of the week.** In Week 3, we built `BaseAgent` with
the Template Method pattern. This week, that investment pays off dramatically.

Here is the **entire** PerformanceAgent implementation:

```python
class PerformanceAgent(BaseAgent):

    @property
    def agent_name(self) -> str:
        return "performance"

    @property
    def system_prompt(self) -> str:
        prompt_path = (
            Path(__file__).resolve().parent.parent.parent
            / "prompts"
            / "performance_system.md"
        )
        return prompt_path.read_text(encoding="utf-8")

    async def run_static_analysis(self, pr_data: PRData) -> str:
        """Run radon complexity analysis on changed Python files."""
        radon_output = await run_radon(pr_data.file_contents)
        return radon_output if radon_output else ""
```

That is it. ~30 lines including the docstring and imports.

**Why so short?** Every piece of shared logic lives in `BaseAgent`:

```
BaseAgent (base_agent.py — ~200 lines)
 |
 |-- __init__()          → ChatGroq setup, temperature, model config
 |-- review()            → The Template Method (full algorithm skeleton)
 |-- _build_prompt()     → ChatPromptTemplate with system + human messages
 |-- _convert_to_findings() → LLM output → Finding objects with validation
 |-- _format_file_contents() → File contents → code blocks for LLM prompt
 |-- run_static_analysis()   → Default: no-op. Override in subclasses
 |
 +---> SecurityAgent     → agent_name, system_prompt, run_static_analysis (Bandit + detect-secrets)
 +---> PerformanceAgent  → agent_name, system_prompt, run_static_analysis (Radon)
 +---> StyleAgent (Week 5) → agent_name, system_prompt, run_static_analysis (Ruff/pylint)
```

**The algorithm skeleton (review method) never changes:**
1. Run static analysis tools (subclass decides which)
2. Build prompt with diff + files + tool output
3. Call the LLM with structured output
4. Convert to Finding objects
5. Log timing and return

**What the subclass controls (the "template steps"):**
- `agent_name` — used to tag findings so the Synthesizer knows which agent found what
- `system_prompt` — completely different expertise and focus area
- `run_static_analysis()` — completely different tools

**The real-world analogy:** Think of it as a factory assembly line. The conveyor belt
(BaseAgent.review) is the same for every product. But Station 1 (static analysis) uses
different tools and Station 2 (LLM) reads different instruction manuals (system prompts)
depending on what you are building.

**Why not just copy-paste the SecurityAgent and edit it?**
Three agents with copy-pasted code means three places to update when you:
- Change the LLM model (Llama 3.3 to Llama 4)
- Add RAG context support (Week 6)
- Fix a bug in finding conversion
- Change the prompt template structure

With the Template Method, you update the base class once and all agents get the fix.
This is the **Open/Closed Principle** — open for extension (new agents), closed for
modification (existing algorithm stays unchanged).

**Interview talking point:** "The PerformanceAgent is only 30 lines because I used the
Template Method pattern. The base class defines the review algorithm — run tools, build
prompt, call LLM, convert output — and each agent only overrides what is unique: its name,
its system prompt, and its static analysis tools. Adding the Performance Agent required
zero changes to the base class."

---

### Step 3: Radon Cyclomatic Complexity (app/tools/radon_tool.py)

#### What Cyclomatic Complexity Is

Cyclomatic complexity measures the number of **independent execution paths** through a
function. Every `if`, `elif`, `for`, `while`, `and`, `or`, `except`, and ternary operator
adds one to the count.

```
def example(x, y):
    if x > 0:          # +1 branch
        if y > 0:      # +1 branch
            return x+y
        else:
            return x-y
    elif x == 0:       # +1 branch
        return y
    else:
        return -1
# Complexity = 4 (base 1 + 3 branches)
```

**Why it matters for performance:** High complexity often correlates with:
- Deeply nested loops (O(n^k) algorithms hiding inside many conditionals)
- Missed short-circuit opportunities (checking everything when early return is possible)
- Functions doing too much (should be split for both clarity and performance)

Complexity alone does not prove a performance bug, but it is a strong **signal**. When
radon flags a function as grade C or worse, the LLM knows to look harder at that function
for algorithmic issues.

#### Radon Grading Scale

```
 Grade | Complexity | Meaning              | Our Action
-------+------------+----------------------+------------------------------------------
   A   |    1-5     | Simple, low risk     | Ignored — no report
   B   |    6-10    | Moderate             | Ignored — manageable
   C   |   11-15    | High complexity      | FLAGGED — sent to LLM for deeper analysis
   D   |   16-20    | Very high            | FLAGGED — likely perf + maintenance issue
   E   |   21-25    | Extremely complex    | FLAGGED — almost certainly problematic
   F   |    26+     | Unmaintainable       | FLAGGED — refactoring is critical
```

We use the `-n C` flag to tell radon "only show grade C or worse." This filters out the
noise — simple functions that are fine — and only surfaces functions worth investigating.

#### How We Integrate Radon

The integration follows the same **Temp File Pattern** used by Bandit in Week 3:

```
Changed Python files (in memory from GitHub API)
        |
        v
Write to temp directory         ← file_path.parent.mkdir(parents=True)
        |
        v
Run `radon cc -j -n C <dir>`   ← subprocess.run, 30s timeout
        |
        v
Parse JSON output               ← json.loads(result.stdout)
        |
        v
Format as text summary          ← "complex.py:14 — process() complexity=17 (grade D)"
        |
        v
Return string → injected into LLM prompt as "Static Analysis Results"
        |
        v
Temp directory auto-cleaned     ← TemporaryDirectory context manager
```

**The code walkthrough:**

```python
async def run_radon(file_contents: dict[str, str]) -> str:
    # Step 1: Filter to Python files only — radon can't analyze .js, .css, etc.
    python_files = {
        path: content
        for path, content in file_contents.items()
        if path.endswith(".py")
    }

    if not python_files:
        return ""  # Nothing to analyze

    try:
        # Step 2: Write files to a temp directory (radon operates on the filesystem)
        with tempfile.TemporaryDirectory(prefix="ninjacg_radon_") as tmpdir:
            tmpdir_path = Path(tmpdir)

            for filepath, content in python_files.items():
                file_path = tmpdir_path / filepath
                file_path.parent.mkdir(parents=True, exist_ok=True)
                file_path.write_text(content, encoding="utf-8")

            # Step 3: Run radon
            # -j: JSON output (machine-parseable, not human table)
            # -n C: only show grade C or worse (complexity > 10)
            result = subprocess.run(
                ["radon", "cc", "-j", "-n", "C", str(tmpdir_path)],
                capture_output=True,
                text=True,
                timeout=30,
            )

            # Step 4: Parse results
            if not result.stdout.strip() or result.stdout.strip() == "{}":
                return ""  # All functions are grade A or B — nothing to report

            radon_output = json.loads(result.stdout)

            # Step 5: Format findings as human-readable text
            findings = []
            for file_path, functions in radon_output.items():
                # Convert absolute temp path back to the relative PR path
                relative = str(Path(file_path).relative_to(tmpdir)).replace("\\", "/")

                for func in functions:
                    name = func.get("name", "unknown")
                    complexity = func.get("complexity", 0)
                    rank = func.get("rank", "?")
                    lineno = func.get("lineno", 0)
                    findings.append(
                        f"- {relative}:{lineno} — `{name}()` complexity={complexity} (grade {rank})"
                    )

            if not findings:
                return ""

            summary = (
                f"Radon complexity analysis found {len(findings)} high-complexity function(s):\n"
                + "\n".join(findings)
            )
            return summary

    except FileNotFoundError:
        # radon binary not installed — degrade gracefully
        logger.warning("radon not found in PATH — skipping complexity analysis")
        return ""
    except Exception as e:
        logger.warning("Radon analysis failed", error=str(e))
        return ""
```

**Why `-j` (JSON) instead of the default text table?**
Radon's default output is a human-readable table, but parsing tables with regex is fragile.
JSON gives us structured data with exact field names, making the code reliable across
radon versions.

**Why `subprocess.run` instead of radon's Python API?**
Radon has a Python API, but the CLI is simpler to integrate and matches how we integrate
Bandit and detect-secrets. Consistency across tools means less code to maintain. The 30-second
timeout prevents a malformed file from hanging the pipeline.

**The path normalization trick:**
Radon's JSON output uses the absolute temp directory path (`/tmp/ninjacg_radon_abc123/app.py`).
We convert it back to the relative PR path (`app.py`) using `Path.relative_to(tmpdir)`.
The `.replace("\\", "/")` handles Windows paths, ensuring consistent output across platforms.

**Interview talking point:** "Radon measures cyclomatic complexity — the number of
independent paths through a function. We flag grade C or worse (complexity above 10) and
feed those results to the LLM as anchoring context. The LLM then investigates whether
the complexity indicates a real performance issue like a quadratic algorithm, or is just
inherent business logic."

---

### Step 4: Performance System Prompt (prompts/performance_system.md)

The system prompt is the agent's brain. It defines what the LLM looks for, how it reasons,
and how it formats its output. Getting this right is more impactful than any code change.

#### Prompt Structure: 5 Sections

**1. Role Definition**
```
You are a principal backend engineer specializing in systems performance.
You have 10+ years of experience optimizing high-throughput applications,
database query patterns, and distributed systems.
```

Why "principal backend engineer" instead of just "performance expert"? Specificity matters.
A principal engineer has opinions about trade-offs, knows when to optimize and when not to,
and can estimate impact at scale. This framing produces more nuanced findings (fewer false
positives on micro-optimizations).

**2. Scope Boundary**
```
Review the PR diff and file contents for performance issues ONLY.
Do not comment on security vulnerabilities, code style, naming conventions,
or anything outside the performance domain.
```

Without this line, the Performance Agent would comment on SQL injection (that is the
Security Agent's job) and variable naming (that is the Style Agent's job). Scope boundaries
prevent duplicate findings across agents.

**3. Issue Categories (What to Look For)**

The prompt organizes issues by impact level:

| Impact | Category | Example | Why It Matters |
|--------|----------|---------|----------------|
| **High** | N+1 Query | `User.objects.get(id=x)` in a for loop | 1 query becomes 10,000 queries |
| **High** | Blocking I/O in Async | `requests.get()` inside `async def` | Blocks event loop, kills throughput |
| **High** | Unbounded Queries | `SELECT *` without LIMIT | Fetches entire table into memory |
| **High** | Quadratic Algorithms | Nested loop over same collection | O(n^2) — 100M ops at 10K items |
| **Medium** | Missing Caching | Same expensive computation repeated | Wasted CPU/DB resources |
| **Medium** | Wrong Data Structure | `if x in large_list` (O(n)) vs set (O(1)) | 10,000x slower at scale |
| **Medium** | Excessive Memory | Building list when generator works | OOM risk on large datasets |
| **Medium** | Missing DB Indexes | WHERE on non-indexed column | Full table scan on every query |
| **Low** | String Concat in Loop | `result += s` in loop | O(n^2) string copying |
| **Low** | Missing Connection Pool | New DB connection per request | Connection overhead + exhaustion |

Each category includes a concrete example and a fix. This is critical — LLMs produce
better output when shown examples (few-shot prompting within the system prompt).

**4. The Six Rules**

The rules section is where precision engineering happens:

**Rule 1: "ONLY report findings in code that was CHANGED in this PR"**
Without this, the LLM reports issues in unchanged code that happens to be in the file
context. That is annoying to developers — they did not introduce the issue, they should
not be blamed for it.

**Rule 2: "Be precise with line numbers"**
Vague findings ("somewhere in this file") are useless. Exact line numbers enable inline
PR comments that point to the exact problem.

**Rule 3: "Estimate the impact" (THE KEY RULE)**
This is what separates our agent from a basic linter. Linters say "nested loop detected."
Our agent says "This nested loop is O(n^2). With 10K users, it performs 100M iterations.
At 1ms per iteration, that is 100 seconds per request." The developer immediately
understands whether this matters.

Why "estimate the impact" is the most important rule:
- It forces the LLM to reason about scaling behavior, not just pattern-match
- It helps developers prioritize — a quadratic loop on 10 items is fine; on 10K it is not
- It demonstrates deeper understanding in the PR comment (builds trust in the tool)
- It is something no existing linter can do (our competitive advantage)

**Rule 4: "Provide a concrete fix"**
"Use caching" is not helpful. "Wrap this in `@functools.lru_cache(maxsize=128)`" is helpful.
Concrete fixes reduce the developer's effort to act on the finding.

**Rule 5: "Set confidence honestly"**
If the LLM cannot tell how large the dataset is from context, it should say so. A finding
with confidence 0.6 and a note "depends on dataset size" is more useful than a false
certainty of 1.0.

**Rule 6: "Don't flag micro-optimizations"**
`list(map(f, xs))` vs `[f(x) for x in xs]` is not worth a comment. The Performance Agent
should focus on issues that matter in production, not nitpick syntax preferences that
happen to have trivial performance differences.

**5. Output Format**
Matches the `FindingOutput` Pydantic schema exactly. The LLM returns structured JSON with
`cwe_id: null` because performance issues do not have CWE identifiers (CWE is a security
vulnerability classification system).

**Interview talking point:** "The performance prompt is structured around three impact
tiers with concrete examples for each category. The most important rule is 'estimate the
impact' — this forces the LLM to reason about scaling behavior rather than just
pattern-matching. It explains WHY something is slow and at what data size it becomes a
problem, which is something no static linter can do."

---

### Step 5: How PerformanceAgent Differs from SecurityAgent

Both agents inherit from the same base class and follow the same flow. Here is a
side-by-side comparison of what is different:

```
                SecurityAgent              PerformanceAgent
                ==============             =================
 agent_name:    "security"                 "performance"

 system_prompt: AppSec engineer            Principal backend engineer
                CWE IDs for each category  Impact tiers (High/Medium/Low)
                Security-specific rules    "Estimate the impact" rule
                OWASP categories           N+1, O(n^2), blocking I/O

 tools:         Bandit (AST security       Radon (cyclomatic complexity)
                  pattern matching)
                detect-secrets (credential
                  scanning via entropy)

 cwe_id:        CWE-89, CWE-78, etc.      Always null (no CWE for perf)

 categories:    sql_injection,             n_plus_1_query,
                command_injection,         quadratic_loop,
                hardcoded_secret,          blocking_io,
                path_traversal             missing_caching

 tool count:    2 (Bandit + detect-secrets) 1 (Radon)
```

**What stays exactly the same (inherited from BaseAgent):**
- LLM configuration (ChatGroq, temperature, max_tokens)
- Prompt template structure (system + human messages with variables)
- Structured output parsing (with_structured_output → AgentFindings)
- LCEL chain composition (prompt | structured_llm)
- Finding conversion and validation (_convert_to_findings)
- Error handling and graceful degradation
- Timing and logging

This is the Template Method pattern in action. The *what* changes; the *how* stays the same.

**Interview talking point:** "The Security and Performance agents are architecturally
identical — same base class, same LLM, same structured output pipeline. They differ only
in their system prompt (domain expertise), their static analysis tools (Bandit vs Radon),
and their output categories. This proves the Template Method abstraction was the right
design — adding a new domain required implementing only three properties."

---

### Step 6: Testing Strategy (tests/unit/test_performance_agent.py)

The tests cover four areas:

#### Test 1: Agent Identity
```python
def test_agent_name(self):
    """PerformanceAgent should identify as 'performance'."""
    agent = PerformanceAgent()
    assert agent.agent_name == "performance"
```
This matters because the agent name is stamped on every Finding object. If it said
"security" by accident, findings would be misattributed in the dashboard.

#### Test 2: System Prompt Loading
```python
def test_system_prompt_loads(self):
    """System prompt should exist and contain performance-related content."""
    agent = PerformanceAgent()
    prompt = agent.system_prompt
    assert len(prompt) > 100
    assert "performance" in prompt.lower()
    assert "N+1" in prompt or "n+1" in prompt.lower()
```
This catches a common failure mode: the prompt file path is wrong, the file is missing,
or someone accidentally emptied it. We verify the file exists, is substantial, and
contains expected keywords.

#### Test 3: Finding Conversion
```python
def test_conversion_produces_performance_findings(self, mock_perf_findings):
    agent = PerformanceAgent()
    findings = agent._convert_to_findings(mock_perf_findings)

    assert len(findings) == 1
    assert findings[0].agent == "performance"
    assert findings[0].severity == "high"
    assert findings[0].category == "quadratic_loop"
    assert findings[0].cwe_id is None  # Performance issues don't have CWE IDs
```
This tests the base class conversion logic through the PerformanceAgent lens. The key
assertion: `cwe_id is None` — performance findings never have CWE IDs.

#### Test 4: LLM Failure Graceful Degradation
```python
@pytest.mark.asyncio
async def test_review_handles_llm_failure(self, sample_pr_data):
    """LLM failure should return empty list, not crash."""
    mock_chain = AsyncMock(side_effect=Exception("Groq rate limit"))
    # ... mock setup ...
    findings = await agent.review(sample_pr_data)
    assert findings == []
```
The most important test. If Groq is down or rate-limited, the PerformanceAgent must return
`[]` (not crash). The Security and Style agents can still contribute their findings.

#### Test 5-8: Radon Tool Tests
```python
async def test_detects_high_complexity(self):
    """Radon should flag functions with cyclomatic complexity > 10."""
    complex_code = (
        "def complex_func(a, b, c, d, e, f, g, h, i, j, k):\n"
        "    if a: return 1\n"
        "    elif b: return 2\n"
        # ... 11 branches → complexity 12 → grade C
    )
    result = await run_radon({"complex.py": complex_code})
    if result:  # radon installed
        assert "complex_func" in result

async def test_returns_empty_for_simple_code(self):
    """Simple code (low complexity) should produce no output."""
    result = await run_radon({"simple.py": "def add(a, b):\n    return a + b\n"})
    assert result == ""  # Grade A — not flagged

async def test_skips_non_python_files(self):
    """Radon should ignore non-Python files."""
    result = await run_radon({"style.css": "body { color: red; }"})
    assert result == ""

async def test_handles_empty_input(self):
    """Empty file dict should return empty string."""
    result = await run_radon({})
    assert result == ""
```

**Testing philosophy:** Radon tests use REAL radon execution on synthetic code, not mocks.
Radon is fast and local (no API calls), so there is no reason to mock it. This catches
real integration issues (wrong CLI flags, output format changes in new radon versions).

LLM tests use mocks because calling Groq costs API quota and adds network latency to the
test suite. The mock verifies the *plumbing* (error handling, conversion) without testing
the LLM's intelligence.

**Interview talking point:** "I test static analysis tools with real execution on synthetic
code because they are fast and local. LLM calls are mocked to avoid API costs in CI. The
most important test verifies graceful degradation — if the LLM fails, the agent returns an
empty list instead of crashing the pipeline."

---

### Step 7: Live Test Results

**Test PR:** github.com/ninjacode911/codeguard-test/pull/4

**Test code (intentionally slow):**
```python
import requests
import time

def process_users(users):
    """Find duplicate users — O(n^2) nested loop."""
    result = []
    for u in users:
        for item in users:
            if u["id"] == item["id"]:
                result.append(u)
    return result

def fetch_all_profiles(user_ids):
    """Blocking I/O — synchronous HTTP in what should be async."""
    profiles = []
    for uid in user_ids:
        resp = requests.get(f"https://api.example.com/users/{uid}")
        profiles.append(resp.json())
    return profiles

def complex_handler(data, mode, flag_a, flag_b, flag_c,
                    flag_d, flag_e, flag_f, flag_g, flag_h):
    """High cyclomatic complexity — too many branches."""
    if mode == "a" and flag_a:
        if flag_b: return data + 1
        elif flag_c: return data + 2
        elif flag_d: return data + 3
    elif mode == "b" and flag_e:
        if flag_f: return data * 2
        elif flag_g: return data * 3
        elif flag_h: return data * 4
    elif mode == "c":
        if flag_a and flag_b: return data - 1
        elif flag_c and flag_d: return data - 2
        elif flag_e and flag_f: return data - 3
    return data
```

**Pipeline execution (from server logs):**
```
14:22:10  Webhook received — PR #4, sha=7f3a2e1c
14:22:12  Fetched PR data — 1 file, 1 with content
14:22:13  Radon found 1 high-complexity function (complex_handler, grade D, complexity=16)
14:22:15  LLM returned 3 findings in 2.5 seconds
14:22:16  Summary comment posted
14:22:16  Cached in Redis (7-day TTL)
```

**Finding 1: Quadratic Loop (HIGH)**
```
File: app.py, Lines 6-10
Category: quadratic_loop
Title: O(n^2) nested loop in process_users

The nested loop iterates over the same `users` list twice, resulting in
O(n^2) time complexity. With 10,000 users, this performs 100,000,000
comparisons. With 100,000 users, it becomes 10 billion — effectively
unusable.

Suggested Fix:
  seen = set()
  result = [u for u in users if u["id"] not in seen and not seen.add(u["id"])]
```

**Finding 2: Blocking I/O (HIGH)**
```
File: app.py, Lines 14-17
Category: blocking_io
Title: Sequential synchronous HTTP calls in fetch_all_profiles

Each iteration makes a synchronous HTTP request, blocking the thread.
With 100 users at 200ms per request, this takes 20 seconds. In an async
service, this would block the event loop entirely.

Suggested Fix:
  import aiohttp
  async def fetch_all_profiles(user_ids):
      async with aiohttp.ClientSession() as session:
          tasks = [session.get(f".../{uid}") for uid in user_ids]
          responses = await asyncio.gather(*tasks)
          return [await r.json() for r in responses]
```

**Finding 3: Complex Function (MEDIUM)**
```
File: app.py, Lines 20-32
Category: high_complexity
Title: complex_handler has cyclomatic complexity 16 (grade D)

This function has 16 independent execution paths, making it difficult
to test and optimize. The deeply nested conditionals suggest the logic
could be restructured as a dispatch table or strategy pattern, which
would also improve branch prediction performance.

Suggested Fix:
  HANDLERS = {
      ("a", True): lambda d: d + 1,
      ("b", True): lambda d: d * 2,
      ...
  }
  def complex_handler(data, mode, **flags):
      handler = HANDLERS.get((mode, flags.get(f"flag_{mode}")))
      return handler(data) if handler else data
```

**Radon anchoring in action:**
Notice how Finding 3 references the exact complexity score and grade from radon's output.
The LLM used radon's data as a high-confidence anchor to focus its analysis on that
specific function. Without radon, the LLM might have missed the complexity issue entirely
or reported it with lower confidence.

---

### Bugs Encountered and Fixed

| Bug | Cause | Fix |
|-----|-------|-----|
| `radon` returning empty `{}` for files with only top-level code | Radon's `cc` command analyzes functions and classes, not module-level code | Documented as expected behavior — module-level code has no function to measure |
| Windows path separators in radon output (`\` instead of `/`) | Radon uses OS-native paths | Added `.replace("\\", "/")` in path normalization |
| `FileNotFoundError` when radon is not installed | `subprocess.run` raises this when the binary is missing | Caught specifically, logged warning, returned empty string |
| LLM reporting issues in unchanged code | System prompt did not emphasize "changed code only" strongly enough | Added bold emphasis and made it Rule #1 in the prompt |

---

## Architecture Deep Dive: Static + LLM Hybrid Analysis

The Performance Agent (like the Security Agent) uses a **hybrid analysis** approach:

```
                 STATIC ANALYSIS (Radon)          LLM REASONING (Groq)
                 ========================         =====================
 Strengths:      Deterministic, fast,             Semantic understanding,
                 zero false negatives             context-aware, explains WHY
                 for known patterns

 Weaknesses:     Cannot reason about              Can hallucinate, needs
                 semantics, no impact             anchoring, slower
                 estimation

 What it catches: High cyclomatic complexity       N+1 queries, blocking I/O,
                  (mechanical measurement)         quadratic algorithms (semantic)

 Speed:          ~0.5 seconds                     ~2.5 seconds

 Cost:           Free (local tool)                API tokens (Groq free tier)
```

**How they work together:**
1. Radon runs first and produces a factual report ("function X has complexity 16")
2. This report is injected into the LLM prompt as "Static Analysis Results"
3. The LLM uses it as an **anchor** — a high-confidence fact that guides its analysis
4. The LLM then goes beyond what radon can do: it reads the actual algorithm, estimates
   scaling behavior, and suggests a concrete refactoring

This is the same pattern as Security (Bandit anchors) but with different tools. The
architecture generalizes to any domain where you have static tools + LLM reasoning.

**Interview talking point:** "We use a hybrid approach: radon provides deterministic
complexity metrics as anchoring data for the LLM. The LLM then does what radon cannot —
it reads the algorithm semantically, estimates scaling behavior, and explains the impact
at different data sizes. Static tools provide precision; the LLM provides understanding."

---

## Files Created/Modified in Week 4

| File | Type | Purpose |
|------|------|---------|
| `app/agents/performance_agent.py` | **New** | Performance Agent — 30 lines leveraging base class |
| `app/tools/radon_tool.py` | **New** | Radon cyclomatic complexity wrapper |
| `prompts/performance_system.md` | **New** | Performance Agent system prompt (50 lines) |
| `tests/unit/test_performance_agent.py` | **New** | 8 tests for agent + radon tool |
| `requirements.txt` | **Modified** | Added `radon` dependency |

---

## Test Coverage

| Test Suite | Tests | Status |
|------------|-------|--------|
| Finding schema validation | 8 | PASS |
| Redis cache logic | 7 | PASS |
| Webhook HMAC validation | 5 | PASS |
| Security Agent & pipeline | 4 | PASS |
| Base Agent conversion | 4 | PASS |
| Bandit tool | 3 | PASS |
| Comment formatter | 4 | PASS |
| **Performance Agent** | **4** | **PASS** |
| **Radon tool** | **4** | **PASS** |
| **Total** | **43** | **PASS** |

---

## Architecture Patterns Used (Interview Reference)

| Pattern | Where Used | What It Means |
|---------|------------|---------------|
| **Template Method** | base_agent.py → performance_agent.py | Algorithm in base class, steps in subclasses. PerformanceAgent is 30 lines because of this. |
| **Open/Closed Principle** | base_agent.py | Open for extension (new agents), closed for modification (no base class changes needed). |
| **Static + LLM Hybrid** | radon_tool.py + performance prompt | Deterministic tools anchor LLM reasoning — precision + understanding. |
| **Temp File Pattern** | radon_tool.py | In-memory content to temp files, run CLI tool, parse output, clean up. |
| **Graceful Degradation** | base_agent.py (inherited) | Radon missing or LLM fails → return empty list, pipeline continues. |
| **Structured Output** | base_agent.py (inherited) | LLM constrained to return valid JSON matching Pydantic schema. |
| **Scope Isolation** | performance_system.md | "Performance issues ONLY" — prevents overlap with Security and Style agents. |
| **Impact-First Reporting** | performance_system.md Rule #3 | "Estimate the impact" — explain scaling behavior, not just flag a pattern. |

---

## Key Takeaway: The Power of Good Abstractions

Week 3 was hard — building the BaseAgent, the structured output pipeline, the tool
integration pattern, the error handling. Week 4 was fast — because all that infrastructure
was reusable.

```
 Week 3: SecurityAgent     Week 4: PerformanceAgent
 =====================     ========================
 base_agent.py  (~200 LOC)  (inherited — 0 new LOC)
 security_agent.py (~30)    performance_agent.py (~30)
 bandit_tool.py (~80)       radon_tool.py (~80)
 detect_secrets_tool.py     (not needed)
 security_system.md (~60)   performance_system.md (~50)
 test_security_agent.py     test_performance_agent.py

 Total new code: ~400 LOC   Total new code: ~160 LOC
                             60% LESS code for the same capability
```

The first agent is always the hardest. Every subsequent agent is incremental. This is
why architectural investment in Week 3 (Template Method, structured output, tool integration
pattern) was worth the effort — it compounds.

---

## What's Next (Week 5)

Build the **Style Agent** — detects code quality issues (naming conventions, dead code,
missing docstrings, type hint gaps). Same base class, different prompt, different tools
(Ruff/pylint). By now, this should take even less time — the pattern is established.

---

*Documentation written 2026-03-20 as part of Week 4 completion.*