Spaces:
Sleeping
Sleeping
File size: 35,808 Bytes
4b445f6 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 | # CodeProbe β Complete Project Plan & Progress Tracker
> **Multi-Agent Code Review System**
> Author: Ninjacode911 | Started: March 2026 | Target: 10 Weeks
---
## Table of Contents
1. [Project Overview](#1-project-overview)
2. [Architecture Deep Dive](#2-architecture-deep-dive)
3. [Complete Tech Stack](#3-complete-tech-stack)
4. [Directory Structure](#4-directory-structure)
5. [Week-by-Week Implementation Plan](#5-week-by-week-implementation-plan)
6. [Non-Coding Tasks](#6-non-coding-tasks)
7. [GPU / WSL Tasks](#7-gpu--wsl-tasks)
8. [Data Models & Schemas](#8-data-models--schemas)
9. [API Endpoints](#9-api-endpoints)
10. [Agent Prompt Design](#10-agent-prompt-design)
11. [Evaluation Plan](#11-evaluation-plan)
12. [Deployment Checklist](#12-deployment-checklist)
13. [Progress Tracker](#13-progress-tracker)
---
## 1. Project Overview
**What:** A multi-agent PR review system that reviews GitHub pull requests using 4 specialized LangChain agents (Security, Performance, Style, Synthesizer), posts inline GitHub comments, and tracks code health via a Next.js dashboard.
**Why:** AI-generated code (41% of GitHub commits) introduces 1.7x more issues. Existing tools use single-pass LLM calls. Sentinel AI uses domain-specialized agents with debate/consensus, RAG context, and static analysis tools.
**Core Thesis:** Separate security, performance, and style review into specialized agents β each with distinct prompts, tools, and context β then merge via a Synthesizer into a coherent, ranked, deduplicated review.
**Key Differentiators:**
- Multi-agent specialization (3 domain + 1 synthesizer)
- Debate & consensus protocol (agents challenge each other before synthesis)
- Repo-aware RAG context (ChromaDB indexes full repo, not just diff)
- $0/month architecture (all free tiers)
- Structured severity scoring (Critical/High/Medium/Low with CWE IDs)
- Auto-fix suggestions (corrected code snippets inline)
---
## 2. Architecture Deep Dive
### 2.1 Four Layers
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GITHUB LAYER β
β Webhooks Β· PR Events Β· Inline Comments β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ
β pull_request webhook
ββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββ
β ORCHESTRATION LAYER (FastAPI on Render) β
β Webhook receiver Β· HMAC validation Β· Redis cache β
β Agent dispatcher Β· GitHub API client β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ
β asyncio.gather()
ββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββ
β AGENT LAYER (LangChain ReAct Agents) β
β ββββββββββββ ββββββββββββββββ βββββββββββ β
β β Security β β Performance β β Style β PARALLEL β
β β Agent β β Agent β β Agent β β
β ββββββ¬ββββββ ββββββββ¬ββββββββ ββββββ¬βββββ β
β ββββββββββββββββΌββββββββββββββββ β
β βΌ β
β ββββββββββββββββββββ β
β β Synthesizer β SEQUENTIAL β
β β Agent β β
β ββββββββββββββββββββ β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββ
β KNOWLEDGE LAYER β
β ChromaDB (vector store) Β· Upstash Redis (cache) β
β Neon Postgres (history) Β· sentence-transformers β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
### 2.2 Data Flow (11 Steps)
1. GitHub fires `pull_request` webhook β Render FastAPI endpoint
2. FastAPI validates HMAC-SHA256 signature (GitHub App secret)
3. Check Upstash Redis: commit SHA already reviewed? β return cached
4. Fetch via GitHub API: PR diff, changed files, full contents, commit history
5. Build repo context: embed chunks with sentence-transformers β upsert ChromaDB
6. Dispatch 3 parallel agents: `asyncio.gather(security, performance, style)`
7. Each agent: system prompt + RAG context β Groq API β static tools β typed findings
8. Synthesizer: deduplicate + resolve conflicts + Health Score + executive summary
9. GitHub API: post inline comment per finding + PR summary comment
10. Write review to Neon Postgres + set Redis cache (TTL: 7 days)
11. Next.js dashboard fetches from Neon and updates Health Score chart
### 2.3 Context Loading (5 Layers per Agent)
1. Raw PR diff (changed lines, file paths, additions/deletions)
2. Relevant file sections from full repo (ChromaDB semantic search on diff)
3. Recent commit history for changed files (pattern detection)
4. Repo configuration (language, framework, linter rules, test coverage)
5. Domain-specific knowledge base (OWASP Top 10, DDIA patterns, style guides)
---
## 3. Complete Tech Stack
### 3.1 LLM & AI
| Tool | Free Tier | Purpose |
|------|-----------|---------|
| **Groq API** (Llama-3.1-70B) | 14,400 req/day, 500 tok/sec | Primary LLM for all agents |
| **Gemini 1.5 Flash** | 1M tokens/day | Fallback when Groq exhausted |
| **LangChain** | OSS | Agent orchestration, LCEL, ReAct framework |
| **sentence-transformers** | Local (GPU) | Embeddings for ChromaDB β runs on RTX 5070 via WSL |
### 3.2 Backend & APIs
| Tool | Free Tier | Purpose |
|------|-----------|---------|
| **FastAPI** | OSS | Webhook receiver, agent dispatcher, REST API |
| **Render.com** | Free web service | Hosts backend (30s cold start after 15min idle) |
| **GitHub Apps API** | Free | Webhooks, PR comments, file fetching |
| **Upstash Redis** | 10K req/day | Cache PR analysis by commit SHA |
| **Neon.tech** | Free Postgres 512MB | Review history, Health Score trends |
### 3.3 Knowledge & Static Analysis
| Tool | Free Tier | Purpose |
|------|-----------|---------|
| **ChromaDB** | OSS, in-memory/persisted | Vector store for RAG context retrieval |
| **Semgrep OSS** | Free, 3K+ rules | SAST rules for Security Agent |
| **Bandit** | Free | Python AST security analysis |
| **detect-secrets** | Free | Credential/API key scanning |
| **radon** | Free | Cyclomatic complexity & maintainability index |
| **pylint/ESLint/Ruff** | Free | Linting for Style Agent |
### 3.4 Frontend & Deployment
| Tool | Free Tier | Purpose |
|------|-----------|---------|
| **Vercel** | Free hobby tier | Hosts Next.js dashboard |
| **Next.js** | OSS | Dashboard UI |
| **Recharts** | OSS | Health Score trend charts, pie charts |
| **GitHub Actions** | 2K min/month | CI/CD for Sentinel AI itself |
---
## 4. Directory Structure
```
sentinel-ai/
βββ app/
β βββ __init__.py
β βββ main.py # FastAPI app, webhook endpoint, lifespan
β βββ config.py # Settings via pydantic-settings (env vars)
β βββ agents/
β β βββ __init__.py
β β βββ base_agent.py # Shared agent interface / base class
β β βββ security_agent.py # Security ReAct agent
β β βββ performance_agent.py # Performance ReAct agent
β β βββ style_agent.py # Style & Maintainability agent
β β βββ synthesizer.py # Synthesizer + Health Score + dedup
β βββ tools/
β β βββ __init__.py
β β βββ semgrep_tool.py # LangChain tool wrapper for Semgrep
β β βββ bandit_tool.py # LangChain tool wrapper for Bandit
β β βββ detect_secrets_tool.py # Credential scanner tool
β β βββ radon_tool.py # Complexity metrics tool
β β βββ ast_analyzer.py # Python AST analysis (N+1, patterns)
β β βββ linter_tool.py # Ruff/ESLint/pylint subprocess tool
β βββ context/
β β βββ __init__.py
β β βββ embedder.py # sentence-transformers embedding pipeline
β β βββ indexer.py # ChromaDB repo indexer (upsert chunks)
β β βββ retriever.py # RAG retriever (query ChromaDB for context)
β βββ github/
β β βββ __init__.py
β β βββ webhook.py # Webhook validation (HMAC-SHA256)
β β βββ client.py # GitHub API client (fetch diff, post comments)
β β βββ comment_formatter.py # Format findings as GitHub Markdown comments
β βββ models/
β β βββ __init__.py
β β βββ findings.py # Finding, PRReview Pydantic schemas
β β βββ webhook_payloads.py # GitHub webhook event schemas
β βββ db/
β β βββ __init__.py
β β βββ postgres.py # Neon Postgres connection + queries
β β βββ redis_cache.py # Upstash Redis cache logic
β βββ services/
β βββ __init__.py
β βββ orchestrator.py # Main orchestration: dispatch agents, synthesize
β βββ health_score.py # Health Score calculation formula
βββ dashboard/ # Next.js app (deployed to Vercel)
β βββ package.json
β βββ next.config.js
β βββ tsconfig.json
β βββ app/
β β βββ layout.tsx
β β βββ page.tsx # / β Repository Overview
β β βββ repos/
β β β βββ [owner]/
β β β βββ [repo]/
β β β βββ page.tsx # Repo Detail (trends, charts)
β β β βββ prs/
β β β βββ [number]/
β β β βββ page.tsx # PR Review Detail
β β βββ api/
β β βββ repos/
β β β βββ route.ts # API proxy to FastAPI backend
β β βββ health/
β β βββ route.ts
β βββ components/
β β βββ HealthScoreRing.tsx # Circular gauge 0-100
β β βββ FindingsTable.tsx # Sortable, filterable findings
β β βββ TrendChart.tsx # Recharts LineChart
β β βββ AgentBreakdown.tsx # 3-column agent summary cards
β β βββ SeverityBadge.tsx # Color-coded severity pill
β β βββ Navbar.tsx
β βββ lib/
β βββ api.ts # Fetch wrapper for backend API
β βββ types.ts # TypeScript types matching backend schemas
βββ tests/
β βββ __init__.py
β βββ conftest.py # Shared fixtures
β βββ unit/
β β βββ test_findings_schema.py
β β βββ test_synthesizer_dedup.py
β β βββ test_webhook_validation.py
β β βββ test_redis_cache.py
β β βββ test_health_score.py
β βββ integration/
β β βββ test_full_pipeline.py
β β βββ test_github_posting.py
β βββ eval/
β βββ dataset/ # 20-PR benchmark dataset (JSON fixtures)
β βββ run_eval.py # Evaluation harness
β βββ metrics.py # Precision, recall, latency tracking
βββ prompts/
β βββ security_system.md # Security Agent system prompt
β βββ performance_system.md # Performance Agent system prompt
β βββ style_system.md # Style Agent system prompt
β βββ synthesizer_system.md # Synthesizer system prompt
βββ knowledge/
β βββ owasp_top10_2025.md # OWASP cheat sheet for Security RAG
β βββ ddia_patterns.md # DDIA patterns for Performance RAG
β βββ style_guides/ # Language style guides for Style RAG
βββ .env.example # Template for env vars (no secrets)
βββ .gitignore
βββ requirements.txt # Python dependencies
βββ requirements-dev.txt # Dev/test dependencies
βββ render.yaml # Render deployment config
βββ sentinel.yml.example # Per-repo config template
βββ Dockerfile # For Render deployment
βββ pyproject.toml # Project metadata + tool configs
βββ README.md # Installation, usage, architecture docs
```
---
## 5. Week-by-Week Implementation Plan
### WEEK 1: Foundation & Setup
**Goal:** Project skeleton running locally, all external services provisioned.
| # | Task | Type | Status |
|---|------|------|--------|
| 1.1 | Initialize git repo, create directory structure | Code | [ ] |
| 1.2 | Set up Python virtual environment + requirements.txt | Code | [ ] |
| 1.3 | Register GitHub App (dev.github.com/settings/apps) | Config | [ ] |
| 1.4 | Provision Neon.tech Postgres database + create `pr_reviews` table | Config | [ ] |
| 1.5 | Provision Upstash Redis instance | Config | [ ] |
| 1.6 | Get Groq API key (console.groq.com) | Config | [ ] |
| 1.7 | Get Gemini API key (aistudio.google.com) | Config | [ ] |
| 1.8 | Create FastAPI skeleton (`app/main.py`) with health endpoint | Code | [ ] |
| 1.9 | Create `app/config.py` with pydantic-settings (all env vars) | Code | [ ] |
| 1.10 | Create Pydantic models (`Finding`, `PRReview` schemas) | Code | [ ] |
| 1.11 | Set up .env.example, .gitignore, pyproject.toml | Code | [ ] |
| 1.12 | Deploy FastAPI skeleton to Render (verify /health works) | Deploy | [ ] |
| 1.13 | Write unit tests for Finding schema validation | Test | [ ] |
| 1.14 | Set up GitHub Actions CI (lint + test on push) | CI/CD | [ ] |
### WEEK 2: GitHub Integration
**Goal:** Receive webhooks, validate signatures, fetch PR data, post dummy comment.
| # | Task | Type | Status |
|---|------|------|--------|
| 2.1 | Implement HMAC-SHA256 webhook validation (`app/github/webhook.py`) | Code | [ ] |
| 2.2 | Implement GitHub API client β fetch PR diff (`app/github/client.py`) | Code | [ ] |
| 2.3 | Implement GitHub API client β fetch file contents | Code | [ ] |
| 2.4 | Implement GitHub API client β fetch commit history | Code | [ ] |
| 2.5 | Implement GitHub API client β post inline review comments | Code | [ ] |
| 2.6 | Implement GitHub API client β post PR summary comment | Code | [ ] |
| 2.7 | Create webhook endpoint (`POST /webhook/github`) in main.py | Code | [ ] |
| 2.8 | Implement comment formatter (`app/github/comment_formatter.py`) | Code | [ ] |
| 2.9 | Set up ngrok for local webhook testing | Config | [ ] |
| 2.10 | End-to-end test: open PR on test repo β dummy comment posted | Test | [ ] |
| 2.11 | Implement Redis cache check (skip if commit SHA already reviewed) | Code | [ ] |
| 2.12 | Write unit tests for HMAC validation (valid + invalid signatures) | Test | [ ] |
| 2.13 | Write unit tests for Redis cache hit/miss logic | Test | [ ] |
### WEEK 3: Security Agent v1
**Goal:** Security Agent analyzes diffs, returns structured findings with CWE IDs.
| # | Task | Type | Status |
|---|------|------|--------|
| 3.1 | Install & configure Semgrep OSS with security rulesets | Config | [ ] |
| 3.2 | Create Semgrep LangChain tool (`app/tools/semgrep_tool.py`) | Code | [ ] |
| 3.3 | Install & configure Bandit for Python AST security analysis | Config | [ ] |
| 3.4 | Create Bandit LangChain tool (`app/tools/bandit_tool.py`) | Code | [ ] |
| 3.5 | Install & configure detect-secrets | Config | [ ] |
| 3.6 | Create detect-secrets LangChain tool (`app/tools/detect_secrets_tool.py`) | Code | [ ] |
| 3.7 | Write Security Agent system prompt (`prompts/security_system.md`) | Prompt | [ ] |
| 3.8 | Prepare OWASP Top 10 (2025) knowledge base (`knowledge/owasp_top10_2025.md`) | Data | [ ] |
| 3.9 | Implement Security Agent ReAct loop (`app/agents/security_agent.py`) | Code | [ ] |
| 3.10 | Implement base agent interface (`app/agents/base_agent.py`) | Code | [ ] |
| 3.11 | Set up Groq LLM client via LangChain (`ChatGroq`) | Code | [ ] |
| 3.12 | Implement structured output parsing (JSON β Finding objects) | Code | [ ] |
| 3.13 | Create 10 synthetic security-vulnerable PRs for testing | Data | [ ] |
| 3.14 | Evaluate Security Agent on synthetic dataset β measure precision/recall | Eval | [ ] |
| 3.15 | Iterate on system prompt based on eval results | Prompt | [ ] |
### WEEK 4: Performance Agent v1
**Goal:** Performance Agent detects N+1 queries, complexity issues, returns findings.
| # | Task | Type | Status |
|---|------|------|--------|
| 4.1 | Create Python AST analyzer tool (`app/tools/ast_analyzer.py`) | Code | [ ] |
| 4.2 | Implement N+1 query pattern detector (Django/SQLAlchemy ORM patterns) | Code | [ ] |
| 4.3 | Create radon complexity tool (`app/tools/radon_tool.py`) | Code | [ ] |
| 4.4 | Write Performance Agent system prompt (`prompts/performance_system.md`) | Prompt | [ ] |
| 4.5 | Prepare DDIA patterns knowledge base (`knowledge/ddia_patterns.md`) | Data | [ ] |
| 4.6 | Implement Performance Agent ReAct loop (`app/agents/performance_agent.py`) | Code | [ ] |
| 4.7 | Fetch 10 Django PRs with known performance issues for testing | Data | [ ] |
| 4.8 | Evaluate Performance Agent on Django PR dataset | Eval | [ ] |
| 4.9 | Iterate on system prompt based on eval results | Prompt | [ ] |
### WEEK 5: Style Agent v1
**Goal:** Style Agent checks naming, complexity, dead code, test coverage gaps.
| # | Task | Type | Status |
|---|------|------|--------|
| 5.1 | Create linter tool wrapper β Ruff/ESLint/pylint (`app/tools/linter_tool.py`) | Code | [ ] |
| 5.2 | Implement dead code detector (unused imports, unreachable branches) | Code | [ ] |
| 5.3 | Write Style Agent system prompt (`prompts/style_system.md`) | Prompt | [ ] |
| 5.4 | Prepare language style guides knowledge base (`knowledge/style_guides/`) | Data | [ ] |
| 5.5 | Implement Style Agent ReAct loop (`app/agents/style_agent.py`) | Code | [ ] |
| 5.6 | Fetch 10 Exercism PRs with style/refactoring issues | Data | [ ] |
| 5.7 | Evaluate Style Agent on Exercism dataset | Eval | [ ] |
| 5.8 | Iterate on system prompt based on eval results | Prompt | [ ] |
### WEEK 6: ChromaDB + RAG Context
**Goal:** Full RAG pipeline β embed repo, retrieve context, inject into agents.
| # | Task | Type | Status |
|---|------|------|--------|
| 6.1 | Set up sentence-transformers embedding pipeline (`app/context/embedder.py`) | Code | [ ] |
| 6.2 | **Run embedding model on RTX 5070 via WSL** β benchmark speed | GPU | [ ] |
| 6.3 | Implement ChromaDB repo indexer (`app/context/indexer.py`) β chunk files, upsert | Code | [ ] |
| 6.4 | Implement RAG retriever (`app/context/retriever.py`) β query by diff content | Code | [ ] |
| 6.5 | Integrate RAG context into Security Agent | Code | [ ] |
| 6.6 | Integrate RAG context into Performance Agent | Code | [ ] |
| 6.7 | Integrate RAG context into Style Agent | Code | [ ] |
| 6.8 | Evaluate: does cross-file RAG context improve recall vs. diff-only? | Eval | [ ] |
| 6.9 | Optimize chunk size and retrieval top-k for quality vs. latency | Code | [ ] |
| 6.10 | Limit repo index to 500 most recently changed files (Render memory constraint) | Code | [ ] |
### WEEK 7: Synthesizer Agent
**Goal:** Deduplication, conflict resolution, Health Score, executive summary, full pipeline.
| # | Task | Type | Status |
|---|------|------|--------|
| 7.1 | Write Synthesizer system prompt (`prompts/synthesizer_system.md`) | Prompt | [ ] |
| 7.2 | Implement deduplication logic (cosine similarity on findings via ChromaDB) | Code | [ ] |
| 7.3 | Implement severity conflict resolution (Security > Performance > Style precedence) | Code | [ ] |
| 7.4 | Implement composite re-ranking: severity Γ exploitability Γ fix_complexity | Code | [ ] |
| 7.5 | Implement PR Health Score formula (0-100) (`app/services/health_score.py`) | Code | [ ] |
| 7.6 | Implement executive summary generation (3-5 sentences) | Code | [ ] |
| 7.7 | Implement auto-block logic (Critical findings β block merge recommendation) | Code | [ ] |
| 7.8 | Implement Synthesizer Agent (`app/agents/synthesizer.py`) | Code | [ ] |
| 7.9 | Build main orchestrator (`app/services/orchestrator.py`) β ties everything together | Code | [ ] |
| 7.10 | Implement Gemini Flash fallback when Groq quota exhausted | Code | [ ] |
| 7.11 | Full end-to-end pipeline test: PR β agents β synthesizer β GitHub comments | Test | [ ] |
| 7.12 | Write unit tests for Health Score formula | Test | [ ] |
| 7.13 | Write unit tests for deduplication with synthetic conflicting findings | Test | [ ] |
| 7.14 | Implement Neon Postgres write (store review record) | Code | [ ] |
### WEEK 8: Next.js Dashboard
**Goal:** Dashboard on Vercel showing review history, Health Scores, charts.
| # | Task | Type | Status |
|---|------|------|--------|
| 8.1 | Initialize Next.js app in `dashboard/` with TypeScript | Code | [ ] |
| 8.2 | Deploy to Vercel (connect GitHub repo) | Deploy | [ ] |
| 8.3 | Create TypeScript types matching backend schemas (`lib/types.ts`) | Code | [ ] |
| 8.4 | Create API fetch wrapper (`lib/api.ts`) β calls FastAPI backend | Code | [ ] |
| 8.5 | Build `HealthScoreRing` component (circular gauge, animated) | Code | [ ] |
| 8.6 | Build `SeverityBadge` component (color-coded pills) | Code | [ ] |
| 8.7 | Build `TrendChart` component (Recharts LineChart, 30-day trend) | Code | [ ] |
| 8.8 | Build `FindingsTable` component (sortable, filterable) | Code | [ ] |
| 8.9 | Build `AgentBreakdown` component (3-column cards) | Code | [ ] |
| 8.10 | Build `/` page β Repository Overview (connected repos, avg scores) | Code | [ ] |
| 8.11 | Build `/repos/[owner]/[repo]` page β Repo Detail (charts, PR list) | Code | [ ] |
| 8.12 | Build `/repos/[owner]/[repo]/prs/[number]` page β PR Review Detail | Code | [ ] |
| 8.13 | Add FastAPI CORS middleware for Vercel domain | Code | [ ] |
| 8.14 | Implement REST API endpoints on FastAPI side for dashboard | Code | [ ] |
### WEEK 9: Polish & Evaluation
**Goal:** Full benchmark, prompt tuning, latency optimization, documentation.
| # | Task | Type | Status |
|---|------|------|--------|
| 9.1 | Curate full 20-PR benchmark dataset (Django, Next.js, synthetic, Exercism) | Data | [ ] |
| 9.2 | Build evaluation harness (`tests/eval/run_eval.py`) | Code | [ ] |
| 9.3 | Run full benchmark β measure precision, recall, latency per agent | Eval | [ ] |
| 9.4 | Tune agent prompts to reduce false positives (target: <30% FP rate) | Prompt | [ ] |
| 9.5 | Implement confidence threshold: findings <0.6 shown as 'Suggestions' | Code | [ ] |
| 9.6 | Latency optimization: measure p50/p95/p99 per PR size bucket | Eval | [ ] |
| 9.7 | Optimize Groq API calls (reduce token usage, cache prompts) | Code | [ ] |
| 9.8 | Write comprehensive README.md | Docs | [ ] |
| 9.9 | Write installation guide in README | Docs | [ ] |
| 9.10 | Add GitHub Actions pre-warm cron (ping /health every 10min) | CI/CD | [ ] |
### WEEK 10: Launch & Promotion
**Goal:** Live on GitHub Marketplace, installed on public repos, launch posts published.
| # | Task | Type | Status |
|---|------|------|--------|
| 10.1 | Install Sentinel AI on 3 public open-source repos | Launch | [ ] |
| 10.2 | Record demo video (screen recording: PR opened β comments posted) | Content | [ ] |
| 10.3 | Write Dev.to / HackerNews launch post | Content | [ ] |
| 10.4 | Write LinkedIn demo post | Content | [ ] |
| 10.5 | Submit to GitHub Marketplace (needs privacy policy, logo, description) | Launch | [ ] |
| 10.6 | Create sentinel.yml.example per-repo config template | Code | [ ] |
| 10.7 | Monitor first 48 hours β fix any production bugs | Ops | [ ] |
---
## 6. Non-Coding Tasks
These tasks don't involve writing project code but are essential for the project:
### 6.1 External Service Provisioning
| Service | Action | URL | Notes |
|---------|--------|-----|-------|
| **GitHub App** | Register new app | github.com/settings/apps/new | Need: App ID, Private Key (.pem), Webhook Secret |
| **Groq** | Get API key | console.groq.com | Free: 14,400 req/day |
| **Google AI Studio** | Get Gemini key | aistudio.google.com | Free: 1M tokens/day |
| **Neon.tech** | Create Postgres DB | console.neon.tech | Free: 512MB, create `pr_reviews` table |
| **Upstash** | Create Redis instance | console.upstash.com | Free: 10K req/day |
| **Render** | Create web service | dashboard.render.com | Free tier, connect GitHub repo |
| **Vercel** | Create project | vercel.com/new | Free hobby tier, connect dashboard/ |
| **ngrok** | Install for local testing | ngrok.com | Free: 1 tunnel |
### 6.2 GitHub App Configuration
**Permissions required:**
- Pull requests: Read & Write
- Contents: Read
- Metadata: Read
- Commit statuses: Write (optional)
**Webhook events to subscribe:**
- `pull_request` (opened, synchronize, reopened, ready_for_review)
- `pull_request_review_comment` (for @sentinel-ai re-review)
### 6.3 Data Curation Tasks
| Dataset | Source | Count | Purpose |
|---------|--------|-------|---------|
| Synthetic security PRs | Hand-crafted | 10 PRs | SQL injection, XSS, IDOR, hardcoded secrets |
| Django security PRs | github.com/django/django | 5 PRs | Real-world Python security fixes |
| Next.js performance PRs | github.com/vercel/next.js | 5 PRs | JS/TS performance changes |
| Exercism style PRs | github.com/exercism | 5 PRs | Naming, complexity, documentation issues |
| Mixed benchmark set | All above | 20 PRs | Full evaluation benchmark |
### 6.4 Knowledge Base Curation
| Document | Source | For Agent |
|----------|--------|-----------|
| OWASP Top 10 (2025) | owasp.org | Security Agent RAG |
| DDIA performance patterns | "Designing Data-Intensive Applications" | Performance Agent RAG |
| Python style guide (PEP 8) | python.org | Style Agent RAG |
| JavaScript style guide | Various (Airbnb, Google) | Style Agent RAG |
| TypeScript best practices | typescript-eslint.io | Style Agent RAG |
---
## 7. GPU / WSL Tasks
Your **RTX 5070** with WSL will be used for:
### 7.1 sentence-transformers Embedding (Required)
**No training needed** β these are pre-trained models used for embedding generation.
```
Model: all-MiniLM-L6-v2 (or all-mpnet-base-v2 for higher quality)
Task: Embed code chunks for ChromaDB indexing
Where: Runs locally during repo indexing (can also run on Render CPU, slower)
GPU benefit: ~10-50x faster embedding generation vs CPU
```
**Setup steps:**
1. Ensure CUDA toolkit installed in WSL (`nvidia-smi` should show RTX 5070)
2. `pip install sentence-transformers torch` (with CUDA support)
3. Benchmark: embed 1000 code chunks, measure time GPU vs CPU
4. Decision: if embedding is fast enough on CPU, skip GPU for deployment simplicity
### 7.2 Local LLM Testing (Optional, Recommended)
Running a local LLM for testing avoids burning Groq API quota during development:
```
Model: Llama-3.1-8B-Instruct (via Ollama or vLLM)
Task: Test agent prompts locally before hitting Groq
GPU benefit: Full inference locally, no API calls, no quota burn
```
**Setup steps:**
1. Install Ollama in WSL: `curl -fsSL https://ollama.com/install.sh | sh`
2. Pull model: `ollama pull llama3.1:8b`
3. Use for prompt iteration β switch to Groq (70B) for production quality
### 7.3 What You Do NOT Need to Train
| Item | Reason |
|------|--------|
| LLM (Llama-3.1-70B) | Used via Groq API β inference only, no fine-tuning |
| sentence-transformers | Pre-trained model, no fine-tuning needed for code embeddings |
| Semgrep/Bandit/radon | Rule-based tools, no ML training |
| Agent prompts | Iterative prompt engineering, not model training |
**Bottom line:** This project is an **inference and orchestration** project, not a training project. Your GPU is used for fast local embeddings and optional local LLM testing β no model training required.
---
## 8. Data Models & Schemas
### 8.1 Finding (per agent output)
```python
class Finding(BaseModel):
agent: Literal['security', 'performance', 'style']
file_path: str # e.g. 'src/auth/login.py'
line_start: int
line_end: int
severity: Literal['critical', 'high', 'medium', 'low']
category: str # e.g. 'sql_injection', 'n+1_query', 'naming'
title: str # Short one-liner
description: str # Full explanation
suggested_fix: str # Corrected code snippet
cwe_id: Optional[str] # For security findings (e.g. 'CWE-89')
confidence: float # 0.0 β 1.0
```
### 8.2 SynthesizedReview (Synthesizer output)
```python
class SynthesizedReview(BaseModel):
health_score: int # 0-100
executive_summary: str # 3-5 sentences
recommendation: Literal['approve', 'request_changes', 'block']
findings: List[Finding] # Deduplicated, re-ranked
critical_count: int
high_count: int
medium_count: int
low_count: int
duration_ms: int
```
### 8.3 PR Review Record (Neon Postgres)
```sql
CREATE TABLE pr_reviews (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
repo_full_name TEXT NOT NULL,
pr_number INT NOT NULL,
commit_sha TEXT NOT NULL,
health_score INT NOT NULL,
critical_count INT DEFAULT 0,
high_count INT DEFAULT 0,
medium_count INT DEFAULT 0,
low_count INT DEFAULT 0,
summary TEXT,
findings JSONB NOT NULL,
duration_ms INT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_pr_reviews_repo ON pr_reviews(repo_full_name);
CREATE INDEX idx_pr_reviews_sha ON pr_reviews(commit_sha);
```
---
## 9. API Endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `POST /webhook/github` | POST | Receive GitHub webhook, validate HMAC, enqueue analysis |
| `GET /api/repos/{owner}/{repo}/reviews` | GET | Paginated PR review list + Health Score trend |
| `GET /api/repos/{owner}/{repo}/reviews/{pr_number}` | GET | Full findings for specific PR |
| `GET /api/repos/{owner}/{repo}/stats` | GET | Aggregate stats: avg score, top categories, 30-day trend |
| `POST /api/repos/{owner}/{repo}/reanalyze/{pr_number}` | POST | Re-trigger analysis (bypass cache) |
| `GET /health` | GET | Health check: agent status, Groq quota remaining |
---
## 10. Agent Prompt Design
Each agent prompt must include:
1. **Role definition** β who the agent is (e.g., "senior AppSec engineer")
2. **Scope boundaries** β what to look for and what to ignore
3. **Output schema** β exact JSON structure expected
4. **Severity guidelines** β when to use Critical vs. High vs. Medium vs. Low
5. **Confidence scoring** β how to self-assess confidence (0.0-1.0)
6. **Examples** β 2-3 few-shot examples of good findings
7. **Anti-patterns** β common false positives to avoid
Prompts are stored in `prompts/` as Markdown files and loaded at agent initialization.
---
## 11. Evaluation Plan
### 11.1 Metrics
| Metric | Target | Formula |
|--------|--------|---------|
| Security precision | >70% | true_positives / (true_positives + false_positives) |
| Performance recall | >60% | true_positives / (true_positives + false_negatives) |
| Deduplication rate | >15% | duplicates_removed / total_findings |
| e2e latency (p95) | <20s | Time from webhook to first comment posted |
| Groq quota usage | <10K/day | Total API calls per day |
| System uptime | >95% | (total_time - downtime) / total_time |
### 11.2 Evaluation Harness
Located in `tests/eval/`:
- `dataset/` β 20 PRs as JSON fixtures (diff, expected findings, ground truth labels)
- `run_eval.py` β Runs each PR through full pipeline, compares output vs ground truth
- `metrics.py` β Computes precision, recall, F1, latency percentiles
- Results logged to console + optionally to LangSmith (free self-hosted)
---
## 12. Deployment Checklist
### Render (FastAPI Backend)
- [ ] `render.yaml` configured with build + start commands
- [ ] Environment variables set in Render dashboard
- [ ] Health check endpoint (`/health`) configured
- [ ] Auto-deploy from `main` branch enabled
### Vercel (Next.js Dashboard)
- [ ] Connected to GitHub repo `dashboard/` directory
- [ ] Environment variable: `NEXT_PUBLIC_API_URL` pointing to Render backend
- [ ] Custom domain (optional)
### GitHub App
- [ ] App registered with correct permissions
- [ ] Webhook URL set to Render endpoint (`/webhook/github`)
- [ ] Private key (.pem) downloaded and stored securely
- [ ] App installed on test repo for development
### GitHub Actions
- [ ] CI workflow: lint (ruff) + test (pytest) on push/PR
- [ ] Pre-warm cron: ping /health every 10 minutes during working hours
---
## 13. Progress Tracker
### Overall Status
| Week | Milestone | Status | Notes |
|------|-----------|--------|-------|
| 1 | Foundation & Setup | COMPLETE | All services provisioned, project scaffolded |
| 2 | GitHub Integration | COMPLETE | E2E tested: webhook β fetch β comment on PR #1 |
| 3 | Security Agent v1 | COMPLETE | Bandit + Llama-3.3-70B, live-tested on PR #3, 4 findings |
| 4 | Performance Agent v1 | COMPLETE | Radon complexity + Llama-3.3-70B, 3 findings on PR #4 |
| 5 | Style Agent v1 | COMPLETE | Ruff linter + Llama-3.3-70B, 6 findings on PR #4 |
| 6 | ChromaDB + RAG Context | COMPLETE | sentence-transformers + ChromaDB, integrated into all agents |
| 7 | Synthesizer Agent | COMPLETE | Dedup, conflict resolution, Health Score formula, exec summary |
| 8 | Next.js Dashboard | COMPLETE | Next.js + Tailwind + Recharts, mock data, all pages |
| 9 | Polish & Evaluation | COMPLETE | Eval harness, metrics, README, DB persistence |
| 10 | Launch & Promotion | COMPLETE | Render config, Vercel ready, API endpoints for dashboard |
### Key Decisions Log
| Date | Decision | Rationale |
|------|----------|-----------|
| 2026-03-19 | Project plan created | Starting from scratch, PDF spec as source of truth |
| 2026-03-19 | Project renamed to "Ninja Code Guard" | User's personal branding choice |
| 2026-03-19 | GitHub App: "Ninja's Code Guard" (ID: 3133457) | Registered and tested with live PR |
| 2026-03-19 | Test repo: ninjacode911/codeguard-test | Used for e2e webhook testing |
| 2026-03-19 | Fail-open pattern for Redis cache | Missing a review is worse than duplicating |
| 2026-03-19 | Background tasks for webhook processing | GitHub's 10s timeout requires async processing |
---
*Last updated: 2026-03-19*
|