Spaces:
Sleeping
Sleeping
| # CodeProbe β Complete Project Plan & Progress Tracker | |
| > **Multi-Agent Code Review System** | |
| > Author: Ninjacode911 | Started: March 2026 | Target: 10 Weeks | |
| --- | |
| ## Table of Contents | |
| 1. [Project Overview](#1-project-overview) | |
| 2. [Architecture Deep Dive](#2-architecture-deep-dive) | |
| 3. [Complete Tech Stack](#3-complete-tech-stack) | |
| 4. [Directory Structure](#4-directory-structure) | |
| 5. [Week-by-Week Implementation Plan](#5-week-by-week-implementation-plan) | |
| 6. [Non-Coding Tasks](#6-non-coding-tasks) | |
| 7. [GPU / WSL Tasks](#7-gpu--wsl-tasks) | |
| 8. [Data Models & Schemas](#8-data-models--schemas) | |
| 9. [API Endpoints](#9-api-endpoints) | |
| 10. [Agent Prompt Design](#10-agent-prompt-design) | |
| 11. [Evaluation Plan](#11-evaluation-plan) | |
| 12. [Deployment Checklist](#12-deployment-checklist) | |
| 13. [Progress Tracker](#13-progress-tracker) | |
| --- | |
| ## 1. Project Overview | |
| **What:** A multi-agent PR review system that reviews GitHub pull requests using 4 specialized LangChain agents (Security, Performance, Style, Synthesizer), posts inline GitHub comments, and tracks code health via a Next.js dashboard. | |
| **Why:** AI-generated code (41% of GitHub commits) introduces 1.7x more issues. Existing tools use single-pass LLM calls. Sentinel AI uses domain-specialized agents with debate/consensus, RAG context, and static analysis tools. | |
| **Core Thesis:** Separate security, performance, and style review into specialized agents β each with distinct prompts, tools, and context β then merge via a Synthesizer into a coherent, ranked, deduplicated review. | |
| **Key Differentiators:** | |
| - Multi-agent specialization (3 domain + 1 synthesizer) | |
| - Debate & consensus protocol (agents challenge each other before synthesis) | |
| - Repo-aware RAG context (ChromaDB indexes full repo, not just diff) | |
| - $0/month architecture (all free tiers) | |
| - Structured severity scoring (Critical/High/Medium/Low with CWE IDs) | |
| - Auto-fix suggestions (corrected code snippets inline) | |
| --- | |
| ## 2. Architecture Deep Dive | |
| ### 2.1 Four Layers | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β GITHUB LAYER β | |
| β Webhooks Β· PR Events Β· Inline Comments β | |
| ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ | |
| β pull_request webhook | |
| ββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββ | |
| β ORCHESTRATION LAYER (FastAPI on Render) β | |
| β Webhook receiver Β· HMAC validation Β· Redis cache β | |
| β Agent dispatcher Β· GitHub API client β | |
| ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ | |
| β asyncio.gather() | |
| ββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββ | |
| β AGENT LAYER (LangChain ReAct Agents) β | |
| β ββββββββββββ ββββββββββββββββ βββββββββββ β | |
| β β Security β β Performance β β Style β PARALLEL β | |
| β β Agent β β Agent β β Agent β β | |
| β ββββββ¬ββββββ ββββββββ¬ββββββββ ββββββ¬βββββ β | |
| β ββββββββββββββββΌββββββββββββββββ β | |
| β βΌ β | |
| β ββββββββββββββββββββ β | |
| β β Synthesizer β SEQUENTIAL β | |
| β β Agent β β | |
| β ββββββββββββββββββββ β | |
| ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ | |
| β | |
| ββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββ | |
| β KNOWLEDGE LAYER β | |
| β ChromaDB (vector store) Β· Upstash Redis (cache) β | |
| β Neon Postgres (history) Β· sentence-transformers β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ### 2.2 Data Flow (11 Steps) | |
| 1. GitHub fires `pull_request` webhook β Render FastAPI endpoint | |
| 2. FastAPI validates HMAC-SHA256 signature (GitHub App secret) | |
| 3. Check Upstash Redis: commit SHA already reviewed? β return cached | |
| 4. Fetch via GitHub API: PR diff, changed files, full contents, commit history | |
| 5. Build repo context: embed chunks with sentence-transformers β upsert ChromaDB | |
| 6. Dispatch 3 parallel agents: `asyncio.gather(security, performance, style)` | |
| 7. Each agent: system prompt + RAG context β Groq API β static tools β typed findings | |
| 8. Synthesizer: deduplicate + resolve conflicts + Health Score + executive summary | |
| 9. GitHub API: post inline comment per finding + PR summary comment | |
| 10. Write review to Neon Postgres + set Redis cache (TTL: 7 days) | |
| 11. Next.js dashboard fetches from Neon and updates Health Score chart | |
| ### 2.3 Context Loading (5 Layers per Agent) | |
| 1. Raw PR diff (changed lines, file paths, additions/deletions) | |
| 2. Relevant file sections from full repo (ChromaDB semantic search on diff) | |
| 3. Recent commit history for changed files (pattern detection) | |
| 4. Repo configuration (language, framework, linter rules, test coverage) | |
| 5. Domain-specific knowledge base (OWASP Top 10, DDIA patterns, style guides) | |
| --- | |
| ## 3. Complete Tech Stack | |
| ### 3.1 LLM & AI | |
| | Tool | Free Tier | Purpose | | |
| |------|-----------|---------| | |
| | **Groq API** (Llama-3.1-70B) | 14,400 req/day, 500 tok/sec | Primary LLM for all agents | | |
| | **Gemini 1.5 Flash** | 1M tokens/day | Fallback when Groq exhausted | | |
| | **LangChain** | OSS | Agent orchestration, LCEL, ReAct framework | | |
| | **sentence-transformers** | Local (GPU) | Embeddings for ChromaDB β runs on RTX 5070 via WSL | | |
| ### 3.2 Backend & APIs | |
| | Tool | Free Tier | Purpose | | |
| |------|-----------|---------| | |
| | **FastAPI** | OSS | Webhook receiver, agent dispatcher, REST API | | |
| | **Render.com** | Free web service | Hosts backend (30s cold start after 15min idle) | | |
| | **GitHub Apps API** | Free | Webhooks, PR comments, file fetching | | |
| | **Upstash Redis** | 10K req/day | Cache PR analysis by commit SHA | | |
| | **Neon.tech** | Free Postgres 512MB | Review history, Health Score trends | | |
| ### 3.3 Knowledge & Static Analysis | |
| | Tool | Free Tier | Purpose | | |
| |------|-----------|---------| | |
| | **ChromaDB** | OSS, in-memory/persisted | Vector store for RAG context retrieval | | |
| | **Semgrep OSS** | Free, 3K+ rules | SAST rules for Security Agent | | |
| | **Bandit** | Free | Python AST security analysis | | |
| | **detect-secrets** | Free | Credential/API key scanning | | |
| | **radon** | Free | Cyclomatic complexity & maintainability index | | |
| | **pylint/ESLint/Ruff** | Free | Linting for Style Agent | | |
| ### 3.4 Frontend & Deployment | |
| | Tool | Free Tier | Purpose | | |
| |------|-----------|---------| | |
| | **Vercel** | Free hobby tier | Hosts Next.js dashboard | | |
| | **Next.js** | OSS | Dashboard UI | | |
| | **Recharts** | OSS | Health Score trend charts, pie charts | | |
| | **GitHub Actions** | 2K min/month | CI/CD for Sentinel AI itself | | |
| --- | |
| ## 4. Directory Structure | |
| ``` | |
| sentinel-ai/ | |
| βββ app/ | |
| β βββ __init__.py | |
| β βββ main.py # FastAPI app, webhook endpoint, lifespan | |
| β βββ config.py # Settings via pydantic-settings (env vars) | |
| β βββ agents/ | |
| β β βββ __init__.py | |
| β β βββ base_agent.py # Shared agent interface / base class | |
| β β βββ security_agent.py # Security ReAct agent | |
| β β βββ performance_agent.py # Performance ReAct agent | |
| β β βββ style_agent.py # Style & Maintainability agent | |
| β β βββ synthesizer.py # Synthesizer + Health Score + dedup | |
| β βββ tools/ | |
| β β βββ __init__.py | |
| β β βββ semgrep_tool.py # LangChain tool wrapper for Semgrep | |
| β β βββ bandit_tool.py # LangChain tool wrapper for Bandit | |
| β β βββ detect_secrets_tool.py # Credential scanner tool | |
| β β βββ radon_tool.py # Complexity metrics tool | |
| β β βββ ast_analyzer.py # Python AST analysis (N+1, patterns) | |
| β β βββ linter_tool.py # Ruff/ESLint/pylint subprocess tool | |
| β βββ context/ | |
| β β βββ __init__.py | |
| β β βββ embedder.py # sentence-transformers embedding pipeline | |
| β β βββ indexer.py # ChromaDB repo indexer (upsert chunks) | |
| β β βββ retriever.py # RAG retriever (query ChromaDB for context) | |
| β βββ github/ | |
| β β βββ __init__.py | |
| β β βββ webhook.py # Webhook validation (HMAC-SHA256) | |
| β β βββ client.py # GitHub API client (fetch diff, post comments) | |
| β β βββ comment_formatter.py # Format findings as GitHub Markdown comments | |
| β βββ models/ | |
| β β βββ __init__.py | |
| β β βββ findings.py # Finding, PRReview Pydantic schemas | |
| β β βββ webhook_payloads.py # GitHub webhook event schemas | |
| β βββ db/ | |
| β β βββ __init__.py | |
| β β βββ postgres.py # Neon Postgres connection + queries | |
| β β βββ redis_cache.py # Upstash Redis cache logic | |
| β βββ services/ | |
| β βββ __init__.py | |
| β βββ orchestrator.py # Main orchestration: dispatch agents, synthesize | |
| β βββ health_score.py # Health Score calculation formula | |
| βββ dashboard/ # Next.js app (deployed to Vercel) | |
| β βββ package.json | |
| β βββ next.config.js | |
| β βββ tsconfig.json | |
| β βββ app/ | |
| β β βββ layout.tsx | |
| β β βββ page.tsx # / β Repository Overview | |
| β β βββ repos/ | |
| β β β βββ [owner]/ | |
| β β β βββ [repo]/ | |
| β β β βββ page.tsx # Repo Detail (trends, charts) | |
| β β β βββ prs/ | |
| β β β βββ [number]/ | |
| β β β βββ page.tsx # PR Review Detail | |
| β β βββ api/ | |
| β β βββ repos/ | |
| β β β βββ route.ts # API proxy to FastAPI backend | |
| β β βββ health/ | |
| β β βββ route.ts | |
| β βββ components/ | |
| β β βββ HealthScoreRing.tsx # Circular gauge 0-100 | |
| β β βββ FindingsTable.tsx # Sortable, filterable findings | |
| β β βββ TrendChart.tsx # Recharts LineChart | |
| β β βββ AgentBreakdown.tsx # 3-column agent summary cards | |
| β β βββ SeverityBadge.tsx # Color-coded severity pill | |
| β β βββ Navbar.tsx | |
| β βββ lib/ | |
| β βββ api.ts # Fetch wrapper for backend API | |
| β βββ types.ts # TypeScript types matching backend schemas | |
| βββ tests/ | |
| β βββ __init__.py | |
| β βββ conftest.py # Shared fixtures | |
| β βββ unit/ | |
| β β βββ test_findings_schema.py | |
| β β βββ test_synthesizer_dedup.py | |
| β β βββ test_webhook_validation.py | |
| β β βββ test_redis_cache.py | |
| β β βββ test_health_score.py | |
| β βββ integration/ | |
| β β βββ test_full_pipeline.py | |
| β β βββ test_github_posting.py | |
| β βββ eval/ | |
| β βββ dataset/ # 20-PR benchmark dataset (JSON fixtures) | |
| β βββ run_eval.py # Evaluation harness | |
| β βββ metrics.py # Precision, recall, latency tracking | |
| βββ prompts/ | |
| β βββ security_system.md # Security Agent system prompt | |
| β βββ performance_system.md # Performance Agent system prompt | |
| β βββ style_system.md # Style Agent system prompt | |
| β βββ synthesizer_system.md # Synthesizer system prompt | |
| βββ knowledge/ | |
| β βββ owasp_top10_2025.md # OWASP cheat sheet for Security RAG | |
| β βββ ddia_patterns.md # DDIA patterns for Performance RAG | |
| β βββ style_guides/ # Language style guides for Style RAG | |
| βββ .env.example # Template for env vars (no secrets) | |
| βββ .gitignore | |
| βββ requirements.txt # Python dependencies | |
| βββ requirements-dev.txt # Dev/test dependencies | |
| βββ render.yaml # Render deployment config | |
| βββ sentinel.yml.example # Per-repo config template | |
| βββ Dockerfile # For Render deployment | |
| βββ pyproject.toml # Project metadata + tool configs | |
| βββ README.md # Installation, usage, architecture docs | |
| ``` | |
| --- | |
| ## 5. Week-by-Week Implementation Plan | |
| ### WEEK 1: Foundation & Setup | |
| **Goal:** Project skeleton running locally, all external services provisioned. | |
| | # | Task | Type | Status | | |
| |---|------|------|--------| | |
| | 1.1 | Initialize git repo, create directory structure | Code | [ ] | | |
| | 1.2 | Set up Python virtual environment + requirements.txt | Code | [ ] | | |
| | 1.3 | Register GitHub App (dev.github.com/settings/apps) | Config | [ ] | | |
| | 1.4 | Provision Neon.tech Postgres database + create `pr_reviews` table | Config | [ ] | | |
| | 1.5 | Provision Upstash Redis instance | Config | [ ] | | |
| | 1.6 | Get Groq API key (console.groq.com) | Config | [ ] | | |
| | 1.7 | Get Gemini API key (aistudio.google.com) | Config | [ ] | | |
| | 1.8 | Create FastAPI skeleton (`app/main.py`) with health endpoint | Code | [ ] | | |
| | 1.9 | Create `app/config.py` with pydantic-settings (all env vars) | Code | [ ] | | |
| | 1.10 | Create Pydantic models (`Finding`, `PRReview` schemas) | Code | [ ] | | |
| | 1.11 | Set up .env.example, .gitignore, pyproject.toml | Code | [ ] | | |
| | 1.12 | Deploy FastAPI skeleton to Render (verify /health works) | Deploy | [ ] | | |
| | 1.13 | Write unit tests for Finding schema validation | Test | [ ] | | |
| | 1.14 | Set up GitHub Actions CI (lint + test on push) | CI/CD | [ ] | | |
| ### WEEK 2: GitHub Integration | |
| **Goal:** Receive webhooks, validate signatures, fetch PR data, post dummy comment. | |
| | # | Task | Type | Status | | |
| |---|------|------|--------| | |
| | 2.1 | Implement HMAC-SHA256 webhook validation (`app/github/webhook.py`) | Code | [ ] | | |
| | 2.2 | Implement GitHub API client β fetch PR diff (`app/github/client.py`) | Code | [ ] | | |
| | 2.3 | Implement GitHub API client β fetch file contents | Code | [ ] | | |
| | 2.4 | Implement GitHub API client β fetch commit history | Code | [ ] | | |
| | 2.5 | Implement GitHub API client β post inline review comments | Code | [ ] | | |
| | 2.6 | Implement GitHub API client β post PR summary comment | Code | [ ] | | |
| | 2.7 | Create webhook endpoint (`POST /webhook/github`) in main.py | Code | [ ] | | |
| | 2.8 | Implement comment formatter (`app/github/comment_formatter.py`) | Code | [ ] | | |
| | 2.9 | Set up ngrok for local webhook testing | Config | [ ] | | |
| | 2.10 | End-to-end test: open PR on test repo β dummy comment posted | Test | [ ] | | |
| | 2.11 | Implement Redis cache check (skip if commit SHA already reviewed) | Code | [ ] | | |
| | 2.12 | Write unit tests for HMAC validation (valid + invalid signatures) | Test | [ ] | | |
| | 2.13 | Write unit tests for Redis cache hit/miss logic | Test | [ ] | | |
| ### WEEK 3: Security Agent v1 | |
| **Goal:** Security Agent analyzes diffs, returns structured findings with CWE IDs. | |
| | # | Task | Type | Status | | |
| |---|------|------|--------| | |
| | 3.1 | Install & configure Semgrep OSS with security rulesets | Config | [ ] | | |
| | 3.2 | Create Semgrep LangChain tool (`app/tools/semgrep_tool.py`) | Code | [ ] | | |
| | 3.3 | Install & configure Bandit for Python AST security analysis | Config | [ ] | | |
| | 3.4 | Create Bandit LangChain tool (`app/tools/bandit_tool.py`) | Code | [ ] | | |
| | 3.5 | Install & configure detect-secrets | Config | [ ] | | |
| | 3.6 | Create detect-secrets LangChain tool (`app/tools/detect_secrets_tool.py`) | Code | [ ] | | |
| | 3.7 | Write Security Agent system prompt (`prompts/security_system.md`) | Prompt | [ ] | | |
| | 3.8 | Prepare OWASP Top 10 (2025) knowledge base (`knowledge/owasp_top10_2025.md`) | Data | [ ] | | |
| | 3.9 | Implement Security Agent ReAct loop (`app/agents/security_agent.py`) | Code | [ ] | | |
| | 3.10 | Implement base agent interface (`app/agents/base_agent.py`) | Code | [ ] | | |
| | 3.11 | Set up Groq LLM client via LangChain (`ChatGroq`) | Code | [ ] | | |
| | 3.12 | Implement structured output parsing (JSON β Finding objects) | Code | [ ] | | |
| | 3.13 | Create 10 synthetic security-vulnerable PRs for testing | Data | [ ] | | |
| | 3.14 | Evaluate Security Agent on synthetic dataset β measure precision/recall | Eval | [ ] | | |
| | 3.15 | Iterate on system prompt based on eval results | Prompt | [ ] | | |
| ### WEEK 4: Performance Agent v1 | |
| **Goal:** Performance Agent detects N+1 queries, complexity issues, returns findings. | |
| | # | Task | Type | Status | | |
| |---|------|------|--------| | |
| | 4.1 | Create Python AST analyzer tool (`app/tools/ast_analyzer.py`) | Code | [ ] | | |
| | 4.2 | Implement N+1 query pattern detector (Django/SQLAlchemy ORM patterns) | Code | [ ] | | |
| | 4.3 | Create radon complexity tool (`app/tools/radon_tool.py`) | Code | [ ] | | |
| | 4.4 | Write Performance Agent system prompt (`prompts/performance_system.md`) | Prompt | [ ] | | |
| | 4.5 | Prepare DDIA patterns knowledge base (`knowledge/ddia_patterns.md`) | Data | [ ] | | |
| | 4.6 | Implement Performance Agent ReAct loop (`app/agents/performance_agent.py`) | Code | [ ] | | |
| | 4.7 | Fetch 10 Django PRs with known performance issues for testing | Data | [ ] | | |
| | 4.8 | Evaluate Performance Agent on Django PR dataset | Eval | [ ] | | |
| | 4.9 | Iterate on system prompt based on eval results | Prompt | [ ] | | |
| ### WEEK 5: Style Agent v1 | |
| **Goal:** Style Agent checks naming, complexity, dead code, test coverage gaps. | |
| | # | Task | Type | Status | | |
| |---|------|------|--------| | |
| | 5.1 | Create linter tool wrapper β Ruff/ESLint/pylint (`app/tools/linter_tool.py`) | Code | [ ] | | |
| | 5.2 | Implement dead code detector (unused imports, unreachable branches) | Code | [ ] | | |
| | 5.3 | Write Style Agent system prompt (`prompts/style_system.md`) | Prompt | [ ] | | |
| | 5.4 | Prepare language style guides knowledge base (`knowledge/style_guides/`) | Data | [ ] | | |
| | 5.5 | Implement Style Agent ReAct loop (`app/agents/style_agent.py`) | Code | [ ] | | |
| | 5.6 | Fetch 10 Exercism PRs with style/refactoring issues | Data | [ ] | | |
| | 5.7 | Evaluate Style Agent on Exercism dataset | Eval | [ ] | | |
| | 5.8 | Iterate on system prompt based on eval results | Prompt | [ ] | | |
| ### WEEK 6: ChromaDB + RAG Context | |
| **Goal:** Full RAG pipeline β embed repo, retrieve context, inject into agents. | |
| | # | Task | Type | Status | | |
| |---|------|------|--------| | |
| | 6.1 | Set up sentence-transformers embedding pipeline (`app/context/embedder.py`) | Code | [ ] | | |
| | 6.2 | **Run embedding model on RTX 5070 via WSL** β benchmark speed | GPU | [ ] | | |
| | 6.3 | Implement ChromaDB repo indexer (`app/context/indexer.py`) β chunk files, upsert | Code | [ ] | | |
| | 6.4 | Implement RAG retriever (`app/context/retriever.py`) β query by diff content | Code | [ ] | | |
| | 6.5 | Integrate RAG context into Security Agent | Code | [ ] | | |
| | 6.6 | Integrate RAG context into Performance Agent | Code | [ ] | | |
| | 6.7 | Integrate RAG context into Style Agent | Code | [ ] | | |
| | 6.8 | Evaluate: does cross-file RAG context improve recall vs. diff-only? | Eval | [ ] | | |
| | 6.9 | Optimize chunk size and retrieval top-k for quality vs. latency | Code | [ ] | | |
| | 6.10 | Limit repo index to 500 most recently changed files (Render memory constraint) | Code | [ ] | | |
| ### WEEK 7: Synthesizer Agent | |
| **Goal:** Deduplication, conflict resolution, Health Score, executive summary, full pipeline. | |
| | # | Task | Type | Status | | |
| |---|------|------|--------| | |
| | 7.1 | Write Synthesizer system prompt (`prompts/synthesizer_system.md`) | Prompt | [ ] | | |
| | 7.2 | Implement deduplication logic (cosine similarity on findings via ChromaDB) | Code | [ ] | | |
| | 7.3 | Implement severity conflict resolution (Security > Performance > Style precedence) | Code | [ ] | | |
| | 7.4 | Implement composite re-ranking: severity Γ exploitability Γ fix_complexity | Code | [ ] | | |
| | 7.5 | Implement PR Health Score formula (0-100) (`app/services/health_score.py`) | Code | [ ] | | |
| | 7.6 | Implement executive summary generation (3-5 sentences) | Code | [ ] | | |
| | 7.7 | Implement auto-block logic (Critical findings β block merge recommendation) | Code | [ ] | | |
| | 7.8 | Implement Synthesizer Agent (`app/agents/synthesizer.py`) | Code | [ ] | | |
| | 7.9 | Build main orchestrator (`app/services/orchestrator.py`) β ties everything together | Code | [ ] | | |
| | 7.10 | Implement Gemini Flash fallback when Groq quota exhausted | Code | [ ] | | |
| | 7.11 | Full end-to-end pipeline test: PR β agents β synthesizer β GitHub comments | Test | [ ] | | |
| | 7.12 | Write unit tests for Health Score formula | Test | [ ] | | |
| | 7.13 | Write unit tests for deduplication with synthetic conflicting findings | Test | [ ] | | |
| | 7.14 | Implement Neon Postgres write (store review record) | Code | [ ] | | |
| ### WEEK 8: Next.js Dashboard | |
| **Goal:** Dashboard on Vercel showing review history, Health Scores, charts. | |
| | # | Task | Type | Status | | |
| |---|------|------|--------| | |
| | 8.1 | Initialize Next.js app in `dashboard/` with TypeScript | Code | [ ] | | |
| | 8.2 | Deploy to Vercel (connect GitHub repo) | Deploy | [ ] | | |
| | 8.3 | Create TypeScript types matching backend schemas (`lib/types.ts`) | Code | [ ] | | |
| | 8.4 | Create API fetch wrapper (`lib/api.ts`) β calls FastAPI backend | Code | [ ] | | |
| | 8.5 | Build `HealthScoreRing` component (circular gauge, animated) | Code | [ ] | | |
| | 8.6 | Build `SeverityBadge` component (color-coded pills) | Code | [ ] | | |
| | 8.7 | Build `TrendChart` component (Recharts LineChart, 30-day trend) | Code | [ ] | | |
| | 8.8 | Build `FindingsTable` component (sortable, filterable) | Code | [ ] | | |
| | 8.9 | Build `AgentBreakdown` component (3-column cards) | Code | [ ] | | |
| | 8.10 | Build `/` page β Repository Overview (connected repos, avg scores) | Code | [ ] | | |
| | 8.11 | Build `/repos/[owner]/[repo]` page β Repo Detail (charts, PR list) | Code | [ ] | | |
| | 8.12 | Build `/repos/[owner]/[repo]/prs/[number]` page β PR Review Detail | Code | [ ] | | |
| | 8.13 | Add FastAPI CORS middleware for Vercel domain | Code | [ ] | | |
| | 8.14 | Implement REST API endpoints on FastAPI side for dashboard | Code | [ ] | | |
| ### WEEK 9: Polish & Evaluation | |
| **Goal:** Full benchmark, prompt tuning, latency optimization, documentation. | |
| | # | Task | Type | Status | | |
| |---|------|------|--------| | |
| | 9.1 | Curate full 20-PR benchmark dataset (Django, Next.js, synthetic, Exercism) | Data | [ ] | | |
| | 9.2 | Build evaluation harness (`tests/eval/run_eval.py`) | Code | [ ] | | |
| | 9.3 | Run full benchmark β measure precision, recall, latency per agent | Eval | [ ] | | |
| | 9.4 | Tune agent prompts to reduce false positives (target: <30% FP rate) | Prompt | [ ] | | |
| | 9.5 | Implement confidence threshold: findings <0.6 shown as 'Suggestions' | Code | [ ] | | |
| | 9.6 | Latency optimization: measure p50/p95/p99 per PR size bucket | Eval | [ ] | | |
| | 9.7 | Optimize Groq API calls (reduce token usage, cache prompts) | Code | [ ] | | |
| | 9.8 | Write comprehensive README.md | Docs | [ ] | | |
| | 9.9 | Write installation guide in README | Docs | [ ] | | |
| | 9.10 | Add GitHub Actions pre-warm cron (ping /health every 10min) | CI/CD | [ ] | | |
| ### WEEK 10: Launch & Promotion | |
| **Goal:** Live on GitHub Marketplace, installed on public repos, launch posts published. | |
| | # | Task | Type | Status | | |
| |---|------|------|--------| | |
| | 10.1 | Install Sentinel AI on 3 public open-source repos | Launch | [ ] | | |
| | 10.2 | Record demo video (screen recording: PR opened β comments posted) | Content | [ ] | | |
| | 10.3 | Write Dev.to / HackerNews launch post | Content | [ ] | | |
| | 10.4 | Write LinkedIn demo post | Content | [ ] | | |
| | 10.5 | Submit to GitHub Marketplace (needs privacy policy, logo, description) | Launch | [ ] | | |
| | 10.6 | Create sentinel.yml.example per-repo config template | Code | [ ] | | |
| | 10.7 | Monitor first 48 hours β fix any production bugs | Ops | [ ] | | |
| --- | |
| ## 6. Non-Coding Tasks | |
| These tasks don't involve writing project code but are essential for the project: | |
| ### 6.1 External Service Provisioning | |
| | Service | Action | URL | Notes | | |
| |---------|--------|-----|-------| | |
| | **GitHub App** | Register new app | github.com/settings/apps/new | Need: App ID, Private Key (.pem), Webhook Secret | | |
| | **Groq** | Get API key | console.groq.com | Free: 14,400 req/day | | |
| | **Google AI Studio** | Get Gemini key | aistudio.google.com | Free: 1M tokens/day | | |
| | **Neon.tech** | Create Postgres DB | console.neon.tech | Free: 512MB, create `pr_reviews` table | | |
| | **Upstash** | Create Redis instance | console.upstash.com | Free: 10K req/day | | |
| | **Render** | Create web service | dashboard.render.com | Free tier, connect GitHub repo | | |
| | **Vercel** | Create project | vercel.com/new | Free hobby tier, connect dashboard/ | | |
| | **ngrok** | Install for local testing | ngrok.com | Free: 1 tunnel | | |
| ### 6.2 GitHub App Configuration | |
| **Permissions required:** | |
| - Pull requests: Read & Write | |
| - Contents: Read | |
| - Metadata: Read | |
| - Commit statuses: Write (optional) | |
| **Webhook events to subscribe:** | |
| - `pull_request` (opened, synchronize, reopened, ready_for_review) | |
| - `pull_request_review_comment` (for @sentinel-ai re-review) | |
| ### 6.3 Data Curation Tasks | |
| | Dataset | Source | Count | Purpose | | |
| |---------|--------|-------|---------| | |
| | Synthetic security PRs | Hand-crafted | 10 PRs | SQL injection, XSS, IDOR, hardcoded secrets | | |
| | Django security PRs | github.com/django/django | 5 PRs | Real-world Python security fixes | | |
| | Next.js performance PRs | github.com/vercel/next.js | 5 PRs | JS/TS performance changes | | |
| | Exercism style PRs | github.com/exercism | 5 PRs | Naming, complexity, documentation issues | | |
| | Mixed benchmark set | All above | 20 PRs | Full evaluation benchmark | | |
| ### 6.4 Knowledge Base Curation | |
| | Document | Source | For Agent | | |
| |----------|--------|-----------| | |
| | OWASP Top 10 (2025) | owasp.org | Security Agent RAG | | |
| | DDIA performance patterns | "Designing Data-Intensive Applications" | Performance Agent RAG | | |
| | Python style guide (PEP 8) | python.org | Style Agent RAG | | |
| | JavaScript style guide | Various (Airbnb, Google) | Style Agent RAG | | |
| | TypeScript best practices | typescript-eslint.io | Style Agent RAG | | |
| --- | |
| ## 7. GPU / WSL Tasks | |
| Your **RTX 5070** with WSL will be used for: | |
| ### 7.1 sentence-transformers Embedding (Required) | |
| **No training needed** β these are pre-trained models used for embedding generation. | |
| ``` | |
| Model: all-MiniLM-L6-v2 (or all-mpnet-base-v2 for higher quality) | |
| Task: Embed code chunks for ChromaDB indexing | |
| Where: Runs locally during repo indexing (can also run on Render CPU, slower) | |
| GPU benefit: ~10-50x faster embedding generation vs CPU | |
| ``` | |
| **Setup steps:** | |
| 1. Ensure CUDA toolkit installed in WSL (`nvidia-smi` should show RTX 5070) | |
| 2. `pip install sentence-transformers torch` (with CUDA support) | |
| 3. Benchmark: embed 1000 code chunks, measure time GPU vs CPU | |
| 4. Decision: if embedding is fast enough on CPU, skip GPU for deployment simplicity | |
| ### 7.2 Local LLM Testing (Optional, Recommended) | |
| Running a local LLM for testing avoids burning Groq API quota during development: | |
| ``` | |
| Model: Llama-3.1-8B-Instruct (via Ollama or vLLM) | |
| Task: Test agent prompts locally before hitting Groq | |
| GPU benefit: Full inference locally, no API calls, no quota burn | |
| ``` | |
| **Setup steps:** | |
| 1. Install Ollama in WSL: `curl -fsSL https://ollama.com/install.sh | sh` | |
| 2. Pull model: `ollama pull llama3.1:8b` | |
| 3. Use for prompt iteration β switch to Groq (70B) for production quality | |
| ### 7.3 What You Do NOT Need to Train | |
| | Item | Reason | | |
| |------|--------| | |
| | LLM (Llama-3.1-70B) | Used via Groq API β inference only, no fine-tuning | | |
| | sentence-transformers | Pre-trained model, no fine-tuning needed for code embeddings | | |
| | Semgrep/Bandit/radon | Rule-based tools, no ML training | | |
| | Agent prompts | Iterative prompt engineering, not model training | | |
| **Bottom line:** This project is an **inference and orchestration** project, not a training project. Your GPU is used for fast local embeddings and optional local LLM testing β no model training required. | |
| --- | |
| ## 8. Data Models & Schemas | |
| ### 8.1 Finding (per agent output) | |
| ```python | |
| class Finding(BaseModel): | |
| agent: Literal['security', 'performance', 'style'] | |
| file_path: str # e.g. 'src/auth/login.py' | |
| line_start: int | |
| line_end: int | |
| severity: Literal['critical', 'high', 'medium', 'low'] | |
| category: str # e.g. 'sql_injection', 'n+1_query', 'naming' | |
| title: str # Short one-liner | |
| description: str # Full explanation | |
| suggested_fix: str # Corrected code snippet | |
| cwe_id: Optional[str] # For security findings (e.g. 'CWE-89') | |
| confidence: float # 0.0 β 1.0 | |
| ``` | |
| ### 8.2 SynthesizedReview (Synthesizer output) | |
| ```python | |
| class SynthesizedReview(BaseModel): | |
| health_score: int # 0-100 | |
| executive_summary: str # 3-5 sentences | |
| recommendation: Literal['approve', 'request_changes', 'block'] | |
| findings: List[Finding] # Deduplicated, re-ranked | |
| critical_count: int | |
| high_count: int | |
| medium_count: int | |
| low_count: int | |
| duration_ms: int | |
| ``` | |
| ### 8.3 PR Review Record (Neon Postgres) | |
| ```sql | |
| CREATE TABLE pr_reviews ( | |
| id UUID PRIMARY KEY DEFAULT gen_random_uuid(), | |
| repo_full_name TEXT NOT NULL, | |
| pr_number INT NOT NULL, | |
| commit_sha TEXT NOT NULL, | |
| health_score INT NOT NULL, | |
| critical_count INT DEFAULT 0, | |
| high_count INT DEFAULT 0, | |
| medium_count INT DEFAULT 0, | |
| low_count INT DEFAULT 0, | |
| summary TEXT, | |
| findings JSONB NOT NULL, | |
| duration_ms INT, | |
| created_at TIMESTAMPTZ DEFAULT NOW() | |
| ); | |
| CREATE INDEX idx_pr_reviews_repo ON pr_reviews(repo_full_name); | |
| CREATE INDEX idx_pr_reviews_sha ON pr_reviews(commit_sha); | |
| ``` | |
| --- | |
| ## 9. API Endpoints | |
| | Endpoint | Method | Description | | |
| |----------|--------|-------------| | |
| | `POST /webhook/github` | POST | Receive GitHub webhook, validate HMAC, enqueue analysis | | |
| | `GET /api/repos/{owner}/{repo}/reviews` | GET | Paginated PR review list + Health Score trend | | |
| | `GET /api/repos/{owner}/{repo}/reviews/{pr_number}` | GET | Full findings for specific PR | | |
| | `GET /api/repos/{owner}/{repo}/stats` | GET | Aggregate stats: avg score, top categories, 30-day trend | | |
| | `POST /api/repos/{owner}/{repo}/reanalyze/{pr_number}` | POST | Re-trigger analysis (bypass cache) | | |
| | `GET /health` | GET | Health check: agent status, Groq quota remaining | | |
| --- | |
| ## 10. Agent Prompt Design | |
| Each agent prompt must include: | |
| 1. **Role definition** β who the agent is (e.g., "senior AppSec engineer") | |
| 2. **Scope boundaries** β what to look for and what to ignore | |
| 3. **Output schema** β exact JSON structure expected | |
| 4. **Severity guidelines** β when to use Critical vs. High vs. Medium vs. Low | |
| 5. **Confidence scoring** β how to self-assess confidence (0.0-1.0) | |
| 6. **Examples** β 2-3 few-shot examples of good findings | |
| 7. **Anti-patterns** β common false positives to avoid | |
| Prompts are stored in `prompts/` as Markdown files and loaded at agent initialization. | |
| --- | |
| ## 11. Evaluation Plan | |
| ### 11.1 Metrics | |
| | Metric | Target | Formula | | |
| |--------|--------|---------| | |
| | Security precision | >70% | true_positives / (true_positives + false_positives) | | |
| | Performance recall | >60% | true_positives / (true_positives + false_negatives) | | |
| | Deduplication rate | >15% | duplicates_removed / total_findings | | |
| | e2e latency (p95) | <20s | Time from webhook to first comment posted | | |
| | Groq quota usage | <10K/day | Total API calls per day | | |
| | System uptime | >95% | (total_time - downtime) / total_time | | |
| ### 11.2 Evaluation Harness | |
| Located in `tests/eval/`: | |
| - `dataset/` β 20 PRs as JSON fixtures (diff, expected findings, ground truth labels) | |
| - `run_eval.py` β Runs each PR through full pipeline, compares output vs ground truth | |
| - `metrics.py` β Computes precision, recall, F1, latency percentiles | |
| - Results logged to console + optionally to LangSmith (free self-hosted) | |
| --- | |
| ## 12. Deployment Checklist | |
| ### Render (FastAPI Backend) | |
| - [ ] `render.yaml` configured with build + start commands | |
| - [ ] Environment variables set in Render dashboard | |
| - [ ] Health check endpoint (`/health`) configured | |
| - [ ] Auto-deploy from `main` branch enabled | |
| ### Vercel (Next.js Dashboard) | |
| - [ ] Connected to GitHub repo `dashboard/` directory | |
| - [ ] Environment variable: `NEXT_PUBLIC_API_URL` pointing to Render backend | |
| - [ ] Custom domain (optional) | |
| ### GitHub App | |
| - [ ] App registered with correct permissions | |
| - [ ] Webhook URL set to Render endpoint (`/webhook/github`) | |
| - [ ] Private key (.pem) downloaded and stored securely | |
| - [ ] App installed on test repo for development | |
| ### GitHub Actions | |
| - [ ] CI workflow: lint (ruff) + test (pytest) on push/PR | |
| - [ ] Pre-warm cron: ping /health every 10 minutes during working hours | |
| --- | |
| ## 13. Progress Tracker | |
| ### Overall Status | |
| | Week | Milestone | Status | Notes | | |
| |------|-----------|--------|-------| | |
| | 1 | Foundation & Setup | COMPLETE | All services provisioned, project scaffolded | | |
| | 2 | GitHub Integration | COMPLETE | E2E tested: webhook β fetch β comment on PR #1 | | |
| | 3 | Security Agent v1 | COMPLETE | Bandit + Llama-3.3-70B, live-tested on PR #3, 4 findings | | |
| | 4 | Performance Agent v1 | COMPLETE | Radon complexity + Llama-3.3-70B, 3 findings on PR #4 | | |
| | 5 | Style Agent v1 | COMPLETE | Ruff linter + Llama-3.3-70B, 6 findings on PR #4 | | |
| | 6 | ChromaDB + RAG Context | COMPLETE | sentence-transformers + ChromaDB, integrated into all agents | | |
| | 7 | Synthesizer Agent | COMPLETE | Dedup, conflict resolution, Health Score formula, exec summary | | |
| | 8 | Next.js Dashboard | COMPLETE | Next.js + Tailwind + Recharts, mock data, all pages | | |
| | 9 | Polish & Evaluation | COMPLETE | Eval harness, metrics, README, DB persistence | | |
| | 10 | Launch & Promotion | COMPLETE | Render config, Vercel ready, API endpoints for dashboard | | |
| ### Key Decisions Log | |
| | Date | Decision | Rationale | | |
| |------|----------|-----------| | |
| | 2026-03-19 | Project plan created | Starting from scratch, PDF spec as source of truth | | |
| | 2026-03-19 | Project renamed to "Ninja Code Guard" | User's personal branding choice | | |
| | 2026-03-19 | GitHub App: "Ninja's Code Guard" (ID: 3133457) | Registered and tested with live PR | | |
| | 2026-03-19 | Test repo: ninjacode911/codeguard-test | Used for e2e webhook testing | | |
| | 2026-03-19 | Fail-open pattern for Redis cache | Missing a review is worse than duplicating | | |
| | 2026-03-19 | Background tasks for webhook processing | GitHub's 10s timeout requires async processing | | |
| --- | |
| *Last updated: 2026-03-19* | |