ninja-code-guard / PROJECT_PLAN.md
NinjainPJs's picture
initial - commit
4b445f6

CodeProbe β€” Complete Project Plan & Progress Tracker

Multi-Agent Code Review System Author: Ninjacode911 | Started: March 2026 | Target: 10 Weeks


Table of Contents

  1. Project Overview
  2. Architecture Deep Dive
  3. Complete Tech Stack
  4. Directory Structure
  5. Week-by-Week Implementation Plan
  6. Non-Coding Tasks
  7. GPU / WSL Tasks
  8. Data Models & Schemas
  9. API Endpoints
  10. Agent Prompt Design
  11. Evaluation Plan
  12. Deployment Checklist
  13. Progress Tracker

1. Project Overview

What: A multi-agent PR review system that reviews GitHub pull requests using 4 specialized LangChain agents (Security, Performance, Style, Synthesizer), posts inline GitHub comments, and tracks code health via a Next.js dashboard.

Why: AI-generated code (41% of GitHub commits) introduces 1.7x more issues. Existing tools use single-pass LLM calls. Sentinel AI uses domain-specialized agents with debate/consensus, RAG context, and static analysis tools.

Core Thesis: Separate security, performance, and style review into specialized agents β€” each with distinct prompts, tools, and context β€” then merge via a Synthesizer into a coherent, ranked, deduplicated review.

Key Differentiators:

  • Multi-agent specialization (3 domain + 1 synthesizer)
  • Debate & consensus protocol (agents challenge each other before synthesis)
  • Repo-aware RAG context (ChromaDB indexes full repo, not just diff)
  • $0/month architecture (all free tiers)
  • Structured severity scoring (Critical/High/Medium/Low with CWE IDs)
  • Auto-fix suggestions (corrected code snippets inline)

2. Architecture Deep Dive

2.1 Four Layers

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  GITHUB LAYER                                       β”‚
β”‚  Webhooks Β· PR Events Β· Inline Comments             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚ pull_request webhook
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  ORCHESTRATION LAYER (FastAPI on Render)             β”‚
β”‚  Webhook receiver Β· HMAC validation Β· Redis cache    β”‚
β”‚  Agent dispatcher Β· GitHub API client                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚ asyncio.gather()
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  AGENT LAYER (LangChain ReAct Agents)               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚
β”‚  β”‚ Security β”‚ β”‚ Performance  β”‚ β”‚  Style  β”‚ PARALLEL β”‚
β”‚  β”‚  Agent   β”‚ β”‚    Agent     β”‚ β”‚  Agent  β”‚          β”‚
β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜         β”‚
β”‚       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β”‚
β”‚                      β–Ό                               β”‚
β”‚            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                      β”‚
β”‚            β”‚  Synthesizer     β”‚  SEQUENTIAL           β”‚
β”‚            β”‚  Agent           β”‚                      β”‚
β”‚            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  KNOWLEDGE LAYER                                     β”‚
β”‚  ChromaDB (vector store) Β· Upstash Redis (cache)     β”‚
β”‚  Neon Postgres (history) Β· sentence-transformers     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2.2 Data Flow (11 Steps)

  1. GitHub fires pull_request webhook β†’ Render FastAPI endpoint
  2. FastAPI validates HMAC-SHA256 signature (GitHub App secret)
  3. Check Upstash Redis: commit SHA already reviewed? β†’ return cached
  4. Fetch via GitHub API: PR diff, changed files, full contents, commit history
  5. Build repo context: embed chunks with sentence-transformers β†’ upsert ChromaDB
  6. Dispatch 3 parallel agents: asyncio.gather(security, performance, style)
  7. Each agent: system prompt + RAG context β†’ Groq API β†’ static tools β†’ typed findings
  8. Synthesizer: deduplicate + resolve conflicts + Health Score + executive summary
  9. GitHub API: post inline comment per finding + PR summary comment
  10. Write review to Neon Postgres + set Redis cache (TTL: 7 days)
  11. Next.js dashboard fetches from Neon and updates Health Score chart

2.3 Context Loading (5 Layers per Agent)

  1. Raw PR diff (changed lines, file paths, additions/deletions)
  2. Relevant file sections from full repo (ChromaDB semantic search on diff)
  3. Recent commit history for changed files (pattern detection)
  4. Repo configuration (language, framework, linter rules, test coverage)
  5. Domain-specific knowledge base (OWASP Top 10, DDIA patterns, style guides)

3. Complete Tech Stack

3.1 LLM & AI

Tool Free Tier Purpose
Groq API (Llama-3.1-70B) 14,400 req/day, 500 tok/sec Primary LLM for all agents
Gemini 1.5 Flash 1M tokens/day Fallback when Groq exhausted
LangChain OSS Agent orchestration, LCEL, ReAct framework
sentence-transformers Local (GPU) Embeddings for ChromaDB β€” runs on RTX 5070 via WSL

3.2 Backend & APIs

Tool Free Tier Purpose
FastAPI OSS Webhook receiver, agent dispatcher, REST API
Render.com Free web service Hosts backend (30s cold start after 15min idle)
GitHub Apps API Free Webhooks, PR comments, file fetching
Upstash Redis 10K req/day Cache PR analysis by commit SHA
Neon.tech Free Postgres 512MB Review history, Health Score trends

3.3 Knowledge & Static Analysis

Tool Free Tier Purpose
ChromaDB OSS, in-memory/persisted Vector store for RAG context retrieval
Semgrep OSS Free, 3K+ rules SAST rules for Security Agent
Bandit Free Python AST security analysis
detect-secrets Free Credential/API key scanning
radon Free Cyclomatic complexity & maintainability index
pylint/ESLint/Ruff Free Linting for Style Agent

3.4 Frontend & Deployment

Tool Free Tier Purpose
Vercel Free hobby tier Hosts Next.js dashboard
Next.js OSS Dashboard UI
Recharts OSS Health Score trend charts, pie charts
GitHub Actions 2K min/month CI/CD for Sentinel AI itself

4. Directory Structure

sentinel-ai/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ main.py                    # FastAPI app, webhook endpoint, lifespan
β”‚   β”œβ”€β”€ config.py                  # Settings via pydantic-settings (env vars)
β”‚   β”œβ”€β”€ agents/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ base_agent.py          # Shared agent interface / base class
β”‚   β”‚   β”œβ”€β”€ security_agent.py      # Security ReAct agent
β”‚   β”‚   β”œβ”€β”€ performance_agent.py   # Performance ReAct agent
β”‚   β”‚   β”œβ”€β”€ style_agent.py         # Style & Maintainability agent
β”‚   β”‚   └── synthesizer.py         # Synthesizer + Health Score + dedup
β”‚   β”œβ”€β”€ tools/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ semgrep_tool.py        # LangChain tool wrapper for Semgrep
β”‚   β”‚   β”œβ”€β”€ bandit_tool.py         # LangChain tool wrapper for Bandit
β”‚   β”‚   β”œβ”€β”€ detect_secrets_tool.py # Credential scanner tool
β”‚   β”‚   β”œβ”€β”€ radon_tool.py          # Complexity metrics tool
β”‚   β”‚   β”œβ”€β”€ ast_analyzer.py        # Python AST analysis (N+1, patterns)
β”‚   β”‚   └── linter_tool.py         # Ruff/ESLint/pylint subprocess tool
β”‚   β”œβ”€β”€ context/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ embedder.py            # sentence-transformers embedding pipeline
β”‚   β”‚   β”œβ”€β”€ indexer.py             # ChromaDB repo indexer (upsert chunks)
β”‚   β”‚   └── retriever.py           # RAG retriever (query ChromaDB for context)
β”‚   β”œβ”€β”€ github/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ webhook.py             # Webhook validation (HMAC-SHA256)
β”‚   β”‚   β”œβ”€β”€ client.py              # GitHub API client (fetch diff, post comments)
β”‚   β”‚   └── comment_formatter.py   # Format findings as GitHub Markdown comments
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ findings.py            # Finding, PRReview Pydantic schemas
β”‚   β”‚   └── webhook_payloads.py    # GitHub webhook event schemas
β”‚   β”œβ”€β”€ db/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ postgres.py            # Neon Postgres connection + queries
β”‚   β”‚   └── redis_cache.py         # Upstash Redis cache logic
β”‚   └── services/
β”‚       β”œβ”€β”€ __init__.py
β”‚       β”œβ”€β”€ orchestrator.py        # Main orchestration: dispatch agents, synthesize
β”‚       └── health_score.py        # Health Score calculation formula
β”œβ”€β”€ dashboard/                     # Next.js app (deployed to Vercel)
β”‚   β”œβ”€β”€ package.json
β”‚   β”œβ”€β”€ next.config.js
β”‚   β”œβ”€β”€ tsconfig.json
β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”œβ”€β”€ layout.tsx
β”‚   β”‚   β”œβ”€β”€ page.tsx               # / β€” Repository Overview
β”‚   β”‚   β”œβ”€β”€ repos/
β”‚   β”‚   β”‚   └── [owner]/
β”‚   β”‚   β”‚       └── [repo]/
β”‚   β”‚   β”‚           β”œβ”€β”€ page.tsx   # Repo Detail (trends, charts)
β”‚   β”‚   β”‚           └── prs/
β”‚   β”‚   β”‚               └── [number]/
β”‚   β”‚   β”‚                   └── page.tsx  # PR Review Detail
β”‚   β”‚   └── api/
β”‚   β”‚       β”œβ”€β”€ repos/
β”‚   β”‚       β”‚   └── route.ts       # API proxy to FastAPI backend
β”‚   β”‚       └── health/
β”‚   β”‚           └── route.ts
β”‚   β”œβ”€β”€ components/
β”‚   β”‚   β”œβ”€β”€ HealthScoreRing.tsx    # Circular gauge 0-100
β”‚   β”‚   β”œβ”€β”€ FindingsTable.tsx      # Sortable, filterable findings
β”‚   β”‚   β”œβ”€β”€ TrendChart.tsx         # Recharts LineChart
β”‚   β”‚   β”œβ”€β”€ AgentBreakdown.tsx     # 3-column agent summary cards
β”‚   β”‚   β”œβ”€β”€ SeverityBadge.tsx      # Color-coded severity pill
β”‚   β”‚   └── Navbar.tsx
β”‚   └── lib/
β”‚       β”œβ”€β”€ api.ts                 # Fetch wrapper for backend API
β”‚       └── types.ts               # TypeScript types matching backend schemas
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ conftest.py                # Shared fixtures
β”‚   β”œβ”€β”€ unit/
β”‚   β”‚   β”œβ”€β”€ test_findings_schema.py
β”‚   β”‚   β”œβ”€β”€ test_synthesizer_dedup.py
β”‚   β”‚   β”œβ”€β”€ test_webhook_validation.py
β”‚   β”‚   β”œβ”€β”€ test_redis_cache.py
β”‚   β”‚   └── test_health_score.py
β”‚   β”œβ”€β”€ integration/
β”‚   β”‚   β”œβ”€β”€ test_full_pipeline.py
β”‚   β”‚   └── test_github_posting.py
β”‚   └── eval/
β”‚       β”œβ”€β”€ dataset/               # 20-PR benchmark dataset (JSON fixtures)
β”‚       β”œβ”€β”€ run_eval.py            # Evaluation harness
β”‚       └── metrics.py             # Precision, recall, latency tracking
β”œβ”€β”€ prompts/
β”‚   β”œβ”€β”€ security_system.md         # Security Agent system prompt
β”‚   β”œβ”€β”€ performance_system.md      # Performance Agent system prompt
β”‚   β”œβ”€β”€ style_system.md            # Style Agent system prompt
β”‚   └── synthesizer_system.md      # Synthesizer system prompt
β”œβ”€β”€ knowledge/
β”‚   β”œβ”€β”€ owasp_top10_2025.md        # OWASP cheat sheet for Security RAG
β”‚   β”œβ”€β”€ ddia_patterns.md           # DDIA patterns for Performance RAG
β”‚   └── style_guides/              # Language style guides for Style RAG
β”œβ”€β”€ .env.example                   # Template for env vars (no secrets)
β”œβ”€β”€ .gitignore
β”œβ”€β”€ requirements.txt               # Python dependencies
β”œβ”€β”€ requirements-dev.txt           # Dev/test dependencies
β”œβ”€β”€ render.yaml                    # Render deployment config
β”œβ”€β”€ sentinel.yml.example           # Per-repo config template
β”œβ”€β”€ Dockerfile                     # For Render deployment
β”œβ”€β”€ pyproject.toml                 # Project metadata + tool configs
└── README.md                      # Installation, usage, architecture docs

5. Week-by-Week Implementation Plan

WEEK 1: Foundation & Setup

Goal: Project skeleton running locally, all external services provisioned.

# Task Type Status
1.1 Initialize git repo, create directory structure Code [ ]
1.2 Set up Python virtual environment + requirements.txt Code [ ]
1.3 Register GitHub App (dev.github.com/settings/apps) Config [ ]
1.4 Provision Neon.tech Postgres database + create pr_reviews table Config [ ]
1.5 Provision Upstash Redis instance Config [ ]
1.6 Get Groq API key (console.groq.com) Config [ ]
1.7 Get Gemini API key (aistudio.google.com) Config [ ]
1.8 Create FastAPI skeleton (app/main.py) with health endpoint Code [ ]
1.9 Create app/config.py with pydantic-settings (all env vars) Code [ ]
1.10 Create Pydantic models (Finding, PRReview schemas) Code [ ]
1.11 Set up .env.example, .gitignore, pyproject.toml Code [ ]
1.12 Deploy FastAPI skeleton to Render (verify /health works) Deploy [ ]
1.13 Write unit tests for Finding schema validation Test [ ]
1.14 Set up GitHub Actions CI (lint + test on push) CI/CD [ ]

WEEK 2: GitHub Integration

Goal: Receive webhooks, validate signatures, fetch PR data, post dummy comment.

# Task Type Status
2.1 Implement HMAC-SHA256 webhook validation (app/github/webhook.py) Code [ ]
2.2 Implement GitHub API client β€” fetch PR diff (app/github/client.py) Code [ ]
2.3 Implement GitHub API client β€” fetch file contents Code [ ]
2.4 Implement GitHub API client β€” fetch commit history Code [ ]
2.5 Implement GitHub API client β€” post inline review comments Code [ ]
2.6 Implement GitHub API client β€” post PR summary comment Code [ ]
2.7 Create webhook endpoint (POST /webhook/github) in main.py Code [ ]
2.8 Implement comment formatter (app/github/comment_formatter.py) Code [ ]
2.9 Set up ngrok for local webhook testing Config [ ]
2.10 End-to-end test: open PR on test repo β†’ dummy comment posted Test [ ]
2.11 Implement Redis cache check (skip if commit SHA already reviewed) Code [ ]
2.12 Write unit tests for HMAC validation (valid + invalid signatures) Test [ ]
2.13 Write unit tests for Redis cache hit/miss logic Test [ ]

WEEK 3: Security Agent v1

Goal: Security Agent analyzes diffs, returns structured findings with CWE IDs.

# Task Type Status
3.1 Install & configure Semgrep OSS with security rulesets Config [ ]
3.2 Create Semgrep LangChain tool (app/tools/semgrep_tool.py) Code [ ]
3.3 Install & configure Bandit for Python AST security analysis Config [ ]
3.4 Create Bandit LangChain tool (app/tools/bandit_tool.py) Code [ ]
3.5 Install & configure detect-secrets Config [ ]
3.6 Create detect-secrets LangChain tool (app/tools/detect_secrets_tool.py) Code [ ]
3.7 Write Security Agent system prompt (prompts/security_system.md) Prompt [ ]
3.8 Prepare OWASP Top 10 (2025) knowledge base (knowledge/owasp_top10_2025.md) Data [ ]
3.9 Implement Security Agent ReAct loop (app/agents/security_agent.py) Code [ ]
3.10 Implement base agent interface (app/agents/base_agent.py) Code [ ]
3.11 Set up Groq LLM client via LangChain (ChatGroq) Code [ ]
3.12 Implement structured output parsing (JSON β†’ Finding objects) Code [ ]
3.13 Create 10 synthetic security-vulnerable PRs for testing Data [ ]
3.14 Evaluate Security Agent on synthetic dataset β€” measure precision/recall Eval [ ]
3.15 Iterate on system prompt based on eval results Prompt [ ]

WEEK 4: Performance Agent v1

Goal: Performance Agent detects N+1 queries, complexity issues, returns findings.

# Task Type Status
4.1 Create Python AST analyzer tool (app/tools/ast_analyzer.py) Code [ ]
4.2 Implement N+1 query pattern detector (Django/SQLAlchemy ORM patterns) Code [ ]
4.3 Create radon complexity tool (app/tools/radon_tool.py) Code [ ]
4.4 Write Performance Agent system prompt (prompts/performance_system.md) Prompt [ ]
4.5 Prepare DDIA patterns knowledge base (knowledge/ddia_patterns.md) Data [ ]
4.6 Implement Performance Agent ReAct loop (app/agents/performance_agent.py) Code [ ]
4.7 Fetch 10 Django PRs with known performance issues for testing Data [ ]
4.8 Evaluate Performance Agent on Django PR dataset Eval [ ]
4.9 Iterate on system prompt based on eval results Prompt [ ]

WEEK 5: Style Agent v1

Goal: Style Agent checks naming, complexity, dead code, test coverage gaps.

# Task Type Status
5.1 Create linter tool wrapper β€” Ruff/ESLint/pylint (app/tools/linter_tool.py) Code [ ]
5.2 Implement dead code detector (unused imports, unreachable branches) Code [ ]
5.3 Write Style Agent system prompt (prompts/style_system.md) Prompt [ ]
5.4 Prepare language style guides knowledge base (knowledge/style_guides/) Data [ ]
5.5 Implement Style Agent ReAct loop (app/agents/style_agent.py) Code [ ]
5.6 Fetch 10 Exercism PRs with style/refactoring issues Data [ ]
5.7 Evaluate Style Agent on Exercism dataset Eval [ ]
5.8 Iterate on system prompt based on eval results Prompt [ ]

WEEK 6: ChromaDB + RAG Context

Goal: Full RAG pipeline β€” embed repo, retrieve context, inject into agents.

# Task Type Status
6.1 Set up sentence-transformers embedding pipeline (app/context/embedder.py) Code [ ]
6.2 Run embedding model on RTX 5070 via WSL β€” benchmark speed GPU [ ]
6.3 Implement ChromaDB repo indexer (app/context/indexer.py) β€” chunk files, upsert Code [ ]
6.4 Implement RAG retriever (app/context/retriever.py) β€” query by diff content Code [ ]
6.5 Integrate RAG context into Security Agent Code [ ]
6.6 Integrate RAG context into Performance Agent Code [ ]
6.7 Integrate RAG context into Style Agent Code [ ]
6.8 Evaluate: does cross-file RAG context improve recall vs. diff-only? Eval [ ]
6.9 Optimize chunk size and retrieval top-k for quality vs. latency Code [ ]
6.10 Limit repo index to 500 most recently changed files (Render memory constraint) Code [ ]

WEEK 7: Synthesizer Agent

Goal: Deduplication, conflict resolution, Health Score, executive summary, full pipeline.

# Task Type Status
7.1 Write Synthesizer system prompt (prompts/synthesizer_system.md) Prompt [ ]
7.2 Implement deduplication logic (cosine similarity on findings via ChromaDB) Code [ ]
7.3 Implement severity conflict resolution (Security > Performance > Style precedence) Code [ ]
7.4 Implement composite re-ranking: severity Γ— exploitability Γ— fix_complexity Code [ ]
7.5 Implement PR Health Score formula (0-100) (app/services/health_score.py) Code [ ]
7.6 Implement executive summary generation (3-5 sentences) Code [ ]
7.7 Implement auto-block logic (Critical findings β†’ block merge recommendation) Code [ ]
7.8 Implement Synthesizer Agent (app/agents/synthesizer.py) Code [ ]
7.9 Build main orchestrator (app/services/orchestrator.py) β€” ties everything together Code [ ]
7.10 Implement Gemini Flash fallback when Groq quota exhausted Code [ ]
7.11 Full end-to-end pipeline test: PR β†’ agents β†’ synthesizer β†’ GitHub comments Test [ ]
7.12 Write unit tests for Health Score formula Test [ ]
7.13 Write unit tests for deduplication with synthetic conflicting findings Test [ ]
7.14 Implement Neon Postgres write (store review record) Code [ ]

WEEK 8: Next.js Dashboard

Goal: Dashboard on Vercel showing review history, Health Scores, charts.

# Task Type Status
8.1 Initialize Next.js app in dashboard/ with TypeScript Code [ ]
8.2 Deploy to Vercel (connect GitHub repo) Deploy [ ]
8.3 Create TypeScript types matching backend schemas (lib/types.ts) Code [ ]
8.4 Create API fetch wrapper (lib/api.ts) β€” calls FastAPI backend Code [ ]
8.5 Build HealthScoreRing component (circular gauge, animated) Code [ ]
8.6 Build SeverityBadge component (color-coded pills) Code [ ]
8.7 Build TrendChart component (Recharts LineChart, 30-day trend) Code [ ]
8.8 Build FindingsTable component (sortable, filterable) Code [ ]
8.9 Build AgentBreakdown component (3-column cards) Code [ ]
8.10 Build / page β€” Repository Overview (connected repos, avg scores) Code [ ]
8.11 Build /repos/[owner]/[repo] page β€” Repo Detail (charts, PR list) Code [ ]
8.12 Build /repos/[owner]/[repo]/prs/[number] page β€” PR Review Detail Code [ ]
8.13 Add FastAPI CORS middleware for Vercel domain Code [ ]
8.14 Implement REST API endpoints on FastAPI side for dashboard Code [ ]

WEEK 9: Polish & Evaluation

Goal: Full benchmark, prompt tuning, latency optimization, documentation.

# Task Type Status
9.1 Curate full 20-PR benchmark dataset (Django, Next.js, synthetic, Exercism) Data [ ]
9.2 Build evaluation harness (tests/eval/run_eval.py) Code [ ]
9.3 Run full benchmark β€” measure precision, recall, latency per agent Eval [ ]
9.4 Tune agent prompts to reduce false positives (target: <30% FP rate) Prompt [ ]
9.5 Implement confidence threshold: findings <0.6 shown as 'Suggestions' Code [ ]
9.6 Latency optimization: measure p50/p95/p99 per PR size bucket Eval [ ]
9.7 Optimize Groq API calls (reduce token usage, cache prompts) Code [ ]
9.8 Write comprehensive README.md Docs [ ]
9.9 Write installation guide in README Docs [ ]
9.10 Add GitHub Actions pre-warm cron (ping /health every 10min) CI/CD [ ]

WEEK 10: Launch & Promotion

Goal: Live on GitHub Marketplace, installed on public repos, launch posts published.

# Task Type Status
10.1 Install Sentinel AI on 3 public open-source repos Launch [ ]
10.2 Record demo video (screen recording: PR opened β†’ comments posted) Content [ ]
10.3 Write Dev.to / HackerNews launch post Content [ ]
10.4 Write LinkedIn demo post Content [ ]
10.5 Submit to GitHub Marketplace (needs privacy policy, logo, description) Launch [ ]
10.6 Create sentinel.yml.example per-repo config template Code [ ]
10.7 Monitor first 48 hours β€” fix any production bugs Ops [ ]

6. Non-Coding Tasks

These tasks don't involve writing project code but are essential for the project:

6.1 External Service Provisioning

Service Action URL Notes
GitHub App Register new app github.com/settings/apps/new Need: App ID, Private Key (.pem), Webhook Secret
Groq Get API key console.groq.com Free: 14,400 req/day
Google AI Studio Get Gemini key aistudio.google.com Free: 1M tokens/day
Neon.tech Create Postgres DB console.neon.tech Free: 512MB, create pr_reviews table
Upstash Create Redis instance console.upstash.com Free: 10K req/day
Render Create web service dashboard.render.com Free tier, connect GitHub repo
Vercel Create project vercel.com/new Free hobby tier, connect dashboard/
ngrok Install for local testing ngrok.com Free: 1 tunnel

6.2 GitHub App Configuration

Permissions required:

  • Pull requests: Read & Write
  • Contents: Read
  • Metadata: Read
  • Commit statuses: Write (optional)

Webhook events to subscribe:

  • pull_request (opened, synchronize, reopened, ready_for_review)
  • pull_request_review_comment (for @sentinel-ai re-review)

6.3 Data Curation Tasks

Dataset Source Count Purpose
Synthetic security PRs Hand-crafted 10 PRs SQL injection, XSS, IDOR, hardcoded secrets
Django security PRs github.com/django/django 5 PRs Real-world Python security fixes
Next.js performance PRs github.com/vercel/next.js 5 PRs JS/TS performance changes
Exercism style PRs github.com/exercism 5 PRs Naming, complexity, documentation issues
Mixed benchmark set All above 20 PRs Full evaluation benchmark

6.4 Knowledge Base Curation

Document Source For Agent
OWASP Top 10 (2025) owasp.org Security Agent RAG
DDIA performance patterns "Designing Data-Intensive Applications" Performance Agent RAG
Python style guide (PEP 8) python.org Style Agent RAG
JavaScript style guide Various (Airbnb, Google) Style Agent RAG
TypeScript best practices typescript-eslint.io Style Agent RAG

7. GPU / WSL Tasks

Your RTX 5070 with WSL will be used for:

7.1 sentence-transformers Embedding (Required)

No training needed β€” these are pre-trained models used for embedding generation.

Model: all-MiniLM-L6-v2 (or all-mpnet-base-v2 for higher quality)
Task: Embed code chunks for ChromaDB indexing
Where: Runs locally during repo indexing (can also run on Render CPU, slower)
GPU benefit: ~10-50x faster embedding generation vs CPU

Setup steps:

  1. Ensure CUDA toolkit installed in WSL (nvidia-smi should show RTX 5070)
  2. pip install sentence-transformers torch (with CUDA support)
  3. Benchmark: embed 1000 code chunks, measure time GPU vs CPU
  4. Decision: if embedding is fast enough on CPU, skip GPU for deployment simplicity

7.2 Local LLM Testing (Optional, Recommended)

Running a local LLM for testing avoids burning Groq API quota during development:

Model: Llama-3.1-8B-Instruct (via Ollama or vLLM)
Task: Test agent prompts locally before hitting Groq
GPU benefit: Full inference locally, no API calls, no quota burn

Setup steps:

  1. Install Ollama in WSL: curl -fsSL https://ollama.com/install.sh | sh
  2. Pull model: ollama pull llama3.1:8b
  3. Use for prompt iteration β€” switch to Groq (70B) for production quality

7.3 What You Do NOT Need to Train

Item Reason
LLM (Llama-3.1-70B) Used via Groq API β€” inference only, no fine-tuning
sentence-transformers Pre-trained model, no fine-tuning needed for code embeddings
Semgrep/Bandit/radon Rule-based tools, no ML training
Agent prompts Iterative prompt engineering, not model training

Bottom line: This project is an inference and orchestration project, not a training project. Your GPU is used for fast local embeddings and optional local LLM testing β€” no model training required.


8. Data Models & Schemas

8.1 Finding (per agent output)

class Finding(BaseModel):
    agent: Literal['security', 'performance', 'style']
    file_path: str              # e.g. 'src/auth/login.py'
    line_start: int
    line_end: int
    severity: Literal['critical', 'high', 'medium', 'low']
    category: str               # e.g. 'sql_injection', 'n+1_query', 'naming'
    title: str                  # Short one-liner
    description: str            # Full explanation
    suggested_fix: str          # Corrected code snippet
    cwe_id: Optional[str]       # For security findings (e.g. 'CWE-89')
    confidence: float           # 0.0 – 1.0

8.2 SynthesizedReview (Synthesizer output)

class SynthesizedReview(BaseModel):
    health_score: int                        # 0-100
    executive_summary: str                   # 3-5 sentences
    recommendation: Literal['approve', 'request_changes', 'block']
    findings: List[Finding]                  # Deduplicated, re-ranked
    critical_count: int
    high_count: int
    medium_count: int
    low_count: int
    duration_ms: int

8.3 PR Review Record (Neon Postgres)

CREATE TABLE pr_reviews (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    repo_full_name  TEXT NOT NULL,
    pr_number       INT NOT NULL,
    commit_sha      TEXT NOT NULL,
    health_score    INT NOT NULL,
    critical_count  INT DEFAULT 0,
    high_count      INT DEFAULT 0,
    medium_count    INT DEFAULT 0,
    low_count       INT DEFAULT 0,
    summary         TEXT,
    findings        JSONB NOT NULL,
    duration_ms     INT,
    created_at      TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_pr_reviews_repo ON pr_reviews(repo_full_name);
CREATE INDEX idx_pr_reviews_sha ON pr_reviews(commit_sha);

9. API Endpoints

Endpoint Method Description
POST /webhook/github POST Receive GitHub webhook, validate HMAC, enqueue analysis
GET /api/repos/{owner}/{repo}/reviews GET Paginated PR review list + Health Score trend
GET /api/repos/{owner}/{repo}/reviews/{pr_number} GET Full findings for specific PR
GET /api/repos/{owner}/{repo}/stats GET Aggregate stats: avg score, top categories, 30-day trend
POST /api/repos/{owner}/{repo}/reanalyze/{pr_number} POST Re-trigger analysis (bypass cache)
GET /health GET Health check: agent status, Groq quota remaining

10. Agent Prompt Design

Each agent prompt must include:

  1. Role definition β€” who the agent is (e.g., "senior AppSec engineer")
  2. Scope boundaries β€” what to look for and what to ignore
  3. Output schema β€” exact JSON structure expected
  4. Severity guidelines β€” when to use Critical vs. High vs. Medium vs. Low
  5. Confidence scoring β€” how to self-assess confidence (0.0-1.0)
  6. Examples β€” 2-3 few-shot examples of good findings
  7. Anti-patterns β€” common false positives to avoid

Prompts are stored in prompts/ as Markdown files and loaded at agent initialization.


11. Evaluation Plan

11.1 Metrics

Metric Target Formula
Security precision >70% true_positives / (true_positives + false_positives)
Performance recall >60% true_positives / (true_positives + false_negatives)
Deduplication rate >15% duplicates_removed / total_findings
e2e latency (p95) <20s Time from webhook to first comment posted
Groq quota usage <10K/day Total API calls per day
System uptime >95% (total_time - downtime) / total_time

11.2 Evaluation Harness

Located in tests/eval/:

  • dataset/ β€” 20 PRs as JSON fixtures (diff, expected findings, ground truth labels)
  • run_eval.py β€” Runs each PR through full pipeline, compares output vs ground truth
  • metrics.py β€” Computes precision, recall, F1, latency percentiles
  • Results logged to console + optionally to LangSmith (free self-hosted)

12. Deployment Checklist

Render (FastAPI Backend)

  • render.yaml configured with build + start commands
  • Environment variables set in Render dashboard
  • Health check endpoint (/health) configured
  • Auto-deploy from main branch enabled

Vercel (Next.js Dashboard)

  • Connected to GitHub repo dashboard/ directory
  • Environment variable: NEXT_PUBLIC_API_URL pointing to Render backend
  • Custom domain (optional)

GitHub App

  • App registered with correct permissions
  • Webhook URL set to Render endpoint (/webhook/github)
  • Private key (.pem) downloaded and stored securely
  • App installed on test repo for development

GitHub Actions

  • CI workflow: lint (ruff) + test (pytest) on push/PR
  • Pre-warm cron: ping /health every 10 minutes during working hours

13. Progress Tracker

Overall Status

Week Milestone Status Notes
1 Foundation & Setup COMPLETE All services provisioned, project scaffolded
2 GitHub Integration COMPLETE E2E tested: webhook β†’ fetch β†’ comment on PR #1
3 Security Agent v1 COMPLETE Bandit + Llama-3.3-70B, live-tested on PR #3, 4 findings
4 Performance Agent v1 COMPLETE Radon complexity + Llama-3.3-70B, 3 findings on PR #4
5 Style Agent v1 COMPLETE Ruff linter + Llama-3.3-70B, 6 findings on PR #4
6 ChromaDB + RAG Context COMPLETE sentence-transformers + ChromaDB, integrated into all agents
7 Synthesizer Agent COMPLETE Dedup, conflict resolution, Health Score formula, exec summary
8 Next.js Dashboard COMPLETE Next.js + Tailwind + Recharts, mock data, all pages
9 Polish & Evaluation COMPLETE Eval harness, metrics, README, DB persistence
10 Launch & Promotion COMPLETE Render config, Vercel ready, API endpoints for dashboard

Key Decisions Log

Date Decision Rationale
2026-03-19 Project plan created Starting from scratch, PDF spec as source of truth
2026-03-19 Project renamed to "Ninja Code Guard" User's personal branding choice
2026-03-19 GitHub App: "Ninja's Code Guard" (ID: 3133457) Registered and tested with live PR
2026-03-19 Test repo: ninjacode911/codeguard-test Used for e2e webhook testing
2026-03-19 Fail-open pattern for Redis cache Missing a review is worse than duplicating
2026-03-19 Background tasks for webhook processing GitHub's 10s timeout requires async processing

Last updated: 2026-03-19