Spaces:
Sleeping
CodeProbe β Complete Project Plan & Progress Tracker
Multi-Agent Code Review System Author: Ninjacode911 | Started: March 2026 | Target: 10 Weeks
Table of Contents
- Project Overview
- Architecture Deep Dive
- Complete Tech Stack
- Directory Structure
- Week-by-Week Implementation Plan
- Non-Coding Tasks
- GPU / WSL Tasks
- Data Models & Schemas
- API Endpoints
- Agent Prompt Design
- Evaluation Plan
- Deployment Checklist
- Progress Tracker
1. Project Overview
What: A multi-agent PR review system that reviews GitHub pull requests using 4 specialized LangChain agents (Security, Performance, Style, Synthesizer), posts inline GitHub comments, and tracks code health via a Next.js dashboard.
Why: AI-generated code (41% of GitHub commits) introduces 1.7x more issues. Existing tools use single-pass LLM calls. Sentinel AI uses domain-specialized agents with debate/consensus, RAG context, and static analysis tools.
Core Thesis: Separate security, performance, and style review into specialized agents β each with distinct prompts, tools, and context β then merge via a Synthesizer into a coherent, ranked, deduplicated review.
Key Differentiators:
- Multi-agent specialization (3 domain + 1 synthesizer)
- Debate & consensus protocol (agents challenge each other before synthesis)
- Repo-aware RAG context (ChromaDB indexes full repo, not just diff)
- $0/month architecture (all free tiers)
- Structured severity scoring (Critical/High/Medium/Low with CWE IDs)
- Auto-fix suggestions (corrected code snippets inline)
2. Architecture Deep Dive
2.1 Four Layers
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GITHUB LAYER β
β Webhooks Β· PR Events Β· Inline Comments β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ
β pull_request webhook
ββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββ
β ORCHESTRATION LAYER (FastAPI on Render) β
β Webhook receiver Β· HMAC validation Β· Redis cache β
β Agent dispatcher Β· GitHub API client β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ
β asyncio.gather()
ββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββ
β AGENT LAYER (LangChain ReAct Agents) β
β ββββββββββββ ββββββββββββββββ βββββββββββ β
β β Security β β Performance β β Style β PARALLEL β
β β Agent β β Agent β β Agent β β
β ββββββ¬ββββββ ββββββββ¬ββββββββ ββββββ¬βββββ β
β ββββββββββββββββΌββββββββββββββββ β
β βΌ β
β ββββββββββββββββββββ β
β β Synthesizer β SEQUENTIAL β
β β Agent β β
β ββββββββββββββββββββ β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββ
β KNOWLEDGE LAYER β
β ChromaDB (vector store) Β· Upstash Redis (cache) β
β Neon Postgres (history) Β· sentence-transformers β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2.2 Data Flow (11 Steps)
- GitHub fires
pull_requestwebhook β Render FastAPI endpoint - FastAPI validates HMAC-SHA256 signature (GitHub App secret)
- Check Upstash Redis: commit SHA already reviewed? β return cached
- Fetch via GitHub API: PR diff, changed files, full contents, commit history
- Build repo context: embed chunks with sentence-transformers β upsert ChromaDB
- Dispatch 3 parallel agents:
asyncio.gather(security, performance, style) - Each agent: system prompt + RAG context β Groq API β static tools β typed findings
- Synthesizer: deduplicate + resolve conflicts + Health Score + executive summary
- GitHub API: post inline comment per finding + PR summary comment
- Write review to Neon Postgres + set Redis cache (TTL: 7 days)
- Next.js dashboard fetches from Neon and updates Health Score chart
2.3 Context Loading (5 Layers per Agent)
- Raw PR diff (changed lines, file paths, additions/deletions)
- Relevant file sections from full repo (ChromaDB semantic search on diff)
- Recent commit history for changed files (pattern detection)
- Repo configuration (language, framework, linter rules, test coverage)
- Domain-specific knowledge base (OWASP Top 10, DDIA patterns, style guides)
3. Complete Tech Stack
3.1 LLM & AI
| Tool | Free Tier | Purpose |
|---|---|---|
| Groq API (Llama-3.1-70B) | 14,400 req/day, 500 tok/sec | Primary LLM for all agents |
| Gemini 1.5 Flash | 1M tokens/day | Fallback when Groq exhausted |
| LangChain | OSS | Agent orchestration, LCEL, ReAct framework |
| sentence-transformers | Local (GPU) | Embeddings for ChromaDB β runs on RTX 5070 via WSL |
3.2 Backend & APIs
| Tool | Free Tier | Purpose |
|---|---|---|
| FastAPI | OSS | Webhook receiver, agent dispatcher, REST API |
| Render.com | Free web service | Hosts backend (30s cold start after 15min idle) |
| GitHub Apps API | Free | Webhooks, PR comments, file fetching |
| Upstash Redis | 10K req/day | Cache PR analysis by commit SHA |
| Neon.tech | Free Postgres 512MB | Review history, Health Score trends |
3.3 Knowledge & Static Analysis
| Tool | Free Tier | Purpose |
|---|---|---|
| ChromaDB | OSS, in-memory/persisted | Vector store for RAG context retrieval |
| Semgrep OSS | Free, 3K+ rules | SAST rules for Security Agent |
| Bandit | Free | Python AST security analysis |
| detect-secrets | Free | Credential/API key scanning |
| radon | Free | Cyclomatic complexity & maintainability index |
| pylint/ESLint/Ruff | Free | Linting for Style Agent |
3.4 Frontend & Deployment
| Tool | Free Tier | Purpose |
|---|---|---|
| Vercel | Free hobby tier | Hosts Next.js dashboard |
| Next.js | OSS | Dashboard UI |
| Recharts | OSS | Health Score trend charts, pie charts |
| GitHub Actions | 2K min/month | CI/CD for Sentinel AI itself |
4. Directory Structure
sentinel-ai/
βββ app/
β βββ __init__.py
β βββ main.py # FastAPI app, webhook endpoint, lifespan
β βββ config.py # Settings via pydantic-settings (env vars)
β βββ agents/
β β βββ __init__.py
β β βββ base_agent.py # Shared agent interface / base class
β β βββ security_agent.py # Security ReAct agent
β β βββ performance_agent.py # Performance ReAct agent
β β βββ style_agent.py # Style & Maintainability agent
β β βββ synthesizer.py # Synthesizer + Health Score + dedup
β βββ tools/
β β βββ __init__.py
β β βββ semgrep_tool.py # LangChain tool wrapper for Semgrep
β β βββ bandit_tool.py # LangChain tool wrapper for Bandit
β β βββ detect_secrets_tool.py # Credential scanner tool
β β βββ radon_tool.py # Complexity metrics tool
β β βββ ast_analyzer.py # Python AST analysis (N+1, patterns)
β β βββ linter_tool.py # Ruff/ESLint/pylint subprocess tool
β βββ context/
β β βββ __init__.py
β β βββ embedder.py # sentence-transformers embedding pipeline
β β βββ indexer.py # ChromaDB repo indexer (upsert chunks)
β β βββ retriever.py # RAG retriever (query ChromaDB for context)
β βββ github/
β β βββ __init__.py
β β βββ webhook.py # Webhook validation (HMAC-SHA256)
β β βββ client.py # GitHub API client (fetch diff, post comments)
β β βββ comment_formatter.py # Format findings as GitHub Markdown comments
β βββ models/
β β βββ __init__.py
β β βββ findings.py # Finding, PRReview Pydantic schemas
β β βββ webhook_payloads.py # GitHub webhook event schemas
β βββ db/
β β βββ __init__.py
β β βββ postgres.py # Neon Postgres connection + queries
β β βββ redis_cache.py # Upstash Redis cache logic
β βββ services/
β βββ __init__.py
β βββ orchestrator.py # Main orchestration: dispatch agents, synthesize
β βββ health_score.py # Health Score calculation formula
βββ dashboard/ # Next.js app (deployed to Vercel)
β βββ package.json
β βββ next.config.js
β βββ tsconfig.json
β βββ app/
β β βββ layout.tsx
β β βββ page.tsx # / β Repository Overview
β β βββ repos/
β β β βββ [owner]/
β β β βββ [repo]/
β β β βββ page.tsx # Repo Detail (trends, charts)
β β β βββ prs/
β β β βββ [number]/
β β β βββ page.tsx # PR Review Detail
β β βββ api/
β β βββ repos/
β β β βββ route.ts # API proxy to FastAPI backend
β β βββ health/
β β βββ route.ts
β βββ components/
β β βββ HealthScoreRing.tsx # Circular gauge 0-100
β β βββ FindingsTable.tsx # Sortable, filterable findings
β β βββ TrendChart.tsx # Recharts LineChart
β β βββ AgentBreakdown.tsx # 3-column agent summary cards
β β βββ SeverityBadge.tsx # Color-coded severity pill
β β βββ Navbar.tsx
β βββ lib/
β βββ api.ts # Fetch wrapper for backend API
β βββ types.ts # TypeScript types matching backend schemas
βββ tests/
β βββ __init__.py
β βββ conftest.py # Shared fixtures
β βββ unit/
β β βββ test_findings_schema.py
β β βββ test_synthesizer_dedup.py
β β βββ test_webhook_validation.py
β β βββ test_redis_cache.py
β β βββ test_health_score.py
β βββ integration/
β β βββ test_full_pipeline.py
β β βββ test_github_posting.py
β βββ eval/
β βββ dataset/ # 20-PR benchmark dataset (JSON fixtures)
β βββ run_eval.py # Evaluation harness
β βββ metrics.py # Precision, recall, latency tracking
βββ prompts/
β βββ security_system.md # Security Agent system prompt
β βββ performance_system.md # Performance Agent system prompt
β βββ style_system.md # Style Agent system prompt
β βββ synthesizer_system.md # Synthesizer system prompt
βββ knowledge/
β βββ owasp_top10_2025.md # OWASP cheat sheet for Security RAG
β βββ ddia_patterns.md # DDIA patterns for Performance RAG
β βββ style_guides/ # Language style guides for Style RAG
βββ .env.example # Template for env vars (no secrets)
βββ .gitignore
βββ requirements.txt # Python dependencies
βββ requirements-dev.txt # Dev/test dependencies
βββ render.yaml # Render deployment config
βββ sentinel.yml.example # Per-repo config template
βββ Dockerfile # For Render deployment
βββ pyproject.toml # Project metadata + tool configs
βββ README.md # Installation, usage, architecture docs
5. Week-by-Week Implementation Plan
WEEK 1: Foundation & Setup
Goal: Project skeleton running locally, all external services provisioned.
| # | Task | Type | Status |
|---|---|---|---|
| 1.1 | Initialize git repo, create directory structure | Code | [ ] |
| 1.2 | Set up Python virtual environment + requirements.txt | Code | [ ] |
| 1.3 | Register GitHub App (dev.github.com/settings/apps) | Config | [ ] |
| 1.4 | Provision Neon.tech Postgres database + create pr_reviews table |
Config | [ ] |
| 1.5 | Provision Upstash Redis instance | Config | [ ] |
| 1.6 | Get Groq API key (console.groq.com) | Config | [ ] |
| 1.7 | Get Gemini API key (aistudio.google.com) | Config | [ ] |
| 1.8 | Create FastAPI skeleton (app/main.py) with health endpoint |
Code | [ ] |
| 1.9 | Create app/config.py with pydantic-settings (all env vars) |
Code | [ ] |
| 1.10 | Create Pydantic models (Finding, PRReview schemas) |
Code | [ ] |
| 1.11 | Set up .env.example, .gitignore, pyproject.toml | Code | [ ] |
| 1.12 | Deploy FastAPI skeleton to Render (verify /health works) | Deploy | [ ] |
| 1.13 | Write unit tests for Finding schema validation | Test | [ ] |
| 1.14 | Set up GitHub Actions CI (lint + test on push) | CI/CD | [ ] |
WEEK 2: GitHub Integration
Goal: Receive webhooks, validate signatures, fetch PR data, post dummy comment.
| # | Task | Type | Status |
|---|---|---|---|
| 2.1 | Implement HMAC-SHA256 webhook validation (app/github/webhook.py) |
Code | [ ] |
| 2.2 | Implement GitHub API client β fetch PR diff (app/github/client.py) |
Code | [ ] |
| 2.3 | Implement GitHub API client β fetch file contents | Code | [ ] |
| 2.4 | Implement GitHub API client β fetch commit history | Code | [ ] |
| 2.5 | Implement GitHub API client β post inline review comments | Code | [ ] |
| 2.6 | Implement GitHub API client β post PR summary comment | Code | [ ] |
| 2.7 | Create webhook endpoint (POST /webhook/github) in main.py |
Code | [ ] |
| 2.8 | Implement comment formatter (app/github/comment_formatter.py) |
Code | [ ] |
| 2.9 | Set up ngrok for local webhook testing | Config | [ ] |
| 2.10 | End-to-end test: open PR on test repo β dummy comment posted | Test | [ ] |
| 2.11 | Implement Redis cache check (skip if commit SHA already reviewed) | Code | [ ] |
| 2.12 | Write unit tests for HMAC validation (valid + invalid signatures) | Test | [ ] |
| 2.13 | Write unit tests for Redis cache hit/miss logic | Test | [ ] |
WEEK 3: Security Agent v1
Goal: Security Agent analyzes diffs, returns structured findings with CWE IDs.
| # | Task | Type | Status |
|---|---|---|---|
| 3.1 | Install & configure Semgrep OSS with security rulesets | Config | [ ] |
| 3.2 | Create Semgrep LangChain tool (app/tools/semgrep_tool.py) |
Code | [ ] |
| 3.3 | Install & configure Bandit for Python AST security analysis | Config | [ ] |
| 3.4 | Create Bandit LangChain tool (app/tools/bandit_tool.py) |
Code | [ ] |
| 3.5 | Install & configure detect-secrets | Config | [ ] |
| 3.6 | Create detect-secrets LangChain tool (app/tools/detect_secrets_tool.py) |
Code | [ ] |
| 3.7 | Write Security Agent system prompt (prompts/security_system.md) |
Prompt | [ ] |
| 3.8 | Prepare OWASP Top 10 (2025) knowledge base (knowledge/owasp_top10_2025.md) |
Data | [ ] |
| 3.9 | Implement Security Agent ReAct loop (app/agents/security_agent.py) |
Code | [ ] |
| 3.10 | Implement base agent interface (app/agents/base_agent.py) |
Code | [ ] |
| 3.11 | Set up Groq LLM client via LangChain (ChatGroq) |
Code | [ ] |
| 3.12 | Implement structured output parsing (JSON β Finding objects) | Code | [ ] |
| 3.13 | Create 10 synthetic security-vulnerable PRs for testing | Data | [ ] |
| 3.14 | Evaluate Security Agent on synthetic dataset β measure precision/recall | Eval | [ ] |
| 3.15 | Iterate on system prompt based on eval results | Prompt | [ ] |
WEEK 4: Performance Agent v1
Goal: Performance Agent detects N+1 queries, complexity issues, returns findings.
| # | Task | Type | Status |
|---|---|---|---|
| 4.1 | Create Python AST analyzer tool (app/tools/ast_analyzer.py) |
Code | [ ] |
| 4.2 | Implement N+1 query pattern detector (Django/SQLAlchemy ORM patterns) | Code | [ ] |
| 4.3 | Create radon complexity tool (app/tools/radon_tool.py) |
Code | [ ] |
| 4.4 | Write Performance Agent system prompt (prompts/performance_system.md) |
Prompt | [ ] |
| 4.5 | Prepare DDIA patterns knowledge base (knowledge/ddia_patterns.md) |
Data | [ ] |
| 4.6 | Implement Performance Agent ReAct loop (app/agents/performance_agent.py) |
Code | [ ] |
| 4.7 | Fetch 10 Django PRs with known performance issues for testing | Data | [ ] |
| 4.8 | Evaluate Performance Agent on Django PR dataset | Eval | [ ] |
| 4.9 | Iterate on system prompt based on eval results | Prompt | [ ] |
WEEK 5: Style Agent v1
Goal: Style Agent checks naming, complexity, dead code, test coverage gaps.
| # | Task | Type | Status |
|---|---|---|---|
| 5.1 | Create linter tool wrapper β Ruff/ESLint/pylint (app/tools/linter_tool.py) |
Code | [ ] |
| 5.2 | Implement dead code detector (unused imports, unreachable branches) | Code | [ ] |
| 5.3 | Write Style Agent system prompt (prompts/style_system.md) |
Prompt | [ ] |
| 5.4 | Prepare language style guides knowledge base (knowledge/style_guides/) |
Data | [ ] |
| 5.5 | Implement Style Agent ReAct loop (app/agents/style_agent.py) |
Code | [ ] |
| 5.6 | Fetch 10 Exercism PRs with style/refactoring issues | Data | [ ] |
| 5.7 | Evaluate Style Agent on Exercism dataset | Eval | [ ] |
| 5.8 | Iterate on system prompt based on eval results | Prompt | [ ] |
WEEK 6: ChromaDB + RAG Context
Goal: Full RAG pipeline β embed repo, retrieve context, inject into agents.
| # | Task | Type | Status |
|---|---|---|---|
| 6.1 | Set up sentence-transformers embedding pipeline (app/context/embedder.py) |
Code | [ ] |
| 6.2 | Run embedding model on RTX 5070 via WSL β benchmark speed | GPU | [ ] |
| 6.3 | Implement ChromaDB repo indexer (app/context/indexer.py) β chunk files, upsert |
Code | [ ] |
| 6.4 | Implement RAG retriever (app/context/retriever.py) β query by diff content |
Code | [ ] |
| 6.5 | Integrate RAG context into Security Agent | Code | [ ] |
| 6.6 | Integrate RAG context into Performance Agent | Code | [ ] |
| 6.7 | Integrate RAG context into Style Agent | Code | [ ] |
| 6.8 | Evaluate: does cross-file RAG context improve recall vs. diff-only? | Eval | [ ] |
| 6.9 | Optimize chunk size and retrieval top-k for quality vs. latency | Code | [ ] |
| 6.10 | Limit repo index to 500 most recently changed files (Render memory constraint) | Code | [ ] |
WEEK 7: Synthesizer Agent
Goal: Deduplication, conflict resolution, Health Score, executive summary, full pipeline.
| # | Task | Type | Status |
|---|---|---|---|
| 7.1 | Write Synthesizer system prompt (prompts/synthesizer_system.md) |
Prompt | [ ] |
| 7.2 | Implement deduplication logic (cosine similarity on findings via ChromaDB) | Code | [ ] |
| 7.3 | Implement severity conflict resolution (Security > Performance > Style precedence) | Code | [ ] |
| 7.4 | Implement composite re-ranking: severity Γ exploitability Γ fix_complexity | Code | [ ] |
| 7.5 | Implement PR Health Score formula (0-100) (app/services/health_score.py) |
Code | [ ] |
| 7.6 | Implement executive summary generation (3-5 sentences) | Code | [ ] |
| 7.7 | Implement auto-block logic (Critical findings β block merge recommendation) | Code | [ ] |
| 7.8 | Implement Synthesizer Agent (app/agents/synthesizer.py) |
Code | [ ] |
| 7.9 | Build main orchestrator (app/services/orchestrator.py) β ties everything together |
Code | [ ] |
| 7.10 | Implement Gemini Flash fallback when Groq quota exhausted | Code | [ ] |
| 7.11 | Full end-to-end pipeline test: PR β agents β synthesizer β GitHub comments | Test | [ ] |
| 7.12 | Write unit tests for Health Score formula | Test | [ ] |
| 7.13 | Write unit tests for deduplication with synthetic conflicting findings | Test | [ ] |
| 7.14 | Implement Neon Postgres write (store review record) | Code | [ ] |
WEEK 8: Next.js Dashboard
Goal: Dashboard on Vercel showing review history, Health Scores, charts.
| # | Task | Type | Status |
|---|---|---|---|
| 8.1 | Initialize Next.js app in dashboard/ with TypeScript |
Code | [ ] |
| 8.2 | Deploy to Vercel (connect GitHub repo) | Deploy | [ ] |
| 8.3 | Create TypeScript types matching backend schemas (lib/types.ts) |
Code | [ ] |
| 8.4 | Create API fetch wrapper (lib/api.ts) β calls FastAPI backend |
Code | [ ] |
| 8.5 | Build HealthScoreRing component (circular gauge, animated) |
Code | [ ] |
| 8.6 | Build SeverityBadge component (color-coded pills) |
Code | [ ] |
| 8.7 | Build TrendChart component (Recharts LineChart, 30-day trend) |
Code | [ ] |
| 8.8 | Build FindingsTable component (sortable, filterable) |
Code | [ ] |
| 8.9 | Build AgentBreakdown component (3-column cards) |
Code | [ ] |
| 8.10 | Build / page β Repository Overview (connected repos, avg scores) |
Code | [ ] |
| 8.11 | Build /repos/[owner]/[repo] page β Repo Detail (charts, PR list) |
Code | [ ] |
| 8.12 | Build /repos/[owner]/[repo]/prs/[number] page β PR Review Detail |
Code | [ ] |
| 8.13 | Add FastAPI CORS middleware for Vercel domain | Code | [ ] |
| 8.14 | Implement REST API endpoints on FastAPI side for dashboard | Code | [ ] |
WEEK 9: Polish & Evaluation
Goal: Full benchmark, prompt tuning, latency optimization, documentation.
| # | Task | Type | Status |
|---|---|---|---|
| 9.1 | Curate full 20-PR benchmark dataset (Django, Next.js, synthetic, Exercism) | Data | [ ] |
| 9.2 | Build evaluation harness (tests/eval/run_eval.py) |
Code | [ ] |
| 9.3 | Run full benchmark β measure precision, recall, latency per agent | Eval | [ ] |
| 9.4 | Tune agent prompts to reduce false positives (target: <30% FP rate) | Prompt | [ ] |
| 9.5 | Implement confidence threshold: findings <0.6 shown as 'Suggestions' | Code | [ ] |
| 9.6 | Latency optimization: measure p50/p95/p99 per PR size bucket | Eval | [ ] |
| 9.7 | Optimize Groq API calls (reduce token usage, cache prompts) | Code | [ ] |
| 9.8 | Write comprehensive README.md | Docs | [ ] |
| 9.9 | Write installation guide in README | Docs | [ ] |
| 9.10 | Add GitHub Actions pre-warm cron (ping /health every 10min) | CI/CD | [ ] |
WEEK 10: Launch & Promotion
Goal: Live on GitHub Marketplace, installed on public repos, launch posts published.
| # | Task | Type | Status |
|---|---|---|---|
| 10.1 | Install Sentinel AI on 3 public open-source repos | Launch | [ ] |
| 10.2 | Record demo video (screen recording: PR opened β comments posted) | Content | [ ] |
| 10.3 | Write Dev.to / HackerNews launch post | Content | [ ] |
| 10.4 | Write LinkedIn demo post | Content | [ ] |
| 10.5 | Submit to GitHub Marketplace (needs privacy policy, logo, description) | Launch | [ ] |
| 10.6 | Create sentinel.yml.example per-repo config template | Code | [ ] |
| 10.7 | Monitor first 48 hours β fix any production bugs | Ops | [ ] |
6. Non-Coding Tasks
These tasks don't involve writing project code but are essential for the project:
6.1 External Service Provisioning
| Service | Action | URL | Notes |
|---|---|---|---|
| GitHub App | Register new app | github.com/settings/apps/new | Need: App ID, Private Key (.pem), Webhook Secret |
| Groq | Get API key | console.groq.com | Free: 14,400 req/day |
| Google AI Studio | Get Gemini key | aistudio.google.com | Free: 1M tokens/day |
| Neon.tech | Create Postgres DB | console.neon.tech | Free: 512MB, create pr_reviews table |
| Upstash | Create Redis instance | console.upstash.com | Free: 10K req/day |
| Render | Create web service | dashboard.render.com | Free tier, connect GitHub repo |
| Vercel | Create project | vercel.com/new | Free hobby tier, connect dashboard/ |
| ngrok | Install for local testing | ngrok.com | Free: 1 tunnel |
6.2 GitHub App Configuration
Permissions required:
- Pull requests: Read & Write
- Contents: Read
- Metadata: Read
- Commit statuses: Write (optional)
Webhook events to subscribe:
pull_request(opened, synchronize, reopened, ready_for_review)pull_request_review_comment(for @sentinel-ai re-review)
6.3 Data Curation Tasks
| Dataset | Source | Count | Purpose |
|---|---|---|---|
| Synthetic security PRs | Hand-crafted | 10 PRs | SQL injection, XSS, IDOR, hardcoded secrets |
| Django security PRs | github.com/django/django | 5 PRs | Real-world Python security fixes |
| Next.js performance PRs | github.com/vercel/next.js | 5 PRs | JS/TS performance changes |
| Exercism style PRs | github.com/exercism | 5 PRs | Naming, complexity, documentation issues |
| Mixed benchmark set | All above | 20 PRs | Full evaluation benchmark |
6.4 Knowledge Base Curation
| Document | Source | For Agent |
|---|---|---|
| OWASP Top 10 (2025) | owasp.org | Security Agent RAG |
| DDIA performance patterns | "Designing Data-Intensive Applications" | Performance Agent RAG |
| Python style guide (PEP 8) | python.org | Style Agent RAG |
| JavaScript style guide | Various (Airbnb, Google) | Style Agent RAG |
| TypeScript best practices | typescript-eslint.io | Style Agent RAG |
7. GPU / WSL Tasks
Your RTX 5070 with WSL will be used for:
7.1 sentence-transformers Embedding (Required)
No training needed β these are pre-trained models used for embedding generation.
Model: all-MiniLM-L6-v2 (or all-mpnet-base-v2 for higher quality)
Task: Embed code chunks for ChromaDB indexing
Where: Runs locally during repo indexing (can also run on Render CPU, slower)
GPU benefit: ~10-50x faster embedding generation vs CPU
Setup steps:
- Ensure CUDA toolkit installed in WSL (
nvidia-smishould show RTX 5070) pip install sentence-transformers torch(with CUDA support)- Benchmark: embed 1000 code chunks, measure time GPU vs CPU
- Decision: if embedding is fast enough on CPU, skip GPU for deployment simplicity
7.2 Local LLM Testing (Optional, Recommended)
Running a local LLM for testing avoids burning Groq API quota during development:
Model: Llama-3.1-8B-Instruct (via Ollama or vLLM)
Task: Test agent prompts locally before hitting Groq
GPU benefit: Full inference locally, no API calls, no quota burn
Setup steps:
- Install Ollama in WSL:
curl -fsSL https://ollama.com/install.sh | sh - Pull model:
ollama pull llama3.1:8b - Use for prompt iteration β switch to Groq (70B) for production quality
7.3 What You Do NOT Need to Train
| Item | Reason |
|---|---|
| LLM (Llama-3.1-70B) | Used via Groq API β inference only, no fine-tuning |
| sentence-transformers | Pre-trained model, no fine-tuning needed for code embeddings |
| Semgrep/Bandit/radon | Rule-based tools, no ML training |
| Agent prompts | Iterative prompt engineering, not model training |
Bottom line: This project is an inference and orchestration project, not a training project. Your GPU is used for fast local embeddings and optional local LLM testing β no model training required.
8. Data Models & Schemas
8.1 Finding (per agent output)
class Finding(BaseModel):
agent: Literal['security', 'performance', 'style']
file_path: str # e.g. 'src/auth/login.py'
line_start: int
line_end: int
severity: Literal['critical', 'high', 'medium', 'low']
category: str # e.g. 'sql_injection', 'n+1_query', 'naming'
title: str # Short one-liner
description: str # Full explanation
suggested_fix: str # Corrected code snippet
cwe_id: Optional[str] # For security findings (e.g. 'CWE-89')
confidence: float # 0.0 β 1.0
8.2 SynthesizedReview (Synthesizer output)
class SynthesizedReview(BaseModel):
health_score: int # 0-100
executive_summary: str # 3-5 sentences
recommendation: Literal['approve', 'request_changes', 'block']
findings: List[Finding] # Deduplicated, re-ranked
critical_count: int
high_count: int
medium_count: int
low_count: int
duration_ms: int
8.3 PR Review Record (Neon Postgres)
CREATE TABLE pr_reviews (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
repo_full_name TEXT NOT NULL,
pr_number INT NOT NULL,
commit_sha TEXT NOT NULL,
health_score INT NOT NULL,
critical_count INT DEFAULT 0,
high_count INT DEFAULT 0,
medium_count INT DEFAULT 0,
low_count INT DEFAULT 0,
summary TEXT,
findings JSONB NOT NULL,
duration_ms INT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_pr_reviews_repo ON pr_reviews(repo_full_name);
CREATE INDEX idx_pr_reviews_sha ON pr_reviews(commit_sha);
9. API Endpoints
| Endpoint | Method | Description |
|---|---|---|
POST /webhook/github |
POST | Receive GitHub webhook, validate HMAC, enqueue analysis |
GET /api/repos/{owner}/{repo}/reviews |
GET | Paginated PR review list + Health Score trend |
GET /api/repos/{owner}/{repo}/reviews/{pr_number} |
GET | Full findings for specific PR |
GET /api/repos/{owner}/{repo}/stats |
GET | Aggregate stats: avg score, top categories, 30-day trend |
POST /api/repos/{owner}/{repo}/reanalyze/{pr_number} |
POST | Re-trigger analysis (bypass cache) |
GET /health |
GET | Health check: agent status, Groq quota remaining |
10. Agent Prompt Design
Each agent prompt must include:
- Role definition β who the agent is (e.g., "senior AppSec engineer")
- Scope boundaries β what to look for and what to ignore
- Output schema β exact JSON structure expected
- Severity guidelines β when to use Critical vs. High vs. Medium vs. Low
- Confidence scoring β how to self-assess confidence (0.0-1.0)
- Examples β 2-3 few-shot examples of good findings
- Anti-patterns β common false positives to avoid
Prompts are stored in prompts/ as Markdown files and loaded at agent initialization.
11. Evaluation Plan
11.1 Metrics
| Metric | Target | Formula |
|---|---|---|
| Security precision | >70% | true_positives / (true_positives + false_positives) |
| Performance recall | >60% | true_positives / (true_positives + false_negatives) |
| Deduplication rate | >15% | duplicates_removed / total_findings |
| e2e latency (p95) | <20s | Time from webhook to first comment posted |
| Groq quota usage | <10K/day | Total API calls per day |
| System uptime | >95% | (total_time - downtime) / total_time |
11.2 Evaluation Harness
Located in tests/eval/:
dataset/β 20 PRs as JSON fixtures (diff, expected findings, ground truth labels)run_eval.pyβ Runs each PR through full pipeline, compares output vs ground truthmetrics.pyβ Computes precision, recall, F1, latency percentiles- Results logged to console + optionally to LangSmith (free self-hosted)
12. Deployment Checklist
Render (FastAPI Backend)
-
render.yamlconfigured with build + start commands - Environment variables set in Render dashboard
- Health check endpoint (
/health) configured - Auto-deploy from
mainbranch enabled
Vercel (Next.js Dashboard)
- Connected to GitHub repo
dashboard/directory - Environment variable:
NEXT_PUBLIC_API_URLpointing to Render backend - Custom domain (optional)
GitHub App
- App registered with correct permissions
- Webhook URL set to Render endpoint (
/webhook/github) - Private key (.pem) downloaded and stored securely
- App installed on test repo for development
GitHub Actions
- CI workflow: lint (ruff) + test (pytest) on push/PR
- Pre-warm cron: ping /health every 10 minutes during working hours
13. Progress Tracker
Overall Status
| Week | Milestone | Status | Notes |
|---|---|---|---|
| 1 | Foundation & Setup | COMPLETE | All services provisioned, project scaffolded |
| 2 | GitHub Integration | COMPLETE | E2E tested: webhook β fetch β comment on PR #1 |
| 3 | Security Agent v1 | COMPLETE | Bandit + Llama-3.3-70B, live-tested on PR #3, 4 findings |
| 4 | Performance Agent v1 | COMPLETE | Radon complexity + Llama-3.3-70B, 3 findings on PR #4 |
| 5 | Style Agent v1 | COMPLETE | Ruff linter + Llama-3.3-70B, 6 findings on PR #4 |
| 6 | ChromaDB + RAG Context | COMPLETE | sentence-transformers + ChromaDB, integrated into all agents |
| 7 | Synthesizer Agent | COMPLETE | Dedup, conflict resolution, Health Score formula, exec summary |
| 8 | Next.js Dashboard | COMPLETE | Next.js + Tailwind + Recharts, mock data, all pages |
| 9 | Polish & Evaluation | COMPLETE | Eval harness, metrics, README, DB persistence |
| 10 | Launch & Promotion | COMPLETE | Render config, Vercel ready, API endpoints for dashboard |
Key Decisions Log
| Date | Decision | Rationale |
|---|---|---|
| 2026-03-19 | Project plan created | Starting from scratch, PDF spec as source of truth |
| 2026-03-19 | Project renamed to "Ninja Code Guard" | User's personal branding choice |
| 2026-03-19 | GitHub App: "Ninja's Code Guard" (ID: 3133457) | Registered and tested with live PR |
| 2026-03-19 | Test repo: ninjacode911/codeguard-test | Used for e2e webhook testing |
| 2026-03-19 | Fail-open pattern for Redis cache | Missing a review is worse than duplicating |
| 2026-03-19 | Background tasks for webhook processing | GitHub's 10s timeout requires async processing |
Last updated: 2026-03-19