--- title: CodeTribunal emoji: πŸ’» colorFrom: pink colorTo: red sdk: docker pinned: false license: mit short_description: The AI Courtroom That Exposes Bad Freelance Code ---
# CodeTribunal ### Put Freelance Code on Trial. **Upload code. Get a verdict. Know the risk.** Built with **GLM 5.1 + CrewAI + GritQL** [![Tests](https://github.com/amineyagoub/CodeTribunal/actions/workflows/tests.yml/badge.svg)](https://github.com/amineyagoub/CodeTribunal/actions/workflows/tests.yml) [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Built for GLM 5.1 Hackathon](https://img.shields.io/badge/Built%20for-GLM%205.1-ff69b4)](https://build-with-glm-5-1-challenge.devpost.com)
--- ## 🚨 The Problem Clients receive code they don’t understand. - Looks clean… but hides security risks - Passes linters… but fails in production - Works… but is architecturally broken **No one answers the only question that matters:** > _Is this code safe, professional, and worth paying for?_ --- ## The Solution **CodeTribunal turns code review into a courtroom trial.** Upload a `.zip` β†’ get: - Forensic evidence (AST-level) - Multi-agent investigation - AI courtroom debate - Final verdict + risk score > Not just analysis β€” **judgment**. --- ## 🧠 Why This Exist ### 1. Real System - 6-phase pipeline - 8 specialized agents - Persistent execution engine ### 2. Agents That Actually Act - File reads, pattern search, call tracing - Real tool usage via function calling (not fake reasoning) ### 3. Deterministic + AI Hybrid - **GritQL = ground truth** - **Agents = interpretation + argument** ### 4. End-to-End Story From raw code β†’ evidence β†’ debate β†’ verdict β†’ report ## How It Works CodeTribunal runs a **6-phase pipeline**, each building on the last: ### Phase 1: Forensic Evidence (Deterministic β€” No LLM) GritQL scans the entire codebase with **17 forensic patterns** across security and quality domains: | Domain | Patterns | Examples | | ----------- | -------- | ---------------------------------------------------------------------------------------- | | πŸ”΄ Security | 13 | Hardcoded secrets, `eval()`, SQL injection, `pickle.load()`, `os.system()`, weak hashing | | 🟑 Quality | 4 | `TODO`, `FIXME`, `HACK` comments | All scanning is **read-only** (`--dry-run`) and runs in **parallel** across patterns. ### Phase 2: Code Dependency Graph (AST β€” No LLM) Python's `ast` module and regex-based JS parsing build a **lightweight dependency graph**: - Nodes: files, functions, classes, imports - Edges: calls, imports, containment, inheritance - Enables call-chain tracing: `eval() β†’ handle_request() β†’ app.route()` ### Phase 3: Investigation (3 ReACT Agents + 4 Tools) Three specialist investigators, each running a **genuine ReACT loop** (Reason β†’ Act β†’ Observe β†’ Repeat) using **Z.ai's native function calling** via LiteLLM: | Agent | Tools | Purpose | | ---------------------------- | --------------------------------------------------------- | ------------------------------------------ | | πŸ›‘οΈ Security Investigator | FileReader, PatternSearch, CodeGraphQuery, FindingContext | Find vulnerabilities, trace attack vectors | | πŸ“‹ Quality Investigator | FileReader, FindingContext | Assess technical debt, detect negligence | | πŸ—οΈ Architecture Investigator | FileReader, CodeGraphQuery | Analyze structure, trace dependencies | Each agent **autonomously decides which tools to call**, observes the results, and iterates. For example, the Security Investigator might: 1. Call `file_reader` to read a flagged file 2. Observe hardcoded secrets on specific lines 3. Call `code_graph_query` to trace where those secrets are used 4. Produce a detailed report with file paths, line numbers, and severity ratings **Verified working**: GLM-5 + LiteLLM function calling confirmed. Agents make real tool calls that execute real code analysis. ### Phase 4: The Trial (3 Agents) A courtroom debate between AI agents: 1. ** The Prosecutor** β€” builds the case for negligence, cites specific evidence 2. ** The Defense Attorney** β€” challenges claims, argues context and proportionality 3. ** Rebuttal** β€” the prosecutor responds to the defense Agents use CrewAI's `context` parameter to chain arguments: prosecution output feeds into defense context, both feed into rebuttal. ### Phase 5: The Verdict **πŸ”¨ The Judge** reviews all evidence, investigation reports, and the full trial transcript. Delivers: - Overall ruling: GUILTY / MIXED / NOT GUILTY - Reputational Risk Score (0-100) - Findings summary with severity rankings ### Phase 6: Structured Report **πŸ“ Verdict Report Agent** compiles everything into a professional report: - Executive Summary - Findings Table (sorted by severity) - Per-Finding Analysis (impact, remediation, estimated fix effort) - Sentencing Recommendations --- ## Architecture ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Gradio UI β”‚ β”‚ + Export β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Pipeline Engine β”‚ β”‚ State Β· Persistence β”‚ β”‚ Cancel Β· Resume β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β–Ό β–Ό β–Ό β–Ό β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” β”‚Evidence β”‚ β”‚Code β”‚ β”‚Invest. β”‚ β”‚ Trial β”‚ β”‚Reportβ”‚ β”‚ Scanner β”‚ β”‚Graph β”‚ β”‚ Agents β”‚ β”‚ Agents β”‚ β”‚Agent β”‚ β”‚(GritQL) β”‚ β”‚(AST) β”‚ β”‚+ Tools β”‚ β”‚ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Custom Tool Layer β”‚ β”‚ FileReader Β· Pattern β”‚ β”‚ CodeGraph Β· Context β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ### Key Design Decisions | Decision | Why | | ------------------------------------- | ------------------------------------------------------------------------------------------- | | **Agents have tools, not text dumps** | Agents read files, search patterns, and trace calls on demand β€” scales to any codebase size | | **ReACT loop via LiteLLM** | Direct function calling with GLM-5 β€” bypasses CrewAI's unreliable tool routing | | **Pipeline state persisted to JSON** | Runs can resume after crashes. State is queryable | | **GritQL for evidence** | AST-level pattern matching, not regex. Language-aware, precise | | **Custom CrewAI tools (BaseTool)** | Pydantic-validated inputs, proper error handling, CrewAI-native integration | | **Rate-limit retry with backoff** | Exponential backoff (4s β†’ 64s) on Z.ai 429 errors β€” pipeline survives API spikes | --- ## Tech Stack | Component | Technology | Purpose | | -------------------- | ------------------------ | ---------------------------------------------------- | | **LLM** | GLM 5 via Z.ai (LiteLLM) | Agent reasoning and debate | | **Code Scanning** | GritQL | Deterministic AST-level pattern matching | | **Multi-Agent** | CrewAI 1.12 | Agent orchestration, task chaining, context handoffs | | **Function Calling** | LiteLLM | Direct ReACT loop with GLM-5 tool calling | | **Code Graph** | Python `ast` + regex | Dependency graph (Python + JS) | | **UI** | Gradio 6 | Streaming chatbot, file upload, export | | **Export** | fpdf2 | PDF report generation | --- ## Install ```bash # Clone git clone https://github.com/amineyagoub/CodeTribunal.git cd CodeTribunal # Install dependencies pip install -e . # Install GritQL CLI npm install -g @getgrit/cli # Configure cp .env.example .env # Edit .env: set ZAI_API_KEY (get one at https://open.bigmodel.cn/) ``` ### Requirements - Python 3.11+ - Node.js (for GritQL CLI) - Z.ai API key ([get one here](https://open.bigmodel.cn/)) --- ## Usage ### Web UI (Recommended) ```bash python3 -m code_tribunal.app ``` Open http://localhost:7860, upload a `.zip` of code, and watch the trial unfold. ### CLI ```bash # Full trial code-tribunal ./path/to/codebase # Evidence only (no LLM, fast) code-tribunal ./path/to/codebase --evidence-only # Save results to JSON code-tribunal ./path/to/codebase --output report.json ``` ### Python API ```python from code_tribunal.config import TribunalConfig from code_tribunal.courtroom import Courtroom from code_tribunal.pipeline import Phase config = TribunalConfig() courtroom = Courtroom(config) for event in courtroom.run("./path/to/code"): print(f"[{event.phase.value}] {event.status}") # Interactive Q&A answer = courtroom.ask_question( "Why was eval() considered critical?", context={"evidence": "...", "verdict": "...", ...} ) ``` --- ## πŸ”§ Production Features | Feature | Details | | ------------------------------ | --------------------------------------------------------------------------------------- | | **4 Custom Tools** | FileReader, PatternSearch, CodeGraphQuery, FindingContext β€” agents actively investigate | | **8 Specialized Agents** | 3 investigators, prosecutor, defense, rebuttal, judge, verdict report, expert witness | | **ReACT Engine** | Custom Reason-Act-Observe loop via LiteLLM function calling with GLM-5 | | **Code Dependency Graph** | AST-based (Python + JS), with call-chain tracing and impact analysis | | **Parallel Evidence Scanning** | ThreadPoolExecutor for GritQL patterns β€” 4x faster than sequential | | **Rate-Limit Resilience** | Exponential backoff retry on 429 errors β€” survives API rate limits | | **Pipeline Persistence** | State saved to JSON, runs can resume after interruption | | **Deduplication** | Same file+line merged into one finding with multiple categories | | **Zip Safety** | Zip-slip attack prevention | | **Streaming UI** | Real-time pipeline progress in Gradio Chatbot with phase indicators | | **Export** | Markdown and PDF report generation | --- ## πŸ§ͺ Testing ```bash # Run evidence scan on test fixtures code-tribunal tests/fixtures/locale/ --evidence-only # Run Python tests pytest tests/ ``` Test fixtures in `tests/fixtures/locale/` contain deliberately bad Python and JavaScript code with: - Hardcoded passwords, API keys, AWS secrets, Stripe keys, JWT secrets - SQL injection via f-strings and template literals - `eval()`, `pickle.load()`, `os.system()`, `subprocess.call(shell=True)` - MD5 hashing - TODO, FIXME, HACK comments --- ---
Built for the [Build with GLM 5.1](https://build-with-glm-5-1-challenge.devpost.com) hackathon. > > > > > > > b4fcdee (feat: Add initial CodeTribunal implementation)