Spaces:
Running
Running
| title: CodeTribunal | |
| emoji: π» | |
| colorFrom: pink | |
| colorTo: red | |
| sdk: docker | |
| pinned: false | |
| license: mit | |
| short_description: The AI Courtroom That Exposes Bad Freelance Code | |
| <div align="center"> | |
| # CodeTribunal | |
| ### Put Freelance Code on Trial. | |
| **Upload code. Get a verdict. Know the risk.** | |
| Built with **GLM 5.1 + CrewAI + GritQL** | |
| [](https://github.com/amineyagoub/CodeTribunal/actions/workflows/tests.yml) | |
| [](https://www.python.org/downloads/) | |
| [](https://opensource.org/licenses/MIT) | |
| [](https://build-with-glm-5-1-challenge.devpost.com) | |
| </div> | |
| --- | |
| ## π¨ The Problem | |
| Clients receive code they donβt understand. | |
| - Looks clean⦠but hides security risks | |
| - Passes linters⦠but fails in production | |
| - Works⦠but is architecturally broken | |
| **No one answers the only question that matters:** | |
| > _Is this code safe, professional, and worth paying for?_ | |
| --- | |
| ## The Solution | |
| **CodeTribunal turns code review into a courtroom trial.** | |
| Upload a `.zip` β get: | |
| - Forensic evidence (AST-level) | |
| - Multi-agent investigation | |
| - AI courtroom debate | |
| - Final verdict + risk score | |
| > Not just analysis β **judgment**. | |
| --- | |
| ## π§ Why This Exist | |
| ### 1. Real System | |
| - 6-phase pipeline | |
| - 8 specialized agents | |
| - Persistent execution engine | |
| ### 2. Agents That Actually Act | |
| - File reads, pattern search, call tracing | |
| - Real tool usage via function calling (not fake reasoning) | |
| ### 3. Deterministic + AI Hybrid | |
| - **GritQL = ground truth** | |
| - **Agents = interpretation + argument** | |
| ### 4. End-to-End Story | |
| From raw code β evidence β debate β verdict β report | |
| ## How It Works | |
| CodeTribunal runs a **6-phase pipeline**, each building on the last: | |
| ### Phase 1: Forensic Evidence (Deterministic β No LLM) | |
| GritQL scans the entire codebase with **17 forensic patterns** across security and quality domains: | |
| | Domain | Patterns | Examples | | |
| | ----------- | -------- | ---------------------------------------------------------------------------------------- | | |
| | π΄ Security | 13 | Hardcoded secrets, `eval()`, SQL injection, `pickle.load()`, `os.system()`, weak hashing | | |
| | π‘ Quality | 4 | `TODO`, `FIXME`, `HACK` comments | | |
| All scanning is **read-only** (`--dry-run`) and runs in **parallel** across patterns. | |
| ### Phase 2: Code Dependency Graph (AST β No LLM) | |
| Python's `ast` module and regex-based JS parsing build a **lightweight dependency graph**: | |
| - Nodes: files, functions, classes, imports | |
| - Edges: calls, imports, containment, inheritance | |
| - Enables call-chain tracing: `eval() β handle_request() β app.route()` | |
| ### Phase 3: Investigation (3 ReACT Agents + 4 Tools) | |
| Three specialist investigators, each running a **genuine ReACT loop** (Reason β Act β Observe β Repeat) using **Z.ai's native function calling** via LiteLLM: | |
| | Agent | Tools | Purpose | | |
| | ---------------------------- | --------------------------------------------------------- | ------------------------------------------ | | |
| | π‘οΈ Security Investigator | FileReader, PatternSearch, CodeGraphQuery, FindingContext | Find vulnerabilities, trace attack vectors | | |
| | π Quality Investigator | FileReader, FindingContext | Assess technical debt, detect negligence | | |
| | ποΈ Architecture Investigator | FileReader, CodeGraphQuery | Analyze structure, trace dependencies | | |
| Each agent **autonomously decides which tools to call**, observes the results, and iterates. For example, the Security Investigator might: | |
| 1. Call `file_reader` to read a flagged file | |
| 2. Observe hardcoded secrets on specific lines | |
| 3. Call `code_graph_query` to trace where those secrets are used | |
| 4. Produce a detailed report with file paths, line numbers, and severity ratings | |
| **Verified working**: GLM-5 + LiteLLM function calling confirmed. Agents make real tool calls that execute real code analysis. | |
| ### Phase 4: The Trial (3 Agents) | |
| A courtroom debate between AI agents: | |
| 1. ** The Prosecutor** β builds the case for negligence, cites specific evidence | |
| 2. ** The Defense Attorney** β challenges claims, argues context and proportionality | |
| 3. ** Rebuttal** β the prosecutor responds to the defense | |
| Agents use CrewAI's `context` parameter to chain arguments: prosecution output feeds into defense context, both feed into rebuttal. | |
| ### Phase 5: The Verdict | |
| **π¨ The Judge** reviews all evidence, investigation reports, and the full trial transcript. Delivers: | |
| - Overall ruling: GUILTY / MIXED / NOT GUILTY | |
| - Reputational Risk Score (0-100) | |
| - Findings summary with severity rankings | |
| ### Phase 6: Structured Report | |
| **π Verdict Report Agent** compiles everything into a professional report: | |
| - Executive Summary | |
| - Findings Table (sorted by severity) | |
| - Per-Finding Analysis (impact, remediation, estimated fix effort) | |
| - Sentencing Recommendations | |
| --- | |
| ## Architecture | |
| ``` | |
| ββββββββββββββββ | |
| β Gradio UI β | |
| β + Export β | |
| ββββββββ¬ββββββββ | |
| β | |
| βββββββββββββΌβββββββββββββ | |
| β Pipeline Engine β | |
| β State Β· Persistence β | |
| β Cancel Β· Resume β | |
| βββββββββββββ¬βββββββββββββ | |
| β | |
| ββββββββββββ¬ββββββββββββΌββββββββββββ¬βββββββββββ | |
| βΌ βΌ βΌ βΌ βΌ | |
| βββββββββββ ββββββββ βββββββββββ βββββββββββ ββββββββ | |
| βEvidence β βCode β βInvest. β β Trial β βReportβ | |
| β Scanner β βGraph β β Agents β β Agents β βAgent β | |
| β(GritQL) β β(AST) β β+ Tools β β β β β | |
| βββββββββββ ββββββββ βββββββββββ βββββββββββ ββββββββ | |
| β β β β β | |
| ββββββββββββ΄ββββββββββββ΄ββββββββββββ΄βββββββββββ | |
| β | |
| βββββββββββββΌβββββββββββββ | |
| β Custom Tool Layer β | |
| β FileReader Β· Pattern β | |
| β CodeGraph Β· Context β | |
| ββββββββββββββββββββββββββ | |
| ``` | |
| ### Key Design Decisions | |
| | Decision | Why | | |
| | ------------------------------------- | ------------------------------------------------------------------------------------------- | | |
| | **Agents have tools, not text dumps** | Agents read files, search patterns, and trace calls on demand β scales to any codebase size | | |
| | **ReACT loop via LiteLLM** | Direct function calling with GLM-5 β bypasses CrewAI's unreliable tool routing | | |
| | **Pipeline state persisted to JSON** | Runs can resume after crashes. State is queryable | | |
| | **GritQL for evidence** | AST-level pattern matching, not regex. Language-aware, precise | | |
| | **Custom CrewAI tools (BaseTool)** | Pydantic-validated inputs, proper error handling, CrewAI-native integration | | |
| | **Rate-limit retry with backoff** | Exponential backoff (4s β 64s) on Z.ai 429 errors β pipeline survives API spikes | | |
| --- | |
| ## Tech Stack | |
| | Component | Technology | Purpose | | |
| | -------------------- | ------------------------ | ---------------------------------------------------- | | |
| | **LLM** | GLM 5 via Z.ai (LiteLLM) | Agent reasoning and debate | | |
| | **Code Scanning** | GritQL | Deterministic AST-level pattern matching | | |
| | **Multi-Agent** | CrewAI 1.12 | Agent orchestration, task chaining, context handoffs | | |
| | **Function Calling** | LiteLLM | Direct ReACT loop with GLM-5 tool calling | | |
| | **Code Graph** | Python `ast` + regex | Dependency graph (Python + JS) | | |
| | **UI** | Gradio 6 | Streaming chatbot, file upload, export | | |
| | **Export** | fpdf2 | PDF report generation | | |
| --- | |
| ## Install | |
| ```bash | |
| # Clone | |
| git clone https://github.com/amineyagoub/CodeTribunal.git | |
| cd CodeTribunal | |
| # Install dependencies | |
| pip install -e . | |
| # Install GritQL CLI | |
| npm install -g @getgrit/cli | |
| # Configure | |
| cp .env.example .env | |
| # Edit .env: set ZAI_API_KEY (get one at https://open.bigmodel.cn/) | |
| ``` | |
| ### Requirements | |
| - Python 3.11+ | |
| - Node.js (for GritQL CLI) | |
| - Z.ai API key ([get one here](https://open.bigmodel.cn/)) | |
| --- | |
| ## Usage | |
| ### Web UI (Recommended) | |
| ```bash | |
| python3 -m code_tribunal.app | |
| ``` | |
| Open http://localhost:7860, upload a `.zip` of code, and watch the trial unfold. | |
| ### CLI | |
| ```bash | |
| # Full trial | |
| code-tribunal ./path/to/codebase | |
| # Evidence only (no LLM, fast) | |
| code-tribunal ./path/to/codebase --evidence-only | |
| # Save results to JSON | |
| code-tribunal ./path/to/codebase --output report.json | |
| ``` | |
| ### Python API | |
| ```python | |
| from code_tribunal.config import TribunalConfig | |
| from code_tribunal.courtroom import Courtroom | |
| from code_tribunal.pipeline import Phase | |
| config = TribunalConfig() | |
| courtroom = Courtroom(config) | |
| for event in courtroom.run("./path/to/code"): | |
| print(f"[{event.phase.value}] {event.status}") | |
| # Interactive Q&A | |
| answer = courtroom.ask_question( | |
| "Why was eval() considered critical?", | |
| context={"evidence": "...", "verdict": "...", ...} | |
| ) | |
| ``` | |
| --- | |
| ## π§ Production Features | |
| | Feature | Details | | |
| | ------------------------------ | --------------------------------------------------------------------------------------- | | |
| | **4 Custom Tools** | FileReader, PatternSearch, CodeGraphQuery, FindingContext β agents actively investigate | | |
| | **8 Specialized Agents** | 3 investigators, prosecutor, defense, rebuttal, judge, verdict report, expert witness | | |
| | **ReACT Engine** | Custom Reason-Act-Observe loop via LiteLLM function calling with GLM-5 | | |
| | **Code Dependency Graph** | AST-based (Python + JS), with call-chain tracing and impact analysis | | |
| | **Parallel Evidence Scanning** | ThreadPoolExecutor for GritQL patterns β 4x faster than sequential | | |
| | **Rate-Limit Resilience** | Exponential backoff retry on 429 errors β survives API rate limits | | |
| | **Pipeline Persistence** | State saved to JSON, runs can resume after interruption | | |
| | **Deduplication** | Same file+line merged into one finding with multiple categories | | |
| | **Zip Safety** | Zip-slip attack prevention | | |
| | **Streaming UI** | Real-time pipeline progress in Gradio Chatbot with phase indicators | | |
| | **Export** | Markdown and PDF report generation | | |
| --- | |
| ## π§ͺ Testing | |
| ```bash | |
| # Run evidence scan on test fixtures | |
| code-tribunal tests/fixtures/locale/ --evidence-only | |
| # Run Python tests | |
| pytest tests/ | |
| ``` | |
| Test fixtures in `tests/fixtures/locale/` contain deliberately bad Python and JavaScript code with: | |
| - Hardcoded passwords, API keys, AWS secrets, Stripe keys, JWT secrets | |
| - SQL injection via f-strings and template literals | |
| - `eval()`, `pickle.load()`, `os.system()`, `subprocess.call(shell=True)` | |
| - MD5 hashing | |
| - TODO, FIXME, HACK comments | |
| --- | |
| --- | |
| <div align="center"> | |
| Built for the [Build with GLM 5.1](https://build-with-glm-5-1-challenge.devpost.com) hackathon. | |
| > > > > > > > b4fcdee (feat: Add initial CodeTribunal implementation) | |
| </div> | |