Spaces:
Running
title: CodeTribunal
emoji: π»
colorFrom: pink
colorTo: red
sdk: docker
pinned: false
license: mit
short_description: The AI Courtroom That Exposes Bad Freelance Code
CodeTribunal
Put Freelance Code on Trial.
Upload code. Get a verdict. Know the risk.
Built with GLM 5.1 + CrewAI + GritQL
π¨ The Problem
Clients receive code they donβt understand.
- Looks clean⦠but hides security risks
- Passes linters⦠but fails in production
- Works⦠but is architecturally broken
No one answers the only question that matters:
Is this code safe, professional, and worth paying for?
The Solution
CodeTribunal turns code review into a courtroom trial.
Upload a .zip β get:
- Forensic evidence (AST-level)
- Multi-agent investigation
- AI courtroom debate
- Final verdict + risk score
Not just analysis β judgment.
π§ Why This Exist
1. Real System
- 6-phase pipeline
- 8 specialized agents
- Persistent execution engine
2. Agents That Actually Act
- File reads, pattern search, call tracing
- Real tool usage via function calling (not fake reasoning)
3. Deterministic + AI Hybrid
- GritQL = ground truth
- Agents = interpretation + argument
4. End-to-End Story
From raw code β evidence β debate β verdict β report
How It Works
CodeTribunal runs a 6-phase pipeline, each building on the last:
Phase 1: Forensic Evidence (Deterministic β No LLM)
GritQL scans the entire codebase with 17 forensic patterns across security and quality domains:
| Domain | Patterns | Examples |
|---|---|---|
| π΄ Security | 13 | Hardcoded secrets, eval(), SQL injection, pickle.load(), os.system(), weak hashing |
| π‘ Quality | 4 | TODO, FIXME, HACK comments |
All scanning is read-only (--dry-run) and runs in parallel across patterns.
Phase 2: Code Dependency Graph (AST β No LLM)
Python's ast module and regex-based JS parsing build a lightweight dependency graph:
- Nodes: files, functions, classes, imports
- Edges: calls, imports, containment, inheritance
- Enables call-chain tracing:
eval() β handle_request() β app.route()
Phase 3: Investigation (3 ReACT Agents + 4 Tools)
Three specialist investigators, each running a genuine ReACT loop (Reason β Act β Observe β Repeat) using Z.ai's native function calling via LiteLLM:
| Agent | Tools | Purpose |
|---|---|---|
| π‘οΈ Security Investigator | FileReader, PatternSearch, CodeGraphQuery, FindingContext | Find vulnerabilities, trace attack vectors |
| π Quality Investigator | FileReader, FindingContext | Assess technical debt, detect negligence |
| ποΈ Architecture Investigator | FileReader, CodeGraphQuery | Analyze structure, trace dependencies |
Each agent autonomously decides which tools to call, observes the results, and iterates. For example, the Security Investigator might:
- Call
file_readerto read a flagged file - Observe hardcoded secrets on specific lines
- Call
code_graph_queryto trace where those secrets are used - Produce a detailed report with file paths, line numbers, and severity ratings
Verified working: GLM-5 + LiteLLM function calling confirmed. Agents make real tool calls that execute real code analysis.
Phase 4: The Trial (3 Agents)
A courtroom debate between AI agents:
- ** The Prosecutor** β builds the case for negligence, cites specific evidence
- ** The Defense Attorney** β challenges claims, argues context and proportionality
- ** Rebuttal** β the prosecutor responds to the defense
Agents use CrewAI's context parameter to chain arguments: prosecution output feeds into defense context, both feed into rebuttal.
Phase 5: The Verdict
π¨ The Judge reviews all evidence, investigation reports, and the full trial transcript. Delivers:
- Overall ruling: GUILTY / MIXED / NOT GUILTY
- Reputational Risk Score (0-100)
- Findings summary with severity rankings
Phase 6: Structured Report
π Verdict Report Agent compiles everything into a professional report:
- Executive Summary
- Findings Table (sorted by severity)
- Per-Finding Analysis (impact, remediation, estimated fix effort)
- Sentencing Recommendations
Architecture
ββββββββββββββββ
β Gradio UI β
β + Export β
ββββββββ¬ββββββββ
β
βββββββββββββΌβββββββββββββ
β Pipeline Engine β
β State Β· Persistence β
β Cancel Β· Resume β
βββββββββββββ¬βββββββββββββ
β
ββββββββββββ¬ββββββββββββΌββββββββββββ¬βββββββββββ
βΌ βΌ βΌ βΌ βΌ
βββββββββββ ββββββββ βββββββββββ βββββββββββ ββββββββ
βEvidence β βCode β βInvest. β β Trial β βReportβ
β Scanner β βGraph β β Agents β β Agents β βAgent β
β(GritQL) β β(AST) β β+ Tools β β β β β
βββββββββββ ββββββββ βββββββββββ βββββββββββ ββββββββ
β β β β β
ββββββββββββ΄ββββββββββββ΄ββββββββββββ΄βββββββββββ
β
βββββββββββββΌβββββββββββββ
β Custom Tool Layer β
β FileReader Β· Pattern β
β CodeGraph Β· Context β
ββββββββββββββββββββββββββ
Key Design Decisions
| Decision | Why |
|---|---|
| Agents have tools, not text dumps | Agents read files, search patterns, and trace calls on demand β scales to any codebase size |
| ReACT loop via LiteLLM | Direct function calling with GLM-5 β bypasses CrewAI's unreliable tool routing |
| Pipeline state persisted to JSON | Runs can resume after crashes. State is queryable |
| GritQL for evidence | AST-level pattern matching, not regex. Language-aware, precise |
| Custom CrewAI tools (BaseTool) | Pydantic-validated inputs, proper error handling, CrewAI-native integration |
| Rate-limit retry with backoff | Exponential backoff (4s β 64s) on Z.ai 429 errors β pipeline survives API spikes |
Tech Stack
| Component | Technology | Purpose |
|---|---|---|
| LLM | GLM 5 via Z.ai (LiteLLM) | Agent reasoning and debate |
| Code Scanning | GritQL | Deterministic AST-level pattern matching |
| Multi-Agent | CrewAI 1.12 | Agent orchestration, task chaining, context handoffs |
| Function Calling | LiteLLM | Direct ReACT loop with GLM-5 tool calling |
| Code Graph | Python ast + regex |
Dependency graph (Python + JS) |
| UI | Gradio 6 | Streaming chatbot, file upload, export |
| Export | fpdf2 | PDF report generation |
Install
# Clone
git clone https://github.com/amineyagoub/CodeTribunal.git
cd CodeTribunal
# Install dependencies
pip install -e .
# Install GritQL CLI
npm install -g @getgrit/cli
# Configure
cp .env.example .env
# Edit .env: set ZAI_API_KEY (get one at https://open.bigmodel.cn/)
Requirements
- Python 3.11+
- Node.js (for GritQL CLI)
- Z.ai API key (get one here)
Usage
Web UI (Recommended)
python3 -m code_tribunal.app
Open http://localhost:7860, upload a .zip of code, and watch the trial unfold.
CLI
# Full trial
code-tribunal ./path/to/codebase
# Evidence only (no LLM, fast)
code-tribunal ./path/to/codebase --evidence-only
# Save results to JSON
code-tribunal ./path/to/codebase --output report.json
Python API
from code_tribunal.config import TribunalConfig
from code_tribunal.courtroom import Courtroom
from code_tribunal.pipeline import Phase
config = TribunalConfig()
courtroom = Courtroom(config)
for event in courtroom.run("./path/to/code"):
print(f"[{event.phase.value}] {event.status}")
# Interactive Q&A
answer = courtroom.ask_question(
"Why was eval() considered critical?",
context={"evidence": "...", "verdict": "...", ...}
)
π§ Production Features
| Feature | Details |
|---|---|
| 4 Custom Tools | FileReader, PatternSearch, CodeGraphQuery, FindingContext β agents actively investigate |
| 8 Specialized Agents | 3 investigators, prosecutor, defense, rebuttal, judge, verdict report, expert witness |
| ReACT Engine | Custom Reason-Act-Observe loop via LiteLLM function calling with GLM-5 |
| Code Dependency Graph | AST-based (Python + JS), with call-chain tracing and impact analysis |
| Parallel Evidence Scanning | ThreadPoolExecutor for GritQL patterns β 4x faster than sequential |
| Rate-Limit Resilience | Exponential backoff retry on 429 errors β survives API rate limits |
| Pipeline Persistence | State saved to JSON, runs can resume after interruption |
| Deduplication | Same file+line merged into one finding with multiple categories |
| Zip Safety | Zip-slip attack prevention |
| Streaming UI | Real-time pipeline progress in Gradio Chatbot with phase indicators |
| Export | Markdown and PDF report generation |
π§ͺ Testing
# Run evidence scan on test fixtures
code-tribunal tests/fixtures/locale/ --evidence-only
# Run Python tests
pytest tests/
Test fixtures in tests/fixtures/locale/ contain deliberately bad Python and JavaScript code with:
- Hardcoded passwords, API keys, AWS secrets, Stripe keys, JWT secrets
- SQL injection via f-strings and template literals
eval(),pickle.load(),os.system(),subprocess.call(shell=True)- MD5 hashing
- TODO, FIXME, HACK comments
Built for the Build with GLM 5.1 hackathon.
b4fcdee (feat: Add initial CodeTribunal implementation)