CodeTribunal / README.md
amine-yagoub's picture
fix: resolve merge conflicts in README.md for HF Spaces
3a09553
---
title: CodeTribunal
emoji: πŸ’»
colorFrom: pink
colorTo: red
sdk: docker
pinned: false
license: mit
short_description: The AI Courtroom That Exposes Bad Freelance Code
---
<div align="center">
# CodeTribunal
### Put Freelance Code on Trial.
**Upload code. Get a verdict. Know the risk.**
Built with **GLM 5.1 + CrewAI + GritQL**
[![Tests](https://github.com/amineyagoub/CodeTribunal/actions/workflows/tests.yml/badge.svg)](https://github.com/amineyagoub/CodeTribunal/actions/workflows/tests.yml)
[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Built for GLM 5.1 Hackathon](https://img.shields.io/badge/Built%20for-GLM%205.1-ff69b4)](https://build-with-glm-5-1-challenge.devpost.com)
</div>
---
## 🚨 The Problem
Clients receive code they don’t understand.
- Looks clean… but hides security risks
- Passes linters… but fails in production
- Works… but is architecturally broken
**No one answers the only question that matters:**
> _Is this code safe, professional, and worth paying for?_
---
## The Solution
**CodeTribunal turns code review into a courtroom trial.**
Upload a `.zip` β†’ get:
- Forensic evidence (AST-level)
- Multi-agent investigation
- AI courtroom debate
- Final verdict + risk score
> Not just analysis β€” **judgment**.
---
## 🧠 Why This Exist
### 1. Real System
- 6-phase pipeline
- 8 specialized agents
- Persistent execution engine
### 2. Agents That Actually Act
- File reads, pattern search, call tracing
- Real tool usage via function calling (not fake reasoning)
### 3. Deterministic + AI Hybrid
- **GritQL = ground truth**
- **Agents = interpretation + argument**
### 4. End-to-End Story
From raw code β†’ evidence β†’ debate β†’ verdict β†’ report
## How It Works
CodeTribunal runs a **6-phase pipeline**, each building on the last:
### Phase 1: Forensic Evidence (Deterministic β€” No LLM)
GritQL scans the entire codebase with **17 forensic patterns** across security and quality domains:
| Domain | Patterns | Examples |
| ----------- | -------- | ---------------------------------------------------------------------------------------- |
| πŸ”΄ Security | 13 | Hardcoded secrets, `eval()`, SQL injection, `pickle.load()`, `os.system()`, weak hashing |
| 🟑 Quality | 4 | `TODO`, `FIXME`, `HACK` comments |
All scanning is **read-only** (`--dry-run`) and runs in **parallel** across patterns.
### Phase 2: Code Dependency Graph (AST β€” No LLM)
Python's `ast` module and regex-based JS parsing build a **lightweight dependency graph**:
- Nodes: files, functions, classes, imports
- Edges: calls, imports, containment, inheritance
- Enables call-chain tracing: `eval() β†’ handle_request() β†’ app.route()`
### Phase 3: Investigation (3 ReACT Agents + 4 Tools)
Three specialist investigators, each running a **genuine ReACT loop** (Reason β†’ Act β†’ Observe β†’ Repeat) using **Z.ai's native function calling** via LiteLLM:
| Agent | Tools | Purpose |
| ---------------------------- | --------------------------------------------------------- | ------------------------------------------ |
| πŸ›‘οΈ Security Investigator | FileReader, PatternSearch, CodeGraphQuery, FindingContext | Find vulnerabilities, trace attack vectors |
| πŸ“‹ Quality Investigator | FileReader, FindingContext | Assess technical debt, detect negligence |
| πŸ—οΈ Architecture Investigator | FileReader, CodeGraphQuery | Analyze structure, trace dependencies |
Each agent **autonomously decides which tools to call**, observes the results, and iterates. For example, the Security Investigator might:
1. Call `file_reader` to read a flagged file
2. Observe hardcoded secrets on specific lines
3. Call `code_graph_query` to trace where those secrets are used
4. Produce a detailed report with file paths, line numbers, and severity ratings
**Verified working**: GLM-5 + LiteLLM function calling confirmed. Agents make real tool calls that execute real code analysis.
### Phase 4: The Trial (3 Agents)
A courtroom debate between AI agents:
1. ** The Prosecutor** β€” builds the case for negligence, cites specific evidence
2. ** The Defense Attorney** β€” challenges claims, argues context and proportionality
3. ** Rebuttal** β€” the prosecutor responds to the defense
Agents use CrewAI's `context` parameter to chain arguments: prosecution output feeds into defense context, both feed into rebuttal.
### Phase 5: The Verdict
**πŸ”¨ The Judge** reviews all evidence, investigation reports, and the full trial transcript. Delivers:
- Overall ruling: GUILTY / MIXED / NOT GUILTY
- Reputational Risk Score (0-100)
- Findings summary with severity rankings
### Phase 6: Structured Report
**πŸ“ Verdict Report Agent** compiles everything into a professional report:
- Executive Summary
- Findings Table (sorted by severity)
- Per-Finding Analysis (impact, remediation, estimated fix effort)
- Sentencing Recommendations
---
## Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Gradio UI β”‚
β”‚ + Export β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Pipeline Engine β”‚
β”‚ State Β· Persistence β”‚
β”‚ Cancel Β· Resume β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β–Ό β–Ό β–Ό β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”
β”‚Evidence β”‚ β”‚Code β”‚ β”‚Invest. β”‚ β”‚ Trial β”‚ β”‚Reportβ”‚
β”‚ Scanner β”‚ β”‚Graph β”‚ β”‚ Agents β”‚ β”‚ Agents β”‚ β”‚Agent β”‚
β”‚(GritQL) β”‚ β”‚(AST) β”‚ β”‚+ Tools β”‚ β”‚ β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚ β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Custom Tool Layer β”‚
β”‚ FileReader Β· Pattern β”‚
β”‚ CodeGraph Β· Context β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### Key Design Decisions
| Decision | Why |
| ------------------------------------- | ------------------------------------------------------------------------------------------- |
| **Agents have tools, not text dumps** | Agents read files, search patterns, and trace calls on demand β€” scales to any codebase size |
| **ReACT loop via LiteLLM** | Direct function calling with GLM-5 β€” bypasses CrewAI's unreliable tool routing |
| **Pipeline state persisted to JSON** | Runs can resume after crashes. State is queryable |
| **GritQL for evidence** | AST-level pattern matching, not regex. Language-aware, precise |
| **Custom CrewAI tools (BaseTool)** | Pydantic-validated inputs, proper error handling, CrewAI-native integration |
| **Rate-limit retry with backoff** | Exponential backoff (4s β†’ 64s) on Z.ai 429 errors β€” pipeline survives API spikes |
---
## Tech Stack
| Component | Technology | Purpose |
| -------------------- | ------------------------ | ---------------------------------------------------- |
| **LLM** | GLM 5 via Z.ai (LiteLLM) | Agent reasoning and debate |
| **Code Scanning** | GritQL | Deterministic AST-level pattern matching |
| **Multi-Agent** | CrewAI 1.12 | Agent orchestration, task chaining, context handoffs |
| **Function Calling** | LiteLLM | Direct ReACT loop with GLM-5 tool calling |
| **Code Graph** | Python `ast` + regex | Dependency graph (Python + JS) |
| **UI** | Gradio 6 | Streaming chatbot, file upload, export |
| **Export** | fpdf2 | PDF report generation |
---
## Install
```bash
# Clone
git clone https://github.com/amineyagoub/CodeTribunal.git
cd CodeTribunal
# Install dependencies
pip install -e .
# Install GritQL CLI
npm install -g @getgrit/cli
# Configure
cp .env.example .env
# Edit .env: set ZAI_API_KEY (get one at https://open.bigmodel.cn/)
```
### Requirements
- Python 3.11+
- Node.js (for GritQL CLI)
- Z.ai API key ([get one here](https://open.bigmodel.cn/))
---
## Usage
### Web UI (Recommended)
```bash
python3 -m code_tribunal.app
```
Open http://localhost:7860, upload a `.zip` of code, and watch the trial unfold.
### CLI
```bash
# Full trial
code-tribunal ./path/to/codebase
# Evidence only (no LLM, fast)
code-tribunal ./path/to/codebase --evidence-only
# Save results to JSON
code-tribunal ./path/to/codebase --output report.json
```
### Python API
```python
from code_tribunal.config import TribunalConfig
from code_tribunal.courtroom import Courtroom
from code_tribunal.pipeline import Phase
config = TribunalConfig()
courtroom = Courtroom(config)
for event in courtroom.run("./path/to/code"):
print(f"[{event.phase.value}] {event.status}")
# Interactive Q&A
answer = courtroom.ask_question(
"Why was eval() considered critical?",
context={"evidence": "...", "verdict": "...", ...}
)
```
---
## πŸ”§ Production Features
| Feature | Details |
| ------------------------------ | --------------------------------------------------------------------------------------- |
| **4 Custom Tools** | FileReader, PatternSearch, CodeGraphQuery, FindingContext β€” agents actively investigate |
| **8 Specialized Agents** | 3 investigators, prosecutor, defense, rebuttal, judge, verdict report, expert witness |
| **ReACT Engine** | Custom Reason-Act-Observe loop via LiteLLM function calling with GLM-5 |
| **Code Dependency Graph** | AST-based (Python + JS), with call-chain tracing and impact analysis |
| **Parallel Evidence Scanning** | ThreadPoolExecutor for GritQL patterns β€” 4x faster than sequential |
| **Rate-Limit Resilience** | Exponential backoff retry on 429 errors β€” survives API rate limits |
| **Pipeline Persistence** | State saved to JSON, runs can resume after interruption |
| **Deduplication** | Same file+line merged into one finding with multiple categories |
| **Zip Safety** | Zip-slip attack prevention |
| **Streaming UI** | Real-time pipeline progress in Gradio Chatbot with phase indicators |
| **Export** | Markdown and PDF report generation |
---
## πŸ§ͺ Testing
```bash
# Run evidence scan on test fixtures
code-tribunal tests/fixtures/locale/ --evidence-only
# Run Python tests
pytest tests/
```
Test fixtures in `tests/fixtures/locale/` contain deliberately bad Python and JavaScript code with:
- Hardcoded passwords, API keys, AWS secrets, Stripe keys, JWT secrets
- SQL injection via f-strings and template literals
- `eval()`, `pickle.load()`, `os.system()`, `subprocess.call(shell=True)`
- MD5 hashing
- TODO, FIXME, HACK comments
---
---
<div align="center">
Built for the [Build with GLM 5.1](https://build-with-glm-5-1-challenge.devpost.com) hackathon.
> > > > > > > b4fcdee (feat: Add initial CodeTribunal implementation)
</div>