Spaces:

amine-yagoub
/

CodeTribunal

Sleeping

App Files Files Community

CodeTribunal / README_HF.md

amine-yagoub

docs: enhance README with frontmatter and add Hugging Face version

c30312e about 2 months ago

preview code

raw

history blame contribute delete

12.5 kB

metadata

title: CodeTribunal
emoji: 💻
colorFrom: pink
colorTo: red
sdk: docker
pinned: false
license: mit
short_description: The AI Courtroom That Exposes Bad Freelance Code

CodeTribunal

Put Freelance Code on Trial.

Upload code. Get a verdict. Know the risk.

Built with GLM 5.1 + CrewAI + GritQL

🚨 The Problem

Clients receive code they don’t understand.

Looks clean… but hides security risks
Passes linters… but fails in production
Works… but is architecturally broken

No one answers the only question that matters:

Is this code safe, professional, and worth paying for?

The Solution

CodeTribunal turns code review into a courtroom trial.

Upload a .zip → get:

Forensic evidence (AST-level)
Multi-agent investigation
AI courtroom debate
Final verdict + risk score

Not just analysis — judgment.

🧠 Why This Exist

1. Real System

6-phase pipeline
8 specialized agents
Persistent execution engine

2. Agents That Actually Act

File reads, pattern search, call tracing
Real tool usage via function calling (not fake reasoning)

3. Deterministic + AI Hybrid

GritQL = ground truth
Agents = interpretation + argument

4. End-to-End Story

From raw code → evidence → debate → verdict → report

How It Works

CodeTribunal runs a 6-phase pipeline, each building on the last:

Phase 1: Forensic Evidence (Deterministic — No LLM)

GritQL scans the entire codebase with 17 forensic patterns across security and quality domains:

Domain	Patterns	Examples
🔴 Security	13	Hardcoded secrets, `eval()`, SQL injection, `pickle.load()`, `os.system()`, weak hashing
🟡 Quality	4	`TODO`, `FIXME`, `HACK` comments

All scanning is read-only (--dry-run) and runs in parallel across patterns.

Phase 2: Code Dependency Graph (AST — No LLM)

Python's ast module and regex-based JS parsing build a lightweight dependency graph:

Nodes: files, functions, classes, imports
Edges: calls, imports, containment, inheritance
Enables call-chain tracing: eval() → handle_request() → app.route()

Phase 3: Investigation (3 ReACT Agents + 4 Tools)

Three specialist investigators, each running a genuine ReACT loop (Reason → Act → Observe → Repeat) using Z.ai's native function calling via LiteLLM:

Agent	Tools	Purpose
Security Investigator	FileReader, PatternSearch, CodeGraphQuery, FindingContext	Find vulnerabilities, trace attack vectors
Quality Investigator	FileReader, FindingContext	Assess technical debt, detect negligence
Architecture Investigator	FileReader, CodeGraphQuery	Analyze structure, trace dependencies

Each agent autonomously decides which tools to call, observes the results, and iterates. For example, the Security Investigator might:

Call file_reader to read a flagged file
Observe hardcoded secrets on specific lines
Call code_graph_query to trace where those secrets are used
Produce a detailed report with file paths, line numbers, and severity ratings

Verified working: GLM-5 + LiteLLM function calling confirmed. Agents make real tool calls that execute real code analysis.

Phase 4: The Trial (3 Agents)

A courtroom debate between AI agents:

** The Prosecutor** — builds the case for negligence, cites specific evidence
** The Defense Attorney** — challenges claims, argues context and proportionality
** Rebuttal** — the prosecutor responds to the defense

Agents use CrewAI's context parameter to chain arguments: prosecution output feeds into defense context, both feed into rebuttal.

Phase 5: The Verdict

** The Judge** reviews all evidence, investigation reports, and the full trial transcript. Delivers:

Overall ruling: GUILTY / MIXED / NOT GUILTY
Reputational Risk Score (0-100)
Findings summary with severity rankings

Phase 6: Structured Report

** Verdict Report Agent** compiles everything into a professional report:

Executive Summary
Findings Table (sorted by severity)
Per-Finding Analysis (impact, remediation, estimated fix effort)
Sentencing Recommendations

Architecture

                          ┌──────────────┐
                          │  Gradio UI   │
                          │  + Export    │
                          └──────┬───────┘
                                 │
                     ┌───────────▼────────────┐
                     │    Pipeline Engine      │
                     │  State · Persistence    │
                     │  Cancel · Resume        │
                     └───────────┬────────────┘
                                 │
          ┌──────────┬───────────┼───────────┬──────────┐
          ▼          ▼           ▼           ▼          ▼
     ┌─────────┐ ┌──────┐ ┌─────────┐ ┌─────────┐ ┌──────┐
     │Evidence │ │Code  │ │Invest.  │ │  Trial  │ │Report│
     │ Scanner │ │Graph │ │ Agents  │ │ Agents  │ │Agent │
     │(GritQL) │ │(AST) │ │+ Tools  │ │         │ │      │
     └─────────┘ └──────┘ └─────────┘ └─────────┘ └──────┘
          │          │           │           │          │
          └──────────┴───────────┴───────────┴──────────┘
                                 │
                     ┌───────────▼────────────┐
                     │   Custom Tool Layer    │
                     │ FileReader · Pattern   │
                     │ CodeGraph · Context    │
                     └────────────────────────┘

Key Design Decisions

Decision	Why
Agents have tools, not text dumps	Agents read files, search patterns, and trace calls on demand — scales to any codebase size
ReACT loop via LiteLLM	Direct function calling with GLM-5 — bypasses CrewAI's unreliable tool routing
Pipeline state persisted to JSON	Runs can resume after crashes. State is queryable
GritQL for evidence	AST-level pattern matching, not regex. Language-aware, precise
Custom CrewAI tools (BaseTool)	Pydantic-validated inputs, proper error handling, CrewAI-native integration
Rate-limit retry with backoff	Exponential backoff (4s → 64s) on Z.ai 429 errors — pipeline survives API spikes

Tech Stack

Component	Technology	Purpose
LLM	GLM 5.1 via Z.ai (LiteLLM)	Agent reasoning and debate
Code Scanning	GritQL	Deterministic AST-level pattern matching
Multi-Agent	CrewAI 1.12	Agent orchestration, task chaining, context handoffs
Function Calling	LiteLLM	Direct ReACT loop with GLM-5 tool calling
Code Graph	Python `ast` + regex	Dependency graph (Python + JS)
UI	Gradio 6	Streaming chatbot, file upload, export
Export	markdown-pdf (PyMuPDF)	PDF report generation from Markdown

Install

# Clone
git clone https://github.com/amineyagoub/CodeTribunal.git
cd CodeTribunal

# Install dependencies
pip install -e .

# Install GritQL CLI
npm install -g @getgrit/cli

# Configure
cp .env.example .env
# Edit .env: set ZAI_API_KEY (get one at https://open.bigmodel.cn/)

Requirements

Python 3.11+
Node.js (for GritQL CLI)
Z.ai API key (get one here)

Usage

Web UI (Recommended)

python3 -m code_tribunal.app

Open http://localhost:7860, upload a .zip of code, and watch the trial unfold.

Features

Feature	Details
4 Custom Tools	FileReader, PatternSearch, CodeGraphQuery, FindingContext — agents actively investigate
8 Specialized Agents	3 investigators, prosecutor, defense, rebuttal, judge, verdict report, expert witness
ReACT Engine	Custom Reason-Act-Observe loop via LiteLLM function calling with GLM-5
Code Dependency Graph	AST-based (Python + JS), with call-chain tracing and impact analysis
Parallel Evidence Scanning	ThreadPoolExecutor for GritQL patterns — 4x faster than sequential
Rate-Limit Resilience	Exponential backoff retry on 429 errors — survives API rate limits
Pipeline Persistence	State saved to JSON, runs can resume after interruption
Deduplication	Same file+line merged into one finding with multiple categories
Zip Safety	Zip-slip attack prevention
Streaming UI	Real-time pipeline progress in Gradio Chatbot with phase indicators
Export	Markdown and PDF report generation

🧪 Testing

# Run evidence scan on test fixtures
code-tribunal tests/fixtures/locale/ --evidence-only

# Run Python tests
pytest tests/

Test fixtures in tests/fixtures/locale/ contain deliberately bad Python and JavaScript code with:

Hardcoded passwords, API keys, AWS secrets, Stripe keys, JWT secrets
SQL injection via f-strings and template literals
eval(), pickle.load(), os.system(), subprocess.call(shell=True)
MD5 hashing
TODO, FIXME, HACK comments

Built for the Build with GLM 5.1 hackathon.