CodeTribunal / README_HF.md
amine-yagoub's picture
docs: enhance README with frontmatter and add Hugging Face version
c30312e
metadata
title: CodeTribunal
emoji: πŸ’»
colorFrom: pink
colorTo: red
sdk: docker
pinned: false
license: mit
short_description: The AI Courtroom That Exposes Bad Freelance Code

CodeTribunal

Put Freelance Code on Trial.

Upload code. Get a verdict. Know the risk.

Built with GLM 5.1 + CrewAI + GritQL

Tests Python 3.11+ License: MIT Built for GLM 5.1 Hackathon


🚨 The Problem

Clients receive code they don’t understand.

  • Looks clean… but hides security risks
  • Passes linters… but fails in production
  • Works… but is architecturally broken

No one answers the only question that matters:

Is this code safe, professional, and worth paying for?


The Solution

CodeTribunal turns code review into a courtroom trial.

Upload a .zip β†’ get:

  • Forensic evidence (AST-level)
  • Multi-agent investigation
  • AI courtroom debate
  • Final verdict + risk score

Not just analysis β€” judgment.


🧠 Why This Exist

1. Real System

  • 6-phase pipeline
  • 8 specialized agents
  • Persistent execution engine

2. Agents That Actually Act

  • File reads, pattern search, call tracing
  • Real tool usage via function calling (not fake reasoning)

3. Deterministic + AI Hybrid

  • GritQL = ground truth
  • Agents = interpretation + argument

4. End-to-End Story

From raw code β†’ evidence β†’ debate β†’ verdict β†’ report

How It Works

CodeTribunal runs a 6-phase pipeline, each building on the last:

Phase 1: Forensic Evidence (Deterministic β€” No LLM)

GritQL scans the entire codebase with 17 forensic patterns across security and quality domains:

Domain Patterns Examples
πŸ”΄ Security 13 Hardcoded secrets, eval(), SQL injection, pickle.load(), os.system(), weak hashing
🟑 Quality 4 TODO, FIXME, HACK comments

All scanning is read-only (--dry-run) and runs in parallel across patterns.

Phase 2: Code Dependency Graph (AST β€” No LLM)

Python's ast module and regex-based JS parsing build a lightweight dependency graph:

  • Nodes: files, functions, classes, imports
  • Edges: calls, imports, containment, inheritance
  • Enables call-chain tracing: eval() β†’ handle_request() β†’ app.route()

Phase 3: Investigation (3 ReACT Agents + 4 Tools)

Three specialist investigators, each running a genuine ReACT loop (Reason β†’ Act β†’ Observe β†’ Repeat) using Z.ai's native function calling via LiteLLM:

Agent Tools Purpose
Security Investigator FileReader, PatternSearch, CodeGraphQuery, FindingContext Find vulnerabilities, trace attack vectors
Quality Investigator FileReader, FindingContext Assess technical debt, detect negligence
Architecture Investigator FileReader, CodeGraphQuery Analyze structure, trace dependencies

Each agent autonomously decides which tools to call, observes the results, and iterates. For example, the Security Investigator might:

  1. Call file_reader to read a flagged file
  2. Observe hardcoded secrets on specific lines
  3. Call code_graph_query to trace where those secrets are used
  4. Produce a detailed report with file paths, line numbers, and severity ratings

Verified working: GLM-5 + LiteLLM function calling confirmed. Agents make real tool calls that execute real code analysis.

Phase 4: The Trial (3 Agents)

A courtroom debate between AI agents:

  1. ** The Prosecutor** β€” builds the case for negligence, cites specific evidence
  2. ** The Defense Attorney** β€” challenges claims, argues context and proportionality
  3. ** Rebuttal** β€” the prosecutor responds to the defense

Agents use CrewAI's context parameter to chain arguments: prosecution output feeds into defense context, both feed into rebuttal.

Phase 5: The Verdict

** The Judge** reviews all evidence, investigation reports, and the full trial transcript. Delivers:

  • Overall ruling: GUILTY / MIXED / NOT GUILTY
  • Reputational Risk Score (0-100)
  • Findings summary with severity rankings

Phase 6: Structured Report

** Verdict Report Agent** compiles everything into a professional report:

  • Executive Summary
  • Findings Table (sorted by severity)
  • Per-Finding Analysis (impact, remediation, estimated fix effort)
  • Sentencing Recommendations

Architecture

                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                          β”‚  Gradio UI   β”‚
                          β”‚  + Export    β”‚
                          β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                     β”‚    Pipeline Engine      β”‚
                     β”‚  State Β· Persistence    β”‚
                     β”‚  Cancel Β· Resume        β”‚
                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β–Ό          β–Ό           β–Ό           β–Ό          β–Ό
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”
     β”‚Evidence β”‚ β”‚Code  β”‚ β”‚Invest.  β”‚ β”‚  Trial  β”‚ β”‚Reportβ”‚
     β”‚ Scanner β”‚ β”‚Graph β”‚ β”‚ Agents  β”‚ β”‚ Agents  β”‚ β”‚Agent β”‚
     β”‚(GritQL) β”‚ β”‚(AST) β”‚ β”‚+ Tools  β”‚ β”‚         β”‚ β”‚      β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜
          β”‚          β”‚           β”‚           β”‚          β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                     β”‚   Custom Tool Layer    β”‚
                     β”‚ FileReader Β· Pattern   β”‚
                     β”‚ CodeGraph Β· Context    β”‚
                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Design Decisions

Decision Why
Agents have tools, not text dumps Agents read files, search patterns, and trace calls on demand β€” scales to any codebase size
ReACT loop via LiteLLM Direct function calling with GLM-5 β€” bypasses CrewAI's unreliable tool routing
Pipeline state persisted to JSON Runs can resume after crashes. State is queryable
GritQL for evidence AST-level pattern matching, not regex. Language-aware, precise
Custom CrewAI tools (BaseTool) Pydantic-validated inputs, proper error handling, CrewAI-native integration
Rate-limit retry with backoff Exponential backoff (4s β†’ 64s) on Z.ai 429 errors β€” pipeline survives API spikes

Tech Stack

Component Technology Purpose
LLM GLM 5.1 via Z.ai (LiteLLM) Agent reasoning and debate
Code Scanning GritQL Deterministic AST-level pattern matching
Multi-Agent CrewAI 1.12 Agent orchestration, task chaining, context handoffs
Function Calling LiteLLM Direct ReACT loop with GLM-5 tool calling
Code Graph Python ast + regex Dependency graph (Python + JS)
UI Gradio 6 Streaming chatbot, file upload, export
Export markdown-pdf (PyMuPDF) PDF report generation from Markdown

Install

# Clone
git clone https://github.com/amineyagoub/CodeTribunal.git
cd CodeTribunal

# Install dependencies
pip install -e .

# Install GritQL CLI
npm install -g @getgrit/cli

# Configure
cp .env.example .env
# Edit .env: set ZAI_API_KEY (get one at https://open.bigmodel.cn/)

Requirements

  • Python 3.11+
  • Node.js (for GritQL CLI)
  • Z.ai API key (get one here)

Usage

Web UI (Recommended)

python3 -m code_tribunal.app

Open http://localhost:7860, upload a .zip of code, and watch the trial unfold.


Features

Feature Details
4 Custom Tools FileReader, PatternSearch, CodeGraphQuery, FindingContext β€” agents actively investigate
8 Specialized Agents 3 investigators, prosecutor, defense, rebuttal, judge, verdict report, expert witness
ReACT Engine Custom Reason-Act-Observe loop via LiteLLM function calling with GLM-5
Code Dependency Graph AST-based (Python + JS), with call-chain tracing and impact analysis
Parallel Evidence Scanning ThreadPoolExecutor for GritQL patterns β€” 4x faster than sequential
Rate-Limit Resilience Exponential backoff retry on 429 errors β€” survives API rate limits
Pipeline Persistence State saved to JSON, runs can resume after interruption
Deduplication Same file+line merged into one finding with multiple categories
Zip Safety Zip-slip attack prevention
Streaming UI Real-time pipeline progress in Gradio Chatbot with phase indicators
Export Markdown and PDF report generation

πŸ§ͺ Testing

# Run evidence scan on test fixtures
code-tribunal tests/fixtures/locale/ --evidence-only

# Run Python tests
pytest tests/

Test fixtures in tests/fixtures/locale/ contain deliberately bad Python and JavaScript code with:

  • Hardcoded passwords, API keys, AWS secrets, Stripe keys, JWT secrets
  • SQL injection via f-strings and template literals
  • eval(), pickle.load(), os.system(), subprocess.call(shell=True)
  • MD5 hashing
  • TODO, FIXME, HACK comments


Built for the Build with GLM 5.1 hackathon.