CodeTribunal / README.md
amine-yagoub's picture
fix: resolve merge conflicts in README.md for HF Spaces
3a09553
metadata
title: CodeTribunal
emoji: πŸ’»
colorFrom: pink
colorTo: red
sdk: docker
pinned: false
license: mit
short_description: The AI Courtroom That Exposes Bad Freelance Code

CodeTribunal

Put Freelance Code on Trial.

Upload code. Get a verdict. Know the risk.

Built with GLM 5.1 + CrewAI + GritQL

Tests Python 3.11+ License: MIT Built for GLM 5.1 Hackathon


🚨 The Problem

Clients receive code they don’t understand.

  • Looks clean… but hides security risks
  • Passes linters… but fails in production
  • Works… but is architecturally broken

No one answers the only question that matters:

Is this code safe, professional, and worth paying for?


The Solution

CodeTribunal turns code review into a courtroom trial.

Upload a .zip β†’ get:

  • Forensic evidence (AST-level)
  • Multi-agent investigation
  • AI courtroom debate
  • Final verdict + risk score

Not just analysis β€” judgment.


🧠 Why This Exist

1. Real System

  • 6-phase pipeline
  • 8 specialized agents
  • Persistent execution engine

2. Agents That Actually Act

  • File reads, pattern search, call tracing
  • Real tool usage via function calling (not fake reasoning)

3. Deterministic + AI Hybrid

  • GritQL = ground truth
  • Agents = interpretation + argument

4. End-to-End Story

From raw code β†’ evidence β†’ debate β†’ verdict β†’ report

How It Works

CodeTribunal runs a 6-phase pipeline, each building on the last:

Phase 1: Forensic Evidence (Deterministic β€” No LLM)

GritQL scans the entire codebase with 17 forensic patterns across security and quality domains:

Domain Patterns Examples
πŸ”΄ Security 13 Hardcoded secrets, eval(), SQL injection, pickle.load(), os.system(), weak hashing
🟑 Quality 4 TODO, FIXME, HACK comments

All scanning is read-only (--dry-run) and runs in parallel across patterns.

Phase 2: Code Dependency Graph (AST β€” No LLM)

Python's ast module and regex-based JS parsing build a lightweight dependency graph:

  • Nodes: files, functions, classes, imports
  • Edges: calls, imports, containment, inheritance
  • Enables call-chain tracing: eval() β†’ handle_request() β†’ app.route()

Phase 3: Investigation (3 ReACT Agents + 4 Tools)

Three specialist investigators, each running a genuine ReACT loop (Reason β†’ Act β†’ Observe β†’ Repeat) using Z.ai's native function calling via LiteLLM:

Agent Tools Purpose
πŸ›‘οΈ Security Investigator FileReader, PatternSearch, CodeGraphQuery, FindingContext Find vulnerabilities, trace attack vectors
πŸ“‹ Quality Investigator FileReader, FindingContext Assess technical debt, detect negligence
πŸ—οΈ Architecture Investigator FileReader, CodeGraphQuery Analyze structure, trace dependencies

Each agent autonomously decides which tools to call, observes the results, and iterates. For example, the Security Investigator might:

  1. Call file_reader to read a flagged file
  2. Observe hardcoded secrets on specific lines
  3. Call code_graph_query to trace where those secrets are used
  4. Produce a detailed report with file paths, line numbers, and severity ratings

Verified working: GLM-5 + LiteLLM function calling confirmed. Agents make real tool calls that execute real code analysis.

Phase 4: The Trial (3 Agents)

A courtroom debate between AI agents:

  1. ** The Prosecutor** β€” builds the case for negligence, cites specific evidence
  2. ** The Defense Attorney** β€” challenges claims, argues context and proportionality
  3. ** Rebuttal** β€” the prosecutor responds to the defense

Agents use CrewAI's context parameter to chain arguments: prosecution output feeds into defense context, both feed into rebuttal.

Phase 5: The Verdict

πŸ”¨ The Judge reviews all evidence, investigation reports, and the full trial transcript. Delivers:

  • Overall ruling: GUILTY / MIXED / NOT GUILTY
  • Reputational Risk Score (0-100)
  • Findings summary with severity rankings

Phase 6: Structured Report

πŸ“ Verdict Report Agent compiles everything into a professional report:

  • Executive Summary
  • Findings Table (sorted by severity)
  • Per-Finding Analysis (impact, remediation, estimated fix effort)
  • Sentencing Recommendations

Architecture

                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                          β”‚  Gradio UI   β”‚
                          β”‚  + Export    β”‚
                          β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                     β”‚    Pipeline Engine      β”‚
                     β”‚  State Β· Persistence    β”‚
                     β”‚  Cancel Β· Resume        β”‚
                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β–Ό          β–Ό           β–Ό           β–Ό          β–Ό
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”
     β”‚Evidence β”‚ β”‚Code  β”‚ β”‚Invest.  β”‚ β”‚  Trial  β”‚ β”‚Reportβ”‚
     β”‚ Scanner β”‚ β”‚Graph β”‚ β”‚ Agents  β”‚ β”‚ Agents  β”‚ β”‚Agent β”‚
     β”‚(GritQL) β”‚ β”‚(AST) β”‚ β”‚+ Tools  β”‚ β”‚         β”‚ β”‚      β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜
          β”‚          β”‚           β”‚           β”‚          β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                     β”‚   Custom Tool Layer    β”‚
                     β”‚ FileReader Β· Pattern   β”‚
                     β”‚ CodeGraph Β· Context    β”‚
                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Design Decisions

Decision Why
Agents have tools, not text dumps Agents read files, search patterns, and trace calls on demand β€” scales to any codebase size
ReACT loop via LiteLLM Direct function calling with GLM-5 β€” bypasses CrewAI's unreliable tool routing
Pipeline state persisted to JSON Runs can resume after crashes. State is queryable
GritQL for evidence AST-level pattern matching, not regex. Language-aware, precise
Custom CrewAI tools (BaseTool) Pydantic-validated inputs, proper error handling, CrewAI-native integration
Rate-limit retry with backoff Exponential backoff (4s β†’ 64s) on Z.ai 429 errors β€” pipeline survives API spikes

Tech Stack

Component Technology Purpose
LLM GLM 5 via Z.ai (LiteLLM) Agent reasoning and debate
Code Scanning GritQL Deterministic AST-level pattern matching
Multi-Agent CrewAI 1.12 Agent orchestration, task chaining, context handoffs
Function Calling LiteLLM Direct ReACT loop with GLM-5 tool calling
Code Graph Python ast + regex Dependency graph (Python + JS)
UI Gradio 6 Streaming chatbot, file upload, export
Export fpdf2 PDF report generation

Install

# Clone
git clone https://github.com/amineyagoub/CodeTribunal.git
cd CodeTribunal

# Install dependencies
pip install -e .

# Install GritQL CLI
npm install -g @getgrit/cli

# Configure
cp .env.example .env
# Edit .env: set ZAI_API_KEY (get one at https://open.bigmodel.cn/)

Requirements

  • Python 3.11+
  • Node.js (for GritQL CLI)
  • Z.ai API key (get one here)

Usage

Web UI (Recommended)

python3 -m code_tribunal.app

Open http://localhost:7860, upload a .zip of code, and watch the trial unfold.

CLI

# Full trial
code-tribunal ./path/to/codebase

# Evidence only (no LLM, fast)
code-tribunal ./path/to/codebase --evidence-only

# Save results to JSON
code-tribunal ./path/to/codebase --output report.json

Python API

from code_tribunal.config import TribunalConfig
from code_tribunal.courtroom import Courtroom
from code_tribunal.pipeline import Phase

config = TribunalConfig()
courtroom = Courtroom(config)

for event in courtroom.run("./path/to/code"):
    print(f"[{event.phase.value}] {event.status}")

# Interactive Q&A
answer = courtroom.ask_question(
    "Why was eval() considered critical?",
    context={"evidence": "...", "verdict": "...", ...}
)

πŸ”§ Production Features

Feature Details
4 Custom Tools FileReader, PatternSearch, CodeGraphQuery, FindingContext β€” agents actively investigate
8 Specialized Agents 3 investigators, prosecutor, defense, rebuttal, judge, verdict report, expert witness
ReACT Engine Custom Reason-Act-Observe loop via LiteLLM function calling with GLM-5
Code Dependency Graph AST-based (Python + JS), with call-chain tracing and impact analysis
Parallel Evidence Scanning ThreadPoolExecutor for GritQL patterns β€” 4x faster than sequential
Rate-Limit Resilience Exponential backoff retry on 429 errors β€” survives API rate limits
Pipeline Persistence State saved to JSON, runs can resume after interruption
Deduplication Same file+line merged into one finding with multiple categories
Zip Safety Zip-slip attack prevention
Streaming UI Real-time pipeline progress in Gradio Chatbot with phase indicators
Export Markdown and PDF report generation

πŸ§ͺ Testing

# Run evidence scan on test fixtures
code-tribunal tests/fixtures/locale/ --evidence-only

# Run Python tests
pytest tests/

Test fixtures in tests/fixtures/locale/ contain deliberately bad Python and JavaScript code with:

  • Hardcoded passwords, API keys, AWS secrets, Stripe keys, JWT secrets
  • SQL injection via f-strings and template literals
  • eval(), pickle.load(), os.system(), subprocess.call(shell=True)
  • MD5 hashing
  • TODO, FIXME, HACK comments


Built for the Build with GLM 5.1 hackathon.

b4fcdee (feat: Add initial CodeTribunal implementation)