Spaces:

amine-yagoub
/

CodeTribunal

Sleeping

File size: 13,274 Bytes

c30312e
38cd7bb
 
 
 
 
 
 
 
c30312e
eecc2a5
 
1de0435
eecc2a5
c30312e
eecc2a5
c30312e
 
 
64d4a2f
662e309
eecc2a5
 
 
 
 
 
 
 
c30312e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eecc2a5
c30312e
eecc2a5
c30312e
eecc2a5
c30312e
 
 
 
eecc2a5
c30312e
eecc2a5
 
 
c30312e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1de0435
eecc2a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1de0435
 
 
eecc2a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38cd7bb
 
1de0435
eecc2a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d5341cc
eecc2a5
 
 
 
 
 
 
 
d5341cc
eecc2a5
d5341cc
1de0435
eecc2a5
 
 
 
 
 
 
 
 
 
 
 
 
1de0435
d5341cc
 
eecc2a5
 
 
 
 
d5341cc
eecc2a5
 
 
 
 
 
1de0435
eecc2a5
 
 
 
 
 
 
 
 
 
1de0435
eecc2a5
 
 
 
 
d5341cc
 
eecc2a5
 
 
d5341cc
 
eecc2a5
d5341cc
eecc2a5
 
 
 
 
 
d5341cc
 
eecc2a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d5341cc
eecc2a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d5341cc
 
eecc2a5

---
title: CodeTribunal
emoji: 💻
colorFrom: pink
colorTo: red
sdk: docker
pinned: false
license: mit
short_description: The AI Courtroom That Exposes Bad Freelance Code
---
<div align="center">

# CodeTribunal

### Put Freelance Code on Trial.

**Upload code. Get a verdict. Know the risk.**

Built with **GLM 5.1 + CrewAI + GritQL**

[![Tests](https://github.com/amineyagoub/CodeTribunal/actions/workflows/tests.yml/badge.svg)](https://github.com/amineyagoub/CodeTribunal/actions/workflows/tests.yml)
[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Built for GLM 5.1 Hackathon](https://img.shields.io/badge/Built%20for-GLM%205.1-ff69b4)](https://build-with-glm-5-1-challenge.devpost.com)

</div>

---

## 🚨 The Problem

Clients receive code they don’t understand.

- Looks clean… but hides security risks
- Passes linters… but fails in production
- Works… but is architecturally broken

**No one answers the only question that matters:**

> _Is this code safe, professional, and worth paying for?_

---

## The Solution

**CodeTribunal turns code review into a courtroom trial.**

Upload a `.zip` → get:

- Forensic evidence (AST-level)
- Multi-agent investigation
- AI courtroom debate
- Final verdict + risk score

> Not just analysis — **judgment**.

---

## 🧠 Why This Exist

### 1. Real System

- 6-phase pipeline
- 8 specialized agents
- Persistent execution engine

### 2. Agents That Actually Act

- File reads, pattern search, call tracing
- Real tool usage via function calling (not fake reasoning)

### 3. Deterministic + AI Hybrid

- **GritQL = ground truth**
- **Agents = interpretation + argument**

### 4. End-to-End Story

From raw code → evidence → debate → verdict → report

## How It Works

CodeTribunal runs a **6-phase pipeline**, each building on the last:

### Phase 1: Forensic Evidence (Deterministic — No LLM)

GritQL scans the entire codebase with **17 forensic patterns** across security and quality domains:

| Domain      | Patterns | Examples                                                                                 |
| ----------- | -------- | ---------------------------------------------------------------------------------------- |
| 🔴 Security | 13       | Hardcoded secrets, `eval()`, SQL injection, `pickle.load()`, `os.system()`, weak hashing |
| 🟡 Quality  | 4        | `TODO`, `FIXME`, `HACK` comments                                                         |

All scanning is **read-only** (`--dry-run`) and runs in **parallel** across patterns.

### Phase 2: Code Dependency Graph (AST — No LLM)

Python's `ast` module and regex-based JS parsing build a **lightweight dependency graph**:

- Nodes: files, functions, classes, imports
- Edges: calls, imports, containment, inheritance
- Enables call-chain tracing: `eval() → handle_request() → app.route()`

### Phase 3: Investigation (3 ReACT Agents + 4 Tools)

Three specialist investigators, each running a **genuine ReACT loop** (Reason → Act → Observe → Repeat) using **Z.ai's native function calling** via LiteLLM:

| Agent                        | Tools                                                     | Purpose                                    |
| ---------------------------- | --------------------------------------------------------- | ------------------------------------------ |
| 🛡️ Security Investigator     | FileReader, PatternSearch, CodeGraphQuery, FindingContext | Find vulnerabilities, trace attack vectors |
| 📋 Quality Investigator      | FileReader, FindingContext                                | Assess technical debt, detect negligence   |
| 🏗️ Architecture Investigator | FileReader, CodeGraphQuery                                | Analyze structure, trace dependencies      |

Each agent **autonomously decides which tools to call**, observes the results, and iterates. For example, the Security Investigator might:

1. Call `file_reader` to read a flagged file
2. Observe hardcoded secrets on specific lines
3. Call `code_graph_query` to trace where those secrets are used
4. Produce a detailed report with file paths, line numbers, and severity ratings

**Verified working**: GLM-5 + LiteLLM function calling confirmed. Agents make real tool calls that execute real code analysis.

### Phase 4: The Trial (3 Agents)

A courtroom debate between AI agents:

1. ** The Prosecutor** — builds the case for negligence, cites specific evidence
2. ** The Defense Attorney** — challenges claims, argues context and proportionality
3. ** Rebuttal** — the prosecutor responds to the defense

Agents use CrewAI's `context` parameter to chain arguments: prosecution output feeds into defense context, both feed into rebuttal.

### Phase 5: The Verdict

**🔨 The Judge** reviews all evidence, investigation reports, and the full trial transcript. Delivers:

- Overall ruling: GUILTY / MIXED / NOT GUILTY
- Reputational Risk Score (0-100)
- Findings summary with severity rankings

### Phase 6: Structured Report

**📝 Verdict Report Agent** compiles everything into a professional report:

- Executive Summary
- Findings Table (sorted by severity)
- Per-Finding Analysis (impact, remediation, estimated fix effort)
- Sentencing Recommendations

---

## Architecture

```
                          ┌──────────────┐
                          │  Gradio UI   │
                          │  + Export    │
                          └──────┬───────┘
                                 │
                     ┌───────────▼────────────┐
                     │    Pipeline Engine      │
                     │  State · Persistence    │
                     │  Cancel · Resume        │
                     └───────────┬────────────┘
                                 │
          ┌──────────┬───────────┼───────────┬──────────┐
          ▼          ▼           ▼           ▼          ▼
     ┌─────────┐ ┌──────┐ ┌─────────┐ ┌─────────┐ ┌──────┐
     │Evidence │ │Code  │ │Invest.  │ │  Trial  │ │Report│
     │ Scanner │ │Graph │ │ Agents  │ │ Agents  │ │Agent │
     │(GritQL) │ │(AST) │ │+ Tools  │ │         │ │      │
     └─────────┘ └──────┘ └─────────┘ └─────────┘ └──────┘
          │          │           │           │          │
          └──────────┴───────────┴───────────┴──────────┘
                                 │
                     ┌───────────▼────────────┐
                     │   Custom Tool Layer    │
                     │ FileReader · Pattern   │
                     │ CodeGraph · Context    │
                     └────────────────────────┘
```

### Key Design Decisions

| Decision                              | Why                                                                                         |
| ------------------------------------- | ------------------------------------------------------------------------------------------- |
| **Agents have tools, not text dumps** | Agents read files, search patterns, and trace calls on demand — scales to any codebase size |
| **ReACT loop via LiteLLM**            | Direct function calling with GLM-5 — bypasses CrewAI's unreliable tool routing              |
| **Pipeline state persisted to JSON**  | Runs can resume after crashes. State is queryable                                           |
| **GritQL for evidence**               | AST-level pattern matching, not regex. Language-aware, precise                              |
| **Custom CrewAI tools (BaseTool)**    | Pydantic-validated inputs, proper error handling, CrewAI-native integration                 |
| **Rate-limit retry with backoff**     | Exponential backoff (4s → 64s) on Z.ai 429 errors — pipeline survives API spikes            |

---

## Tech Stack

| Component            | Technology               | Purpose                                              |
| -------------------- | ------------------------ | ---------------------------------------------------- |
| **LLM**              | GLM 5 via Z.ai (LiteLLM) | Agent reasoning and debate                           |
| **Code Scanning**    | GritQL                   | Deterministic AST-level pattern matching             |
| **Multi-Agent**      | CrewAI 1.12              | Agent orchestration, task chaining, context handoffs |
| **Function Calling** | LiteLLM                  | Direct ReACT loop with GLM-5 tool calling            |
| **Code Graph**       | Python `ast` + regex     | Dependency graph (Python + JS)                       |
| **UI**               | Gradio 6                 | Streaming chatbot, file upload, export               |
| **Export**           | fpdf2                    | PDF report generation                                |

---

## Install

```bash
# Clone
git clone https://github.com/amineyagoub/CodeTribunal.git
cd CodeTribunal

# Install dependencies
pip install -e .

# Install GritQL CLI
npm install -g @getgrit/cli

# Configure
cp .env.example .env
# Edit .env: set ZAI_API_KEY (get one at https://open.bigmodel.cn/)
```

### Requirements

- Python 3.11+
- Node.js (for GritQL CLI)
- Z.ai API key ([get one here](https://open.bigmodel.cn/))

---

## Usage

### Web UI (Recommended)

```bash
python3 -m code_tribunal.app
```

Open http://localhost:7860, upload a `.zip` of code, and watch the trial unfold.

### CLI

```bash
# Full trial
code-tribunal ./path/to/codebase

# Evidence only (no LLM, fast)
code-tribunal ./path/to/codebase --evidence-only

# Save results to JSON
code-tribunal ./path/to/codebase --output report.json
```

### Python API

```python
from code_tribunal.config import TribunalConfig
from code_tribunal.courtroom import Courtroom
from code_tribunal.pipeline import Phase

config = TribunalConfig()
courtroom = Courtroom(config)

for event in courtroom.run("./path/to/code"):
    print(f"[{event.phase.value}] {event.status}")

# Interactive Q&A
answer = courtroom.ask_question(
    "Why was eval() considered critical?",
    context={"evidence": "...", "verdict": "...", ...}
)
```

---

## 🔧 Production Features

| Feature                        | Details                                                                                 |
| ------------------------------ | --------------------------------------------------------------------------------------- |
| **4 Custom Tools**             | FileReader, PatternSearch, CodeGraphQuery, FindingContext — agents actively investigate |
| **8 Specialized Agents**       | 3 investigators, prosecutor, defense, rebuttal, judge, verdict report, expert witness   |
| **ReACT Engine**               | Custom Reason-Act-Observe loop via LiteLLM function calling with GLM-5                  |
| **Code Dependency Graph**      | AST-based (Python + JS), with call-chain tracing and impact analysis                    |
| **Parallel Evidence Scanning** | ThreadPoolExecutor for GritQL patterns — 4x faster than sequential                      |
| **Rate-Limit Resilience**      | Exponential backoff retry on 429 errors — survives API rate limits                      |
| **Pipeline Persistence**       | State saved to JSON, runs can resume after interruption                                 |
| **Deduplication**              | Same file+line merged into one finding with multiple categories                         |
| **Zip Safety**                 | Zip-slip attack prevention                                                              |
| **Streaming UI**               | Real-time pipeline progress in Gradio Chatbot with phase indicators                     |
| **Export**                     | Markdown and PDF report generation                                                      |

---

## 🧪 Testing

```bash
# Run evidence scan on test fixtures
code-tribunal tests/fixtures/locale/ --evidence-only

# Run Python tests
pytest tests/
```

Test fixtures in `tests/fixtures/locale/` contain deliberately bad Python and JavaScript code with:

- Hardcoded passwords, API keys, AWS secrets, Stripe keys, JWT secrets
- SQL injection via f-strings and template literals
- `eval()`, `pickle.load()`, `os.system()`, `subprocess.call(shell=True)`
- MD5 hashing
- TODO, FIXME, HACK comments

---

---

<div align="center">

Built for the [Build with GLM 5.1](https://build-with-glm-5-1-challenge.devpost.com) hackathon.

> > > > > > > b4fcdee (feat: Add initial CodeTribunal implementation)

</div>