Spaces:

amine-yagoub
/

CodeTribunal

Running

App Files Files Community

CodeTribunal / README.md

amine-yagoub

fix: resolve merge conflicts in README.md for HF Spaces

3a09553 about 1 month ago

preview code

raw

history blame contribute delete

13.3 kB

	---
	title: CodeTribunal
	emoji: 💻
	colorFrom: pink
	colorTo: red
	sdk: docker
	pinned: false
	license: mit
	short_description: The AI Courtroom That Exposes Bad Freelance Code
	---
	<div align="center">

	# CodeTribunal

	### Put Freelance Code on Trial.

	Upload code. Get a verdict. Know the risk.

	Built with GLM 5.1 + CrewAI + GritQL

	[![Tests](https://github.com/amineyagoub/CodeTribunal/actions/workflows/tests.yml/badge.svg)](https://github.com/amineyagoub/CodeTribunal/actions/workflows/tests.yml)
	[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
	[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
	[![Built for GLM 5.1 Hackathon](https://img.shields.io/badge/Built%20for-GLM%205.1-ff69b4)](https://build-with-glm-5-1-challenge.devpost.com)

	</div>

	---

	## 🚨 The Problem

	Clients receive code they don’t understand.

	- Looks clean… but hides security risks
	- Passes linters… but fails in production
	- Works… but is architecturally broken

	No one answers the only question that matters:

	> _Is this code safe, professional, and worth paying for?_

	---

	## The Solution

	CodeTribunal turns code review into a courtroom trial.

	Upload a `.zip` → get:

	- Forensic evidence (AST-level)
	- Multi-agent investigation
	- AI courtroom debate
	- Final verdict + risk score

	> Not just analysis — judgment.

	---

	## 🧠 Why This Exist

	### 1. Real System

	- 6-phase pipeline
	- 8 specialized agents
	- Persistent execution engine

	### 2. Agents That Actually Act

	- File reads, pattern search, call tracing
	- Real tool usage via function calling (not fake reasoning)

	### 3. Deterministic + AI Hybrid

	- GritQL = ground truth
	- Agents = interpretation + argument

	### 4. End-to-End Story

	From raw code → evidence → debate → verdict → report

	## How It Works

	CodeTribunal runs a 6-phase pipeline, each building on the last:

	### Phase 1: Forensic Evidence (Deterministic — No LLM)

	GritQL scans the entire codebase with 17 forensic patterns across security and quality domains:

	\| Domain \| Patterns \| Examples \|
	\| ----------- \| -------- \| ---------------------------------------------------------------------------------------- \|
	\| 🔴 Security \| 13 \| Hardcoded secrets, `eval()`, SQL injection, `pickle.load()`, `os.system()`, weak hashing \|
	\| 🟡 Quality \| 4 \| `TODO`, `FIXME`, `HACK` comments \|

	All scanning is read-only (`--dry-run`) and runs in parallel across patterns.

	### Phase 2: Code Dependency Graph (AST — No LLM)

	Python's `ast` module and regex-based JS parsing build a lightweight dependency graph:

	- Nodes: files, functions, classes, imports
	- Edges: calls, imports, containment, inheritance
	- Enables call-chain tracing: `eval() → handle_request() → app.route()`

	### Phase 3: Investigation (3 ReACT Agents + 4 Tools)

	Three specialist investigators, each running a genuine ReACT loop (Reason → Act → Observe → Repeat) using Z.ai's native function calling via LiteLLM:

	\| Agent \| Tools \| Purpose \|
	\| ---------------------------- \| --------------------------------------------------------- \| ------------------------------------------ \|
	\| 🛡️ Security Investigator \| FileReader, PatternSearch, CodeGraphQuery, FindingContext \| Find vulnerabilities, trace attack vectors \|
	\| 📋 Quality Investigator \| FileReader, FindingContext \| Assess technical debt, detect negligence \|
	\| 🏗️ Architecture Investigator \| FileReader, CodeGraphQuery \| Analyze structure, trace dependencies \|

	Each agent autonomously decides which tools to call, observes the results, and iterates. For example, the Security Investigator might:

	1. Call `file_reader` to read a flagged file
	2. Observe hardcoded secrets on specific lines
	3. Call `code_graph_query` to trace where those secrets are used
	4. Produce a detailed report with file paths, line numbers, and severity ratings

	Verified working: GLM-5 + LiteLLM function calling confirmed. Agents make real tool calls that execute real code analysis.

	### Phase 4: The Trial (3 Agents)

	A courtroom debate between AI agents:

	1. The Prosecutor — builds the case for negligence, cites specific evidence
	2. The Defense Attorney — challenges claims, argues context and proportionality
	3. Rebuttal — the prosecutor responds to the defense

	Agents use CrewAI's `context` parameter to chain arguments: prosecution output feeds into defense context, both feed into rebuttal.

	### Phase 5: The Verdict

	🔨 The Judge reviews all evidence, investigation reports, and the full trial transcript. Delivers:

	- Overall ruling: GUILTY / MIXED / NOT GUILTY
	- Reputational Risk Score (0-100)
	- Findings summary with severity rankings

	### Phase 6: Structured Report

	📝 Verdict Report Agent compiles everything into a professional report:

	- Executive Summary
	- Findings Table (sorted by severity)
	- Per-Finding Analysis (impact, remediation, estimated fix effort)
	- Sentencing Recommendations

	---

	## Architecture

	```
	┌──────────────┐
	│ Gradio UI │
	│ + Export │
	└──────┬───────┘
	│
	┌───────────▼────────────┐
	│ Pipeline Engine │
	│ State · Persistence │
	│ Cancel · Resume │
	└───────────┬────────────┘
	│
	┌──────────┬───────────┼───────────┬──────────┐
	▼ ▼ ▼ ▼ ▼
	┌─────────┐ ┌──────┐ ┌─────────┐ ┌─────────┐ ┌──────┐
	│Evidence │ │Code │ │Invest. │ │ Trial │ │Report│
	│ Scanner │ │Graph │ │ Agents │ │ Agents │ │Agent │
	│(GritQL) │ │(AST) │ │+ Tools │ │ │ │ │
	└─────────┘ └──────┘ └─────────┘ └─────────┘ └──────┘
	│ │ │ │ │
	└──────────┴───────────┴───────────┴──────────┘
	│
	┌───────────▼────────────┐
	│ Custom Tool Layer │
	│ FileReader · Pattern │
	│ CodeGraph · Context │
	└────────────────────────┘
	```

	### Key Design Decisions

	\| Decision \| Why \|
	\| ------------------------------------- \| ------------------------------------------------------------------------------------------- \|
	\| Agents have tools, not text dumps \| Agents read files, search patterns, and trace calls on demand — scales to any codebase size \|
	\| ReACT loop via LiteLLM \| Direct function calling with GLM-5 — bypasses CrewAI's unreliable tool routing \|
	\| Pipeline state persisted to JSON \| Runs can resume after crashes. State is queryable \|
	\| GritQL for evidence \| AST-level pattern matching, not regex. Language-aware, precise \|
	\| Custom CrewAI tools (BaseTool) \| Pydantic-validated inputs, proper error handling, CrewAI-native integration \|
	\| Rate-limit retry with backoff \| Exponential backoff (4s → 64s) on Z.ai 429 errors — pipeline survives API spikes \|

	---

	## Tech Stack

	\| Component \| Technology \| Purpose \|
	\| -------------------- \| ------------------------ \| ---------------------------------------------------- \|
	\| LLM \| GLM 5 via Z.ai (LiteLLM) \| Agent reasoning and debate \|
	\| Code Scanning \| GritQL \| Deterministic AST-level pattern matching \|
	\| Multi-Agent \| CrewAI 1.12 \| Agent orchestration, task chaining, context handoffs \|
	\| Function Calling \| LiteLLM \| Direct ReACT loop with GLM-5 tool calling \|
	\| Code Graph \| Python `ast` + regex \| Dependency graph (Python + JS) \|
	\| UI \| Gradio 6 \| Streaming chatbot, file upload, export \|
	\| Export \| fpdf2 \| PDF report generation \|

	---

	## Install

	```bash
	# Clone
	git clone https://github.com/amineyagoub/CodeTribunal.git
	cd CodeTribunal

	# Install dependencies
	pip install -e .

	# Install GritQL CLI
	npm install -g @getgrit/cli

	# Configure
	cp .env.example .env
	# Edit .env: set ZAI_API_KEY (get one at https://open.bigmodel.cn/)
	```

	### Requirements

	- Python 3.11+
	- Node.js (for GritQL CLI)
	- Z.ai API key ([get one here](https://open.bigmodel.cn/))

	---

	## Usage

	### Web UI (Recommended)

	```bash
	python3 -m code_tribunal.app
	```

	Open http://localhost:7860, upload a `.zip` of code, and watch the trial unfold.

	### CLI

	```bash
	# Full trial
	code-tribunal ./path/to/codebase

	# Evidence only (no LLM, fast)
	code-tribunal ./path/to/codebase --evidence-only

	# Save results to JSON
	code-tribunal ./path/to/codebase --output report.json
	```

	### Python API

	```python
	from code_tribunal.config import TribunalConfig
	from code_tribunal.courtroom import Courtroom
	from code_tribunal.pipeline import Phase

	config = TribunalConfig()
	courtroom = Courtroom(config)

	for event in courtroom.run("./path/to/code"):
	print(f"[{event.phase.value}] {event.status}")

	# Interactive Q&A
	answer = courtroom.ask_question(
	"Why was eval() considered critical?",
	context={"evidence": "...", "verdict": "...", ...}
	)
	```

	---

	## 🔧 Production Features

	\| Feature \| Details \|
	\| ------------------------------ \| --------------------------------------------------------------------------------------- \|
	\| 4 Custom Tools \| FileReader, PatternSearch, CodeGraphQuery, FindingContext — agents actively investigate \|
	\| 8 Specialized Agents \| 3 investigators, prosecutor, defense, rebuttal, judge, verdict report, expert witness \|
	\| ReACT Engine \| Custom Reason-Act-Observe loop via LiteLLM function calling with GLM-5 \|
	\| Code Dependency Graph \| AST-based (Python + JS), with call-chain tracing and impact analysis \|
	\| Parallel Evidence Scanning \| ThreadPoolExecutor for GritQL patterns — 4x faster than sequential \|
	\| Rate-Limit Resilience \| Exponential backoff retry on 429 errors — survives API rate limits \|
	\| Pipeline Persistence \| State saved to JSON, runs can resume after interruption \|
	\| Deduplication \| Same file+line merged into one finding with multiple categories \|
	\| Zip Safety \| Zip-slip attack prevention \|
	\| Streaming UI \| Real-time pipeline progress in Gradio Chatbot with phase indicators \|
	\| Export \| Markdown and PDF report generation \|

	---

	## 🧪 Testing

	```bash
	# Run evidence scan on test fixtures
	code-tribunal tests/fixtures/locale/ --evidence-only

	# Run Python tests
	pytest tests/
	```

	Test fixtures in `tests/fixtures/locale/` contain deliberately bad Python and JavaScript code with:

	- Hardcoded passwords, API keys, AWS secrets, Stripe keys, JWT secrets
	- SQL injection via f-strings and template literals
	- `eval()`, `pickle.load()`, `os.system()`, `subprocess.call(shell=True)`
	- MD5 hashing
	- TODO, FIXME, HACK comments

	---

	---

	<div align="center">

	Built for the [Build with GLM 5.1](https://build-with-glm-5-1-challenge.devpost.com) hackathon.

	> > > > > > > b4fcdee (feat: Add initial CodeTribunal implementation)

	</div>