File size: 13,274 Bytes
c30312e
38cd7bb
 
 
 
 
 
 
 
c30312e
eecc2a5
 
1de0435
eecc2a5
c30312e
eecc2a5
c30312e
 
 
64d4a2f
662e309
eecc2a5
 
 
 
 
 
 
 
c30312e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eecc2a5
c30312e
eecc2a5
c30312e
eecc2a5
c30312e
 
 
 
eecc2a5
c30312e
eecc2a5
 
 
c30312e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1de0435
eecc2a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1de0435
 
 
eecc2a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38cd7bb
 
1de0435
eecc2a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d5341cc
eecc2a5
 
 
 
 
 
 
 
d5341cc
eecc2a5
d5341cc
1de0435
eecc2a5
 
 
 
 
 
 
 
 
 
 
 
 
1de0435
d5341cc
 
eecc2a5
 
 
 
 
d5341cc
eecc2a5
 
 
 
 
 
1de0435
eecc2a5
 
 
 
 
 
 
 
 
 
1de0435
eecc2a5
 
 
 
 
d5341cc
 
eecc2a5
 
 
d5341cc
 
eecc2a5
d5341cc
eecc2a5
 
 
 
 
 
d5341cc
 
eecc2a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d5341cc
eecc2a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d5341cc
 
eecc2a5
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
---
title: CodeTribunal
emoji: πŸ’»
colorFrom: pink
colorTo: red
sdk: docker
pinned: false
license: mit
short_description: The AI Courtroom That Exposes Bad Freelance Code
---
<div align="center">

# CodeTribunal

### Put Freelance Code on Trial.

**Upload code. Get a verdict. Know the risk.**

Built with **GLM 5.1 + CrewAI + GritQL**

[![Tests](https://github.com/amineyagoub/CodeTribunal/actions/workflows/tests.yml/badge.svg)](https://github.com/amineyagoub/CodeTribunal/actions/workflows/tests.yml)
[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Built for GLM 5.1 Hackathon](https://img.shields.io/badge/Built%20for-GLM%205.1-ff69b4)](https://build-with-glm-5-1-challenge.devpost.com)

</div>

---

## 🚨 The Problem

Clients receive code they don’t understand.

- Looks clean… but hides security risks
- Passes linters… but fails in production
- Works… but is architecturally broken

**No one answers the only question that matters:**

> _Is this code safe, professional, and worth paying for?_

---

## The Solution

**CodeTribunal turns code review into a courtroom trial.**

Upload a `.zip` β†’ get:

- Forensic evidence (AST-level)
- Multi-agent investigation
- AI courtroom debate
- Final verdict + risk score

> Not just analysis β€” **judgment**.

---

## 🧠 Why This Exist

### 1. Real System

- 6-phase pipeline
- 8 specialized agents
- Persistent execution engine

### 2. Agents That Actually Act

- File reads, pattern search, call tracing
- Real tool usage via function calling (not fake reasoning)

### 3. Deterministic + AI Hybrid

- **GritQL = ground truth**
- **Agents = interpretation + argument**

### 4. End-to-End Story

From raw code β†’ evidence β†’ debate β†’ verdict β†’ report

## How It Works

CodeTribunal runs a **6-phase pipeline**, each building on the last:

### Phase 1: Forensic Evidence (Deterministic β€” No LLM)

GritQL scans the entire codebase with **17 forensic patterns** across security and quality domains:

| Domain      | Patterns | Examples                                                                                 |
| ----------- | -------- | ---------------------------------------------------------------------------------------- |
| πŸ”΄ Security | 13       | Hardcoded secrets, `eval()`, SQL injection, `pickle.load()`, `os.system()`, weak hashing |
| 🟑 Quality  | 4        | `TODO`, `FIXME`, `HACK` comments                                                         |

All scanning is **read-only** (`--dry-run`) and runs in **parallel** across patterns.

### Phase 2: Code Dependency Graph (AST β€” No LLM)

Python's `ast` module and regex-based JS parsing build a **lightweight dependency graph**:

- Nodes: files, functions, classes, imports
- Edges: calls, imports, containment, inheritance
- Enables call-chain tracing: `eval() β†’ handle_request() β†’ app.route()`

### Phase 3: Investigation (3 ReACT Agents + 4 Tools)

Three specialist investigators, each running a **genuine ReACT loop** (Reason β†’ Act β†’ Observe β†’ Repeat) using **Z.ai's native function calling** via LiteLLM:

| Agent                        | Tools                                                     | Purpose                                    |
| ---------------------------- | --------------------------------------------------------- | ------------------------------------------ |
| πŸ›‘οΈ Security Investigator     | FileReader, PatternSearch, CodeGraphQuery, FindingContext | Find vulnerabilities, trace attack vectors |
| πŸ“‹ Quality Investigator      | FileReader, FindingContext                                | Assess technical debt, detect negligence   |
| πŸ—οΈ Architecture Investigator | FileReader, CodeGraphQuery                                | Analyze structure, trace dependencies      |

Each agent **autonomously decides which tools to call**, observes the results, and iterates. For example, the Security Investigator might:

1. Call `file_reader` to read a flagged file
2. Observe hardcoded secrets on specific lines
3. Call `code_graph_query` to trace where those secrets are used
4. Produce a detailed report with file paths, line numbers, and severity ratings

**Verified working**: GLM-5 + LiteLLM function calling confirmed. Agents make real tool calls that execute real code analysis.

### Phase 4: The Trial (3 Agents)

A courtroom debate between AI agents:

1. ** The Prosecutor** β€” builds the case for negligence, cites specific evidence
2. ** The Defense Attorney** β€” challenges claims, argues context and proportionality
3. ** Rebuttal** β€” the prosecutor responds to the defense

Agents use CrewAI's `context` parameter to chain arguments: prosecution output feeds into defense context, both feed into rebuttal.

### Phase 5: The Verdict

**πŸ”¨ The Judge** reviews all evidence, investigation reports, and the full trial transcript. Delivers:

- Overall ruling: GUILTY / MIXED / NOT GUILTY
- Reputational Risk Score (0-100)
- Findings summary with severity rankings

### Phase 6: Structured Report

**πŸ“ Verdict Report Agent** compiles everything into a professional report:

- Executive Summary
- Findings Table (sorted by severity)
- Per-Finding Analysis (impact, remediation, estimated fix effort)
- Sentencing Recommendations

---

## Architecture

```
                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                          β”‚  Gradio UI   β”‚
                          β”‚  + Export    β”‚
                          β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                     β”‚    Pipeline Engine      β”‚
                     β”‚  State Β· Persistence    β”‚
                     β”‚  Cancel Β· Resume        β”‚
                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β–Ό          β–Ό           β–Ό           β–Ό          β–Ό
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”
     β”‚Evidence β”‚ β”‚Code  β”‚ β”‚Invest.  β”‚ β”‚  Trial  β”‚ β”‚Reportβ”‚
     β”‚ Scanner β”‚ β”‚Graph β”‚ β”‚ Agents  β”‚ β”‚ Agents  β”‚ β”‚Agent β”‚
     β”‚(GritQL) β”‚ β”‚(AST) β”‚ β”‚+ Tools  β”‚ β”‚         β”‚ β”‚      β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜
          β”‚          β”‚           β”‚           β”‚          β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                     β”‚   Custom Tool Layer    β”‚
                     β”‚ FileReader Β· Pattern   β”‚
                     β”‚ CodeGraph Β· Context    β”‚
                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

### Key Design Decisions

| Decision                              | Why                                                                                         |
| ------------------------------------- | ------------------------------------------------------------------------------------------- |
| **Agents have tools, not text dumps** | Agents read files, search patterns, and trace calls on demand β€” scales to any codebase size |
| **ReACT loop via LiteLLM**            | Direct function calling with GLM-5 β€” bypasses CrewAI's unreliable tool routing              |
| **Pipeline state persisted to JSON**  | Runs can resume after crashes. State is queryable                                           |
| **GritQL for evidence**               | AST-level pattern matching, not regex. Language-aware, precise                              |
| **Custom CrewAI tools (BaseTool)**    | Pydantic-validated inputs, proper error handling, CrewAI-native integration                 |
| **Rate-limit retry with backoff**     | Exponential backoff (4s β†’ 64s) on Z.ai 429 errors β€” pipeline survives API spikes            |

---

## Tech Stack

| Component            | Technology               | Purpose                                              |
| -------------------- | ------------------------ | ---------------------------------------------------- |
| **LLM**              | GLM 5 via Z.ai (LiteLLM) | Agent reasoning and debate                           |
| **Code Scanning**    | GritQL                   | Deterministic AST-level pattern matching             |
| **Multi-Agent**      | CrewAI 1.12              | Agent orchestration, task chaining, context handoffs |
| **Function Calling** | LiteLLM                  | Direct ReACT loop with GLM-5 tool calling            |
| **Code Graph**       | Python `ast` + regex     | Dependency graph (Python + JS)                       |
| **UI**               | Gradio 6                 | Streaming chatbot, file upload, export               |
| **Export**           | fpdf2                    | PDF report generation                                |

---

## Install

```bash
# Clone
git clone https://github.com/amineyagoub/CodeTribunal.git
cd CodeTribunal

# Install dependencies
pip install -e .

# Install GritQL CLI
npm install -g @getgrit/cli

# Configure
cp .env.example .env
# Edit .env: set ZAI_API_KEY (get one at https://open.bigmodel.cn/)
```

### Requirements

- Python 3.11+
- Node.js (for GritQL CLI)
- Z.ai API key ([get one here](https://open.bigmodel.cn/))

---

## Usage

### Web UI (Recommended)

```bash
python3 -m code_tribunal.app
```

Open http://localhost:7860, upload a `.zip` of code, and watch the trial unfold.

### CLI

```bash
# Full trial
code-tribunal ./path/to/codebase

# Evidence only (no LLM, fast)
code-tribunal ./path/to/codebase --evidence-only

# Save results to JSON
code-tribunal ./path/to/codebase --output report.json
```

### Python API

```python
from code_tribunal.config import TribunalConfig
from code_tribunal.courtroom import Courtroom
from code_tribunal.pipeline import Phase

config = TribunalConfig()
courtroom = Courtroom(config)

for event in courtroom.run("./path/to/code"):
    print(f"[{event.phase.value}] {event.status}")

# Interactive Q&A
answer = courtroom.ask_question(
    "Why was eval() considered critical?",
    context={"evidence": "...", "verdict": "...", ...}
)
```

---

## πŸ”§ Production Features

| Feature                        | Details                                                                                 |
| ------------------------------ | --------------------------------------------------------------------------------------- |
| **4 Custom Tools**             | FileReader, PatternSearch, CodeGraphQuery, FindingContext β€” agents actively investigate |
| **8 Specialized Agents**       | 3 investigators, prosecutor, defense, rebuttal, judge, verdict report, expert witness   |
| **ReACT Engine**               | Custom Reason-Act-Observe loop via LiteLLM function calling with GLM-5                  |
| **Code Dependency Graph**      | AST-based (Python + JS), with call-chain tracing and impact analysis                    |
| **Parallel Evidence Scanning** | ThreadPoolExecutor for GritQL patterns β€” 4x faster than sequential                      |
| **Rate-Limit Resilience**      | Exponential backoff retry on 429 errors β€” survives API rate limits                      |
| **Pipeline Persistence**       | State saved to JSON, runs can resume after interruption                                 |
| **Deduplication**              | Same file+line merged into one finding with multiple categories                         |
| **Zip Safety**                 | Zip-slip attack prevention                                                              |
| **Streaming UI**               | Real-time pipeline progress in Gradio Chatbot with phase indicators                     |
| **Export**                     | Markdown and PDF report generation                                                      |

---

## πŸ§ͺ Testing

```bash
# Run evidence scan on test fixtures
code-tribunal tests/fixtures/locale/ --evidence-only

# Run Python tests
pytest tests/
```

Test fixtures in `tests/fixtures/locale/` contain deliberately bad Python and JavaScript code with:

- Hardcoded passwords, API keys, AWS secrets, Stripe keys, JWT secrets
- SQL injection via f-strings and template literals
- `eval()`, `pickle.load()`, `os.system()`, `subprocess.call(shell=True)`
- MD5 hashing
- TODO, FIXME, HACK comments

---

---

<div align="center">

Built for the [Build with GLM 5.1](https://build-with-glm-5-1-challenge.devpost.com) hackathon.

> > > > > > > b4fcdee (feat: Add initial CodeTribunal implementation)

</div>