amine-yagoub commited on
Commit
eecc2a5
Β·
1 Parent(s): 12c5d69

docs: expand README with comprehensive project documentation

Browse files
Files changed (1) hide show
  1. README.md +300 -15
README.md CHANGED
@@ -1,5 +1,5 @@
1
- <<<<<<< HEAD
2
- ---
3
  title: CodeTribunal
4
  emoji: πŸ’»
5
  colorFrom: pink
@@ -8,34 +8,319 @@ sdk: docker
8
  pinned: false
9
  license: mit
10
  short_description: The AI Courtroom That Exposes Bad Freelance Code
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
14
- =======
15
- # CodeTribunal
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
- The AI courtroom that exposes bad freelance code.
 
 
 
 
 
 
 
18
 
19
- Multi-agent forensic investigation powered by GLM 5.1. Instead of guessing code quality, CodeTribunal puts it on trial β€” a live-streaming debate where an AI Prosecutor and Defense Attorney clash over real, deterministic technical evidence.
20
 
21
- ## Install
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
  ```bash
 
 
 
 
 
24
  pip install -e .
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  ```
26
 
27
- ## Usage
 
 
28
 
29
  ```bash
 
30
  code-tribunal ./path/to/codebase
 
 
 
 
 
 
31
  ```
32
 
33
- ## How it works
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
- 1. **Evidence Gathering** β€” Deterministic scans (security, code smells, hardcoded secrets, TODOs)
36
- 2. **Investigation** β€” GLM 5.1 agents analyze the evidence
37
- 3. **The Trial** β€” Prosecutor and Defense debate in a live-streamed courtroom
38
- 4. **Verdict** β€” The Judge delivers a final ruling
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
  Built for the [Build with GLM 5.1](https://build-with-glm-5-1-challenge.devpost.com) hackathon.
41
- >>>>>>> b4fcdee (feat: Add initial CodeTribunal implementation)
 
 
 
 
1
+ ## <<<<<<< HEAD
2
+
3
  title: CodeTribunal
4
  emoji: πŸ’»
5
  colorFrom: pink
 
8
  pinned: false
9
  license: mit
10
  short_description: The AI Courtroom That Exposes Bad Freelance Code
11
+
12
+ ---
13
+
14
+ # Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
15
+
16
+ <div align="center">
17
+
18
+ # βš–οΈ CodeTribunal
19
+
20
+ ### The AI Courtroom That Exposes Bad Freelance Code
21
+
22
+ **Multi-Agent Forensic Investigation Powered by GLM 5 + GritQL + CrewAI**
23
+
24
+ [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
25
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
26
+ [![Built for GLM 5.1 Hackathon](https://img.shields.io/badge/Built%20for-GLM%205.1-ff69b4)](https://build-with-glm-5-1-challenge.devpost.com)
27
+
28
+ [How It Works](#-how-it-works) β€’ [Architecture](#-architecture) β€’ [Install](#-install) β€’ [Usage](#-usage) β€’ [Demo](#-demo)
29
+
30
+ </div>
31
+
32
+ ---
33
+
34
+ ## 🎬 The Problem
35
+
36
+ A freelancer delivers code. The client can't tell if it's professional work or a security nightmare. Traditional linters find syntax errors. Code reviews miss architectural flaws. Nobody puts it all together and tells you:
37
+
38
+ > _"This code is negligent, here's exactly why, and here's what it will cost you."_
39
+
40
+ **CodeTribunal does.**
41
+
42
+ Upload a `.zip` of code and watch a full courtroom trial unfold β€” evidence gathering, investigation by specialist agents, a live-streamed debate between an AI Prosecutor and Defense Attorney, and a Judge's verdict with a reputational risk score.
43
+
44
+ ---
45
+
46
+ ## πŸ›οΈ How It Works
47
+
48
+ CodeTribunal runs a **6-phase pipeline**, each building on the last:
49
+
50
+ ### Phase 1: Forensic Evidence (Deterministic β€” No LLM)
51
+
52
+ GritQL scans the entire codebase with **17 forensic patterns** across security and quality domains:
53
+
54
+ | Domain | Patterns | Examples |
55
+ | ----------- | -------- | ---------------------------------------------------------------------------------------- |
56
+ | πŸ”΄ Security | 13 | Hardcoded secrets, `eval()`, SQL injection, `pickle.load()`, `os.system()`, weak hashing |
57
+ | 🟑 Quality | 4 | `TODO`, `FIXME`, `HACK` comments |
58
+
59
+ All scanning is **read-only** (`--dry-run`) and runs in **parallel** across patterns.
60
+
61
+ ### Phase 2: Code Dependency Graph (AST β€” No LLM)
62
+
63
+ Python's `ast` module and regex-based JS parsing build a **lightweight dependency graph**:
64
+
65
+ - Nodes: files, functions, classes, imports
66
+ - Edges: calls, imports, containment, inheritance
67
+ - Enables call-chain tracing: `eval() β†’ handle_request() β†’ app.route()`
68
+
69
+ ### Phase 3: Investigation (3 ReACT Agents + 4 Tools)
70
+
71
+ Three specialist investigators, each running a **genuine ReACT loop** (Reason β†’ Act β†’ Observe β†’ Repeat) using **Z.ai's native function calling** via LiteLLM:
72
+
73
+ | Agent | Tools | Purpose |
74
+ | ---------------------------- | --------------------------------------------------------- | ------------------------------------------ |
75
+ | πŸ›‘οΈ Security Investigator | FileReader, PatternSearch, CodeGraphQuery, FindingContext | Find vulnerabilities, trace attack vectors |
76
+ | πŸ“‹ Quality Investigator | FileReader, FindingContext | Assess technical debt, detect negligence |
77
+ | πŸ—οΈ Architecture Investigator | FileReader, CodeGraphQuery | Analyze structure, trace dependencies |
78
+
79
+ Each agent **autonomously decides which tools to call**, observes the results, and iterates. For example, the Security Investigator might:
80
+
81
+ 1. Call `file_reader` to read a flagged file
82
+ 2. Observe hardcoded secrets on specific lines
83
+ 3. Call `code_graph_query` to trace where those secrets are used
84
+ 4. Produce a detailed report with file paths, line numbers, and severity ratings
85
+
86
+ **Verified working**: GLM-5 + LiteLLM function calling confirmed. Agents make real tool calls that execute real code analysis.
87
+
88
+ ### Phase 4: The Trial (3 Agents)
89
+
90
+ A courtroom debate between AI agents:
91
+
92
+ 1. **βš–οΈ The Prosecutor** β€” builds the case for negligence, cites specific evidence
93
+ 2. **πŸ›‘οΈ The Defense Attorney** β€” challenges claims, argues context and proportionality
94
+ 3. **βš–οΈ Rebuttal** β€” the prosecutor responds to the defense
95
+
96
+ Agents use CrewAI's `context` parameter to chain arguments: prosecution output feeds into defense context, both feed into rebuttal.
97
+
98
+ ### Phase 5: The Verdict
99
+
100
+ **πŸ”¨ The Judge** reviews all evidence, investigation reports, and the full trial transcript. Delivers:
101
+
102
+ - Overall ruling: GUILTY / MIXED / NOT GUILTY
103
+ - Reputational Risk Score (0-100)
104
+ - Findings summary with severity rankings
105
+
106
+ ### Phase 6: Structured Report
107
+
108
+ **πŸ“ Verdict Report Agent** compiles everything into a professional report:
109
+
110
+ - Executive Summary
111
+ - Findings Table (sorted by severity)
112
+ - Per-Finding Analysis (impact, remediation, estimated fix effort)
113
+ - Sentencing Recommendations
114
+
115
  ---
116
 
117
+ ## πŸ—οΈ Architecture
118
+
119
+ ```
120
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
121
+ β”‚ Gradio UI β”‚
122
+ β”‚ + Export β”‚
123
+ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
124
+ β”‚
125
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
126
+ β”‚ Pipeline Engine β”‚
127
+ β”‚ State Β· Persistence β”‚
128
+ β”‚ Cancel Β· Resume β”‚
129
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
130
+ β”‚
131
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
132
+ β–Ό β–Ό β–Ό β–Ό β–Ό
133
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”
134
+ β”‚Evidence β”‚ β”‚Code β”‚ β”‚Invest. β”‚ β”‚ Trial β”‚ β”‚Reportβ”‚
135
+ β”‚ Scanner β”‚ β”‚Graph β”‚ β”‚ Agents β”‚ β”‚ Agents β”‚ β”‚Agent β”‚
136
+ β”‚(GritQL) β”‚ β”‚(AST) β”‚ β”‚+ Tools β”‚ β”‚ β”‚ β”‚ β”‚
137
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜
138
+ β”‚ β”‚ β”‚ β”‚ β”‚
139
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
140
+ β”‚
141
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
142
+ β”‚ Custom Tool Layer β”‚
143
+ β”‚ FileReader Β· Pattern β”‚
144
+ β”‚ CodeGraph Β· Context β”‚
145
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
146
+ ```
147
+
148
+ ### Key Design Decisions
149
 
150
+ | Decision | Why |
151
+ | ------------------------------------- | ------------------------------------------------------------------------------------------- |
152
+ | **Agents have tools, not text dumps** | Agents read files, search patterns, and trace calls on demand β€” scales to any codebase size |
153
+ | **ReACT loop via LiteLLM** | Direct function calling with GLM-5 β€” bypasses CrewAI's unreliable tool routing |
154
+ | **Pipeline state persisted to JSON** | Runs can resume after crashes. State is queryable |
155
+ | **GritQL for evidence** | AST-level pattern matching, not regex. Language-aware, precise |
156
+ | **Custom CrewAI tools (BaseTool)** | Pydantic-validated inputs, proper error handling, CrewAI-native integration |
157
+ | **Rate-limit retry with backoff** | Exponential backoff (4s β†’ 64s) on Z.ai 429 errors β€” pipeline survives API spikes |
158
 
159
+ ---
160
 
161
+ ## 🧰 Tech Stack
162
+
163
+ | Component | Technology | Purpose |
164
+ | -------------------- | ------------------------ | ---------------------------------------------------- |
165
+ | **LLM** | GLM 5 via Z.ai (LiteLLM) | Agent reasoning and debate |
166
+ | **Code Scanning** | GritQL | Deterministic AST-level pattern matching |
167
+ | **Multi-Agent** | CrewAI 1.12 | Agent orchestration, task chaining, context handoffs |
168
+ | **Function Calling** | LiteLLM | Direct ReACT loop with GLM-5 tool calling |
169
+ | **Code Graph** | Python `ast` + regex | Dependency graph (Python + JS) |
170
+ | **UI** | Gradio 6 | Streaming chatbot, file upload, export |
171
+ | **Export** | fpdf2 | PDF report generation |
172
+
173
+ ---
174
+
175
+ ## πŸ“¦ Install
176
 
177
  ```bash
178
+ # Clone
179
+ git clone https://github.com/amineyagoub/CodeTribunal.git
180
+ cd CodeTribunal
181
+
182
+ # Install dependencies
183
  pip install -e .
184
+
185
+ # Install GritQL CLI
186
+ npm install -g @getgrit/cli
187
+
188
+ # Configure
189
+ cp .env.example .env
190
+ # Edit .env: set ZAI_API_KEY
191
+ ```
192
+
193
+ ### Requirements
194
+
195
+ - Python 3.11+
196
+ - Node.js (for GritQL CLI)
197
+ - Z.ai API key ([get one here](https://open.bigmodel.cn/))
198
+
199
+ ---
200
+
201
+ ## πŸš€ Usage
202
+
203
+ ### Web UI (Recommended)
204
+
205
+ ```bash
206
+ python3 -m code_tribunal.app
207
  ```
208
 
209
+ Open http://localhost:7860, upload a `.zip` of code, and watch the trial unfold.
210
+
211
+ ### CLI
212
 
213
  ```bash
214
+ # Full trial
215
  code-tribunal ./path/to/codebase
216
+
217
+ # Evidence only (no LLM, fast)
218
+ code-tribunal ./path/to/codebase --evidence-only
219
+
220
+ # Save results to JSON
221
+ code-tribunal ./path/to/codebase --output report.json
222
  ```
223
 
224
+ ### Python API
225
+
226
+ ```python
227
+ from code_tribunal.config import TribunalConfig
228
+ from code_tribunal.courtroom import Courtroom
229
+ from code_tribunal.pipeline import Phase
230
+
231
+ config = TribunalConfig()
232
+ courtroom = Courtroom(config)
233
+
234
+ for event in courtroom.run("./path/to/code"):
235
+ print(f"[{event.phase.value}] {event.status}")
236
+
237
+ # Interactive Q&A
238
+ answer = courtroom.ask_question(
239
+ "Why was eval() considered critical?",
240
+ context={"evidence": "...", "verdict": "...", ...}
241
+ )
242
+ ```
243
+
244
+ ---
245
 
246
+ ## πŸ”§ Production Features
247
+
248
+ | Feature | Details |
249
+ | ------------------------------ | --------------------------------------------------------------------------------------- |
250
+ | **4 Custom Tools** | FileReader, PatternSearch, CodeGraphQuery, FindingContext β€” agents actively investigate |
251
+ | **8 Specialized Agents** | 3 investigators, prosecutor, defense, rebuttal, judge, verdict report, expert witness |
252
+ | **ReACT Engine** | Custom Reason-Act-Observe loop via LiteLLM function calling with GLM-5 |
253
+ | **Code Dependency Graph** | AST-based (Python + JS), with call-chain tracing and impact analysis |
254
+ | **Parallel Evidence Scanning** | ThreadPoolExecutor for GritQL patterns β€” 4x faster than sequential |
255
+ | **Rate-Limit Resilience** | Exponential backoff retry on 429 errors β€” survives API rate limits |
256
+ | **Pipeline Persistence** | State saved to JSON, runs can resume after interruption |
257
+ | **Deduplication** | Same file+line merged into one finding with multiple categories |
258
+ | **Zip Safety** | Zip-slip attack prevention |
259
+ | **Streaming UI** | Real-time pipeline progress in Gradio Chatbot with phase indicators |
260
+ | **Export** | Markdown and PDF report generation |
261
+
262
+ ---
263
+
264
+ ## πŸ§ͺ Testing
265
+
266
+ ```bash
267
+ # Run evidence scan on test fixtures
268
+ code-tribunal tests/fixtures/locale/ --evidence-only
269
+
270
+ # Run Python tests
271
+ pytest tests/
272
+ ```
273
+
274
+ Test fixtures in `tests/fixtures/locale/` contain deliberately bad Python and JavaScript code with:
275
+
276
+ - Hardcoded passwords, API keys, AWS secrets, Stripe keys, JWT secrets
277
+ - SQL injection via f-strings and template literals
278
+ - `eval()`, `pickle.load()`, `os.system()`, `subprocess.call(shell=True)`
279
+ - MD5 hashing
280
+ - TODO, FIXME, HACK comments
281
+
282
+ ---
283
+
284
+ ## πŸ“ Project Structure
285
+
286
+ ```
287
+ CodeTribunal/
288
+ β”œβ”€β”€ src/code_tribunal/
289
+ β”‚ β”œβ”€β”€ config.py # Centralized configuration
290
+ β”‚ β”œβ”€β”€ evidence.py # GritQL forensic scanning (17 patterns)
291
+ β”‚ β”œβ”€β”€ code_graph.py # AST code dependency graph
292
+ β”‚ β”œβ”€β”€ tools.py # 4 custom CrewAI tools
293
+ β”‚ β”œβ”€β”€ agents.py # 8 agent definitions
294
+ β”‚ β”œβ”€β”€ react.py # ReACT engine (LiteLLM function calling)
295
+ β”‚ β”œβ”€β”€ courtroom.py # 6-phase pipeline orchestrator
296
+ β”‚ β”œβ”€β”€ pipeline.py # State machine + persistence
297
+ β”‚ β”œβ”€β”€ app.py # Gradio UI + export
298
+ β”‚ └── cli.py # CLI entry point
299
+ β”œβ”€β”€ tests/
300
+ β”‚ β”œβ”€β”€ fixtures/locale/ # Deliberately bad code samples
301
+ β”‚ └── bad_code.zip # Zip fixture for UI testing
302
+ β”œβ”€β”€ assets/
303
+ β”‚ └── logo.png # 3D courtroom logo
304
+ └── pyproject.toml
305
+ ```
306
+
307
+ ---
308
+
309
+ ## πŸŽ“ What Makes This a Strong Hackathon Entry
310
+
311
+ 1. **System Complexity** β€” 6-phase pipeline with 8 agents, 4 custom tools, code graph, and streaming
312
+ 2. **Effective Tool Use** β€” Agents use BaseTool with Pydantic schemas to read files, search patterns, and trace calls
313
+ 3. **Context Handoffs** β€” CrewAI `context` parameter chains prosecution β†’ defense β†’ rebuttal
314
+ 4. **Custom ReACT Engine** β€” Direct LiteLLM function calling with GLM-5 for reliable tool use
315
+ 5. **Deterministic + AI** β€” GritQL provides ground-truth evidence, agents provide interpretation and debate
316
+ 6. **Resilient** β€” Rate-limit retry, pipeline persistence, error recovery
317
+
318
+ ---
319
+
320
+ <div align="center">
321
 
322
  Built for the [Build with GLM 5.1](https://build-with-glm-5-1-challenge.devpost.com) hackathon.
323
+
324
+ > > > > > > > b4fcdee (feat: Add initial CodeTribunal implementation)
325
+
326
+ </div>