ps2181 commited on
Commit
3391ffe
·
1 Parent(s): bbe2575

Rewrite README: blog-style with animations, 10 tasks, new endpoints

Browse files
Files changed (1) hide show
  1. README.md +200 -395
README.md CHANGED
@@ -1,522 +1,327 @@
1
- <div class="card">
2
- <div class="card-header">
3
- <div class="card-header-dot"></div>
4
- <span class="card-header-title"></span>
5
- </div>
6
- <!-- yaml rows + tag rows + footer badges -->
7
- </div>
8
- <div align="center">
 
 
 
 
 
 
 
9
 
10
- <!-- Animated header banner -->
11
- <img src="https://capsule-render.vercel.app/api?type=waving&color=gradient&customColorList=6,11,20&height=200&section=header&text=Invoice%20Processing%20Pipeline&fontSize=40&fontColor=fff&animation=twinkling&fontAlignY=35&desc=Self-Improving%20Multi-Agent%20Fraud%20Detection%20%7C%20OpenEnv%20%2B%20GRPO%20%2B%20Qwen2.5&descAlignY=55&descSize=16" width="100%"/>
12
-
13
- <!-- Badges row 1 -->
14
- <p>
15
- <a href="https://ps2181-invoice-processing-pipeline.hf.space/web">
16
- <img src="https://img.shields.io/badge/🚀%20Live%20Demo-HuggingFace%20Spaces-FF9D00?style=for-the-badge&logo=huggingface&logoColor=white" />
17
- </a>
18
- <a href="https://colab.research.google.com/drive/1C1_3giNt-NmbzKNFJr5_L1fms3L8LfmB">
19
- <img src="https://img.shields.io/badge/Training%20Colab-Open%20Notebook-F9AB00?style=for-the-badge&logo=googlecolab&logoColor=white" />
20
- </a>
21
- <a href="https://ps2181-invoice-processing-pipeline.hf.space/docs">
22
- <img src="https://img.shields.io/badge/API%20Docs-FastAPI-009688?style=for-the-badge&logo=fastapi&logoColor=white" />
23
- </a>
24
- </p>
25
-
26
- <!-- Badges row 2 -->
27
- <p>
28
- <img src="https://img.shields.io/badge/Framework-OpenEnv-1A356E?style=for-the-badge" />
29
- <img src="https://img.shields.io/badge/Model-Qwen2.5--1.5B%20+%20LoRA%20r%3D16-8B1A4E?style=for-the-badge" />
30
- <img src="https://img.shields.io/badge/Training-GRPO%20+%20Unsloth-00A67E?style=for-the-badge" />
31
- <img src="https://img.shields.io/badge/Agents-5%20Adversarial-E44D26?style=for-the-badge" />
32
- </p>
33
-
34
- <!-- Badges row 3 -->
35
- <p>
36
- <img src="https://img.shields.io/badge/Tasks-7%20Progressive-6C3483?style=for-the-badge" />
37
- <img src="https://img.shields.io/badge/Deployment-Docker%20%7C%20HF%20Spaces-0D1117?style=for-the-badge&logo=docker" />
38
- <img src="https://img.shields.io/badge/License-MIT-green?style=for-the-badge" />
39
- <img src="https://img.shields.io/badge/Hackathon-Meta%20PyTorch%202026-FF6B35?style=for-the-badge" />
40
- </p>
41
 
42
- <br/>
 
43
 
44
- > **Meta PyTorch OpenEnv Hackathon Grand Finale · April 25–26, 2026**
45
- >
46
- > Team: **Pritam Satpathy** & **Gnana Nawin T** · Scaler School of Technology, Bangalore
47
 
48
  <br/>
49
 
50
- <!-- Animated typing headline -->
51
- <a href="https://git.io/typing-svg">
52
- <img src="https://readme-typing-svg.demolab.com?font=Fira+Code&weight=600&size=22&pause=1000&color=007A87&center=true&vCenter=true&width=750&lines=5-Agent+Adversarial+Fraud+Detection+System;Self-Improving+via+Cross-Episode+Regulator;GRPO-Trained+LoRA+Agents+on+Live+Environment;Invoice+%E2%86%92+Extract+%E2%86%92+Audit+%E2%86%92+Approve+%E2%86%92+Improve" alt="Typing SVG" />
53
- </a>
 
54
 
55
  </div>
56
 
57
  ---
58
 
59
- ## 🔥 What Makes This Different
60
 
61
- > Most multi-agent systems are **static pipelines**. Ours **gets harder for itself over time**.
62
 
63
- The system contains a **Predictive Regulator** a cross-episode meta-agent that monitors the Auditor across 30 rolling episodes, detects fraud types it systematically fails on (**blind spots**), and **automatically biases the Generator** to produce more of exactly those fraud types. No human intervention. No manual curriculum design. The system pressure-tests its own weakest point, every single episode.
64
 
65
  <div align="center">
66
- <img width="1462" height="731" alt="image" src="https://github.com/user-attachments/assets/7d863b87-1921-45f5-8d94-a06ba3ed6fc1" />
67
  </div>
68
 
69
  ---
70
 
71
- ## Three Novel Features
72
-
73
- <table>
74
- <tr>
75
- <td width="33%" align="center">
76
-
77
- ### 🔮 Predictive Regulator
78
-
79
- Computes **trend slope** over 5-episode windows.<br/>Warns of *emerging* blind spots **before** detection rates cross the critical threshold — proactive oversight, not reactive retraining.
80
-
81
- `+0.15 early-warning bonus`
82
-
83
- </td>
84
- <td width="33%" align="center">
85
-
86
- ### 🧩 Compound Fraud
87
 
88
- Invoices carry **two fraud signals simultaneously** (e.g. phantom vendor + price gouging).<br/>Partial credit `+0.65` for catching one; full reward `+0.99` for both.
89
-
90
- Prevents single-signal heuristics.
91
-
92
- </td>
93
- <td width="33%" align="center">
94
-
95
- ### 📊 Confidence Calibration
96
-
97
- Tracks `(confidence, correct?)` pairs per fraud type.<br/>Detects **overconfident misses** — the Auditor saying "90% sure, approved" on fraud — the most dangerous real-world failure mode.
98
-
99
- </td>
100
- </tr>
101
- </table>
102
-
103
- ---
104
-
105
- ## 🤖 Five Agents, One Closed Loop
106
 
107
  <div align="center">
108
 
109
  | Agent | Role | Reward Signal |
110
  |:---:|:---|:---|
111
- | 🏭 **Generator** | Creates clean or fraudulent invoices, biased by Regulator blind-spot weights | `+0.85` evades Auditor + Approver · `+0.60` evades Auditor only · `+0.10` caught |
112
- | 🔍 **Extractor** | Parses raw OCR invoice text structured JSON | 4 independent signals: format `0.10` · field accuracy `0.40` · math `0.25` · completeness `0.25` |
113
- | 🕵️ **Auditor** | Classifies each invoice with fraud type + confidence score | `+0.99` correct type · `+0.90` clean clearance · `+0.65` compound (one caught) · `+0.01` miss/FP |
114
- | **Approver** | Final approve / escalate / reject (rule-based, confidence-gated) | `0.80` confidence reject · `0.50–0.80` escalate · approved approve |
115
- | 🧠 **Regulator** | Cross-episode meta-agent 30-episode rolling window, blind-spot tracker | Precision `0.35` + Recall `0.35` + No over-flagging `0.15` + Early warning `0.15` |
116
 
117
  </div>
118
 
119
  ---
120
 
121
- ## 🎯 Seven Tasks — Progressive Difficulty
122
-
123
- | # | Task | Difficulty | What the Agent Must Do |
124
- |:---:|:---|:---:|:---|
125
- | 1 | `easy` | 🟢 Easy | Extract `vendor`, `date`, `currency`, `total`, `line_items` from a single clean invoice |
126
- | 2 | `medium` | 🟡 Medium | Clean & normalise a batch: fix date format chaos, vendor typos, currency symbol pollution |
127
- | 3 | `hard` | 🟠 Hard | Extract + reconcile against purchase orders — flag overcharges, extra items, missing items |
128
- | 4 | `expert` | 🔴 Expert | Fraud audit using vendor registry, market prices, and invoice history — classify fraud type exactly |
129
- | 5 | `adversarial` | 🟠 Hard | Ignore SUBTOTAL trap + fake TAX/ADJUSTMENT + FX noise; OCR-corrupted vendor labels |
130
- | 6 | `negotiate` | 🟡 Medium | Ask clarification questions `{"question": "..."}` then extract; `+15%` bonus for ≤2 questions |
131
- | 7 | `supply_chain` | 🔴 Expert | Detect `quantity_shortfall`, `price_spike`, `unauthorized_substitution`, `phantom_delivery` |
132
-
133
- ---
134
-
135
- ## 🧠 Trained LoRA Agents
136
-
137
- All three generative agents trained with **GRPO on live environment data** — the HF Space `/grader` endpoint *is* the reward function during training.
138
 
139
  <div align="center">
140
 
141
- | Agent | Base Model | LoRA Config | HuggingFace Hub |
142
- |:---:|:---|:---:|:---|
143
- | 🔍 Extractor | Qwen2.5-1.5B-Instruct | r=16, α=16, 4-bit QLoRA | [ps2181/extractor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/extractor-lora-qwen2.5-1.5b) |
144
- | 🕵️ Auditor | Qwen2.5-1.5B-Instruct | r=16, α=16, 4-bit QLoRA | [ps2181/auditor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/auditor-lora-qwen2.5-1.5b) |
145
- | 🏭 Generator | Qwen2.5-1.5B-Instruct | r=16, α=16, 4-bit QLoRA | [ps2181/generator-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/generator-lora-qwen2.5-1.5b) |
146
 
147
  </div>
148
 
149
- **LoRA target modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
150
-
151
  ---
152
 
153
- ## 📈 Training Results
154
-
155
- ### Extractor — GRPO Training Progress
156
-
157
- The model learned to extract structured JSON from noisy invoice text via **reinforcement learning with 4 independent reward signals**, scoring directly against the live environment grader.
158
-
159
- | Step | Total Reward | Env Score | Format | Math Consistency |
160
- |:---:|:---:|:---:|:---:|:---:|
161
- | 10 | 2.361 | 0.113 | 0.900 | 0.347 |
162
- | 20 | 2.595 | 0.282 | 0.900 | 0.413 |
163
- | 30 | 2.657 | 0.304 | **0.950** | 0.403 |
164
-
165
- > 📊 **Environment score: `0.113 → 0.304` in 30 steps — a 169% improvement** in live-graded extraction accuracy.
166
 
167
- ### 🔍 Reward Hacking Caught in Training
168
-
169
- At step 10, we observed the model achieving `math_consistency = 0.97` and `completeness = 1.0` while `field_accuracy = 0.00` — it had learned to output **arithmetically-consistent JSON with entirely hallucinated values**.
170
 
171
- Our 4 **independent** reward signals made this visible immediately. A single aggregated reward would have never surfaced this.
 
 
 
 
 
 
 
 
 
 
 
172
 
173
- ```
174
- Step 10 — Reward Hacking Detected:
175
- format: 0.10 ✅
176
- math_consistency: 0.97 ✅ ← model gaming this signal
177
- completeness: 1.00 ✅ ← model gaming this signal
178
- field_accuracy: 0.00 ❌ ← hallucinating all values
179
-
180
- Action: adjusted training emphasis on field_accuracy weight
181
- Result: field_accuracy climbed to 0.30+ by step 30
182
- ```
183
 
184
- This is exactly why multiple independent reward signals matter and why we have 4.
185
 
186
  ---
187
 
188
- ## 🎁 Reward Architecture
189
 
190
- ### Extractor — 4 Independent Signals
191
 
192
  ```python
193
- def reward_format(extracted) -> float: # weight 0.10
194
- """Are all 5 required JSON keys present?"""
195
-
196
- def reward_field_accuracy(extracted, gt) -> float: # weight 0.40
197
- """Do vendor / date / currency / total match ground truth?"""
198
 
199
- def reward_math_consistency(extracted) -> float: # weight 0.25
200
- """Does qty × unit_price = amount for every line item?"""
201
-
202
- def reward_completeness(extracted, gt) -> float: # weight 0.25
203
- """Recall: what fraction of expected line items are present?"""
204
-
205
- # All rewards clamped to (0.01, 0.99) — no log(0), no gradient collapse
206
  ```
207
 
208
- ### Auditor Reward
 
 
209
 
210
  | Outcome | Reward | Why |
211
  |:---|:---:|:---|
212
- | Correct fraud type detected | **0.99** | Incentivises precise classification, not just binary flagging |
213
- | Clean invoice correctly approved | **0.90** | High reward keeps false-positive rate low |
214
- | Compound fraud — one of two types caught | **0.65** | Partial credit prevents cliff on hard cases |
215
- | Fraud flagged but wrong type | **0.50** | Penalises sloppiness; rewards catching *something* |
216
  | Miss or false positive | **0.01** | Near-zero punishes both failure modes symmetrically |
217
 
218
- ### Generator Reward (Adversarial Self-Play)
219
-
220
- | Outcome | Reward |
221
- |:---|:---:|
222
- | Fraud evades **both** Auditor and Approver | **0.85** |
223
- | Auditor misses, Approver catches | **0.60** |
224
- | Auditor catches it | **0.10** |
225
 
226
- ### Regulator Reward
227
 
228
  ```
229
  Total = Precision(0.35) + Recall(0.35) + No-over-flagging(0.15) + Early-warning-bonus(0.15)
230
  ```
231
 
 
 
232
  ---
233
 
234
- ## 🦺 Five Fraud Types
 
 
235
 
236
  <div align="center">
237
 
238
- | Type | Detection Method | Example |
239
- |:---|:---|:---|
240
- | 🏚️ `phantom_vendor` | Vendor not in Approved Vendor Registry | "QuickSupply Hub"not in approved list |
241
- | 💸 `price_gouging` | Unit price > 150% of market ceiling | Laptop at $2,800 when market max is $1,299 |
242
- | `math_fraud` | Invoice total sum of line items | Total $5,200 when items sum to $4,400 |
243
- | 📋 `duplicate_submission` | Same invoice_id or vendor+date+total already in history | INV-83221 submitted twice |
244
- | 🔀 `compound_fraud` | Two fraud signals in one invoice | Phantom vendor **AND** price gouging simultaneously |
245
 
246
  </div>
247
 
248
- ---
249
 
250
- ## 🌍 The Regulator in Action
251
 
252
- After each episode, the Regulator publishes a report that the Generator reads to bias its next batch:
253
 
254
- ```
255
- GET /regulator/report
256
-
257
- {
258
- "total_audits_recorded": 20,
259
- "detection_rates": {
260
- "phantom_vendor": "31% ⚠ BLIND SPOT (-0.08↓)",
261
- "price_gouging": "74% ✓ OK (+0.03↑)",
262
- "math_fraud": "81% ✓ OK (+0.01↑)",
263
- "duplicate_submission": "62% ⚡ EMERGING (-0.02↓)"
264
- },
265
- "false_positive_rate": "12% ✓ OK",
266
- "blind_spots": ["phantom_vendor"],
267
- "emerging_blind_spots": ["duplicate_submission"],
268
- "generator_weights": {
269
- "phantom_vendor": 0.30, ← 3× upweighted (blind spot)
270
- "duplicate_submission": 0.20, ← 2× upweighted (emerging)
271
- "price_gouging": 0.125,
272
- "math_fraud": 0.125,
273
- "compound_fraud": 0.10
274
- },
275
- "verdict": "Recommend retraining on: phantom_vendor"
276
- }
277
- ```
278
 
279
- ---
280
 
281
- ## 🚀 Quick Start
282
 
283
- ### Try the Live Demo
 
 
 
 
284
 
285
- ```bash
286
- # Health check
287
- curl https://ps2181-invoice-processing-pipeline.hf.space/health
288
 
289
- # List all 7 tasks with schemas
290
- curl https://ps2181-invoice-processing-pipeline.hf.space/tasks
291
 
292
- # Start a single-agent episode
293
- curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/reset \
294
- -H "Content-Type: application/json" \
295
- -d '{"task_id": "easy"}'
296
-
297
- # Submit an extraction (replace EPISODE_ID from reset response)
298
- curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/step \
299
- -H "Content-Type: application/json" \
300
- -d '{
301
- "episode_id": "EPISODE_ID",
302
- "extracted_data": {
303
- "vendor": "Acme Corp",
304
- "date": "2024-08-15",
305
- "currency": "USD",
306
- "total": 2374.93,
307
- "line_items": [
308
- {"description": "Laptop Computer", "qty": 2, "unit_price": 1099.99, "amount": 2199.98},
309
- {"description": "Wireless Mouse", "qty": 5, "unit_price": 34.99, "amount": 174.95}
310
- ]
311
- }
312
- }'
313
- ```
314
 
315
- ### Run the Multi-Agent Pipeline
 
 
 
 
 
 
 
316
 
317
- ```bash
318
- # Step 1 — Start 5-agent episode (Generator biased by Regulator)
319
- curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/multi/reset
320
 
321
- # Step 2 — Score Extractor output (4 signals)
322
- curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/multi/extract \
323
- -H "Content-Type: application/json" \
324
- -d '{"episode_id": "EP_ID", "extracted_data": {...}}'
325
-
326
- # Step 3 — Score Auditor output (updates 30-episode tracker)
327
- curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/multi/audit \
328
- -H "Content-Type: application/json" \
329
- -d '{"episode_id": "EP_ID", "audit_results": [
330
- {"invoice_id": "INV-83221", "verdict": "flagged",
331
- "fraud_type": "phantom_vendor", "confidence": 0.87}
332
- ]}'
333
-
334
- # Step 4 — Run Approver, compute Generator adversarial reward
335
- curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/multi/approve \
336
- -H "Content-Type: application/json" \
337
- -d '{"episode_id": "EP_ID"}'
338
-
339
- # Check Regulator state anytime
340
- curl https://ps2181-invoice-processing-pipeline.hf.space/regulator/report
341
- curl https://ps2181-invoice-processing-pipeline.hf.space/regulator/forecast
342
- curl https://ps2181-invoice-processing-pipeline.hf.space/regulator/calibration
343
- ```
344
 
345
- ### Run Training (Google Colab)
346
 
347
- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1C1_3giNt-NmbzKNFJr5_L1fms3L8LfmB)
348
 
349
- The training loop connects **directly** to the live HF Space environment:
350
 
351
- ```
352
- Colab → /reset (fresh synthetic invoice) → model generates JSON
353
- /grader (scores vs ground truth) → GRPO weight update
354
- repeat 200 steps
355
- ```
 
 
356
 
357
  ---
358
 
359
- ## 🗂️ Repository Structure
360
 
361
  ```
362
- invoice-processing-pipeline/
363
-
364
- ├── server/
365
- │ ├── app.py # FastAPI — 18 endpoints
366
- │ ├── environment.py # 7 tasks · graders · dynamic difficulty
367
- │ ├── multi_agent_environment.py # 5-agent system + AuditorPerformanceTracker
368
- │ ├── agents.py # Lazy-loading LoRA inference wrappers
369
- │ └── web_ui.py # Gradio UI (mounted at /web)
370
-
371
- ├── models.py # Pydantic: Action · Observation · State
372
- ├── inference.py # Standalone inference helper
373
- ├── client.py # OpenEnv-compatible Python client
374
-
375
- ├── extractor_training_grpo.ipynb # Extractor GRPO training (Unsloth + TRL)
376
- ├── auditor_grpo_training.ipynb # Auditor GRPO training
377
- ├── generator_grpo_training.ipynb # Generator GRPO training
378
-
379
- ── openenv.yaml # OpenEnv manifest (all 7 tasks declared)
380
- ├── Dockerfile # HF Spaces Docker (port 7860, non-root UID 1000)
381
- ├── pyproject.toml # Project metadata + dependencies
382
- ├── requirements.txt # Runtime dependencies
383
- ── validate-submission.sh # Submission validator script
384
-
385
- ├── ROUND2_PROBLEM_STATEMENT.md # Full problem statement + reward design rationale
386
- └── BLOG_DRAFT.md # HuggingFace blog post draft
 
 
 
 
 
 
 
 
 
387
  ```
388
 
389
  ---
390
 
391
- ## 🔌 API Reference
392
 
393
  ### Core OpenEnv
394
 
395
  | Endpoint | Method | Description |
396
  |:---|:---:|:---|
397
- | `/health` | `GET` | Health check `{"status": "ok", "active_sessions": N}` |
398
- | `/tasks` | `GET` | All 7 tasks with descriptions, max_attempts, action/observation schemas |
399
- | `/reset` | `POST` | Start episode `{"task_id": "easy\|medium\|hard\|expert\|adversarial\|negotiate\|supply_chain"}` |
400
- | `/step` | `POST` | Submit extraction → reward + feedback + hint + reward_breakdown |
401
- | `/grader` | `POST` | Score without consuming an attempt (used by training Colab) |
402
- | `/state` | `GET` | Episode metadata — step_count, done, best_reward, full rewards history |
403
- | `/ws` | `WS` | Full episode over WebSocket (OpenEnv standard) |
404
- | `/web` | `GET` | Gradio interactive demo UI |
405
 
406
  ### Multi-Agent
407
 
408
  | Endpoint | Method | Description |
409
  |:---|:---:|:---|
410
- | `/multi/reset` | `POST` | Start 5-agent episode Generator biased by Regulator weights |
411
- | `/multi/extract` | `POST` | Score Extractor output (4 signals) |
412
- | `/multi/audit` | `POST` | Score Auditor output, update 30-episode performance tracker |
413
- | `/multi/approve` | `POST` | Run Approver, compute Generator adversarial reward |
414
- | `/multi/state/{id}` | `GET` | Full episode state including all agent scores |
415
 
416
  ### Regulator
417
 
418
  | Endpoint | Method | Description |
419
  |:---|:---:|:---|
420
- | `/regulator/report` | `GET` | Detection rates, blind spots, calibration, generator weights |
421
- | `/regulator/forecast` | `GET` | Predictive trend analysis — critical + emerging blind spots with slopes |
422
- | `/regulator/calibration` | `GET` | Overconfidence / underconfidence per fraud type |
423
- | `/regulator/predict` | `POST` | Score a Regulator blind-spot prediction |
424
- | `/regulator/demo_seed` | `POST` | Seed tracker with realistic demo data |
425
- | `/generator/score` | `POST` | Compute Generator reward given auditor/approver outcomes |
426
 
427
  ---
428
 
429
- ## 🏗️ Tech Stack
430
 
431
- <div align="center">
432
-
433
- | Layer | Technology |
434
- |:---|:---|
435
- | **Environment** | [OpenEnv](https://github.com/meta-pytorch/OpenEnv) · FastAPI · Pydantic v2 |
436
- | **UI** | Gradio 4.x (mounted at `/web`) |
437
- | **Deployment** | Docker · HuggingFace Spaces (vcpu-2 / 8 GB) |
438
- | **Training** | [TRL GRPOTrainer](https://huggingface.co/docs/trl) · [Unsloth](https://github.com/unslothai/unsloth) |
439
- | **Model** | `unsloth/Qwen2.5-1.5B-Instruct` · 4-bit QLoRA · r=16 |
440
- | **Reward** | Live `/grader` endpoint on HF Space as verifier |
441
- | **Session Mgmt** | Thread-safe `OrderedDict` · 200-session cap · LRU eviction |
442
- | **Dynamic Difficulty** | Per-task rolling window (maxlen=10) → adjusts OCR intensity, batch size, discrepancy count |
443
-
444
- </div>
445
 
446
- ---
 
447
 
448
- ## 🔍 Dynamic Difficulty
 
 
449
 
450
- The environment adapts generation parameters to the agent's recent performance:
 
451
 
452
- ```python
453
- if avg_score >= 0.85: # Agent is doing well → harder
454
- n_invoices = (4, 6)
455
- ocr_intensity = 0.55 # heavier corruption
456
- n_discrepancies = (3, 5)
457
- n_anomalies = 3
458
-
459
- elif avg_score < 0.60: # Agent is struggling → easier
460
- n_invoices = (2, 3)
461
- ocr_intensity = 0.15
462
- n_discrepancies = (1, 2)
463
- n_anomalies = 2
464
-
465
- else: # balanced
466
- n_invoices = (3, 5)
467
- ocr_intensity = 0.35
468
- n_discrepancies = (2, 3)
469
  ```
470
 
471
  ---
472
 
473
- ## 🎭 Theme Alignment
474
-
475
- <div align="center">
476
-
477
- | Theme | Alignment | Evidence |
478
- |:---:|:---|:---|
479
- | **#1 Multi-Agent Interactions** | ✅ Core | 5 agents with cooperation, competition, and adversarial self-play |
480
- | **#1 Fleet AI Scalable Oversight** | ✅ Bonus | Regulator monitors Auditor cross-episode — fully autonomous oversight loop |
481
- | **#2 Long-Horizon Planning** | ✅ Partial | `negotiate` task: multi-turn clarification with attempt budget penalty |
482
- | **#3.1 Professional Tasks** | ✅ Core | Invoice + PO + vendor registry + supply chain = real finance operations |
483
- | **#4 Self-Improvement** | ✅ Core | Regulator → Generator bias → harder adversarial batches → Auditor improves |
484
-
485
- </div>
486
-
487
- ---
488
-
489
- ## 👥 Team
490
-
491
- <div align="center">
492
-
493
- | | |
494
- |:---:|:---:|
495
- | **Pritam Satpathy** | **Gnana Nawin T** |
496
- | [🤗 ps2181](https://huggingface.co/ps2181) | [🤗 gnananawin](https://huggingface.co/gnananawin) |
497
- | Scaler School of Technology | Scaler School of Technology |
498
-
499
- **Meta PyTorch OpenEnv Hackathon — Grand Finale · April 25–26, 2026 · Bangalore**
500
-
501
- </div>
502
-
503
- ---
504
-
505
- ## 🔗 All Links
506
 
507
  <div align="center">
508
 
509
- | Resource | Link |
510
  |:---|:---|
511
- | 🚀 **Live Environment** | https://ps2181-invoice-processing-pipeline.hf.space |
512
- | 🖥️ **Gradio Demo UI** | https://ps2181-invoice-processing-pipeline.hf.space/web |
513
- | 📖 **API Documentation** | https://ps2181-invoice-processing-pipeline.hf.space/docs |
514
- | 🤗 **Extractor Model** | https://huggingface.co/ps2181/extractor-lora-qwen2.5-1.5b |
515
- | 🕵️ **Auditor Model** | https://huggingface.co/ps2181/auditor-lora-qwen2.5-1.5b |
516
- | 🏭 **Generator Model** | https://huggingface.co/ps2181/generator-lora-qwen2.5-1.5b |
517
- | 📓 **Training Colab** | https://colab.research.google.com/drive/1C1_3giNt-NmbzKNFJr5_L1fms3L8LfmB |
518
- | 💻 **GitHub** | https://github.com/ps2181/invoice-processing-pipeline |
519
- | 🧩 **OpenEnv Framework** | https://github.com/meta-pytorch/OpenEnv |
520
 
521
  </div>
522
 
@@ -524,8 +329,8 @@ else: # balanced
524
 
525
  <div align="center">
526
 
527
- <img src="https://capsule-render.vercel.app/api?type=waving&color=gradient&customColorList=6,11,20&height=100&section=footer&animation=twinkling" width="100%"/>
528
 
529
- **Built with ❤️ for the Meta PyTorch OpenEnv Hackathon 2026**
530
 
531
  </div>
 
1
+ ---
2
+ title: Invoice Processing Pipeline
3
+ emoji: 🧾
4
+ colorFrom: blue
5
+ colorTo: indigo
6
+ sdk: docker
7
+ app_port: 7860
8
+ tags:
9
+ - openenv
10
+ - multi-agent
11
+ - grpo
12
+ - rlhf
13
+ - fraud-detection
14
+ - invoice
15
+ ---
16
 
17
+ <div align="center">
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
+ # 🧾 Invoice Processing Pipeline
20
+ ### Self-Improving Adversarial Fraud Detection Environment
21
 
22
+ **Meta PyTorch OpenEnv Hackathon · Grand Finale · April 25–26, 2026**
23
+ *Pritam Satpathy & Gnana Nawin T · Scaler School of Technology, Bangalore*
 
24
 
25
  <br/>
26
 
27
+ [![Live Demo](https://img.shields.io/badge/🚀_Live_Demo-HF_Space-yellow)](https://ps2181-invoice-processing-pipeline.hf.space/web)
28
+ [![API Docs](https://img.shields.io/badge/📖_API_Docs-Swagger-blue)](https://ps2181-invoice-processing-pipeline.hf.space/docs)
29
+ [![GitHub](https://img.shields.io/badge/GitHub-invoice--pipeline-black?logo=github)](https://github.com/ps2181/invoice-processing-pipeline)
30
+
31
+ > **Primary theme: #4 Self-Improvement · Secondary: #1 Multi-Agent Interactions**
32
 
33
  </div>
34
 
35
  ---
36
 
37
+ ## The Core Idea
38
 
39
+ > *A system that continuously generates harder challenges targeting its own weakest points.*
40
 
41
+ Most fraud detection pipelines are static. Ours **gets harder for itself over time**: the Regulator finds where the Auditor keeps failing, the Generator exploits those exact blind spots in the next episode, the Auditor's new mistakes update the Regulator and the loop closes without any human intervention.
42
 
43
  <div align="center">
44
+ <img width="1710" height="326" alt="5-agent loop" src="https://github.com/user-attachments/assets/319654c3-aa24-47e8-9716-734d4e902168" />
45
  </div>
46
 
47
  ---
48
 
49
+ ## 5-Agent Architecture
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
+ ```mermaid
52
+ graph LR
53
+ R[🎯 Regulator\nDetects blind spots\nUpdates weights] -->|bias weights| G[⚡ Generator\nCreates adversarial\ninvoices]
54
+ G -->|raw invoice text| E[🔍 Extractor\nParses structured\nJSON fields]
55
+ E -->|structured data| A[🕵️ Auditor\nFlags fraud with\nconfidence scores]
56
+ A -->|audit results| AP[✅ Approver\nApprove / Escalate\n/ Reject]
57
+ AP -->|episode outcome| R
58
+ A -->|missed fraud types| R
59
+ ```
 
 
 
 
 
 
 
 
 
60
 
61
  <div align="center">
62
 
63
  | Agent | Role | Reward Signal |
64
  |:---:|:---|:---|
65
+ | **🎯 Regulator** | Cross-episode oversight: detects Auditor blind spots, reweights Generator | Precision `0.35` + Recall `0.35` + No over-flagging `0.15` + Early warning `0.15` |
66
+ | **⚡ Generator** | Adversary: creates invoices biased toward blind spots | `+0.85` evades both · `+0.60` evades Auditor · `+0.10` caught |
67
+ | **🔍 Extractor** | Parser: text structured JSON with 4 independent signals | Format `0.10` · Field accuracy `0.40` · Math `0.25` · Completeness `0.25` |
68
+ | **🕵️ Auditor** | Detector: fraud classification with confidence scores | `+0.99` correct type · `+0.90` clean cleared · `+0.01` miss or FP |
69
+ | **✅ Approver** | Gatekeeper: final approve / escalate / reject | Rule-based on confidence threshold |
70
 
71
  </div>
72
 
73
  ---
74
 
75
+ ## Three Novel Features
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
  <div align="center">
78
 
79
+ | Feature | What it does |
80
+ |:---|:---|
81
+ | **🔮 Predictive Regulator** | Computes trend slopes over 5-episode windows warns of *emerging* blind spots before they go critical |
82
+ | **🧬 Compound Fraud** | Invoices can carry two simultaneous fraud signals (e.g. phantom vendor + price gouging). Partial credit for catching one; full reward for both |
83
+ | **📊 Confidence Calibration** | Tracks (confidence, correct?) pairs per fraud type. Flags *overconfident misses* — the most dangerous Auditor failure mode |
84
 
85
  </div>
86
 
 
 
87
  ---
88
 
89
+ ## 10 Tasks — Progressive Curriculum
 
 
 
 
 
 
 
 
 
 
 
 
90
 
91
+ <div align="center">
 
 
92
 
93
+ | # | Task | What the Agent Faces | Difficulty |
94
+ |:---:|:---|:---|:---:|
95
+ | 1 | `easy` | Single clean invoice — extract 5 fields | Easy |
96
+ | 2 | `medium` | Batch with date chaos, vendor typos, currency noise | Medium |
97
+ | 3 | `hard` | Extraction + PO reconciliation — flag overcharges, missing items | Hard |
98
+ | 4 | `expert` | Full fraud audit across all four fraud types | Expert |
99
+ | 5 | `adversarial` | OCR corruption, SUBTOTAL traps, fake TAX/FX noise lines | Expert |
100
+ | 6 | `negotiate` | Ask clarifying questions first (bonus for ≤2), then extract | Medium |
101
+ | 7 | `supply_chain` | Detect quantity shortfalls, price spikes, phantom deliveries | Expert |
102
+ | 8 | `long_horizon` | 20-step 4-phase investigation: extract → reconcile → audit → risk forecast | Expert |
103
+ | 9 | `personalized` | Adapts to your weak fields — next invoice always targets your worst category | Adaptive |
104
+ | 10 | `curriculum` | Auto-progresses easy→medium→hard→expert based on score (≥0.80 to advance) | Auto |
105
 
106
+ </div>
 
 
 
 
 
 
 
 
 
107
 
108
+ Dynamic difficulty also adjusts **within** each task via a rolling 10-episode score window: score above `0.85` → heavier OCR, more discrepancies, deeper traps. Drop below `0.60` → it eases off.
109
 
110
  ---
111
 
112
+ ## Reward Architecture
113
 
114
+ ### 🔍 Extractor — 4 Independent Signals
115
 
116
  ```python
117
+ reward_format(extracted) # 0.10 — all 5 required JSON keys present?
118
+ reward_field_accuracy(extracted, gt) # 0.40 vendor / date / currency / total match?
119
+ reward_math_consistency(extracted) # 0.25 — qty × unit_price = amount per line?
120
+ reward_completeness(extracted, gt) # 0.25 — all expected line items captured?
 
121
 
122
+ # All clamped to (0.01, 0.99) — no log(0), no gradient collapse at boundaries
 
 
 
 
 
 
123
  ```
124
 
125
+ ### 🕵️ Auditor
126
+
127
+ <div align="center">
128
 
129
  | Outcome | Reward | Why |
130
  |:---|:---:|:---|
131
+ | Correct fraud type detected | **0.99** | Rewards precise classification, not just flagging |
132
+ | Clean invoice correctly approved | **0.90** | Keeps false-positive rate honest |
133
+ | Compound fraud — one of two types caught | **0.65** | Partial credit on hard cases |
134
+ | Fraud flagged but wrong type | **0.50** | Penalises sloppiness while crediting intent |
135
  | Miss or false positive | **0.01** | Near-zero punishes both failure modes symmetrically |
136
 
137
+ </div>
 
 
 
 
 
 
138
 
139
+ ### 🎯 Regulator — Cross-Episode
140
 
141
  ```
142
  Total = Precision(0.35) + Recall(0.35) + No-over-flagging(0.15) + Early-warning-bonus(0.15)
143
  ```
144
 
145
+ The early-warning bonus rewards predictions of *emerging* blind spots — before detection rates cross the critical threshold.
146
+
147
  ---
148
 
149
+ ## Training Results GRPO on Live Environment
150
+
151
+ All 3 agents trained with **TRL GRPOTrainer + Unsloth** using the deployed HF Space as the live reward verifier:
152
 
153
  <div align="center">
154
 
155
+ | Agent | Baseline | Best Achieved | Notes |
156
+ |:---:|:---:|:---:|:---|
157
+ | **🔍 Extractor** | 0.10 (random) | **0.914** live grader | Peaked step 15 above Qwen 72B baseline (0.67) |
158
+ | **🕵️ Auditor** | 0.01 (dead signal) | **0.719** total reward | Run 1 had episode_id bug; Run 2 0.01→0.52 live reward |
159
+ | **⚡ Generator** | | Format learned (~0.22) | Plausibility reward improved; evasion had same bug as Run 1 |
 
 
160
 
161
  </div>
162
 
163
+ ![Extractor Training](https://raw.githubusercontent.com/ps2181/invoice-processing-pipeline/main/assets/reward_curve.png)
164
 
165
+ ![Auditor Training Run 2](https://raw.githubusercontent.com/ps2181/invoice-processing-pipeline/main/assets/auditor_reward_curve_run2.png)
166
 
167
+ ![Generator Training](https://raw.githubusercontent.com/ps2181/invoice-processing-pipeline/main/assets/generator_reward_curve.png)
168
 
169
+ **Setup:** Qwen2.5-1.5B-Instruct · 4-bit QLoRA r=16 · Unsloth + TRL · Google Colab A100
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
170
 
171
+ ### The Reward Hacking We Caught at Step 10
172
 
173
+ At step 10 the model had figured out it could score high by producing *arithmetically consistent* JSON while **hallucinating every actual value**:
174
 
175
+ ```
176
+ math_consistency: 0.97 ✓
177
+ completeness: 1.00 ✓
178
+ field_accuracy: 0.00 ✗ ← vendor, date, total all fabricated
179
+ ```
180
 
181
+ Without 4 independent signals, a single aggregated reward would have called this success. The independent signals made the failure immediately visible — and diagnosable.
 
 
182
 
183
+ ### Auditor Training Log Run 2 (exact data)
 
184
 
185
+ <div align="center">
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
186
 
187
+ | Step | Total Reward | Live Env Reward | ±Std |
188
+ |:---:|:---:|:---:|:---:|
189
+ | 5 | 0.4828 | 0.2828 | ±0.194 |
190
+ | 10 | **0.7188** | **0.5188** | ±0.239 |
191
+ | 15 | 0.4538 | 0.2538 | ±0.123 |
192
+ | 20 | 0.5733 | 0.3733 | ±0.212 |
193
+ | 25 | 0.5325 | 0.3325 | ±0.232 |
194
+ | 30 | 0.6038 | 0.4038 | ±0.147 |
195
 
196
+ *Run 1 (dead signal): live env reward flat at 0.010 — TRL passes episode_id as a list; old code sent the whole list instead of indexing per completion*
 
 
197
 
198
+ </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
199
 
200
+ ---
201
 
202
+ ## Trained LoRA Agents
203
 
204
+ <div align="center">
205
 
206
+ | Agent | HF Hub |
207
+ |:---:|:---|
208
+ | 🔍 Extractor | [ps2181/extractor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/extractor-lora-qwen2.5-1.5b) |
209
+ | 🕵️ Auditor | [ps2181/auditor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/auditor-lora-qwen2.5-1.5b) |
210
+ | ⚡ Generator | [ps2181/generator-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/generator-lora-qwen2.5-1.5b) |
211
+
212
+ </div>
213
 
214
  ---
215
 
216
+ ## Sample Multi-Agent Episode
217
 
218
  ```
219
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
220
+ MULTI-AGENT PIPELINE · LIVE EPISODE
221
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
222
+
223
+ 🎯 REGULATOR (30-episode rolling window)
224
+ ────────────────────────────────────────────────
225
+ phantom_vendor 31% ⚠ BLIND SPOT ← prioritised 60%
226
+ price_gouging 74% ✓ OK
227
+ math_fraud 81% ✓ OK
228
+ duplicate 62% ✓ OK
229
+
230
+ ⚡ GENERATOR (Qwen2.5 LoRA)
231
+ ────────────────────────────────────────────────
232
+ Fraud focus : phantom_vendor (60% Regulator weight)
233
+ Vendor : ShadowByte Technologies ← not in registry
234
+
235
+ 🔍 EXTRACTOR (Qwen2.5 LoRA)
236
+ ────────────────────────────────────────────────
237
+ Reward : 0.847 [format 0.10 · field 0.38 · math 0.25 · completeness 0.12]
238
+
239
+ 🕵️ AUDITOR (Qwen2.5 LoRA)
240
+ ────────────────────────────────────────────────
241
+ INV-85529 → 🚨 FLAGGED [PHANTOM VENDOR] conf=0.91
242
+ INV-85530 → ✅ APPROVED conf=0.88
243
+
244
+ ✅ APPROVER
245
+ ────────────────────────────────────────────────
246
+ INV-85529 → ❌ REJECT
247
+ Generator reward : 0.60 (evaded Auditor on 1/3, Approver caught)
248
+
249
+ 🎯 REGULATOR UPDATE
250
+ ────────────────────────────────────────────────
251
+ phantom_vendor detection: 31% → 45% ↑ improving
252
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
253
  ```
254
 
255
  ---
256
 
257
+ ## API Reference
258
 
259
  ### Core OpenEnv
260
 
261
  | Endpoint | Method | Description |
262
  |:---|:---:|:---|
263
+ | `/reset` | POST | Start episode (`{"task_id": "easy\|medium\|hard\|...\|curriculum"}`) |
264
+ | `/step` | POST | Submit extracted data, get reward + feedback |
265
+ | `/grader` | POST | Score without modifying state (training reward signal) |
266
+ | `/state` | GET | Episode metadata |
267
+ | `/health` | GET | Health check + active session count |
268
+ | `/metrics` | GET | Per-task episode counts, avg/best scores, Regulator state |
269
+ | `/tasks` | GET | Full task list with schemas |
270
+ | `/ws` | WS | WebSocket interface |
271
 
272
  ### Multi-Agent
273
 
274
  | Endpoint | Method | Description |
275
  |:---|:---:|:---|
276
+ | `/multi/reset` | POST | Start 5-agent episode, Generator biased by Regulator |
277
+ | `/multi/extract` | POST | Score Extractor output (4 signals) |
278
+ | `/multi/audit` | POST | Score Auditor output, update tracker |
279
+ | `/multi/approve` | POST | Run Approver, compute Generator adversarial reward |
280
+ | `/generator/score` | POST | Direct Generator scoring through Auditor+Approver pipeline |
281
 
282
  ### Regulator
283
 
284
  | Endpoint | Method | Description |
285
  |:---|:---:|:---|
286
+ | `/regulator/report` | GET | Detection rates, blind spots, generator weights |
287
+ | `/regulator/forecast` | GET | Trend slopes + emerging blind spot warnings |
288
+ | `/regulator/calibration` | GET | Confidence calibration per fraud type |
289
+ | `/regulator/predict` | POST | Score Regulator blind spot predictions |
 
 
290
 
291
  ---
292
 
293
+ ## Quick Start
294
 
295
+ ```bash
296
+ # Health check
297
+ curl https://ps2181-invoice-processing-pipeline.hf.space/health
 
 
 
 
 
 
 
 
 
 
 
298
 
299
+ # Environment-wide metrics
300
+ curl https://ps2181-invoice-processing-pipeline.hf.space/metrics
301
 
302
+ # Auto-progressive curriculum episode
303
+ curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/reset \
304
+ -H "Content-Type: application/json" -d '{"task_id": "curriculum"}'
305
 
306
+ # Start multi-agent episode
307
+ curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/multi/reset
308
 
309
+ # Regulator blind spot report
310
+ curl https://ps2181-invoice-processing-pipeline.hf.space/regulator/report
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
311
  ```
312
 
313
  ---
314
 
315
+ ## Theme Alignment
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
316
 
317
  <div align="center">
318
 
319
+ | Theme | Alignment |
320
  |:---|:---|
321
+ | **#4 Self-Improvement** (primary) | Regulator detects blind spots → Generator biases toward them → Auditor improves → loop repeats |
322
+ | **#1 Multi-Agent Interactions** | 5 agents with conflicting incentives (Generator vs Auditor adversarial self-play) |
323
+ | **#1 Fleet AI Scalable Oversight** (bonus) | Regulator monitors Auditor cross-episode with predictive trend detection |
324
+ | **#3.1 Professional Tasks** | Invoice fraud detection = core enterprise financial workflow |
 
 
 
 
 
325
 
326
  </div>
327
 
 
329
 
330
  <div align="center">
331
 
332
+ *Built for the Meta PyTorch OpenEnv Hackathon 2026.*
333
 
334
+ **Pritam Satpathy & Gnana Nawin T · Scaler School of Technology · Bangalore**
335
 
336
  </div>