File size: 25,308 Bytes
49227b4 3391ffe bbe2575 d537044 bbe2575 d537044 49227b4 d537044 3391ffe d537044 bbe2575 347eb5c bbe2575 d537044 bbe2575 3391ffe bbe2575 d537044 bbe2575 d537044 bbe2575 347eb5c d537044 65e1955 e932bc3 65e1955 bbe2575 0bf71ce bbe2575 d537044 bbe2575 0bf71ce b02956e d537044 bbe2575 d537044 b02956e d537044 bbe2575 d537044 0bf71ce d537044 bbe2575 d537044 0bf71ce d537044 bbe2575 d537044 bbe2575 d537044 b02956e d537044 b02956e d537044 bbe2575 d537044 bbe2575 d537044 b02956e d537044 bbe2575 d537044 3391ffe b02956e d537044 b02956e 3391ffe b02956e d537044 3391ffe 0bf71ce d537044 3391ffe d537044 0bf71ce fcd74c3 bbe2575 fcd74c3 bbe2575 b02956e d537044 3391ffe 0ed8b20 af4c96c fcd74c3 0bf71ce fcd74c3 d537044 3391ffe 0ed8b20 af4c96c fcd74c3 bbe2575 d537044 0bf71ce d537044 0ed8b20 af4c96c fcd74c3 0bf71ce d537044 0bf71ce d537044 ed15028 3391ffe d537044 3391ffe ed15028 d537044 ed15028 d537044 bbe2575 3391ffe bbe2575 3391ffe bbe2575 3391ffe bbe2575 3391ffe bbe2575 3391ffe bbe2575 d537044 bbe2575 3391ffe bbe2575 d537044 3391ffe bbe2575 d537044 bbe2575 d537044 bbe2575 3391ffe bbe2575 0bf71ce d537044 fcd74c3 d537044 0bf71ce ed15028 bbe2575 ed15028 bbe2575 d537044 ed15028 bbe2575 ed15028 bbe2575 d537044 0bf71ce ed15028 bbe2575 ed15028 bbe2575 d537044 0bf71ce d537044 0bf71ce d537044 b02956e d537044 bbe2575 d537044 bbe2575 d537044 bbe2575 d537044 0bf71ce d537044 bbe2575 d537044 bbe2575 d537044 fcd74c3 d537044 fcd74c3 d537044 7ec5eb4 d537044 bbe2575 d537044 bbe2575 d537044 0bf71ce bbe2575 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 | ---
title: Invoice Processing Pipeline
emoji: ๐งพ
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
tags:
- openenv
- multi-agent
- grpo
- rl
short_description: 5-agent adversarial fraud detection RL environment
---
<div align="center">
<img src="https://capsule-render.vercel.app/api?type=waving&color=gradient&customColorList=6,11,20&height=200§ion=header&text=Invoice%20Processing%20Pipeline&fontSize=40&fontColor=fff&animation=twinkling&fontAlignY=35&desc=Self-Improving%20Multi-Agent%20Fraud%20Detection%20%7C%20OpenEnv%20%2B%20GRPO%20%2B%20Qwen2.5&descAlignY=55&descSize=16" width="100%"/>
<p>
<a href="https://ps2181-invoice-processing-pipeline.hf.space/web">
<img src="https://img.shields.io/badge/๐%20Live%20Demo-HuggingFace%20Spaces-FF9D00?style=for-the-badge&logo=huggingface&logoColor=white" />
</a>
<a href="https://colab.research.google.com/drive/1C1_3giNt-NmbzKNFJr5_L1fms3L8LfmB">
<img src="https://img.shields.io/badge/Training%20Colab-Open%20Notebook-F9AB00?style=for-the-badge&logo=googlecolab&logoColor=white" />
</a>
<a href="https://ps2181-invoice-processing-pipeline.hf.space/docs">
<img src="https://img.shields.io/badge/API%20Docs-FastAPI-009688?style=for-the-badge&logo=fastapi&logoColor=white" />
</a>
</p>
<p>
<img src="https://img.shields.io/badge/Framework-OpenEnv-1A356E?style=for-the-badge" />
<img src="https://img.shields.io/badge/Model-Qwen2.5--1.5B%20+%20LoRA%20r%3D16-8B1A4E?style=for-the-badge" />
<img src="https://img.shields.io/badge/Training-GRPO%20+%20Unsloth-00A67E?style=for-the-badge" />
<img src="https://img.shields.io/badge/Agents-5%20Adversarial-E44D26?style=for-the-badge" />
</p>
<p>
<img src="https://img.shields.io/badge/Tasks-10%20Progressive-6C3483?style=for-the-badge" />
<img src="https://img.shields.io/badge/Deployment-Docker%20%7C%20HF%20Spaces-0D1117?style=for-the-badge&logo=docker" />
<img src="https://img.shields.io/badge/Theme-%234%20Self--Improvement-FF6B35?style=for-the-badge" />
<img src="https://img.shields.io/badge/Hackathon-Meta%20PyTorch%202026-185FA5?style=for-the-badge" />
</p>
<br/>
> **Meta PyTorch OpenEnv Hackathon โ Grand Finale ยท April 25โ26, 2026**
>
> Team: **Pritam Satpathy** & **Gnana Nawin T** ยท VIT, Vellore
<br/>
<a href="https://git.io/typing-svg">
<img src="https://readme-typing-svg.demolab.com?font=Fira+Code&weight=600&size=22&pause=1000&color=007A87¢er=true&vCenter=true&width=750&lines=5-Agent+Adversarial+Fraud+Detection+System;Self-Improving+via+Cross-Episode+Regulator;GRPO-Trained+LoRA+Agents+on+Live+Environment;Invoice+%E2%86%92+Extract+%E2%86%92+Audit+%E2%86%92+Approve+%E2%86%92+Improve" alt="Typing SVG" />
</a>
</div>
---
## ๐ฅ The Core Idea
> *A system that continuously generates harder challenges targeting its own weakest points.*
Most fraud detection pipelines are **static**. Ours **gets harder for itself over time**: the Regulator finds where the Auditor keeps failing, the Generator exploits those exact blind spots in the next episode, the Auditor's new mistakes update the Regulator โ and the loop closes without any human intervention.
**Primary theme: #4 Self-Improvement ยท Secondary: #1 Multi-Agent Interactions**
<div align="center">
<img width="1710" height="326" alt="5-agent self-improvement loop" src="https://github.com/user-attachments/assets/319654c3-aa24-47e8-9716-734d4e902168" />
</div>
---
## ๐ค 5-Agent Architecture
```
๐ฏ Regulator โโbias weightsโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโบ โก Generator
โฒ โ
โ raw invoice text
โ missed fraud types โผ
โ ๐ Extractor
โ โ
โ structured data
โ โผ
โโโโโ episode outcome โโโโ โ
Approver โโaudit resultsโโโ ๐ต๏ธ Auditor
```
<div align="center">
| Agent | Role | Reward Signal |
|:---:|:---|:---|
| ๐ฏ **Regulator** | Cross-episode oversight: detects Auditor blind spots, reweights Generator | Precision `0.35` + Recall `0.35` + No over-flagging `0.15` + Early warning `0.15` |
| โก **Generator** | Adversary: creates invoices biased toward blind spots | `+0.85` evades both ยท `+0.60` evades Auditor ยท `+0.10` caught |
| ๐ **Extractor** | Parser: text โ structured JSON with 4 independent signals | Format `0.10` ยท Field accuracy `0.40` ยท Math `0.25` ยท Completeness `0.25` |
| ๐ต๏ธ **Auditor** | Detector: fraud classification with confidence scores | `+0.99` correct type ยท `+0.90` clean cleared ยท `+0.01` miss or FP |
| โ
**Approver** | Gatekeeper: final approve / escalate / reject | `โฅ0.80` โ reject ยท `0.50โ0.80` โ escalate ยท `<0.50` โ approve |
</div>
---
## โก Three Novel Features
<table>
<tr>
<td width="33%" align="center">
### ๐ฎ Predictive Regulator
Computes **trend slopes** over 5-episode windows.<br/>Warns of *emerging* blind spots **before** detection rates cross the critical threshold โ proactive oversight, not reactive retraining.
`+0.15 early-warning bonus`
</td>
<td width="33%" align="center">
### ๐งฌ Compound Fraud
Invoices carry **two fraud signals simultaneously** (e.g. phantom vendor + price gouging).<br/>Partial credit `+0.65` for catching one; full reward `+0.99` for both.
Prevents single-signal heuristics.
</td>
<td width="33%" align="center">
### ๐ Confidence Calibration
Tracks `(confidence, correct?)` pairs per fraud type.<br/>Detects **overconfident misses** โ the Auditor saying "90% sure, approved" on fraud โ the most dangerous real-world failure mode.
</td>
</tr>
</table>
---
## ๐ฏ 10 Tasks โ Progressive Curriculum
<div align="center">
| # | Task | What the Agent Faces | Difficulty |
|:---:|:---|:---|:---:|
| 1 | `easy` | Single clean invoice โ extract 5 fields | ๐ข Easy |
| 2 | `medium` | Batch with date chaos, vendor typos, currency noise | ๐ก Medium |
| 3 | `hard` | Extraction + PO reconciliation โ flag overcharges, missing items | ๐ Hard |
| 4 | `expert` | Full fraud audit across all four fraud types | ๐ด Expert |
| 5 | `adversarial` | OCR corruption, SUBTOTAL traps, fake TAX/FX noise lines | ๐ด Expert |
| 6 | `negotiate` | Ask clarifying questions first (bonus for โค2), then extract | ๐ก Medium |
| 7 | `supply_chain` | Detect quantity shortfalls, price spikes, phantom deliveries | ๐ด Expert |
| 8 | `long_horizon` | 20-step 4-phase investigation: extract โ reconcile โ audit โ risk forecast | ๐ด Expert |
| 9 | `personalized` | Adapts to your weak fields โ next invoice always targets your worst category | ๐ Adaptive |
| 10 | `curriculum` | Auto-progresses easyโmediumโhardโexpert based on score (โฅ0.80 to advance) | ๐ Auto |
</div>
Dynamic difficulty also adjusts **within** each task via a rolling 10-episode score window: score above `0.85` โ heavier OCR, more discrepancies, deeper traps. Drop below `0.60` โ it eases off.
---
## ๐ Training Results โ GRPO on Live Environment
All 3 agents trained with **TRL GRPOTrainer + Unsloth** using the deployed HF Space as the live reward verifier โ `/grader` endpoint *is* the reward function during training.
### Before vs After Training
<div align="center">
| Agent | Untrained (random) | Qwen 72B baseline | After GRPO | Improvement |
|:---:|:---:|:---:|:---:|:---:|
| ๐ **Extractor** | 0.10 | 0.67 | **0.914** | +714% vs random |
| ๐ต๏ธ **Auditor** | 0.01 | โ | **0.52** live reward | Dead โ active signal |
| โก **Generator** | โ | โ | **0.22** plausibility | Format & realism learned |
</div>
**Setup:** Qwen2.5-1.5B-Instruct ยท 4-bit QLoRA r=16 ยท Unsloth + TRL ยท Google Colab A100
### Extractor Reward Curve

*X-axis: training step (1โ20) ยท Y-axis: reward (0โ1). Left: total GRPO reward across 4 independent signals (format 0.10 + field accuracy 0.40 + math 0.25 + completeness 0.25). Right: live `/grader` score peaking at **0.914** โ above Qwen 72B baseline (0.67) and untrained 1.5B (0.46).*
*Left: Total GRPO reward across 4 signals (format + field + math + completeness) over 20 training steps. Right: Live environment grader score peaking at **0.914** โ above Qwen 72B baseline (0.67) and untrained 1.5B baseline (0.46).*
### Auditor Reward Curve (Run 2 โ Bug Fixed)

*X-axis: training step (1โ30) ยท Y-axis: reward (0โ1). Total reward (blue) and live env reward (orange) with ยฑ1 std band. Best total: **0.719** at step 10. Live env reward climbed from 0.01 (dead signal, Run 1) to **0.52** after fixing the TRL episode_id list indexing bug.*
*Total reward (blue) and live env reward (orange) over 30 steps with ยฑ1 std band. Best total reward: **0.719**. Live env reward rose from 0.01 (dead signal in Run 1) to **0.52** after fixing the episode_id list bug.*
### Generator Reward Curve

*X-axis: training step (1โ30) ยท Y-axis: reward (0โ1). Live evasion reward (red) flat near 0 โ Auditor+Approver caught all fraud attempts. Fraud plausibility reward (orange dashed) stable at ~0.20 โ Generator learned realistic invoice structure even without successful evasion.*
*Live evasion reward (red) flat near 0 โ Auditor+Approver caught all fraud attempts. Fraud plausibility reward (orange dashed) learned and stable at ~0.20, showing the Generator learned to produce realistic-looking invoices even without successful evasion.*
### ๐ Reward Hacking Caught at Step 10
At step 10 the model achieved `math_consistency = 0.97` and `completeness = 1.0` while `field_accuracy = 0.00` โ it had learned to output **arithmetically-consistent JSON with entirely hallucinated values**:
```
Step 10 โ Reward Hacking Detected:
format: 0.10 โ
math_consistency: 0.97 โ
โ model gaming this signal
completeness: 1.00 โ
โ model gaming this signal
field_accuracy: 0.00 โ โ hallucinating all values
Action: adjusted training emphasis on field_accuracy weight
Result: field_accuracy climbed to 0.30+ by step 30
```
Without 4 independent signals, a single aggregated reward would have called this success. **Independent signals are diagnostics, not just incentives.**
### Auditor Training โ Run 2 (exact data)
<div align="center">
| Step | Total Reward | Live Env Reward | ยฑStd |
|:---:|:---:|:---:|:---:|
| 5 | 0.4828 | 0.2828 | ยฑ0.194 |
| 10 | **0.7188** | **0.5188** | ยฑ0.239 |
| 15 | 0.4538 | 0.2538 | ยฑ0.123 |
| 20 | 0.5733 | 0.3733 | ยฑ0.212 |
| 25 | 0.5325 | 0.3325 | ยฑ0.232 |
| 30 | 0.6038 | 0.4038 | ยฑ0.147 |
*Run 1 (dead signal): live env reward flat at 0.010 โ TRL passes episode_id as a list; old code sent the whole list instead of indexing per completion*
</div>
---
## ๐ Reward Architecture
### ๐ Extractor โ 4 Independent Signals
```python
reward_format(extracted) # 0.10 โ all 5 required JSON keys present?
reward_field_accuracy(extracted, gt) # 0.40 โ vendor / date / currency / total match?
reward_math_consistency(extracted) # 0.25 โ qty ร unit_price = amount per line?
reward_completeness(extracted, gt) # 0.25 โ all expected line items captured?
# All clamped to (0.01, 0.99) โ no log(0), no gradient collapse at boundaries
```
### ๐ต๏ธ Auditor
<div align="center">
| Outcome | Reward | Why |
|:---|:---:|:---|
| Correct fraud type detected | **0.99** | Rewards precise classification, not just binary flagging |
| Clean invoice correctly approved | **0.90** | Keeps false-positive rate honest |
| Compound fraud โ one of two types caught | **0.65** | Partial credit prevents cliff on hard cases |
| Fraud flagged but wrong type | **0.50** | Penalises sloppiness; rewards catching *something* |
| Miss or false positive | **0.01** | Near-zero punishes both failure modes symmetrically |
</div>
### โก Generator (Adversarial Self-Play)
| Outcome | Reward |
|:---|:---:|
| Fraud evades **both** Auditor and Approver | **0.85** |
| Auditor misses, Approver catches | **0.60** |
| Auditor catches it | **0.10** |
### ๐ฏ Regulator โ Cross-Episode
```
Total = Precision(0.35) + Recall(0.35) + No-over-flagging(0.15) + Early-warning-bonus(0.15)
```
The early-warning bonus rewards predictions of *emerging* blind spots โ before detection rates cross the critical threshold.
---
## ๐ง Trained LoRA Agents
<div align="center">
| Agent | Base Model | LoRA Config | HuggingFace Hub |
|:---:|:---|:---:|:---|
| ๐ Extractor | Qwen2.5-1.5B-Instruct | r=16, ฮฑ=16, 4-bit QLoRA | [ps2181/extractor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/extractor-lora-qwen2.5-1.5b) |
| ๐ต๏ธ Auditor | Qwen2.5-1.5B-Instruct | r=16, ฮฑ=16, 4-bit QLoRA | [ps2181/auditor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/auditor-lora-qwen2.5-1.5b) |
| โก Generator | Qwen2.5-1.5B-Instruct | r=16, ฮฑ=16, 4-bit QLoRA | [ps2181/generator-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/generator-lora-qwen2.5-1.5b) |
</div>
**LoRA target modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
---
## ๐ The Regulator in Action
After each episode, the Regulator publishes a report the Generator uses to bias its next batch:
```
GET /regulator/report
{
"total_audits_recorded": 20,
"detection_rates": {
"phantom_vendor": "31% โ BLIND SPOT (-0.08โ)",
"price_gouging": "74% โ OK (+0.03โ)",
"math_fraud": "81% โ OK (+0.01โ)",
"duplicate_submission": "62% โก EMERGING (-0.02โ)"
},
"blind_spots": ["phantom_vendor"],
"emerging_blind_spots": ["duplicate_submission"],
"generator_weights": {
"phantom_vendor": 0.30, โ 3ร upweighted (blind spot)
"duplicate_submission": 0.20, โ 2ร upweighted (emerging)
"price_gouging": 0.125,
"math_fraud": 0.125,
"compound_fraud": 0.10
},
"verdict": "Recommend retraining on: phantom_vendor"
}
```
---
## ๐ญ Sample Multi-Agent Episode
```
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
MULTI-AGENT PIPELINE ยท LIVE EPISODE
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ฏ REGULATOR (30-episode rolling window)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
phantom_vendor 31% โ BLIND SPOT โ prioritised 60%
price_gouging 74% โ OK
math_fraud 81% โ OK
duplicate 62% โ OK
โก GENERATOR (Qwen2.5 LoRA)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Fraud focus : phantom_vendor (60% Regulator weight)
Vendor : ShadowByte Technologies โ not in registry
๐ EXTRACTOR (Qwen2.5 LoRA)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Reward : 0.847 [format 0.10 ยท field 0.38 ยท math 0.25 ยท completeness 0.12]
๐ต๏ธ AUDITOR (Qwen2.5 LoRA)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
INV-85529 โ ๐จ FLAGGED [PHANTOM VENDOR] conf=0.91
INV-85530 โ โ
APPROVED conf=0.88
โ
APPROVER
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
INV-85529 โ โ REJECT
Generator reward : 0.60 (evaded Auditor on 1/3, Approver caught)
๐ฏ REGULATOR UPDATE
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
phantom_vendor detection: 31% โ 45% โ improving
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
```
---
## ๐ Quick Start
```bash
# Health check
curl https://ps2181-invoice-processing-pipeline.hf.space/health
# Environment-wide metrics
curl https://ps2181-invoice-processing-pipeline.hf.space/metrics
# Auto-progressive curriculum episode
curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/reset \
-H "Content-Type: application/json" -d '{"task_id": "curriculum"}'
# Start multi-agent episode
curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/multi/reset
# Regulator blind spot report
curl https://ps2181-invoice-processing-pipeline.hf.space/regulator/report
```
### Run Training (Google Colab)
[](https://colab.research.google.com/drive/1C1_3giNt-NmbzKNFJr5_L1fms3L8LfmB)
```
Colab โ /reset (fresh synthetic invoice from live environment)
โ model generates JSON
โ /grader scores against ground truth
โ GRPO updates weights toward higher-reward completions
โ repeat 200 steps
```
---
## ๐๏ธ Repository Structure
```
invoice-processing-pipeline/
โ
โโโ server/
โ โโโ app.py # FastAPI โ 18 endpoints
โ โโโ environment.py # 10 tasks ยท graders ยท dynamic difficulty
โ โโโ multi_agent_environment.py # 5-agent system + AuditorPerformanceTracker
โ โโโ agents.py # Lazy-loading LoRA inference wrappers
โ โโโ web_ui.py # Gradio UI (mounted at /web)
โ
โโโ models.py # Pydantic: Action ยท Observation ยท State
โโโ inference.py # Standalone inference helper
โโโ client.py # OpenEnv-compatible Python client
โ
โโโ extractor_training_grpo.ipynb # ๐ฅ Extractor GRPO training (Unsloth + TRL)
โโโ auditor_grpo_training.ipynb # ๐ฅ Auditor GRPO training
โโโ generator_grpo_training.ipynb # ๐ฅ Generator GRPO training
โ
โโโ assets/
โ โโโ reward_curve.png # Extractor training curve
โ โโโ auditor_reward_curve_run2.png
โ โโโ generator_reward_curve.png
โ
โโโ openenv.yaml # OpenEnv manifest (all tasks declared)
โโโ Dockerfile # HF Spaces Docker (port 7860, non-root UID 1000)
โโโ pyproject.toml # Project metadata + dependencies
โโโ requirements.txt # Runtime dependencies
โโโ validate-submission.sh # Submission validator script
โโโ BLOG.md # HuggingFace blog post
โโโ ROUND2_PROBLEM_STATEMENT.md # Full problem statement + reward design rationale
```
---
## ๐ API Reference
### Core OpenEnv
| Endpoint | Method | Description |
|:---|:---:|:---|
| `/health` | `GET` | Health check โ `{"status": "ok", "active_sessions": N}` |
| `/tasks` | `GET` | All tasks with descriptions, schemas, difficulty levels |
| `/metrics` | `GET` | Per-task episode counts, avg/best scores, Regulator state |
| `/reset` | `POST` | Start episode `{"task_id": "easy\|medium\|...\|curriculum"}` |
| `/step` | `POST` | Submit extraction โ reward + feedback + hint + reward_breakdown |
| `/grader` | `POST` | Score without consuming an attempt (training reward signal) |
| `/state` | `GET` | Episode metadata โ step_count, done, best_reward, history |
| `/ws` | `WS` | Full episode over WebSocket (OpenEnv standard) |
| `/web` | `GET` | Gradio interactive demo UI |
### Multi-Agent
| Endpoint | Method | Description |
|:---|:---:|:---|
| `/multi/reset` | `POST` | Start 5-agent episode โ Generator biased by Regulator weights |
| `/multi/extract` | `POST` | Score Extractor output (4 independent signals) |
| `/multi/audit` | `POST` | Score Auditor output, update 30-episode performance tracker |
| `/multi/approve` | `POST` | Run Approver, compute Generator adversarial reward |
| `/multi/state/{id}` | `GET` | Full episode state including all agent scores |
| `/generator/score` | `POST` | Direct Generator scoring through Auditor+Approver pipeline |
### Regulator
| Endpoint | Method | Description |
|:---|:---:|:---|
| `/regulator/report` | `GET` | Detection rates, blind spots, calibration, generator weights |
| `/regulator/forecast` | `GET` | Trend slopes + emerging blind spot warnings with episode countdown |
| `/regulator/calibration` | `GET` | Overconfidence / underconfidence per fraud type |
| `/regulator/predict` | `POST` | Score a Regulator blind-spot prediction |
| `/regulator/demo_seed` | `POST` | Seed tracker with realistic demo data |
---
## ๐๏ธ Tech Stack
<div align="center">
| Layer | Technology |
|:---|:---|
| **Environment** | [OpenEnv](https://github.com/meta-pytorch/OpenEnv) ยท FastAPI ยท Pydantic v2 |
| **UI** | Gradio 4.x (mounted at `/web`) |
| **Deployment** | Docker ยท HuggingFace Spaces (vcpu-2 / 8 GB) |
| **Training** | [TRL GRPOTrainer](https://huggingface.co/docs/trl) ยท [Unsloth](https://github.com/unslothai/unsloth) |
| **Model** | `unsloth/Qwen2.5-1.5B-Instruct` ยท 4-bit QLoRA ยท r=16 ยท A100 |
| **Reward** | Live `/grader` endpoint on HF Space as verifier |
| **Session Mgmt** | Thread-safe `OrderedDict` ยท 200-session cap ยท LRU eviction |
| **Dynamic Difficulty** | Per-task rolling window (maxlen=10) โ adjusts OCR intensity, batch size, discrepancy count |
</div>
---
## ๐ญ Theme Alignment
<div align="center">
| Theme | Alignment | Evidence |
|:---:|:---|:---|
| **#4 Self-Improvement** (primary) | โ
Core | Regulator detects blind spots โ Generator biases toward them โ Auditor improves โ loop repeats |
| **#1 Multi-Agent Interactions** | โ
Core | 5 agents with conflicting incentives โ Generator vs Auditor adversarial self-play |
| **#1 Fleet AI Scalable Oversight** | โ
Bonus | Regulator monitors Auditor cross-episode with predictive trend detection |
| **#3.1 Professional Tasks** | โ
Core | Invoice + PO + vendor registry + supply chain = real enterprise AP workflow |
| **#2 Long-Horizon Planning** | โ
Partial | `long_horizon` task: 20-step 4-phase investigation with multi-turn state |
</div>
---
## ๐ฅ Team
<div align="center">
| | |
|:---:|:---:|
| **Pritam Satpathy** | **Gnana Nawin T** |
| [๐ค ps2181](https://huggingface.co/ps2181) | [๐ค gnananawin](https://huggingface.co/gnananawin) |
| Scaler School of Technology | Scaler School of Technology |
**Meta PyTorch OpenEnv Hackathon โ Grand Finale ยท April 25โ26, 2026 ยท Bangalore**
</div>
---
## ๐ All Links
<div align="center">
| Resource | Link |
|:---|:---|
| ๐ **Live Environment** | https://ps2181-invoice-processing-pipeline.hf.space |
| ๐ฅ๏ธ **Gradio Demo UI** | https://ps2181-invoice-processing-pipeline.hf.space/web |
| ๐ **API Documentation** | https://ps2181-invoice-processing-pipeline.hf.space/docs |
| ๐ **Metrics Dashboard** | https://ps2181-invoice-processing-pipeline.hf.space/metrics |
| ๐ **Blog Post** | https://github.com/ps2181/invoice-processing-pipeline/blob/main/BLOG.md |
| ๐ค **Extractor Model** | https://huggingface.co/ps2181/extractor-lora-qwen2.5-1.5b |
| ๐ต๏ธ **Auditor Model** | https://huggingface.co/ps2181/auditor-lora-qwen2.5-1.5b |
| โก **Generator Model** | https://huggingface.co/ps2181/generator-lora-qwen2.5-1.5b |
| ๐ **Training Colab (Auditor Agent)** | https://colab.research.google.com/drive/1C1_3giNt-NmbzKNFJr5_L1fms3L8LfmB |
| ๐ **Training Colab (Extractor Agent)** | https://colab.research.google.com/drive/1fxfBt13LjmT4m98pJq-b5B__1ytFeszK?usp=sharing |
| ๐ **Training Colab (Generator Agent)** | https://colab.research.google.com/drive/1O293_VBZQCthxlGpgvz5kxoty3zcsWGH?usp=sharing |
| ๐ป **GitHub** | https://github.com/ps2181/invoice-processing-pipeline |
| ๐ฅ **Demo Video** | https://youtu.be/QSB4UOLvaC8?si=SGnIwsfTW4JGsU3e |
| ๐งฉ **OpenEnv Framework** | https://github.com/meta-pytorch/OpenEnv |
</div>
---
<div align="center">
<img src="https://capsule-render.vercel.app/api?type=waving&color=gradient&customColorList=6,11,20&height=100§ion=footer&animation=twinkling" width="100%"/>
**Built with โค๏ธ for the Meta PyTorch OpenEnv Hackathon 2026**
*"The system that gets harder for itself โ so the agent never stops learning."*
</div>
|