ps2181 commited on
Commit
bbe2575
·
1 Parent(s): 95d36d4

Update BLOG.md and README.md

Browse files
Files changed (2) hide show
  1. BLOG.md +139 -188
  2. README.md +458 -183
BLOG.md CHANGED
@@ -1,276 +1,227 @@
1
- # Invoice Processing Pipeline — Multi-Agent RL Environment for Financial Fraud Detection
2
 
3
- **Meta PyTorch OpenEnv Hackathon Grand Finale | April 25–26, 2026**
4
- **Team: Pritam Satpathy + Gnana Nawin T**
5
 
6
- ---
7
-
8
- ## The Problem
9
-
10
- Invoice fraud costs businesses an estimated 5% of annual revenue. Finance teams manually process thousands of invoices every month — extracting vendor names, dates, line items, totals — and checking them against purchase orders for discrepancies. The work is slow (hours per batch), error-prone (typos, OCR noise, format chaos), and gameable (phantom vendors, price gouging, duplicate submissions).
11
-
12
- We built an RL training environment that teaches LLMs to do this automatically — and improves itself when it discovers its own blind spots.
13
-
14
- ---
15
 
16
- ## What We Built
 
 
17
 
18
- An OpenEnv-compatible environment deployed on HuggingFace Spaces:
19
- **[https://ps2181-invoice-processing-pipeline.hf.space](https://ps2181-invoice-processing-pipeline.hf.space)**
20
 
21
  ---
22
 
23
- ## Architecture: 5-Agent System
24
 
25
- ```
26
- ┌─────────────────────────────────────────────────────────┐
27
- │ ADVERSARIAL REWARD (dashed) │
28
- │ │
29
- ▼ │
30
- ┌───────────────┐ │
31
- │ Generator │◄───── Regulator biases fraud type ◄────┐ │
32
- │ Creates fraud │ │ │
33
- └───────┬───────┘ │ │
34
- │ Raw invoice text │ │
35
- ▼ │ │
36
- ┌───────────────┐ │ │
37
- │ Extractor │ │ │
38
- │ Text → JSON │ │ │
39
- └───────┬───────┘ │ │
40
- │ Structured data ┌─────┴─────┐ │
41
- ▼ │ Regulator │ │
42
- ┌───────────────┐ │ Cross- │ │
43
- │ Auditor │────── decision history ────────►│ episode │ │
44
- │ Fraud detect │ │ meta-agent │ │
45
- └───────┬───────┘ └───────────┘ │
46
- │ Verdict + flags │
47
- ▼ │
48
- ┌───────────────┐ │
49
- │ Approver │────────────────────────────────────────────────┘
50
- │ Approve/reject │
51
- └───────┬───────┘
52
-
53
-
54
- ┌──────────────────────────────────────┐
55
- │ 4 Independent Reward Signals │
56
- │ Format · Field · Math · Completeness│
57
- └──────────────────────────────────────┘
58
- ```
59
 
60
- | Agent | Role | Reward Signal |
61
- |-------|------|---------------|
62
- | **Generator** | Creates clean or fraudulent invoices | Rewarded when fraud slips past Auditor (adversarial self-play) |
63
- | **Extractor** | Reads raw invoice text → structured JSON | 4 independent signals: format, field accuracy, math consistency, completeness |
64
- | **Auditor** | Reviews extraction, flags fraud | +0.99 correct detection, +0.90 clean clearance, 0.01 for miss/false positive |
65
- | **Approver** | Final approve/reject/escalate decision | +0.95 correct decision |
66
- | **Regulator** | Monitors Auditor blind spots across episodes | Precision + recall of blind spot predictions |
67
 
68
- ---
69
 
70
- ## The Key Innovation: The Regulator
71
 
72
- The Regulator is a cross-episode meta-agent it watches the Auditor's decision history over 30 episodes and identifies systematic failure patterns:
73
 
74
- ```
75
- AUDITOR PERFORMANCE TRACKER (last 30 episodes)
76
 
77
- Fraud Type Detection Rate
78
- ─────────────────────────────────────
79
- phantom_vendor 31% ⚠ BLIND SPOT
80
- price_gouging 74% ✓ OK
81
- math_fraud 81% ✓ OK
82
- duplicate_submission 62% ✓ OK
83
 
84
- False Positive Rate: 12% ✓ OK
85
 
86
- REGULATOR VERDICT: Recommend retraining on phantom_vendor
87
- ```
88
 
89
- When the Regulator detects a blind spot, the Generator automatically starts producing more of that fraud type closing the self-improvement loop without human intervention.
90
 
91
- This directly addresses **Theme #1 (Fleet AI Scalable Oversight)** and **Theme #4 (Self-Improvement)**.
92
 
93
- ---
 
 
94
 
95
- ## 7 Tasks (Progressive Difficulty)
96
 
97
- | Task | Difficulty | What the Agent Does |
98
- |------|-----------|---------------------|
99
- | `easy` | Easy | Extract fields from a single clean invoice |
100
- | `medium` | Medium | Clean + normalise a batch of messy invoices (typos, date chaos, currency symbols) |
101
- | `hard` | Hard | Extract + reconcile against purchase orders, flag discrepancies |
102
- | `expert` | Expert | Fraud audit: classify phantom_vendor / price_gouging / math_fraud / duplicate_submission |
103
- | `adversarial` | Hard | Extract from OCR-corrupted invoice with SUBTOTAL trap and FX noise lines |
104
- | `negotiate` | Medium | Ask clarification questions then submit extraction (bonus for ≤2 questions) |
105
- | `supply_chain` | Expert | Detect quantity shortfalls, price spikes, phantom deliveries in delivery records |
106
 
107
  ---
108
 
109
- ## Design Decisions
110
-
111
- ### 4 Independent Reward Functions (Anti-Hacking)
112
 
113
- Per the hackathon guide: *"use multiple independent reward functions — if you only have one, it is easier for the model to hack it."*
114
 
115
- ```python
116
- format_reward() # Are all 5 required JSON keys present? weight: 0.10
117
- field_reward() # Do vendor/date/currency/total match? weight: 0.40
118
- math_reward() # Does qty × unit_price = amount for all items? weight: 0.25
119
- completeness_reward() # Are all line items present (recall)? weight: 0.25
120
- ```
 
121
 
122
- During training we observed the model maximising `math_reward` (0.97) and `completeness_reward` (1.0) while `field_reward` stayed at 0.00 — the model learned to output arithmetic-consistent JSON while hallucinating values. Our independent signals made this reward hacking immediately visible, confirming the design choice.
123
 
124
- ### Adversarial Self-Play
125
 
126
- The Generator is rewarded when its fraud evades the Auditor:
127
- - Fraud undetected, Approver approves → Generator reward: **0.85**
128
- - Auditor missed but Approver caught → Generator reward: **0.60**
129
- - Auditor caught it → Generator reward: **0.10**
130
 
131
- This creates evolutionary pressure: the Generator evolves harder-to-detect fraud, forcing the Auditor to improve.
132
 
133
- ### Dynamic Difficulty
134
 
135
- The environment tracks recent agent scores per task (rolling window of 10 episodes) and adjusts generation parameters:
136
- - Agent scoring ≥ 0.85 → harder parameters (more invoices, more OCR noise, more discrepancies)
137
- - Agent scoring < 0.60 easier parameters
138
- - In between standard
 
 
 
 
 
139
 
140
- ### All Rewards Clamped to (0.01, 0.99)
141
 
142
- Avoids `log(0)` in policy gradient and prevents the model from getting stuck at boundaries.
143
 
144
  ---
145
 
146
- ## Tech Stack
147
 
148
- ```
149
- Environment: FastAPI + OpenEnv-core + Pydantic
150
- Deployment: HuggingFace Spaces (Docker, port 7860)
151
- UI: Gradio (mounted at /web)
152
- Training: TRL GRPOTrainer + Unsloth (Qwen2.5-1.5B-Instruct, 4-bit QLoRA)
153
- Model: unsloth/Qwen2.5-1.5B-Instruct r=16 LoRA
154
- Reward: 4 local signals + live /grader endpoint on HF Space
155
- ```
156
-
157
- ---
158
 
159
- ## Training Setup
160
 
161
- GRPO (Group Relative Policy Optimization) with:
162
- - `num_generations = 4` — 4 completions per prompt, compared within group
163
- - `max_steps = 200`
164
- - `learning_rate = 5e-6`
165
- - Live `/grader` endpoint on HF Space as environment verifier
166
-
167
- The training loop:
168
  ```
169
- Colab samples episode → HF Space /reset → gets live invoice
170
- Model generates JSON extraction
171
- HF Space /grader scores it against ground truth
172
- GRPO updates model toward higher-scoring completions
173
  ```
174
 
175
- ---
176
 
177
- ## What Worked (Achievements)
178
 
179
- ### 1. Reward Hacking Detection Caught at Step 10
180
 
181
- The independent reward signals caught a classic reward hacking pattern immediately. The model maximised math and completeness while hallucinating field values. Without 4 independent signals, this would have been invisible behind a rising aggregate reward.
182
 
183
- | Step | Total Reward | Env Score | Format | Math |
184
- |------|-------------|-----------|--------|------|
185
- | 10 | 2.361 | 0.113 | 0.900 | 0.347 |
186
- | 20 | 2.595 | 0.282 | 0.900 | 0.413 |
187
- | 30 | 2.657 | 0.304 | 0.950 | 0.403 |
188
 
189
- Environment score rose **0.113 → 0.304 in 30 steps** — a 169% improvement in correct invoice extraction as scored by the live environment grader.
190
 
191
- ### 2. Live Environment as Verifier
 
 
 
 
192
 
193
- Training Colab directly calls `/grader` on the deployed HF Space the environment IS the reward function. No separate reward model. Deterministic and reproducible.
194
 
195
- ### 3. Regulator Concept Validated
196
 
197
- The cross-episode tracking logic works: the Regulator correctly identifies `phantom_vendor` as the Auditor's weakest category and triggers Generator bias toward that fraud type. No other OpenEnv environment we've seen implements a cross-episode meta-agent.
198
 
199
- ### 4. Full 7-Task Ladder Deployed
200
 
201
- All 7 tasks are live on the HF Space with independent graders, schemas, and difficulty calibration. The progressive structure directly supports curriculum learning.
202
 
203
- ### 5. Clean OpenEnv API Compliance
 
 
 
 
204
 
205
- Standard `reset()` / `step()` / `state()` interface, WebSocket support, Swagger docs at `/docs`, Gradio UI at `/web`. Drop-in compatible with any OpenEnv training script.
 
206
 
207
- ---
208
 
209
- ## Where We're Having Problems (Honest Assessment)
210
 
211
- ### 1. Field Reward Plateau
 
 
 
 
 
 
212
 
213
- The `field_reward` (vendor name, date, currency, total accuracy) remains the hardest signal for the 1.5B model to crack. Even at step 30, the environment score is 0.304 — meaning the model still hallucinates field values despite correct structure and math. We suspect this is a model capacity issue: Qwen2.5-1.5B may not have enough parameters to learn extraction patterns from raw OCR text in 200 steps.
214
 
215
- **Potential fix:** Switching to Qwen2.5-7B or adding a light SFT warmup phase with 50–100 correct extraction examples before RL.
216
 
217
- ### 2. Multi-Agent Coordination Not Yet Trained End-to-End
 
 
218
 
219
- The 5-agent architecture is designed and the environment supports it, but we haven't yet run the full adversarial training loop (Generator vs Auditor) end-to-end with GRPO. Currently, the Extractor is trained in isolation. The Regulator logic runs as environment-side code, not as a trainable agent.
220
 
221
- **Potential fix:** Implementing a two-phase training loop — Phase 1: train Extractor on easy/medium, Phase 2: train Auditor against Generator with Regulator feedback.
222
 
223
- ### 3. Compute Constraints
224
 
225
- 4-bit QLoRA on a Colab T4 limits batch sizes and generation counts. With `num_generations=4`, each step is slow enough that we couldn't push past ~50 steps in the available time. The reward curves are trending upward but haven't converged.
226
 
227
- **Potential fix:** Onsite compute credits (HF GPU Spaces) should allow `num_generations=8` and 500+ steps.
 
 
228
 
229
- ### 4. OCR Noise Robustness
 
230
 
231
- The `adversarial` task (trap-resistant extraction with SUBTOTAL/FX noise lines) works as an environment, but the model hasn't been trained on it yet. Early inference tests show the model consistently falls for fake SUBTOTAL lines.
 
 
 
 
232
 
233
- **Potential fix:** Adding adversarial examples to the curriculum after the model achieves ≥0.60 on `medium`.
 
 
 
 
 
 
 
 
234
 
235
  ---
236
 
237
- ## What Makes This Novel
238
 
239
- 1. **Regulator agent** — no other OpenEnv environment has a cross-episode meta-agent that monitors another agent for systematic cognitive blind spots
240
 
241
- 2. **Closed self-improvement loop** Regulator detects blind spot Generator biases fraud generation toward that type Auditor forced to improve no human intervention required
242
 
243
- 3. **Adversarial Generator arms race** Generator rewarded for evading Auditor creates evolutionary pressure on fraud detection
244
 
245
- 4. **Live environment as verifier** — training Colab directly calls `/grader` on deployed HF Space — the environment IS the reward function
246
 
247
- 5. **4 independent reward signals** — made reward hacking immediately visible during training (detected it at step 10)
248
 
249
- ---
250
 
251
- ## Theme Alignment
 
 
 
 
 
 
 
 
 
252
 
253
- | Theme | Alignment |
254
- |-------|-----------|
255
- | **#1 Multi-Agent** | 5 agents with conflicting incentives (Generator vs Auditor) |
256
- | **#1 Sub: Fleet AI Oversight** (bonus) | Regulator monitors Auditor cross-episode |
257
- | **#3.1 Professional Tasks** | Invoice processing = core enterprise workflow |
258
- | **#3.1 Sub: Scaler AI Labs** (bonus) | Multi-agent RL for enterprise financial workflows |
259
- | **#4 Self-Improvement** | Generator adapts based on Regulator blind spot findings |
260
 
261
  ---
262
 
263
- ## Links
264
 
265
- - **Live Environment:** [https://ps2181-invoice-processing-pipeline.hf.space](https://ps2181-invoice-processing-pipeline.hf.space)
266
- - **Gradio UI:** [https://ps2181-invoice-processing-pipeline.hf.space/web](https://ps2181-invoice-processing-pipeline.hf.space/web)
267
- - **API Docs:** [https://ps2181-invoice-processing-pipeline.hf.space/docs](https://ps2181-invoice-processing-pipeline.hf.space/docs)
268
- - **GitHub:** [https://github.com/ps2181/invoice-processing-pipeline](https://github.com/ps2181/invoice-processing-pipeline)
269
 
270
- ---
271
 
272
- ## Team
273
 
274
- **Pritam Satpathy** + **Gnana Nawin T**
275
- Meta PyTorch OpenEnv Hackathon Grand Finale
276
- Scaler School of Technology, Bangalore — April 25–26, 2026
 
1
+ <div align="center">
2
 
3
+ # When the System Learns to Pressure-Test Itself
 
4
 
5
+ **How we built a 5-agent adversarial RL environment that detects invoice fraud —**
6
+ **and automatically gets harder when it finds its own blind spots.**
 
 
 
 
 
 
 
7
 
8
+ <br/>
9
+ *Meta PyTorch OpenEnv Hackathon · Grand Finale · April 25–26, 2026*
10
+ *Pritam Satpathy & Gnana Nawin T · Scaler School of Technology, Bangalore*
11
 
12
+ </div>
 
13
 
14
  ---
15
 
16
+ ## The Problem Nobody Talks About
17
 
18
+ Invoice fraud is boring to talk about and devastating in practice.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
+ It costs businesses an estimated **5% of annual revenue**, and it doesn't announce itself — it hides in purchase order line items, disguised as rounding errors, vendor name typos, and suspiciously round numbers that only look wrong if you already know what to look for.
 
 
 
 
 
 
21
 
22
+ Finance teams today catch it manually. They compare thousands of invoices against purchase orders, cross-reference vendor registries, and flag anything that smells off. It's slow, it's error-prone, and critically — **it doesn't improve**. A human who misses phantom vendor fraud on Monday is statistically likely to miss it again on Friday.
23
 
24
+ We asked a different question:
25
 
26
+ > *What if you could build an LLM system that not only detects fraud, but gets better at detecting the exact fraud types it's currently failing on automatically, without any human retraining the loop?*
27
 
28
+ That's what we built.
 
29
 
30
+ ---
 
 
 
 
 
31
 
32
+ ## The Core Idea: Make the System Pressure-Test Itself
33
 
34
+ Most multi-agent RL setups have agents that operate independently within a single episode. Ours doesn't.
 
35
 
36
+ We added a **cross-episode Regulator** an agent that watches the Auditor across 30 rolling episodes, tracks which fraud types it's systematically missing, and quietly biases the Generator to produce more of those exact scenarios.
37
 
38
+ No human decides *"let's train more on phantom vendors."* The Regulator notices the detection rate for phantom vendors is at `31%` and trending downward, raises the alarm, and tells the Generator to send more phantom vendor invoices. **The loop closes itself.**
39
 
40
+ <div align="center">
41
+ <img width="1710" height="326" alt="image" src="https://github.com/user-attachments/assets/319654c3-aa24-47e8-9716-734d4e902168" />
42
+ </div>
43
 
 
44
 
45
+ The Auditor sees more of exactly what it's failing on. The Generator gets rewarded for finding those gaps. The Regulator earns points for predicting blind spots *before* they go critical. Every agent has skin in the game.
 
 
 
 
 
 
 
 
46
 
47
  ---
48
 
49
+ ## Five Agents, One Closed Loop
 
 
50
 
51
+ <div align="center">
52
 
53
+ | Agent | Role | Reward Signal |
54
+ |:---:|:---|:---|
55
+ | **Generator** | Creates clean or fraudulent invoices, biased by Regulator's blind-spot weights | `+0.85` evades both · `+0.60` evades Auditor · `+0.10` caught |
56
+ | **Extractor** | Raw invoice text structured JSON | format `0.10` · field accuracy `0.40` · math `0.25` · completeness `0.25` |
57
+ | **Auditor** | Fraud classification with fraud type + confidence score | `+0.99` correct type · `+0.90` clean cleared · `+0.01` miss or FP |
58
+ | **Approver** | Final approve / escalate / reject, gated by confidence | `≥0.80` → reject · `0.50–0.80` → escalate · `<0.50` → approve |
59
+ | **Regulator** | Cross-episode meta-agent, 30-episode rolling window | precision `0.35` + recall `0.35` + no over-flagging `0.15` + early warning `0.15` |
60
 
61
+ </div>
62
 
63
+ The **Regulator** is the part that makes this genuinely different. Most RL environments treat each episode as independent. The Regulator sits outside that — accumulating detection rates, computing trend slopes over 5-episode windows, and warning of *emerging* blind spots before they go critical. It's proactive oversight, not reactive retraining.
64
 
65
+ ---
 
 
 
66
 
67
+ ## Seven Tasks, One Curriculum
68
 
69
+ <div align="center">
70
 
71
+ | # | Task | What the Agent Faces | Difficulty |
72
+ |:---:|:---|:---|:---:|
73
+ | 1 | `easy` | Single clean invoice — extract 5 fields | Easy |
74
+ | 2 | `medium` | Batch with date chaos, vendor typos, currency noise | Medium |
75
+ | 3 | `hard` | Extraction + PO reconciliation — flag overcharges, missing items | Hard |
76
+ | 4 | `expert` | Full fraud audit across all four fraud types | Expert |
77
+ | 5 | `adversarial` | OCR corruption, SUBTOTAL traps, fake TAX/FX noise lines | Expert |
78
+ | 6 | `negotiate` | Ask clarifying questions first (bonus for ≤2), then extract | Medium |
79
+ | 7 | `supply_chain` | Detect quantity shortfalls, price spikes, phantom deliveries | Expert |
80
 
81
+ </div>
82
 
83
+ The difficulty also adjusts **dynamically** based on the agent's rolling score. Score above `0.85`? The next batch gets heavier OCR corruption, more PO discrepancies, deeper adversarial traps. Drop below `0.60`? It eases off. The agent is always working at its productive edge.
84
 
85
  ---
86
 
87
+ ## The Part Where We Caught Our Own Reward Hacking
88
 
89
+ This was the most interesting moment in the project.
 
 
 
 
 
 
 
 
 
90
 
91
+ At training step 10, we had:
92
 
 
 
 
 
 
 
 
93
  ```
94
+ math_consistency: 0.97
95
+ completeness: 1.00
96
+ field_accuracy: 0.00 :( ← hallucinating every actual value
 
97
  ```
98
 
99
+ The model had figured out that it could score well by outputting JSON that was *arithmetically correct* — quantities times unit prices summed to the totals perfectly — while **hallucinating every actual value**. Vendor name: made up. Date: made up. Currency: made up. All internally consistent. All completely wrong.
100
 
101
+ This is reward hacking. A single aggregated reward would have happily reported high performance and called it a day.
102
 
103
+ Our four **independent** reward signals made the failure immediately visible. We could see exactly which signal the model had learned to game and which it was ignoring.
104
 
105
+ > **That's the entire argument for independent reward functions: not just diversity, but diagnosability.**
106
 
107
+ We adjusted training emphasis. By step 30, field accuracy had climbed from `0.00` to `0.30+` while math consistency stayed stable.
 
 
 
 
108
 
109
+ <div align="center">
110
 
111
+ | Step | Total Reward | Env Score | Format | Math Consistency |
112
+ |:---:|:---:|:---:|:---:|:---:|
113
+ | 10 | 2.361 | 0.113 | 0.900 | 0.347 |
114
+ | 20 | 2.595 | 0.282 | 0.900 | 0.413 |
115
+ | 30 | 2.657 | **0.304** | **0.950** | 0.403 |
116
 
117
+ **Environment score: `0.113 0.304` in 30 stepsa 169% improvement in live-graded extraction accuracy.**
118
 
119
+ </div>
120
 
121
+ ---
122
 
123
+ ## The Reward Architecture
124
 
125
+ ### 🔍 Extractor 4 Independent Signals
126
 
127
+ ```python
128
+ reward_format(extracted) # weight 0.10 — all 5 required JSON keys present?
129
+ reward_field_accuracy(extracted, gt) # weight 0.40 — vendor / date / currency / total match?
130
+ reward_math_consistency(extracted) # weight 0.25 — qty × unit_price = amount per line?
131
+ reward_completeness(extracted, gt) # weight 0.25 — all expected line items present?
132
 
133
+ # All clamped to (0.01, 0.99) no log(0), no gradient collapse at boundaries
134
+ ```
135
 
136
+ ### Auditor — Precision-Weighted
137
 
138
+ <div align="center">
139
 
140
+ | Outcome | Reward | Why |
141
+ |:---|:---:|:---|
142
+ | Correct fraud type detected | **0.99** | Rewards precise classification, not just flagging |
143
+ | Clean invoice correctly approved | **0.90** | Keeps false-positive rate honest |
144
+ | Compound fraud — one of two types caught | **0.65** | Partial credit prevents discouragement on hard cases |
145
+ | Fraud flagged but wrong type | **0.50** | Penalises sloppiness while crediting intent |
146
+ | Miss or false positive | **0.01** | Near-zero punishes both failure modes symmetrically |
147
 
148
+ </div>
149
 
150
+ ### Regulator Cross-Episode
151
 
152
+ ```
153
+ Total = Precision(0.35) + Recall(0.35) + No-over-flagging(0.15) + Early-warning-bonus(0.15)
154
+ ```
155
 
156
+ The early-warning bonus rewards the Regulator for predicting emerging blind spots *before* detection rates cross the critical threshold proactive oversight, not reactive alarm.
157
 
158
+ ---
159
 
160
+ ## Building With OpenEnv
161
 
162
+ The environment is a FastAPI app deployed on HuggingFace Spaces, exposing the standard OpenEnv interface. The training Colab connects directly to the live Space `/grader` *is* the reward function. There's no separate scoring script. **The environment and the verifier are the same thing.**
163
 
164
+ ```bash
165
+ # Start an episode
166
+ POST /reset {"task_id": "expert"}
167
 
168
+ # Submit an extraction or audit result
169
+ POST /step {"episode_id": "...", "extracted_data": {...}}
170
 
171
+ # Check Regulator state anytime
172
+ GET /regulator/report # detection rates, blind spots, generator bias weights
173
+ GET /regulator/forecast # trend slopes, emerging blind spots with early warnings
174
+ GET /regulator/calibration # overconfidence / underconfidence per fraud type
175
+ ```
176
 
177
+ Training uses **GRPO via TRL** with **Unsloth-optimised 4-bit QLoRA** on `Qwen2.5-1.5B-Instruct` three separate LoRA adapters for Extractor, Auditor, and Generator, each trained on their own reward signal.
178
+
179
+ ```
180
+ Colab → /reset (fresh synthetic invoice from live environment)
181
+ → model generates JSON extraction
182
+ → /grader scores against ground truth
183
+ → GRPO updates weights toward higher-reward completions
184
+ → repeat 200 steps
185
+ ```
186
 
187
  ---
188
 
189
+ ## What We Learned
190
 
191
+ **Reward design is product design.** Every reward function is a specification for the behaviour you actually want. Getting the Auditor reward right where catching the *right* fraud type earns `0.99` but the *wrong* type earns `0.50` and missing entirely earns `0.01` — took more thinking than most of the engineering.
192
 
193
+ **Multiple reward signals are diagnostics, not just incentives.** We didn't add four signals to the Extractor because the theory said to. We added them because we wanted to *see* where the model was failing. They paid off immediately at step 10.
194
 
195
+ **Cross-episode agents change what's possible.** The Regulator couldn't exist in a single-episode design. Most RL environments treat each episode as independent. Giving one agent access to the history of another creates a fundamentally different kind of oversight — one that looks less like evaluation and more like a genuine colleague watching your back.
196
 
197
+ ---
198
 
199
+ ## Try It
200
 
201
+ <div align="center">
202
 
203
+ | Resource | Link |
204
+ |:---|:---|
205
+ | **Live Environment** | [ps2181-invoice-processing-pipeline.hf.space](https://ps2181-invoice-processing-pipeline.hf.space) |
206
+ | **Gradio Demo UI** | [/web](https://ps2181-invoice-processing-pipeline.hf.space/web) |
207
+ | **API Docs** | [/docs](https://ps2181-invoice-processing-pipeline.hf.space/docs) |
208
+ | **Training Colab** | [Open notebook](https://colab.research.google.com/drive/1C1_3giNt-NmbzKNFJr5_L1fms3L8LfmB) |
209
+ | **GitHub** | [invoice-processing-pipeline](https://github.com/ps2181/invoice-processing-pipeline) |
210
+ | **Extractor Model** | [ps2181/extractor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/extractor-lora-qwen2.5-1.5b) |
211
+ | **Auditor Model** | [ps2181/auditor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/auditor-lora-qwen2.5-1.5b) |
212
+ | **Generator Model** | [ps2181/generator-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/generator-lora-qwen2.5-1.5b) |
213
 
214
+ </div>
 
 
 
 
 
 
215
 
216
  ---
217
 
218
+ <div align="center">
219
 
220
+ *Built for the Meta PyTorch OpenEnv Hackathon 2026.*
221
+ *Theme alignment: Multi-Agent Interactions (#1) · Fleet AI Scalable Oversight (#1 bonus) · Professional Tasks (#3.1) · Self-Improvement (#4)*
 
 
222
 
223
+ <br/>
224
 
225
+ **Pritam Satpathy & Gnana Nawin T · Scaler School of Technology · Bangalore**
226
 
227
+ </div>
 
 
README.md CHANGED
@@ -1,256 +1,531 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- title: Invoice Processing Pipeline
3
- emoji: 🧾
4
- colorFrom: blue
5
- colorTo: indigo
6
- sdk: docker
7
- app_port: 7860
8
- tags:
9
- - openenv
10
- - multi-agent
11
- - grpo
12
- - rlhf
13
- - fraud-detection
14
- - invoice
15
  ---
16
 
17
- # 🧾 Invoice Processing Pipeline — Self-Improving Adversarial Fraud Detection
18
 
19
- > **Meta PyTorch OpenEnv Hackathon** · Team: Pritam Satpathy & Gnana Nawin T
20
- >
21
- > **Primary theme: #4 Self-Improvement** · **Secondary: #1 Multi-Agent Interactions**
 
 
 
 
 
 
22
 
23
- **Live Demo** → [ps2181-invoice-processing-pipeline.hf.space/web](https://ps2181-invoice-processing-pipeline.hf.space/web)
24
- **API Docs** → [ps2181-invoice-processing-pipeline.hf.space/docs](https://ps2181-invoice-processing-pipeline.hf.space/docs)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
  ---
27
 
28
- ## The Core Idea
29
 
30
- > *A system that continuously generates harder challenges targeting its own weakest points.*
31
 
32
- Most fraud detection pipelines are static. Ours **gets harder for itself over time**: the Regulator finds where the Auditor keeps failing, the Generator exploits those exact blind spots in the next episode, the Auditor's new mistakes update the Regulator — and the loop closes.
 
 
 
 
 
 
 
 
33
 
34
  ---
35
 
36
- ## 5-Agent Architecture
37
 
38
- ```mermaid
39
- graph LR
40
- R[🎯 Regulator\nDetects blind spots\nUpdates weights] -->|bias weights| G[⚡ Generator\nCreates adversarial\ninvoices]
41
- G -->|raw invoice text| E[🔍 Extractor\nParses structured\nJSON fields]
42
- E -->|structured data| A[🕵️ Auditor\nFlags fraud with\nconfidence scores]
43
- A -->|audit results| AP[✅ Approver\nApprove / Escalate\n/ Reject]
44
- AP -->|episode outcome| R
45
- A -->|missed fraud types| R
46
- ```
 
 
 
 
 
 
 
 
47
 
48
- Each agent has **independent reward signals** no shared objective, genuine multi-agent dynamics:
 
 
 
 
49
 
50
- | Agent | Role | Reward signal |
51
- |-------|------|---------------|
52
- | **Regulator** | Oversight: detects Auditor blind spots, reweights Generator | Precision + Recall + Early-warning bonus |
53
- | **Generator** | Adversary: creates invoices biased toward blind spots | Evasion rate (0.85 evades both, 0.10 if caught) |
54
- | **Extractor** | Parser: structured JSON extraction with 4 signals | Format + Field accuracy + Math + Completeness |
55
- | **Auditor** | Detector: fraud classification with confidence | 0.99 correct type, 0.90 clean, 0.01 miss |
56
- | **Approver** | Gatekeeper: final approve/escalate/reject | Rule-based (confidence threshold) |
57
 
58
  ---
59
 
60
- ## Three Novel Features
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
 
62
- | Feature | What it does |
63
- |---------|-------------|
64
- | **Predictive Regulator** | Computes trend slope over 5-episode windows — warns of *emerging* blind spots before they become critical, not just current ones |
65
- | **Compound Fraud** | Invoices can carry two simultaneous fraud signals (e.g. phantom vendor + price gouging). Partial credit for catching one; full reward for both |
66
- | **Confidence Calibration** | Tracks (confidence, correct?) pairs per fraud type. Flags *overconfident misses* — Auditor saying "90% sure, approved" on a fraudulent invoice — the most dangerous failure mode |
 
 
 
 
 
 
 
 
 
67
 
68
  ---
69
 
70
- ## Training Results — GRPO on Live Environment
 
 
 
 
 
 
 
 
 
71
 
72
- All 3 agents trained with **TRL GRPOTrainer + Unsloth** using the deployed HF Space as the live reward verifier:
 
73
 
74
- | Agent | Baseline | Best Achieved | Notes |
75
- |-------|----------|--------------|-------|
76
- | **Extractor** | 0.10 (random) | **0.914** live grader score | Peaked step 15; crashed due to `_MAX_SESSIONS=50` bug (fixed to 200) |
77
- | **Auditor** | 0.01 (dead signal) | **0.719** total reward | Run 1 had dead live reward (episode_id list bug); Run 2 fixed → 0.01→0.52 |
78
- | **Generator** | — | Format learned (~0.22) | Live evasion reward had same bug; format/plausibility reward improved |
79
 
80
- **Training setup:** Qwen2.5-1.5B-Instruct, 4-bit QLoRA r=16, Unsloth + TRL, Google Colab A100
 
 
 
81
 
82
- ### Auditor Training Log (Run 2 — exact data)
 
 
 
 
 
 
83
 
84
- | Step | Total Reward | Live Env Reward | ±Std |
85
- |------|-------------|----------------|------|
86
- | 5 | 0.4828 | 0.2828 | ±0.194 |
87
- | 10 | **0.7188** | **0.5188** | ±0.239 |
88
- | 15 | 0.4538 | 0.2538 | ±0.123 |
89
- | 20 | 0.5733 | 0.3733 | ±0.212 |
90
- | 25 | 0.5325 | 0.3325 | ±0.232 |
91
- | 30 | 0.6038 | 0.4038 | ±0.147 |
92
 
93
- *Run 1 (dead signal): live env reward = 0.010 flat across all steps (episode_id list bug — TRL passes episode_id as a list, old code sent the whole list to the server instead of indexing per completion)*
 
 
 
 
94
 
95
  ---
96
 
97
- ## Trained LoRA Agents
98
 
99
- | Agent | HF Hub |
100
- |-------|--------|
101
- | Extractor | [ps2181/extractor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/extractor-lora-qwen2.5-1.5b) |
102
- | Auditor | [ps2181/auditor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/auditor-lora-qwen2.5-1.5b) |
103
- | Generator | [ps2181/generator-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/generator-lora-qwen2.5-1.5b) |
 
 
 
 
 
 
104
 
105
  ---
106
 
107
- ## Sample Episode Trace
 
 
108
 
109
  ```
110
- ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
111
- MULTI-AGENT PIPELINE · LIVE EPISODE
112
- ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
113
-
114
- 🎯 STEP 0 — REGULATOR
115
- ────────────────────────────────────────────────
116
- Blind spots detected : phantom_vendor
117
- Fraud weights → next episode:
118
- phantom_vendor 60% ▓▓▓▓▓▓▓▓▓▓▓▓▓▓ ← prioritised
119
- price_gouging 20% ▓▓▓▓▓
120
- math_fraud 10% ▓▓
121
- duplicate_submission 10% ▓▓
122
-
123
- ⚡ STEP 1 — GENERATOR (Qwen2.5 LoRA)
124
- ────────────────────────────────────────────────
125
- Episode : ep_8f3a2c…
126
- Invoices : 3
127
- Fraud focus : Phantom Vendor (60% Regulator weight)
128
-
129
- 🔍 STEP 2 — EXTRACTOR (Qwen2.5 LoRA)
130
- ────────────────────────────────────────────────
131
- Vendor : ShadowByte Technologies ← not in registry
132
- Total : $12,450.00
133
- Reward : 0.847 [format 0.10 field 0.38 math 0.25 completeness 0.12]
134
-
135
- 🕵️ STEP 3 — AUDITOR (Qwen2.5 LoRA)
136
- ────────────────────────────────────────────────
137
- INV-85529 → 🚨 FLAGGED [PHANTOM VENDOR] conf=0.91
138
- INV-85530 → ✅ APPROVED conf=0.88
139
- Mean reward : 0.623
140
-
141
- ✅ STEP 4 — APPROVER
142
- ────────────────────────────────────────────────
143
- INV-85529 → ❌ REJECT
144
- INV-85530 → ✅ APPROVE
145
-
146
- Generator adversarial reward : 0.60 (evaded Auditor on 1/3, Approver caught)
147
-
148
- 🎯 STEP 5 — REGULATOR UPDATE
149
- ────────────────────────────────────────────────
150
- phantom_vendor detection improved: 31% → 45%
151
- Generator weights updated for next episode
152
- ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
153
  ```
154
 
155
  ---
156
 
157
- ## Reward Signals
158
 
159
- ### Extractor (4 independent signals)
160
- | Signal | Max | What it measures |
161
- |--------|-----|-----------------|
162
- | Format | 0.10 | Required fields present |
163
- | Field accuracy | 0.40 | Vendor / date / currency / total correct |
164
- | Math consistency | 0.25 | qty × unit_price = amount, sum = total |
165
- | Completeness | 0.25 | All line items captured |
166
 
167
- ### Auditor
168
- | Outcome | Reward |
169
- |---------|--------|
170
- | Correct fraud type detected | 0.99 |
171
- | Clean invoice correctly approved | 0.90 |
172
- | Compound fraud — one type caught | 0.65 |
173
- | Fraud detected, wrong type | 0.50 |
174
- | Miss or false positive | 0.01 |
175
-
176
- ### Generator (adversarial)
177
- | Outcome | Reward |
178
- |---------|--------|
179
- | Evades both Auditor and Approver | 0.85 |
180
- | Evades Auditor, Approver catches | 0.60 |
181
- | Auditor catches it | 0.10 |
182
 
183
- ### Regulator
184
- Precision (0.35) + Recall (0.35) + No over-flagging (0.15) + Early warning bonus (0.15)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
185
 
186
  ---
187
 
188
- ## API Endpoints
189
 
190
  ### Core OpenEnv
 
191
  | Endpoint | Method | Description |
192
- |----------|--------|-------------|
193
- | `/reset` | POST | Start episode (`{"task_id": "easy\|medium\|hard\|expert\|adversarial\|negotiate\|supply_chain"}`) |
194
- | `/step` | POST | Submit extracted data, get reward + feedback |
195
- | `/grader`| POST | Score without modifying state |
196
- | `/state` | GET | Episode metadata |
197
- | `/health`| GET | Health check |
198
- | `/ws` | WS | WebSocket interface |
 
 
199
 
200
  ### Multi-Agent
 
201
  | Endpoint | Method | Description |
202
- |----------|--------|-------------|
203
- | `/multi/reset` | POST | Start 5-agent episode, Generator biased by Regulator |
204
- | `/multi/extract` | POST | Score Extractor output (4 signals) |
205
- | `/multi/audit` | POST | Score Auditor output, update tracker |
206
- | `/multi/approve` | POST | Run Approver, compute Generator reward |
 
207
 
208
  ### Regulator
 
209
  | Endpoint | Method | Description |
210
- |----------|--------|-------------|
211
- | `/regulator/report` | GET | Detection rates, blind spots, weights |
212
- | `/regulator/forecast` | GET | Predictive trend analysis |
213
- | `/regulator/calibration` | GET | Confidence calibration per fraud type |
214
- | `/regulator/predict` | POST | Score Regulator blind spot predictions |
 
 
215
 
216
  ---
217
 
218
- ## Quick Start
219
 
220
- ```bash
221
- # Health check
222
- curl https://ps2181-invoice-processing-pipeline.hf.space/health
223
 
224
- # Start a multi-agent episode
225
- curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/multi/reset
 
 
 
 
 
 
 
 
226
 
227
- # Get Regulator blind spot report
228
- curl https://ps2181-invoice-processing-pipeline.hf.space/regulator/report
229
 
230
- # Predictive forecast
231
- curl https://ps2181-invoice-processing-pipeline.hf.space/regulator/forecast
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
232
  ```
233
 
234
  ---
235
 
236
- ## Fraud Types
 
 
237
 
238
- | Type | Description |
239
- |------|-------------|
240
- | `phantom_vendor` | Vendor not in the Approved Vendor Registry |
241
- | `price_gouging` | Unit price > 150% of market max |
242
- | `math_fraud` | Invoice total sum of line items |
243
- | `duplicate_submission` | Same invoice_id or vendor+date+total already seen |
244
- | `compound_fraud` | Two fraud signals in one invoice |
 
 
245
 
246
  ---
247
 
248
- ## Links
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
249
 
250
- - **Live Demo**: [ps2181-invoice-processing-pipeline.hf.space/web](https://ps2181-invoice-processing-pipeline.hf.space/web)
251
- - **API Docs**: [ps2181-invoice-processing-pipeline.hf.space/docs](https://ps2181-invoice-processing-pipeline.hf.space/docs)
252
- - **GitHub**: [github.com/ps2181/invoice-processing-pipeline](https://github.com/ps2181/invoice-processing-pipeline)
253
- - **OpenEnv**: [github.com/meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv)
254
- - **Extractor LoRA**: [ps2181/extractor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/extractor-lora-qwen2.5-1.5b)
255
- - **Auditor LoRA**: [ps2181/auditor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/auditor-lora-qwen2.5-1.5b)
256
- - **Generator LoRA**: [ps2181/generator-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/generator-lora-qwen2.5-1.5b)
 
1
+ <div class="card">
2
+ <div class="card-header">
3
+ <div class="card-header-dot"></div>
4
+ <span class="card-header-title"></span>
5
+ </div>
6
+ <!-- yaml rows + tag rows + footer badges -->
7
+ </div>
8
+ <div align="center">
9
+
10
+ <!-- Animated header banner -->
11
+ <img src="https://capsule-render.vercel.app/api?type=waving&color=gradient&customColorList=6,11,20&height=200&section=header&text=Invoice%20Processing%20Pipeline&fontSize=40&fontColor=fff&animation=twinkling&fontAlignY=35&desc=Self-Improving%20Multi-Agent%20Fraud%20Detection%20%7C%20OpenEnv%20%2B%20GRPO%20%2B%20Qwen2.5&descAlignY=55&descSize=16" width="100%"/>
12
+
13
+ <!-- Badges row 1 -->
14
+ <p>
15
+ <a href="https://ps2181-invoice-processing-pipeline.hf.space/web">
16
+ <img src="https://img.shields.io/badge/🚀%20Live%20Demo-HuggingFace%20Spaces-FF9D00?style=for-the-badge&logo=huggingface&logoColor=white" />
17
+ </a>
18
+ <a href="https://colab.research.google.com/drive/1C1_3giNt-NmbzKNFJr5_L1fms3L8LfmB">
19
+ <img src="https://img.shields.io/badge/Training%20Colab-Open%20Notebook-F9AB00?style=for-the-badge&logo=googlecolab&logoColor=white" />
20
+ </a>
21
+ <a href="https://ps2181-invoice-processing-pipeline.hf.space/docs">
22
+ <img src="https://img.shields.io/badge/API%20Docs-FastAPI-009688?style=for-the-badge&logo=fastapi&logoColor=white" />
23
+ </a>
24
+ </p>
25
+
26
+ <!-- Badges row 2 -->
27
+ <p>
28
+ <img src="https://img.shields.io/badge/Framework-OpenEnv-1A356E?style=for-the-badge" />
29
+ <img src="https://img.shields.io/badge/Model-Qwen2.5--1.5B%20+%20LoRA%20r%3D16-8B1A4E?style=for-the-badge" />
30
+ <img src="https://img.shields.io/badge/Training-GRPO%20+%20Unsloth-00A67E?style=for-the-badge" />
31
+ <img src="https://img.shields.io/badge/Agents-5%20Adversarial-E44D26?style=for-the-badge" />
32
+ </p>
33
+
34
+ <!-- Badges row 3 -->
35
+ <p>
36
+ <img src="https://img.shields.io/badge/Tasks-7%20Progressive-6C3483?style=for-the-badge" />
37
+ <img src="https://img.shields.io/badge/Deployment-Docker%20%7C%20HF%20Spaces-0D1117?style=for-the-badge&logo=docker" />
38
+ <img src="https://img.shields.io/badge/License-MIT-green?style=for-the-badge" />
39
+ <img src="https://img.shields.io/badge/Hackathon-Meta%20PyTorch%202026-FF6B35?style=for-the-badge" />
40
+ </p>
41
+
42
+ <br/>
43
+
44
+ > **Meta PyTorch OpenEnv Hackathon — Grand Finale · April 25–26, 2026**
45
+ >
46
+ > Team: **Pritam Satpathy** & **Gnana Nawin T** · Scaler School of Technology, Bangalore
47
+
48
+ <br/>
49
+
50
+ <!-- Animated typing headline -->
51
+ <a href="https://git.io/typing-svg">
52
+ <img src="https://readme-typing-svg.demolab.com?font=Fira+Code&weight=600&size=22&pause=1000&color=007A87&center=true&vCenter=true&width=750&lines=5-Agent+Adversarial+Fraud+Detection+System;Self-Improving+via+Cross-Episode+Regulator;GRPO-Trained+LoRA+Agents+on+Live+Environment;Invoice+%E2%86%92+Extract+%E2%86%92+Audit+%E2%86%92+Approve+%E2%86%92+Improve" alt="Typing SVG" />
53
+ </a>
54
+
55
+ </div>
56
+
57
  ---
58
+
59
+ ## 🔥 What Makes This Different
60
+
61
+ > Most multi-agent systems are **static pipelines**. Ours **gets harder for itself over time**.
62
+
63
+ The system contains a **Predictive Regulator** — a cross-episode meta-agent that monitors the Auditor across 30 rolling episodes, detects fraud types it systematically fails on (**blind spots**), and **automatically biases the Generator** to produce more of exactly those fraud types. No human intervention. No manual curriculum design. The system pressure-tests its own weakest point, every single episode.
64
+
65
+ <div align="center">
66
+ <img width="1462" height="731" alt="image" src="https://github.com/user-attachments/assets/7d863b87-1921-45f5-8d94-a06ba3ed6fc1" />
67
+ </div>
68
+
 
 
69
  ---
70
 
71
+ ## Three Novel Features
72
 
73
+ <table>
74
+ <tr>
75
+ <td width="33%" align="center">
76
+
77
+ ### 🔮 Predictive Regulator
78
+
79
+ Computes **trend slope** over 5-episode windows.<br/>Warns of *emerging* blind spots **before** detection rates cross the critical threshold — proactive oversight, not reactive retraining.
80
+
81
+ `+0.15 early-warning bonus`
82
 
83
+ </td>
84
+ <td width="33%" align="center">
85
+
86
+ ### 🧩 Compound Fraud
87
+
88
+ Invoices carry **two fraud signals simultaneously** (e.g. phantom vendor + price gouging).<br/>Partial credit `+0.65` for catching one; full reward `+0.99` for both.
89
+
90
+ Prevents single-signal heuristics.
91
+
92
+ </td>
93
+ <td width="33%" align="center">
94
+
95
+ ### 📊 Confidence Calibration
96
+
97
+ Tracks `(confidence, correct?)` pairs per fraud type.<br/>Detects **overconfident misses** — the Auditor saying "90% sure, approved" on fraud — the most dangerous real-world failure mode.
98
+
99
+ </td>
100
+ </tr>
101
+ </table>
102
 
103
  ---
104
 
105
+ ## 🤖 Five Agents, One Closed Loop
106
 
107
+ <div align="center">
108
 
109
+ | Agent | Role | Reward Signal |
110
+ |:---:|:---|:---|
111
+ | 🏭 **Generator** | Creates clean or fraudulent invoices, biased by Regulator blind-spot weights | `+0.85` evades Auditor + Approver · `+0.60` evades Auditor only · `+0.10` caught |
112
+ | 🔍 **Extractor** | Parses raw OCR invoice text → structured JSON | 4 independent signals: format `0.10` · field accuracy `0.40` · math `0.25` · completeness `0.25` |
113
+ | 🕵️ **Auditor** | Classifies each invoice with fraud type + confidence score | `+0.99` correct type · `+0.90` clean clearance · `+0.65` compound (one caught) · `+0.01` miss/FP |
114
+ | ✅ **Approver** | Final approve / escalate / reject (rule-based, confidence-gated) | `≥0.80` confidence → reject · `0.50–0.80` → escalate · approved → approve |
115
+ | 🧠 **Regulator** | Cross-episode meta-agent — 30-episode rolling window, blind-spot tracker | Precision `0.35` + Recall `0.35` + No over-flagging `0.15` + Early warning `0.15` |
116
+
117
+ </div>
118
 
119
  ---
120
 
121
+ ## 🎯 Seven Tasks — Progressive Difficulty
122
 
123
+ | # | Task | Difficulty | What the Agent Must Do |
124
+ |:---:|:---|:---:|:---|
125
+ | 1 | `easy` | 🟢 Easy | Extract `vendor`, `date`, `currency`, `total`, `line_items` from a single clean invoice |
126
+ | 2 | `medium` | 🟡 Medium | Clean & normalise a batch: fix date format chaos, vendor typos, currency symbol pollution |
127
+ | 3 | `hard` | 🟠 Hard | Extract + reconcile against purchase orders — flag overcharges, extra items, missing items |
128
+ | 4 | `expert` | 🔴 Expert | Fraud audit using vendor registry, market prices, and invoice history — classify fraud type exactly |
129
+ | 5 | `adversarial` | 🟠 Hard | Ignore SUBTOTAL trap + fake TAX/ADJUSTMENT + FX noise; OCR-corrupted vendor labels |
130
+ | 6 | `negotiate` | 🟡 Medium | Ask clarification questions `{"question": "..."}` then extract; `+15%` bonus for ≤2 questions |
131
+ | 7 | `supply_chain` | 🔴 Expert | Detect `quantity_shortfall`, `price_spike`, `unauthorized_substitution`, `phantom_delivery` |
132
+
133
+ ---
134
+
135
+ ## 🧠 Trained LoRA Agents
136
+
137
+ All three generative agents trained with **GRPO on live environment data** — the HF Space `/grader` endpoint *is* the reward function during training.
138
+
139
+ <div align="center">
140
 
141
+ | Agent | Base Model | LoRA Config | HuggingFace Hub |
142
+ |:---:|:---|:---:|:---|
143
+ | 🔍 Extractor | Qwen2.5-1.5B-Instruct | r=16, α=16, 4-bit QLoRA | [ps2181/extractor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/extractor-lora-qwen2.5-1.5b) |
144
+ | 🕵️ Auditor | Qwen2.5-1.5B-Instruct | r=16, α=16, 4-bit QLoRA | [ps2181/auditor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/auditor-lora-qwen2.5-1.5b) |
145
+ | 🏭 Generator | Qwen2.5-1.5B-Instruct | r=16, α=16, 4-bit QLoRA | [ps2181/generator-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/generator-lora-qwen2.5-1.5b) |
146
 
147
+ </div>
148
+
149
+ **LoRA target modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
 
 
 
 
150
 
151
  ---
152
 
153
+ ## 📈 Training Results
154
+
155
+ ### Extractor — GRPO Training Progress
156
+
157
+ The model learned to extract structured JSON from noisy invoice text via **reinforcement learning with 4 independent reward signals**, scoring directly against the live environment grader.
158
+
159
+ | Step | Total Reward | Env Score | Format | Math Consistency |
160
+ |:---:|:---:|:---:|:---:|:---:|
161
+ | 10 | 2.361 | 0.113 | 0.900 | 0.347 |
162
+ | 20 | 2.595 | 0.282 | 0.900 | 0.413 |
163
+ | 30 | 2.657 | 0.304 | **0.950** | 0.403 |
164
+
165
+ > 📊 **Environment score: `0.113 → 0.304` in 30 steps — a 169% improvement** in live-graded extraction accuracy.
166
+
167
+ ### 🔍 Reward Hacking Caught in Training
168
+
169
+ At step 10, we observed the model achieving `math_consistency = 0.97` and `completeness = 1.0` while `field_accuracy = 0.00` — it had learned to output **arithmetically-consistent JSON with entirely hallucinated values**.
170
 
171
+ Our 4 **independent** reward signals made this visible immediately. A single aggregated reward would have never surfaced this.
172
+
173
+ ```
174
+ Step 10 Reward Hacking Detected:
175
+ format: 0.10 ✅
176
+ math_consistency: 0.97 ✅ ← model gaming this signal
177
+ completeness: 1.00 ✅ ← model gaming this signal
178
+ field_accuracy: 0.00 ❌ ← hallucinating all values
179
+
180
+ Action: adjusted training emphasis on field_accuracy weight
181
+ Result: field_accuracy climbed to 0.30+ by step 30
182
+ ```
183
+
184
+ This is exactly why multiple independent reward signals matter — and why we have 4.
185
 
186
  ---
187
 
188
+ ## 🎁 Reward Architecture
189
+
190
+ ### Extractor — 4 Independent Signals
191
+
192
+ ```python
193
+ def reward_format(extracted) -> float: # weight 0.10
194
+ """Are all 5 required JSON keys present?"""
195
+
196
+ def reward_field_accuracy(extracted, gt) -> float: # weight 0.40
197
+ """Do vendor / date / currency / total match ground truth?"""
198
 
199
+ def reward_math_consistency(extracted) -> float: # weight 0.25
200
+ """Does qty × unit_price = amount for every line item?"""
201
 
202
+ def reward_completeness(extracted, gt) -> float: # weight 0.25
203
+ """Recall: what fraction of expected line items are present?"""
 
 
 
204
 
205
+ # All rewards clamped to (0.01, 0.99) no log(0), no gradient collapse
206
+ ```
207
+
208
+ ### Auditor Reward
209
 
210
+ | Outcome | Reward | Why |
211
+ |:---|:---:|:---|
212
+ | Correct fraud type detected | **0.99** | Incentivises precise classification, not just binary flagging |
213
+ | Clean invoice correctly approved | **0.90** | High reward keeps false-positive rate low |
214
+ | Compound fraud — one of two types caught | **0.65** | Partial credit prevents cliff on hard cases |
215
+ | Fraud flagged but wrong type | **0.50** | Penalises sloppiness; rewards catching *something* |
216
+ | Miss or false positive | **0.01** | Near-zero punishes both failure modes symmetrically |
217
 
218
+ ### Generator Reward (Adversarial Self-Play)
219
+
220
+ | Outcome | Reward |
221
+ |:---|:---:|
222
+ | Fraud evades **both** Auditor and Approver | **0.85** |
223
+ | Auditor misses, Approver catches | **0.60** |
224
+ | Auditor catches it | **0.10** |
 
225
 
226
+ ### Regulator Reward
227
+
228
+ ```
229
+ Total = Precision(0.35) + Recall(0.35) + No-over-flagging(0.15) + Early-warning-bonus(0.15)
230
+ ```
231
 
232
  ---
233
 
234
+ ## 🦺 Five Fraud Types
235
 
236
+ <div align="center">
237
+
238
+ | Type | Detection Method | Example |
239
+ |:---|:---|:---|
240
+ | 🏚️ `phantom_vendor` | Vendor not in Approved Vendor Registry | "QuickSupply Hub" — not in approved list |
241
+ | 💸 `price_gouging` | Unit price > 150% of market ceiling | Laptop at $2,800 when market max is $1,299 |
242
+ | ➕ `math_fraud` | Invoice total ≠ sum of line items | Total $5,200 when items sum to $4,400 |
243
+ | 📋 `duplicate_submission` | Same invoice_id or vendor+date+total already in history | INV-83221 submitted twice |
244
+ | 🔀 `compound_fraud` | Two fraud signals in one invoice | Phantom vendor **AND** price gouging simultaneously |
245
+
246
+ </div>
247
 
248
  ---
249
 
250
+ ## 🌍 The Regulator in Action
251
+
252
+ After each episode, the Regulator publishes a report that the Generator reads to bias its next batch:
253
 
254
  ```
255
+ GET /regulator/report
256
+
257
+ {
258
+ "total_audits_recorded": 20,
259
+ "detection_rates": {
260
+ "phantom_vendor": "31% ⚠ BLIND SPOT (-0.08↓)",
261
+ "price_gouging": "74% OK (+0.03↑)",
262
+ "math_fraud": "81% OK (+0.01↑)",
263
+ "duplicate_submission": "62% EMERGING (-0.02↓)"
264
+ },
265
+ "false_positive_rate": "12% ✓ OK",
266
+ "blind_spots": ["phantom_vendor"],
267
+ "emerging_blind_spots": ["duplicate_submission"],
268
+ "generator_weights": {
269
+ "phantom_vendor": 0.30, ← 3× upweighted (blind spot)
270
+ "duplicate_submission": 0.20, ← 2× upweighted (emerging)
271
+ "price_gouging": 0.125,
272
+ "math_fraud": 0.125,
273
+ "compound_fraud": 0.10
274
+ },
275
+ "verdict": "Recommend retraining on: phantom_vendor"
276
+ }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
277
  ```
278
 
279
  ---
280
 
281
+ ## 🚀 Quick Start
282
 
283
+ ### Try the Live Demo
 
 
 
 
 
 
284
 
285
+ ```bash
286
+ # Health check
287
+ curl https://ps2181-invoice-processing-pipeline.hf.space/health
 
 
 
 
 
 
 
 
 
 
 
 
288
 
289
+ # List all 7 tasks with schemas
290
+ curl https://ps2181-invoice-processing-pipeline.hf.space/tasks
291
+
292
+ # Start a single-agent episode
293
+ curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/reset \
294
+ -H "Content-Type: application/json" \
295
+ -d '{"task_id": "easy"}'
296
+
297
+ # Submit an extraction (replace EPISODE_ID from reset response)
298
+ curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/step \
299
+ -H "Content-Type: application/json" \
300
+ -d '{
301
+ "episode_id": "EPISODE_ID",
302
+ "extracted_data": {
303
+ "vendor": "Acme Corp",
304
+ "date": "2024-08-15",
305
+ "currency": "USD",
306
+ "total": 2374.93,
307
+ "line_items": [
308
+ {"description": "Laptop Computer", "qty": 2, "unit_price": 1099.99, "amount": 2199.98},
309
+ {"description": "Wireless Mouse", "qty": 5, "unit_price": 34.99, "amount": 174.95}
310
+ ]
311
+ }
312
+ }'
313
+ ```
314
+
315
+ ### Run the Multi-Agent Pipeline
316
+
317
+ ```bash
318
+ # Step 1 — Start 5-agent episode (Generator biased by Regulator)
319
+ curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/multi/reset
320
+
321
+ # Step 2 — Score Extractor output (4 signals)
322
+ curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/multi/extract \
323
+ -H "Content-Type: application/json" \
324
+ -d '{"episode_id": "EP_ID", "extracted_data": {...}}'
325
+
326
+ # Step 3 — Score Auditor output (updates 30-episode tracker)
327
+ curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/multi/audit \
328
+ -H "Content-Type: application/json" \
329
+ -d '{"episode_id": "EP_ID", "audit_results": [
330
+ {"invoice_id": "INV-83221", "verdict": "flagged",
331
+ "fraud_type": "phantom_vendor", "confidence": 0.87}
332
+ ]}'
333
+
334
+ # Step 4 — Run Approver, compute Generator adversarial reward
335
+ curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/multi/approve \
336
+ -H "Content-Type: application/json" \
337
+ -d '{"episode_id": "EP_ID"}'
338
+
339
+ # Check Regulator state anytime
340
+ curl https://ps2181-invoice-processing-pipeline.hf.space/regulator/report
341
+ curl https://ps2181-invoice-processing-pipeline.hf.space/regulator/forecast
342
+ curl https://ps2181-invoice-processing-pipeline.hf.space/regulator/calibration
343
+ ```
344
+
345
+ ### Run Training (Google Colab)
346
+
347
+ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1C1_3giNt-NmbzKNFJr5_L1fms3L8LfmB)
348
+
349
+ The training loop connects **directly** to the live HF Space environment:
350
+
351
+ ```
352
+ Colab → /reset (fresh synthetic invoice) → model generates JSON
353
+ → /grader (scores vs ground truth) → GRPO weight update
354
+ → repeat 200 steps
355
+ ```
356
+
357
+ ---
358
+
359
+ ## 🗂️ Repository Structure
360
+
361
+ ```
362
+ invoice-processing-pipeline/
363
+
364
+ ├── server/
365
+ │ ├── app.py # FastAPI — 18 endpoints
366
+ │ ├── environment.py # 7 tasks · graders · dynamic difficulty
367
+ │ ├── multi_agent_environment.py # 5-agent system + AuditorPerformanceTracker
368
+ │ ├── agents.py # Lazy-loading LoRA inference wrappers
369
+ │ └── web_ui.py # Gradio UI (mounted at /web)
370
+
371
+ ├── models.py # Pydantic: Action · Observation · State
372
+ ├── inference.py # Standalone inference helper
373
+ ├── client.py # OpenEnv-compatible Python client
374
+
375
+ ├── extractor_training_grpo.ipynb # Extractor GRPO training (Unsloth + TRL)
376
+ ├── auditor_grpo_training.ipynb # Auditor GRPO training
377
+ ├── generator_grpo_training.ipynb # Generator GRPO training
378
+
379
+ ├── openenv.yaml # OpenEnv manifest (all 7 tasks declared)
380
+ ├── Dockerfile # HF Spaces Docker (port 7860, non-root UID 1000)
381
+ ├── pyproject.toml # Project metadata + dependencies
382
+ ├── requirements.txt # Runtime dependencies
383
+ ├── validate-submission.sh # Submission validator script
384
+
385
+ ├── ROUND2_PROBLEM_STATEMENT.md # Full problem statement + reward design rationale
386
+ └── BLOG_DRAFT.md # HuggingFace blog post draft
387
+ ```
388
 
389
  ---
390
 
391
+ ## 🔌 API Reference
392
 
393
  ### Core OpenEnv
394
+
395
  | Endpoint | Method | Description |
396
+ |:---|:---:|:---|
397
+ | `/health` | `GET` | Health check `{"status": "ok", "active_sessions": N}` |
398
+ | `/tasks` | `GET` | All 7 tasks with descriptions, max_attempts, action/observation schemas |
399
+ | `/reset` | `POST` | Start episode `{"task_id": "easy\|medium\|hard\|expert\|adversarial\|negotiate\|supply_chain"}` |
400
+ | `/step` | `POST` | Submit extraction → reward + feedback + hint + reward_breakdown |
401
+ | `/grader` | `POST` | Score without consuming an attempt (used by training Colab) |
402
+ | `/state` | `GET` | Episode metadata — step_count, done, best_reward, full rewards history |
403
+ | `/ws` | `WS` | Full episode over WebSocket (OpenEnv standard) |
404
+ | `/web` | `GET` | Gradio interactive demo UI |
405
 
406
  ### Multi-Agent
407
+
408
  | Endpoint | Method | Description |
409
+ |:---|:---:|:---|
410
+ | `/multi/reset` | `POST` | Start 5-agent episode Generator biased by Regulator weights |
411
+ | `/multi/extract` | `POST` | Score Extractor output (4 signals) |
412
+ | `/multi/audit` | `POST` | Score Auditor output, update 30-episode performance tracker |
413
+ | `/multi/approve` | `POST` | Run Approver, compute Generator adversarial reward |
414
+ | `/multi/state/{id}` | `GET` | Full episode state including all agent scores |
415
 
416
  ### Regulator
417
+
418
  | Endpoint | Method | Description |
419
+ |:---|:---:|:---|
420
+ | `/regulator/report` | `GET` | Detection rates, blind spots, calibration, generator weights |
421
+ | `/regulator/forecast` | `GET` | Predictive trend analysis — critical + emerging blind spots with slopes |
422
+ | `/regulator/calibration` | `GET` | Overconfidence / underconfidence per fraud type |
423
+ | `/regulator/predict` | `POST` | Score a Regulator blind-spot prediction |
424
+ | `/regulator/demo_seed` | `POST` | Seed tracker with realistic demo data |
425
+ | `/generator/score` | `POST` | Compute Generator reward given auditor/approver outcomes |
426
 
427
  ---
428
 
429
+ ## 🏗️ Tech Stack
430
 
431
+ <div align="center">
 
 
432
 
433
+ | Layer | Technology |
434
+ |:---|:---|
435
+ | **Environment** | [OpenEnv](https://github.com/meta-pytorch/OpenEnv) · FastAPI · Pydantic v2 |
436
+ | **UI** | Gradio 4.x (mounted at `/web`) |
437
+ | **Deployment** | Docker · HuggingFace Spaces (vcpu-2 / 8 GB) |
438
+ | **Training** | [TRL GRPOTrainer](https://huggingface.co/docs/trl) · [Unsloth](https://github.com/unslothai/unsloth) |
439
+ | **Model** | `unsloth/Qwen2.5-1.5B-Instruct` · 4-bit QLoRA · r=16 |
440
+ | **Reward** | Live `/grader` endpoint on HF Space as verifier |
441
+ | **Session Mgmt** | Thread-safe `OrderedDict` · 200-session cap · LRU eviction |
442
+ | **Dynamic Difficulty** | Per-task rolling window (maxlen=10) → adjusts OCR intensity, batch size, discrepancy count |
443
 
444
+ </div>
 
445
 
446
+ ---
447
+
448
+ ## 🔍 Dynamic Difficulty
449
+
450
+ The environment adapts generation parameters to the agent's recent performance:
451
+
452
+ ```python
453
+ if avg_score >= 0.85: # Agent is doing well → harder
454
+ n_invoices = (4, 6)
455
+ ocr_intensity = 0.55 # heavier corruption
456
+ n_discrepancies = (3, 5)
457
+ n_anomalies = 3
458
+
459
+ elif avg_score < 0.60: # Agent is struggling → easier
460
+ n_invoices = (2, 3)
461
+ ocr_intensity = 0.15
462
+ n_discrepancies = (1, 2)
463
+ n_anomalies = 2
464
+
465
+ else: # balanced
466
+ n_invoices = (3, 5)
467
+ ocr_intensity = 0.35
468
+ n_discrepancies = (2, 3)
469
  ```
470
 
471
  ---
472
 
473
+ ## 🎭 Theme Alignment
474
+
475
+ <div align="center">
476
 
477
+ | Theme | Alignment | Evidence |
478
+ |:---:|:---|:---|
479
+ | **#1 Multi-Agent Interactions** | Core | 5 agents with cooperation, competition, and adversarial self-play |
480
+ | **#1 Fleet AI Scalable Oversight** | Bonus | Regulator monitors Auditor cross-episode — fully autonomous oversight loop |
481
+ | **#2 Long-Horizon Planning** | Partial | `negotiate` task: multi-turn clarification with attempt budget penalty |
482
+ | **#3.1 Professional Tasks** | Core | Invoice + PO + vendor registry + supply chain = real finance operations |
483
+ | **#4 Self-Improvement** | Core | Regulator Generator bias → harder adversarial batches → Auditor improves |
484
+
485
+ </div>
486
 
487
  ---
488
 
489
+ ## 👥 Team
490
+
491
+ <div align="center">
492
+
493
+ | | |
494
+ |:---:|:---:|
495
+ | **Pritam Satpathy** | **Gnana Nawin T** |
496
+ | [🤗 ps2181](https://huggingface.co/ps2181) | [🤗 gnananawin](https://huggingface.co/gnananawin) |
497
+ | Scaler School of Technology | Scaler School of Technology |
498
+
499
+ **Meta PyTorch OpenEnv Hackathon — Grand Finale · April 25–26, 2026 · Bangalore**
500
+
501
+ </div>
502
+
503
+ ---
504
+
505
+ ## 🔗 All Links
506
+
507
+ <div align="center">
508
+
509
+ | Resource | Link |
510
+ |:---|:---|
511
+ | 🚀 **Live Environment** | https://ps2181-invoice-processing-pipeline.hf.space |
512
+ | 🖥️ **Gradio Demo UI** | https://ps2181-invoice-processing-pipeline.hf.space/web |
513
+ | 📖 **API Documentation** | https://ps2181-invoice-processing-pipeline.hf.space/docs |
514
+ | 🤗 **Extractor Model** | https://huggingface.co/ps2181/extractor-lora-qwen2.5-1.5b |
515
+ | 🕵️ **Auditor Model** | https://huggingface.co/ps2181/auditor-lora-qwen2.5-1.5b |
516
+ | 🏭 **Generator Model** | https://huggingface.co/ps2181/generator-lora-qwen2.5-1.5b |
517
+ | 📓 **Training Colab** | https://colab.research.google.com/drive/1C1_3giNt-NmbzKNFJr5_L1fms3L8LfmB |
518
+ | 💻 **GitHub** | https://github.com/ps2181/invoice-processing-pipeline |
519
+ | 🧩 **OpenEnv Framework** | https://github.com/meta-pytorch/OpenEnv |
520
+
521
+ </div>
522
+
523
+ ---
524
+
525
+ <div align="center">
526
+
527
+ <img src="https://capsule-render.vercel.app/api?type=waving&color=gradient&customColorList=6,11,20&height=100&section=footer&animation=twinkling" width="100%"/>
528
+
529
+ **Built with ❤️ for the Meta PyTorch OpenEnv Hackathon 2026**
530
 
531
+ </div>