ps2181 commited on
Commit
df15bd6
Β·
1 Parent(s): 8afb151

Create Blog

Browse files
Files changed (1) hide show
  1. BLOG +276 -0
BLOG ADDED
@@ -0,0 +1,276 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Invoice Processing Pipeline β€” Multi-Agent RL Environment for Financial Fraud Detection
2
+
3
+ **Meta PyTorch OpenEnv Hackathon Grand Finale | April 25–26, 2026**
4
+ **Team: Pritam Satpathy + Gnana Nawin T**
5
+
6
+ ---
7
+
8
+ ## The Problem
9
+
10
+ Invoice fraud costs businesses an estimated 5% of annual revenue. Finance teams manually process thousands of invoices every month β€” extracting vendor names, dates, line items, totals β€” and checking them against purchase orders for discrepancies. The work is slow (hours per batch), error-prone (typos, OCR noise, format chaos), and gameable (phantom vendors, price gouging, duplicate submissions).
11
+
12
+ We built an RL training environment that teaches LLMs to do this automatically β€” and improves itself when it discovers its own blind spots.
13
+
14
+ ---
15
+
16
+ ## What We Built
17
+
18
+ An OpenEnv-compatible environment deployed on HuggingFace Spaces:
19
+ **[https://ps2181-invoice-processing-pipeline.hf.space](https://ps2181-invoice-processing-pipeline.hf.space)**
20
+
21
+ ---
22
+
23
+ ## Architecture: 5-Agent System
24
+
25
+ ```
26
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
27
+ β”‚ ADVERSARIAL REWARD (dashed) β”‚
28
+ β”‚ β”‚
29
+ β–Ό β”‚
30
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
31
+ β”‚ Generator │◄───── Regulator biases fraud type ◄────┐ β”‚
32
+ β”‚ Creates fraud β”‚ β”‚ β”‚
33
+ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
34
+ β”‚ Raw invoice text β”‚ β”‚
35
+ β–Ό β”‚ β”‚
36
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
37
+ β”‚ Extractor β”‚ β”‚ β”‚
38
+ β”‚ Text β†’ JSON β”‚ β”‚ β”‚
39
+ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
40
+ β”‚ Structured data β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β” β”‚
41
+ β–Ό β”‚ Regulator β”‚ β”‚
42
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Cross- β”‚ β”‚
43
+ β”‚ Auditor │────── decision history ────────►│ episode β”‚ β”‚
44
+ β”‚ Fraud detect β”‚ β”‚ meta-agent β”‚ β”‚
45
+ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
46
+ β”‚ Verdict + flags β”‚
47
+ β–Ό β”‚
48
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
49
+ β”‚ Approver β”‚β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
50
+ β”‚ Approve/reject β”‚
51
+ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
52
+ β”‚
53
+ β–Ό
54
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
55
+ β”‚ 4 Independent Reward Signals β”‚
56
+ β”‚ Format Β· Field Β· Math Β· Completenessβ”‚
57
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
58
+ ```
59
+
60
+ | Agent | Role | Reward Signal |
61
+ |-------|------|---------------|
62
+ | **Generator** | Creates clean or fraudulent invoices | Rewarded when fraud slips past Auditor (adversarial self-play) |
63
+ | **Extractor** | Reads raw invoice text β†’ structured JSON | 4 independent signals: format, field accuracy, math consistency, completeness |
64
+ | **Auditor** | Reviews extraction, flags fraud | +0.99 correct detection, +0.90 clean clearance, 0.01 for miss/false positive |
65
+ | **Approver** | Final approve/reject/escalate decision | +0.95 correct decision |
66
+ | **Regulator** | Monitors Auditor blind spots across episodes | Precision + recall of blind spot predictions |
67
+
68
+ ---
69
+
70
+ ## The Key Innovation: The Regulator
71
+
72
+ The Regulator is a cross-episode meta-agent β€” it watches the Auditor's decision history over 30 episodes and identifies systematic failure patterns:
73
+
74
+ ```
75
+ AUDITOR PERFORMANCE TRACKER (last 30 episodes)
76
+
77
+ Fraud Type Detection Rate
78
+ ─────────────────────────────────────
79
+ phantom_vendor 31% ⚠ BLIND SPOT
80
+ price_gouging 74% βœ“ OK
81
+ math_fraud 81% βœ“ OK
82
+ duplicate_submission 62% βœ“ OK
83
+
84
+ False Positive Rate: 12% βœ“ OK
85
+
86
+ REGULATOR VERDICT: Recommend retraining on phantom_vendor
87
+ ```
88
+
89
+ When the Regulator detects a blind spot, the Generator automatically starts producing more of that fraud type β€” closing the self-improvement loop without human intervention.
90
+
91
+ This directly addresses **Theme #1 (Fleet AI Scalable Oversight)** and **Theme #4 (Self-Improvement)**.
92
+
93
+ ---
94
+
95
+ ## 7 Tasks (Progressive Difficulty)
96
+
97
+ | Task | Difficulty | What the Agent Does |
98
+ |------|-----------|---------------------|
99
+ | `easy` | Easy | Extract fields from a single clean invoice |
100
+ | `medium` | Medium | Clean + normalise a batch of messy invoices (typos, date chaos, currency symbols) |
101
+ | `hard` | Hard | Extract + reconcile against purchase orders, flag discrepancies |
102
+ | `expert` | Expert | Fraud audit: classify phantom_vendor / price_gouging / math_fraud / duplicate_submission |
103
+ | `adversarial` | Hard | Extract from OCR-corrupted invoice with SUBTOTAL trap and FX noise lines |
104
+ | `negotiate` | Medium | Ask clarification questions then submit extraction (bonus for ≀2 questions) |
105
+ | `supply_chain` | Expert | Detect quantity shortfalls, price spikes, phantom deliveries in delivery records |
106
+
107
+ ---
108
+
109
+ ## Design Decisions
110
+
111
+ ### 4 Independent Reward Functions (Anti-Hacking)
112
+
113
+ Per the hackathon guide: *"use multiple independent reward functions β€” if you only have one, it is easier for the model to hack it."*
114
+
115
+ ```python
116
+ format_reward() # Are all 5 required JSON keys present? weight: 0.10
117
+ field_reward() # Do vendor/date/currency/total match? weight: 0.40
118
+ math_reward() # Does qty Γ— unit_price = amount for all items? weight: 0.25
119
+ completeness_reward() # Are all line items present (recall)? weight: 0.25
120
+ ```
121
+
122
+ During training we observed the model maximising `math_reward` (0.97) and `completeness_reward` (1.0) while `field_reward` stayed at 0.00 β€” the model learned to output arithmetic-consistent JSON while hallucinating values. Our independent signals made this reward hacking immediately visible, confirming the design choice.
123
+
124
+ ### Adversarial Self-Play
125
+
126
+ The Generator is rewarded when its fraud evades the Auditor:
127
+ - Fraud undetected, Approver approves β†’ Generator reward: **0.85**
128
+ - Auditor missed but Approver caught β†’ Generator reward: **0.60**
129
+ - Auditor caught it β†’ Generator reward: **0.10**
130
+
131
+ This creates evolutionary pressure: the Generator evolves harder-to-detect fraud, forcing the Auditor to improve.
132
+
133
+ ### Dynamic Difficulty
134
+
135
+ The environment tracks recent agent scores per task (rolling window of 10 episodes) and adjusts generation parameters:
136
+ - Agent scoring β‰₯ 0.85 β†’ harder parameters (more invoices, more OCR noise, more discrepancies)
137
+ - Agent scoring < 0.60 β†’ easier parameters
138
+ - In between β†’ standard
139
+
140
+ ### All Rewards Clamped to (0.01, 0.99)
141
+
142
+ Avoids `log(0)` in policy gradient and prevents the model from getting stuck at boundaries.
143
+
144
+ ---
145
+
146
+ ## Tech Stack
147
+
148
+ ```
149
+ Environment: FastAPI + OpenEnv-core + Pydantic
150
+ Deployment: HuggingFace Spaces (Docker, port 7860)
151
+ UI: Gradio (mounted at /web)
152
+ Training: TRL GRPOTrainer + Unsloth (Qwen2.5-1.5B-Instruct, 4-bit QLoRA)
153
+ Model: unsloth/Qwen2.5-1.5B-Instruct r=16 LoRA
154
+ Reward: 4 local signals + live /grader endpoint on HF Space
155
+ ```
156
+
157
+ ---
158
+
159
+ ## Training Setup
160
+
161
+ GRPO (Group Relative Policy Optimization) with:
162
+ - `num_generations = 4` β€” 4 completions per prompt, compared within group
163
+ - `max_steps = 200`
164
+ - `learning_rate = 5e-6`
165
+ - Live `/grader` endpoint on HF Space as environment verifier
166
+
167
+ The training loop:
168
+ ```
169
+ Colab samples episode β†’ HF Space /reset β†’ gets live invoice
170
+ Model generates JSON extraction
171
+ HF Space /grader scores it against ground truth
172
+ GRPO updates model toward higher-scoring completions
173
+ ```
174
+
175
+ ---
176
+
177
+ ## What Worked (Achievements)
178
+
179
+ ### 1. Reward Hacking Detection β€” Caught at Step 10
180
+
181
+ The independent reward signals caught a classic reward hacking pattern immediately. The model maximised math and completeness while hallucinating field values. Without 4 independent signals, this would have been invisible behind a rising aggregate reward.
182
+
183
+ | Step | Total Reward | Env Score | Format | Math |
184
+ |------|-------------|-----------|--------|------|
185
+ | 10 | 2.361 | 0.113 | 0.900 | 0.347 |
186
+ | 20 | 2.595 | 0.282 | 0.900 | 0.413 |
187
+ | 30 | 2.657 | 0.304 | 0.950 | 0.403 |
188
+
189
+ Environment score rose **0.113 β†’ 0.304 in 30 steps** β€” a 169% improvement in correct invoice extraction as scored by the live environment grader.
190
+
191
+ ### 2. Live Environment as Verifier
192
+
193
+ Training Colab directly calls `/grader` on the deployed HF Space β€” the environment IS the reward function. No separate reward model. Deterministic and reproducible.
194
+
195
+ ### 3. Regulator Concept Validated
196
+
197
+ The cross-episode tracking logic works: the Regulator correctly identifies `phantom_vendor` as the Auditor's weakest category and triggers Generator bias toward that fraud type. No other OpenEnv environment we've seen implements a cross-episode meta-agent.
198
+
199
+ ### 4. Full 7-Task Ladder Deployed
200
+
201
+ All 7 tasks are live on the HF Space with independent graders, schemas, and difficulty calibration. The progressive structure directly supports curriculum learning.
202
+
203
+ ### 5. Clean OpenEnv API Compliance
204
+
205
+ Standard `reset()` / `step()` / `state()` interface, WebSocket support, Swagger docs at `/docs`, Gradio UI at `/web`. Drop-in compatible with any OpenEnv training script.
206
+
207
+ ---
208
+
209
+ ## Where We're Having Problems (Honest Assessment)
210
+
211
+ ### 1. Field Reward Plateau
212
+
213
+ The `field_reward` (vendor name, date, currency, total accuracy) remains the hardest signal for the 1.5B model to crack. Even at step 30, the environment score is 0.304 β€” meaning the model still hallucinates field values despite correct structure and math. We suspect this is a model capacity issue: Qwen2.5-1.5B may not have enough parameters to learn extraction patterns from raw OCR text in 200 steps.
214
+
215
+ **Potential fix:** Switching to Qwen2.5-7B or adding a light SFT warmup phase with 50–100 correct extraction examples before RL.
216
+
217
+ ### 2. Multi-Agent Coordination Not Yet Trained End-to-End
218
+
219
+ The 5-agent architecture is designed and the environment supports it, but we haven't yet run the full adversarial training loop (Generator vs Auditor) end-to-end with GRPO. Currently, the Extractor is trained in isolation. The Regulator logic runs as environment-side code, not as a trainable agent.
220
+
221
+ **Potential fix:** Implementing a two-phase training loop β€” Phase 1: train Extractor on easy/medium, Phase 2: train Auditor against Generator with Regulator feedback.
222
+
223
+ ### 3. Compute Constraints
224
+
225
+ 4-bit QLoRA on a Colab T4 limits batch sizes and generation counts. With `num_generations=4`, each step is slow enough that we couldn't push past ~50 steps in the available time. The reward curves are trending upward but haven't converged.
226
+
227
+ **Potential fix:** Onsite compute credits (HF GPU Spaces) should allow `num_generations=8` and 500+ steps.
228
+
229
+ ### 4. OCR Noise Robustness
230
+
231
+ The `adversarial` task (trap-resistant extraction with SUBTOTAL/FX noise lines) works as an environment, but the model hasn't been trained on it yet. Early inference tests show the model consistently falls for fake SUBTOTAL lines.
232
+
233
+ **Potential fix:** Adding adversarial examples to the curriculum after the model achieves β‰₯0.60 on `medium`.
234
+
235
+ ---
236
+
237
+ ## What Makes This Novel
238
+
239
+ 1. **Regulator agent** β€” no other OpenEnv environment has a cross-episode meta-agent that monitors another agent for systematic cognitive blind spots
240
+
241
+ 2. **Closed self-improvement loop** β€” Regulator detects blind spot β†’ Generator biases fraud generation toward that type β†’ Auditor forced to improve β†’ no human intervention required
242
+
243
+ 3. **Adversarial Generator arms race** β€” Generator rewarded for evading Auditor creates evolutionary pressure on fraud detection
244
+
245
+ 4. **Live environment as verifier** β€” training Colab directly calls `/grader` on deployed HF Space β€” the environment IS the reward function
246
+
247
+ 5. **4 independent reward signals** β€” made reward hacking immediately visible during training (detected it at step 10)
248
+
249
+ ---
250
+
251
+ ## Theme Alignment
252
+
253
+ | Theme | Alignment |
254
+ |-------|-----------|
255
+ | **#1 Multi-Agent** | 5 agents with conflicting incentives (Generator vs Auditor) |
256
+ | **#1 Sub: Fleet AI Oversight** (bonus) | Regulator monitors Auditor cross-episode |
257
+ | **#3.1 Professional Tasks** | Invoice processing = core enterprise workflow |
258
+ | **#3.1 Sub: Scaler AI Labs** (bonus) | Multi-agent RL for enterprise financial workflows |
259
+ | **#4 Self-Improvement** | Generator adapts based on Regulator blind spot findings |
260
+
261
+ ---
262
+
263
+ ## Links
264
+
265
+ - **Live Environment:** [https://ps2181-invoice-processing-pipeline.hf.space](https://ps2181-invoice-processing-pipeline.hf.space)
266
+ - **Gradio UI:** [https://ps2181-invoice-processing-pipeline.hf.space/web](https://ps2181-invoice-processing-pipeline.hf.space/web)
267
+ - **API Docs:** [https://ps2181-invoice-processing-pipeline.hf.space/docs](https://ps2181-invoice-processing-pipeline.hf.space/docs)
268
+ - **GitHub:** [https://github.com/ps2181/invoice-processing-pipeline](https://github.com/ps2181/invoice-processing-pipeline)
269
+
270
+ ---
271
+
272
+ ## Team
273
+
274
+ **Pritam Satpathy** + **Gnana Nawin T**
275
+ Meta PyTorch OpenEnv Hackathon Grand Finale
276
+ Scaler School of Technology, Bangalore β€” April 25–26, 2026