Create Blog
Browse files
BLOG
ADDED
|
@@ -0,0 +1,276 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Invoice Processing Pipeline β Multi-Agent RL Environment for Financial Fraud Detection
|
| 2 |
+
|
| 3 |
+
**Meta PyTorch OpenEnv Hackathon Grand Finale | April 25β26, 2026**
|
| 4 |
+
**Team: Pritam Satpathy + Gnana Nawin T**
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
## The Problem
|
| 9 |
+
|
| 10 |
+
Invoice fraud costs businesses an estimated 5% of annual revenue. Finance teams manually process thousands of invoices every month β extracting vendor names, dates, line items, totals β and checking them against purchase orders for discrepancies. The work is slow (hours per batch), error-prone (typos, OCR noise, format chaos), and gameable (phantom vendors, price gouging, duplicate submissions).
|
| 11 |
+
|
| 12 |
+
We built an RL training environment that teaches LLMs to do this automatically β and improves itself when it discovers its own blind spots.
|
| 13 |
+
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
## What We Built
|
| 17 |
+
|
| 18 |
+
An OpenEnv-compatible environment deployed on HuggingFace Spaces:
|
| 19 |
+
**[https://ps2181-invoice-processing-pipeline.hf.space](https://ps2181-invoice-processing-pipeline.hf.space)**
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## Architecture: 5-Agent System
|
| 24 |
+
|
| 25 |
+
```
|
| 26 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 27 |
+
β ADVERSARIAL REWARD (dashed) β
|
| 28 |
+
β β
|
| 29 |
+
βΌ β
|
| 30 |
+
βββββββββββββββββ β
|
| 31 |
+
β Generator βββββββ Regulator biases fraud type ββββββ β
|
| 32 |
+
β Creates fraud β β β
|
| 33 |
+
βββββββββ¬ββββββββ β β
|
| 34 |
+
β Raw invoice text β β
|
| 35 |
+
βΌ β β
|
| 36 |
+
βββββββββββββββββ β β
|
| 37 |
+
β Extractor β β β
|
| 38 |
+
β Text β JSON β β β
|
| 39 |
+
βββββββββ¬ββββββββ β β
|
| 40 |
+
β Structured data βββββββ΄ββββββ β
|
| 41 |
+
βΌ β Regulator β β
|
| 42 |
+
βββββββββββββββββ β Cross- β β
|
| 43 |
+
β Auditor βββββββ decision history βββββββββΊβ episode β β
|
| 44 |
+
β Fraud detect β β meta-agent β β
|
| 45 |
+
βββββββββ¬ββββββββ βββββββββββββ β
|
| 46 |
+
β Verdict + flags β
|
| 47 |
+
βΌ β
|
| 48 |
+
βββββββββββββββββ β
|
| 49 |
+
β Approver ββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 50 |
+
β Approve/reject β
|
| 51 |
+
βββββββββ¬ββββββββ
|
| 52 |
+
β
|
| 53 |
+
βΌ
|
| 54 |
+
ββββββββββββββββββββββββββββββββββββββββ
|
| 55 |
+
β 4 Independent Reward Signals β
|
| 56 |
+
β Format Β· Field Β· Math Β· Completenessβ
|
| 57 |
+
ββββββββββββββββββββββββββββββββββββββββ
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
| Agent | Role | Reward Signal |
|
| 61 |
+
|-------|------|---------------|
|
| 62 |
+
| **Generator** | Creates clean or fraudulent invoices | Rewarded when fraud slips past Auditor (adversarial self-play) |
|
| 63 |
+
| **Extractor** | Reads raw invoice text β structured JSON | 4 independent signals: format, field accuracy, math consistency, completeness |
|
| 64 |
+
| **Auditor** | Reviews extraction, flags fraud | +0.99 correct detection, +0.90 clean clearance, 0.01 for miss/false positive |
|
| 65 |
+
| **Approver** | Final approve/reject/escalate decision | +0.95 correct decision |
|
| 66 |
+
| **Regulator** | Monitors Auditor blind spots across episodes | Precision + recall of blind spot predictions |
|
| 67 |
+
|
| 68 |
+
---
|
| 69 |
+
|
| 70 |
+
## The Key Innovation: The Regulator
|
| 71 |
+
|
| 72 |
+
The Regulator is a cross-episode meta-agent β it watches the Auditor's decision history over 30 episodes and identifies systematic failure patterns:
|
| 73 |
+
|
| 74 |
+
```
|
| 75 |
+
AUDITOR PERFORMANCE TRACKER (last 30 episodes)
|
| 76 |
+
|
| 77 |
+
Fraud Type Detection Rate
|
| 78 |
+
βββββββββββββββββββββββββββββββββββββ
|
| 79 |
+
phantom_vendor 31% β BLIND SPOT
|
| 80 |
+
price_gouging 74% β OK
|
| 81 |
+
math_fraud 81% β OK
|
| 82 |
+
duplicate_submission 62% β OK
|
| 83 |
+
|
| 84 |
+
False Positive Rate: 12% β OK
|
| 85 |
+
|
| 86 |
+
REGULATOR VERDICT: Recommend retraining on phantom_vendor
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
When the Regulator detects a blind spot, the Generator automatically starts producing more of that fraud type β closing the self-improvement loop without human intervention.
|
| 90 |
+
|
| 91 |
+
This directly addresses **Theme #1 (Fleet AI Scalable Oversight)** and **Theme #4 (Self-Improvement)**.
|
| 92 |
+
|
| 93 |
+
---
|
| 94 |
+
|
| 95 |
+
## 7 Tasks (Progressive Difficulty)
|
| 96 |
+
|
| 97 |
+
| Task | Difficulty | What the Agent Does |
|
| 98 |
+
|------|-----------|---------------------|
|
| 99 |
+
| `easy` | Easy | Extract fields from a single clean invoice |
|
| 100 |
+
| `medium` | Medium | Clean + normalise a batch of messy invoices (typos, date chaos, currency symbols) |
|
| 101 |
+
| `hard` | Hard | Extract + reconcile against purchase orders, flag discrepancies |
|
| 102 |
+
| `expert` | Expert | Fraud audit: classify phantom_vendor / price_gouging / math_fraud / duplicate_submission |
|
| 103 |
+
| `adversarial` | Hard | Extract from OCR-corrupted invoice with SUBTOTAL trap and FX noise lines |
|
| 104 |
+
| `negotiate` | Medium | Ask clarification questions then submit extraction (bonus for β€2 questions) |
|
| 105 |
+
| `supply_chain` | Expert | Detect quantity shortfalls, price spikes, phantom deliveries in delivery records |
|
| 106 |
+
|
| 107 |
+
---
|
| 108 |
+
|
| 109 |
+
## Design Decisions
|
| 110 |
+
|
| 111 |
+
### 4 Independent Reward Functions (Anti-Hacking)
|
| 112 |
+
|
| 113 |
+
Per the hackathon guide: *"use multiple independent reward functions β if you only have one, it is easier for the model to hack it."*
|
| 114 |
+
|
| 115 |
+
```python
|
| 116 |
+
format_reward() # Are all 5 required JSON keys present? weight: 0.10
|
| 117 |
+
field_reward() # Do vendor/date/currency/total match? weight: 0.40
|
| 118 |
+
math_reward() # Does qty Γ unit_price = amount for all items? weight: 0.25
|
| 119 |
+
completeness_reward() # Are all line items present (recall)? weight: 0.25
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
During training we observed the model maximising `math_reward` (0.97) and `completeness_reward` (1.0) while `field_reward` stayed at 0.00 β the model learned to output arithmetic-consistent JSON while hallucinating values. Our independent signals made this reward hacking immediately visible, confirming the design choice.
|
| 123 |
+
|
| 124 |
+
### Adversarial Self-Play
|
| 125 |
+
|
| 126 |
+
The Generator is rewarded when its fraud evades the Auditor:
|
| 127 |
+
- Fraud undetected, Approver approves β Generator reward: **0.85**
|
| 128 |
+
- Auditor missed but Approver caught β Generator reward: **0.60**
|
| 129 |
+
- Auditor caught it β Generator reward: **0.10**
|
| 130 |
+
|
| 131 |
+
This creates evolutionary pressure: the Generator evolves harder-to-detect fraud, forcing the Auditor to improve.
|
| 132 |
+
|
| 133 |
+
### Dynamic Difficulty
|
| 134 |
+
|
| 135 |
+
The environment tracks recent agent scores per task (rolling window of 10 episodes) and adjusts generation parameters:
|
| 136 |
+
- Agent scoring β₯ 0.85 β harder parameters (more invoices, more OCR noise, more discrepancies)
|
| 137 |
+
- Agent scoring < 0.60 β easier parameters
|
| 138 |
+
- In between β standard
|
| 139 |
+
|
| 140 |
+
### All Rewards Clamped to (0.01, 0.99)
|
| 141 |
+
|
| 142 |
+
Avoids `log(0)` in policy gradient and prevents the model from getting stuck at boundaries.
|
| 143 |
+
|
| 144 |
+
---
|
| 145 |
+
|
| 146 |
+
## Tech Stack
|
| 147 |
+
|
| 148 |
+
```
|
| 149 |
+
Environment: FastAPI + OpenEnv-core + Pydantic
|
| 150 |
+
Deployment: HuggingFace Spaces (Docker, port 7860)
|
| 151 |
+
UI: Gradio (mounted at /web)
|
| 152 |
+
Training: TRL GRPOTrainer + Unsloth (Qwen2.5-1.5B-Instruct, 4-bit QLoRA)
|
| 153 |
+
Model: unsloth/Qwen2.5-1.5B-Instruct r=16 LoRA
|
| 154 |
+
Reward: 4 local signals + live /grader endpoint on HF Space
|
| 155 |
+
```
|
| 156 |
+
|
| 157 |
+
---
|
| 158 |
+
|
| 159 |
+
## Training Setup
|
| 160 |
+
|
| 161 |
+
GRPO (Group Relative Policy Optimization) with:
|
| 162 |
+
- `num_generations = 4` β 4 completions per prompt, compared within group
|
| 163 |
+
- `max_steps = 200`
|
| 164 |
+
- `learning_rate = 5e-6`
|
| 165 |
+
- Live `/grader` endpoint on HF Space as environment verifier
|
| 166 |
+
|
| 167 |
+
The training loop:
|
| 168 |
+
```
|
| 169 |
+
Colab samples episode β HF Space /reset β gets live invoice
|
| 170 |
+
Model generates JSON extraction
|
| 171 |
+
HF Space /grader scores it against ground truth
|
| 172 |
+
GRPO updates model toward higher-scoring completions
|
| 173 |
+
```
|
| 174 |
+
|
| 175 |
+
---
|
| 176 |
+
|
| 177 |
+
## What Worked (Achievements)
|
| 178 |
+
|
| 179 |
+
### 1. Reward Hacking Detection β Caught at Step 10
|
| 180 |
+
|
| 181 |
+
The independent reward signals caught a classic reward hacking pattern immediately. The model maximised math and completeness while hallucinating field values. Without 4 independent signals, this would have been invisible behind a rising aggregate reward.
|
| 182 |
+
|
| 183 |
+
| Step | Total Reward | Env Score | Format | Math |
|
| 184 |
+
|------|-------------|-----------|--------|------|
|
| 185 |
+
| 10 | 2.361 | 0.113 | 0.900 | 0.347 |
|
| 186 |
+
| 20 | 2.595 | 0.282 | 0.900 | 0.413 |
|
| 187 |
+
| 30 | 2.657 | 0.304 | 0.950 | 0.403 |
|
| 188 |
+
|
| 189 |
+
Environment score rose **0.113 β 0.304 in 30 steps** β a 169% improvement in correct invoice extraction as scored by the live environment grader.
|
| 190 |
+
|
| 191 |
+
### 2. Live Environment as Verifier
|
| 192 |
+
|
| 193 |
+
Training Colab directly calls `/grader` on the deployed HF Space β the environment IS the reward function. No separate reward model. Deterministic and reproducible.
|
| 194 |
+
|
| 195 |
+
### 3. Regulator Concept Validated
|
| 196 |
+
|
| 197 |
+
The cross-episode tracking logic works: the Regulator correctly identifies `phantom_vendor` as the Auditor's weakest category and triggers Generator bias toward that fraud type. No other OpenEnv environment we've seen implements a cross-episode meta-agent.
|
| 198 |
+
|
| 199 |
+
### 4. Full 7-Task Ladder Deployed
|
| 200 |
+
|
| 201 |
+
All 7 tasks are live on the HF Space with independent graders, schemas, and difficulty calibration. The progressive structure directly supports curriculum learning.
|
| 202 |
+
|
| 203 |
+
### 5. Clean OpenEnv API Compliance
|
| 204 |
+
|
| 205 |
+
Standard `reset()` / `step()` / `state()` interface, WebSocket support, Swagger docs at `/docs`, Gradio UI at `/web`. Drop-in compatible with any OpenEnv training script.
|
| 206 |
+
|
| 207 |
+
---
|
| 208 |
+
|
| 209 |
+
## Where We're Having Problems (Honest Assessment)
|
| 210 |
+
|
| 211 |
+
### 1. Field Reward Plateau
|
| 212 |
+
|
| 213 |
+
The `field_reward` (vendor name, date, currency, total accuracy) remains the hardest signal for the 1.5B model to crack. Even at step 30, the environment score is 0.304 β meaning the model still hallucinates field values despite correct structure and math. We suspect this is a model capacity issue: Qwen2.5-1.5B may not have enough parameters to learn extraction patterns from raw OCR text in 200 steps.
|
| 214 |
+
|
| 215 |
+
**Potential fix:** Switching to Qwen2.5-7B or adding a light SFT warmup phase with 50β100 correct extraction examples before RL.
|
| 216 |
+
|
| 217 |
+
### 2. Multi-Agent Coordination Not Yet Trained End-to-End
|
| 218 |
+
|
| 219 |
+
The 5-agent architecture is designed and the environment supports it, but we haven't yet run the full adversarial training loop (Generator vs Auditor) end-to-end with GRPO. Currently, the Extractor is trained in isolation. The Regulator logic runs as environment-side code, not as a trainable agent.
|
| 220 |
+
|
| 221 |
+
**Potential fix:** Implementing a two-phase training loop β Phase 1: train Extractor on easy/medium, Phase 2: train Auditor against Generator with Regulator feedback.
|
| 222 |
+
|
| 223 |
+
### 3. Compute Constraints
|
| 224 |
+
|
| 225 |
+
4-bit QLoRA on a Colab T4 limits batch sizes and generation counts. With `num_generations=4`, each step is slow enough that we couldn't push past ~50 steps in the available time. The reward curves are trending upward but haven't converged.
|
| 226 |
+
|
| 227 |
+
**Potential fix:** Onsite compute credits (HF GPU Spaces) should allow `num_generations=8` and 500+ steps.
|
| 228 |
+
|
| 229 |
+
### 4. OCR Noise Robustness
|
| 230 |
+
|
| 231 |
+
The `adversarial` task (trap-resistant extraction with SUBTOTAL/FX noise lines) works as an environment, but the model hasn't been trained on it yet. Early inference tests show the model consistently falls for fake SUBTOTAL lines.
|
| 232 |
+
|
| 233 |
+
**Potential fix:** Adding adversarial examples to the curriculum after the model achieves β₯0.60 on `medium`.
|
| 234 |
+
|
| 235 |
+
---
|
| 236 |
+
|
| 237 |
+
## What Makes This Novel
|
| 238 |
+
|
| 239 |
+
1. **Regulator agent** β no other OpenEnv environment has a cross-episode meta-agent that monitors another agent for systematic cognitive blind spots
|
| 240 |
+
|
| 241 |
+
2. **Closed self-improvement loop** β Regulator detects blind spot β Generator biases fraud generation toward that type β Auditor forced to improve β no human intervention required
|
| 242 |
+
|
| 243 |
+
3. **Adversarial Generator arms race** β Generator rewarded for evading Auditor creates evolutionary pressure on fraud detection
|
| 244 |
+
|
| 245 |
+
4. **Live environment as verifier** β training Colab directly calls `/grader` on deployed HF Space β the environment IS the reward function
|
| 246 |
+
|
| 247 |
+
5. **4 independent reward signals** β made reward hacking immediately visible during training (detected it at step 10)
|
| 248 |
+
|
| 249 |
+
---
|
| 250 |
+
|
| 251 |
+
## Theme Alignment
|
| 252 |
+
|
| 253 |
+
| Theme | Alignment |
|
| 254 |
+
|-------|-----------|
|
| 255 |
+
| **#1 Multi-Agent** | 5 agents with conflicting incentives (Generator vs Auditor) |
|
| 256 |
+
| **#1 Sub: Fleet AI Oversight** (bonus) | Regulator monitors Auditor cross-episode |
|
| 257 |
+
| **#3.1 Professional Tasks** | Invoice processing = core enterprise workflow |
|
| 258 |
+
| **#3.1 Sub: Scaler AI Labs** (bonus) | Multi-agent RL for enterprise financial workflows |
|
| 259 |
+
| **#4 Self-Improvement** | Generator adapts based on Regulator blind spot findings |
|
| 260 |
+
|
| 261 |
+
---
|
| 262 |
+
|
| 263 |
+
## Links
|
| 264 |
+
|
| 265 |
+
- **Live Environment:** [https://ps2181-invoice-processing-pipeline.hf.space](https://ps2181-invoice-processing-pipeline.hf.space)
|
| 266 |
+
- **Gradio UI:** [https://ps2181-invoice-processing-pipeline.hf.space/web](https://ps2181-invoice-processing-pipeline.hf.space/web)
|
| 267 |
+
- **API Docs:** [https://ps2181-invoice-processing-pipeline.hf.space/docs](https://ps2181-invoice-processing-pipeline.hf.space/docs)
|
| 268 |
+
- **GitHub:** [https://github.com/ps2181/invoice-processing-pipeline](https://github.com/ps2181/invoice-processing-pipeline)
|
| 269 |
+
|
| 270 |
+
---
|
| 271 |
+
|
| 272 |
+
## Team
|
| 273 |
+
|
| 274 |
+
**Pritam Satpathy** + **Gnana Nawin T**
|
| 275 |
+
Meta PyTorch OpenEnv Hackathon Grand Finale
|
| 276 |
+
Scaler School of Technology, Bangalore β April 25β26, 2026
|