| <div align="center"> |
|
|
| <h1>When the System Learns to Pressure-Test Itself</h1> |
|
|
| **How we built a 5-agent adversarial RL environment that detects invoice fraud β** |
| **and automatically gets harder when it finds its own blind spots.** |
|
|
| <br/> |
|
|
| **Pritam Satpathy & Gnana Nawin T Β· VIT, Vellore** |
|
|
| **Meta PyTorch OpenEnv Hackathon Β· Grand Finale Β· April 25β26, 2026** |
|
|
| </div> |
|
|
|
|
|
|
| --- |
|
|
| ## The Problem Nobody Talks About |
|
|
| Invoice fraud is boring to talk about and devastating in practice. |
|
|
| It costs businesses an estimated **5% of annual revenue**, and it doesn't announce itself β it hides in purchase order line items, disguised as rounding errors, vendor name typos, and suspiciously round numbers that only look wrong if you already know what to look for. |
|
|
| Finance teams today catch it manually. They compare thousands of invoices against purchase orders, cross-reference vendor registries, and flag anything that smells off. It's slow, it's error-prone, and critically β **it doesn't improve**. A human who misses phantom vendor fraud on Monday is statistically likely to miss it again on Friday. |
|
|
| We asked a different question: |
|
|
| > *What if you could build an LLM system that not only detects fraud, but gets better at detecting the exact fraud types it's currently failing on β automatically, without any human retraining the loop?* |
|
|
| That's what we built. |
|
|
| --- |
|
|
| ## The Core Idea: Make the System Pressure-Test Itself |
|
|
| Most multi-agent RL setups have agents that operate independently within a single episode. Ours doesn't. |
|
|
| We added a **cross-episode Regulator** β an agent that watches the Auditor across 30 rolling episodes, tracks which fraud types it's systematically missing, and quietly biases the Generator to produce more of those exact scenarios. |
|
|
| No human decides *"let's train more on phantom vendors."* The Regulator notices the detection rate for phantom vendors is at `31%` and trending downward, raises the alarm, and tells the Generator to send more phantom vendor invoices. **The loop closes itself.** |
|
|
| <div align="center"> |
| <img width="1710" height="326" alt="image" src="https://github.com/user-attachments/assets/319654c3-aa24-47e8-9716-734d4e902168" /> |
| </div> |
|
|
|
|
| The Auditor sees more of exactly what it's failing on. The Generator gets rewarded for finding those gaps. The Regulator earns points for predicting blind spots *before* they go critical. Every agent has skin in the game. |
|
|
| --- |
|
|
| ## Five Agents, One Closed Loop |
|
|
| <div align="center"> |
|
|
| | Agent | Role | Reward Signal | |
| |:---:|:---|:---| |
| | **Generator** | Creates clean or fraudulent invoices, biased by Regulator's blind-spot weights | `+0.85` evades both Β· `+0.60` evades Auditor Β· `+0.10` caught | |
| | **Extractor** | Raw invoice text β structured JSON | format `0.10` Β· field accuracy `0.40` Β· math `0.25` Β· completeness `0.25` | |
| | **Auditor** | Fraud classification with fraud type + confidence score | `+0.99` correct type Β· `+0.90` clean cleared Β· `+0.01` miss or FP | |
| | **Approver** | Final approve / escalate / reject, gated by confidence | `β₯0.80` β reject Β· `0.50β0.80` β escalate Β· `<0.50` β approve | |
| | **Regulator** | Cross-episode meta-agent, 30-episode rolling window | precision `0.35` + recall `0.35` + no over-flagging `0.15` + early warning `0.15` | |
|
|
| </div> |
|
|
| The **Regulator** is the part that makes this genuinely different. Most RL environments treat each episode as independent. The Regulator sits outside that β accumulating detection rates, computing trend slopes over 5-episode windows, and warning of *emerging* blind spots before they go critical. It's proactive oversight, not reactive retraining. |
|
|
| --- |
|
|
| ## Seven Tasks, One Curriculum |
|
|
| <div align="center"> |
|
|
| | # | Task | What the Agent Faces | Difficulty | |
| |:---:|:---|:---|:---:| |
| | 1 | `easy` | Single clean invoice β extract 5 fields | Easy | |
| | 2 | `medium` | Batch with date chaos, vendor typos, currency noise | Medium | |
| | 3 | `hard` | Extraction + PO reconciliation β flag overcharges, missing items | Hard | |
| | 4 | `expert` | Full fraud audit across all four fraud types | Expert | |
| | 5 | `adversarial` | OCR corruption, SUBTOTAL traps, fake TAX/FX noise lines | Expert | |
| | 6 | `negotiate` | Ask clarifying questions first (bonus for β€2), then extract | Medium | |
| | 7 | `supply_chain` | Detect quantity shortfalls, price spikes, phantom deliveries | Expert | |
|
|
| </div> |
|
|
| The difficulty also adjusts **dynamically** based on the agent's rolling score. Score above `0.85`? The next batch gets heavier OCR corruption, more PO discrepancies, deeper adversarial traps. Drop below `0.60`? It eases off. The agent is always working at its productive edge. |
|
|
| --- |
|
|
| ## The Part Where We Caught Our Own Reward Hacking |
|
|
| This was the most interesting moment in the project. |
|
|
| At training step 10, we had: |
|
|
| ``` |
| math_consistency: 0.97 |
| completeness: 1.00 |
| field_accuracy: 0.00 :( β hallucinating every actual value |
| ``` |
|
|
| The model had figured out that it could score well by outputting JSON that was *arithmetically correct* β quantities times unit prices summed to the totals perfectly β while **hallucinating every actual value**. Vendor name: made up. Date: made up. Currency: made up. All internally consistent. All completely wrong. |
|
|
| This is reward hacking. A single aggregated reward would have happily reported high performance and called it a day. |
|
|
| Our four **independent** reward signals made the failure immediately visible. We could see exactly which signal the model had learned to game and which it was ignoring. |
|
|
| > **That's the entire argument for independent reward functions: not just diversity, but diagnosability.** |
|
|
| We adjusted training emphasis. By step 30, field accuracy had climbed from `0.00` to `0.30+` while math consistency stayed stable. |
|
|
| <div align="center"> |
|
|
| | Step | Total Reward | Env Score | Format | Math Consistency | |
| |:---:|:---:|:---:|:---:|:---:| |
| | 10 | 2.361 | 0.113 | 0.900 | 0.347 | |
| | 20 | 2.595 | 0.282 | 0.900 | 0.413 | |
| | 30 | 2.657 | **0.304** | **0.950** | 0.403 | |
|
|
| **Environment score: `0.113 β 0.304` in 30 steps β a 169% improvement in live-graded extraction accuracy.** |
|
|
| </div> |
|
|
| --- |
|
|
| ## The Reward Architecture |
|
|
| ### π Extractor β 4 Independent Signals |
|
|
| ```python |
| reward_format(extracted) # weight 0.10 β all 5 required JSON keys present? |
| reward_field_accuracy(extracted, gt) # weight 0.40 β vendor / date / currency / total match? |
| reward_math_consistency(extracted) # weight 0.25 β qty Γ unit_price = amount per line? |
| reward_completeness(extracted, gt) # weight 0.25 β all expected line items present? |
| |
| # All clamped to (0.01, 0.99) β no log(0), no gradient collapse at boundaries |
| ``` |
|
|
| ### Auditor β Precision-Weighted |
|
|
| <div align="center"> |
|
|
| | Outcome | Reward | Why | |
| |:---|:---:|:---| |
| | Correct fraud type detected | **0.99** | Rewards precise classification, not just flagging | |
| | Clean invoice correctly approved | **0.90** | Keeps false-positive rate honest | |
| | Compound fraud β one of two types caught | **0.65** | Partial credit prevents discouragement on hard cases | |
| | Fraud flagged but wrong type | **0.50** | Penalises sloppiness while crediting intent | |
| | Miss or false positive | **0.01** | Near-zero punishes both failure modes symmetrically | |
|
|
| </div> |
|
|
| ### Regulator β Cross-Episode |
|
|
| ``` |
| Total = Precision(0.35) + Recall(0.35) + No-over-flagging(0.15) + Early-warning-bonus(0.15) |
| ``` |
|
|
| The early-warning bonus rewards the Regulator for predicting emerging blind spots *before* detection rates cross the critical threshold β proactive oversight, not reactive alarm. |
|
|
| --- |
|
|
| ## Building With OpenEnv |
|
|
| The environment is a FastAPI app deployed on HuggingFace Spaces, exposing the standard OpenEnv interface. The training Colab connects directly to the live Space β `/grader` *is* the reward function. There's no separate scoring script. **The environment and the verifier are the same thing.** |
|
|
| ```bash |
| # Start an episode |
| POST /reset {"task_id": "expert"} |
| |
| # Submit an extraction or audit result |
| POST /step {"episode_id": "...", "extracted_data": {...}} |
| |
| # Check Regulator state anytime |
| GET /regulator/report # detection rates, blind spots, generator bias weights |
| GET /regulator/forecast # trend slopes, emerging blind spots with early warnings |
| GET /regulator/calibration # overconfidence / underconfidence per fraud type |
| ``` |
|
|
| Training uses **GRPO via TRL** with **Unsloth-optimised 4-bit QLoRA** on `Qwen2.5-1.5B-Instruct` β three separate LoRA adapters for Extractor, Auditor, and Generator, each trained on their own reward signal. |
|
|
| ``` |
| Colab β /reset (fresh synthetic invoice from live environment) |
| β model generates JSON extraction |
| β /grader scores against ground truth |
| β GRPO updates weights toward higher-reward completions |
| β repeat 200 steps |
| ``` |
|
|
| --- |
|
|
| ## What We Learned |
|
|
| **Reward design is product design.** Every reward function is a specification for the behaviour you actually want. Getting the Auditor reward right β where catching the *right* fraud type earns `0.99` but the *wrong* type earns `0.50` and missing entirely earns `0.01` β took more thinking than most of the engineering. |
|
|
| **Multiple reward signals are diagnostics, not just incentives.** We didn't add four signals to the Extractor because the theory said to. We added them because we wanted to *see* where the model was failing. They paid off immediately at step 10. |
|
|
| **Cross-episode agents change what's possible.** The Regulator couldn't exist in a single-episode design. Most RL environments treat each episode as independent. Giving one agent access to the history of another creates a fundamentally different kind of oversight β one that looks less like evaluation and more like a genuine colleague watching your back. |
|
|
| --- |
|
|
| ## Try It |
|
|
| <div align="center"> |
|
|
| | Resource | Link | |
| |:---|:---| |
| | **Live Environment** | [ps2181-invoice-processing-pipeline.hf.space](https://ps2181-invoice-processing-pipeline.hf.space) | |
| | **Gradio Demo UI** | [/web](https://ps2181-invoice-processing-pipeline.hf.space/web) | |
| | **API Docs** | [/docs](https://ps2181-invoice-processing-pipeline.hf.space/docs) | |
| | **Training Colab** | [Open notebook](https://colab.research.google.com/drive/1C1_3giNt-NmbzKNFJr5_L1fms3L8LfmB) | |
| | **GitHub** | [invoice-processing-pipeline](https://github.com/ps2181/invoice-processing-pipeline) | |
| | **Extractor Model** | [ps2181/extractor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/extractor-lora-qwen2.5-1.5b) | |
| | **Auditor Model** | [ps2181/auditor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/auditor-lora-qwen2.5-1.5b) | |
| | **Generator Model** | [ps2181/generator-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/generator-lora-qwen2.5-1.5b) | |
| | **Demo Video** | [video](https://youtu.be/QSB4UOLvaC8?si=SGnIwsfTW4JGsU3e) | |
|
|
| </div> |
|
|
| --- |
|
|
| <div align="center"> |
|
|
| *Built for the Meta PyTorch OpenEnv Hackathon 2026.* |
| *Theme alignment: Multi-Agent Interactions (#1) Β· Fleet AI Scalable Oversight (#1 bonus) Β· Professional Tasks (#3.1) Β· Self-Improvement (#4)* |
|
|
| <br/> |
|
|
| **Pritam Satpathy & Gnana Nawin T Β· VIT, Vellore** |
|
|
| </div> |
|
|