title: Invoice Processing Pipeline
emoji: ๐งพ
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
tags:
- openenv
- multi-agent
- grpo
- rl
short_description: 5-agent adversarial fraud detection RL environment
Meta PyTorch OpenEnv Hackathon โ Grand Finale ยท April 25โ26, 2026
Team: Pritam Satpathy & Gnana Nawin T ยท VIT, Vellore
๐ฅ The Core Idea
A system that continuously generates harder challenges targeting its own weakest points.
Most fraud detection pipelines are static. Ours gets harder for itself over time: the Regulator finds where the Auditor keeps failing, the Generator exploits those exact blind spots in the next episode, the Auditor's new mistakes update the Regulator โ and the loop closes without any human intervention.
Primary theme: #4 Self-Improvement ยท Secondary: #1 Multi-Agent Interactions
๐ค 5-Agent Architecture
๐ฏ Regulator โโbias weightsโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโบ โก Generator
โฒ โ
โ raw invoice text
โ missed fraud types โผ
โ ๐ Extractor
โ โ
โ structured data
โ โผ
โโโโโ episode outcome โโโโ โ
Approver โโaudit resultsโโโ ๐ต๏ธ Auditor
| Agent | Role | Reward Signal |
|---|---|---|
| ๐ฏ Regulator | Cross-episode oversight: detects Auditor blind spots, reweights Generator | Precision 0.35 + Recall 0.35 + No over-flagging 0.15 + Early warning 0.15 |
| โก Generator | Adversary: creates invoices biased toward blind spots | +0.85 evades both ยท +0.60 evades Auditor ยท +0.10 caught |
| ๐ Extractor | Parser: text โ structured JSON with 4 independent signals | Format 0.10 ยท Field accuracy 0.40 ยท Math 0.25 ยท Completeness 0.25 |
| ๐ต๏ธ Auditor | Detector: fraud classification with confidence scores | +0.99 correct type ยท +0.90 clean cleared ยท +0.01 miss or FP |
| โ Approver | Gatekeeper: final approve / escalate / reject | โฅ0.80 โ reject ยท 0.50โ0.80 โ escalate ยท <0.50 โ approve |
โก Three Novel Features
๐ฏ 10 Tasks โ Progressive Curriculum
| # | Task | What the Agent Faces | Difficulty |
|---|---|---|---|
| 1 | easy |
Single clean invoice โ extract 5 fields | ๐ข Easy |
| 2 | medium |
Batch with date chaos, vendor typos, currency noise | ๐ก Medium |
| 3 | hard |
Extraction + PO reconciliation โ flag overcharges, missing items | ๐ Hard |
| 4 | expert |
Full fraud audit across all four fraud types | ๐ด Expert |
| 5 | adversarial |
OCR corruption, SUBTOTAL traps, fake TAX/FX noise lines | ๐ด Expert |
| 6 | negotiate |
Ask clarifying questions first (bonus for โค2), then extract | ๐ก Medium |
| 7 | supply_chain |
Detect quantity shortfalls, price spikes, phantom deliveries | ๐ด Expert |
| 8 | long_horizon |
20-step 4-phase investigation: extract โ reconcile โ audit โ risk forecast | ๐ด Expert |
| 9 | personalized |
Adapts to your weak fields โ next invoice always targets your worst category | ๐ Adaptive |
| 10 | curriculum |
Auto-progresses easyโmediumโhardโexpert based on score (โฅ0.80 to advance) | ๐ Auto |
Dynamic difficulty also adjusts within each task via a rolling 10-episode score window: score above 0.85 โ heavier OCR, more discrepancies, deeper traps. Drop below 0.60 โ it eases off.
๐ Training Results โ GRPO on Live Environment
All 3 agents trained with TRL GRPOTrainer + Unsloth using the deployed HF Space as the live reward verifier โ /grader endpoint is the reward function during training.
Before vs After Training
| Agent | Untrained (random) | Qwen 72B baseline | After GRPO | Improvement |
|---|---|---|---|---|
| ๐ Extractor | 0.10 | 0.67 | 0.914 | +714% vs random |
| ๐ต๏ธ Auditor | 0.01 | โ | 0.52 live reward | Dead โ active signal |
| โก Generator | โ | โ | 0.22 plausibility | Format & realism learned |
Setup: Qwen2.5-1.5B-Instruct ยท 4-bit QLoRA r=16 ยท Unsloth + TRL ยท Google Colab A100
Extractor Reward Curve
X-axis: training step (1โ20) ยท Y-axis: reward (0โ1). Left: total GRPO reward across 4 independent signals (format 0.10 + field accuracy 0.40 + math 0.25 + completeness 0.25). Right: live /grader score peaking at 0.914 โ above Qwen 72B baseline (0.67) and untrained 1.5B (0.46).
Left: Total GRPO reward across 4 signals (format + field + math + completeness) over 20 training steps. Right: Live environment grader score peaking at 0.914 โ above Qwen 72B baseline (0.67) and untrained 1.5B baseline (0.46).
Auditor Reward Curve (Run 2 โ Bug Fixed)
X-axis: training step (1โ30) ยท Y-axis: reward (0โ1). Total reward (blue) and live env reward (orange) with ยฑ1 std band. Best total: 0.719 at step 10. Live env reward climbed from 0.01 (dead signal, Run 1) to 0.52 after fixing the TRL episode_id list indexing bug.
Total reward (blue) and live env reward (orange) over 30 steps with ยฑ1 std band. Best total reward: 0.719. Live env reward rose from 0.01 (dead signal in Run 1) to 0.52 after fixing the episode_id list bug.
Generator Reward Curve
X-axis: training step (1โ30) ยท Y-axis: reward (0โ1). Live evasion reward (red) flat near 0 โ Auditor+Approver caught all fraud attempts. Fraud plausibility reward (orange dashed) stable at ~0.20 โ Generator learned realistic invoice structure even without successful evasion.
Live evasion reward (red) flat near 0 โ Auditor+Approver caught all fraud attempts. Fraud plausibility reward (orange dashed) learned and stable at ~0.20, showing the Generator learned to produce realistic-looking invoices even without successful evasion.
๐ Reward Hacking Caught at Step 10
At step 10 the model achieved math_consistency = 0.97 and completeness = 1.0 while field_accuracy = 0.00 โ it had learned to output arithmetically-consistent JSON with entirely hallucinated values:
Step 10 โ Reward Hacking Detected:
format: 0.10 โ
math_consistency: 0.97 โ
โ model gaming this signal
completeness: 1.00 โ
โ model gaming this signal
field_accuracy: 0.00 โ โ hallucinating all values
Action: adjusted training emphasis on field_accuracy weight
Result: field_accuracy climbed to 0.30+ by step 30
Without 4 independent signals, a single aggregated reward would have called this success. Independent signals are diagnostics, not just incentives.
Auditor Training โ Run 2 (exact data)
| Step | Total Reward | Live Env Reward | ยฑStd |
|---|---|---|---|
| 5 | 0.4828 | 0.2828 | ยฑ0.194 |
| 10 | 0.7188 | 0.5188 | ยฑ0.239 |
| 15 | 0.4538 | 0.2538 | ยฑ0.123 |
| 20 | 0.5733 | 0.3733 | ยฑ0.212 |
| 25 | 0.5325 | 0.3325 | ยฑ0.232 |
| 30 | 0.6038 | 0.4038 | ยฑ0.147 |
Run 1 (dead signal): live env reward flat at 0.010 โ TRL passes episode_id as a list; old code sent the whole list instead of indexing per completion
๐ Reward Architecture
๐ Extractor โ 4 Independent Signals
reward_format(extracted) # 0.10 โ all 5 required JSON keys present?
reward_field_accuracy(extracted, gt) # 0.40 โ vendor / date / currency / total match?
reward_math_consistency(extracted) # 0.25 โ qty ร unit_price = amount per line?
reward_completeness(extracted, gt) # 0.25 โ all expected line items captured?
# All clamped to (0.01, 0.99) โ no log(0), no gradient collapse at boundaries
๐ต๏ธ Auditor
| Outcome | Reward | Why |
|---|---|---|
| Correct fraud type detected | 0.99 | Rewards precise classification, not just binary flagging |
| Clean invoice correctly approved | 0.90 | Keeps false-positive rate honest |
| Compound fraud โ one of two types caught | 0.65 | Partial credit prevents cliff on hard cases |
| Fraud flagged but wrong type | 0.50 | Penalises sloppiness; rewards catching something |
| Miss or false positive | 0.01 | Near-zero punishes both failure modes symmetrically |
โก Generator (Adversarial Self-Play)
| Outcome | Reward |
|---|---|
| Fraud evades both Auditor and Approver | 0.85 |
| Auditor misses, Approver catches | 0.60 |
| Auditor catches it | 0.10 |
๐ฏ Regulator โ Cross-Episode
Total = Precision(0.35) + Recall(0.35) + No-over-flagging(0.15) + Early-warning-bonus(0.15)
The early-warning bonus rewards predictions of emerging blind spots โ before detection rates cross the critical threshold.
๐ง Trained LoRA Agents
| Agent | Base Model | LoRA Config | HuggingFace Hub |
|---|---|---|---|
| ๐ Extractor | Qwen2.5-1.5B-Instruct | r=16, ฮฑ=16, 4-bit QLoRA | ps2181/extractor-lora-qwen2.5-1.5b |
| ๐ต๏ธ Auditor | Qwen2.5-1.5B-Instruct | r=16, ฮฑ=16, 4-bit QLoRA | ps2181/auditor-lora-qwen2.5-1.5b |
| โก Generator | Qwen2.5-1.5B-Instruct | r=16, ฮฑ=16, 4-bit QLoRA | ps2181/generator-lora-qwen2.5-1.5b |
LoRA target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
๐ The Regulator in Action
After each episode, the Regulator publishes a report the Generator uses to bias its next batch:
GET /regulator/report
{
"total_audits_recorded": 20,
"detection_rates": {
"phantom_vendor": "31% โ BLIND SPOT (-0.08โ)",
"price_gouging": "74% โ OK (+0.03โ)",
"math_fraud": "81% โ OK (+0.01โ)",
"duplicate_submission": "62% โก EMERGING (-0.02โ)"
},
"blind_spots": ["phantom_vendor"],
"emerging_blind_spots": ["duplicate_submission"],
"generator_weights": {
"phantom_vendor": 0.30, โ 3ร upweighted (blind spot)
"duplicate_submission": 0.20, โ 2ร upweighted (emerging)
"price_gouging": 0.125,
"math_fraud": 0.125,
"compound_fraud": 0.10
},
"verdict": "Recommend retraining on: phantom_vendor"
}
๐ญ Sample Multi-Agent Episode
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
MULTI-AGENT PIPELINE ยท LIVE EPISODE
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ฏ REGULATOR (30-episode rolling window)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
phantom_vendor 31% โ BLIND SPOT โ prioritised 60%
price_gouging 74% โ OK
math_fraud 81% โ OK
duplicate 62% โ OK
โก GENERATOR (Qwen2.5 LoRA)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Fraud focus : phantom_vendor (60% Regulator weight)
Vendor : ShadowByte Technologies โ not in registry
๐ EXTRACTOR (Qwen2.5 LoRA)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Reward : 0.847 [format 0.10 ยท field 0.38 ยท math 0.25 ยท completeness 0.12]
๐ต๏ธ AUDITOR (Qwen2.5 LoRA)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
INV-85529 โ ๐จ FLAGGED [PHANTOM VENDOR] conf=0.91
INV-85530 โ โ
APPROVED conf=0.88
โ
APPROVER
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
INV-85529 โ โ REJECT
Generator reward : 0.60 (evaded Auditor on 1/3, Approver caught)
๐ฏ REGULATOR UPDATE
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
phantom_vendor detection: 31% โ 45% โ improving
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ Quick Start
# Health check
curl https://ps2181-invoice-processing-pipeline.hf.space/health
# Environment-wide metrics
curl https://ps2181-invoice-processing-pipeline.hf.space/metrics
# Auto-progressive curriculum episode
curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/reset \
-H "Content-Type: application/json" -d '{"task_id": "curriculum"}'
# Start multi-agent episode
curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/multi/reset
# Regulator blind spot report
curl https://ps2181-invoice-processing-pipeline.hf.space/regulator/report
Run Training (Google Colab)
Colab โ /reset (fresh synthetic invoice from live environment)
โ model generates JSON
โ /grader scores against ground truth
โ GRPO updates weights toward higher-reward completions
โ repeat 200 steps
๐๏ธ Repository Structure
invoice-processing-pipeline/
โ
โโโ server/
โ โโโ app.py # FastAPI โ 18 endpoints
โ โโโ environment.py # 10 tasks ยท graders ยท dynamic difficulty
โ โโโ multi_agent_environment.py # 5-agent system + AuditorPerformanceTracker
โ โโโ agents.py # Lazy-loading LoRA inference wrappers
โ โโโ web_ui.py # Gradio UI (mounted at /web)
โ
โโโ models.py # Pydantic: Action ยท Observation ยท State
โโโ inference.py # Standalone inference helper
โโโ client.py # OpenEnv-compatible Python client
โ
โโโ extractor_training_grpo.ipynb # ๐ฅ Extractor GRPO training (Unsloth + TRL)
โโโ auditor_grpo_training.ipynb # ๐ฅ Auditor GRPO training
โโโ generator_grpo_training.ipynb # ๐ฅ Generator GRPO training
โ
โโโ assets/
โ โโโ reward_curve.png # Extractor training curve
โ โโโ auditor_reward_curve_run2.png
โ โโโ generator_reward_curve.png
โ
โโโ openenv.yaml # OpenEnv manifest (all tasks declared)
โโโ Dockerfile # HF Spaces Docker (port 7860, non-root UID 1000)
โโโ pyproject.toml # Project metadata + dependencies
โโโ requirements.txt # Runtime dependencies
โโโ validate-submission.sh # Submission validator script
โโโ BLOG.md # HuggingFace blog post
โโโ ROUND2_PROBLEM_STATEMENT.md # Full problem statement + reward design rationale
๐ API Reference
Core OpenEnv
| Endpoint | Method | Description |
|---|---|---|
/health |
GET |
Health check โ {"status": "ok", "active_sessions": N} |
/tasks |
GET |
All tasks with descriptions, schemas, difficulty levels |
/metrics |
GET |
Per-task episode counts, avg/best scores, Regulator state |
/reset |
POST |
Start episode {"task_id": "easy|medium|...|curriculum"} |
/step |
POST |
Submit extraction โ reward + feedback + hint + reward_breakdown |
/grader |
POST |
Score without consuming an attempt (training reward signal) |
/state |
GET |
Episode metadata โ step_count, done, best_reward, history |
/ws |
WS |
Full episode over WebSocket (OpenEnv standard) |
/web |
GET |
Gradio interactive demo UI |
Multi-Agent
| Endpoint | Method | Description |
|---|---|---|
/multi/reset |
POST |
Start 5-agent episode โ Generator biased by Regulator weights |
/multi/extract |
POST |
Score Extractor output (4 independent signals) |
/multi/audit |
POST |
Score Auditor output, update 30-episode performance tracker |
/multi/approve |
POST |
Run Approver, compute Generator adversarial reward |
/multi/state/{id} |
GET |
Full episode state including all agent scores |
/generator/score |
POST |
Direct Generator scoring through Auditor+Approver pipeline |
Regulator
| Endpoint | Method | Description |
|---|---|---|
/regulator/report |
GET |
Detection rates, blind spots, calibration, generator weights |
/regulator/forecast |
GET |
Trend slopes + emerging blind spot warnings with episode countdown |
/regulator/calibration |
GET |
Overconfidence / underconfidence per fraud type |
/regulator/predict |
POST |
Score a Regulator blind-spot prediction |
/regulator/demo_seed |
POST |
Seed tracker with realistic demo data |
๐๏ธ Tech Stack
| Layer | Technology |
|---|---|
| Environment | OpenEnv ยท FastAPI ยท Pydantic v2 |
| UI | Gradio 4.x (mounted at /web) |
| Deployment | Docker ยท HuggingFace Spaces (vcpu-2 / 8 GB) |
| Training | TRL GRPOTrainer ยท Unsloth |
| Model | unsloth/Qwen2.5-1.5B-Instruct ยท 4-bit QLoRA ยท r=16 ยท A100 |
| Reward | Live /grader endpoint on HF Space as verifier |
| Session Mgmt | Thread-safe OrderedDict ยท 200-session cap ยท LRU eviction |
| Dynamic Difficulty | Per-task rolling window (maxlen=10) โ adjusts OCR intensity, batch size, discrepancy count |
๐ญ Theme Alignment
| Theme | Alignment | Evidence |
|---|---|---|
| #4 Self-Improvement (primary) | โ Core | Regulator detects blind spots โ Generator biases toward them โ Auditor improves โ loop repeats |
| #1 Multi-Agent Interactions | โ Core | 5 agents with conflicting incentives โ Generator vs Auditor adversarial self-play |
| #1 Fleet AI Scalable Oversight | โ Bonus | Regulator monitors Auditor cross-episode with predictive trend detection |
| #3.1 Professional Tasks | โ Core | Invoice + PO + vendor registry + supply chain = real enterprise AP workflow |
| #2 Long-Horizon Planning | โ Partial | long_horizon task: 20-step 4-phase investigation with multi-turn state |
๐ฅ Team
| Pritam Satpathy | Gnana Nawin T |
| ๐ค ps2181 | ๐ค gnananawin |
| Scaler School of Technology | Scaler School of Technology |
Meta PyTorch OpenEnv Hackathon โ Grand Finale ยท April 25โ26, 2026 ยท Bangalore
๐ All Links
Built with โค๏ธ for the Meta PyTorch OpenEnv Hackathon 2026
"The system that gets harder for itself โ so the agent never stops learning."