ps2181's picture
Update README.md
af4c96c verified
metadata
title: Invoice Processing Pipeline
emoji: ๐Ÿงพ
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
tags:
  - openenv
  - multi-agent
  - grpo
  - rl
short_description: 5-agent adversarial fraud detection RL environment


Meta PyTorch OpenEnv Hackathon โ€” Grand Finale ยท April 25โ€“26, 2026

Team: Pritam Satpathy & Gnana Nawin T ยท VIT, Vellore


Typing SVG

๐Ÿ”ฅ The Core Idea

A system that continuously generates harder challenges targeting its own weakest points.

Most fraud detection pipelines are static. Ours gets harder for itself over time: the Regulator finds where the Auditor keeps failing, the Generator exploits those exact blind spots in the next episode, the Auditor's new mistakes update the Regulator โ€” and the loop closes without any human intervention.

Primary theme: #4 Self-Improvement ยท Secondary: #1 Multi-Agent Interactions

5-agent self-improvement loop

๐Ÿค– 5-Agent Architecture

๐ŸŽฏ Regulator โ”€โ”€bias weightsโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ โšก Generator
     โ–ฒ                                                                โ”‚
     โ”‚                                                      raw invoice text
     โ”‚ missed fraud types                                             โ–ผ
     โ”‚                                                         ๐Ÿ” Extractor
     โ”‚                                                                โ”‚
     โ”‚                                                     structured data
     โ”‚                                                                โ–ผ
     โ””โ”€โ”€โ”€โ”€ episode outcome โ”€โ”€โ”€โ”€ โœ… Approver โ—„โ”€audit resultsโ”€โ”€โ”€ ๐Ÿ•ต๏ธ Auditor
Agent Role Reward Signal
๐ŸŽฏ Regulator Cross-episode oversight: detects Auditor blind spots, reweights Generator Precision 0.35 + Recall 0.35 + No over-flagging 0.15 + Early warning 0.15
โšก Generator Adversary: creates invoices biased toward blind spots +0.85 evades both ยท +0.60 evades Auditor ยท +0.10 caught
๐Ÿ” Extractor Parser: text โ†’ structured JSON with 4 independent signals Format 0.10 ยท Field accuracy 0.40 ยท Math 0.25 ยท Completeness 0.25
๐Ÿ•ต๏ธ Auditor Detector: fraud classification with confidence scores +0.99 correct type ยท +0.90 clean cleared ยท +0.01 miss or FP
โœ… Approver Gatekeeper: final approve / escalate / reject โ‰ฅ0.80 โ†’ reject ยท 0.50โ€“0.80 โ†’ escalate ยท <0.50 โ†’ approve

โšก Three Novel Features

๐Ÿ”ฎ Predictive Regulator

Computes trend slopes over 5-episode windows.
Warns of emerging blind spots before detection rates cross the critical threshold โ€” proactive oversight, not reactive retraining.

+0.15 early-warning bonus

๐Ÿงฌ Compound Fraud

Invoices carry two fraud signals simultaneously (e.g. phantom vendor + price gouging).
Partial credit +0.65 for catching one; full reward +0.99 for both.

Prevents single-signal heuristics.

๐Ÿ“Š Confidence Calibration

Tracks (confidence, correct?) pairs per fraud type.
Detects overconfident misses โ€” the Auditor saying "90% sure, approved" on fraud โ€” the most dangerous real-world failure mode.


๐ŸŽฏ 10 Tasks โ€” Progressive Curriculum

# Task What the Agent Faces Difficulty
1 easy Single clean invoice โ€” extract 5 fields ๐ŸŸข Easy
2 medium Batch with date chaos, vendor typos, currency noise ๐ŸŸก Medium
3 hard Extraction + PO reconciliation โ€” flag overcharges, missing items ๐ŸŸ  Hard
4 expert Full fraud audit across all four fraud types ๐Ÿ”ด Expert
5 adversarial OCR corruption, SUBTOTAL traps, fake TAX/FX noise lines ๐Ÿ”ด Expert
6 negotiate Ask clarifying questions first (bonus for โ‰ค2), then extract ๐ŸŸก Medium
7 supply_chain Detect quantity shortfalls, price spikes, phantom deliveries ๐Ÿ”ด Expert
8 long_horizon 20-step 4-phase investigation: extract โ†’ reconcile โ†’ audit โ†’ risk forecast ๐Ÿ”ด Expert
9 personalized Adapts to your weak fields โ€” next invoice always targets your worst category ๐Ÿ”„ Adaptive
10 curriculum Auto-progresses easyโ†’mediumโ†’hardโ†’expert based on score (โ‰ฅ0.80 to advance) ๐Ÿ”„ Auto

Dynamic difficulty also adjusts within each task via a rolling 10-episode score window: score above 0.85 โ†’ heavier OCR, more discrepancies, deeper traps. Drop below 0.60 โ†’ it eases off.


๐Ÿ“ˆ Training Results โ€” GRPO on Live Environment

All 3 agents trained with TRL GRPOTrainer + Unsloth using the deployed HF Space as the live reward verifier โ€” /grader endpoint is the reward function during training.

Before vs After Training

Agent Untrained (random) Qwen 72B baseline After GRPO Improvement
๐Ÿ” Extractor 0.10 0.67 0.914 +714% vs random
๐Ÿ•ต๏ธ Auditor 0.01 โ€” 0.52 live reward Dead โ†’ active signal
โšก Generator โ€” โ€” 0.22 plausibility Format & realism learned

Setup: Qwen2.5-1.5B-Instruct ยท 4-bit QLoRA r=16 ยท Unsloth + TRL ยท Google Colab A100

Extractor Reward Curve

Extractor Training X-axis: training step (1โ€“20) ยท Y-axis: reward (0โ€“1). Left: total GRPO reward across 4 independent signals (format 0.10 + field accuracy 0.40 + math 0.25 + completeness 0.25). Right: live /grader score peaking at 0.914 โ€” above Qwen 72B baseline (0.67) and untrained 1.5B (0.46).

Left: Total GRPO reward across 4 signals (format + field + math + completeness) over 20 training steps. Right: Live environment grader score peaking at 0.914 โ€” above Qwen 72B baseline (0.67) and untrained 1.5B baseline (0.46).

Auditor Reward Curve (Run 2 โ€” Bug Fixed)

Auditor Training Run 2 X-axis: training step (1โ€“30) ยท Y-axis: reward (0โ€“1). Total reward (blue) and live env reward (orange) with ยฑ1 std band. Best total: 0.719 at step 10. Live env reward climbed from 0.01 (dead signal, Run 1) to 0.52 after fixing the TRL episode_id list indexing bug.

Total reward (blue) and live env reward (orange) over 30 steps with ยฑ1 std band. Best total reward: 0.719. Live env reward rose from 0.01 (dead signal in Run 1) to 0.52 after fixing the episode_id list bug.

Generator Reward Curve

Generator Training X-axis: training step (1โ€“30) ยท Y-axis: reward (0โ€“1). Live evasion reward (red) flat near 0 โ€” Auditor+Approver caught all fraud attempts. Fraud plausibility reward (orange dashed) stable at ~0.20 โ€” Generator learned realistic invoice structure even without successful evasion.

Live evasion reward (red) flat near 0 โ€” Auditor+Approver caught all fraud attempts. Fraud plausibility reward (orange dashed) learned and stable at ~0.20, showing the Generator learned to produce realistic-looking invoices even without successful evasion.

๐Ÿ” Reward Hacking Caught at Step 10

At step 10 the model achieved math_consistency = 0.97 and completeness = 1.0 while field_accuracy = 0.00 โ€” it had learned to output arithmetically-consistent JSON with entirely hallucinated values:

Step 10 โ€” Reward Hacking Detected:
  format:            0.10  โœ…
  math_consistency:  0.97  โœ… โ† model gaming this signal
  completeness:      1.00  โœ… โ† model gaming this signal
  field_accuracy:    0.00  โŒ โ† hallucinating all values

  Action: adjusted training emphasis on field_accuracy weight
  Result: field_accuracy climbed to 0.30+ by step 30

Without 4 independent signals, a single aggregated reward would have called this success. Independent signals are diagnostics, not just incentives.

Auditor Training โ€” Run 2 (exact data)

Step Total Reward Live Env Reward ยฑStd
5 0.4828 0.2828 ยฑ0.194
10 0.7188 0.5188 ยฑ0.239
15 0.4538 0.2538 ยฑ0.123
20 0.5733 0.3733 ยฑ0.212
25 0.5325 0.3325 ยฑ0.232
30 0.6038 0.4038 ยฑ0.147

Run 1 (dead signal): live env reward flat at 0.010 โ€” TRL passes episode_id as a list; old code sent the whole list instead of indexing per completion


๐ŸŽ Reward Architecture

๐Ÿ” Extractor โ€” 4 Independent Signals

reward_format(extracted)             # 0.10 โ€” all 5 required JSON keys present?
reward_field_accuracy(extracted, gt) # 0.40 โ€” vendor / date / currency / total match?
reward_math_consistency(extracted)   # 0.25 โ€” qty ร— unit_price = amount per line?
reward_completeness(extracted, gt)   # 0.25 โ€” all expected line items captured?

# All clamped to (0.01, 0.99) โ€” no log(0), no gradient collapse at boundaries

๐Ÿ•ต๏ธ Auditor

Outcome Reward Why
Correct fraud type detected 0.99 Rewards precise classification, not just binary flagging
Clean invoice correctly approved 0.90 Keeps false-positive rate honest
Compound fraud โ€” one of two types caught 0.65 Partial credit prevents cliff on hard cases
Fraud flagged but wrong type 0.50 Penalises sloppiness; rewards catching something
Miss or false positive 0.01 Near-zero punishes both failure modes symmetrically

โšก Generator (Adversarial Self-Play)

Outcome Reward
Fraud evades both Auditor and Approver 0.85
Auditor misses, Approver catches 0.60
Auditor catches it 0.10

๐ŸŽฏ Regulator โ€” Cross-Episode

Total = Precision(0.35) + Recall(0.35) + No-over-flagging(0.15) + Early-warning-bonus(0.15)

The early-warning bonus rewards predictions of emerging blind spots โ€” before detection rates cross the critical threshold.


๐Ÿง  Trained LoRA Agents

Agent Base Model LoRA Config HuggingFace Hub
๐Ÿ” Extractor Qwen2.5-1.5B-Instruct r=16, ฮฑ=16, 4-bit QLoRA ps2181/extractor-lora-qwen2.5-1.5b
๐Ÿ•ต๏ธ Auditor Qwen2.5-1.5B-Instruct r=16, ฮฑ=16, 4-bit QLoRA ps2181/auditor-lora-qwen2.5-1.5b
โšก Generator Qwen2.5-1.5B-Instruct r=16, ฮฑ=16, 4-bit QLoRA ps2181/generator-lora-qwen2.5-1.5b

LoRA target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj


๐ŸŒ The Regulator in Action

After each episode, the Regulator publishes a report the Generator uses to bias its next batch:

GET /regulator/report

{
  "total_audits_recorded": 20,
  "detection_rates": {
    "phantom_vendor":        "31%  โš  BLIND SPOT (-0.08โ†“)",
    "price_gouging":         "74%  โœ“ OK (+0.03โ†‘)",
    "math_fraud":            "81%  โœ“ OK (+0.01โ†‘)",
    "duplicate_submission":  "62%  โšก EMERGING (-0.02โ†“)"
  },
  "blind_spots": ["phantom_vendor"],
  "emerging_blind_spots": ["duplicate_submission"],
  "generator_weights": {
    "phantom_vendor":       0.30,    โ† 3ร— upweighted (blind spot)
    "duplicate_submission": 0.20,    โ† 2ร— upweighted (emerging)
    "price_gouging":        0.125,
    "math_fraud":           0.125,
    "compound_fraud":       0.10
  },
  "verdict": "Recommend retraining on: phantom_vendor"
}

๐ŸŽญ Sample Multi-Agent Episode

โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
  MULTI-AGENT PIPELINE  ยท  LIVE EPISODE
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”

  ๐ŸŽฏ  REGULATOR  (30-episode rolling window)
  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  phantom_vendor     31%  โš  BLIND SPOT  โ† prioritised 60%
  price_gouging      74%  โœ“ OK
  math_fraud         81%  โœ“ OK
  duplicate          62%  โœ“ OK

  โšก  GENERATOR  (Qwen2.5 LoRA)
  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  Fraud focus : phantom_vendor (60% Regulator weight)
  Vendor      : ShadowByte Technologies  โ† not in registry

  ๐Ÿ”  EXTRACTOR  (Qwen2.5 LoRA)
  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  Reward : 0.847  [format 0.10 ยท field 0.38 ยท math 0.25 ยท completeness 0.12]

  ๐Ÿ•ต๏ธ  AUDITOR  (Qwen2.5 LoRA)
  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  INV-85529  โ†’  ๐Ÿšจ FLAGGED  [PHANTOM VENDOR]  conf=0.91
  INV-85530  โ†’  โœ… APPROVED                   conf=0.88

  โœ…  APPROVER
  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  INV-85529  โ†’  โŒ REJECT
  Generator reward : 0.60  (evaded Auditor on 1/3, Approver caught)

  ๐ŸŽฏ  REGULATOR UPDATE
  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  phantom_vendor detection: 31% โ†’ 45%  โ†‘ improving
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”

๐Ÿš€ Quick Start

# Health check
curl https://ps2181-invoice-processing-pipeline.hf.space/health

# Environment-wide metrics
curl https://ps2181-invoice-processing-pipeline.hf.space/metrics

# Auto-progressive curriculum episode
curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/reset \
  -H "Content-Type: application/json" -d '{"task_id": "curriculum"}'

# Start multi-agent episode
curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/multi/reset

# Regulator blind spot report
curl https://ps2181-invoice-processing-pipeline.hf.space/regulator/report

Run Training (Google Colab)

Open In Colab

Colab โ†’ /reset (fresh synthetic invoice from live environment)
      โ†’ model generates JSON
      โ†’ /grader scores against ground truth
      โ†’ GRPO updates weights toward higher-reward completions
      โ†’ repeat 200 steps

๐Ÿ—‚๏ธ Repository Structure

invoice-processing-pipeline/
โ”‚
โ”œโ”€โ”€ server/
โ”‚   โ”œโ”€โ”€ app.py                      # FastAPI โ€” 18 endpoints
โ”‚   โ”œโ”€โ”€ environment.py              # 10 tasks ยท graders ยท dynamic difficulty
โ”‚   โ”œโ”€โ”€ multi_agent_environment.py  # 5-agent system + AuditorPerformanceTracker
โ”‚   โ”œโ”€โ”€ agents.py                   # Lazy-loading LoRA inference wrappers
โ”‚   โ””โ”€โ”€ web_ui.py                   # Gradio UI (mounted at /web)
โ”‚
โ”œโ”€โ”€ models.py                       # Pydantic: Action ยท Observation ยท State
โ”œโ”€โ”€ inference.py                    # Standalone inference helper
โ”œโ”€โ”€ client.py                       # OpenEnv-compatible Python client
โ”‚
โ”œโ”€โ”€ extractor_training_grpo.ipynb   # ๐Ÿ”ฅ Extractor GRPO training (Unsloth + TRL)
โ”œโ”€โ”€ auditor_grpo_training.ipynb     # ๐Ÿ”ฅ Auditor GRPO training
โ”œโ”€โ”€ generator_grpo_training.ipynb   # ๐Ÿ”ฅ Generator GRPO training
โ”‚
โ”œโ”€โ”€ assets/
โ”‚   โ”œโ”€โ”€ reward_curve.png            # Extractor training curve
โ”‚   โ”œโ”€โ”€ auditor_reward_curve_run2.png
โ”‚   โ””โ”€โ”€ generator_reward_curve.png
โ”‚
โ”œโ”€โ”€ openenv.yaml                    # OpenEnv manifest (all tasks declared)
โ”œโ”€โ”€ Dockerfile                      # HF Spaces Docker (port 7860, non-root UID 1000)
โ”œโ”€โ”€ pyproject.toml                  # Project metadata + dependencies
โ”œโ”€โ”€ requirements.txt                # Runtime dependencies
โ”œโ”€โ”€ validate-submission.sh          # Submission validator script
โ”œโ”€โ”€ BLOG.md                         # HuggingFace blog post
โ””โ”€โ”€ ROUND2_PROBLEM_STATEMENT.md     # Full problem statement + reward design rationale

๐Ÿ”Œ API Reference

Core OpenEnv

Endpoint Method Description
/health GET Health check โ†’ {"status": "ok", "active_sessions": N}
/tasks GET All tasks with descriptions, schemas, difficulty levels
/metrics GET Per-task episode counts, avg/best scores, Regulator state
/reset POST Start episode {"task_id": "easy|medium|...|curriculum"}
/step POST Submit extraction โ†’ reward + feedback + hint + reward_breakdown
/grader POST Score without consuming an attempt (training reward signal)
/state GET Episode metadata โ€” step_count, done, best_reward, history
/ws WS Full episode over WebSocket (OpenEnv standard)
/web GET Gradio interactive demo UI

Multi-Agent

Endpoint Method Description
/multi/reset POST Start 5-agent episode โ€” Generator biased by Regulator weights
/multi/extract POST Score Extractor output (4 independent signals)
/multi/audit POST Score Auditor output, update 30-episode performance tracker
/multi/approve POST Run Approver, compute Generator adversarial reward
/multi/state/{id} GET Full episode state including all agent scores
/generator/score POST Direct Generator scoring through Auditor+Approver pipeline

Regulator

Endpoint Method Description
/regulator/report GET Detection rates, blind spots, calibration, generator weights
/regulator/forecast GET Trend slopes + emerging blind spot warnings with episode countdown
/regulator/calibration GET Overconfidence / underconfidence per fraud type
/regulator/predict POST Score a Regulator blind-spot prediction
/regulator/demo_seed POST Seed tracker with realistic demo data

๐Ÿ—๏ธ Tech Stack

Layer Technology
Environment OpenEnv ยท FastAPI ยท Pydantic v2
UI Gradio 4.x (mounted at /web)
Deployment Docker ยท HuggingFace Spaces (vcpu-2 / 8 GB)
Training TRL GRPOTrainer ยท Unsloth
Model unsloth/Qwen2.5-1.5B-Instruct ยท 4-bit QLoRA ยท r=16 ยท A100
Reward Live /grader endpoint on HF Space as verifier
Session Mgmt Thread-safe OrderedDict ยท 200-session cap ยท LRU eviction
Dynamic Difficulty Per-task rolling window (maxlen=10) โ†’ adjusts OCR intensity, batch size, discrepancy count

๐ŸŽญ Theme Alignment

Theme Alignment Evidence
#4 Self-Improvement (primary) โœ… Core Regulator detects blind spots โ†’ Generator biases toward them โ†’ Auditor improves โ†’ loop repeats
#1 Multi-Agent Interactions โœ… Core 5 agents with conflicting incentives โ€” Generator vs Auditor adversarial self-play
#1 Fleet AI Scalable Oversight โœ… Bonus Regulator monitors Auditor cross-episode with predictive trend detection
#3.1 Professional Tasks โœ… Core Invoice + PO + vendor registry + supply chain = real enterprise AP workflow
#2 Long-Horizon Planning โœ… Partial long_horizon task: 20-step 4-phase investigation with multi-turn state

๐Ÿ‘ฅ Team

Pritam Satpathy Gnana Nawin T
๐Ÿค— ps2181 ๐Ÿค— gnananawin
Scaler School of Technology Scaler School of Technology

Meta PyTorch OpenEnv Hackathon โ€” Grand Finale ยท April 25โ€“26, 2026 ยท Bangalore


๐Ÿ”— All Links


Built with โค๏ธ for the Meta PyTorch OpenEnv Hackathon 2026

"The system that gets harder for itself โ€” so the agent never stops learning."