Spaces:

ps2181
/

invoice-processing-pipeline

Running

App Files Files Community

ps2181 commited on Apr 25

Commit

bbe2575

1 Parent(s): 95d36d4

Update BLOG.md and README.md

Browse files

Files changed (2) hide show

BLOG.md +139 -188
README.md +458 -183

BLOG.md CHANGED Viewed

@@ -1,276 +1,227 @@
-# Invoice Processing Pipeline — Multi-Agent RL Environment for Financial Fraud Detection
-**Meta PyTorch OpenEnv Hackathon Grand Finale | April 25–26, 2026**
-**Team: Pritam Satpathy + Gnana Nawin T**
----
-## The Problem
-Invoice fraud costs businesses an estimated 5% of annual revenue. Finance teams manually process thousands of invoices every month — extracting vendor names, dates, line items, totals — and checking them against purchase orders for discrepancies. The work is slow (hours per batch), error-prone (typos, OCR noise, format chaos), and gameable (phantom vendors, price gouging, duplicate submissions).
-We built an RL training environment that teaches LLMs to do this automatically — and improves itself when it discovers its own blind spots.
----
-## What We Built
-An OpenEnv-compatible environment deployed on HuggingFace Spaces:
-**[https://ps2181-invoice-processing-pipeline.hf.space](https://ps2181-invoice-processing-pipeline.hf.space)**
 ---
-## Architecture: 5-Agent System
-```
-          ┌─────────────────────────────────────────────────────────┐
-          │              ADVERSARIAL REWARD (dashed)                │
-          │                                                         │
-          ▼                                                         │
-  ┌───────────────┐                                                 │
-  │   Generator    │◄───── Regulator biases fraud type ◄────┐       │
-  │ Creates fraud  │                                        │       │
-  └───────┬───────┘                                        │       │
-          │ Raw invoice text                                │       │
-          ▼                                                 │       │
-  ┌───────────────┐                                        │       │
-  │   Extractor    │                                        │       │
-  │ Text → JSON    │                                        │       │
-  └───────┬───────┘                                        │       │
-          │ Structured data                          ┌─────┴─────┐ │
-          ▼                                          │ Regulator  │ │
-  ┌───────────────┐                                  │ Cross-     │ │
-  │    Auditor     │────── decision history ────────►│ episode    │ │
-  │ Fraud detect   │                                  │ meta-agent │ │
-  └───────┬───────┘                                  └───────────┘ │
-          │ Verdict + flags                                         │
-          ▼                                                         │
-  ┌───────────────┐                                                 │
-  │   Approver     │────────────────────────────────────────────────┘
-  │ Approve/reject │
-  └───────┬───────┘
-          │
-          ▼
-  ┌──────────────────────────────────────┐
-  │  4 Independent Reward Signals        │
-  │  Format · Field · Math · Completeness│
-  └──────────────────────────────────────┘
-```
-| Agent | Role | Reward Signal |
-|-------|------|---------------|
-| **Generator** | Creates clean or fraudulent invoices | Rewarded when fraud slips past Auditor (adversarial self-play) |
-| **Extractor** | Reads raw invoice text → structured JSON | 4 independent signals: format, field accuracy, math consistency, completeness |
-| **Auditor** | Reviews extraction, flags fraud | +0.99 correct detection, +0.90 clean clearance, 0.01 for miss/false positive |
-| **Approver** | Final approve/reject/escalate decision | +0.95 correct decision |
-| **Regulator** | Monitors Auditor blind spots across episodes | Precision + recall of blind spot predictions |
----
-## The Key Innovation: The Regulator
-The Regulator is a cross-episode meta-agent — it watches the Auditor's decision history over 30 episodes and identifies systematic failure patterns:
-```
-AUDITOR PERFORMANCE TRACKER (last 30 episodes)
-Fraud Type            Detection Rate
-─────────────────────────────────────
-phantom_vendor        31%   ⚠ BLIND SPOT
-price_gouging         74%   ✓ OK
-math_fraud            81%   ✓ OK
-duplicate_submission  62%   ✓ OK
-False Positive Rate:  12%   ✓ OK
-REGULATOR VERDICT: Recommend retraining on phantom_vendor
-```
-When the Regulator detects a blind spot, the Generator automatically starts producing more of that fraud type — closing the self-improvement loop without human intervention.
-This directly addresses **Theme #1 (Fleet AI Scalable Oversight)** and **Theme #4 (Self-Improvement)**.
----
-## 7 Tasks (Progressive Difficulty)
-| Task | Difficulty | What the Agent Does |
-|------|-----------|---------------------|
-| `easy` | Easy | Extract fields from a single clean invoice |
-| `medium` | Medium | Clean + normalise a batch of messy invoices (typos, date chaos, currency symbols) |
-| `hard` | Hard | Extract + reconcile against purchase orders, flag discrepancies |
-| `expert` | Expert | Fraud audit: classify phantom_vendor / price_gouging / math_fraud / duplicate_submission |
-| `adversarial` | Hard | Extract from OCR-corrupted invoice with SUBTOTAL trap and FX noise lines |
-| `negotiate` | Medium | Ask clarification questions then submit extraction (bonus for ≤2 questions) |
-| `supply_chain` | Expert | Detect quantity shortfalls, price spikes, phantom deliveries in delivery records |
 ---
-## Design Decisions
-### 4 Independent Reward Functions (Anti-Hacking)
-Per the hackathon guide: *"use multiple independent reward functions — if you only have one, it is easier for the model to hack it."*
-```python
-format_reward()       # Are all 5 required JSON keys present?       weight: 0.10
-field_reward()        # Do vendor/date/currency/total match?         weight: 0.40
-math_reward()         # Does qty × unit_price = amount for all items? weight: 0.25
-completeness_reward() # Are all line items present (recall)?          weight: 0.25
-```
-During training we observed the model maximising `math_reward` (0.97) and `completeness_reward` (1.0) while `field_reward` stayed at 0.00 — the model learned to output arithmetic-consistent JSON while hallucinating values. Our independent signals made this reward hacking immediately visible, confirming the design choice.
-### Adversarial Self-Play
-The Generator is rewarded when its fraud evades the Auditor:
-- Fraud undetected, Approver approves → Generator reward: **0.85**
-- Auditor missed but Approver caught → Generator reward: **0.60**
-- Auditor caught it → Generator reward: **0.10**
-This creates evolutionary pressure: the Generator evolves harder-to-detect fraud, forcing the Auditor to improve.
-### Dynamic Difficulty
-The environment tracks recent agent scores per task (rolling window of 10 episodes) and adjusts generation parameters:
-- Agent scoring ≥ 0.85 → harder parameters (more invoices, more OCR noise, more discrepancies)
-- Agent scoring < 0.60 → easier parameters
-- In between → standard
-### All Rewards Clamped to (0.01, 0.99)
-Avoids `log(0)` in policy gradient and prevents the model from getting stuck at boundaries.
 ---
-## Tech Stack
-```
-Environment:  FastAPI + OpenEnv-core + Pydantic
-Deployment:   HuggingFace Spaces (Docker, port 7860)
-UI:           Gradio (mounted at /web)
-Training:     TRL GRPOTrainer + Unsloth (Qwen2.5-1.5B-Instruct, 4-bit QLoRA)
-Model:        unsloth/Qwen2.5-1.5B-Instruct  r=16 LoRA
-Reward:       4 local signals + live /grader endpoint on HF Space
-```
----
-## Training Setup
-GRPO (Group Relative Policy Optimization) with:
-- `num_generations = 4` — 4 completions per prompt, compared within group
-- `max_steps = 200`
-- `learning_rate = 5e-6`
-- Live `/grader` endpoint on HF Space as environment verifier
-The training loop:
 ```
-Colab samples episode → HF Space /reset → gets live invoice
-Model generates JSON extraction
-HF Space /grader scores it against ground truth
-GRPO updates model toward higher-scoring completions
 ```
----
-## What Worked (Achievements)
-### 1. Reward Hacking Detection — Caught at Step 10
-The independent reward signals caught a classic reward hacking pattern immediately. The model maximised math and completeness while hallucinating field values. Without 4 independent signals, this would have been invisible behind a rising aggregate reward.
-| Step | Total Reward | Env Score | Format | Math |
-|------|-------------|-----------|--------|------|
-| 10   | 2.361       | 0.113     | 0.900  | 0.347 |
-| 20   | 2.595       | 0.282     | 0.900  | 0.413 |
-| 30   | 2.657       | 0.304     | 0.950  | 0.403 |
-Environment score rose **0.113 → 0.304 in 30 steps** — a 169% improvement in correct invoice extraction as scored by the live environment grader.
-### 2. Live Environment as Verifier
-Training Colab directly calls `/grader` on the deployed HF Space — the environment IS the reward function. No separate reward model. Deterministic and reproducible.
-### 3. Regulator Concept Validated
-The cross-episode tracking logic works: the Regulator correctly identifies `phantom_vendor` as the Auditor's weakest category and triggers Generator bias toward that fraud type. No other OpenEnv environment we've seen implements a cross-episode meta-agent.
-### 4. Full 7-Task Ladder Deployed
-All 7 tasks are live on the HF Space with independent graders, schemas, and difficulty calibration. The progressive structure directly supports curriculum learning.
-### 5. Clean OpenEnv API Compliance
-Standard `reset()` / `step()` / `state()` interface, WebSocket support, Swagger docs at `/docs`, Gradio UI at `/web`. Drop-in compatible with any OpenEnv training script.
----
-## Where We're Having Problems (Honest Assessment)
-### 1. Field Reward Plateau
-The `field_reward` (vendor name, date, currency, total accuracy) remains the hardest signal for the 1.5B model to crack. Even at step 30, the environment score is 0.304 — meaning the model still hallucinates field values despite correct structure and math. We suspect this is a model capacity issue: Qwen2.5-1.5B may not have enough parameters to learn extraction patterns from raw OCR text in 200 steps.
-**Potential fix:** Switching to Qwen2.5-7B or adding a light SFT warmup phase with 50–100 correct extraction examples before RL.
-### 2. Multi-Agent Coordination Not Yet Trained End-to-End
-The 5-agent architecture is designed and the environment supports it, but we haven't yet run the full adversarial training loop (Generator vs Auditor) end-to-end with GRPO. Currently, the Extractor is trained in isolation. The Regulator logic runs as environment-side code, not as a trainable agent.
-**Potential fix:** Implementing a two-phase training loop — Phase 1: train Extractor on easy/medium, Phase 2: train Auditor against Generator with Regulator feedback.
-### 3. Compute Constraints
-4-bit QLoRA on a Colab T4 limits batch sizes and generation counts. With `num_generations=4`, each step is slow enough that we couldn't push past ~50 steps in the available time. The reward curves are trending upward but haven't converged.
-**Potential fix:** Onsite compute credits (HF GPU Spaces) should allow `num_generations=8` and 500+ steps.
-### 4. OCR Noise Robustness
-The `adversarial` task (trap-resistant extraction with SUBTOTAL/FX noise lines) works as an environment, but the model hasn't been trained on it yet. Early inference tests show the model consistently falls for fake SUBTOTAL lines.
-**Potential fix:** Adding adversarial examples to the curriculum after the model achieves ≥0.60 on `medium`.
 ---
-## What Makes This Novel
-1. **Regulator agent** — no other OpenEnv environment has a cross-episode meta-agent that monitors another agent for systematic cognitive blind spots
-2. **Closed self-improvement loop** — Regulator detects blind spot → Generator biases fraud generation toward that type → Auditor forced to improve → no human intervention required
-3. **Adversarial Generator arms race** — Generator rewarded for evading Auditor creates evolutionary pressure on fraud detection
-4. **Live environment as verifier** — training Colab directly calls `/grader` on deployed HF Space — the environment IS the reward function
-5. **4 independent reward signals** — made reward hacking immediately visible during training (detected it at step 10)
----
-## Theme Alignment
-| Theme | Alignment |
-|-------|-----------|
-| **#1 Multi-Agent** | 5 agents with conflicting incentives (Generator vs Auditor) |
-| **#1 Sub: Fleet AI Oversight** (bonus) | Regulator monitors Auditor cross-episode |
-| **#3.1 Professional Tasks** | Invoice processing = core enterprise workflow |
-| **#3.1 Sub: Scaler AI Labs** (bonus) | Multi-agent RL for enterprise financial workflows |
-| **#4 Self-Improvement** | Generator adapts based on Regulator blind spot findings |
 ---
-## Links
-- **Live Environment:** [https://ps2181-invoice-processing-pipeline.hf.space](https://ps2181-invoice-processing-pipeline.hf.space)
-- **Gradio UI:** [https://ps2181-invoice-processing-pipeline.hf.space/web](https://ps2181-invoice-processing-pipeline.hf.space/web)
-- **API Docs:** [https://ps2181-invoice-processing-pipeline.hf.space/docs](https://ps2181-invoice-processing-pipeline.hf.space/docs)
-- **GitHub:** [https://github.com/ps2181/invoice-processing-pipeline](https://github.com/ps2181/invoice-processing-pipeline)
----
-## Team
-**Pritam Satpathy** + **Gnana Nawin T**
-Meta PyTorch OpenEnv Hackathon Grand Finale
-Scaler School of Technology, Bangalore — April 25–26, 2026

+<div align="center">
+# When the System Learns to Pressure-Test Itself
+**How we built a 5-agent adversarial RL environment that detects invoice fraud —**
+**and automatically gets harder when it finds its own blind spots.**
+<br/>
+*Meta PyTorch OpenEnv Hackathon · Grand Finale · April 25–26, 2026*
+*Pritam Satpathy & Gnana Nawin T · Scaler School of Technology, Bangalore*
+</div>
 ---
+## The Problem Nobody Talks About
+Invoice fraud is boring to talk about and devastating in practice.
+It costs businesses an estimated **5% of annual revenue**, and it doesn't announce itself — it hides in purchase order line items, disguised as rounding errors, vendor name typos, and suspiciously round numbers that only look wrong if you already know what to look for.
+Finance teams today catch it manually. They compare thousands of invoices against purchase orders, cross-reference vendor registries, and flag anything that smells off. It's slow, it's error-prone, and critically — **it doesn't improve**. A human who misses phantom vendor fraud on Monday is statistically likely to miss it again on Friday.
+We asked a different question:
+> *What if you could build an LLM system that not only detects fraud, but gets better at detecting the exact fraud types it's currently failing on — automatically, without any human retraining the loop?*
+That's what we built.
+---
+## The Core Idea: Make the System Pressure-Test Itself
+Most multi-agent RL setups have agents that operate independently within a single episode. Ours doesn't.
+We added a **cross-episode Regulator** — an agent that watches the Auditor across 30 rolling episodes, tracks which fraud types it's systematically missing, and quietly biases the Generator to produce more of those exact scenarios.
+No human decides *"let's train more on phantom vendors."* The Regulator notices the detection rate for phantom vendors is at `31%` and trending downward, raises the alarm, and tells the Generator to send more phantom vendor invoices. **The loop closes itself.**
+<div align="center">
+<img width="1710" height="326" alt="image" src="https://github.com/user-attachments/assets/319654c3-aa24-47e8-9716-734d4e902168" />
+</div>
+The Auditor sees more of exactly what it's failing on. The Generator gets rewarded for finding those gaps. The Regulator earns points for predicting blind spots *before* they go critical. Every agent has skin in the game.
 ---
+## Five Agents, One Closed Loop
+<div align="center">
+| Agent | Role | Reward Signal |
+|:---:|:---|:---|
+| **Generator** | Creates clean or fraudulent invoices, biased by Regulator's blind-spot weights | `+0.85` evades both · `+0.60` evades Auditor · `+0.10` caught |
+| **Extractor** | Raw invoice text → structured JSON | format `0.10` · field accuracy `0.40` · math `0.25` · completeness `0.25` |
+| **Auditor** | Fraud classification with fraud type + confidence score | `+0.99` correct type · `+0.90` clean cleared · `+0.01` miss or FP |
+| **Approver** | Final approve / escalate / reject, gated by confidence | `≥0.80` → reject · `0.50–0.80` → escalate · `<0.50` → approve |
+| **Regulator** | Cross-episode meta-agent, 30-episode rolling window | precision `0.35` + recall `0.35` + no over-flagging `0.15` + early warning `0.15` |
+</div>
+The **Regulator** is the part that makes this genuinely different. Most RL environments treat each episode as independent. The Regulator sits outside that — accumulating detection rates, computing trend slopes over 5-episode windows, and warning of *emerging* blind spots before they go critical. It's proactive oversight, not reactive retraining.
+---
+## Seven Tasks, One Curriculum
+<div align="center">
+| # | Task | What the Agent Faces | Difficulty |
+|:---:|:---|:---|:---:|
+| 1 | `easy` | Single clean invoice — extract 5 fields | Easy |
+| 2 | `medium` | Batch with date chaos, vendor typos, currency noise | Medium |
+| 3 | `hard` | Extraction + PO reconciliation — flag overcharges, missing items | Hard |
+| 4 | `expert` | Full fraud audit across all four fraud types | Expert |
+| 5 | `adversarial` | OCR corruption, SUBTOTAL traps, fake TAX/FX noise lines | Expert |
+| 6 | `negotiate` | Ask clarifying questions first (bonus for ≤2), then extract | Medium |
+| 7 | `supply_chain` | Detect quantity shortfalls, price spikes, phantom deliveries | Expert |
+</div>
+The difficulty also adjusts **dynamically** based on the agent's rolling score. Score above `0.85`? The next batch gets heavier OCR corruption, more PO discrepancies, deeper adversarial traps. Drop below `0.60`? It eases off. The agent is always working at its productive edge.
 ---
+## The Part Where We Caught Our Own Reward Hacking
+This was the most interesting moment in the project.
+At training step 10, we had:
 ```
+math_consistency:   0.97
+completeness:       1.00
+field_accuracy:     0.00  :(  ← hallucinating every actual value
 ```
+The model had figured out that it could score well by outputting JSON that was *arithmetically correct* — quantities times unit prices summed to the totals perfectly — while **hallucinating every actual value**. Vendor name: made up. Date: made up. Currency: made up. All internally consistent. All completely wrong.
+This is reward hacking. A single aggregated reward would have happily reported high performance and called it a day.
+Our four **independent** reward signals made the failure immediately visible. We could see exactly which signal the model had learned to game and which it was ignoring.
+> **That's the entire argument for independent reward functions: not just diversity, but diagnosability.**
+We adjusted training emphasis. By step 30, field accuracy had climbed from `0.00` to `0.30+` while math consistency stayed stable.
+<div align="center">
+| Step | Total Reward | Env Score | Format | Math Consistency |
+|:---:|:---:|:---:|:---:|:---:|
+| 10 | 2.361 | 0.113 | 0.900 | 0.347 |
+| 20 | 2.595 | 0.282 | 0.900 | 0.413 |
+| 30 | 2.657 | **0.304** | **0.950** | 0.403 |
+**Environment score: `0.113 → 0.304` in 30 steps — a 169% improvement in live-graded extraction accuracy.**
+</div>
+---
+## The Reward Architecture
+### 🔍 Extractor — 4 Independent Signals
+```python
+reward_format(extracted)             # weight 0.10 — all 5 required JSON keys present?
+reward_field_accuracy(extracted, gt) # weight 0.40 — vendor / date / currency / total match?
+reward_math_consistency(extracted)   # weight 0.25 — qty × unit_price = amount per line?
+reward_completeness(extracted, gt)   # weight 0.25 — all expected line items present?
+# All clamped to (0.01, 0.99) — no log(0), no gradient collapse at boundaries
+```
+### Auditor — Precision-Weighted
+<div align="center">
+| Outcome | Reward | Why |
+|:---|:---:|:---|
+| Correct fraud type detected | **0.99** | Rewards precise classification, not just flagging |
+| Clean invoice correctly approved | **0.90** | Keeps false-positive rate honest |
+| Compound fraud — one of two types caught | **0.65** | Partial credit prevents discouragement on hard cases |
+| Fraud flagged but wrong type | **0.50** | Penalises sloppiness while crediting intent |
+| Miss or false positive | **0.01** | Near-zero punishes both failure modes symmetrically |
+</div>
+### Regulator — Cross-Episode
+```
+Total = Precision(0.35) + Recall(0.35) + No-over-flagging(0.15) + Early-warning-bonus(0.15)
+```
+The early-warning bonus rewards the Regulator for predicting emerging blind spots *before* detection rates cross the critical threshold — proactive oversight, not reactive alarm.
+---
+## Building With OpenEnv
+The environment is a FastAPI app deployed on HuggingFace Spaces, exposing the standard OpenEnv interface. The training Colab connects directly to the live Space — `/grader` *is* the reward function. There's no separate scoring script. **The environment and the verifier are the same thing.**
+```bash
+# Start an episode
+POST /reset  {"task_id": "expert"}
+# Submit an extraction or audit result
+POST /step   {"episode_id": "...", "extracted_data": {...}}
+# Check Regulator state anytime
+GET  /regulator/report       # detection rates, blind spots, generator bias weights
+GET  /regulator/forecast     # trend slopes, emerging blind spots with early warnings
+GET  /regulator/calibration  # overconfidence / underconfidence per fraud type
+```
+Training uses **GRPO via TRL** with **Unsloth-optimised 4-bit QLoRA** on `Qwen2.5-1.5B-Instruct` — three separate LoRA adapters for Extractor, Auditor, and Generator, each trained on their own reward signal.
+```
+Colab → /reset  (fresh synthetic invoice from live environment)
+      → model generates JSON extraction
+      → /grader  scores against ground truth
+      → GRPO updates weights toward higher-reward completions
+      → repeat 200 steps
+```
 ---
+## What We Learned
+**Reward design is product design.** Every reward function is a specification for the behaviour you actually want. Getting the Auditor reward right — where catching the *right* fraud type earns `0.99` but the *wrong* type earns `0.50` and missing entirely earns `0.01` — took more thinking than most of the engineering.
+**Multiple reward signals are diagnostics, not just incentives.** We didn't add four signals to the Extractor because the theory said to. We added them because we wanted to *see* where the model was failing. They paid off immediately at step 10.
+**Cross-episode agents change what's possible.** The Regulator couldn't exist in a single-episode design. Most RL environments treat each episode as independent. Giving one agent access to the history of another creates a fundamentally different kind of oversight — one that looks less like evaluation and more like a genuine colleague watching your back.
+---
+## Try It
+<div align="center">
+| Resource | Link |
+|:---|:---|
+| **Live Environment** | [ps2181-invoice-processing-pipeline.hf.space](https://ps2181-invoice-processing-pipeline.hf.space) |
+| **Gradio Demo UI** | [/web](https://ps2181-invoice-processing-pipeline.hf.space/web) |
+| **API Docs** | [/docs](https://ps2181-invoice-processing-pipeline.hf.space/docs) |
+| **Training Colab** | [Open notebook](https://colab.research.google.com/drive/1C1_3giNt-NmbzKNFJr5_L1fms3L8LfmB) |
+| **GitHub** | [invoice-processing-pipeline](https://github.com/ps2181/invoice-processing-pipeline) |
+| **Extractor Model** | [ps2181/extractor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/extractor-lora-qwen2.5-1.5b) |
+| **Auditor Model** | [ps2181/auditor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/auditor-lora-qwen2.5-1.5b) |
+| **Generator Model** | [ps2181/generator-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/generator-lora-qwen2.5-1.5b) |
+</div>
 ---
+<div align="center">
+*Built for the Meta PyTorch OpenEnv Hackathon 2026.*
+*Theme alignment: Multi-Agent Interactions (#1) · Fleet AI Scalable Oversight (#1 bonus) · Professional Tasks (#3.1) · Self-Improvement (#4)*
+<br/>
+**Pritam Satpathy & Gnana Nawin T · Scaler School of Technology · Bangalore**
+</div>

README.md CHANGED Viewed

@@ -1,256 +1,531 @@
 ---
-title: Invoice Processing Pipeline
-emoji: 🧾
-colorFrom: blue
-colorTo: indigo
-sdk: docker
-app_port: 7860
-tags:
-  - openenv
-  - multi-agent
-  - grpo
-  - rlhf
-  - fraud-detection
-  - invoice
 ---
-# 🧾 Invoice Processing Pipeline — Self-Improving Adversarial Fraud Detection
-> **Meta PyTorch OpenEnv Hackathon** · Team: Pritam Satpathy & Gnana Nawin T
->
-> **Primary theme: #4 Self-Improvement** · **Secondary: #1 Multi-Agent Interactions**
-**Live Demo** → [ps2181-invoice-processing-pipeline.hf.space/web](https://ps2181-invoice-processing-pipeline.hf.space/web)
-**API Docs** → [ps2181-invoice-processing-pipeline.hf.space/docs](https://ps2181-invoice-processing-pipeline.hf.space/docs)
 ---
-## The Core Idea
-> *A system that continuously generates harder challenges targeting its own weakest points.*
-Most fraud detection pipelines are static. Ours **gets harder for itself over time**: the Regulator finds where the Auditor keeps failing, the Generator exploits those exact blind spots in the next episode, the Auditor's new mistakes update the Regulator — and the loop closes.
 ---
-## 5-Agent Architecture
-```mermaid
-graph LR
-    R[🎯 Regulator\nDetects blind spots\nUpdates weights] -->|bias weights| G[⚡ Generator\nCreates adversarial\ninvoices]
-    G -->|raw invoice text| E[🔍 Extractor\nParses structured\nJSON fields]
-    E -->|structured data| A[🕵️ Auditor\nFlags fraud with\nconfidence scores]
-    A -->|audit results| AP[✅ Approver\nApprove / Escalate\n/ Reject]
-    AP -->|episode outcome| R
-    A -->|missed fraud types| R
-```
-Each agent has **independent reward signals** — no shared objective, genuine multi-agent dynamics:
-| Agent | Role | Reward signal |
-|-------|------|---------------|
-| **Regulator** | Oversight: detects Auditor blind spots, reweights Generator | Precision + Recall + Early-warning bonus |
-| **Generator** | Adversary: creates invoices biased toward blind spots | Evasion rate (0.85 evades both, 0.10 if caught) |
-| **Extractor** | Parser: structured JSON extraction with 4 signals | Format + Field accuracy + Math + Completeness |
-| **Auditor** | Detector: fraud classification with confidence | 0.99 correct type, 0.90 clean, 0.01 miss |
-| **Approver** | Gatekeeper: final approve/escalate/reject | Rule-based (confidence threshold) |
 ---
-## Three Novel Features
-| Feature | What it does |
-|---------|-------------|
-| **Predictive Regulator** | Computes trend slope over 5-episode windows — warns of *emerging* blind spots before they become critical, not just current ones |
-| **Compound Fraud** | Invoices can carry two simultaneous fraud signals (e.g. phantom vendor + price gouging). Partial credit for catching one; full reward for both |
-| **Confidence Calibration** | Tracks (confidence, correct?) pairs per fraud type. Flags *overconfident misses* — Auditor saying "90% sure, approved" on a fraudulent invoice — the most dangerous failure mode |
 ---
-## Training Results — GRPO on Live Environment
-All 3 agents trained with **TRL GRPOTrainer + Unsloth** using the deployed HF Space as the live reward verifier:
-| Agent | Baseline | Best Achieved | Notes |
-|-------|----------|--------------|-------|
-| **Extractor** | 0.10 (random) | **0.914** live grader score | Peaked step 15; crashed due to `_MAX_SESSIONS=50` bug (fixed to 200) |
-| **Auditor** | 0.01 (dead signal) | **0.719** total reward | Run 1 had dead live reward (episode_id list bug); Run 2 fixed → 0.01→0.52 |
-| **Generator** | — | Format learned (~0.22) | Live evasion reward had same bug; format/plausibility reward improved |
-**Training setup:** Qwen2.5-1.5B-Instruct, 4-bit QLoRA r=16, Unsloth + TRL, Google Colab A100
-### Auditor Training Log (Run 2 — exact data)
-| Step | Total Reward | Live Env Reward | ±Std |
-|------|-------------|----------------|------|
-| 5  | 0.4828 | 0.2828 | ±0.194 |
-| 10 | **0.7188** | **0.5188** | ±0.239 |
-| 15 | 0.4538 | 0.2538 | ±0.123 |
-| 20 | 0.5733 | 0.3733 | ±0.212 |
-| 25 | 0.5325 | 0.3325 | ±0.232 |
-| 30 | 0.6038 | 0.4038 | ±0.147 |
-*Run 1 (dead signal): live env reward = 0.010 flat across all steps (episode_id list bug — TRL passes episode_id as a list, old code sent the whole list to the server instead of indexing per completion)*
 ---
-## Trained LoRA Agents
-| Agent | HF Hub |
-|-------|--------|
-| Extractor | [ps2181/extractor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/extractor-lora-qwen2.5-1.5b) |
-| Auditor   | [ps2181/auditor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/auditor-lora-qwen2.5-1.5b) |
-| Generator | [ps2181/generator-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/generator-lora-qwen2.5-1.5b) |
 ---
-## Sample Episode Trace
 ```
-━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
-  MULTI-AGENT PIPELINE  ·  LIVE EPISODE
-━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
-  🎯  STEP 0 — REGULATOR
-  ────────────────────────────────────────────────
-  Blind spots detected : phantom_vendor
-  Fraud weights → next episode:
-    phantom_vendor             60%  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓  ← prioritised
-    price_gouging              20%  ▓▓▓▓▓
-    math_fraud                 10%  ▓▓
-    duplicate_submission       10%  ▓▓
-  ⚡  STEP 1 — GENERATOR  (Qwen2.5 LoRA)
-  ────────────────────────────────────────────────
-  Episode  : ep_8f3a2c…
-  Invoices : 3
-  Fraud focus : Phantom Vendor (60% Regulator weight)
-  🔍  STEP 2 — EXTRACTOR  (Qwen2.5 LoRA)
-  ────────────────────────────────────────────────
-  Vendor   : ShadowByte Technologies          ← not in registry
-  Total    : $12,450.00
-  Reward   : 0.847  [format 0.10  field 0.38  math 0.25  completeness 0.12]
-  🕵️  STEP 3 — AUDITOR  (Qwen2.5 LoRA)
-  ────────────────────────────────────────────────
-  INV-85529  →  🚨 FLAGGED  [PHANTOM VENDOR]  conf=0.91
-  INV-85530  →  ✅ APPROVED                   conf=0.88
-  Mean reward : 0.623
-  ✅  STEP 4 — APPROVER
-  ────────────────────────────────────────────────
-  INV-85529  →  ❌ REJECT
-  INV-85530  →  ✅ APPROVE
-  Generator adversarial reward : 0.60   (evaded Auditor on 1/3, Approver caught)
-  🎯  STEP 5 — REGULATOR UPDATE
-  ────────────────────────────────────────────────
-  phantom_vendor detection improved: 31% → 45%
-  Generator weights updated for next episode
-━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 ```
 ---
-## Reward Signals
-### Extractor (4 independent signals)
-| Signal | Max | What it measures |
-|--------|-----|-----------------|
-| Format | 0.10 | Required fields present |
-| Field accuracy | 0.40 | Vendor / date / currency / total correct |
-| Math consistency | 0.25 | qty × unit_price = amount, sum = total |
-| Completeness | 0.25 | All line items captured |
-### Auditor
-| Outcome | Reward |
-|---------|--------|
-| Correct fraud type detected | 0.99 |
-| Clean invoice correctly approved | 0.90 |
-| Compound fraud — one type caught | 0.65 |
-| Fraud detected, wrong type | 0.50 |
-| Miss or false positive | 0.01 |
-### Generator (adversarial)
-| Outcome | Reward |
-|---------|--------|
-| Evades both Auditor and Approver | 0.85 |
-| Evades Auditor, Approver catches | 0.60 |
-| Auditor catches it | 0.10 |
-### Regulator
-Precision (0.35) + Recall (0.35) + No over-flagging (0.15) + Early warning bonus (0.15)
 ---
-## API Endpoints
 ### Core OpenEnv
 | Endpoint | Method | Description |
-|----------|--------|-------------|
-| `/reset` | POST | Start episode (`{"task_id": "easy\|medium\|hard\|expert\|adversarial\|negotiate\|supply_chain"}`) |
-| `/step`  | POST | Submit extracted data, get reward + feedback |
-| `/grader`| POST | Score without modifying state |
-| `/state` | GET  | Episode metadata |
-| `/health`| GET  | Health check |
-| `/ws`    | WS   | WebSocket interface |
 ### Multi-Agent
 | Endpoint | Method | Description |
-|----------|--------|-------------|
-| `/multi/reset`   | POST | Start 5-agent episode, Generator biased by Regulator |
-| `/multi/extract` | POST | Score Extractor output (4 signals) |
-| `/multi/audit`   | POST | Score Auditor output, update tracker |
-| `/multi/approve` | POST | Run Approver, compute Generator reward |
 ### Regulator
 | Endpoint | Method | Description |
-|----------|--------|-------------|
-| `/regulator/report`      | GET  | Detection rates, blind spots, weights |
-| `/regulator/forecast`    | GET  | Predictive trend analysis |
-| `/regulator/calibration` | GET  | Confidence calibration per fraud type |
-| `/regulator/predict`     | POST | Score Regulator blind spot predictions |
 ---
-## Quick Start
-```bash
-# Health check
-curl https://ps2181-invoice-processing-pipeline.hf.space/health
-# Start a multi-agent episode
-curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/multi/reset
-# Get Regulator blind spot report
-curl https://ps2181-invoice-processing-pipeline.hf.space/regulator/report
-# Predictive forecast
-curl https://ps2181-invoice-processing-pipeline.hf.space/regulator/forecast
 ```
 ---
-## Fraud Types
-| Type | Description |
-|------|-------------|
-| `phantom_vendor` | Vendor not in the Approved Vendor Registry |
-| `price_gouging` | Unit price > 150% of market max |
-| `math_fraud` | Invoice total ≠ sum of line items |
-| `duplicate_submission` | Same invoice_id or vendor+date+total already seen |
-| `compound_fraud` | Two fraud signals in one invoice |
 ---
-## Links
-- **Live Demo**: [ps2181-invoice-processing-pipeline.hf.space/web](https://ps2181-invoice-processing-pipeline.hf.space/web)
-- **API Docs**: [ps2181-invoice-processing-pipeline.hf.space/docs](https://ps2181-invoice-processing-pipeline.hf.space/docs)
-- **GitHub**: [github.com/ps2181/invoice-processing-pipeline](https://github.com/ps2181/invoice-processing-pipeline)
-- **OpenEnv**: [github.com/meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv)
-- **Extractor LoRA**: [ps2181/extractor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/extractor-lora-qwen2.5-1.5b)
-- **Auditor LoRA**: [ps2181/auditor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/auditor-lora-qwen2.5-1.5b)
-- **Generator LoRA**: [ps2181/generator-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/generator-lora-qwen2.5-1.5b)

+<div class="card">
+  <div class="card-header">
+    <div class="card-header-dot"></div>
+    <span class="card-header-title"></span>
+  </div>
+  <!-- yaml rows + tag rows + footer badges -->
+</div>
+<div align="center">
+<!-- Animated header banner -->
+<img src="https://capsule-render.vercel.app/api?type=waving&color=gradient&customColorList=6,11,20&height=200&section=header&text=Invoice%20Processing%20Pipeline&fontSize=40&fontColor=fff&animation=twinkling&fontAlignY=35&desc=Self-Improving%20Multi-Agent%20Fraud%20Detection%20%7C%20OpenEnv%20%2B%20GRPO%20%2B%20Qwen2.5&descAlignY=55&descSize=16" width="100%"/>
+<!-- Badges row 1 -->
+<p>
+  <a href="https://ps2181-invoice-processing-pipeline.hf.space/web">
+    <img src="https://img.shields.io/badge/🚀%20Live%20Demo-HuggingFace%20Spaces-FF9D00?style=for-the-badge&logo=huggingface&logoColor=white" />
+  </a>
+  <a href="https://colab.research.google.com/drive/1C1_3giNt-NmbzKNFJr5_L1fms3L8LfmB">
+    <img src="https://img.shields.io/badge/Training%20Colab-Open%20Notebook-F9AB00?style=for-the-badge&logo=googlecolab&logoColor=white" />
+  </a>
+  <a href="https://ps2181-invoice-processing-pipeline.hf.space/docs">
+    <img src="https://img.shields.io/badge/API%20Docs-FastAPI-009688?style=for-the-badge&logo=fastapi&logoColor=white" />
+  </a>
+</p>
+<!-- Badges row 2 -->
+<p>
+  <img src="https://img.shields.io/badge/Framework-OpenEnv-1A356E?style=for-the-badge" />
+  <img src="https://img.shields.io/badge/Model-Qwen2.5--1.5B%20+%20LoRA%20r%3D16-8B1A4E?style=for-the-badge" />
+  <img src="https://img.shields.io/badge/Training-GRPO%20+%20Unsloth-00A67E?style=for-the-badge" />
+  <img src="https://img.shields.io/badge/Agents-5%20Adversarial-E44D26?style=for-the-badge" />
+</p>
+<!-- Badges row 3 -->
+<p>
+  <img src="https://img.shields.io/badge/Tasks-7%20Progressive-6C3483?style=for-the-badge" />
+  <img src="https://img.shields.io/badge/Deployment-Docker%20%7C%20HF%20Spaces-0D1117?style=for-the-badge&logo=docker" />
+  <img src="https://img.shields.io/badge/License-MIT-green?style=for-the-badge" />
+  <img src="https://img.shields.io/badge/Hackathon-Meta%20PyTorch%202026-FF6B35?style=for-the-badge" />
+</p>
+<br/>
+> **Meta PyTorch OpenEnv Hackathon — Grand Finale · April 25–26, 2026**
+>
+> Team: **Pritam Satpathy** & **Gnana Nawin T** · Scaler School of Technology, Bangalore
+<br/>
+<!-- Animated typing headline -->
+<a href="https://git.io/typing-svg">
+  <img src="https://readme-typing-svg.demolab.com?font=Fira+Code&weight=600&size=22&pause=1000&color=007A87&center=true&vCenter=true&width=750&lines=5-Agent+Adversarial+Fraud+Detection+System;Self-Improving+via+Cross-Episode+Regulator;GRPO-Trained+LoRA+Agents+on+Live+Environment;Invoice+%E2%86%92+Extract+%E2%86%92+Audit+%E2%86%92+Approve+%E2%86%92+Improve" alt="Typing SVG" />
+</a>
+</div>
 ---
+## 🔥 What Makes This Different
+> Most multi-agent systems are **static pipelines**. Ours **gets harder for itself over time**.
+The system contains a **Predictive Regulator** — a cross-episode meta-agent that monitors the Auditor across 30 rolling episodes, detects fraud types it systematically fails on (**blind spots**), and **automatically biases the Generator** to produce more of exactly those fraud types. No human intervention. No manual curriculum design. The system pressure-tests its own weakest point, every single episode.
+<div align="center">
+<img width="1462" height="731" alt="image" src="https://github.com/user-attachments/assets/7d863b87-1921-45f5-8d94-a06ba3ed6fc1" />
+</div>
 ---
+## ⚡ Three Novel Features
+<table>
+<tr>
+<td width="33%" align="center">
+### 🔮 Predictive Regulator
+Computes **trend slope** over 5-episode windows.<br/>Warns of *emerging* blind spots **before** detection rates cross the critical threshold — proactive oversight, not reactive retraining.
+`+0.15 early-warning bonus`
+</td>
+<td width="33%" align="center">
+### 🧩 Compound Fraud
+Invoices carry **two fraud signals simultaneously** (e.g. phantom vendor + price gouging).<br/>Partial credit `+0.65` for catching one; full reward `+0.99` for both.
+Prevents single-signal heuristics.
+</td>
+<td width="33%" align="center">
+### 📊 Confidence Calibration
+Tracks `(confidence, correct?)` pairs per fraud type.<br/>Detects **overconfident misses** — the Auditor saying "90% sure, approved" on fraud — the most dangerous real-world failure mode.
+</td>
+</tr>
+</table>
 ---
+## 🤖 Five Agents, One Closed Loop
+<div align="center">
+| Agent | Role | Reward Signal |
+|:---:|:---|:---|
+| 🏭 **Generator** | Creates clean or fraudulent invoices, biased by Regulator blind-spot weights | `+0.85` evades Auditor + Approver · `+0.60` evades Auditor only · `+0.10` caught |
+| 🔍 **Extractor** | Parses raw OCR invoice text → structured JSON | 4 independent signals: format `0.10` · field accuracy `0.40` · math `0.25` · completeness `0.25` |
+| 🕵️ **Auditor** | Classifies each invoice with fraud type + confidence score | `+0.99` correct type · `+0.90` clean clearance · `+0.65` compound (one caught) · `+0.01` miss/FP |
+| ✅ **Approver** | Final approve / escalate / reject (rule-based, confidence-gated) | `≥0.80` confidence → reject · `0.50–0.80` → escalate · approved → approve |
+| 🧠 **Regulator** | Cross-episode meta-agent — 30-episode rolling window, blind-spot tracker | Precision `0.35` + Recall `0.35` + No over-flagging `0.15` + Early warning `0.15` |
+</div>
 ---
+## 🎯 Seven Tasks — Progressive Difficulty
+| # | Task | Difficulty | What the Agent Must Do |
+|:---:|:---|:---:|:---|
+| 1 | `easy` | 🟢 Easy | Extract `vendor`, `date`, `currency`, `total`, `line_items` from a single clean invoice |
+| 2 | `medium` | 🟡 Medium | Clean & normalise a batch: fix date format chaos, vendor typos, currency symbol pollution |
+| 3 | `hard` | 🟠 Hard | Extract + reconcile against purchase orders — flag overcharges, extra items, missing items |
+| 4 | `expert` | 🔴 Expert | Fraud audit using vendor registry, market prices, and invoice history — classify fraud type exactly |
+| 5 | `adversarial` | 🟠 Hard | Ignore SUBTOTAL trap + fake TAX/ADJUSTMENT + FX noise; OCR-corrupted vendor labels |
+| 6 | `negotiate` | 🟡 Medium | Ask clarification questions `{"question": "..."}` then extract; `+15%` bonus for ≤2 questions |
+| 7 | `supply_chain` | 🔴 Expert | Detect `quantity_shortfall`, `price_spike`, `unauthorized_substitution`, `phantom_delivery` |
+---
+## 🧠 Trained LoRA Agents
+All three generative agents trained with **GRPO on live environment data** — the HF Space `/grader` endpoint *is* the reward function during training.
+<div align="center">
+| Agent | Base Model | LoRA Config | HuggingFace Hub |
+|:---:|:---|:---:|:---|
+| 🔍 Extractor | Qwen2.5-1.5B-Instruct | r=16, α=16, 4-bit QLoRA | [ps2181/extractor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/extractor-lora-qwen2.5-1.5b) |
+| 🕵️ Auditor | Qwen2.5-1.5B-Instruct | r=16, α=16, 4-bit QLoRA | [ps2181/auditor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/auditor-lora-qwen2.5-1.5b) |
+| 🏭 Generator | Qwen2.5-1.5B-Instruct | r=16, α=16, 4-bit QLoRA | [ps2181/generator-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/generator-lora-qwen2.5-1.5b) |
+</div>
+**LoRA target modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
 ---
+## 📈 Training Results
+### Extractor — GRPO Training Progress
+The model learned to extract structured JSON from noisy invoice text via **reinforcement learning with 4 independent reward signals**, scoring directly against the live environment grader.
+| Step | Total Reward | Env Score | Format | Math Consistency |
+|:---:|:---:|:---:|:---:|:---:|
+| 10 | 2.361 | 0.113 | 0.900 | 0.347 |
+| 20 | 2.595 | 0.282 | 0.900 | 0.413 |
+| 30 | 2.657 | 0.304 | **0.950** | 0.403 |
+> 📊 **Environment score: `0.113 → 0.304` in 30 steps — a 169% improvement** in live-graded extraction accuracy.
+### 🔍 Reward Hacking Caught in Training
+At step 10, we observed the model achieving `math_consistency = 0.97` and `completeness = 1.0` while `field_accuracy = 0.00` — it had learned to output **arithmetically-consistent JSON with entirely hallucinated values**.
+Our 4 **independent** reward signals made this visible immediately. A single aggregated reward would have never surfaced this.
+```
+Step 10 — Reward Hacking Detected:
+  format:             0.10  ✅
+  math_consistency:   0.97  ✅ ← model gaming this signal
+  completeness:       1.00  ✅ ← model gaming this signal
+  field_accuracy:     0.00  ❌ ← hallucinating all values
+  Action: adjusted training emphasis on field_accuracy weight
+  Result: field_accuracy climbed to 0.30+ by step 30
+```
+This is exactly why multiple independent reward signals matter — and why we have 4.
 ---
+## 🎁 Reward Architecture
+### Extractor — 4 Independent Signals
+```python
+def reward_format(extracted) -> float:              # weight 0.10
+    """Are all 5 required JSON keys present?"""
+def reward_field_accuracy(extracted, gt) -> float:  # weight 0.40
+    """Do vendor / date / currency / total match ground truth?"""
+def reward_math_consistency(extracted) -> float:    # weight 0.25
+    """Does qty × unit_price = amount for every line item?"""
+def reward_completeness(extracted, gt) -> float:    # weight 0.25
+    """Recall: what fraction of expected line items are present?"""
+# All rewards clamped to (0.01, 0.99) — no log(0), no gradient collapse
+```
+### Auditor Reward
+| Outcome | Reward | Why |
+|:---|:---:|:---|
+| Correct fraud type detected | **0.99** | Incentivises precise classification, not just binary flagging |
+| Clean invoice correctly approved | **0.90** | High reward keeps false-positive rate low |
+| Compound fraud — one of two types caught | **0.65** | Partial credit prevents cliff on hard cases |
+| Fraud flagged but wrong type | **0.50** | Penalises sloppiness; rewards catching *something* |
+| Miss or false positive | **0.01** | Near-zero punishes both failure modes symmetrically |
+### Generator Reward (Adversarial Self-Play)
+| Outcome | Reward |
+|:---|:---:|
+| Fraud evades **both** Auditor and Approver | **0.85** |
+| Auditor misses, Approver catches | **0.60** |
+| Auditor catches it | **0.10** |
+### Regulator Reward
+```
+Total = Precision(0.35) + Recall(0.35) + No-over-flagging(0.15) + Early-warning-bonus(0.15)
+```
 ---
+## 🦺 Five Fraud Types
+<div align="center">
+| Type | Detection Method | Example |
+|:---|:---|:---|
+| 🏚️ `phantom_vendor` | Vendor not in Approved Vendor Registry | "QuickSupply Hub" — not in approved list |
+| 💸 `price_gouging` | Unit price > 150% of market ceiling | Laptop at $2,800 when market max is $1,299 |
+| ➕ `math_fraud` | Invoice total ≠ sum of line items | Total $5,200 when items sum to $4,400 |
+| 📋 `duplicate_submission` | Same invoice_id or vendor+date+total already in history | INV-83221 submitted twice |
+| 🔀 `compound_fraud` | Two fraud signals in one invoice | Phantom vendor **AND** price gouging simultaneously |
+</div>
 ---
+## 🌍 The Regulator in Action
+After each episode, the Regulator publishes a report that the Generator reads to bias its next batch:
 ```
+GET /regulator/report
+{
+  "total_audits_recorded": 20,
+  "detection_rates": {
+    "phantom_vendor":        "31%  ⚠ BLIND SPOT (-0.08↓)",
+    "price_gouging":         "74%  ✓ OK (+0.03↑)",
+    "math_fraud":            "81%  ✓ OK (+0.01↑)",
+    "duplicate_submission":  "62%  ⚡ EMERGING (-0.02↓)"
+  },
+  "false_positive_rate": "12%  ✓ OK",
+  "blind_spots": ["phantom_vendor"],
+  "emerging_blind_spots": ["duplicate_submission"],
+  "generator_weights": {
+    "phantom_vendor":       0.30,   ← 3× upweighted (blind spot)
+    "duplicate_submission": 0.20,   ← 2× upweighted (emerging)
+    "price_gouging":        0.125,
+    "math_fraud":           0.125,
+    "compound_fraud":       0.10
+  },
+  "verdict": "Recommend retraining on: phantom_vendor"
+}
 ```
 ---
+## 🚀 Quick Start
+### Try the Live Demo
+```bash
+# Health check
+curl https://ps2181-invoice-processing-pipeline.hf.space/health
+# List all 7 tasks with schemas
+curl https://ps2181-invoice-processing-pipeline.hf.space/tasks
+# Start a single-agent episode
+curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/reset \
+     -H "Content-Type: application/json" \
+     -d '{"task_id": "easy"}'
+# Submit an extraction (replace EPISODE_ID from reset response)
+curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/step \
+     -H "Content-Type: application/json" \
+     -d '{
+       "episode_id": "EPISODE_ID",
+       "extracted_data": {
+         "vendor": "Acme Corp",
+         "date": "2024-08-15",
+         "currency": "USD",
+         "total": 2374.93,
+         "line_items": [
+           {"description": "Laptop Computer", "qty": 2, "unit_price": 1099.99, "amount": 2199.98},
+           {"description": "Wireless Mouse",  "qty": 5, "unit_price":   34.99, "amount":  174.95}
+         ]
+       }
+     }'
+```
+### Run the Multi-Agent Pipeline
+```bash
+# Step 1 — Start 5-agent episode (Generator biased by Regulator)
+curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/multi/reset
+# Step 2 — Score Extractor output (4 signals)
+curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/multi/extract \
+     -H "Content-Type: application/json" \
+     -d '{"episode_id": "EP_ID", "extracted_data": {...}}'
+# Step 3 — Score Auditor output (updates 30-episode tracker)
+curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/multi/audit \
+     -H "Content-Type: application/json" \
+     -d '{"episode_id": "EP_ID", "audit_results": [
+       {"invoice_id": "INV-83221", "verdict": "flagged",
+        "fraud_type": "phantom_vendor", "confidence": 0.87}
+     ]}'
+# Step 4 — Run Approver, compute Generator adversarial reward
+curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/multi/approve \
+     -H "Content-Type: application/json" \
+     -d '{"episode_id": "EP_ID"}'
+# Check Regulator state anytime
+curl https://ps2181-invoice-processing-pipeline.hf.space/regulator/report
+curl https://ps2181-invoice-processing-pipeline.hf.space/regulator/forecast
+curl https://ps2181-invoice-processing-pipeline.hf.space/regulator/calibration
+```
+### Run Training (Google Colab)
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1C1_3giNt-NmbzKNFJr5_L1fms3L8LfmB)
+The training loop connects **directly** to the live HF Space environment:
+```
+Colab → /reset (fresh synthetic invoice) → model generates JSON
+      → /grader (scores vs ground truth) → GRPO weight update
+      → repeat 200 steps
+```
+---
+## 🗂️ Repository Structure
+```
+invoice-processing-pipeline/
+│
+├── server/
+│   ├── app.py                      # FastAPI — 18 endpoints
+│   ├── environment.py              # 7 tasks · graders · dynamic difficulty
+│   ├── multi_agent_environment.py  # 5-agent system + AuditorPerformanceTracker
+│   ├── agents.py                   # Lazy-loading LoRA inference wrappers
+│   └── web_ui.py                   # Gradio UI (mounted at /web)
+│
+├── models.py                       # Pydantic: Action · Observation · State
+├── inference.py                    # Standalone inference helper
+├── client.py                       # OpenEnv-compatible Python client
+│
+├── extractor_training_grpo.ipynb   # Extractor GRPO training (Unsloth + TRL)
+├── auditor_grpo_training.ipynb     # Auditor GRPO training
+├── generator_grpo_training.ipynb   # Generator GRPO training
+│
+├── openenv.yaml                    # OpenEnv manifest (all 7 tasks declared)
+├── Dockerfile                      # HF Spaces Docker (port 7860, non-root UID 1000)
+├── pyproject.toml                  # Project metadata + dependencies
+├── requirements.txt                # Runtime dependencies
+├── validate-submission.sh          # Submission validator script
+│
+├── ROUND2_PROBLEM_STATEMENT.md     # Full problem statement + reward design rationale
+└── BLOG_DRAFT.md                   # HuggingFace blog post draft
+```
 ---
+## 🔌 API Reference
 ### Core OpenEnv
 | Endpoint | Method | Description |
+|:---|:---:|:---|
+| `/health` | `GET` | Health check → `{"status": "ok", "active_sessions": N}` |
+| `/tasks` | `GET` | All 7 tasks with descriptions, max_attempts, action/observation schemas |
+| `/reset` | `POST` | Start episode `{"task_id": "easy\|medium\|hard\|expert\|adversarial\|negotiate\|supply_chain"}` |
+| `/step` | `POST` | Submit extraction → reward + feedback + hint + reward_breakdown |
+| `/grader` | `POST` | Score without consuming an attempt (used by training Colab) |
+| `/state` | `GET` | Episode metadata — step_count, done, best_reward, full rewards history |
+| `/ws` | `WS` | Full episode over WebSocket (OpenEnv standard) |
+| `/web` | `GET` | Gradio interactive demo UI |
 ### Multi-Agent
 | Endpoint | Method | Description |
+|:---|:---:|:---|
+| `/multi/reset` | `POST` | Start 5-agent episode — Generator biased by Regulator weights |
+| `/multi/extract` | `POST` | Score Extractor output (4 signals) |
+| `/multi/audit` | `POST` | Score Auditor output, update 30-episode performance tracker |
+| `/multi/approve` | `POST` | Run Approver, compute Generator adversarial reward |
+| `/multi/state/{id}` | `GET` | Full episode state including all agent scores |
 ### Regulator
 | Endpoint | Method | Description |
+|:---|:---:|:---|
+| `/regulator/report` | `GET` | Detection rates, blind spots, calibration, generator weights |
+| `/regulator/forecast` | `GET` | Predictive trend analysis — critical + emerging blind spots with slopes |
+| `/regulator/calibration` | `GET` | Overconfidence / underconfidence per fraud type |
+| `/regulator/predict` | `POST` | Score a Regulator blind-spot prediction |
+| `/regulator/demo_seed` | `POST` | Seed tracker with realistic demo data |
+| `/generator/score` | `POST` | Compute Generator reward given auditor/approver outcomes |
 ---
+## 🏗️ Tech Stack
+<div align="center">
+| Layer | Technology |
+|:---|:---|
+| **Environment** | [OpenEnv](https://github.com/meta-pytorch/OpenEnv) · FastAPI · Pydantic v2 |
+| **UI** | Gradio 4.x (mounted at `/web`) |
+| **Deployment** | Docker · HuggingFace Spaces (vcpu-2 / 8 GB) |
+| **Training** | [TRL GRPOTrainer](https://huggingface.co/docs/trl) · [Unsloth](https://github.com/unslothai/unsloth) |
+| **Model** | `unsloth/Qwen2.5-1.5B-Instruct` · 4-bit QLoRA · r=16 |
+| **Reward** | Live `/grader` endpoint on HF Space as verifier |
+| **Session Mgmt** | Thread-safe `OrderedDict` · 200-session cap · LRU eviction |
+| **Dynamic Difficulty** | Per-task rolling window (maxlen=10) → adjusts OCR intensity, batch size, discrepancy count |
+</div>
+---
+## 🔍 Dynamic Difficulty
+The environment adapts generation parameters to the agent's recent performance:
+```python
+if avg_score >= 0.85:   # Agent is doing well → harder
+    n_invoices    = (4, 6)
+    ocr_intensity = 0.55        # heavier corruption
+    n_discrepancies = (3, 5)
+    n_anomalies   = 3
+elif avg_score < 0.60:  # Agent is struggling → easier
+    n_invoices    = (2, 3)
+    ocr_intensity = 0.15
+    n_discrepancies = (1, 2)
+    n_anomalies   = 2
+else:                   # balanced
+    n_invoices    = (3, 5)
+    ocr_intensity = 0.35
+    n_discrepancies = (2, 3)
 ```
 ---
+## 🎭 Theme Alignment
+<div align="center">
+| Theme | Alignment | Evidence |
+|:---:|:---|:---|
+| **#1 Multi-Agent Interactions** | ✅ Core | 5 agents with cooperation, competition, and adversarial self-play |
+| **#1 Fleet AI Scalable Oversight** | ✅ Bonus | Regulator monitors Auditor cross-episode — fully autonomous oversight loop |
+| **#2 Long-Horizon Planning** | ✅ Partial | `negotiate` task: multi-turn clarification with attempt budget penalty |
+| **#3.1 Professional Tasks** | ✅ Core | Invoice + PO + vendor registry + supply chain = real finance operations |
+| **#4 Self-Improvement** | ✅ Core | Regulator → Generator bias → harder adversarial batches → Auditor improves |
+</div>
 ---
+## 👥 Team
+<div align="center">
+| | |
+|:---:|:---:|
+| **Pritam Satpathy** | **Gnana Nawin T** |
+| [🤗 ps2181](https://huggingface.co/ps2181) | [🤗 gnananawin](https://huggingface.co/gnananawin) |
+| Scaler School of Technology | Scaler School of Technology |
+**Meta PyTorch OpenEnv Hackathon — Grand Finale · April 25–26, 2026 · Bangalore**
+</div>
+---
+## 🔗 All Links
+<div align="center">
+| Resource | Link |
+|:---|:---|
+| 🚀 **Live Environment** | https://ps2181-invoice-processing-pipeline.hf.space |
+| 🖥️ **Gradio Demo UI** | https://ps2181-invoice-processing-pipeline.hf.space/web |
+| 📖 **API Documentation** | https://ps2181-invoice-processing-pipeline.hf.space/docs |
+| 🤗 **Extractor Model** | https://huggingface.co/ps2181/extractor-lora-qwen2.5-1.5b |
+| 🕵️ **Auditor Model** | https://huggingface.co/ps2181/auditor-lora-qwen2.5-1.5b |
+| 🏭 **Generator Model** | https://huggingface.co/ps2181/generator-lora-qwen2.5-1.5b |
+| 📓 **Training Colab** | https://colab.research.google.com/drive/1C1_3giNt-NmbzKNFJr5_L1fms3L8LfmB |
+| 💻 **GitHub** | https://github.com/ps2181/invoice-processing-pipeline |
+| 🧩 **OpenEnv Framework** | https://github.com/meta-pytorch/OpenEnv |
+</div>
+---
+<div align="center">
+<img src="https://capsule-render.vercel.app/api?type=waving&color=gradient&customColorList=6,11,20&height=100&section=footer&animation=twinkling" width="100%"/>
+**Built with ❤️ for the Meta PyTorch OpenEnv Hackathon 2026**
+</div>