Spaces:

ps2181
/

invoice-processing-pipeline

Running

App Files Files Community

ps2181 commited on Apr 25

Commit

3391ffe

1 Parent(s): bbe2575

Rewrite README: blog-style with animations, 10 tasks, new endpoints

Browse files

Files changed (1) hide show

README.md +200 -395

README.md CHANGED Viewed

@@ -1,522 +1,327 @@
-<div class="card">
-  <div class="card-header">
-    <div class="card-header-dot"></div>
-    <span class="card-header-title"></span>
-  </div>
-  <!-- yaml rows + tag rows + footer badges -->
-</div>
-<div align="center">
-<!-- Animated header banner -->
-<img src="https://capsule-render.vercel.app/api?type=waving&color=gradient&customColorList=6,11,20&height=200&section=header&text=Invoice%20Processing%20Pipeline&fontSize=40&fontColor=fff&animation=twinkling&fontAlignY=35&desc=Self-Improving%20Multi-Agent%20Fraud%20Detection%20%7C%20OpenEnv%20%2B%20GRPO%20%2B%20Qwen2.5&descAlignY=55&descSize=16" width="100%"/>
-<!-- Badges row 1 -->
-<p>
-  <a href="https://ps2181-invoice-processing-pipeline.hf.space/web">
-    <img src="https://img.shields.io/badge/🚀%20Live%20Demo-HuggingFace%20Spaces-FF9D00?style=for-the-badge&logo=huggingface&logoColor=white" />
-  </a>
-  <a href="https://colab.research.google.com/drive/1C1_3giNt-NmbzKNFJr5_L1fms3L8LfmB">
-    <img src="https://img.shields.io/badge/Training%20Colab-Open%20Notebook-F9AB00?style=for-the-badge&logo=googlecolab&logoColor=white" />
-  </a>
-  <a href="https://ps2181-invoice-processing-pipeline.hf.space/docs">
-    <img src="https://img.shields.io/badge/API%20Docs-FastAPI-009688?style=for-the-badge&logo=fastapi&logoColor=white" />
-  </a>
-</p>
-<!-- Badges row 2 -->
-<p>
-  <img src="https://img.shields.io/badge/Framework-OpenEnv-1A356E?style=for-the-badge" />
-  <img src="https://img.shields.io/badge/Model-Qwen2.5--1.5B%20+%20LoRA%20r%3D16-8B1A4E?style=for-the-badge" />
-  <img src="https://img.shields.io/badge/Training-GRPO%20+%20Unsloth-00A67E?style=for-the-badge" />
-  <img src="https://img.shields.io/badge/Agents-5%20Adversarial-E44D26?style=for-the-badge" />
-</p>
-<!-- Badges row 3 -->
-<p>
-  <img src="https://img.shields.io/badge/Tasks-7%20Progressive-6C3483?style=for-the-badge" />
-  <img src="https://img.shields.io/badge/Deployment-Docker%20%7C%20HF%20Spaces-0D1117?style=for-the-badge&logo=docker" />
-  <img src="https://img.shields.io/badge/License-MIT-green?style=for-the-badge" />
-  <img src="https://img.shields.io/badge/Hackathon-Meta%20PyTorch%202026-FF6B35?style=for-the-badge" />
-</p>
-<br/>
-> **Meta PyTorch OpenEnv Hackathon — Grand Finale · April 25–26, 2026**
->
-> Team: **Pritam Satpathy** & **Gnana Nawin T** · Scaler School of Technology, Bangalore
 <br/>
-<!-- Animated typing headline -->
-<a href="https://git.io/typing-svg">
-  <img src="https://readme-typing-svg.demolab.com?font=Fira+Code&weight=600&size=22&pause=1000&color=007A87&center=true&vCenter=true&width=750&lines=5-Agent+Adversarial+Fraud+Detection+System;Self-Improving+via+Cross-Episode+Regulator;GRPO-Trained+LoRA+Agents+on+Live+Environment;Invoice+%E2%86%92+Extract+%E2%86%92+Audit+%E2%86%92+Approve+%E2%86%92+Improve" alt="Typing SVG" />
-</a>
 </div>
 ---
-## 🔥 What Makes This Different
-> Most multi-agent systems are **static pipelines**. Ours **gets harder for itself over time**.
-The system contains a **Predictive Regulator** — a cross-episode meta-agent that monitors the Auditor across 30 rolling episodes, detects fraud types it systematically fails on (**blind spots**), and **automatically biases the Generator** to produce more of exactly those fraud types. No human intervention. No manual curriculum design. The system pressure-tests its own weakest point, every single episode.
 <div align="center">
-<img width="1462" height="731" alt="image" src="https://github.com/user-attachments/assets/7d863b87-1921-45f5-8d94-a06ba3ed6fc1" />
 </div>
 ---
-## ⚡ Three Novel Features
-<table>
-<tr>
-<td width="33%" align="center">
-### 🔮 Predictive Regulator
-Computes **trend slope** over 5-episode windows.<br/>Warns of *emerging* blind spots **before** detection rates cross the critical threshold — proactive oversight, not reactive retraining.
-`+0.15 early-warning bonus`
-</td>
-<td width="33%" align="center">
-### 🧩 Compound Fraud
-Invoices carry **two fraud signals simultaneously** (e.g. phantom vendor + price gouging).<br/>Partial credit `+0.65` for catching one; full reward `+0.99` for both.
-Prevents single-signal heuristics.
-</td>
-<td width="33%" align="center">
-### 📊 Confidence Calibration
-Tracks `(confidence, correct?)` pairs per fraud type.<br/>Detects **overconfident misses** — the Auditor saying "90% sure, approved" on fraud — the most dangerous real-world failure mode.
-</td>
-</tr>
-</table>
----
-## 🤖 Five Agents, One Closed Loop
 <div align="center">
 | Agent | Role | Reward Signal |
 |:---:|:---|:---|
-| 🏭 **Generator** | Creates clean or fraudulent invoices, biased by Regulator blind-spot weights | `+0.85` evades Auditor + Approver · `+0.60` evades Auditor only · `+0.10` caught |
-| 🔍 **Extractor** | Parses raw OCR invoice text → structured JSON | 4 independent signals: format `0.10` · field accuracy `0.40` · math `0.25` · completeness `0.25` |
-| 🕵️ **Auditor** | Classifies each invoice with fraud type + confidence score | `+0.99` correct type · `+0.90` clean clearance · `+0.65` compound (one caught) · `+0.01` miss/FP |
-| ✅ **Approver** | Final approve / escalate / reject (rule-based, confidence-gated) | `≥0.80` confidence → reject · `0.50–0.80` → escalate · approved → approve |
-| 🧠 **Regulator** | Cross-episode meta-agent — 30-episode rolling window, blind-spot tracker | Precision `0.35` + Recall `0.35` + No over-flagging `0.15` + Early warning `0.15` |
 </div>
 ---
-## 🎯 Seven Tasks — Progressive Difficulty
-| # | Task | Difficulty | What the Agent Must Do |
-|:---:|:---|:---:|:---|
-| 1 | `easy` | 🟢 Easy | Extract `vendor`, `date`, `currency`, `total`, `line_items` from a single clean invoice |
-| 2 | `medium` | 🟡 Medium | Clean & normalise a batch: fix date format chaos, vendor typos, currency symbol pollution |
-| 3 | `hard` | 🟠 Hard | Extract + reconcile against purchase orders — flag overcharges, extra items, missing items |
-| 4 | `expert` | 🔴 Expert | Fraud audit using vendor registry, market prices, and invoice history — classify fraud type exactly |
-| 5 | `adversarial` | 🟠 Hard | Ignore SUBTOTAL trap + fake TAX/ADJUSTMENT + FX noise; OCR-corrupted vendor labels |
-| 6 | `negotiate` | 🟡 Medium | Ask clarification questions `{"question": "..."}` then extract; `+15%` bonus for ≤2 questions |
-| 7 | `supply_chain` | 🔴 Expert | Detect `quantity_shortfall`, `price_spike`, `unauthorized_substitution`, `phantom_delivery` |
----
-## 🧠 Trained LoRA Agents
-All three generative agents trained with **GRPO on live environment data** — the HF Space `/grader` endpoint *is* the reward function during training.
 <div align="center">
-| Agent | Base Model | LoRA Config | HuggingFace Hub |
-|:---:|:---|:---:|:---|
-| 🔍 Extractor | Qwen2.5-1.5B-Instruct | r=16, α=16, 4-bit QLoRA | [ps2181/extractor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/extractor-lora-qwen2.5-1.5b) |
-| 🕵️ Auditor | Qwen2.5-1.5B-Instruct | r=16, α=16, 4-bit QLoRA | [ps2181/auditor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/auditor-lora-qwen2.5-1.5b) |
-| 🏭 Generator | Qwen2.5-1.5B-Instruct | r=16, α=16, 4-bit QLoRA | [ps2181/generator-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/generator-lora-qwen2.5-1.5b) |
 </div>
-**LoRA target modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
 ---
-## 📈 Training Results
-### Extractor — GRPO Training Progress
-The model learned to extract structured JSON from noisy invoice text via **reinforcement learning with 4 independent reward signals**, scoring directly against the live environment grader.
-| Step | Total Reward | Env Score | Format | Math Consistency |
-|:---:|:---:|:---:|:---:|:---:|
-| 10 | 2.361 | 0.113 | 0.900 | 0.347 |
-| 20 | 2.595 | 0.282 | 0.900 | 0.413 |
-| 30 | 2.657 | 0.304 | **0.950** | 0.403 |
-> 📊 **Environment score: `0.113 → 0.304` in 30 steps — a 169% improvement** in live-graded extraction accuracy.
-### 🔍 Reward Hacking Caught in Training
-At step 10, we observed the model achieving `math_consistency = 0.97` and `completeness = 1.0` while `field_accuracy = 0.00` — it had learned to output **arithmetically-consistent JSON with entirely hallucinated values**.
-Our 4 **independent** reward signals made this visible immediately. A single aggregated reward would have never surfaced this.
-```
-Step 10 — Reward Hacking Detected:
-  format:             0.10  ✅
-  math_consistency:   0.97  ✅ ← model gaming this signal
-  completeness:       1.00  ✅ ← model gaming this signal
-  field_accuracy:     0.00  ❌ ← hallucinating all values
-  Action: adjusted training emphasis on field_accuracy weight
-  Result: field_accuracy climbed to 0.30+ by step 30
-```
-This is exactly why multiple independent reward signals matter — and why we have 4.
 ---
-## 🎁 Reward Architecture
-### Extractor — 4 Independent Signals
 ```python
-def reward_format(extracted) -> float:              # weight 0.10
-    """Are all 5 required JSON keys present?"""
-def reward_field_accuracy(extracted, gt) -> float:  # weight 0.40
-    """Do vendor / date / currency / total match ground truth?"""
-def reward_math_consistency(extracted) -> float:    # weight 0.25
-    """Does qty × unit_price = amount for every line item?"""
-def reward_completeness(extracted, gt) -> float:    # weight 0.25
-    """Recall: what fraction of expected line items are present?"""
-# All rewards clamped to (0.01, 0.99) — no log(0), no gradient collapse
 ```
-### Auditor Reward
 | Outcome | Reward | Why |
 |:---|:---:|:---|
-| Correct fraud type detected | **0.99** | Incentivises precise classification, not just binary flagging |
-| Clean invoice correctly approved | **0.90** | High reward keeps false-positive rate low |
-| Compound fraud — one of two types caught | **0.65** | Partial credit prevents cliff on hard cases |
-| Fraud flagged but wrong type | **0.50** | Penalises sloppiness; rewards catching *something* |
 | Miss or false positive | **0.01** | Near-zero punishes both failure modes symmetrically |
-### Generator Reward (Adversarial Self-Play)
-| Outcome | Reward |
-|:---|:---:|
-| Fraud evades **both** Auditor and Approver | **0.85** |
-| Auditor misses, Approver catches | **0.60** |
-| Auditor catches it | **0.10** |
-### Regulator Reward
 ```
 Total = Precision(0.35) + Recall(0.35) + No-over-flagging(0.15) + Early-warning-bonus(0.15)
 ```
 ---
-## 🦺 Five Fraud Types
 <div align="center">
-| Type | Detection Method | Example |
-|:---|:---|:---|
-| 🏚️ `phantom_vendor` | Vendor not in Approved Vendor Registry | "QuickSupply Hub" — not in approved list |
-| 💸 `price_gouging` | Unit price > 150% of market ceiling | Laptop at $2,800 when market max is $1,299 |
-| ➕ `math_fraud` | Invoice total ≠ sum of line items | Total $5,200 when items sum to $4,400 |
-| 📋 `duplicate_submission` | Same invoice_id or vendor+date+total already in history | INV-83221 submitted twice |
-| 🔀 `compound_fraud` | Two fraud signals in one invoice | Phantom vendor **AND** price gouging simultaneously |
 </div>
----
-## 🌍 The Regulator in Action
-After each episode, the Regulator publishes a report that the Generator reads to bias its next batch:
-```
-GET /regulator/report
-{
-  "total_audits_recorded": 20,
-  "detection_rates": {
-    "phantom_vendor":        "31%  ⚠ BLIND SPOT (-0.08↓)",
-    "price_gouging":         "74%  ✓ OK (+0.03↑)",
-    "math_fraud":            "81%  ✓ OK (+0.01↑)",
-    "duplicate_submission":  "62%  ⚡ EMERGING (-0.02↓)"
-  },
-  "false_positive_rate": "12%  ✓ OK",
-  "blind_spots": ["phantom_vendor"],
-  "emerging_blind_spots": ["duplicate_submission"],
-  "generator_weights": {
-    "phantom_vendor":       0.30,   ← 3× upweighted (blind spot)
-    "duplicate_submission": 0.20,   ← 2× upweighted (emerging)
-    "price_gouging":        0.125,
-    "math_fraud":           0.125,
-    "compound_fraud":       0.10
-  },
-  "verdict": "Recommend retraining on: phantom_vendor"
-}
-```
----
-## 🚀 Quick Start
-### Try the Live Demo
-```bash
-# Health check
-curl https://ps2181-invoice-processing-pipeline.hf.space/health
-# List all 7 tasks with schemas
-curl https://ps2181-invoice-processing-pipeline.hf.space/tasks
-# Start a single-agent episode
-curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/reset \
-     -H "Content-Type: application/json" \
-     -d '{"task_id": "easy"}'
-# Submit an extraction (replace EPISODE_ID from reset response)
-curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/step \
-     -H "Content-Type: application/json" \
-     -d '{
-       "episode_id": "EPISODE_ID",
-       "extracted_data": {
-         "vendor": "Acme Corp",
-         "date": "2024-08-15",
-         "currency": "USD",
-         "total": 2374.93,
-         "line_items": [
-           {"description": "Laptop Computer", "qty": 2, "unit_price": 1099.99, "amount": 2199.98},
-           {"description": "Wireless Mouse",  "qty": 5, "unit_price":   34.99, "amount":  174.95}
-         ]
-       }
-     }'
-```
-### Run the Multi-Agent Pipeline
-```bash
-# Step 1 — Start 5-agent episode (Generator biased by Regulator)
-curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/multi/reset
-# Step 2 — Score Extractor output (4 signals)
-curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/multi/extract \
-     -H "Content-Type: application/json" \
-     -d '{"episode_id": "EP_ID", "extracted_data": {...}}'
-# Step 3 — Score Auditor output (updates 30-episode tracker)
-curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/multi/audit \
-     -H "Content-Type: application/json" \
-     -d '{"episode_id": "EP_ID", "audit_results": [
-       {"invoice_id": "INV-83221", "verdict": "flagged",
-        "fraud_type": "phantom_vendor", "confidence": 0.87}
-     ]}'
-# Step 4 — Run Approver, compute Generator adversarial reward
-curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/multi/approve \
-     -H "Content-Type: application/json" \
-     -d '{"episode_id": "EP_ID"}'
-# Check Regulator state anytime
-curl https://ps2181-invoice-processing-pipeline.hf.space/regulator/report
-curl https://ps2181-invoice-processing-pipeline.hf.space/regulator/forecast
-curl https://ps2181-invoice-processing-pipeline.hf.space/regulator/calibration
-```
-### Run Training (Google Colab)
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1C1_3giNt-NmbzKNFJr5_L1fms3L8LfmB)
-The training loop connects **directly** to the live HF Space environment:
-```
-Colab → /reset (fresh synthetic invoice) → model generates JSON
-      → /grader (scores vs ground truth) → GRPO weight update
-      → repeat 200 steps
-```
 ---
-## 🗂️ Repository Structure
 ```
-invoice-processing-pipeline/
-│
-├── server/
-│   ├── app.py                      # FastAPI — 18 endpoints
-│   ├── environment.py              # 7 tasks · graders · dynamic difficulty
-│   ├── multi_agent_environment.py  # 5-agent system + AuditorPerformanceTracker
-│   ├── agents.py                   # Lazy-loading LoRA inference wrappers
-│   └── web_ui.py                   # Gradio UI (mounted at /web)
-│
-├── models.py                       # Pydantic: Action · Observation · State
-├── inference.py                    # Standalone inference helper
-├── client.py                       # OpenEnv-compatible Python client
-│
-├── extractor_training_grpo.ipynb   # Extractor GRPO training (Unsloth + TRL)
-├── auditor_grpo_training.ipynb     # Auditor GRPO training
-├── generator_grpo_training.ipynb   # Generator GRPO training
-│
-├── openenv.yaml                    # OpenEnv manifest (all 7 tasks declared)
-├── Dockerfile                      # HF Spaces Docker (port 7860, non-root UID 1000)
-├── pyproject.toml                  # Project metadata + dependencies
-├── requirements.txt                # Runtime dependencies
-├── validate-submission.sh          # Submission validator script
-│
-├── ROUND2_PROBLEM_STATEMENT.md     # Full problem statement + reward design rationale
-└── BLOG_DRAFT.md                   # HuggingFace blog post draft
 ```
 ---
-## 🔌 API Reference
 ### Core OpenEnv
 | Endpoint | Method | Description |
 |:---|:---:|:---|
-| `/health` | `GET` | Health check → `{"status": "ok", "active_sessions": N}` |
-| `/tasks` | `GET` | All 7 tasks with descriptions, max_attempts, action/observation schemas |
-| `/reset` | `POST` | Start episode `{"task_id": "easy\|medium\|hard\|expert\|adversarial\|negotiate\|supply_chain"}` |
-| `/step` | `POST` | Submit extraction → reward + feedback + hint + reward_breakdown |
-| `/grader` | `POST` | Score without consuming an attempt (used by training Colab) |
-| `/state` | `GET` | Episode metadata — step_count, done, best_reward, full rewards history |
-| `/ws` | `WS` | Full episode over WebSocket (OpenEnv standard) |
-| `/web` | `GET` | Gradio interactive demo UI |
 ### Multi-Agent
 | Endpoint | Method | Description |
 |:---|:---:|:---|
-| `/multi/reset` | `POST` | Start 5-agent episode — Generator biased by Regulator weights |
-| `/multi/extract` | `POST` | Score Extractor output (4 signals) |
-| `/multi/audit` | `POST` | Score Auditor output, update 30-episode performance tracker |
-| `/multi/approve` | `POST` | Run Approver, compute Generator adversarial reward |
-| `/multi/state/{id}` | `GET` | Full episode state including all agent scores |
 ### Regulator
 | Endpoint | Method | Description |
 |:---|:---:|:---|
-| `/regulator/report` | `GET` | Detection rates, blind spots, calibration, generator weights |
-| `/regulator/forecast` | `GET` | Predictive trend analysis — critical + emerging blind spots with slopes |
-| `/regulator/calibration` | `GET` | Overconfidence / underconfidence per fraud type |
-| `/regulator/predict` | `POST` | Score a Regulator blind-spot prediction |
-| `/regulator/demo_seed` | `POST` | Seed tracker with realistic demo data |
-| `/generator/score` | `POST` | Compute Generator reward given auditor/approver outcomes |
 ---
-## 🏗️ Tech Stack
-<div align="center">
-| Layer | Technology |
-|:---|:---|
-| **Environment** | [OpenEnv](https://github.com/meta-pytorch/OpenEnv) · FastAPI · Pydantic v2 |
-| **UI** | Gradio 4.x (mounted at `/web`) |
-| **Deployment** | Docker · HuggingFace Spaces (vcpu-2 / 8 GB) |
-| **Training** | [TRL GRPOTrainer](https://huggingface.co/docs/trl) · [Unsloth](https://github.com/unslothai/unsloth) |
-| **Model** | `unsloth/Qwen2.5-1.5B-Instruct` · 4-bit QLoRA · r=16 |
-| **Reward** | Live `/grader` endpoint on HF Space as verifier |
-| **Session Mgmt** | Thread-safe `OrderedDict` · 200-session cap · LRU eviction |
-| **Dynamic Difficulty** | Per-task rolling window (maxlen=10) → adjusts OCR intensity, batch size, discrepancy count |
-</div>
----
-## 🔍 Dynamic Difficulty
-The environment adapts generation parameters to the agent's recent performance:
-```python
-if avg_score >= 0.85:   # Agent is doing well → harder
-    n_invoices    = (4, 6)
-    ocr_intensity = 0.55        # heavier corruption
-    n_discrepancies = (3, 5)
-    n_anomalies   = 3
-elif avg_score < 0.60:  # Agent is struggling → easier
-    n_invoices    = (2, 3)
-    ocr_intensity = 0.15
-    n_discrepancies = (1, 2)
-    n_anomalies   = 2
-else:                   # balanced
-    n_invoices    = (3, 5)
-    ocr_intensity = 0.35
-    n_discrepancies = (2, 3)
 ```
 ---
-## 🎭 Theme Alignment
-<div align="center">
-| Theme | Alignment | Evidence |
-|:---:|:---|:---|
-| **#1 Multi-Agent Interactions** | ✅ Core | 5 agents with cooperation, competition, and adversarial self-play |
-| **#1 Fleet AI Scalable Oversight** | ✅ Bonus | Regulator monitors Auditor cross-episode — fully autonomous oversight loop |
-| **#2 Long-Horizon Planning** | ✅ Partial | `negotiate` task: multi-turn clarification with attempt budget penalty |
-| **#3.1 Professional Tasks** | ✅ Core | Invoice + PO + vendor registry + supply chain = real finance operations |
-| **#4 Self-Improvement** | ✅ Core | Regulator → Generator bias → harder adversarial batches → Auditor improves |
-</div>
----
-## 👥 Team
-<div align="center">
-| | |
-|:---:|:---:|
-| **Pritam Satpathy** | **Gnana Nawin T** |
-| [🤗 ps2181](https://huggingface.co/ps2181) | [🤗 gnananawin](https://huggingface.co/gnananawin) |
-| Scaler School of Technology | Scaler School of Technology |
-**Meta PyTorch OpenEnv Hackathon — Grand Finale · April 25–26, 2026 · Bangalore**
-</div>
----
-## 🔗 All Links
 <div align="center">
-| Resource | Link |
 |:---|:---|
-| 🚀 **Live Environment** | https://ps2181-invoice-processing-pipeline.hf.space |
-| 🖥️ **Gradio Demo UI** | https://ps2181-invoice-processing-pipeline.hf.space/web |
-| 📖 **API Documentation** | https://ps2181-invoice-processing-pipeline.hf.space/docs |
-| 🤗 **Extractor Model** | https://huggingface.co/ps2181/extractor-lora-qwen2.5-1.5b |
-| 🕵️ **Auditor Model** | https://huggingface.co/ps2181/auditor-lora-qwen2.5-1.5b |
-| 🏭 **Generator Model** | https://huggingface.co/ps2181/generator-lora-qwen2.5-1.5b |
-| 📓 **Training Colab** | https://colab.research.google.com/drive/1C1_3giNt-NmbzKNFJr5_L1fms3L8LfmB |
-| 💻 **GitHub** | https://github.com/ps2181/invoice-processing-pipeline |
-| 🧩 **OpenEnv Framework** | https://github.com/meta-pytorch/OpenEnv |
 </div>
@@ -524,8 +329,8 @@ else:                   # balanced
 <div align="center">
-<img src="https://capsule-render.vercel.app/api?type=waving&color=gradient&customColorList=6,11,20&height=100&section=footer&animation=twinkling" width="100%"/>
-**Built with ❤️ for the Meta PyTorch OpenEnv Hackathon 2026**
 </div>

+---
+title: Invoice Processing Pipeline
+emoji: 🧾
+colorFrom: blue
+colorTo: indigo
+sdk: docker
+app_port: 7860
+tags:
+  - openenv
+  - multi-agent
+  - grpo
+  - rlhf
+  - fraud-detection
+  - invoice
+---
+<div align="center">
+# 🧾 Invoice Processing Pipeline
+### Self-Improving Adversarial Fraud Detection Environment
+**Meta PyTorch OpenEnv Hackathon · Grand Finale · April 25–26, 2026**
+*Pritam Satpathy & Gnana Nawin T · Scaler School of Technology, Bangalore*
 <br/>
+[![Live Demo](https://img.shields.io/badge/🚀_Live_Demo-HF_Space-yellow)](https://ps2181-invoice-processing-pipeline.hf.space/web)
+[![API Docs](https://img.shields.io/badge/📖_API_Docs-Swagger-blue)](https://ps2181-invoice-processing-pipeline.hf.space/docs)
+[![GitHub](https://img.shields.io/badge/GitHub-invoice--pipeline-black?logo=github)](https://github.com/ps2181/invoice-processing-pipeline)
+> **Primary theme: #4 Self-Improvement · Secondary: #1 Multi-Agent Interactions**
 </div>
 ---
+## The Core Idea
+> *A system that continuously generates harder challenges targeting its own weakest points.*
+Most fraud detection pipelines are static. Ours **gets harder for itself over time**: the Regulator finds where the Auditor keeps failing, the Generator exploits those exact blind spots in the next episode, the Auditor's new mistakes update the Regulator — and the loop closes without any human intervention.
 <div align="center">
+<img width="1710" height="326" alt="5-agent loop" src="https://github.com/user-attachments/assets/319654c3-aa24-47e8-9716-734d4e902168" />
 </div>
 ---
+## 5-Agent Architecture
+```mermaid
+graph LR
+    R[🎯 Regulator\nDetects blind spots\nUpdates weights] -->|bias weights| G[⚡ Generator\nCreates adversarial\ninvoices]
+    G -->|raw invoice text| E[🔍 Extractor\nParses structured\nJSON fields]
+    E -->|structured data| A[🕵️ Auditor\nFlags fraud with\nconfidence scores]
+    A -->|audit results| AP[✅ Approver\nApprove / Escalate\n/ Reject]
+    AP -->|episode outcome| R
+    A -->|missed fraud types| R
+```
 <div align="center">
 | Agent | Role | Reward Signal |
 |:---:|:---|:---|
+| **🎯 Regulator** | Cross-episode oversight: detects Auditor blind spots, reweights Generator | Precision `0.35` + Recall `0.35` + No over-flagging `0.15` + Early warning `0.15` |
+| **⚡ Generator** | Adversary: creates invoices biased toward blind spots | `+0.85` evades both · `+0.60` evades Auditor · `+0.10` caught |
+| **🔍 Extractor** | Parser: text → structured JSON with 4 independent signals | Format `0.10` · Field accuracy `0.40` · Math `0.25` · Completeness `0.25` |
+| **🕵️ Auditor** | Detector: fraud classification with confidence scores | `+0.99` correct type · `+0.90` clean cleared · `+0.01` miss or FP |
+| **✅ Approver** | Gatekeeper: final approve / escalate / reject | Rule-based on confidence threshold |
 </div>
 ---
+## Three Novel Features
 <div align="center">
+| Feature | What it does |
+|:---|:---|
+| **🔮 Predictive Regulator** | Computes trend slopes over 5-episode windows — warns of *emerging* blind spots before they go critical |
+| **🧬 Compound Fraud** | Invoices can carry two simultaneous fraud signals (e.g. phantom vendor + price gouging). Partial credit for catching one; full reward for both |
+| **📊 Confidence Calibration** | Tracks (confidence, correct?) pairs per fraud type. Flags *overconfident misses* — the most dangerous Auditor failure mode |
 </div>
 ---
+## 10 Tasks — Progressive Curriculum
+<div align="center">
+| # | Task | What the Agent Faces | Difficulty |
+|:---:|:---|:---|:---:|
+| 1 | `easy` | Single clean invoice — extract 5 fields | Easy |
+| 2 | `medium` | Batch with date chaos, vendor typos, currency noise | Medium |
+| 3 | `hard` | Extraction + PO reconciliation — flag overcharges, missing items | Hard |
+| 4 | `expert` | Full fraud audit across all four fraud types | Expert |
+| 5 | `adversarial` | OCR corruption, SUBTOTAL traps, fake TAX/FX noise lines | Expert |
+| 6 | `negotiate` | Ask clarifying questions first (bonus for ≤2), then extract | Medium |
+| 7 | `supply_chain` | Detect quantity shortfalls, price spikes, phantom deliveries | Expert |
+| 8 | `long_horizon` | 20-step 4-phase investigation: extract → reconcile → audit → risk forecast | Expert |
+| 9 | `personalized` | Adapts to your weak fields — next invoice always targets your worst category | Adaptive |
+| 10 | `curriculum` | Auto-progresses easy→medium→hard→expert based on score (≥0.80 to advance) | Auto |
+</div>
+Dynamic difficulty also adjusts **within** each task via a rolling 10-episode score window: score above `0.85` → heavier OCR, more discrepancies, deeper traps. Drop below `0.60` → it eases off.
 ---
+## Reward Architecture
+### 🔍 Extractor — 4 Independent Signals
 ```python
+reward_format(extracted)             # 0.10 — all 5 required JSON keys present?
+reward_field_accuracy(extracted, gt) # 0.40 — vendor / date / currency / total match?
+reward_math_consistency(extracted)   # 0.25 — qty × unit_price = amount per line?
+reward_completeness(extracted, gt)   # 0.25 — all expected line items captured?
+# All clamped to (0.01, 0.99) — no log(0), no gradient collapse at boundaries
 ```
+### 🕵️ Auditor
+<div align="center">
 | Outcome | Reward | Why |
 |:---|:---:|:---|
+| Correct fraud type detected | **0.99** | Rewards precise classification, not just flagging |
+| Clean invoice correctly approved | **0.90** | Keeps false-positive rate honest |
+| Compound fraud — one of two types caught | **0.65** | Partial credit on hard cases |
+| Fraud flagged but wrong type | **0.50** | Penalises sloppiness while crediting intent |
 | Miss or false positive | **0.01** | Near-zero punishes both failure modes symmetrically |
+</div>
+### 🎯 Regulator — Cross-Episode
 ```
 Total = Precision(0.35) + Recall(0.35) + No-over-flagging(0.15) + Early-warning-bonus(0.15)
 ```
+The early-warning bonus rewards predictions of *emerging* blind spots — before detection rates cross the critical threshold.
 ---
+## Training Results — GRPO on Live Environment
+All 3 agents trained with **TRL GRPOTrainer + Unsloth** using the deployed HF Space as the live reward verifier:
 <div align="center">
+| Agent | Baseline | Best Achieved | Notes |
+|:---:|:---:|:---:|:---|
+| **🔍 Extractor** | 0.10 (random) | **0.914** live grader | Peaked step 15 — above Qwen 72B baseline (0.67) |
+| **🕵️ Auditor** | 0.01 (dead signal) | **0.719** total reward | Run 1 had episode_id bug; Run 2 → 0.01→0.52 live reward |
+| **⚡ Generator** | — | Format learned (~0.22) | Plausibility reward improved; evasion had same bug as Run 1 |
 </div>
+![Extractor Training](https://raw.githubusercontent.com/ps2181/invoice-processing-pipeline/main/assets/reward_curve.png)
+![Auditor Training Run 2](https://raw.githubusercontent.com/ps2181/invoice-processing-pipeline/main/assets/auditor_reward_curve_run2.png)
+![Generator Training](https://raw.githubusercontent.com/ps2181/invoice-processing-pipeline/main/assets/generator_reward_curve.png)
+**Setup:** Qwen2.5-1.5B-Instruct · 4-bit QLoRA r=16 · Unsloth + TRL · Google Colab A100
+### The Reward Hacking We Caught at Step 10
+At step 10 the model had figured out it could score high by producing *arithmetically consistent* JSON while **hallucinating every actual value**:
+```
+math_consistency:  0.97   ✓
+completeness:      1.00   ✓
+field_accuracy:    0.00   ✗  ← vendor, date, total all fabricated
+```
+Without 4 independent signals, a single aggregated reward would have called this success. The independent signals made the failure immediately visible — and diagnosable.
+### Auditor Training Log — Run 2 (exact data)
+<div align="center">
+| Step | Total Reward | Live Env Reward | ±Std |
+|:---:|:---:|:---:|:---:|
+| 5 | 0.4828 | 0.2828 | ±0.194 |
+| 10 | **0.7188** | **0.5188** | ±0.239 |
+| 15 | 0.4538 | 0.2538 | ±0.123 |
+| 20 | 0.5733 | 0.3733 | ±0.212 |
+| 25 | 0.5325 | 0.3325 | ±0.232 |
+| 30 | 0.6038 | 0.4038 | ±0.147 |
+*Run 1 (dead signal): live env reward flat at 0.010 — TRL passes episode_id as a list; old code sent the whole list instead of indexing per completion*
+</div>
+---
+## Trained LoRA Agents
+<div align="center">
+| Agent | HF Hub |
+|:---:|:---|
+| 🔍 Extractor | [ps2181/extractor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/extractor-lora-qwen2.5-1.5b) |
+| 🕵️ Auditor | [ps2181/auditor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/auditor-lora-qwen2.5-1.5b) |
+| ⚡ Generator | [ps2181/generator-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/generator-lora-qwen2.5-1.5b) |
+</div>
 ---
+## Sample Multi-Agent Episode
 ```
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+  MULTI-AGENT PIPELINE  ·  LIVE EPISODE
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+  🎯  REGULATOR  (30-episode rolling window)
+  ────────────────────────────────────────────────
+  phantom_vendor     31%  ⚠ BLIND SPOT  ← prioritised 60%
+  price_gouging      74%  ✓ OK
+  math_fraud         81%  ✓ OK
+  duplicate          62%  ✓ OK
+  ⚡  GENERATOR  (Qwen2.5 LoRA)
+  ────────────────────────────────────────────────
+  Fraud focus : phantom_vendor (60% Regulator weight)
+  Vendor      : ShadowByte Technologies  ← not in registry
+  🔍  EXTRACTOR  (Qwen2.5 LoRA)
+  ────────────────────────────────────────────────
+  Reward : 0.847  [format 0.10 · field 0.38 · math 0.25 · completeness 0.12]
+  🕵️  AUDITOR  (Qwen2.5 LoRA)
+  ────────────────────────────────────────────────
+  INV-85529  →  🚨 FLAGGED  [PHANTOM VENDOR]  conf=0.91
+  INV-85530  →  ✅ APPROVED                   conf=0.88
+  ✅  APPROVER
+  ────────────────────────────────────────────────
+  INV-85529  →  ❌ REJECT
+  Generator reward : 0.60  (evaded Auditor on 1/3, Approver caught)
+  🎯  REGULATOR UPDATE
+  ────────────────────────────────────────────────
+  phantom_vendor detection: 31% → 45%  ↑ improving
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 ```
 ---
+## API Reference
 ### Core OpenEnv
 | Endpoint | Method | Description |
 |:---|:---:|:---|
+| `/reset` | POST | Start episode (`{"task_id": "easy\|medium\|hard\|...\|curriculum"}`) |
+| `/step` | POST | Submit extracted data, get reward + feedback |
+| `/grader` | POST | Score without modifying state (training reward signal) |
+| `/state` | GET | Episode metadata |
+| `/health` | GET | Health check + active session count |
+| `/metrics` | GET | Per-task episode counts, avg/best scores, Regulator state |
+| `/tasks` | GET | Full task list with schemas |
+| `/ws` | WS | WebSocket interface |
 ### Multi-Agent
 | Endpoint | Method | Description |
 |:---|:---:|:---|
+| `/multi/reset` | POST | Start 5-agent episode, Generator biased by Regulator |
+| `/multi/extract` | POST | Score Extractor output (4 signals) |
+| `/multi/audit` | POST | Score Auditor output, update tracker |
+| `/multi/approve` | POST | Run Approver, compute Generator adversarial reward |
+| `/generator/score` | POST | Direct Generator scoring through Auditor+Approver pipeline |
 ### Regulator
 | Endpoint | Method | Description |
 |:---|:---:|:---|
+| `/regulator/report` | GET | Detection rates, blind spots, generator weights |
+| `/regulator/forecast` | GET | Trend slopes + emerging blind spot warnings |
+| `/regulator/calibration` | GET | Confidence calibration per fraud type |
+| `/regulator/predict` | POST | Score Regulator blind spot predictions |
 ---
+## Quick Start
+```bash
+# Health check
+curl https://ps2181-invoice-processing-pipeline.hf.space/health
+# Environment-wide metrics
+curl https://ps2181-invoice-processing-pipeline.hf.space/metrics
+# Auto-progressive curriculum episode
+curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/reset \
+  -H "Content-Type: application/json" -d '{"task_id": "curriculum"}'
+# Start multi-agent episode
+curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/multi/reset
+# Regulator blind spot report
+curl https://ps2181-invoice-processing-pipeline.hf.space/regulator/report
 ```
 ---
+## Theme Alignment
 <div align="center">
+| Theme | Alignment |
 |:---|:---|
+| **#4 Self-Improvement** (primary) | Regulator detects blind spots → Generator biases toward them → Auditor improves → loop repeats |
+| **#1 Multi-Agent Interactions** | 5 agents with conflicting incentives (Generator vs Auditor adversarial self-play) |
+| **#1 Fleet AI Scalable Oversight** (bonus) | Regulator monitors Auditor cross-episode with predictive trend detection |
+| **#3.1 Professional Tasks** | Invoice fraud detection = core enterprise financial workflow |
 </div>
 <div align="center">
+*Built for the Meta PyTorch OpenEnv Hackathon 2026.*
+**Pritam Satpathy & Gnana Nawin T · Scaler School of Technology · Bangalore**
 </div>