Spaces:

ps2181
/

invoice-processing-pipeline

Running

ps2181 Claude Sonnet 4.6 commited on Apr 26

Commit

fcd74c3

1 Parent(s): 3d2e6c1

Fix all submission gaps: openenv.yaml, README captions, baseline table, blog link

- Add long_horizon, personalized, curriculum tasks to openenv.yaml
- Update openenv.yaml description and tags
- Add before/after baseline comparison table to README
- Add one-line captions under all 3 training curve images
- Add BLOG.md link in All Links section
- Fix Blog.md typo → BLOG.md in repo structure

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (2) hide show

README.md +16 -10
openenv.yaml +26 -7

README.md CHANGED Viewed

@@ -160,13 +160,15 @@ Dynamic difficulty also adjusts **within** each task via a rolling 10-episode sc
 All 3 agents trained with **TRL GRPOTrainer + Unsloth** using the deployed HF Space as the live reward verifier — `/grader` endpoint *is* the reward function during training.
 <div align="center">
-| Agent | Baseline | Best Achieved | Notes |
-|:---:|:---:|:---:|:---|
-| 🔍 **Extractor** | 0.10 (random) | **0.914** live grader | Peaked step 15 — above Qwen 72B baseline (0.67) |
-| 🕵️ **Auditor** | 0.01 (dead signal) | **0.719** total reward | Run 1 had episode_id bug; Run 2 → 0.01→0.52 live reward |
-| ⚡ **Generator** | — | Format learned (~0.22) | Plausibility reward improved; evasion had same bug as Run 1 |
 </div>
@@ -175,14 +177,17 @@ All 3 agents trained with **TRL GRPOTrainer + Unsloth** using the deployed HF Sp
 ### Extractor Reward Curve
 ![Extractor Training](https://raw.githubusercontent.com/ps2181/invoice-processing-pipeline/main/assets/reward_curve.png)
-### Auditor Reward Curve (Run 2)
 ![Auditor Training Run 2](https://raw.githubusercontent.com/ps2181/invoice-processing-pipeline/main/assets/auditor_reward_curve_run2.png)
 ### Generator Reward Curve
 ![Generator Training](https://raw.githubusercontent.com/ps2181/invoice-processing-pipeline/main/assets/generator_reward_curve.png)
 ### 🔍 Reward Hacking Caught at Step 10
@@ -416,7 +421,7 @@ invoice-processing-pipeline/
 ├── pyproject.toml                  # Project metadata + dependencies
 ├── requirements.txt                # Runtime dependencies
 ├── validate-submission.sh          # Submission validator script
-├── Blog.md                         # HuggingFace blog post
 └── ROUND2_PROBLEM_STATEMENT.md     # Full problem statement + reward design rationale
 ```
@@ -522,12 +527,13 @@ invoice-processing-pipeline/
 | 🖥️ **Gradio Demo UI** | https://ps2181-invoice-processing-pipeline.hf.space/web |
 | 📖 **API Documentation** | https://ps2181-invoice-processing-pipeline.hf.space/docs |
 | 📊 **Metrics Dashboard** | https://ps2181-invoice-processing-pipeline.hf.space/metrics |
 | 🤗 **Extractor Model** | https://huggingface.co/ps2181/extractor-lora-qwen2.5-1.5b |
 | 🕵️ **Auditor Model** | https://huggingface.co/ps2181/auditor-lora-qwen2.5-1.5b |
 | ⚡ **Generator Model** | https://huggingface.co/ps2181/generator-lora-qwen2.5-1.5b |
-| 📓 **Training Colab(Auditor Agent)** | https://colab.research.google.com/drive/1C1_3giNt-NmbzKNFJr5_L1fms3L8LfmB |
-| 📓 **Training Colab(Extractor Agent)** | https://colab.research.google.com/drive/1fxfBt13LjmT4m98pJq-b5B__1ytFeszK?usp=sharing |
-| 📓 **Training Colab(Generator Agent)** | https://colab.research.google.com/drive/1O293_VBZQCthxlGpgvz5kxoty3zcsWGH?usp=sharing |
 | 💻 **GitHub** | https://github.com/ps2181/invoice-processing-pipeline |
 | 🧩 **OpenEnv Framework** | https://github.com/meta-pytorch/OpenEnv |

 All 3 agents trained with **TRL GRPOTrainer + Unsloth** using the deployed HF Space as the live reward verifier — `/grader` endpoint *is* the reward function during training.
+### Before vs After Training
 <div align="center">
+| Agent | Untrained (random) | Qwen 72B baseline | After GRPO | Improvement |
+|:---:|:---:|:---:|:---:|:---:|
+| 🔍 **Extractor** | 0.10 | 0.67 | **0.914** | +714% vs random |
+| 🕵️ **Auditor** | 0.01 | — | **0.52** live reward | Dead → active signal |
+| ⚡ **Generator** | — | — | **0.22** plausibility | Format & realism learned |
 </div>
 ### Extractor Reward Curve
 ![Extractor Training](https://raw.githubusercontent.com/ps2181/invoice-processing-pipeline/main/assets/reward_curve.png)
+*Left: Total GRPO reward across 4 signals (format + field + math + completeness) over 20 training steps. Right: Live environment grader score peaking at **0.914** — above Qwen 72B baseline (0.67) and untrained 1.5B baseline (0.46).*
+### Auditor Reward Curve (Run 2 — Bug Fixed)
 ![Auditor Training Run 2](https://raw.githubusercontent.com/ps2181/invoice-processing-pipeline/main/assets/auditor_reward_curve_run2.png)
+*Total reward (blue) and live env reward (orange) over 30 steps with ±1 std band. Best total reward: **0.719**. Live env reward rose from 0.01 (dead signal in Run 1) to **0.52** after fixing the episode_id list bug.*
 ### Generator Reward Curve
 ![Generator Training](https://raw.githubusercontent.com/ps2181/invoice-processing-pipeline/main/assets/generator_reward_curve.png)
+*Live evasion reward (red) flat near 0 — Auditor+Approver caught all fraud attempts. Fraud plausibility reward (orange dashed) learned and stable at ~0.20, showing the Generator learned to produce realistic-looking invoices even without successful evasion.*
 ### 🔍 Reward Hacking Caught at Step 10
 ├── pyproject.toml                  # Project metadata + dependencies
 ├── requirements.txt                # Runtime dependencies
 ├── validate-submission.sh          # Submission validator script
+├── BLOG.md                         # HuggingFace blog post
 └── ROUND2_PROBLEM_STATEMENT.md     # Full problem statement + reward design rationale
 ```
 | 🖥️ **Gradio Demo UI** | https://ps2181-invoice-processing-pipeline.hf.space/web |
 | 📖 **API Documentation** | https://ps2181-invoice-processing-pipeline.hf.space/docs |
 | 📊 **Metrics Dashboard** | https://ps2181-invoice-processing-pipeline.hf.space/metrics |
+| 📝 **Blog Post** | https://github.com/ps2181/invoice-processing-pipeline/blob/main/BLOG.md |
 | 🤗 **Extractor Model** | https://huggingface.co/ps2181/extractor-lora-qwen2.5-1.5b |
 | 🕵️ **Auditor Model** | https://huggingface.co/ps2181/auditor-lora-qwen2.5-1.5b |
 | ⚡ **Generator Model** | https://huggingface.co/ps2181/generator-lora-qwen2.5-1.5b |
+| 📓 **Training Colab (Auditor Agent)** | https://colab.research.google.com/drive/1C1_3giNt-NmbzKNFJr5_L1fms3L8LfmB |
+| 📓 **Training Colab (Extractor Agent)** | https://colab.research.google.com/drive/1fxfBt13LjmT4m98pJq-b5B__1ytFeszK?usp=sharing |
+| 📓 **Training Colab (Generator Agent)** | https://colab.research.google.com/drive/1O293_VBZQCthxlGpgvz5kxoty3zcsWGH?usp=sharing |
 | 💻 **GitHub** | https://github.com/ps2181/invoice-processing-pipeline |
 | 🧩 **OpenEnv Framework** | https://github.com/meta-pytorch/OpenEnv |

openenv.yaml CHANGED Viewed

@@ -1,20 +1,24 @@
 name: invoice_processing_pipeline
 version: "1.0.0"
 description: >
-  An OpenEnv environment for training AI agents on real-world invoice processing:
-  data extraction from OCR text, batch cleaning & normalisation, and
-  reconciliation against purchase orders with discrepancy detection.
-author: "OpenEnv Challenge Submission"
 license: "MIT"
 tags:
   - openenv
   - invoice
-  - data-extraction
-  - data-cleaning
-  - reconciliation
   - finance
 environment:
   module: server.app
@@ -58,6 +62,21 @@ tasks:
     description: "Identify quantity shortfalls, price spikes, unauthorized substitutions, and phantom deliveries in a set of supply chain delivery records."
     difficulty: expert
 endpoints:
   reset: /reset
   step: /step

 name: invoice_processing_pipeline
 version: "1.0.0"
 description: >
+  A self-improving 5-agent adversarial RL environment for invoice fraud detection.
+  A cross-episode Regulator monitors the Auditor's blind spots and biases the Generator
+  to produce harder fraud — closing a self-improvement loop without human intervention.
+  10 tasks from easy extraction to 20-step long-horizon investigations and adaptive
+  personalized curricula.
+author: "Pritam Satpathy & Gnana Nawin T"
 license: "MIT"
 tags:
   - openenv
   - invoice
+  - fraud-detection
+  - multi-agent
+  - self-improvement
+  - grpo
   - finance
+  - curriculum
 environment:
   module: server.app
     description: "Identify quantity shortfalls, price spikes, unauthorized substitutions, and phantom deliveries in a set of supply chain delivery records."
     difficulty: expert
+  - id: long_horizon
+    name: "Long-Horizon Financial Investigation"
+    description: "20-step, 4-phase investigation with sparse rewards. Phase 1: extract 3 invoices. Phase 2: reconcile against POs (unlocked). Phase 3: fraud audit (registry unlocked). Phase 4: risk forecast. Each phase completion required to unlock next phase's reference data."
+    difficulty: expert
+  - id: personalized
+    name: "Personalized Adaptive Task"
+    description: "Tracks per-field accuracy (vendor, date, math, completeness) across steps and generates the next invoice to target the agent's weakest field. Reward weighted toward the historically weakest category."
+    difficulty: adaptive
+  - id: curriculum
+    name: "Auto-Progressive Curriculum"
+    description: "Automatically progresses the agent through easy→medium→hard→expert based on score. Score ≥0.80 to advance to next stage. Score <0.40 to be held back. Up to 20 steps across all stages."
+    difficulty: adaptive
 endpoints:
   reset: /reset
   step: /step