--- title: Invoice Processing Pipeline emoji: ๐Ÿงพ colorFrom: blue colorTo: green sdk: docker app_port: 7860 pinned: false tags: - openenv - multi-agent - grpo - rl short_description: 5-agent adversarial fraud detection RL environment ---


> **Meta PyTorch OpenEnv Hackathon โ€” Grand Finale ยท April 25โ€“26, 2026** > > Team: **Pritam Satpathy** & **Gnana Nawin T** ยท VIT, Vellore
Typing SVG
--- ## ๐Ÿ”ฅ The Core Idea > *A system that continuously generates harder challenges targeting its own weakest points.* Most fraud detection pipelines are **static**. Ours **gets harder for itself over time**: the Regulator finds where the Auditor keeps failing, the Generator exploits those exact blind spots in the next episode, the Auditor's new mistakes update the Regulator โ€” and the loop closes without any human intervention. **Primary theme: #4 Self-Improvement ยท Secondary: #1 Multi-Agent Interactions**
5-agent self-improvement loop
--- ## ๐Ÿค– 5-Agent Architecture ``` ๐ŸŽฏ Regulator โ”€โ”€bias weightsโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ โšก Generator โ–ฒ โ”‚ โ”‚ raw invoice text โ”‚ missed fraud types โ–ผ โ”‚ ๐Ÿ” Extractor โ”‚ โ”‚ โ”‚ structured data โ”‚ โ–ผ โ””โ”€โ”€โ”€โ”€ episode outcome โ”€โ”€โ”€โ”€ โœ… Approver โ—„โ”€audit resultsโ”€โ”€โ”€ ๐Ÿ•ต๏ธ Auditor ```
| Agent | Role | Reward Signal | |:---:|:---|:---| | ๐ŸŽฏ **Regulator** | Cross-episode oversight: detects Auditor blind spots, reweights Generator | Precision `0.35` + Recall `0.35` + No over-flagging `0.15` + Early warning `0.15` | | โšก **Generator** | Adversary: creates invoices biased toward blind spots | `+0.85` evades both ยท `+0.60` evades Auditor ยท `+0.10` caught | | ๐Ÿ” **Extractor** | Parser: text โ†’ structured JSON with 4 independent signals | Format `0.10` ยท Field accuracy `0.40` ยท Math `0.25` ยท Completeness `0.25` | | ๐Ÿ•ต๏ธ **Auditor** | Detector: fraud classification with confidence scores | `+0.99` correct type ยท `+0.90` clean cleared ยท `+0.01` miss or FP | | โœ… **Approver** | Gatekeeper: final approve / escalate / reject | `โ‰ฅ0.80` โ†’ reject ยท `0.50โ€“0.80` โ†’ escalate ยท `<0.50` โ†’ approve |
--- ## โšก Three Novel Features
### ๐Ÿ”ฎ Predictive Regulator Computes **trend slopes** over 5-episode windows.
Warns of *emerging* blind spots **before** detection rates cross the critical threshold โ€” proactive oversight, not reactive retraining. `+0.15 early-warning bonus`
### ๐Ÿงฌ Compound Fraud Invoices carry **two fraud signals simultaneously** (e.g. phantom vendor + price gouging).
Partial credit `+0.65` for catching one; full reward `+0.99` for both. Prevents single-signal heuristics.
### ๐Ÿ“Š Confidence Calibration Tracks `(confidence, correct?)` pairs per fraud type.
Detects **overconfident misses** โ€” the Auditor saying "90% sure, approved" on fraud โ€” the most dangerous real-world failure mode.
--- ## ๐ŸŽฏ 10 Tasks โ€” Progressive Curriculum
| # | Task | What the Agent Faces | Difficulty | |:---:|:---|:---|:---:| | 1 | `easy` | Single clean invoice โ€” extract 5 fields | ๐ŸŸข Easy | | 2 | `medium` | Batch with date chaos, vendor typos, currency noise | ๐ŸŸก Medium | | 3 | `hard` | Extraction + PO reconciliation โ€” flag overcharges, missing items | ๐ŸŸ  Hard | | 4 | `expert` | Full fraud audit across all four fraud types | ๐Ÿ”ด Expert | | 5 | `adversarial` | OCR corruption, SUBTOTAL traps, fake TAX/FX noise lines | ๐Ÿ”ด Expert | | 6 | `negotiate` | Ask clarifying questions first (bonus for โ‰ค2), then extract | ๐ŸŸก Medium | | 7 | `supply_chain` | Detect quantity shortfalls, price spikes, phantom deliveries | ๐Ÿ”ด Expert | | 8 | `long_horizon` | 20-step 4-phase investigation: extract โ†’ reconcile โ†’ audit โ†’ risk forecast | ๐Ÿ”ด Expert | | 9 | `personalized` | Adapts to your weak fields โ€” next invoice always targets your worst category | ๐Ÿ”„ Adaptive | | 10 | `curriculum` | Auto-progresses easyโ†’mediumโ†’hardโ†’expert based on score (โ‰ฅ0.80 to advance) | ๐Ÿ”„ Auto |
Dynamic difficulty also adjusts **within** each task via a rolling 10-episode score window: score above `0.85` โ†’ heavier OCR, more discrepancies, deeper traps. Drop below `0.60` โ†’ it eases off. --- ## ๐Ÿ“ˆ Training Results โ€” GRPO on Live Environment All 3 agents trained with **TRL GRPOTrainer + Unsloth** using the deployed HF Space as the live reward verifier โ€” `/grader` endpoint *is* the reward function during training. ### Before vs After Training
| Agent | Untrained (random) | Qwen 72B baseline | After GRPO | Improvement | |:---:|:---:|:---:|:---:|:---:| | ๐Ÿ” **Extractor** | 0.10 | 0.67 | **0.914** | +714% vs random | | ๐Ÿ•ต๏ธ **Auditor** | 0.01 | โ€” | **0.52** live reward | Dead โ†’ active signal | | โšก **Generator** | โ€” | โ€” | **0.22** plausibility | Format & realism learned |
**Setup:** Qwen2.5-1.5B-Instruct ยท 4-bit QLoRA r=16 ยท Unsloth + TRL ยท Google Colab A100 ### Extractor Reward Curve ![Extractor Training](https://raw.githubusercontent.com/ps2181/invoice-processing-pipeline/main/assets/reward_curve.png) *X-axis: training step (1โ€“20) ยท Y-axis: reward (0โ€“1). Left: total GRPO reward across 4 independent signals (format 0.10 + field accuracy 0.40 + math 0.25 + completeness 0.25). Right: live `/grader` score peaking at **0.914** โ€” above Qwen 72B baseline (0.67) and untrained 1.5B (0.46).* *Left: Total GRPO reward across 4 signals (format + field + math + completeness) over 20 training steps. Right: Live environment grader score peaking at **0.914** โ€” above Qwen 72B baseline (0.67) and untrained 1.5B baseline (0.46).* ### Auditor Reward Curve (Run 2 โ€” Bug Fixed) ![Auditor Training Run 2](https://raw.githubusercontent.com/ps2181/invoice-processing-pipeline/main/assets/auditor_reward_curve_run2.png) *X-axis: training step (1โ€“30) ยท Y-axis: reward (0โ€“1). Total reward (blue) and live env reward (orange) with ยฑ1 std band. Best total: **0.719** at step 10. Live env reward climbed from 0.01 (dead signal, Run 1) to **0.52** after fixing the TRL episode_id list indexing bug.* *Total reward (blue) and live env reward (orange) over 30 steps with ยฑ1 std band. Best total reward: **0.719**. Live env reward rose from 0.01 (dead signal in Run 1) to **0.52** after fixing the episode_id list bug.* ### Generator Reward Curve ![Generator Training](https://raw.githubusercontent.com/ps2181/invoice-processing-pipeline/main/assets/generator_reward_curve.png) *X-axis: training step (1โ€“30) ยท Y-axis: reward (0โ€“1). Live evasion reward (red) flat near 0 โ€” Auditor+Approver caught all fraud attempts. Fraud plausibility reward (orange dashed) stable at ~0.20 โ€” Generator learned realistic invoice structure even without successful evasion.* *Live evasion reward (red) flat near 0 โ€” Auditor+Approver caught all fraud attempts. Fraud plausibility reward (orange dashed) learned and stable at ~0.20, showing the Generator learned to produce realistic-looking invoices even without successful evasion.* ### ๐Ÿ” Reward Hacking Caught at Step 10 At step 10 the model achieved `math_consistency = 0.97` and `completeness = 1.0` while `field_accuracy = 0.00` โ€” it had learned to output **arithmetically-consistent JSON with entirely hallucinated values**: ``` Step 10 โ€” Reward Hacking Detected: format: 0.10 โœ… math_consistency: 0.97 โœ… โ† model gaming this signal completeness: 1.00 โœ… โ† model gaming this signal field_accuracy: 0.00 โŒ โ† hallucinating all values Action: adjusted training emphasis on field_accuracy weight Result: field_accuracy climbed to 0.30+ by step 30 ``` Without 4 independent signals, a single aggregated reward would have called this success. **Independent signals are diagnostics, not just incentives.** ### Auditor Training โ€” Run 2 (exact data)
| Step | Total Reward | Live Env Reward | ยฑStd | |:---:|:---:|:---:|:---:| | 5 | 0.4828 | 0.2828 | ยฑ0.194 | | 10 | **0.7188** | **0.5188** | ยฑ0.239 | | 15 | 0.4538 | 0.2538 | ยฑ0.123 | | 20 | 0.5733 | 0.3733 | ยฑ0.212 | | 25 | 0.5325 | 0.3325 | ยฑ0.232 | | 30 | 0.6038 | 0.4038 | ยฑ0.147 | *Run 1 (dead signal): live env reward flat at 0.010 โ€” TRL passes episode_id as a list; old code sent the whole list instead of indexing per completion*
--- ## ๐ŸŽ Reward Architecture ### ๐Ÿ” Extractor โ€” 4 Independent Signals ```python reward_format(extracted) # 0.10 โ€” all 5 required JSON keys present? reward_field_accuracy(extracted, gt) # 0.40 โ€” vendor / date / currency / total match? reward_math_consistency(extracted) # 0.25 โ€” qty ร— unit_price = amount per line? reward_completeness(extracted, gt) # 0.25 โ€” all expected line items captured? # All clamped to (0.01, 0.99) โ€” no log(0), no gradient collapse at boundaries ``` ### ๐Ÿ•ต๏ธ Auditor
| Outcome | Reward | Why | |:---|:---:|:---| | Correct fraud type detected | **0.99** | Rewards precise classification, not just binary flagging | | Clean invoice correctly approved | **0.90** | Keeps false-positive rate honest | | Compound fraud โ€” one of two types caught | **0.65** | Partial credit prevents cliff on hard cases | | Fraud flagged but wrong type | **0.50** | Penalises sloppiness; rewards catching *something* | | Miss or false positive | **0.01** | Near-zero punishes both failure modes symmetrically |
### โšก Generator (Adversarial Self-Play) | Outcome | Reward | |:---|:---:| | Fraud evades **both** Auditor and Approver | **0.85** | | Auditor misses, Approver catches | **0.60** | | Auditor catches it | **0.10** | ### ๐ŸŽฏ Regulator โ€” Cross-Episode ``` Total = Precision(0.35) + Recall(0.35) + No-over-flagging(0.15) + Early-warning-bonus(0.15) ``` The early-warning bonus rewards predictions of *emerging* blind spots โ€” before detection rates cross the critical threshold. --- ## ๐Ÿง  Trained LoRA Agents
| Agent | Base Model | LoRA Config | HuggingFace Hub | |:---:|:---|:---:|:---| | ๐Ÿ” Extractor | Qwen2.5-1.5B-Instruct | r=16, ฮฑ=16, 4-bit QLoRA | [ps2181/extractor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/extractor-lora-qwen2.5-1.5b) | | ๐Ÿ•ต๏ธ Auditor | Qwen2.5-1.5B-Instruct | r=16, ฮฑ=16, 4-bit QLoRA | [ps2181/auditor-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/auditor-lora-qwen2.5-1.5b) | | โšก Generator | Qwen2.5-1.5B-Instruct | r=16, ฮฑ=16, 4-bit QLoRA | [ps2181/generator-lora-qwen2.5-1.5b](https://huggingface.co/ps2181/generator-lora-qwen2.5-1.5b) |
**LoRA target modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` --- ## ๐ŸŒ The Regulator in Action After each episode, the Regulator publishes a report the Generator uses to bias its next batch: ``` GET /regulator/report { "total_audits_recorded": 20, "detection_rates": { "phantom_vendor": "31% โš  BLIND SPOT (-0.08โ†“)", "price_gouging": "74% โœ“ OK (+0.03โ†‘)", "math_fraud": "81% โœ“ OK (+0.01โ†‘)", "duplicate_submission": "62% โšก EMERGING (-0.02โ†“)" }, "blind_spots": ["phantom_vendor"], "emerging_blind_spots": ["duplicate_submission"], "generator_weights": { "phantom_vendor": 0.30, โ† 3ร— upweighted (blind spot) "duplicate_submission": 0.20, โ† 2ร— upweighted (emerging) "price_gouging": 0.125, "math_fraud": 0.125, "compound_fraud": 0.10 }, "verdict": "Recommend retraining on: phantom_vendor" } ``` --- ## ๐ŸŽญ Sample Multi-Agent Episode ``` โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” MULTI-AGENT PIPELINE ยท LIVE EPISODE โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” ๐ŸŽฏ REGULATOR (30-episode rolling window) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ phantom_vendor 31% โš  BLIND SPOT โ† prioritised 60% price_gouging 74% โœ“ OK math_fraud 81% โœ“ OK duplicate 62% โœ“ OK โšก GENERATOR (Qwen2.5 LoRA) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Fraud focus : phantom_vendor (60% Regulator weight) Vendor : ShadowByte Technologies โ† not in registry ๐Ÿ” EXTRACTOR (Qwen2.5 LoRA) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Reward : 0.847 [format 0.10 ยท field 0.38 ยท math 0.25 ยท completeness 0.12] ๐Ÿ•ต๏ธ AUDITOR (Qwen2.5 LoRA) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ INV-85529 โ†’ ๐Ÿšจ FLAGGED [PHANTOM VENDOR] conf=0.91 INV-85530 โ†’ โœ… APPROVED conf=0.88 โœ… APPROVER โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ INV-85529 โ†’ โŒ REJECT Generator reward : 0.60 (evaded Auditor on 1/3, Approver caught) ๐ŸŽฏ REGULATOR UPDATE โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ phantom_vendor detection: 31% โ†’ 45% โ†‘ improving โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” ``` --- ## ๐Ÿš€ Quick Start ```bash # Health check curl https://ps2181-invoice-processing-pipeline.hf.space/health # Environment-wide metrics curl https://ps2181-invoice-processing-pipeline.hf.space/metrics # Auto-progressive curriculum episode curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/reset \ -H "Content-Type: application/json" -d '{"task_id": "curriculum"}' # Start multi-agent episode curl -X POST https://ps2181-invoice-processing-pipeline.hf.space/multi/reset # Regulator blind spot report curl https://ps2181-invoice-processing-pipeline.hf.space/regulator/report ``` ### Run Training (Google Colab) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1C1_3giNt-NmbzKNFJr5_L1fms3L8LfmB) ``` Colab โ†’ /reset (fresh synthetic invoice from live environment) โ†’ model generates JSON โ†’ /grader scores against ground truth โ†’ GRPO updates weights toward higher-reward completions โ†’ repeat 200 steps ``` --- ## ๐Ÿ—‚๏ธ Repository Structure ``` invoice-processing-pipeline/ โ”‚ โ”œโ”€โ”€ server/ โ”‚ โ”œโ”€โ”€ app.py # FastAPI โ€” 18 endpoints โ”‚ โ”œโ”€โ”€ environment.py # 10 tasks ยท graders ยท dynamic difficulty โ”‚ โ”œโ”€โ”€ multi_agent_environment.py # 5-agent system + AuditorPerformanceTracker โ”‚ โ”œโ”€โ”€ agents.py # Lazy-loading LoRA inference wrappers โ”‚ โ””โ”€โ”€ web_ui.py # Gradio UI (mounted at /web) โ”‚ โ”œโ”€โ”€ models.py # Pydantic: Action ยท Observation ยท State โ”œโ”€โ”€ inference.py # Standalone inference helper โ”œโ”€โ”€ client.py # OpenEnv-compatible Python client โ”‚ โ”œโ”€โ”€ extractor_training_grpo.ipynb # ๐Ÿ”ฅ Extractor GRPO training (Unsloth + TRL) โ”œโ”€โ”€ auditor_grpo_training.ipynb # ๐Ÿ”ฅ Auditor GRPO training โ”œโ”€โ”€ generator_grpo_training.ipynb # ๐Ÿ”ฅ Generator GRPO training โ”‚ โ”œโ”€โ”€ assets/ โ”‚ โ”œโ”€โ”€ reward_curve.png # Extractor training curve โ”‚ โ”œโ”€โ”€ auditor_reward_curve_run2.png โ”‚ โ””โ”€โ”€ generator_reward_curve.png โ”‚ โ”œโ”€โ”€ openenv.yaml # OpenEnv manifest (all tasks declared) โ”œโ”€โ”€ Dockerfile # HF Spaces Docker (port 7860, non-root UID 1000) โ”œโ”€โ”€ pyproject.toml # Project metadata + dependencies โ”œโ”€โ”€ requirements.txt # Runtime dependencies โ”œโ”€โ”€ validate-submission.sh # Submission validator script โ”œโ”€โ”€ BLOG.md # HuggingFace blog post โ””โ”€โ”€ ROUND2_PROBLEM_STATEMENT.md # Full problem statement + reward design rationale ``` --- ## ๐Ÿ”Œ API Reference ### Core OpenEnv | Endpoint | Method | Description | |:---|:---:|:---| | `/health` | `GET` | Health check โ†’ `{"status": "ok", "active_sessions": N}` | | `/tasks` | `GET` | All tasks with descriptions, schemas, difficulty levels | | `/metrics` | `GET` | Per-task episode counts, avg/best scores, Regulator state | | `/reset` | `POST` | Start episode `{"task_id": "easy\|medium\|...\|curriculum"}` | | `/step` | `POST` | Submit extraction โ†’ reward + feedback + hint + reward_breakdown | | `/grader` | `POST` | Score without consuming an attempt (training reward signal) | | `/state` | `GET` | Episode metadata โ€” step_count, done, best_reward, history | | `/ws` | `WS` | Full episode over WebSocket (OpenEnv standard) | | `/web` | `GET` | Gradio interactive demo UI | ### Multi-Agent | Endpoint | Method | Description | |:---|:---:|:---| | `/multi/reset` | `POST` | Start 5-agent episode โ€” Generator biased by Regulator weights | | `/multi/extract` | `POST` | Score Extractor output (4 independent signals) | | `/multi/audit` | `POST` | Score Auditor output, update 30-episode performance tracker | | `/multi/approve` | `POST` | Run Approver, compute Generator adversarial reward | | `/multi/state/{id}` | `GET` | Full episode state including all agent scores | | `/generator/score` | `POST` | Direct Generator scoring through Auditor+Approver pipeline | ### Regulator | Endpoint | Method | Description | |:---|:---:|:---| | `/regulator/report` | `GET` | Detection rates, blind spots, calibration, generator weights | | `/regulator/forecast` | `GET` | Trend slopes + emerging blind spot warnings with episode countdown | | `/regulator/calibration` | `GET` | Overconfidence / underconfidence per fraud type | | `/regulator/predict` | `POST` | Score a Regulator blind-spot prediction | | `/regulator/demo_seed` | `POST` | Seed tracker with realistic demo data | --- ## ๐Ÿ—๏ธ Tech Stack
| Layer | Technology | |:---|:---| | **Environment** | [OpenEnv](https://github.com/meta-pytorch/OpenEnv) ยท FastAPI ยท Pydantic v2 | | **UI** | Gradio 4.x (mounted at `/web`) | | **Deployment** | Docker ยท HuggingFace Spaces (vcpu-2 / 8 GB) | | **Training** | [TRL GRPOTrainer](https://huggingface.co/docs/trl) ยท [Unsloth](https://github.com/unslothai/unsloth) | | **Model** | `unsloth/Qwen2.5-1.5B-Instruct` ยท 4-bit QLoRA ยท r=16 ยท A100 | | **Reward** | Live `/grader` endpoint on HF Space as verifier | | **Session Mgmt** | Thread-safe `OrderedDict` ยท 200-session cap ยท LRU eviction | | **Dynamic Difficulty** | Per-task rolling window (maxlen=10) โ†’ adjusts OCR intensity, batch size, discrepancy count |
--- ## ๐ŸŽญ Theme Alignment
| Theme | Alignment | Evidence | |:---:|:---|:---| | **#4 Self-Improvement** (primary) | โœ… Core | Regulator detects blind spots โ†’ Generator biases toward them โ†’ Auditor improves โ†’ loop repeats | | **#1 Multi-Agent Interactions** | โœ… Core | 5 agents with conflicting incentives โ€” Generator vs Auditor adversarial self-play | | **#1 Fleet AI Scalable Oversight** | โœ… Bonus | Regulator monitors Auditor cross-episode with predictive trend detection | | **#3.1 Professional Tasks** | โœ… Core | Invoice + PO + vendor registry + supply chain = real enterprise AP workflow | | **#2 Long-Horizon Planning** | โœ… Partial | `long_horizon` task: 20-step 4-phase investigation with multi-turn state |
--- ## ๐Ÿ‘ฅ Team
| | | |:---:|:---:| | **Pritam Satpathy** | **Gnana Nawin T** | | [๐Ÿค— ps2181](https://huggingface.co/ps2181) | [๐Ÿค— gnananawin](https://huggingface.co/gnananawin) | | Scaler School of Technology | Scaler School of Technology | **Meta PyTorch OpenEnv Hackathon โ€” Grand Finale ยท April 25โ€“26, 2026 ยท Bangalore**
--- ## ๐Ÿ”— All Links
| Resource | Link | |:---|:---| | ๐Ÿš€ **Live Environment** | https://ps2181-invoice-processing-pipeline.hf.space | | ๐Ÿ–ฅ๏ธ **Gradio Demo UI** | https://ps2181-invoice-processing-pipeline.hf.space/web | | ๐Ÿ“– **API Documentation** | https://ps2181-invoice-processing-pipeline.hf.space/docs | | ๐Ÿ“Š **Metrics Dashboard** | https://ps2181-invoice-processing-pipeline.hf.space/metrics | | ๐Ÿ“ **Blog Post** | https://github.com/ps2181/invoice-processing-pipeline/blob/main/BLOG.md | | ๐Ÿค— **Extractor Model** | https://huggingface.co/ps2181/extractor-lora-qwen2.5-1.5b | | ๐Ÿ•ต๏ธ **Auditor Model** | https://huggingface.co/ps2181/auditor-lora-qwen2.5-1.5b | | โšก **Generator Model** | https://huggingface.co/ps2181/generator-lora-qwen2.5-1.5b | | ๐Ÿ““ **Training Colab (Auditor Agent)** | https://colab.research.google.com/drive/1C1_3giNt-NmbzKNFJr5_L1fms3L8LfmB | | ๐Ÿ““ **Training Colab (Extractor Agent)** | https://colab.research.google.com/drive/1fxfBt13LjmT4m98pJq-b5B__1ytFeszK?usp=sharing | | ๐Ÿ““ **Training Colab (Generator Agent)** | https://colab.research.google.com/drive/1O293_VBZQCthxlGpgvz5kxoty3zcsWGH?usp=sharing | | ๐Ÿ’ป **GitHub** | https://github.com/ps2181/invoice-processing-pipeline | | ๐ŸŽฅ **Demo Video** | https://youtu.be/QSB4UOLvaC8?si=SGnIwsfTW4JGsU3e | | ๐Ÿงฉ **OpenEnv Framework** | https://github.com/meta-pytorch/OpenEnv |
---
**Built with โค๏ธ for the Meta PyTorch OpenEnv Hackathon 2026** *"The system that gets harder for itself โ€” so the agent never stops learning."*