Spaces:

musharraf7
/

esctr-environment

Running

Musharraf commited on Apr 9

Commit

a2ae67c

1 Parent(s): 7de3176

Add 2 new frontier-challenging tasks + reward shaping system

- corrupted_scan: OCR-corrupted documents with char substitutions (0/O, 1/l, 5/S)
- adversarial_invoice: decoy fields, contradictions, hidden calculations, discrepancy detection
- Reward shaping: consistency bonus (+0.03), efficiency signal (+0.01-0.02), improvement tracking (+0.02)
- 15 total documents across 5 difficulty tiers (easy -> expert)
- Updated inference.py with task-specific prompts for all 5 tasks
- Comprehensive README rewrite with reward design documentation

Files changed (7) hide show

README.md +61 -28
inference.py +51 -8
project_juiding_criterion.txt +145 -0
server/app.py +12 -4
server/documents.py +364 -1
server/environment.py +66 -6
server/graders.py +10 -3

README.md CHANGED Viewed

@@ -12,7 +12,7 @@ tags:
 # Invoice Extraction Environment
-An OpenEnv-compliant environment where AI agents extract structured data from unstructured invoice and receipt documents.
 **Space URL:** `https://huggingface.co/spaces/musharraf7/invoice-extraction-env`
@@ -25,14 +25,14 @@ r = requests.post(f"{url}/reset", json={"task_name": "simple_invoice"})
 print(r.json())
 ```
-## Environment Description
-This environment simulates real-world document data extraction — a task faced daily by businesses processing invoices, receipts, and purchase orders. The agent receives unstructured text documents and must extract specific structured fields (invoice numbers, dates, vendor names, line items, totals, etc.).
-### Why This Matters
-- **$5B+ industry:** Automated document processing is one of the largest business process automation markets
-- **Real RL training signal:** Partial-credit rewards on a per-field basis provide rich gradient
-- **Difficulty progression:** Three task levels test increasingly complex reasoning
 ## Action Space
@@ -65,35 +65,63 @@ Each step returns an `InvoiceObservation`:
 | `attempts_remaining` | int | Remaining extraction attempts |
 | `required_fields` | list | Fields to extract |
 | `done` | bool | Whether the episode has ended |
-| `reward` | float | Reward signal [0.0–1.0] |
-## Tasks
-### 1. `simple_invoice` (Easy)
-Clean, well-formatted invoices with clear field labels. The agent must extract 8 fields including invoice number, date, vendor/customer names, financial totals, and itemized line items.
 **Required fields:** `invoice_number`, `date`, `vendor_name`, `customer_name`, `subtotal`, `tax`, `total`, `line_items`
-### 2. `messy_invoice` (Medium)
 Same fields but from messy, inconsistently formatted documents with abbreviations, typos, and non-standard layouts.
 **Required fields:** Same as simple_invoice
-### 3. `multi_document` (Hard)
-Complex multi-section documents containing a purchase order, invoice, and credit memo/payment receipt. The agent must cross-reference sections and extract 11 fields including the adjusted total.
-**Required fields:** All of the above + `po_number`, `adjustment_reason`, `adjusted_total`
 ## Reward Design
-- **Per-field scoring:** Each field is scored independently (0.0–1.0)
-  - Text fields: Fuzzy matching with SequenceMatcher
-  - Numeric fields: Exact match (1.0), within 1% (0.9), within 5% (0.5)
-  - Date fields: Normalized comparison (YYYY-MM-DD)
-  - Line items: Matched by best-fit comparison of description, qty, price, amount
-- **Overall score:** Weighted average of all field scores
-- **Episode rewards:** Best score across all extraction attempts
-- **Partial progress:** Feedback identifies weak fields for refinement
 ## Setup Instructions
@@ -109,6 +137,11 @@ pip install -r requirements.txt
 uvicorn server.app:app --host 0.0.0.0 --port 7860
 ```
 ### Run inference
 ```bash
 export ENV_URL="http://localhost:7860"
@@ -135,12 +168,12 @@ python inference.py
 ├── server/
 │   ├── __init__.py
 │   ├── app.py             # FastAPI application
-│   ├── environment.py     # Core environment logic
-│   ├── documents.py       # Document corpus
-│   ├── graders.py         # Scoring/grading logic
-│   └── models.py          # Pydantic Action/Observation types
 ├── __init__.py            # Package declaration
-├── inference.py           # Baseline inference script
 ├── openenv.yaml           # OpenEnv manifest
 ├── pyproject.toml         # Package configuration
 ├── requirements.txt       # Dependencies

 # Invoice Extraction Environment
+An OpenEnv-compliant environment where AI agents extract structured data from unstructured invoice and receipt documents. Features **5 difficulty tiers** — from clean invoices to adversarial documents with decoy fields, OCR corruption, and hidden calculations.
 **Space URL:** `https://huggingface.co/spaces/musharraf7/invoice-extraction-env`
 print(r.json())
 ```
+## Why This Environment?
+Invoice data extraction is a **$5B+ industry** problem faced daily by every business. This environment provides:
+- **Real RL training signal**: Per-field partial-credit scoring gives dense reward gradients
+- **Genuine difficulty progression**: From clean invoices to adversarial traps that challenge frontier models
+- **Reward shaping**: Consistency bonuses, efficiency signals, and improvement tracking provide rich learning signals beyond simple field matching
+- **Production relevance**: The task directly models what commercial document processing systems must solve
 ## Action Space
 | `attempts_remaining` | int | Remaining extraction attempts |
 | `required_fields` | list | Fields to extract |
 | `done` | bool | Whether the episode has ended |
+| `reward` | float | Reward signal (0.01–0.99) |
+## Tasks (5 Difficulty Tiers)
+### 1. `simple_invoice` (Easy) — 3 attempts
+Clean, well-formatted invoices with clear field labels.
 **Required fields:** `invoice_number`, `date`, `vendor_name`, `customer_name`, `subtotal`, `tax`, `total`, `line_items`
+### 2. `messy_invoice` (Medium) — 3 attempts
 Same fields but from messy, inconsistently formatted documents with abbreviations, typos, and non-standard layouts.
 **Required fields:** Same as simple_invoice
+### 3. `multi_document` (Hard) — 5 attempts
+Complex multi-section documents containing a purchase order, invoice, and credit memo/payment receipt. The agent must cross-reference sections.
+**Required fields:** All basic fields + `po_number`, `adjustment_reason`, `adjusted_total`
+### 4. `corrupted_scan` (Very Hard) — 4 attempts
+Simulates OCR-scanned/faxed invoices with systematic character errors:
+- Character substitutions: `0`↔`O`, `1`↔`l`↔`I`, `5`↔`S`, `8`↔`B`
+- Garbled sections and scan artifacts
+- The agent must **reason through noise** to recover the true values
+**Required fields:** Same as simple_invoice
+### 5. `adversarial_invoice` (Expert) — 6 attempts
+Adversarial documents designed to trap and challenge frontier models:
+- **Decoy fields**: Multiple invoice numbers — only one is current
+- **Hidden calculations**: Discounts the agent must compute
+- **Contradictory sections**: PO vs invoice disagreements
+- **Budget variance alerts**: Agent must identify and explain discrepancies
+**Required fields:** All basic fields + `po_number`, `discount_amount`, `original_total`, `discrepancy_notes`
 ## Reward Design
+### Per-Field Scoring (Base Score)
+- **Text fields**: Fuzzy matching with SequenceMatcher (0.0–1.0)
+- **Numeric fields**: Exact match (1.0), within 1% (0.9), within 5% (0.5)
+- **Date fields**: Normalized comparison (YYYY-MM-DD)
+- **Line items**: Best-fit matching of description, qty, price, amount
+- **Reasoning fields** (discrepancy_notes): Fuzzy matching with lower threshold
+### Reward Shaping Bonuses
+| Bonus | Value | Trigger |
+|-------|-------|---------|
+| **Consistency** | +0.03 | Agent's subtotal + tax = total |
+| **Efficiency** | +0.01–0.02 | Solution found in ≤5 steps |
+| **Improvement** | up to +0.02 | Score improves on retry |
+### Episode Mechanics
+- **Best score tracked** across all extraction attempts
+- **Partial progress** feedback identifies weak fields for refinement
+- **Early termination** at score ≥ 0.95
+- **All scores** clamped to strict (0.01, 0.99) range
 ## Setup Instructions
 uvicorn server.app:app --host 0.0.0.0 --port 7860
 ```
+### Run with uv
+```bash
+uv run server
+```
 ### Run inference
 ```bash
 export ENV_URL="http://localhost:7860"
 ├── server/
 │   ├── __init__.py
 │   ├── app.py             # FastAPI application
+│   ├── environment.py     # Core environment logic + reward shaping
+│   ├── documents.py       # 15-document corpus across 5 difficulty tiers
+│   ├── graders.py         # Field-level scoring with fuzzy matching
+│   └── models.py          # Pydantic Action/Observation/State types
 ├── __init__.py            # Package declaration
+├── inference.py           # Baseline inference script (all 5 tasks)
 ├── openenv.yaml           # OpenEnv manifest
 ├── pyproject.toml         # Package configuration
 ├── requirements.txt       # Dependencies

inference.py CHANGED Viewed

@@ -3,9 +3,9 @@
 Baseline inference script for the Invoice Extraction Environment.
 This script demonstrates how an LLM agent interacts with the environment
-to extract structured data from invoice documents. It runs all three tasks
-(simple_invoice, messy_invoice, multi_document) and logs results in the
-mandatory OpenEnv [START]/[STEP]/[END] format.
 Required environment variables:
     API_BASE_URL       — OpenAI-compatible API endpoint
@@ -35,7 +35,7 @@ ENV_URL = os.getenv("ENV_URL", "http://localhost:7860")
 # Optional — if you use from_docker_image():
 LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
-TASKS = ["simple_invoice", "messy_invoice", "multi_document"]
 # ---------------------------------------------------------------------------
 # LLM Client (OpenAI-compatible, configured via env vars)
@@ -100,12 +100,46 @@ RULES:
 - For monetary amounts, use plain numbers without currency symbols (e.g. 1134.00)
 - For line_items, use an array of objects with keys: description, quantity, unit_price, amount
 - If a field cannot be found, use null
-- For the multi_document task, look across all document sections (invoice, credit memo, PO, etc.)
-- adjusted_total is the final amount after credits/payments
-- po_number is the purchase order reference number
 JSON:"""
 REFINE_PROMPT = """You previously extracted data from an invoice but some fields were incorrect.
 DOCUMENT:
@@ -128,6 +162,8 @@ RULES:
 - For monetary amounts, use plain numbers without currency symbols
 - For line_items, use an array of objects with keys: description, quantity, unit_price, amount
 - If a field cannot be determined, use null
 JSON:"""
@@ -217,7 +253,12 @@ def run_task(env_url: str, task_name: str, seed: int = 0) -> float:
         # Step 3: Use LLM to extract fields
         fields_str = "\n".join(f"- {f}" for f in required_fields)
-        prompt = EXTRACT_PROMPT.format(document=document_text, fields=fields_str)
         llm_response = call_llm(prompt)
         extracted_json = extract_json_from_response(llm_response)
@@ -253,11 +294,13 @@ def run_task(env_url: str, task_name: str, seed: int = 0) -> float:
             weak_fields = [f for f, s in field_scores.items() if s < 0.8]
             # Refine with LLM
             refine_prompt = REFINE_PROMPT.format(
                 document=document_text,
                 previous=extracted_json,
                 weak_fields=", ".join(weak_fields) if weak_fields else "all fields",
                 feedback=feedback_text,
             )
             refined_response = call_llm(refine_prompt)
             refined_json = extract_json_from_response(refined_response)

 Baseline inference script for the Invoice Extraction Environment.
 This script demonstrates how an LLM agent interacts with the environment
+to extract structured data from invoice documents. It runs all five tasks
+(simple_invoice, messy_invoice, multi_document, corrupted_scan, adversarial_invoice)
+and logs results in the mandatory OpenEnv [START]/[STEP]/[END] format.
 Required environment variables:
     API_BASE_URL       — OpenAI-compatible API endpoint
 # Optional — if you use from_docker_image():
 LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
+TASKS = ["simple_invoice", "messy_invoice", "multi_document", "corrupted_scan", "adversarial_invoice"]
 # ---------------------------------------------------------------------------
 # LLM Client (OpenAI-compatible, configured via env vars)
 - For monetary amounts, use plain numbers without currency symbols (e.g. 1134.00)
 - For line_items, use an array of objects with keys: description, quantity, unit_price, amount
 - If a field cannot be found, use null
+{task_specific_rules}
+IMPORTANT: Ensure your extracted subtotal + tax = total. Verify math consistency.
 JSON:"""
+TASK_RULES = {
+    "simple_invoice": "",
+    "messy_invoice": (
+        "- This document uses informal formatting, abbreviations, and shorthand\n"
+        "- Look past formatting irregularities to find the actual values\n"
+        "- 'subtot', 's/t', 'sub' = subtotal; 'tx' = tax; 'amt due' = total"
+    ),
+    "multi_document": (
+        "- This contains MULTIPLE document sections (PO, Invoice, Credit Memo, etc.)\n"
+        "- Extract from the INVOICE section primarily\n"
+        "- adjusted_total is the final amount after credits/payments\n"
+        "- po_number is the purchase order reference number\n"
+        "- adjustment_reason describes why the total was adjusted"
+    ),
+    "corrupted_scan": (
+        "- WARNING: This is an OCR-scanned document with character errors\n"
+        "- Common OCR substitutions: 0<->O, 1<->l<->I, 5<->S, 8<->B\n"
+        "- Mentally correct OCR errors to recover the true values\n"
+        "- 'lNV' = 'INV', 'S' in numbers = '5', 'O' in numbers = '0'\n"
+        "- Verify all numbers by cross-checking (qty * unit_price = amount)"
+    ),
+    "adversarial_invoice": (
+        "- CAUTION: This document contains DECOY fields and contradictions\n"
+        "- Multiple invoice numbers may appear — use the CURRENT/ACTIVE one, not voided/draft ones\n"
+        "- If there is a reissue date, use that as the date (not the original date)\n"
+        "- subtotal is the ADJUSTED subtotal after any discounts\n"
+        "- discount_amount is the monetary discount value\n"
+        "- original_total is what the total WOULD have been without adjustments\n"
+        "- discrepancy_notes: describe ALL discrepancies, adjustments, and calculations\n"
+        "- po_number: the purchase order reference if present, else null\n"
+        "- Cross-reference different sections to find contradictions"
+    ),
+}
 REFINE_PROMPT = """You previously extracted data from an invoice but some fields were incorrect.
 DOCUMENT:
 - For monetary amounts, use plain numbers without currency symbols
 - For line_items, use an array of objects with keys: description, quantity, unit_price, amount
 - If a field cannot be determined, use null
+- VERIFY: subtotal + tax should equal total
+{task_specific_rules}
 JSON:"""
         # Step 3: Use LLM to extract fields
         fields_str = "\n".join(f"- {f}" for f in required_fields)
+        task_rules = TASK_RULES.get(task_name, "")
+        prompt = EXTRACT_PROMPT.format(
+            document=document_text,
+            fields=fields_str,
+            task_specific_rules=task_rules,
+        )
         llm_response = call_llm(prompt)
         extracted_json = extract_json_from_response(llm_response)
             weak_fields = [f for f, s in field_scores.items() if s < 0.8]
             # Refine with LLM
+            task_rules = TASK_RULES.get(task_name, "")
             refine_prompt = REFINE_PROMPT.format(
                 document=document_text,
                 previous=extracted_json,
                 weak_fields=", ".join(weak_fields) if weak_fields else "all fields",
                 feedback=feedback_text,
+                task_specific_rules=task_rules,
             )
             refined_response = call_llm(refine_prompt)
             refined_json = extract_json_from_response(refined_response)

project_juiding_criterion.txt ADDED Viewed

	@@ -0,0 +1,145 @@

+al-world task simulation
+The environment must simulate a task humans actually do. Not games, not toys. Examples: email triage, code review, data cleaning, scheduling, customer support, content moderation.
+OpenEnv spec compliance
+Implement the full OpenEnv interface: typed Observation, Action, and Reward Pydantic models. step(action) → returns observation, reward, done, info. reset() → returns initial observation. state() → returns current state. openenv.yaml with metadata. Tested via openenv validate.
+Minimum 3 tasks with agent graders
+Each task defines a concrete objective an agent must accomplish, with a programmatic grader that scores performance (0.0–1.0). Tasks should range: easy → medium → hard. Graders must have clear, deterministic success/failure criteria.
+Meaningful reward function
+Provides signal over the full trajectory (not just binary end-of-episode). Rewards partial progress toward task completion. Penalizes clearly undesirable behavior (e.g. infinite loops, destructive actions).
+Baseline inference script
+Uses the OpenAI API client to run a model against the environment. Reads API credentials from environment variables (OPENAI_API_KEY). Produces a reproducible baseline score on all 3 tasks.
+Detailed Requirements
+Non-Functional Requirements
+Deploys to a Hugging Face Space
+Environment must run as a containerized HF Space tagged with openenv.
+Containerized execution
+Must include a working Dockerfile. The environment should start cleanly with docker build + docker run.
+Documentation
+README must include: environment description and motivation, action and observation space definitions, task descriptions with expected difficulty, setup and usage instructions, baseline scores.
+Parameter
+Weight
+Description
+Real-world utility
+30%
+Does the environment model a genuine task? Would someone actually use this to train or evaluate agents?
+Task & grader quality
+25%
+Are tasks well-defined with clear objectives? Do graders accurately and fairly measure success? Meaningful difficulty progression?
+Environment design
+20%
+Clean state management, sensible action/observation spaces, good reward shaping, proper episode boundaries.
+Code quality & spec compliance
+15%
+Follows OpenEnv spec, clean project structure, typed models, documented, tested, Dockerfile works.
+Creativity & novelty
+10%
+Novel problem domain, interesting mechanics, clever reward design, original approach.
+Scoring Breakdown
+Real-world utility (30%)
+•  0–5: Toy/artificial problem with no practical application
+•  6–15: Valid domain but shallow modeling of the real task
+•  16–25: Good domain modeling, would be useful for agent evaluation
+•  26–30: Excellent — fills a real gap, immediate value for the RL/agent community
+Task & grader quality (25%)
+•  3+ tasks with difficulty range?
+•  Graders produce scores between 0.0–1.0?
+•  Graders deterministic and reproducible?
+•  Hard task genuinely challenges frontier models?
+Environment design (20%)
+•  reset() produces clean state?
+•  Action/observation types well-designed and documented?
+•  Reward function provides useful varying signal (not just sparse)?
+•  Episode boundaries sensible?
+Code quality & spec compliance (15%)
+•  openenv validate passes?
+•  docker build && docker run works?
+•  HF Space deploys and responds?
+•  Baseline script runs and reproduces scores?
+Creativity & novelty (10%)
+•  Domain we haven’t seen in OpenEnv before?
+•  Reward design has interesting properties?
+•  Clever mechanics that make the environment engaging?
+Evaluation Criteria
+Phase 1: Automated Validation
+Pass/fail gate — HF Space deploys, OpenEnv spec compliance, Dockerfile builds, baseline reproduces, 3+ tasks with graders.
+Phase 2: Agentic Evaluation
+Scored — baseline agent re-run, standard Open LLM agent (e.g. Nemotron 3 Super) run against all environments, score variance check.
+Phase 3: Human Review
+Top submissions reviewed by Meta and Hugging Face engineers for real-world utility, creativity, and exploit checks.
+Disqualification Criteria
+Environment does not deploy or respond
+Plagiarized or trivially modified existing environments
+Graders that always return the same score
+No baseline inference script

server/app.py CHANGED Viewed

@@ -131,11 +131,19 @@ def create_invoice_app() -> FastAPI:
             "name": "invoice_extraction_env",
             "description": (
                 "An environment for extracting structured data from unstructured "
-                "invoice and receipt documents. The agent must identify and extract "
-                "fields like invoice number, dates, vendor name, line items, and totals."
             ),
-            "version": "0.1.0",
-            "tasks": ["simple_invoice", "messy_invoice", "multi_document"],
         }
     # === WebSocket (for persistent sessions) ===

             "name": "invoice_extraction_env",
             "description": (
                 "An environment for extracting structured data from unstructured "
+                "invoice and receipt documents. Features 5 difficulty tiers from "
+                "clean invoices to adversarial documents with decoy fields, OCR "
+                "corruption, and hidden calculations. Reward shaping includes "
+                "consistency bonuses, efficiency signals, and improvement tracking."
             ),
+            "version": "0.2.0",
+            "tasks": [
+                "simple_invoice",
+                "messy_invoice",
+                "multi_document",
+                "corrupted_scan",
+                "adversarial_invoice",
+            ],
         }
     # === WebSocket (for persistent sessions) ===

server/documents.py CHANGED Viewed

@@ -465,6 +465,357 @@ Total with Backorder: $2,498.45 + $170.00 = $2,668.45
             },
         },
     ],
 }
@@ -483,6 +834,16 @@ TASK_REQUIRED_FIELDS = {
         "subtotal", "tax", "total", "line_items",
         "po_number", "adjustment_reason", "adjusted_total",
     ],
 }
@@ -490,7 +851,8 @@ def get_document(task_name: str, doc_index: int = 0) -> dict:
     """Get a document and its metadata for a given task.
     Args:
-        task_name: One of 'simple_invoice', 'messy_invoice', 'multi_document'
         doc_index: Index into the document pool (will wrap around)
     Returns:
@@ -504,3 +866,4 @@ def get_document(task_name: str, doc_index: int = 0) -> dict:
         "ground_truth": doc["ground_truth"],
         "required_fields": TASK_REQUIRED_FIELDS.get(task_name, TASK_REQUIRED_FIELDS["simple_invoice"]),
     }

             },
         },
     ],
+    # =========================================================================
+    # CORRUPTED SCAN — OCR-like artifacts, character substitutions, garbled text
+    # These simulate real scanned/faxed invoices with OCR errors.
+    # =========================================================================
+    "corrupted_scan": [
+        {
+            "id": "corrupt_001",
+            "text": """SC4NNED D0CUMENT - Page 1 of 1
+lNVOlCE
+lnvoice Nurnber: lNV-2O24-OO1
+Dat.e: Januery 1S, 2O24
+Frorn:
+  Acrne Corporati0n
+  l23 Business Avenue
+  New Y0rk, NY 1OOO1
+BilI To:
+  Widget C0.
+  4S6 Cornmerce Street
+  Chicag0, lL 6O6O1
+Descripti0n                Qty    Unit Price    Arnount
+---------------------------------------------------------
+Widget Type A               1O      $2S.OO     $2SO.OO
+Widget Type 8                S      $4O.OO     $2OO.OO
+ConsuIting Service           8      $7S.OO     $6OO.OO
+                                   Subtotal:  $1,OSO.OO
+                                   Tax (8%):     $84.OO
+                                   T0tal:     $1,l34.OO
+Payrnent Terrns: Net 3O
+--- END 0F SCAN ---
+""",
+            "ground_truth": {
+                "invoice_number": "INV-2024-001",
+                "date": "2024-01-15",
+                "vendor_name": "Acme Corporation",
+                "customer_name": "Widget Co.",
+                "subtotal": 1050.00,
+                "tax": 84.00,
+                "total": 1134.00,
+                "line_items": [
+                    {"description": "Widget Type A", "quantity": 10, "unit_price": 25.00, "amount": 250.00},
+                    {"description": "Widget Type B", "quantity": 5, "unit_price": 40.00, "amount": 200.00},
+                    {"description": "Consulting Service", "quantity": 8, "unit_price": 75.00, "amount": 600.00},
+                ],
+            },
+        },
+        {
+            "id": "corrupt_002",
+            "text": """[SCAN QUALITY: P00R - SOME CHARACTERS MAY BE lNCORRECT]
+TECHSTART S0LUTl0NS LLC
+89O lnnovation Dr, Suite 2OO
+San Francisc0, CA 941OS
+lNV0lCE #: TS~S892
+DATE: O3/O3/2O24
+CUSTOMERr DataFIow lnc.
+          321 AnaIytics BIvd
+          Austin, TX 787O1
+Servicc                       Qty   Unit Pricc     Total
+----------------------------------------------------------
+CIoud Hosting (MonthIy)         l     $4SO.OO    $4SO.OO
+APl lntegration Setup           l   $l,2OO.OO  $l,2OO.OO
+TechnicaI Support (hours)      l2      $9S.OO  $l,l4O.OO
+                                    SubtotaI:  $2,79O.OO
+                                    Tax (7%)):    $l9S.3O
+                                    TotaI:     $2,98S.3O
+Due Date: ApriI 2, 2O24
+[PAGE 1/1 - SCAN C0MPLETE]
+""",
+            "ground_truth": {
+                "invoice_number": "TS-5892",
+                "date": "2024-03-03",
+                "vendor_name": "TechStart Solutions LLC",
+                "customer_name": "DataFlow Inc.",
+                "subtotal": 2790.00,
+                "tax": 195.30,
+                "total": 2985.30,
+                "line_items": [
+                    {"description": "Cloud Hosting (Monthly)", "quantity": 1, "unit_price": 450.00, "amount": 450.00},
+                    {"description": "API Integration Setup", "quantity": 1, "unit_price": 1200.00, "amount": 1200.00},
+                    {"description": "Technical Support (hours)", "quantity": 12, "unit_price": 95.00, "amount": 1140.00},
+                ],
+            },
+        },
+        {
+            "id": "corrupt_003",
+            "text": """---FAXED DOCUMENT---
+RECEIVED: 02/20/2024 14:32
+QUALITY: [####===---] 40%
+GL0BAL SUPPLlES lNC.
+25OO lndustriaI Parkway
+Detr0it, Ml 482Ol
+lNVOlCE
+lnvoice Number: GS-2O24-Ol47
+Date: February 2O, 2024
+T0:
+  Riverside Manufactur1ng
+  78O Factory R0ad
+  CIeveIand, 0H 44l0l
+Product                    Qty    Price Each    Line Total
+-----------------------------------------------------------
+SteeI BoIts (Box/lOO)       SO       $l2.SO       $62S.OO
+Copper Wire (SOOft RoII)     8       $8S.OO       $68O.OO
+Safety GoggIes (Pack/lO)    2O       $3S.OO       $7OO.OO
+WeIding Rods (BundIe)       lS       $22.OO       $33O.OO
+                   [iIIegibIe]
+                                    SubtotaI:   $2,33S.OO
+                                    SaIes Tax:    $l63.4S
+                                    lnvoice T0tal: $2,498.4S
+Terrns: Net 4S
+---END FAX---
+""",
+            "ground_truth": {
+                "invoice_number": "GS-2024-0147",
+                "date": "2024-02-20",
+                "vendor_name": "Global Supplies Inc.",
+                "customer_name": "Riverside Manufacturing",
+                "subtotal": 2335.00,
+                "tax": 163.45,
+                "total": 2498.45,
+                "line_items": [
+                    {"description": "Steel Bolts (Box/100)", "quantity": 50, "unit_price": 12.50, "amount": 625.00},
+                    {"description": "Copper Wire (500ft Roll)", "quantity": 8, "unit_price": 85.00, "amount": 680.00},
+                    {"description": "Safety Goggles (Pack/10)", "quantity": 20, "unit_price": 35.00, "amount": 700.00},
+                    {"description": "Welding Rods (Bundle)", "quantity": 15, "unit_price": 22.00, "amount": 330.00},
+                ],
+            },
+        },
+    ],
+    # =========================================================================
+    # ADVERSARIAL INVOICE — Decoy fields, contradictions, hidden calculations
+    # Designed to genuinely challenge frontier models with traps.
+    # =========================================================================
+    "adversarial_invoice": [
+        {
+            "id": "adversarial_001",
+            "text": """INVOICE
+*** IMPORTANT: This replaces previous invoice DRAFT-INV-999 which was voided ***
+Invoice Number: INV-2024-001-R2
+Previous Reference: DRAFT-INV-999 (VOIDED — DO NOT USE)
+Date: January 15, 2024
+Reissue Date: January 20, 2024
+From:
+  Acme Corporation
+  123 Business Avenue, New York, NY 10001
+  Tax ID: 12-3456789
+Bill To:
+  Widget Co. (DBA "WidgetCorp International")
+  456 Commerce Street, Chicago, IL 60601
+  Customer Account: WC-0042
+Description                Qty    Unit Price    Amount
+---------------------------------------------------------
+Widget Type A               10      $25.00     $250.00
+Widget Type B                5      $40.00     $200.00
+Consulting Service           8      $75.00     $600.00
+  ** EARLY PAYMENT DISCOUNT: -5% on consulting **
+                                   Subtotal:  $1,050.00
+                              Discount (5%):    -$30.00
+                         Adjusted Subtotal:   $1,020.00
+                                   Tax (8%):     $81.60
+                                   Total:     $1,101.60
+NOTE: Original quote (QT-2024-555) was $1,134.00 but discount applied.
+Per agreement dated Jan 12, if paid within 10 days.
+Payment Terms: Net 10 (discounted) / Net 30 (full price $1,134.00)
+""",
+            "ground_truth": {
+                "invoice_number": "INV-2024-001-R2",
+                "date": "2024-01-20",
+                "vendor_name": "Acme Corporation",
+                "customer_name": "Widget Co.",
+                "subtotal": 1020.00,
+                "tax": 81.60,
+                "total": 1101.60,
+                "discount_amount": 30.00,
+                "original_total": 1134.00,
+                "line_items": [
+                    {"description": "Widget Type A", "quantity": 10, "unit_price": 25.00, "amount": 250.00},
+                    {"description": "Widget Type B", "quantity": 5, "unit_price": 40.00, "amount": 200.00},
+                    {"description": "Consulting Service", "quantity": 8, "unit_price": 75.00, "amount": 600.00},
+                ],
+                "discrepancy_notes": "5% early payment discount applied to consulting. Reissued invoice replaces voided DRAFT-INV-999. Adjusted subtotal $1,020 vs original $1,050.",
+            },
+        },
+        {
+            "id": "adversarial_002",
+            "text": """--- PURCHASE ORDER ---
+PO#: PO-DF-2024-112
+Date: February 28, 2024
+Vendor: TechStart Solutions LLC
+Buyer: DataFlow Inc.
+Authorized Budget: $2,600.00 (pre-tax)
+Items:
+1. Cloud Hosting - 1 unit @ $450.00 = $450.00
+2. API Integration - 1 unit @ $1,200.00 = $1,200.00
+3. Tech Support - 10 hours @ $95.00/hr = $950.00
+PO Total: $2,600.00
+--- INVOICE ---
+Invoice: TS-5892-FINAL
+Date: March 3, 2024
+PO Reference: PO-DF-2024-112
+From: TechStart Solutions LLC
+To: DataFlow Inc.
+Service                       Qty   Rate        Amount
+Cloud Hosting (Monthly)         1   $450.00    $450.00
+API Integration Setup           1   $1,200.00  $1,200.00
+Technical Support (actual)     12   $95.00     $1,140.00
+  >> 2 hrs over PO estimate, approved by J. Smith 03/01/2024
+Rush Processing Fee             1   $150.00    $150.00
+  >> Added per emergency request ER-2024-033
+Subtotal: $2,940.00
+Tax (7%): $205.80
+Total: $3,145.80
+!!! BUDGET VARIANCE ALERT !!!
+PO Authorized: $2,600.00
+Actual (pre-tax): $2,940.00
+Variance: $340.00 OVER BUDGET (13.1%)
+Causes: Support overage ($190), Rush fee ($150)
+--- PAYMENT SCHEDULE ---
+Payment 1 (due 03/15): $1,500.00
+Payment 2 (due 04/02): $1,645.80
+""",
+            "ground_truth": {
+                "invoice_number": "TS-5892-FINAL",
+                "date": "2024-03-03",
+                "vendor_name": "TechStart Solutions LLC",
+                "customer_name": "DataFlow Inc.",
+                "subtotal": 2940.00,
+                "tax": 205.80,
+                "total": 3145.80,
+                "po_number": "PO-DF-2024-112",
+                "discount_amount": 0.00,
+                "original_total": 2600.00,
+                "line_items": [
+                    {"description": "Cloud Hosting (Monthly)", "quantity": 1, "unit_price": 450.00, "amount": 450.00},
+                    {"description": "API Integration Setup", "quantity": 1, "unit_price": 1200.00, "amount": 1200.00},
+                    {"description": "Technical Support (actual)", "quantity": 12, "unit_price": 95.00, "amount": 1140.00},
+                    {"description": "Rush Processing Fee", "quantity": 1, "unit_price": 150.00, "amount": 150.00},
+                ],
+                "discrepancy_notes": "Invoice exceeds PO by $340 (13.1%). 2 extra support hours ($190) and rush processing fee ($150) added. PO authorized $2,600 but actual pre-tax is $2,940.",
+            },
+        },
+        {
+            "id": "adversarial_003",
+            "text": """CONSOLIDATED STATEMENT
+Account: Riverside Manufacturing
+Statement Period: February 2024
+Prepared by: Global Supplies Inc., Accounts Receivable
+=== TRANSACTION 1: ORIGINAL INVOICE ===
+Invoice: GS-2024-0147
+Date: February 20, 2024
+PO: PO-RM-2024-033
+Steel Bolts (Box/100)       50   @ $12.50    =    $625.00
+Copper Wire (500ft Roll)    10   @ $85.00    =    $850.00
+Safety Goggles (Pack/10)    20   @ $35.00    =    $700.00
+Welding Rods (Bundle)       15   @ $22.00    =    $330.00
+Invoice Subtotal: $2,505.00
+Tax (7%): $175.35
+Invoice Total: $2,680.35
+=== TRANSACTION 2: ADJUSTMENT ===
+Credit Memo: CM-2024-0201
+Date: February 25, 2024
+Reference: GS-2024-0147
+Issue: Copper Wire — only 8 of 10 rolls delivered.
+2 rolls backordered (BO-2024-0089).
+Credit for undelivered: 2 x $85.00 = $170.00
+Tax adjustment: -$11.90
+Total Credit: -$181.90
+=== TRANSACTION 3: PRICE CORRECTION ===
+Debit Memo: DM-2024-0055
+Date: February 27, 2024
+Steel Bolts price was quoted at $12.50 but contract
+rate is $13.00. Underbilled on 50 boxes.
+Price difference: 50 x $0.50 = $25.00
+Tax on adjustment: $1.75
+Total Debit: $26.75
+=== ACCOUNT SUMMARY ===
+Original Invoice:           $2,680.35
+Credit (undelivered wire): -$181.90
+Debit (price correction):   +$26.75
+================================
+Net Amount Due:             $2,525.20
+Payment due by: March 20, 2024
+""",
+            "ground_truth": {
+                "invoice_number": "GS-2024-0147",
+                "date": "2024-02-20",
+                "vendor_name": "Global Supplies Inc.",
+                "customer_name": "Riverside Manufacturing",
+                "subtotal": 2505.00,
+                "tax": 175.35,
+                "total": 2680.35,
+                "po_number": "PO-RM-2024-033",
+                "discount_amount": 0.00,
+                "original_total": 2680.35,
+                "line_items": [
+                    {"description": "Steel Bolts (Box/100)", "quantity": 50, "unit_price": 12.50, "amount": 625.00},
+                    {"description": "Copper Wire (500ft Roll)", "quantity": 10, "unit_price": 85.00, "amount": 850.00},
+                    {"description": "Safety Goggles (Pack/10)", "quantity": 20, "unit_price": 35.00, "amount": 700.00},
+                    {"description": "Welding Rods (Bundle)", "quantity": 15, "unit_price": 22.00, "amount": 330.00},
+                ],
+                "discrepancy_notes": "Credit memo CM-2024-0201 for 2 undelivered Copper Wire rolls (-$181.90). Debit memo DM-2024-0055 for Steel Bolts price correction (+$26.75). Net adjustment: -$155.15. Final amount due: $2,525.20.",
+            },
+        },
+    ],
 }
         "subtotal", "tax", "total", "line_items",
         "po_number", "adjustment_reason", "adjusted_total",
     ],
+    "corrupted_scan": [
+        "invoice_number", "date", "vendor_name", "customer_name",
+        "subtotal", "tax", "total", "line_items",
+    ],
+    "adversarial_invoice": [
+        "invoice_number", "date", "vendor_name", "customer_name",
+        "subtotal", "tax", "total", "line_items",
+        "po_number", "discount_amount", "original_total",
+        "discrepancy_notes",
+    ],
 }
     """Get a document and its metadata for a given task.
     Args:
+        task_name: One of 'simple_invoice', 'messy_invoice', 'multi_document',
+                   'corrupted_scan', 'adversarial_invoice'
         doc_index: Index into the document pool (will wrap around)
     Returns:
         "ground_truth": doc["ground_truth"],
         "required_fields": TASK_REQUIRED_FIELDS.get(task_name, TASK_REQUIRED_FIELDS["simple_invoice"]),
     }

server/environment.py CHANGED Viewed

@@ -20,6 +20,8 @@ MAX_ATTEMPTS = {
     "simple_invoice": 3,
     "messy_invoice": 3,
     "multi_document": 5,
 }
 VALID_TASKS = list(TASK_REQUIRED_FIELDS.keys())
@@ -174,16 +176,19 @@ class InvoiceExtractionEnvironment:
         """Return the list of required fields with descriptions."""
         field_descriptions = {
             "invoice_number": "The invoice/document number (string)",
-            "date": "Invoice date in YYYY-MM-DD format",
             "vendor_name": "Name of the vendor/seller/supplier",
             "customer_name": "Name of the customer/buyer/bill-to party",
-            "subtotal": "Subtotal before tax (number)",
             "tax": "Tax amount (number)",
             "total": "Total amount due (number)",
             "line_items": "Array of items: [{description, quantity, unit_price, amount}]",
             "po_number": "Purchase order reference number (string)",
             "adjustment_reason": "Reason for any adjustments/credits (string)",
             "adjusted_total": "Final adjusted total after credits/payments (number)",
         }
         lines = ["Required fields to extract:\n"]
@@ -243,10 +248,43 @@ class InvoiceExtractionEnvironment:
         # Grade the extraction
         self._state.attempts_used += 1
-        score, feedback = grade_extraction(
             extracted, self._ground_truth, self._required_fields
         )
         # Track best score
         self._state.best_score = max(self._state.best_score, score)
         self._last_feedback = feedback
@@ -259,12 +297,15 @@ class InvoiceExtractionEnvironment:
         matched = sum(1 for f in feedback.values() if f.get("matched", False))
         total = len(feedback)
         feedback_text = (
-            f"Extraction scored: {score:.2f}\n"
             f"Fields matched: {matched}/{total}\n"
-            f"Best score so far: {self._state.best_score:.2f}\n"
             f"Attempts remaining: {attempts_remaining}\n"
         )
         if not done and score < 0.95:
             weak_fields = [
                 name for name, data in feedback.items()
@@ -275,7 +316,7 @@ class InvoiceExtractionEnvironment:
                 feedback_text += "\nUse 'get_feedback' for detailed per-field scores."
         if done:
-            feedback_text += f"\n\nEpisode complete. Final score: {self._state.best_score:.2f}"
         return InvoiceObservation(
             done=done,
@@ -287,6 +328,9 @@ class InvoiceExtractionEnvironment:
             required_fields=self._required_fields,
             metadata={
                 "score": score,
                 "best_score": self._state.best_score,
                 "field_scores": {k: v["score"] for k, v in feedback.items()},
             },
@@ -334,3 +378,19 @@ class InvoiceExtractionEnvironment:
     def close(self) -> None:
         """Clean up resources."""
         self._initialized = False

     "simple_invoice": 3,
     "messy_invoice": 3,
     "multi_document": 5,
+    "corrupted_scan": 4,
+    "adversarial_invoice": 6,
 }
 VALID_TASKS = list(TASK_REQUIRED_FIELDS.keys())
         """Return the list of required fields with descriptions."""
         field_descriptions = {
             "invoice_number": "The invoice/document number (string)",
+            "date": "Invoice date in YYYY-MM-DD format (use reissue date if applicable)",
             "vendor_name": "Name of the vendor/seller/supplier",
             "customer_name": "Name of the customer/buyer/bill-to party",
+            "subtotal": "Subtotal before tax, after discounts (number)",
             "tax": "Tax amount (number)",
             "total": "Total amount due (number)",
             "line_items": "Array of items: [{description, quantity, unit_price, amount}]",
             "po_number": "Purchase order reference number (string)",
             "adjustment_reason": "Reason for any adjustments/credits (string)",
             "adjusted_total": "Final adjusted total after credits/payments (number)",
+            "discount_amount": "Monetary discount value applied (number, 0 if none)",
+            "original_total": "What the total would have been without adjustments (number)",
+            "discrepancy_notes": "Free-text description of all discrepancies, adjustments, and anomalies found",
         }
         lines = ["Required fields to extract:\n"]
         # Grade the extraction
         self._state.attempts_used += 1
+        base_score, feedback = grade_extraction(
             extracted, self._ground_truth, self._required_fields
         )
+        # === REWARD SHAPING BONUSES ===
+        bonus = 0.0
+        bonus_details = []
+        # 1. Mathematical consistency bonus: subtotal + tax ≈ total
+        ext_sub = _safe_float(extracted.get("subtotal"))
+        ext_tax = _safe_float(extracted.get("tax"))
+        ext_total = _safe_float(extracted.get("total"))
+        if ext_sub is not None and ext_tax is not None and ext_total is not None:
+            computed = round(ext_sub + ext_tax, 2)
+            if abs(computed - ext_total) < 0.02:
+                bonus += 0.03
+                bonus_details.append("consistency_check: +0.03")
+        # 2. Improvement tracking: rewarding learning from feedback
+        prev_score = self._state.best_score
+        if self._state.attempts_used > 1 and base_score > prev_score:
+            improvement = min(base_score - prev_score, 0.02)
+            bonus += improvement
+            bonus_details.append(f"improvement: +{improvement:.3f}")
+        # 3. Step efficiency signal: fewer steps = small bonus
+        steps_used = self._state.step_count
+        if steps_used <= 3:
+            bonus += 0.02  # Very efficient
+            bonus_details.append("efficiency: +0.02")
+        elif steps_used <= 5:
+            bonus += 0.01  # Moderately efficient
+            bonus_details.append("efficiency: +0.01")
+        # Apply bonus (clamped to strict (0, 1))
+        score = round(max(0.01, min(0.99, base_score + bonus)), 4)
         # Track best score
         self._state.best_score = max(self._state.best_score, score)
         self._last_feedback = feedback
         matched = sum(1 for f in feedback.values() if f.get("matched", False))
         total = len(feedback)
         feedback_text = (
+            f"Extraction scored: {score:.4f} (base: {base_score:.4f}, bonus: {bonus:.3f})\n"
             f"Fields matched: {matched}/{total}\n"
+            f"Best score so far: {self._state.best_score:.4f}\n"
             f"Attempts remaining: {attempts_remaining}\n"
         )
+        if bonus_details:
+            feedback_text += f"Reward bonuses: {', '.join(bonus_details)}\n"
         if not done and score < 0.95:
             weak_fields = [
                 name for name, data in feedback.items()
                 feedback_text += "\nUse 'get_feedback' for detailed per-field scores."
         if done:
+            feedback_text += f"\n\nEpisode complete. Final score: {self._state.best_score:.4f}"
         return InvoiceObservation(
             done=done,
             required_fields=self._required_fields,
             metadata={
                 "score": score,
+                "base_score": base_score,
+                "bonus": bonus,
+                "bonus_details": bonus_details,
                 "best_score": self._state.best_score,
                 "field_scores": {k: v["score"] for k, v in feedback.items()},
             },
     def close(self) -> None:
         """Clean up resources."""
         self._initialized = False
+def _safe_float(value) -> float:
+    """Safely convert a value to float, returning None on failure."""
+    if value is None:
+        return None
+    if isinstance(value, (int, float)):
+        return float(value)
+    if isinstance(value, str):
+        import re
+        cleaned = re.sub(r"[$ ,]", "", value.strip())
+        try:
+            return float(cleaned)
+        except (ValueError, TypeError):
+            return None
+    return None

server/graders.py CHANGED Viewed

@@ -236,9 +236,12 @@ def grade_extraction(
     field_scores = {}
     feedback = {}
-    numeric_fields = {"total", "subtotal", "tax", "adjusted_total"}
     date_fields = {"date", "due_date"}
     list_fields = {"line_items"}
     for field in required_fields:
         expected = ground_truth.get(field)
@@ -250,6 +253,9 @@ def grade_extraction(
             score = grade_numeric(actual, expected)
         elif field in date_fields:
             score = grade_date(actual, expected)
         else:
             score = grade_text(actual, expected)
@@ -258,8 +264,9 @@ def grade_extraction(
             "score": score,
             "expected_type": "list" if field in list_fields else
                             "number" if field in numeric_fields else
-                            "date" if field in date_fields else "text",
-            "matched": score >= 0.8,
         }
     # Overall score = weighted average

     field_scores = {}
     feedback = {}
+    numeric_fields = {"total", "subtotal", "tax", "adjusted_total",
+                       "discount_amount", "original_total"}
     date_fields = {"date", "due_date"}
     list_fields = {"line_items"}
+    # Free-text reasoning fields — graded with lower threshold
+    reasoning_fields = {"discrepancy_notes", "adjustment_reason"}
     for field in required_fields:
         expected = ground_truth.get(field)
             score = grade_numeric(actual, expected)
         elif field in date_fields:
             score = grade_date(actual, expected)
+        elif field in reasoning_fields:
+            # Free-text reasoning: use fuzzy matching with generous partial credit
+            score = grade_text(actual, expected)
         else:
             score = grade_text(actual, expected)
             "score": score,
             "expected_type": "list" if field in list_fields else
                             "number" if field in numeric_fields else
+                            "date" if field in date_fields else
+                            "reasoning" if field in reasoning_fields else "text",
+            "matched": score >= 0.5 if field in reasoning_fields else score >= 0.8,
         }
     # Overall score = weighted average