Spaces:

musharraf7
/

esctr-environment

Running

musharraf7 commited on 29 days ago

Commit

a363048

1 Parent(s): 6f7e1b7

feat: ESCTR pivot — Enterprise Supply Chain & Tax Reconciliation

Complete rewrite for OpenEnv Hackathon Round 2:
- New domain: autonomous financial auditing (supply chain discrepancies)
- 3 tasks: procurement reconciliation, SLA enforcement, adversarial auditing
- 4 ERP tools: query_database, read_document, communicate_vendor, submit_financial_decision
- Adversarial vendor negotiation with settlement dynamics
- Procedural scenario generation (deterministic from seed)
- RLVR composite rewards with trajectory milestones and gullibility penalties
- Storytelling README (Problem → Environment → Results → Why it matters)
- Added course.md documenting the full journey
- Removed old documents.py (replaced by procedural.py)

Files changed (12) hide show

.gitignore +1 -0
README.md +122 -155
course.md +309 -0
inference.py +155 -225
openenv.yaml +10 -13
server/__init__.py +3 -3
server/app.py +53 -76
server/documents.py +0 -898
server/environment.py +458 -526
server/graders.py +280 -302
server/models.py +91 -45
server/procedural.py +536 -382

.gitignore CHANGED Viewed

@@ -14,3 +14,4 @@ hackathon_instructions.txt
 preparatory_course.txt
 RESEARCH_1.md
 RESEARCH_2.md

 preparatory_course.txt
 RESEARCH_1.md
 RESEARCH_2.md
+ROUND_2_GUIDELINES.md

README.md CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
-title: Invoice Extraction Environment
-emoji: 📄
-colorFrom: blue
 colorTo: green
 sdk: docker
 pinned: false
@@ -10,218 +10,185 @@ tags:
   - openenv
 ---
-# Invoice Extraction Environment
-An OpenEnv-compliant environment where AI agents extract structured data from unstructured invoice and receipt documents. Features **5 difficulty tiers** — from clean invoices to adversarial documents with decoy fields, OCR corruption, and hidden calculations — with **procedural document generation** for virtually infinite training configurations and an **RLVR-inspired composite reward architecture**.
 **Space URL:** `https://huggingface.co/spaces/musharraf7/invoice-extraction-env`
-```python
-import requests
-# Connect to the environment
-url = "https://musharraf7-invoice-extraction-env.hf.space"
-r = requests.post(f"{url}/reset", json={"task_name": "simple_invoice"})
-print(r.json())
-```
-## Why This Environment?
-Invoice data extraction is a **$5B+ industry** problem faced daily by every business. This environment provides:
-- **Real RL training signal**: Per-field partial-credit scoring gives dense reward gradients via RLVR-inspired composite rewards
-- **Infinite training data**: Procedural document generation creates unique invoices from any seed — eliminating overfitting to a static corpus
-- **Genuine difficulty progression**: From clean invoices to adversarial traps that challenge frontier models
-- **Multi-tool agentic workflow**: Hard tasks feature database queries, calculation verification, and discrepancy detection tools — training agents for multi-step reasoning
-- **Reward shaping**: Trajectory milestones, consistency bonuses, efficiency signals, and improvement tracking provide rich learning signals beyond simple field matching
-- **Production relevance**: The task directly models what commercial document processing systems must solve
-## Reward Architecture (RLVR-Inspired)
-The environment uses a composite reward function inspired by Reinforcement Learning with Verifiable Rewards:
-```
-R_total = α·R_outcome + β·R_trajectory + bonuses
-```
-| Component | Weight | Description |
-|-----------|--------|-------------|
-| **R_outcome** | α = 0.70 | Weighted extraction accuracy (financial fields 1.5×, line items 2.0×) |
-| **R_trajectory** | β = 0.30 | Micro-rewards for information gathering milestones |
-| **Consistency bonus** | +0.03 | Agent's subtotal + tax = total |
-| **Efficiency bonus** | +0.01–0.02 | Solution found in ≤5 steps |
-| **Improvement bonus** | up to +0.02 | Score improves on retry |
-| **Step cost** | -0.005/step | Encourages efficient exploration |
-| **Hallucination penalty** | -0.02 | Invalid JSON or unknown commands |
-### Trajectory Milestones
-| Action | Micro-reward | Purpose |
-|--------|-------------|---------|
-| `view_document` | +0.01 | Evidence gathering |
-| `view_fields` | +0.01 | Understanding requirements |
-| `get_feedback` | +0.005 | Learning from errors |
-| `query_related_documents` | +0.015 | Cross-referencing (hard tasks) |
-| `verify_calculations` | +0.01 | Mathematical verification |
-| `check_discrepancies` | +0.015 | Anomaly detection |
-## Action Space
-The agent sends an `InvoiceAction` with a `command` and optional `payload`:
-| Command | Description | Payload | Available Tasks |
-|---------|-------------|---------|-----------------|
-| `view_document` | View the raw document text | — | All |
-| `view_fields` | See required fields with descriptions | — | All |
-| `extract` | Submit extracted fields | JSON string | All |
-| `get_feedback` | Get detailed per-field feedback | — | All |
-| `query_related_documents` | Retrieve PO, credit memos, etc. | — | multi_document, adversarial |
-| `verify_calculations` | Submit arithmetic for verification | JSON string | multi_document, adversarial |
-| `check_discrepancies` | Flag inconsistencies in documents | — | multi_document, adversarial |
-### Action Schema
-```json
-{
-  "command": "extract",
-  "payload": "{\"invoice_number\": \"INV-2024-001\", \"date\": \"2024-01-15\", ...}"
-}
-```
-## Observation Space
-Each step returns an `InvoiceObservation`:
-| Field | Type | Description |
-|-------|------|-------------|
-| `text` | string | Response text from the environment |
-| `task_name` | string | Current task name |
-| `current_score` | float | Best score achieved so far |
-| `attempts_remaining` | int | Remaining extraction attempts |
-| `required_fields` | list | Fields to extract |
-| `done` | bool | Whether the episode has ended |
-| `reward` | float | Reward signal (0.01–0.99) |
-| `last_action_status` | string | "success" or "error" |
-| `error_message` | string | Diagnostic error message (if error) |
-| `current_step` | int | Step number within episode |
-| `accumulated_reward` | float | Total reward accumulated so far |
-## Tasks (5 Difficulty Tiers)
-### 1. `simple_invoice` (Easy) — 3 attempts
-Clean, well-formatted invoices with clear field labels.
-**Required fields:** `invoice_number`, `date`, `vendor_name`, `customer_name`, `subtotal`, `tax`, `total`, `line_items`
-### 2. `messy_invoice` (Medium) — 3 attempts
-Same fields but from messy, inconsistently formatted documents with abbreviations, typos, and non-standard layouts.
-**Required fields:** Same as simple_invoice
-### 3. `multi_document` (Hard) — 5 attempts
-Complex multi-section documents containing a purchase order, invoice, and credit memo/payment receipt. The agent must cross-reference sections. **Advanced tools available** (`query_related_documents`, `verify_calculations`, `check_discrepancies`).
-**Required fields:** All basic fields + `po_number`, `adjustment_reason`, `adjusted_total`
-### 4. `corrupted_scan` (Very Hard) — 4 attempts
-Simulates OCR-scanned/faxed invoices with systematic character errors:
-- Character substitutions: `0`↔`O`, `1`↔`l`↔`I`, `5`↔`S`, `8`↔`B`
-- Garbled sections and scan artifacts
-- The agent must **reason through noise** to recover the true values
-**Required fields:** Same as simple_invoice
-### 5. `adversarial_invoice` (Expert) — 6 attempts
-Adversarial documents designed to trap and challenge frontier models:
-- **Decoy fields**: Multiple invoice numbers — only one is current
-- **Hidden calculations**: Discounts the agent must compute
-- **Contradictory sections**: PO vs invoice disagreements
-- **Budget variance alerts**: Agent must identify and explain discrepancies
-**Advanced tools available** for investigation.
-**Required fields:** All basic fields + `po_number`, `discount_amount`, `original_total`, `discrepancy_notes`
-## Procedural Document Generation
-The environment features a **procedural generation engine** that creates unique invoice documents from any seed value:
-- **15 vendor profiles** with addresses across the US
-- **15 customer profiles** with realistic business names
-- **25+ product catalog items** spanning hardware, software, and services
-- **10 tax rate configurations** (5%–10%)
-- **Deterministic**: Same seed always produces the same document
-- **Infinite variety**: Seeds 0–2 use static test fixtures; seeds ≥ 3 generate novel documents
-```python
-# Use seed to get different documents
-r = requests.post(f"{url}/reset", json={"task_name": "simple_invoice", "seed": 42})
-r = requests.post(f"{url}/reset", json={"task_name": "simple_invoice", "seed": 100})
-```
-## Per-Field Scoring
-- **Text fields**: Fuzzy matching with SequenceMatcher (0.0–1.0)
-- **Numeric fields**: Exact match (1.0), within 1% (0.9), within 5% (0.5), within 10% (0.2)
-- **Date fields**: Normalized comparison (YYYY-MM-DD) with format tolerance
-- **Line items**: Best-fit matching of description, qty, price, amount (weighted 2.0×)
-- **Reasoning fields** (discrepancy_notes): Fuzzy matching with lower threshold
-- **Financial fields** (subtotal, tax, total): Weighted 1.5× for importance
-## Setup Instructions
-### Run with Docker
 ```bash
-docker build -t invoice-extraction-env .
-docker run -p 7860:7860 invoice-extraction-env
-```
-### Run locally
-```bash
 pip install -r requirements.txt
 uvicorn server.app:app --host 0.0.0.0 --port 7860
 ```
-### Run with uv
-```bash
-uv run server
 ```
-### Run inference
 ```bash
 export ENV_URL="http://localhost:7860"
 export API_BASE_URL="https://router.huggingface.co/v1"
 export MODEL_NAME="meta-llama/Meta-Llama-3-8B-Instruct"
-export HF_TOKEN="your_token_here"
 python inference.py
 ```
 ## API Endpoints
 | Endpoint | Method | Description |
 |----------|--------|-------------|
 | `/health` | GET | Health check |
-| `/reset` | POST | Reset with task selection |
 | `/step` | POST | Execute an action |
-| `/state` | GET | Get current state |
-| `/schema` | GET | Get action/observation schemas |
-| `/metadata` | GET | Get environment metadata |
 | `/ws` | WebSocket | Persistent session |
 ## Project Structure
 ```
 ├── server/
 │   ├── __init__.py
 │   ├── app.py             # FastAPI application
-│   ├── environment.py     # Core environment logic + RLVR reward architecture
-│   ├── documents.py       # 15-document corpus across 5 difficulty tiers
-│   ├── procedural.py      # Procedural document generation engine
-│   ├── graders.py         # Field-level scoring with weighted fuzzy matching
-│   └── models.py          # Pydantic Action/Observation/State types
-├── __init__.py            # Package declaration
-├── inference.py           # Baseline inference script (all 5 tasks)
 ├── openenv.yaml           # OpenEnv manifest
-├── pyproject.toml         # Package configuration
 ├── requirements.txt       # Dependencies
 ├── Dockerfile             # Container definition
 └── README.md              # This file
 ```

 ---
+title: ESCTR Environment
+emoji: 🏢
+colorFrom: indigo
 colorTo: green
 sdk: docker
 pinned: false
   - openenv
 ---
+# 🏢 ESCTR: Enterprise Supply Chain & Tax Reconciliation
+> **Training LLMs to be autonomous financial auditors** — an OpenEnv environment for teaching AI agents to investigate procurement discrepancies, enforce SLA penalties, and navigate adversarial vendor disputes.
 **Space URL:** `https://huggingface.co/spaces/musharraf7/invoice-extraction-env`
+---
+## The Problem
+Every day, global enterprises process millions of procurement transactions. Between the Purchase Order, the shipping manifest, the SLA contract, and the final vendor invoice, discrepancies **inevitably** arise:
+- A vendor bills $45/unit instead of the contracted $40
+- A shipment arrives 5 days late, triggering SLA penalty clauses
+- A vendor disputes the penalty, claiming *your* warehouse rejected the delivery
+Resolving these disputes currently requires human financial controllers to **manually cross-reference multiple siloed databases**, interpret complex contract clauses, perform precise arithmetic, and negotiate with adversarial counterparties. It's slow, expensive, and error-prone.
+**What if we could train LLMs to do this autonomously?**
+## The Environment
+ESCTR provides a stateful sandbox where an LLM agent operates as an **autonomous financial controller**. Rather than just extracting data from a document, the agent must:
+1. **Investigate** — query procurement databases, shipping logs, SLA contracts
+2. **Reason** — cross-reference documents, calculate penalties, verify claims
+3. **Negotiate** — handle adversarial vendor communications
+4. **Decide** — submit a mathematically precise financial adjustment
+### Three Tasks, Escalating Complexity
+| Task | Difficulty | Max Steps | What the Agent Must Do |
+|------|-----------|-----------|----------------------|
+| **Procurement Reconciliation** | Easy | 10 | Find an overcharged line item between PO and Invoice, calculate the exact overcharge |
+| **SLA Enforcement** | Medium | 15 | Discover a late shipment, retrieve the SLA contract, calculate the penalty from contract terms |
+| **Adversarial Auditing** | Hard | 20 | All of the above + verify warehouse logs to disprove vendor's claim + reject a settlement offer |
+### The Tool Suite
+The agent interacts through **4 ERP tools**, each requiring precise parameters:
+| Tool | Purpose | Parameters |
+|------|---------|------------|
+| `query_database` | Search corporate databases | `{"table": "shipping_logs"}` |
+| `read_document` | Retrieve full document text | `document_id: "PO-2024-1234"` |
+| `communicate_vendor` | Negotiate with adversarial vendor | `message_content: "We reject..."` |
+| `submit_financial_decision` | Submit final adjustment (terminal) | `adjustment_amount: -450.00` |
+### Procedural Generation
+Every scenario is generated from a seed — **same seed = same scenario = deterministic grading**. This enables:
+- Infinite training configurations (no memorization)
+- Reproducible evaluation
+- Fair comparison between models
+Each scenario generates: company profiles, product catalogs with contracted pricing, purchase orders, vendor invoices (with seeded discrepancies), SLA contracts (linear/tiered penalty structures), shipping telemetry, and warehouse access logs.
+## Reward Architecture (RLVR-Inspired)
+```
+R_total = α·R_outcome + β·R_trajectory − penalties
+```
+| Component | Weight | Description |
+|-----------|--------|-------------|
+| **R_outcome** | 60-70% | Did the agent submit the correct adjustment amount? |
+| **R_trajectory** | 30-40% | Did the agent follow proper investigative procedure? |
+| **Efficiency penalty** | -0.005/step | Encourages shortest path to resolution |
+| **Hallucination penalty** | -0.02 | Invalid queries, nonexistent documents |
+| **Gullibility penalty** | -0.20 | Accepting adversarial settlement offers (Task 3) |
+| **Evidence bonus** | +0.05 | Citing warehouse logs as evidence (Task 3) |
+### Why This Reward Design Matters
+- **Dense, not sparse**: Trajectory milestones reward correct investigative behavior (querying the right databases, reading the right documents) even if the final answer is wrong
+- **Hard to game**: An agent that spams queries gets penalized by step costs; an agent that submits without investigating gets 0 trajectory reward
+- **Verifiable**: The correct answer is always a precise floating-point number derived from contract terms — no subjective evaluation
+## Results
+*Training evidence and reward plots will be added during the onsite hackathon (April 25-26) when compute credits are provided.*
+<!-- Placeholder for training results -->
+<!-- ![Reward curves](plots/reward_curves.png) -->
+## Quick Start
+### Run the environment
 ```bash
+# Docker
+docker build -t esctr-env .
+docker run -p 7860:7860 esctr-env
+# Or locally
 pip install -r requirements.txt
 uvicorn server.app:app --host 0.0.0.0 --port 7860
 ```
+### Connect an agent
+```python
+import requests
+url = "http://localhost:7860"
+# Reset with a task
+r = requests.post(f"{url}/reset", json={"task_name": "sla_enforcement", "seed": 42})
+briefing = r.json()["observation"]["system_response"]
+# Query a database
+r = requests.post(f"{url}/step", json={
+    "action": {
+        "action_type": "query_database",
+        "query_parameters": {"table": "shipping_logs"}
+    }
+})
+result = r.json()["observation"]["system_response"]
+# Submit financial decision
+r = requests.post(f"{url}/step", json={
+    "action": {
+        "action_type": "submit_financial_decision",
+        "adjustment_amount": -450.00,
+        "adjustment_reason": "Late delivery penalty per SLA clause"
+    }
+})
+score = r.json()["reward"]
 ```
+### Run baseline inference
 ```bash
 export ENV_URL="http://localhost:7860"
 export API_BASE_URL="https://router.huggingface.co/v1"
 export MODEL_NAME="meta-llama/Meta-Llama-3-8B-Instruct"
+export HF_TOKEN="your_token"
 python inference.py
 ```
+## Why This Matters
+| Question | Answer |
+|----------|--------|
+| *Does this teach an LLM something it can't do well?* | Yes — multi-step financial reasoning with tool use is a known weakness of current LLMs |
+| *Is the domain underexplored?* | Yes — supply chain auditing + adversarial negotiation is nearly absent from RL/LLM training benchmarks |
+| *Could a researcher write a paper about this?* | Yes — training autonomous financial auditors has direct commercial and academic value |
+| *Is the reward hard to game?* | Yes — the correct answer is always a precise number from contract math; trajectory rewards require specific database queries |
 ## API Endpoints
 | Endpoint | Method | Description |
 |----------|--------|-------------|
 | `/health` | GET | Health check |
+| `/reset` | POST | Reset with task + seed |
 | `/step` | POST | Execute an action |
+| `/state` | GET | Current state |
+| `/schema` | GET | Action/Observation/State schemas |
+| `/metadata` | GET | Environment metadata |
 | `/ws` | WebSocket | Persistent session |
 ## Project Structure
 ```
 ├── server/
 │   ├── __init__.py
 │   ├── app.py             # FastAPI application
+│   ├── environment.py     # Core stateful environment + tool handlers
+│   ├── procedural.py      # Deterministic scenario generation engine
+│   ├── graders.py         # Multi-axis deterministic graders (3 tasks)
+│   └── models.py          # Pydantic Action/Observation/State schemas
+├── inference.py           # Baseline inference script
 ├── openenv.yaml           # OpenEnv manifest
+├── pyproject.toml         # Package config
 ├── requirements.txt       # Dependencies
 ├── Dockerfile             # Container definition
 └── README.md              # This file
 ```
+## Themes Alignment
+- **🌐 World Modeling (Professional Tasks)** — Real interaction with tools and dynamic databases
+- **📋 Long-Horizon Planning** — Multi-step investigation requiring state tracking across 10-20 steps
+- **🤝 Multi-Agent Interactions** — Adversarial vendor negotiation with settlement dynamics
+- **📈 Self-Improvement** — Escalating difficulty curriculum (Easy → Medium → Hard)

course.md ADDED Viewed

	@@ -0,0 +1,309 @@

+# ESCTR: The Full Story — From Invoice Extraction to Enterprise Supply Chain Auditing
+> This document captures the entire journey: the problem we set out to solve, the research we did, the approaches we tried, and how we arrived at the final ESCTR environment.
+---
+## Table of Contents
+1. [The Starting Point — OpenEnv Hackathon](#1-the-starting-point)
+2. [Round 1 — Invoice Extraction Environment](#2-round-1--invoice-extraction-environment)
+3. [Research Phase — What Would Win Round 2?](#3-research-phase)
+4. [The Pivot Decision — Why ESCTR](#4-the-pivot-decision)
+5. [Architecture Deep Dive — How ESCTR Works](#5-architecture-deep-dive)
+6. [Reward Design — RLVR Principles](#6-reward-design)
+7. [What We Learned](#7-what-we-learned)
+---
+## 1. The Starting Point
+### What is the OpenEnv Hackathon?
+The **Meta PyTorch OpenEnv Hackathon × Scaler School of Technology** is a hackathon focused on building **RL training environments for LLMs**. The core idea: instead of training LLMs on static datasets, we build interactive environments where agents learn through Reinforcement Learning with Verifiable Rewards (RLVR).
+**OpenEnv** is a framework by Meta PyTorch and HuggingFace that treats RL environments as isolated microservices — the training loop (client) is completely decoupled from the environment simulation (server). The environment exposes standard HTTP endpoints (`/reset`, `/step`, `/state`) and the agent interacts through typed Actions and Observations.
+### The Challenge
+Build an OpenEnv-compliant environment that:
+- Simulates a task humans actually perform
+- Has programmatic, deterministic grading (no LLM-as-judge)
+- Provides dense reward signals (not just 0/1 at the end)
+- Supports multiple difficulty tiers
+- Runs within 2 vCPU / 8GB RAM constraints
+- Is deployable as a Docker container on HuggingFace Spaces
+---
+## 2. Round 1 — Invoice Extraction Environment
+### The Original Idea
+Our Round 1 submission was an **Invoice Extraction Environment** — an environment where an AI agent extracts structured data (vendor name, invoice number, line items, totals, etc.) from unstructured invoice documents.
+### What We Built
+- **5 difficulty tiers**: simple_invoice → messy_invoice → multi_document → corrupted_scan → adversarial_invoice
+- **15 static documents** across the 5 tiers
+- **Fuzzy string matching** for text fields, numeric tolerance for amounts
+- **Multi-step interaction**: view_document → view_fields → extract → get_feedback → refine
+- **OpenEnv compliance**: FastAPI server, typed Pydantic models, Docker deployment
+### Round 1 Enhancements (Pre-Pivot)
+Before Round 2 guidelines dropped, we upgraded the Round 1 environment with:
+1. **Procedural Document Generation** (`procedural.py`): A seed-based engine generating infinite invoice variations — 15 vendor profiles, 15 customers, 25 products, OCR corruption simulation. This eliminated the overfitting risk of a 15-document static corpus.
+2. **RLVR Composite Rewards**: Instead of a simple extraction score, we implemented:
+   ```
+   R_total = 0.70 × R_outcome + 0.30 × R_trajectory + bonuses
+   ```
+   With trajectory milestones (micro-rewards for viewing documents, getting feedback), efficiency bonuses, consistency bonuses (subtotal + tax = total), and penalties.
+3. **Weighted Grading**: Financial fields scored 1.5×, line items 2.0×, with built-in cross-field arithmetic verification.
+4. **Multi-Tool Workflow**: For hard tasks (multi_document, adversarial_invoice), we added `query_related_documents`, `verify_calculations`, and `check_discrepancies` tools.
+### Why Round 1 Wasn't Enough
+The enhanced invoice extraction was technically solid — all tests passed, good reward design, infinite procedural data. **But it wasn't going to win Round 2.**
+---
+## 3. Research Phase
+### RESEARCH_1: The ESCTR Blueprint
+We conducted deep research into what would maximize hackathon scoring. The key findings:
+**The Core Problem with Invoice Extraction:**
+| Vulnerability | Why It Hurts |
+|--------------|-------------|
+| **Saturated domain** | Document extraction is a well-trodden path. Judges have seen it before. |
+| **Shallow interaction** | View document → extract → done. No real multi-step reasoning. |
+| **Text-centric abstraction** | Pre-parsed text removes any visual/spatial reasoning challenge. |
+| **Low novelty ceiling** | Even with procedural generation, the core task is "fill in the JSON fields." |
+**What Frontier AI Research Demands:**
+Drawing from the **OLMo 3 technical report** and RLVR research, we identified that winning environments need:
+- **Long-horizon planning**: Agents that plan across 10-20 steps, not 3-5
+- **Tool orchestration**: Multiple heterogeneous tools, not just "view" and "extract"
+- **Partial observability**: Information spread across multiple databases, not one document
+- **Adversarial dynamics**: Active counterparties that resist the agent's goal
+- **Deterministic verification**: Correct answers that are mathematically provable, not fuzzy-matched
+**The Proposed Solution: Enterprise Supply Chain & Tax Reconciliation (ESCTR)**
+The research proposed pivoting from "extract data from an invoice" to "act as an autonomous financial controller investigating procurement discrepancies." This transforms a simple NLP extraction task into a genuine **agentic workflow** that maps to real enterprise operations worth trillions of dollars annually.
+### RESEARCH_2: Supporting Analysis
+The supplementary research validated the ESCTR concept against:
+- Amazon's agentic AI evaluation practices
+- Multi-agent negotiation frameworks
+- The credit assignment problem in long-horizon RL
+- Rubric-based reward systems for domains beyond simple verification
+### Key Insight from Research
+> "An environment that challenges frontier 72B models at 40% success rate on its hardest task provides more training headroom than one where 8B models already score 80%."
+This directly informed our task difficulty design — Task 3 (Adversarial Auditing) is deliberately hard enough that a model must:
+1. Query 5 different databases
+2. Cross-reference shipping dates against SLA penalty clauses
+3. Verify warehouse logs to disprove a vendor's false claim
+4. Navigate a multi-turn negotiation
+5. Reject a settlement offer
+6. Calculate the exact penalty amount to 2 decimal places
+---
+## 4. The Pivot Decision
+### Round 2 Guidelines Changed Everything
+When the Round 2 guidelines arrived, the scoring criteria shifted dramatically:
+| Criterion | Round 1 Weight | Round 2 Weight |
+|-----------|---------------|---------------|
+| Environment Innovation | ~30% | **40%** |
+| Storytelling & Presentation | 0% | **30%** |
+| Training Evidence (reward curves) | 0% | **20%** |
+| Reward & Training Pipeline | ~25% | **10%** |
+**70% of the score** now depends on innovation + storytelling. The guidelines explicitly warned:
+> *"A messy but ambitious environment with real training evidence beats a polished but boring one."*
+> *"Judges have seen a lot of chess, snake, tic-tac-toe, and grid-world clones."*
+### The Decision Matrix
+| Factor | Invoice Extraction | ESCTR |
+|--------|-------------------|-------|
+| Innovation (40%) | ⚠️ Known domain, seen before | ✅ Novel — supply chain auditing is unexplored in RL |
+| Storytelling (30%) | ⚠️ Hard to make exciting | ✅ Strong narrative — "training autonomous financial controllers" |
+| Training Evidence (20%) | Equal | Equal |
+| Theme Alignment | Weak — barely touches themes | ✅ Hits Theme #3.1 (World Modeling), #2 (Long-Horizon), #1 (Multi-Agent) |
+| Technical Depth | Good but shallow | ✅ 4 tools, 5 databases, adversarial negotiation |
+### Decision: Full ESCTR Pivot
+We chose **Option A: Full ESCTR Pivot** because:
+1. The innovation ceiling is dramatically higher
+2. The storytelling angle is compelling and unique
+3. Our existing RLVR reward architecture transfers directly
+4. The procedural generation concept transfers directly
+5. We had 2 days pre-onsite + 2 days onsite to build it
+The risk was real — a complete rewrite — but a "polished but boring" environment was guaranteed to lose.
+---
+## 5. Architecture Deep Dive
+### How ESCTR Works
+The agent is presented with a **discrepancy alert** and must use 4 ERP tools to investigate:
+```
+┌─────────────────────────────────────────┐
+│           ESCTR Environment             │
+│                                         │
+│  ┌─────────┐  ┌──────────┐  ┌────────┐│
+│  │ Purchase │  │ Shipping │  │  SLA   ││
+│  │  Orders  │  │   Logs   │  │Contract││
+│  └────┬─────┘  └────┬─────┘  └───┬────┘│
+│       │              │            │      │
+│  ┌────┴──────────────┴────────────┴────┐│
+│  │         Tool Dispatcher              ││
+│  │  query_database | read_document      ││
+│  │  communicate_vendor                  ││
+│  │  submit_financial_decision           ││
+│  └────────────────┬─────────────────────┘│
+│                   │                      │
+│  ┌────────────────┴─────────────────────┐│
+│  │         Grader Engine                ││
+│  │  R = α·outcome + β·trajectory − pen  ││
+│  └──────────────────────────────────────┘│
+└─────────────────────────────────────────┘
+```
+### The Three Tasks
+**Task 1 — Procurement Reconciliation (Easy)**
+- A vendor invoices at higher prices than contracted
+- Agent must: Query PO → Query Invoice → Compare line items → Find overcharge → Submit correction
+- Grading: Correct line item ID + exact adjustment amount = 1.0
+**Task 2 — SLA Enforcement (Medium)**
+- A shipment arrived late, vendor demands full payment
+- Agent must: Query shipping logs → Discover delay → Query SLA contract → Calculate penalty per terms → Submit deduction
+- Grading: Exact penalty calculation = 1.0, within 5% = 0.7, within 10% = 0.4
+**Task 3 — Adversarial Auditing (Hard)**
+- Vendor disputes the late delivery, claims warehouse rejected shipment
+- Agent must: Verify shipping delay → Get SLA terms → Query warehouse logs (prove dock was open) → Engage vendor → Reject settlement offer → Enforce full penalty
+- Grading: Multi-axis — outcome (60%) + trajectory (40%) − gullibility penalty + evidence bonus
+### Procedural Generation
+Every scenario is generated from a seed using deterministic randomization:
+- **15 vendor profiles** with US addresses
+- **15 buyer profiles** with realistic business names
+- **20 products** across hardware, electrical, IT, machinery categories
+- **5 SLA penalty structures** (linear and tiered)
+- Same seed → identical scenario → reproducible evaluation
+### The Vendor Negotiation System
+Task 3 features a **3-phase adversarial vendor**:
+1. **Phase 1 — The Excuse**: Vendor claims your warehouse rejected delivery
+2. **Phase 2 — The Settlement Offer**: Vendor offers 40-55% of the penalty as a "goodwill credit"
+3. **Phase 3 — Concession or Persistence**: If agent rejects firmly + cites evidence, vendor concedes
+The agent is penalized −0.20 for **gullibility** (accepting the settlement) and rewarded +0.05 for **evidence citation** (mentioning warehouse logs in the adjustment reason).
+---
+## 6. Reward Design
+### RLVR Principles Applied
+Our reward design follows principles from the OLMo 3 technical report:
+```
+R_total = α · R_outcome + β · R_trajectory − penalties
+```
+**Why not just binary rewards?**
+- Sparse rewards (0 or 1 at the end) make credit assignment intractable in 15-20 step episodes
+- The agent can't tell which of its 15 actions contributed to success or failure
+- Dense trajectory rewards act as "algorithmic breadcrumbs" guiding policy gradients
+**Trajectory Milestones:**
+| Milestone | Meaning |
+|-----------|---------|
+| `retrieved_po` | Agent queried the purchase order database |
+| `retrieved_invoice` | Agent queried the invoice database |
+| `retrieved_shipping` | Agent discovered the shipping delay |
+| `retrieved_sla` | Agent found the penalty terms |
+| `checked_warehouse` | Agent verified internal records |
+| `vendor_negotiation` | Agent engaged with the adversarial vendor |
+| `calculated_penalty` | Agent performed penalty arithmetic |
+**Penalties:**
+- Step cost: −0.005 per action (encourages efficiency)
+- Hallucination: −0.02 for invalid queries or nonexistent documents
+- Gullibility: −0.20 for accepting adversarial settlements (Task 3)
+**Why These Specific Values?**
+- Step cost is small enough that investigation is still rewarded
+- Hallucination penalty is 4× the step cost — bad actions are much worse than slow actions
+- Gullibility penalty is massive (−0.20) because accepting a fraudulent claim is the worst possible failure mode in financial auditing
+---
+## 7. What We Learned
+### Technical Lessons
+1. **Procedural generation is non-negotiable** for RL environments. Static corpora get memorized instantly. Our engine generates unique scenarios from any seed.
+2. **Tool restriction per task** is important. Easy tasks shouldn't have tools the agent can't meaningfully use — it creates noise in the reward signal.
+3. **Adversarial dynamics create genuine difficulty.** A vendor that lies and offers settlements tests the agent's reasoning in ways static documents never can.
+4. **Composite rewards require careful balancing.** If trajectory reward is too high, agents learn to query everything without ever submitting. If too low, they learn to guess without investigating.
+### Strategic Lessons
+1. **Read the scoring rubric backwards.** Don't start with what you want to build — start with what gets scored highest and work backwards.
+2. **Innovation (40%) + Storytelling (30%) = 70%.** A technically perfect but boring environment loses to a messy but ambitious one with a great narrative.
+3. **The pivot was worth the risk.** Rewriting 1000+ lines of code in 2 days was aggressive, but staying with invoice extraction would have capped us at "top 10, not first."
+4. **Domain choice matters enormously.** Supply chain auditing is a multi-trillion dollar problem that's underexplored in AI training — this gives us both novelty and real-world utility.
+---
+## Appendix: File History
+| Phase | Files Created/Modified | Purpose |
+|-------|----------------------|---------|
+| Round 1 | `server/documents.py` (15 static docs) | Original invoice corpus |
+| Round 1 | `server/graders.py` (fuzzy matching) | Text extraction grading |
+| Enhancement | `server/procedural.py` v1 (invoice generator) | Infinite invoice variations |
+| Enhancement | `server/environment.py` v1 (6 tools) | Multi-tool invoice extraction |
+| **ESCTR Pivot** | `server/models.py` (ESCTRAction/Obs/State) | ERP tool schemas |
+| **ESCTR Pivot** | `server/procedural.py` v2 (corporate graphs) | Supply chain scenario generation |
+| **ESCTR Pivot** | `server/graders.py` v2 (3 task graders) | Deterministic multi-axis scoring |
+| **ESCTR Pivot** | `server/environment.py` v2 (4 tools + vendor AI) | Full ESCTR environment |
+| **ESCTR Pivot** | `inference.py` v2 (financial controller) | Baseline agent script |
+| **ESCTR Pivot** | Removed `server/documents.py` | No longer needed |

inference.py CHANGED Viewed

@@ -1,18 +1,16 @@
 #!/usr/bin/env python3
 """
-Baseline inference script for the Invoice Extraction Environment.
-This script demonstrates how an LLM agent interacts with the environment
-to extract structured data from invoice documents. It runs all five tasks
-(simple_invoice, messy_invoice, multi_document, corrupted_scan, adversarial_invoice)
-and logs results in the mandatory OpenEnv [START]/[STEP]/[END] format.
 Required environment variables:
-    API_BASE_URL       — OpenAI-compatible API endpoint
-    MODEL_NAME         — Model identifier (e.g. meta-llama/Meta-Llama-3-8B-Instruct)
-    HF_TOKEN           — API key / Hugging Face token (no default)
-    ENV_URL            — URL of the running environment server (default: http://localhost:7860)
-    LOCAL_IMAGE_NAME   — (Optional) Docker image name for from_docker_image() style
 """
 import json
@@ -33,15 +31,9 @@ HF_TOKEN = os.getenv("HF_TOKEN")
 ENV_URL = os.getenv("ENV_URL", "http://localhost:7860")
 LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
-TASKS = ["simple_invoice", "messy_invoice", "multi_document", "corrupted_scan", "adversarial_invoice"]
-BENCHMARK = "invoice-extraction"
-# Tasks that support advanced multi-tool commands
-TOOL_ENABLED_TASKS = {"multi_document", "adversarial_invoice"}
-# ---------------------------------------------------------------------------
-# LLM Client
-# ---------------------------------------------------------------------------
 llm = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)
@@ -67,140 +59,121 @@ def env_reset(url: str, task_name: str, seed: int = 0) -> dict:
     return r.json()
-def env_step(url: str, command: str, payload: str = "") -> dict:
-    r = requests.post(f"{url}/step", json={"action": {"command": command, "payload": payload}}, timeout=30)
     r.raise_for_status()
     return r.json()
 # ---------------------------------------------------------------------------
-# Logging helpers (strict OpenEnv format)
 # ---------------------------------------------------------------------------
 def log_start(task: str, model: str):
     print(f"[START] task={task} env={BENCHMARK} model={model}", flush=True)
 def log_step(step: int, action: str, reward: float, done: bool, error=None):
-    error_val = error if error else "null"
-    print(
-        f"[STEP] step={step} action={action} reward={reward:.2f} "
-        f"done={str(done).lower()} error={error_val}",
-        flush=True,
-    )
 def log_end(success: bool, steps: int, score: float, rewards: list):
-    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
-    print(
-        f"[END] success={str(success).lower()} steps={steps} "
-        f"score={score:.2f} rewards={rewards_str}",
-        flush=True,
-    )
 # ---------------------------------------------------------------------------
-# LLM extraction logic
 # ---------------------------------------------------------------------------
-EXTRACT_PROMPT = """You are an expert data extraction assistant. Given the following document text, extract the specified fields and return ONLY a valid JSON object.
-DOCUMENT:
-{document}
-REQUIRED FIELDS:
-{fields}
-RULES:
-- Return ONLY a valid JSON object, no explanation or markdown
-- For dates, use YYYY-MM-DD format (e.g. 2024-01-15)
-- For monetary amounts, use plain numbers without currency symbols (e.g. 1134.00)
-- For line_items, use an array of objects with keys: description, quantity, unit_price, amount
-- If a field cannot be found, use null
-{task_specific_rules}
-IMPORTANT: Ensure your extracted subtotal + tax = total. Verify math consistency.
-JSON:"""
-TASK_RULES = {
-    "simple_invoice": "",
-    "messy_invoice": (
-        "- This document uses informal formatting, abbreviations, and shorthand\n"
-        "- Look past formatting irregularities to find the actual values\n"
-        "- 'subtot', 's/t', 'sub' = subtotal; 'tx' = tax; 'amt due' = total"
-    ),
-    "multi_document": (
-        "- This contains MULTIPLE document sections (PO, Invoice, Credit Memo, etc.)\n"
-        "- Extract from the INVOICE section primarily\n"
-        "- adjusted_total is the final amount after credits/payments\n"
-        "- po_number is the purchase order reference number\n"
-        "- adjustment_reason describes why the total was adjusted\n"
-        "- Cross-reference PO with invoice for discrepancies"
-    ),
-    "corrupted_scan": (
-        "- WARNING: This is an OCR-scanned document with character errors\n"
-        "- Common OCR substitutions: 0<->O, 1<->l<->I, 5<->S, 8<->B\n"
-        "- Mentally correct OCR errors to recover the true values\n"
-        "- 'lNV' = 'INV', 'S' in numbers = '5', 'O' in numbers = '0'\n"
-        "- Verify all numbers by cross-checking (qty * unit_price = amount)"
-    ),
-    "adversarial_invoice": (
-        "- CAUTION: This document contains DECOY fields and contradictions\n"
-        "- Multiple invoice numbers may appear — use the CURRENT/ACTIVE one\n"
-        "- If there is a reissue date, use that as the date (not the original)\n"
-        "- subtotal is the ADJUSTED subtotal after any discounts\n"
-        "- discount_amount is the monetary discount value\n"
-        "- original_total is what the total WOULD have been without adjustments\n"
-        "- discrepancy_notes: describe ALL discrepancies and adjustments\n"
-        "- po_number: the purchase order reference if present, else null\n"
-        "- Cross-reference different sections to find contradictions"
-    ),
 }
-REFINE_PROMPT = """You previously extracted data from an invoice but some fields were incorrect.
-DOCUMENT:
-{document}
-YOUR PREVIOUS EXTRACTION:
-{previous}
-FIELDS NEEDING IMPROVEMENT: {weak_fields}
-FEEDBACK:
-{feedback}
-{extra_context}
-Please re-extract ALL fields and return ONLY a valid JSON object with corrections.
-Pay special attention to the fields listed above.
-RULES:
-- Return ONLY a valid JSON object, no explanation or markdown
-- For dates, use YYYY-MM-DD format
-- For monetary amounts, use plain numbers without currency symbols
-- For line_items, use an array of objects with keys: description, quantity, unit_price, amount
-- VERIFY: subtotal + tax should equal total
-{task_specific_rules}
-JSON:"""
-def call_llm(prompt: str) -> str:
     try:
         response = llm.chat.completions.create(
             model=MODEL_NAME,
-            messages=[{"role": "user", "content": prompt}],
             temperature=0.1,
-            max_tokens=2000,
         )
         return response.choices[0].message.content.strip()
     except Exception as e:
-        return json.dumps({"error": str(e)})
-def extract_json_from_response(text: str) -> str:
     if "```json" in text:
         start = text.index("```json") + 7
         end = text.index("```", start)
@@ -219,113 +192,69 @@ def extract_json_from_response(text: str) -> str:
             elif text[i] == "}":
                 depth -= 1
                 if depth == 0:
-                    return text[brace_start : i + 1]
-    return text
 # ---------------------------------------------------------------------------
-# Main inference loop
 # ---------------------------------------------------------------------------
 def run_task(env_url: str, task_name: str, seed: int = 0) -> float:
-    """Run a single task and return the final score."""
     log_start(task=task_name, model=MODEL_NAME)
     rewards = []
     step_num = 0
     final_score = 0.0
     try:
-        env_reset(env_url, task_name, seed=seed)
-        # Step 1: View the document
-        step_num += 1
-        result = env_step(env_url, "view_document")
-        document_text = result.get("observation", {}).get("text", "")
-        reward = result.get("reward", 0.0) or 0.0
-        done = result.get("done", False)
-        rewards.append(reward)
-        log_step(step_num, "view_document()", reward, done)
-        # Step 2: View required fields
-        step_num += 1
-        result = env_step(env_url, "view_fields")
-        required_fields = result.get("observation", {}).get("required_fields", [])
-        reward = result.get("reward", 0.0) or 0.0
-        done = result.get("done", False)
-        rewards.append(reward)
-        log_step(step_num, "view_fields()", reward, done)
-        # Step 2.5: For tool-enabled tasks, gather extra context
-        extra_context = ""
-        if task_name in TOOL_ENABLED_TASKS:
-            step_num += 1
-            result = env_step(env_url, "query_related_documents")
-            related_text = result.get("observation", {}).get("text", "")
-            reward = result.get("reward", 0.0) or 0.0
-            rewards.append(reward)
-            log_step(step_num, "query_related_documents()", reward, False)
-            extra_context += f"\nRELATED DOCUMENTS:\n{related_text}\n"
             step_num += 1
-            result = env_step(env_url, "check_discrepancies")
-            discrep_text = result.get("observation", {}).get("text", "")
             reward = result.get("reward", 0.0) or 0.0
             rewards.append(reward)
-            log_step(step_num, "check_discrepancies()", reward, False)
-            extra_context += f"\nDISCREPANCY HINTS:\n{discrep_text}\n"
-        # Step 3: LLM extraction
-        fields_str = "\n".join(f"- {f}" for f in required_fields)
-        task_rules = TASK_RULES.get(task_name, "")
-        prompt = EXTRACT_PROMPT.format(
-            document=document_text + extra_context,
-            fields=fields_str,
-            task_specific_rules=task_rules,
-        )
-        llm_response = call_llm(prompt)
-        extracted_json = extract_json_from_response(llm_response)
-        # Step 4: Submit extraction
-        step_num += 1
-        result = env_step(env_url, "extract", extracted_json)
-        reward = result.get("reward", 0.0) or 0.0
-        done = result.get("done", False)
-        obs = result.get("observation", {})
-        rewards.append(reward)
-        log_step(step_num, "submit_extraction()", reward, done)
-        final_score = reward
-        # If not done and score < 0.9, refine
-        if not done and reward < 0.9:
-            step_num += 1
-            fb_result = env_step(env_url, "get_feedback")
-            feedback_text = fb_result.get("observation", {}).get("text", "")
-            fb_reward = fb_result.get("reward", 0.0) or 0.0
-            rewards.append(fb_reward)
-            log_step(step_num, "get_feedback()", fb_reward, False)
-            field_scores = obs.get("metadata", {}).get("field_scores", {})
-            weak_fields = [f for f, s in field_scores.items() if s < 0.8]
-            refine_prompt = REFINE_PROMPT.format(
-                document=document_text,
-                previous=extracted_json,
-                weak_fields=", ".join(weak_fields) if weak_fields else "all fields",
-                feedback=feedback_text,
-                extra_context=extra_context,
-                task_specific_rules=task_rules,
-            )
-            refined_response = call_llm(refine_prompt)
-            refined_json = extract_json_from_response(refined_response)
-            step_num += 1
-            result2 = env_step(env_url, "extract", refined_json)
-            reward2 = result2.get("reward", 0.0) or 0.0
-            done = result2.get("done", False)
-            rewards.append(reward2)
-            log_step(step_num, "submit_refined_extraction()", reward2, done)
-            final_score = max(final_score, reward2)
     except Exception as e:
         step_num += 1
@@ -338,51 +267,52 @@ def run_task(env_url: str, task_name: str, seed: int = 0) -> float:
     return final_score
 def main():
     global ENV_URL
     container_id = None
     if LOCAL_IMAGE_NAME:
-        print(f"Starting Docker container from image: {LOCAL_IMAGE_NAME}")
         try:
             container_id = subprocess.check_output(
                 ["docker", "run", "-d", "--rm", "-p", "7860:7860", LOCAL_IMAGE_NAME],
-                stderr=subprocess.STDOUT,
             ).decode().strip()
             ENV_URL = "http://localhost:7860"
-            print(f"Container started: {container_id[:12]}")
         except Exception as e:
-            print(f"Failed to start Docker container: {e}")
             sys.exit(1)
     print(f"Waiting for environment at {ENV_URL} ...")
     if not env_health(ENV_URL):
-        print("ERROR: Environment failed to become healthy")
         if container_id:
             subprocess.run(["docker", "stop", container_id], capture_output=True)
         sys.exit(1)
-    print("Environment is healthy!\n")
     scores = {}
-    for task_name in TASKS:
-        score = run_task(ENV_URL, task_name, seed=42)
-        scores[task_name] = score
         print()
-    avg_score = sum(scores.values()) / len(scores) if scores else 0.0
     print("=" * 50)
-    print("SUMMARY")
     print("=" * 50)
-    for task, score in scores.items():
-        print(f"  {task}: {score:.2f}")
-    print(f"  Average: {avg_score:.2f}")
     print("=" * 50)
     if container_id:
-        print(f"Stopping container {container_id[:12]} ...")
         subprocess.run(["docker", "stop", container_id], capture_output=True)
-    return 0 if avg_score > 0 else 1
 if __name__ == "__main__":

 #!/usr/bin/env python3
 """
+Baseline inference script for the ESCTR Environment.
+Demonstrates how an LLM agent interacts with the enterprise supply chain
+environment to investigate discrepancies, enforce SLA penalties, and
+navigate adversarial vendor disputes.
 Required environment variables:
+    API_BASE_URL  — OpenAI-compatible API endpoint
+    MODEL_NAME    — Model identifier (e.g. meta-llama/Meta-Llama-3-8B-Instruct)
+    HF_TOKEN      — API key
+    ENV_URL       — Environment server URL (default: http://localhost:7860)
 """
 import json
 ENV_URL = os.getenv("ENV_URL", "http://localhost:7860")
 LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
+TASKS = ["procurement_reconciliation", "sla_enforcement", "adversarial_auditing"]
+BENCHMARK = "esctr"
 llm = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)
     return r.json()
+def env_step(url: str, action: dict) -> dict:
+    r = requests.post(f"{url}/step", json={"action": action}, timeout=30)
     r.raise_for_status()
     return r.json()
 # ---------------------------------------------------------------------------
+# Logging (strict OpenEnv format)
 # ---------------------------------------------------------------------------
 def log_start(task: str, model: str):
     print(f"[START] task={task} env={BENCHMARK} model={model}", flush=True)
 def log_step(step: int, action: str, reward: float, done: bool, error=None):
+    err = error if error else "null"
+    print(f"[STEP] step={step} action={action} reward={reward:.2f} done={str(done).lower()} error={err}", flush=True)
 def log_end(success: bool, steps: int, score: float, rewards: list):
+    rstr = ",".join(f"{r:.2f}" for r in rewards)
+    print(f"[END] success={str(success).lower()} steps={steps} score={score:.2f} rewards={rstr}", flush=True)
 # ---------------------------------------------------------------------------
+# System prompts per task
 # ---------------------------------------------------------------------------
+SYSTEM_PROMPT_BASE = """You are an autonomous Financial Controller AI agent operating in an Enterprise Supply Chain environment. You must investigate discrepancies, verify documents, and submit precise financial adjustments.
+AVAILABLE TOOLS:
+{tools}
+RESPONSE FORMAT:
+You must respond with a SINGLE valid JSON object — NO explanation, NO markdown.
+The JSON must have these fields:
+- "action_type": one of the available tool names
+- Additional fields depending on the action:
+  - For "query_database": include "query_parameters": {{"table": "<table_name>"}}
+  - For "read_document": include "document_id": "<id>"
+  - For "communicate_vendor": include "message_content": "<your message>"
+  - For "submit_financial_decision": include "adjustment_amount": <number> and "adjustment_reason": "<explanation>"
+CRITICAL RULES:
+- ALWAYS query databases and read documents BEFORE submitting a decision
+- Calculate amounts precisely — use exact arithmetic
+- adjustment_amount should be NEGATIVE to reduce the invoice payment
+- Respond ONLY with JSON, nothing else"""
+TASK_INSTRUCTIONS = {
+    "procurement_reconciliation": """
+TASK: Procurement Reconciliation (Easy)
+A pricing discrepancy exists between a Purchase Order and a Vendor Invoice.
+STRATEGY:
+1. Query "purchase_orders" to find the PO
+2. Query "invoices" to find the invoice
+3. Read both documents using read_document with their IDs
+4. Compare line-by-line: find the item where invoiced price > contracted price
+5. Calculate the overcharge: (invoiced_total - contracted_total) for that line item
+6. Submit with adjustment_amount = -(overcharge amount)
+Available tables: purchase_orders, invoices""",
+    "sla_enforcement": """
+TASK: SLA Enforcement (Medium)
+A vendor demands full payment but the shipment was delivered late.
+STRATEGY:
+1. Query "shipping_logs" to check delivery timing and find delay days
+2. Query "sla_contracts" to find late delivery penalty terms
+3. Read the SLA document for exact penalty rates and caps
+4. Calculate: penalty = invoice_subtotal × min(delay_days × rate_per_day, cap)
+   - If there's a grace period, subtract grace days from delay first
+5. Submit with adjustment_amount = -(penalty amount)
+Available tables: purchase_orders, invoices, shipping_logs, sla_contracts""",
+    "adversarial_auditing": """
+TASK: Adversarial Auditing (Hard)
+A vendor disputes a late delivery claim, blaming your warehouse. You must prove them wrong.
+STRATEGY:
+1. Query "shipping_logs" to confirm the delivery was late
+2. Query "sla_contracts" for penalty terms
+3. Query "warehouse_logs" to verify your dock was OPEN during delivery
+4. Use "communicate_vendor" to engage — they will make excuses then offer a settlement
+5. REJECT the settlement — enforce the FULL penalty
+6. Cite warehouse access logs as evidence in your final reason
+7. Calculate exact penalty from SLA terms and submit
+CRITICAL: Do NOT accept any settlement offer! Enforce the full contractual penalty.
+Available tables: purchase_orders, invoices, shipping_logs, sla_contracts, warehouse_logs""",
 }
+# ---------------------------------------------------------------------------
+# LLM helpers
+# ---------------------------------------------------------------------------
+def call_llm(messages: list) -> str:
     try:
         response = llm.chat.completions.create(
             model=MODEL_NAME,
+            messages=messages,
             temperature=0.1,
+            max_tokens=1000,
         )
         return response.choices[0].message.content.strip()
     except Exception as e:
+        return json.dumps({"action_type": "query_database", "query_parameters": {"table": "purchase_orders"}})
+def parse_action(text: str) -> dict:
+    """Extract a JSON action from LLM response."""
+    # Try to find JSON in response
     if "```json" in text:
         start = text.index("```json") + 7
         end = text.index("```", start)
             elif text[i] == "}":
                 depth -= 1
                 if depth == 0:
+                    text = text[brace_start:i + 1]
+                    break
+    return json.loads(text)
 # ---------------------------------------------------------------------------
+# Task runner
 # ---------------------------------------------------------------------------
 def run_task(env_url: str, task_name: str, seed: int = 0) -> float:
     log_start(task=task_name, model=MODEL_NAME)
     rewards = []
     step_num = 0
     final_score = 0.0
+    tools = ["query_database", "read_document", "submit_financial_decision"]
+    if task_name == "adversarial_auditing":
+        tools.insert(2, "communicate_vendor")
+    system_prompt = SYSTEM_PROMPT_BASE.format(tools=", ".join(tools))
+    system_prompt += TASK_INSTRUCTIONS.get(task_name, "")
     try:
+        reset_data = env_reset(env_url, task_name, seed)
+        briefing = reset_data.get("observation", {}).get("system_response", "")
+        messages = [
+            {"role": "system", "content": system_prompt},
+            {"role": "user", "content": f"ENVIRONMENT BRIEFING:\n{briefing}\n\nBegin your investigation. Respond with a JSON action."},
+        ]
+        max_steps = {"procurement_reconciliation": 10, "sla_enforcement": 15, "adversarial_auditing": 20}.get(task_name, 15)
+        for _ in range(max_steps):
             step_num += 1
+            # Get LLM action
+            llm_response = call_llm(messages)
+            try:
+                action = parse_action(llm_response)
+            except (json.JSONDecodeError, ValueError):
+                action = {"action_type": "query_database", "query_parameters": {"table": "purchase_orders"}}
+            # Execute action
+            action_str = json.dumps(action, separators=(",", ":"))
+            result = env_step(env_url, action)
             reward = result.get("reward", 0.0) or 0.0
+            done = result.get("done", False)
+            obs = result.get("observation", {})
+            response_text = obs.get("system_response", "")
+            error = obs.get("error_message")
             rewards.append(reward)
+            log_step(step_num, action_str, reward, done, error)
+            if done:
+                final_score = reward
+                break
+            # Append to conversation
+            messages.append({"role": "assistant", "content": llm_response})
+            messages.append({"role": "user", "content": f"ENVIRONMENT RESPONSE:\n{response_text}\n\nContinue your investigation. Respond with your next JSON action."})
     except Exception as e:
         step_num += 1
     return final_score
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
 def main():
     global ENV_URL
     container_id = None
     if LOCAL_IMAGE_NAME:
+        print(f"Starting Docker container: {LOCAL_IMAGE_NAME}")
         try:
             container_id = subprocess.check_output(
                 ["docker", "run", "-d", "--rm", "-p", "7860:7860", LOCAL_IMAGE_NAME],
+                stderr=subprocess.STDOUT
             ).decode().strip()
             ENV_URL = "http://localhost:7860"
         except Exception as e:
+            print(f"Docker start failed: {e}")
             sys.exit(1)
     print(f"Waiting for environment at {ENV_URL} ...")
     if not env_health(ENV_URL):
+        print("ERROR: Environment not healthy")
         if container_id:
             subprocess.run(["docker", "stop", container_id], capture_output=True)
         sys.exit(1)
+    print("Environment healthy!\n")
     scores = {}
+    for task in TASKS:
+        scores[task] = run_task(ENV_URL, task, seed=42)
         print()
+    avg = sum(scores.values()) / len(scores) if scores else 0.0
     print("=" * 50)
+    print("ESCTR INFERENCE SUMMARY")
     print("=" * 50)
+    for t, s in scores.items():
+        print(f"  {t}: {s:.2f}")
+    print(f"  Average: {avg:.2f}")
     print("=" * 50)
     if container_id:
         subprocess.run(["docker", "stop", container_id], capture_output=True)
+    return 0 if avg > 0 else 1
 if __name__ == "__main__":

openenv.yaml CHANGED Viewed

@@ -1,23 +1,20 @@
 spec_version: 1
-name: invoice_extraction_env
 type: space
 runtime: fastapi
 app: server.app:app
 port: 7860
 tasks:
-  - name: simple_invoice
     difficulty: easy
-    description: "Clean, well-formatted invoices with clear field labels"
-  - name: messy_invoice
     difficulty: medium
-    description: "Messy invoices with abbreviations, typos, and non-standard layouts"
-  - name: multi_document
     difficulty: hard
-    description: "Multi-section documents requiring cross-referencing PO, invoice, and credit memos"
-  - name: corrupted_scan
-    difficulty: very_hard
-    description: "OCR-scanned invoices with systematic character errors"
-  - name: adversarial_invoice
-    difficulty: expert
-    description: "Adversarial documents with decoy fields, hidden calculations, and contradictions"

 spec_version: 1
+name: esctr_environment
 type: space
 runtime: fastapi
 app: server.app:app
 port: 7860
 tasks:
+  - name: procurement_reconciliation
     difficulty: easy
+    max_steps: 10
+    description: "Identify overcharged line items between PO and Invoice, calculate exact overcharge"
+  - name: sla_enforcement
     difficulty: medium
+    max_steps: 15
+    description: "Calculate late delivery penalties from shipping logs and SLA contract terms"
+  - name: adversarial_auditing
     difficulty: hard
+    max_steps: 20
+    description: "Navigate vendor disputes, verify warehouse logs, reject settlements, enforce full penalties"

server/__init__.py CHANGED Viewed

@@ -1,5 +1,5 @@
-"""Invoice Extraction Environment — Server package."""
-from .environment import InvoiceExtractionEnvironment
-__all__ = ["InvoiceExtractionEnvironment"]

+"""Enterprise Supply Chain & Tax Reconciliation Environment — Server package."""
+from .environment import ESCTREnvironment
+__all__ = ["ESCTREnvironment"]

server/app.py CHANGED Viewed

@@ -1,8 +1,8 @@
 """
-FastAPI application for the Invoice Extraction Environment.
-Exposes the environment over HTTP and WebSocket endpoints
-compatible with the OpenEnv client protocol.
 """
 import json
@@ -13,20 +13,20 @@ from fastapi import FastAPI, WebSocket, WebSocketDisconnect
 from fastapi.responses import JSONResponse
 from pydantic import BaseModel
-from .models import InvoiceAction, InvoiceObservation, InvoiceState
-from .environment import InvoiceExtractionEnvironment
 logger = logging.getLogger(__name__)
 # ---------------------------------------------------------------------------
-# Request / Response models (OpenEnv-compatible)
 # ---------------------------------------------------------------------------
 class ResetRequest(BaseModel):
     seed: Optional[int] = None
     episode_id: Optional[str] = None
-    task_name: str = "simple_invoice"
     class Config:
         extra = "allow"
@@ -48,14 +48,13 @@ class HealthResponse(BaseModel):
 # Helpers
 # ---------------------------------------------------------------------------
-def _obs_to_response(obs: InvoiceObservation) -> dict:
-    """Convert an InvoiceObservation to a step/reset response dict."""
     obs_dict = obs.model_dump()
-    reward = obs_dict.pop("reward", None)
     done = obs_dict.pop("done", False)
     return {
         "observation": obs_dict,
-        "reward": reward if reward is not None else 0.0,
         "done": done,
     }
@@ -64,19 +63,19 @@ def _obs_to_response(obs: InvoiceObservation) -> dict:
 # Application factory
 # ---------------------------------------------------------------------------
-def create_invoice_app() -> FastAPI:
-    """Create and configure the FastAPI application."""
     app = FastAPI(
-        title="Invoice Extraction Environment",
-        description="OpenEnv environment for extracting structured data from invoices",
-        version="0.1.0",
     )
-    # Global environment instance for HTTP endpoints
-    _env = InvoiceExtractionEnvironment()
-    # === Health check ===
     @app.get("/health")
     def health():
         return HealthResponse()
@@ -84,24 +83,22 @@ def create_invoice_app() -> FastAPI:
     @app.get("/")
     def root():
         return {
-            "name": "invoice_extraction_env",
-            "version": "0.1.0",
             "status": "running",
-            "endpoints": ["/health", "/reset", "/step", "/state", "/schema", "/ws"],
         }
-    # === Reset ===
     @app.post("/reset")
     def reset(request: ResetRequest = ResetRequest()):
         kwargs = request.model_dump(exclude_unset=True)
         obs = _env.reset(**kwargs)
         return _obs_to_response(obs)
-    # === Step ===
     @app.post("/step")
     def step(request: StepRequest):
         try:
-            action = InvoiceAction(**request.action)
         except Exception as e:
             return JSONResponse(
                 status_code=422,
@@ -110,58 +107,52 @@ def create_invoice_app() -> FastAPI:
         obs = _env.step(action, timeout_s=request.timeout_s)
         return _obs_to_response(obs)
-    # === State ===
     @app.get("/state")
     def get_state():
         return _env.state.model_dump()
-    # === Schema ===
     @app.get("/schema")
     def get_schema():
         return {
-            "action": InvoiceAction.model_json_schema(),
-            "observation": InvoiceObservation.model_json_schema(),
-            "state": InvoiceState.model_json_schema(),
         }
-    # === Metadata ===
     @app.get("/metadata")
     def get_metadata():
         return {
-            "name": "invoice_extraction_env",
             "description": (
-                "An environment for extracting structured data from unstructured "
-                "invoice and receipt documents. Features 5 difficulty tiers from "
-                "clean invoices to adversarial documents with decoy fields, OCR "
-                "corruption, and hidden calculations. Includes procedural document "
-                "generation for infinite training configurations, RLVR-inspired "
-                "composite reward architecture with trajectory milestones, and "
-                "multi-tool agentic workflow for complex tasks."
             ),
-            "version": "0.3.0",
-            "features": [
-                "procedural_document_generation",
-                "rlvr_composite_rewards",
-                "multi_tool_workflow",
-                "weighted_field_scoring",
-                "cross_field_verification",
             ],
             "tasks": [
-                {"name": "simple_invoice", "difficulty": "easy", "attempts": 3},
-                {"name": "messy_invoice", "difficulty": "medium", "attempts": 3},
-                {"name": "multi_document", "difficulty": "hard", "attempts": 5,
-                 "tools": ["query_related_documents", "verify_calculations", "check_discrepancies"]},
-                {"name": "corrupted_scan", "difficulty": "very_hard", "attempts": 4},
-                {"name": "adversarial_invoice", "difficulty": "expert", "attempts": 6,
-                 "tools": ["query_related_documents", "verify_calculations", "check_discrepancies"]},
             ],
         }
-    # === WebSocket (for persistent sessions) ===
     @app.websocket("/ws")
     async def websocket_endpoint(websocket: WebSocket):
         await websocket.accept()
-        ws_env = InvoiceExtractionEnvironment()
         logger.info("WebSocket session opened")
         try:
@@ -181,19 +172,13 @@ def create_invoice_app() -> FastAPI:
                 if msg_type == "reset":
                     obs = ws_env.reset(**msg_data)
-                    await websocket.send_json({
-                        "type": "observation",
-                        "data": _obs_to_response(obs),
-                    })
                 elif msg_type == "step":
                     try:
-                        action = InvoiceAction(**msg_data)
                         obs = ws_env.step(action)
-                        await websocket.send_json({
-                            "type": "observation",
-                            "data": _obs_to_response(obs),
-                        })
                     except Exception as e:
                         await websocket.send_json({
                             "type": "error",
@@ -201,10 +186,7 @@ def create_invoice_app() -> FastAPI:
                         })
                 elif msg_type == "state":
-                    await websocket.send_json({
-                        "type": "state",
-                        "data": ws_env.state.model_dump(),
-                    })
                 elif msg_type == "close":
                     break
@@ -212,10 +194,7 @@ def create_invoice_app() -> FastAPI:
                 else:
                     await websocket.send_json({
                         "type": "error",
-                        "data": {
-                            "message": f"Unknown message type: {msg_type}",
-                            "code": "UNKNOWN_TYPE",
-                        },
                     })
         except WebSocketDisconnect:
@@ -229,12 +208,10 @@ def create_invoice_app() -> FastAPI:
     return app
-# Create the app instance
-app = create_invoice_app()
 def main():
-    """Entry point for `uv run server` / `[project.scripts]`."""
     import uvicorn
     uvicorn.run("server.app:app", host="0.0.0.0", port=7860)

 """
+FastAPI application for the ESCTR Environment.
+Exposes the Enterprise Supply Chain & Tax Reconciliation environment
+over HTTP and WebSocket endpoints compatible with the OpenEnv spec.
 """
 import json
 from fastapi.responses import JSONResponse
 from pydantic import BaseModel
+from .models import ESCTRAction, ESCTRObservation, ESCTRState
+from .environment import ESCTREnvironment
 logger = logging.getLogger(__name__)
 # ---------------------------------------------------------------------------
+# Request / Response models
 # ---------------------------------------------------------------------------
 class ResetRequest(BaseModel):
     seed: Optional[int] = None
     episode_id: Optional[str] = None
+    task_name: str = "procurement_reconciliation"
     class Config:
         extra = "allow"
 # Helpers
 # ---------------------------------------------------------------------------
+def _obs_to_response(obs: ESCTRObservation) -> dict:
     obs_dict = obs.model_dump()
+    reward = obs_dict.pop("reward", 0.0)
     done = obs_dict.pop("done", False)
     return {
         "observation": obs_dict,
+        "reward": reward,
         "done": done,
     }
 # Application factory
 # ---------------------------------------------------------------------------
+def create_app() -> FastAPI:
     app = FastAPI(
+        title="ESCTR Environment",
+        description=(
+            "Enterprise Supply Chain & Tax Reconciliation — an OpenEnv environment "
+            "for training LLMs to investigate discrepancies, enforce SLA penalties, "
+            "and navigate adversarial vendor disputes."
+        ),
+        version="1.0.0",
     )
+    _env = ESCTREnvironment()
     @app.get("/health")
     def health():
         return HealthResponse()
     @app.get("/")
     def root():
         return {
+            "name": "esctr_environment",
+            "version": "1.0.0",
             "status": "running",
+            "endpoints": ["/health", "/reset", "/step", "/state", "/schema", "/metadata", "/ws"],
         }
     @app.post("/reset")
     def reset(request: ResetRequest = ResetRequest()):
         kwargs = request.model_dump(exclude_unset=True)
         obs = _env.reset(**kwargs)
         return _obs_to_response(obs)
     @app.post("/step")
     def step(request: StepRequest):
         try:
+            action = ESCTRAction(**request.action)
         except Exception as e:
             return JSONResponse(
                 status_code=422,
         obs = _env.step(action, timeout_s=request.timeout_s)
         return _obs_to_response(obs)
     @app.get("/state")
     def get_state():
         return _env.state.model_dump()
     @app.get("/schema")
     def get_schema():
         return {
+            "action": ESCTRAction.model_json_schema(),
+            "observation": ESCTRObservation.model_json_schema(),
+            "state": ESCTRState.model_json_schema(),
         }
     @app.get("/metadata")
     def get_metadata():
         return {
+            "name": "esctr_environment",
             "description": (
+                "Enterprise Supply Chain & Tax Reconciliation: an environment where "
+                "an LLM agent operates as an autonomous financial controller, investigating "
+                "procurement discrepancies, enforcing SLA penalties from shipping delays, "
+                "and navigating adversarial vendor disputes. Features procedural generation "
+                "for infinite scenarios, RLVR composite rewards, and multi-tool agentic workflow."
             ),
+            "version": "1.0.0",
+            "themes": [
+                "World Modeling — Professional Tasks",
+                "Long-Horizon Planning & Instruction Following",
+                "Multi-Agent Interactions (adversarial vendor)",
             ],
             "tasks": [
+                {"name": "procurement_reconciliation", "difficulty": "easy", "max_steps": 10,
+                 "description": "Identify overcharged line items between PO and Invoice"},
+                {"name": "sla_enforcement", "difficulty": "medium", "max_steps": 15,
+                 "description": "Calculate late delivery penalties from shipping logs and SLA contracts"},
+                {"name": "adversarial_auditing", "difficulty": "hard", "max_steps": 20,
+                 "description": "Navigate vendor disputes, verify warehouse logs, reject settlement offers"},
+            ],
+            "tools": [
+                "query_database", "read_document", "communicate_vendor", "submit_financial_decision",
             ],
         }
     @app.websocket("/ws")
     async def websocket_endpoint(websocket: WebSocket):
         await websocket.accept()
+        ws_env = ESCTREnvironment()
         logger.info("WebSocket session opened")
         try:
                 if msg_type == "reset":
                     obs = ws_env.reset(**msg_data)
+                    await websocket.send_json({"type": "observation", "data": _obs_to_response(obs)})
                 elif msg_type == "step":
                     try:
+                        action = ESCTRAction(**msg_data)
                         obs = ws_env.step(action)
+                        await websocket.send_json({"type": "observation", "data": _obs_to_response(obs)})
                     except Exception as e:
                         await websocket.send_json({
                             "type": "error",
                         })
                 elif msg_type == "state":
+                    await websocket.send_json({"type": "state", "data": ws_env.state.model_dump()})
                 elif msg_type == "close":
                     break
                 else:
                     await websocket.send_json({
                         "type": "error",
+                        "data": {"message": f"Unknown message type: {msg_type}", "code": "UNKNOWN_TYPE"},
                     })
         except WebSocketDisconnect:
     return app
+app = create_app()
 def main():
     import uvicorn
     uvicorn.run("server.app:app", host="0.0.0.0", port=7860)

server/documents.py DELETED Viewed

@@ -1,898 +0,0 @@
-"""
-Document corpus for the Invoice Extraction Environment.
-Contains synthetic but realistic invoice/receipt documents across 3 difficulty levels.
-Each document has raw text and ground truth extraction targets.
-"""
-DOCUMENTS = {
-    # =========================================================================
-    # SIMPLE INVOICES — Clean formatting, clear labels, consistent structure
-    # =========================================================================
-    "simple_invoice": [
-        {
-            "id": "simple_001",
-            "text": """INVOICE
-Invoice Number: INV-2024-001
-Date: January 15, 2024
-From:
-  Acme Corporation
-  123 Business Avenue
-  New York, NY 10001
-Bill To:
-  Widget Co.
-  456 Commerce Street
-  Chicago, IL 60601
-Description                Qty    Unit Price    Amount
----------------------------------------------------------
-Widget Type A               10      $25.00     $250.00
-Widget Type B                5      $40.00     $200.00
-Consulting Service           8      $75.00     $600.00
-                                   Subtotal:  $1,050.00
-                                   Tax (8%):     $84.00
-                                   Total:     $1,134.00
-Payment Terms: Net 30
-""",
-            "ground_truth": {
-                "invoice_number": "INV-2024-001",
-                "date": "2024-01-15",
-                "vendor_name": "Acme Corporation",
-                "customer_name": "Widget Co.",
-                "subtotal": 1050.00,
-                "tax": 84.00,
-                "total": 1134.00,
-                "line_items": [
-                    {"description": "Widget Type A", "quantity": 10, "unit_price": 25.00, "amount": 250.00},
-                    {"description": "Widget Type B", "quantity": 5, "unit_price": 40.00, "amount": 200.00},
-                    {"description": "Consulting Service", "quantity": 8, "unit_price": 75.00, "amount": 600.00},
-                ],
-            },
-        },
-        {
-            "id": "simple_002",
-            "text": """INVOICE
-Invoice #: TS-5892
-Invoice Date: March 3, 2024
-Vendor:
-  TechStart Solutions LLC
-  890 Innovation Drive, Suite 200
-  San Francisco, CA 94105
-Customer:
-  DataFlow Inc.
-  321 Analytics Blvd
-  Austin, TX 78701
-Item                          Qty   Unit Price     Total
-----------------------------------------------------------
-Cloud Hosting (Monthly)         1     $450.00    $450.00
-API Integration Setup           1   $1,200.00  $1,200.00
-Technical Support (hours)      12      $95.00  $1,140.00
-                                    Subtotal:  $2,790.00
-                                    Tax (7%):    $195.30
-                                    Total:     $2,985.30
-Due Date: April 2, 2024
-""",
-            "ground_truth": {
-                "invoice_number": "TS-5892",
-                "date": "2024-03-03",
-                "vendor_name": "TechStart Solutions LLC",
-                "customer_name": "DataFlow Inc.",
-                "subtotal": 2790.00,
-                "tax": 195.30,
-                "total": 2985.30,
-                "line_items": [
-                    {"description": "Cloud Hosting (Monthly)", "quantity": 1, "unit_price": 450.00, "amount": 450.00},
-                    {"description": "API Integration Setup", "quantity": 1, "unit_price": 1200.00, "amount": 1200.00},
-                    {"description": "Technical Support (hours)", "quantity": 12, "unit_price": 95.00, "amount": 1140.00},
-                ],
-            },
-        },
-        {
-            "id": "simple_003",
-            "text": """INVOICE
-Invoice Number: GS-2024-0147
-Date: February 20, 2024
-From:
-  Global Supplies Inc.
-  2500 Industrial Parkway
-  Detroit, MI 48201
-To:
-  Riverside Manufacturing
-  780 Factory Road
-  Cleveland, OH 44101
-Product                    Qty    Price Each    Line Total
------------------------------------------------------------
-Steel Bolts (Box/100)       50       $12.50       $625.00
-Copper Wire (500ft Roll)     8       $85.00       $680.00
-Safety Goggles (Pack/10)    20       $35.00       $700.00
-Welding Rods (Bundle)       15       $22.00       $330.00
-                                    Subtotal:   $2,335.00
-                                    Sales Tax:    $163.45
-                                    Invoice Total: $2,498.45
-Terms: Net 45
-""",
-            "ground_truth": {
-                "invoice_number": "GS-2024-0147",
-                "date": "2024-02-20",
-                "vendor_name": "Global Supplies Inc.",
-                "customer_name": "Riverside Manufacturing",
-                "subtotal": 2335.00,
-                "tax": 163.45,
-                "total": 2498.45,
-                "line_items": [
-                    {"description": "Steel Bolts (Box/100)", "quantity": 50, "unit_price": 12.50, "amount": 625.00},
-                    {"description": "Copper Wire (500ft Roll)", "quantity": 8, "unit_price": 85.00, "amount": 680.00},
-                    {"description": "Safety Goggles (Pack/10)", "quantity": 20, "unit_price": 35.00, "amount": 700.00},
-                    {"description": "Welding Rods (Bundle)", "quantity": 15, "unit_price": 22.00, "amount": 330.00},
-                ],
-            },
-        },
-    ],
-    # =========================================================================
-    # MESSY INVOICES — Inconsistent formatting, abbreviations, typos
-    # =========================================================================
-    "messy_invoice": [
-        {
-            "id": "messy_001",
-            "text": """ACME Corp
-123 Biz Ave., NY 10001
-Ph: (212) 555-0100
-inv# ACM-987
-dt: Jan 15 '24
-BILL TO:
-widgetco / 456 commerce, chicago il
----items---
-10x WidgetA @ 25           250
-5x WidgetB @ 40            200
-8hrs consulting @75/hr     600
-                          ------
-                    subtot 1050
-                    tx 8%:   84
-              TOTAL DUE: $1,134
-pay within 30 days
-""",
-            "ground_truth": {
-                "invoice_number": "ACM-987",
-                "date": "2024-01-15",
-                "vendor_name": "ACME Corp",
-                "customer_name": "widgetco",
-                "subtotal": 1050.00,
-                "tax": 84.00,
-                "total": 1134.00,
-                "line_items": [
-                    {"description": "WidgetA", "quantity": 10, "unit_price": 25.00, "amount": 250.00},
-                    {"description": "WidgetB", "quantity": 5, "unit_price": 40.00, "amount": 200.00},
-                    {"description": "consulting", "quantity": 8, "unit_price": 75.00, "amount": 600.00},
-                ],
-            },
-        },
-        {
-            "id": "messy_002",
-            "text": """techstart solutions
-san fran, CA
-INVOICE  ts5892-b
-date 03/03/2024
-cust: DataFlow
-      austin TX
--- charges --
-hosting (cloud, monthly plan)...$450
-api integration - setup fee...$1200
-tech support x12h @$95 = $1,140.00
-sub: $2790
-tax 7pct = 195.30
-========
-amt due $2,985.30
-please remit by 04/02/2024
-""",
-            "ground_truth": {
-                "invoice_number": "ts5892-b",
-                "date": "2024-03-03",
-                "vendor_name": "techstart solutions",
-                "customer_name": "DataFlow",
-                "subtotal": 2790.00,
-                "tax": 195.30,
-                "total": 2985.30,
-                "line_items": [
-                    {"description": "hosting (cloud, monthly plan)", "quantity": 1, "unit_price": 450.00, "amount": 450.00},
-                    {"description": "api integration - setup fee", "quantity": 1, "unit_price": 1200.00, "amount": 1200.00},
-                    {"description": "tech support", "quantity": 12, "unit_price": 95.00, "amount": 1140.00},
-                ],
-            },
-        },
-        {
-            "id": "messy_003",
-            "text": """GLOBAL SUPPLY
-2500 industrial pkwy detroit MI
-inv GS-0147rev
-20-Feb-2024
-Riverside Mfg / cleveland OH
-stl bolts 100ct boxes -- 50 @ 12.50 ea ........... 625
-cu wire 500' rolls -- 8 @ 85 .................... 680
-safety goggles 10pk -- 20 @ 35 .................. 700
-weld rods bundle -- 15 @ 22 ea .................. 330
-s/t   2335.00
-tax     163.45
------
-GRAND TOTAL  $2498.45
-net45
-""",
-            "ground_truth": {
-                "invoice_number": "GS-0147rev",
-                "date": "2024-02-20",
-                "vendor_name": "GLOBAL SUPPLY",
-                "customer_name": "Riverside Mfg",
-                "subtotal": 2335.00,
-                "tax": 163.45,
-                "total": 2498.45,
-                "line_items": [
-                    {"description": "stl bolts 100ct boxes", "quantity": 50, "unit_price": 12.50, "amount": 625.00},
-                    {"description": "cu wire 500' rolls", "quantity": 8, "unit_price": 85.00, "amount": 680.00},
-                    {"description": "safety goggles 10pk", "quantity": 20, "unit_price": 35.00, "amount": 700.00},
-                    {"description": "weld rods bundle", "quantity": 15, "unit_price": 22.00, "amount": 330.00},
-                ],
-            },
-        },
-    ],
-    # =========================================================================
-    # MULTI-DOCUMENT — Multiple sections, cross-references, adjustments
-    # =========================================================================
-    "multi_document": [
-        {
-            "id": "multi_001",
-            "text": """=== PURCHASE ORDER ===
-PO Number: PO-2024-0055
-Date: January 10, 2024
-Vendor: Acme Corporation
-Buyer: Widget Co.
-Ordered Items:
-- 10x Widget Type A @ $25.00 = $250.00
-- 5x Widget Type B @ $40.00 = $200.00
-- 8hrs Consulting @ $75.00/hr = $600.00
-PO Total: $1,050.00 (before tax)
-=== INVOICE ===
-Invoice Number: INV-2024-001
-Reference PO: PO-2024-0055
-Date: January 15, 2024
-From: Acme Corporation, 123 Business Ave, New York, NY 10001
-To: Widget Co., 456 Commerce St, Chicago, IL 60601
-Description                Qty    Unit Price    Amount
-Widget Type A               10      $25.00     $250.00
-Widget Type B                5      $40.00     $200.00
-Consulting Service           8      $75.00     $600.00
-Subtotal: $1,050.00
-Tax (8%): $84.00
-Invoice Total: $1,134.00
-=== CREDIT MEMO ===
-Credit Memo #: CM-2024-003
-Reference Invoice: INV-2024-001
-Date: January 22, 2024
-Reason: 2x Widget Type A received defective
-Credit Amount: $50.00
-=== SUMMARY ===
-Original Invoice: $1,134.00
-Credit Applied: -$50.00
-Adjusted Balance Due: $1,084.00
-""",
-            "ground_truth": {
-                "invoice_number": "INV-2024-001",
-                "date": "2024-01-15",
-                "vendor_name": "Acme Corporation",
-                "customer_name": "Widget Co.",
-                "subtotal": 1050.00,
-                "tax": 84.00,
-                "total": 1134.00,
-                "po_number": "PO-2024-0055",
-                "adjustment_reason": "2x Widget Type A received defective",
-                "adjusted_total": 1084.00,
-                "line_items": [
-                    {"description": "Widget Type A", "quantity": 10, "unit_price": 25.00, "amount": 250.00},
-                    {"description": "Widget Type B", "quantity": 5, "unit_price": 40.00, "amount": 200.00},
-                    {"description": "Consulting Service", "quantity": 8, "unit_price": 75.00, "amount": 600.00},
-                ],
-            },
-        },
-        {
-            "id": "multi_002",
-            "text": """--- PURCHASE ORDER ---
-PO#: PO-DF-2024-112
-Issued: 2024-02-28
-Requested By: DataFlow Inc., Austin TX
-Vendor: TechStart Solutions LLC
-Items Requested:
-1. Cloud Hosting (Monthly) - 1 unit - $450.00 - $450.00
-2. API Integration - 1 unit - $1,200.00 - $1,200.00
-3. Tech Support - 10 hours - $95.00/hr - $950.00
-   NOTE: Hours are estimated, bill actuals
-PO Authorized Amount: $2,600.00 (pre-tax)
---- INVOICE ---
-Invoice: TS-5892
-Date: March 3, 2024
-PO Reference: PO-DF-2024-112
-From: TechStart Solutions LLC, 890 Innovation Dr Suite 200, San Francisco CA 94105
-To: DataFlow Inc., 321 Analytics Blvd, Austin TX 78701
-Service                       Qty   Rate        Amount
-Cloud Hosting (Monthly)         1   $450.00    $450.00
-API Integration Setup           1   $1,200.00  $1,200.00
-Technical Support (actual hrs) 12   $95.00     $1,140.00
-NOTE: Support hours exceeded PO estimate (10hrs) by 2hrs.
-Overage pre-approved by J. Smith on 03/01/2024.
-Subtotal: $2,790.00
-Tax (7%): $195.30
-Total: $2,985.30
---- PAYMENT RECEIPT ---
-Receipt #: RCP-2024-0891
-Date: March 15, 2024
-Payment Method: ACH Transfer
-Reference: TS-5892
-Amount Paid: $2,000.00
-Outstanding Balance: $985.30
-Due By: April 2, 2024
-""",
-            "ground_truth": {
-                "invoice_number": "TS-5892",
-                "date": "2024-03-03",
-                "vendor_name": "TechStart Solutions LLC",
-                "customer_name": "DataFlow Inc.",
-                "subtotal": 2790.00,
-                "tax": 195.30,
-                "total": 2985.30,
-                "po_number": "PO-DF-2024-112",
-                "adjustment_reason": "Partial payment applied",
-                "adjusted_total": 985.30,
-                "line_items": [
-                    {"description": "Cloud Hosting (Monthly)", "quantity": 1, "unit_price": 450.00, "amount": 450.00},
-                    {"description": "API Integration Setup", "quantity": 1, "unit_price": 1200.00, "amount": 1200.00},
-                    {"description": "Technical Support (actual hrs)", "quantity": 12, "unit_price": 95.00, "amount": 1140.00},
-                ],
-            },
-        },
-        {
-            "id": "multi_003",
-            "text": """==== PURCHASE ORDER ====
-PO: PO-RM-2024-033
-Date: Feb 15, 2024
-Buyer: Riverside Manufacturing, 780 Factory Rd, Cleveland OH
-Supplier: Global Supplies Inc.
-Budget Approved: $2,800.00
-Requested:
-- Steel Bolts Box/100: 50 boxes @ $12.50
-- Copper Wire 500ft: 10 rolls @ $85.00
-- Safety Goggles Pack/10: 20 packs @ $35.00
-- Welding Rods Bundle: 15 bundles @ $22.00
-==== INVOICE ====
-Invoice: GS-2024-0147
-Date: February 20, 2024
-PO Ref: PO-RM-2024-033
-Billed By: Global Supplies Inc., 2500 Industrial Parkway, Detroit MI 48201
-Billed To: Riverside Manufacturing, 780 Factory Road, Cleveland OH 44101
-Item                       Qty   Unit$     Total
-Steel Bolts (Box/100)       50   $12.50    $625.00
-Copper Wire (500ft Roll)     8   $85.00    $680.00
-Safety Goggles (Pack/10)    20   $35.00    $700.00
-Welding Rods (Bundle)       15   $22.00    $330.00
-IMPORTANT: Copper Wire qty reduced from PO (10 to 8).
-2 rolls backordered, will ship separately.
-Subtotal: $2,335.00
-Tax (7%): $163.45
-Total Due: $2,498.45
-==== BACKORDER NOTICE ====
-Backorder #: BO-2024-0089
-Reference: GS-2024-0147 / PO-RM-2024-033
-Item: Copper Wire (500ft Roll)
-Qty Backordered: 2
-Unit Price: $85.00
-Backorder Amount: $170.00
-Estimated Ship Date: March 10, 2024
-Total with Backorder: $2,498.45 + $170.00 = $2,668.45
-(Backorder will be invoiced separately upon shipment)
-""",
-            "ground_truth": {
-                "invoice_number": "GS-2024-0147",
-                "date": "2024-02-20",
-                "vendor_name": "Global Supplies Inc.",
-                "customer_name": "Riverside Manufacturing",
-                "subtotal": 2335.00,
-                "tax": 163.45,
-                "total": 2498.45,
-                "po_number": "PO-RM-2024-033",
-                "adjustment_reason": "Copper Wire qty reduced from PO, 2 rolls backordered",
-                "adjusted_total": 2668.45,
-                "line_items": [
-                    {"description": "Steel Bolts (Box/100)", "quantity": 50, "unit_price": 12.50, "amount": 625.00},
-                    {"description": "Copper Wire (500ft Roll)", "quantity": 8, "unit_price": 85.00, "amount": 680.00},
-                    {"description": "Safety Goggles (Pack/10)", "quantity": 20, "unit_price": 35.00, "amount": 700.00},
-                    {"description": "Welding Rods (Bundle)", "quantity": 15, "unit_price": 22.00, "amount": 330.00},
-                ],
-            },
-        },
-    ],
-    # =========================================================================
-    # CORRUPTED SCAN — OCR-like artifacts, character substitutions, garbled text
-    # These simulate real scanned/faxed invoices with OCR errors.
-    # =========================================================================
-    "corrupted_scan": [
-        {
-            "id": "corrupt_001",
-            "text": """SC4NNED D0CUMENT - Page 1 of 1
-lNVOlCE
-lnvoice Nurnber: lNV-2O24-OO1
-Dat.e: Januery 1S, 2O24
-Frorn:
-  Acrne Corporati0n
-  l23 Business Avenue
-  New Y0rk, NY 1OOO1
-BilI To:
-  Widget C0.
-  4S6 Cornmerce Street
-  Chicag0, lL 6O6O1
-Descripti0n                Qty    Unit Price    Arnount
----------------------------------------------------------
-Widget Type A               1O      $2S.OO     $2SO.OO
-Widget Type 8                S      $4O.OO     $2OO.OO
-ConsuIting Service           8      $7S.OO     $6OO.OO
-                                   Subtotal:  $1,OSO.OO
-                                   Tax (8%):     $84.OO
-                                   T0tal:     $1,l34.OO
-Payrnent Terrns: Net 3O
---- END 0F SCAN ---
-""",
-            "ground_truth": {
-                "invoice_number": "INV-2024-001",
-                "date": "2024-01-15",
-                "vendor_name": "Acme Corporation",
-                "customer_name": "Widget Co.",
-                "subtotal": 1050.00,
-                "tax": 84.00,
-                "total": 1134.00,
-                "line_items": [
-                    {"description": "Widget Type A", "quantity": 10, "unit_price": 25.00, "amount": 250.00},
-                    {"description": "Widget Type B", "quantity": 5, "unit_price": 40.00, "amount": 200.00},
-                    {"description": "Consulting Service", "quantity": 8, "unit_price": 75.00, "amount": 600.00},
-                ],
-            },
-        },
-        {
-            "id": "corrupt_002",
-            "text": """[SCAN QUALITY: P00R - SOME CHARACTERS MAY BE lNCORRECT]
-TECHSTART S0LUTl0NS LLC
-89O lnnovation Dr, Suite 2OO
-San Francisc0, CA 941OS
-lNV0lCE #: TS~S892
-DATE: O3/O3/2O24
-CUSTOMERr DataFIow lnc.
-          321 AnaIytics BIvd
-          Austin, TX 787O1
-Servicc                       Qty   Unit Pricc     Total
-----------------------------------------------------------
-CIoud Hosting (MonthIy)         l     $4SO.OO    $4SO.OO
-APl lntegration Setup           l   $l,2OO.OO  $l,2OO.OO
-TechnicaI Support (hours)      l2      $9S.OO  $l,l4O.OO
-                                    SubtotaI:  $2,79O.OO
-                                    Tax (7%)):    $l9S.3O
-                                    TotaI:     $2,98S.3O
-Due Date: ApriI 2, 2O24
-[PAGE 1/1 - SCAN C0MPLETE]
-""",
-            "ground_truth": {
-                "invoice_number": "TS-5892",
-                "date": "2024-03-03",
-                "vendor_name": "TechStart Solutions LLC",
-                "customer_name": "DataFlow Inc.",
-                "subtotal": 2790.00,
-                "tax": 195.30,
-                "total": 2985.30,
-                "line_items": [
-                    {"description": "Cloud Hosting (Monthly)", "quantity": 1, "unit_price": 450.00, "amount": 450.00},
-                    {"description": "API Integration Setup", "quantity": 1, "unit_price": 1200.00, "amount": 1200.00},
-                    {"description": "Technical Support (hours)", "quantity": 12, "unit_price": 95.00, "amount": 1140.00},
-                ],
-            },
-        },
-        {
-            "id": "corrupt_003",
-            "text": """---FAXED DOCUMENT---
-RECEIVED: 02/20/2024 14:32
-QUALITY: [####===---] 40%
-GL0BAL SUPPLlES lNC.
-25OO lndustriaI Parkway
-Detr0it, Ml 482Ol
-lNVOlCE
-lnvoice Number: GS-2O24-Ol47
-Date: February 2O, 2024
-T0:
-  Riverside Manufactur1ng
-  78O Factory R0ad
-  CIeveIand, 0H 44l0l
-Product                    Qty    Price Each    Line Total
------------------------------------------------------------
-SteeI BoIts (Box/lOO)       SO       $l2.SO       $62S.OO
-Copper Wire (SOOft RoII)     8       $8S.OO       $68O.OO
-Safety GoggIes (Pack/lO)    2O       $3S.OO       $7OO.OO
-WeIding Rods (BundIe)       lS       $22.OO       $33O.OO
-                   [iIIegibIe]
-                                    SubtotaI:   $2,33S.OO
-                                    SaIes Tax:    $l63.4S
-                                    lnvoice T0tal: $2,498.4S
-Terrns: Net 4S
----END FAX---
-""",
-            "ground_truth": {
-                "invoice_number": "GS-2024-0147",
-                "date": "2024-02-20",
-                "vendor_name": "Global Supplies Inc.",
-                "customer_name": "Riverside Manufacturing",
-                "subtotal": 2335.00,
-                "tax": 163.45,
-                "total": 2498.45,
-                "line_items": [
-                    {"description": "Steel Bolts (Box/100)", "quantity": 50, "unit_price": 12.50, "amount": 625.00},
-                    {"description": "Copper Wire (500ft Roll)", "quantity": 8, "unit_price": 85.00, "amount": 680.00},
-                    {"description": "Safety Goggles (Pack/10)", "quantity": 20, "unit_price": 35.00, "amount": 700.00},
-                    {"description": "Welding Rods (Bundle)", "quantity": 15, "unit_price": 22.00, "amount": 330.00},
-                ],
-            },
-        },
-    ],
-    # =========================================================================
-    # ADVERSARIAL INVOICE — Decoy fields, contradictions, hidden calculations
-    # Designed to genuinely challenge frontier models with traps.
-    # =========================================================================
-    "adversarial_invoice": [
-        {
-            "id": "adversarial_001",
-            "text": """INVOICE
-*** IMPORTANT: This replaces previous invoice DRAFT-INV-999 which was voided ***
-Invoice Number: INV-2024-001-R2
-Previous Reference: DRAFT-INV-999 (VOIDED — DO NOT USE)
-Date: January 15, 2024
-Reissue Date: January 20, 2024
-From:
-  Acme Corporation
-  123 Business Avenue, New York, NY 10001
-  Tax ID: 12-3456789
-Bill To:
-  Widget Co. (DBA "WidgetCorp International")
-  456 Commerce Street, Chicago, IL 60601
-  Customer Account: WC-0042
-Description                Qty    Unit Price    Amount
----------------------------------------------------------
-Widget Type A               10      $25.00     $250.00
-Widget Type B                5      $40.00     $200.00
-Consulting Service           8      $75.00     $600.00
-  ** EARLY PAYMENT DISCOUNT: -5% on consulting **
-                                   Subtotal:  $1,050.00
-                              Discount (5%):    -$30.00
-                         Adjusted Subtotal:   $1,020.00
-                                   Tax (8%):     $81.60
-                                   Total:     $1,101.60
-NOTE: Original quote (QT-2024-555) was $1,134.00 but discount applied.
-Per agreement dated Jan 12, if paid within 10 days.
-Payment Terms: Net 10 (discounted) / Net 30 (full price $1,134.00)
-""",
-            "ground_truth": {
-                "invoice_number": "INV-2024-001-R2",
-                "date": "2024-01-20",
-                "vendor_name": "Acme Corporation",
-                "customer_name": "Widget Co.",
-                "subtotal": 1020.00,
-                "tax": 81.60,
-                "total": 1101.60,
-                "discount_amount": 30.00,
-                "original_total": 1134.00,
-                "line_items": [
-                    {"description": "Widget Type A", "quantity": 10, "unit_price": 25.00, "amount": 250.00},
-                    {"description": "Widget Type B", "quantity": 5, "unit_price": 40.00, "amount": 200.00},
-                    {"description": "Consulting Service", "quantity": 8, "unit_price": 75.00, "amount": 600.00},
-                ],
-                "discrepancy_notes": "5% early payment discount applied to consulting. Reissued invoice replaces voided DRAFT-INV-999. Adjusted subtotal $1,020 vs original $1,050.",
-            },
-        },
-        {
-            "id": "adversarial_002",
-            "text": """--- PURCHASE ORDER ---
-PO#: PO-DF-2024-112
-Date: February 28, 2024
-Vendor: TechStart Solutions LLC
-Buyer: DataFlow Inc.
-Authorized Budget: $2,600.00 (pre-tax)
-Items:
-1. Cloud Hosting - 1 unit @ $450.00 = $450.00
-2. API Integration - 1 unit @ $1,200.00 = $1,200.00
-3. Tech Support - 10 hours @ $95.00/hr = $950.00
-PO Total: $2,600.00
---- INVOICE ---
-Invoice: TS-5892-FINAL
-Date: March 3, 2024
-PO Reference: PO-DF-2024-112
-From: TechStart Solutions LLC
-To: DataFlow Inc.
-Service                       Qty   Rate        Amount
-Cloud Hosting (Monthly)         1   $450.00    $450.00
-API Integration Setup           1   $1,200.00  $1,200.00
-Technical Support (actual)     12   $95.00     $1,140.00
-  >> 2 hrs over PO estimate, approved by J. Smith 03/01/2024
-Rush Processing Fee             1   $150.00    $150.00
-  >> Added per emergency request ER-2024-033
-Subtotal: $2,940.00
-Tax (7%): $205.80
-Total: $3,145.80
-!!! BUDGET VARIANCE ALERT !!!
-PO Authorized: $2,600.00
-Actual (pre-tax): $2,940.00
-Variance: $340.00 OVER BUDGET (13.1%)
-Causes: Support overage ($190), Rush fee ($150)
---- PAYMENT SCHEDULE ---
-Payment 1 (due 03/15): $1,500.00
-Payment 2 (due 04/02): $1,645.80
-""",
-            "ground_truth": {
-                "invoice_number": "TS-5892-FINAL",
-                "date": "2024-03-03",
-                "vendor_name": "TechStart Solutions LLC",
-                "customer_name": "DataFlow Inc.",
-                "subtotal": 2940.00,
-                "tax": 205.80,
-                "total": 3145.80,
-                "po_number": "PO-DF-2024-112",
-                "discount_amount": 0.00,
-                "original_total": 2600.00,
-                "line_items": [
-                    {"description": "Cloud Hosting (Monthly)", "quantity": 1, "unit_price": 450.00, "amount": 450.00},
-                    {"description": "API Integration Setup", "quantity": 1, "unit_price": 1200.00, "amount": 1200.00},
-                    {"description": "Technical Support (actual)", "quantity": 12, "unit_price": 95.00, "amount": 1140.00},
-                    {"description": "Rush Processing Fee", "quantity": 1, "unit_price": 150.00, "amount": 150.00},
-                ],
-                "discrepancy_notes": "Invoice exceeds PO by $340 (13.1%). 2 extra support hours ($190) and rush processing fee ($150) added. PO authorized $2,600 but actual pre-tax is $2,940.",
-            },
-        },
-        {
-            "id": "adversarial_003",
-            "text": """CONSOLIDATED STATEMENT
-Account: Riverside Manufacturing
-Statement Period: February 2024
-Prepared by: Global Supplies Inc., Accounts Receivable
-=== TRANSACTION 1: ORIGINAL INVOICE ===
-Invoice: GS-2024-0147
-Date: February 20, 2024
-PO: PO-RM-2024-033
-Steel Bolts (Box/100)       50   @ $12.50    =    $625.00
-Copper Wire (500ft Roll)    10   @ $85.00    =    $850.00
-Safety Goggles (Pack/10)    20   @ $35.00    =    $700.00
-Welding Rods (Bundle)       15   @ $22.00    =    $330.00
-Invoice Subtotal: $2,505.00
-Tax (7%): $175.35
-Invoice Total: $2,680.35
-=== TRANSACTION 2: ADJUSTMENT ===
-Credit Memo: CM-2024-0201
-Date: February 25, 2024
-Reference: GS-2024-0147
-Issue: Copper Wire — only 8 of 10 rolls delivered.
-2 rolls backordered (BO-2024-0089).
-Credit for undelivered: 2 x $85.00 = $170.00
-Tax adjustment: -$11.90
-Total Credit: -$181.90
-=== TRANSACTION 3: PRICE CORRECTION ===
-Debit Memo: DM-2024-0055
-Date: February 27, 2024
-Steel Bolts price was quoted at $12.50 but contract
-rate is $13.00. Underbilled on 50 boxes.
-Price difference: 50 x $0.50 = $25.00
-Tax on adjustment: $1.75
-Total Debit: $26.75
-=== ACCOUNT SUMMARY ===
-Original Invoice:           $2,680.35
-Credit (undelivered wire): -$181.90
-Debit (price correction):   +$26.75
-================================
-Net Amount Due:             $2,525.20
-Payment due by: March 20, 2024
-""",
-            "ground_truth": {
-                "invoice_number": "GS-2024-0147",
-                "date": "2024-02-20",
-                "vendor_name": "Global Supplies Inc.",
-                "customer_name": "Riverside Manufacturing",
-                "subtotal": 2505.00,
-                "tax": 175.35,
-                "total": 2680.35,
-                "po_number": "PO-RM-2024-033",
-                "discount_amount": 0.00,
-                "original_total": 2680.35,
-                "line_items": [
-                    {"description": "Steel Bolts (Box/100)", "quantity": 50, "unit_price": 12.50, "amount": 625.00},
-                    {"description": "Copper Wire (500ft Roll)", "quantity": 10, "unit_price": 85.00, "amount": 850.00},
-                    {"description": "Safety Goggles (Pack/10)", "quantity": 20, "unit_price": 35.00, "amount": 700.00},
-                    {"description": "Welding Rods (Bundle)", "quantity": 15, "unit_price": 22.00, "amount": 330.00},
-                ],
-                "discrepancy_notes": "Credit memo CM-2024-0201 for 2 undelivered Copper Wire rolls (-$181.90). Debit memo DM-2024-0055 for Steel Bolts price correction (+$26.75). Net adjustment: -$155.15. Final amount due: $2,525.20.",
-            },
-        },
-    ],
-}
-# Required fields per task (defines what the agent must extract)
-TASK_REQUIRED_FIELDS = {
-    "simple_invoice": [
-        "invoice_number", "date", "vendor_name", "customer_name",
-        "subtotal", "tax", "total", "line_items",
-    ],
-    "messy_invoice": [
-        "invoice_number", "date", "vendor_name", "customer_name",
-        "subtotal", "tax", "total", "line_items",
-    ],
-    "multi_document": [
-        "invoice_number", "date", "vendor_name", "customer_name",
-        "subtotal", "tax", "total", "line_items",
-        "po_number", "adjustment_reason", "adjusted_total",
-    ],
-    "corrupted_scan": [
-        "invoice_number", "date", "vendor_name", "customer_name",
-        "subtotal", "tax", "total", "line_items",
-    ],
-    "adversarial_invoice": [
-        "invoice_number", "date", "vendor_name", "customer_name",
-        "subtotal", "tax", "total", "line_items",
-        "po_number", "discount_amount", "original_total",
-        "discrepancy_notes",
-    ],
-}
-def get_document(task_name: str, doc_index: int = 0, use_procedural: bool = True) -> dict:
-    """Get a document and its metadata for a given task.
-    For doc_index 0-2, returns static documents (deterministic test fixtures).
-    For doc_index >= 3 (or when use_procedural=True and index wraps), uses the
-    procedural generation engine to create novel documents from the seed.
-    Args:
-        task_name: One of 'simple_invoice', 'messy_invoice', 'multi_document',
-                   'corrupted_scan', 'adversarial_invoice'
-        doc_index: Index / seed for document selection
-        use_procedural: Whether to use procedural generation for indices beyond static pool
-    Returns:
-        dict with 'id', 'text', 'ground_truth', 'required_fields'
-    """
-    docs = DOCUMENTS.get(task_name, DOCUMENTS["simple_invoice"])
-    required = TASK_REQUIRED_FIELDS.get(task_name, TASK_REQUIRED_FIELDS["simple_invoice"])
-    # Use static documents for small indices (deterministic test fixtures)
-    if doc_index < len(docs):
-        doc = docs[doc_index]
-        return {
-            "id": doc["id"],
-            "text": doc["text"],
-            "ground_truth": doc["ground_truth"],
-            "required_fields": required,
-        }
-    # Use procedural generation for larger indices
-    if use_procedural:
-        from .procedural import generate_document
-        proc_doc = generate_document(task_name, seed=doc_index)
-        return {
-            "id": proc_doc["id"],
-            "text": proc_doc["text"],
-            "ground_truth": proc_doc["ground_truth"],
-            "required_fields": required,
-        }
-    # Fallback: wrap around static docs
-    doc = docs[doc_index % len(docs)]
-    return {
-        "id": doc["id"],
-        "text": doc["text"],
-        "ground_truth": doc["ground_truth"],
-        "required_fields": required,
-    }

server/environment.py CHANGED Viewed

@@ -1,621 +1,553 @@
 """
-Invoice Extraction Environment — Core Implementation.
-A stateful environment where an AI agent extracts structured data
-from unstructured invoice/receipt documents through a multi-step
-interaction loop with RLVR-inspired dense reward signals.
 Reward Architecture:
-    R_total = α·R_outcome + β·R_trajectory + R_penalties
-    α = 0.70 (outcome dominates)
-    β = 0.30 (trajectory contributes)
-    Penalties: step cost, hallucination penalties
 """
 import json
 from typing import Any, Optional
 from uuid import uuid4
-from .models import InvoiceAction, InvoiceObservation, InvoiceState
-from .documents import get_document, TASK_REQUIRED_FIELDS
-from .graders import grade_extraction
-# ---------------------------------------------------------------------------
-# Constants
-# ---------------------------------------------------------------------------
-MAX_ATTEMPTS = {
-    "simple_invoice": 3,
-    "messy_invoice": 3,
-    "multi_document": 5,
-    "corrupted_scan": 4,
-    "adversarial_invoice": 6,
 }
-# Reward architecture coefficients
-ALPHA = 0.70   # outcome weight
-BETA = 0.30    # trajectory weight
-# Trajectory micro-rewards
-REWARD_VIEW_DOC = 0.01
-REWARD_VIEW_FIELDS = 0.01
-REWARD_GET_FEEDBACK = 0.005
-REWARD_QUERY_RELATED = 0.015
-REWARD_VERIFY_CALC = 0.01
-REWARD_CHECK_DISCREP = 0.015
-# Penalties
-PENALTY_PER_STEP = -0.005
-PENALTY_INVALID_JSON = -0.02
-PENALTY_UNKNOWN_CMD = -0.02
-PENALTY_INVALID_CALC = -0.01
-# Tasks that support advanced tool commands
-TOOL_ENABLED_TASKS = {"multi_document", "adversarial_invoice"}
-VALID_TASKS = list(TASK_REQUIRED_FIELDS.keys())
-class InvoiceExtractionEnvironment:
-    """Environment for extracting structured data from invoice documents.
-    The agent interacts through these commands:
-      - view_document: See the raw document text
-      - view_fields: See the list of required fields
-      - extract: Submit extracted fields as JSON
-      - get_feedback: Get detailed feedback on last extraction
-      - query_related_documents: Retrieve cross-reference documents
-      - verify_calculations: Submit arithmetic for verification
-      - check_discrepancies: Request environment to flag inconsistencies
-    Reward design follows RLVR principles:
-      R_total = α·R_outcome + β·R_trajectory + R_penalties
-    """
     def __init__(self):
-        self._state = InvoiceState(episode_id=str(uuid4()))
-        self._document_text = ""
-        self._ground_truth = {}
-        self._required_fields = []
-        self._last_feedback = {}
-        self._last_extracted = {}
         self._initialized = False
         self._trajectory_reward = 0.0
-        self._milestones = set()  # tracks which trajectory milestones agent has hit
-        self._related_docs_text = ""
     def reset(
         self,
         seed: Optional[int] = None,
         episode_id: Optional[str] = None,
-        task_name: str = "simple_invoice",
         **kwargs: Any,
-    ) -> InvoiceObservation:
-        """Reset the environment with a new task and document."""
         if task_name not in VALID_TASKS:
-            task_name = "simple_invoice"
-        doc_index = seed if seed is not None else 0
-        doc_data = get_document(task_name, doc_index)
-        max_attempts = MAX_ATTEMPTS.get(task_name, 3)
-        self._state = InvoiceState(
             episode_id=episode_id or str(uuid4()),
             step_count=0,
             task_name=task_name,
-            document_id=doc_data["id"],
-            best_score=0.0,
-            attempts_used=0,
-            max_attempts=max_attempts,
             accumulated_reward=0.0,
         )
-        self._document_text = doc_data["text"]
-        self._ground_truth = doc_data["ground_truth"]
-        self._required_fields = doc_data["required_fields"]
-        self._last_feedback = {}
-        self._last_extracted = {}
         self._initialized = True
         self._trajectory_reward = 0.0
-        self._milestones = set()
-        self._related_docs_text = self._build_related_docs(task_name, doc_data)
-        tool_hint = ""
-        if task_name in TOOL_ENABLED_TASKS:
-            tool_hint = (
-                "\nAdvanced tools available for this task:\n"
-                "  - 'query_related_documents': Retrieve PO, credit memos, etc.\n"
-                "  - 'verify_calculations': Submit arithmetic for verification\n"
-                "  - 'check_discrepancies': Flag inconsistencies in the document\n"
-            )
-        return InvoiceObservation(
             done=False,
             reward=0.0,
-            text=(
-                f"Invoice Extraction Environment ready.\n"
-                f"Task: {task_name}\n"
-                f"Document ID: {doc_data['id']}\n"
-                f"Fields to extract: {len(self._required_fields)}\n"
-                f"Max attempts: {max_attempts}\n\n"
-                f"Use 'view_document' to see the document text.\n"
-                f"Use 'view_fields' to see the required fields.\n"
-                f"Use 'extract' with a JSON payload to submit your extraction.\n"
-                f"Use 'get_feedback' to see feedback on your last attempt."
-                f"{tool_hint}"
-            ),
-            task_name=task_name,
-            current_score=0.0,
-            attempts_remaining=max_attempts,
-            required_fields=self._required_fields,
             current_step=0,
             accumulated_reward=0.0,
-            last_action_status="success",
         )
-    def _build_related_docs(self, task_name: str, doc_data: dict) -> str:
-        """Build related documents text for cross-referencing tasks."""
-        gt = doc_data["ground_truth"]
-        if task_name not in TOOL_ENABLED_TASKS:
-            return ""
-        parts = []
-        if "po_number" in gt:
-            parts.append(
-                f"=== PURCHASE ORDER ===\n"
-                f"PO Number: {gt.get('po_number', 'N/A')}\n"
-                f"Vendor: {gt.get('vendor_name', 'N/A')}\n"
-                f"Buyer: {gt.get('customer_name', 'N/A')}\n"
             )
-            if "line_items" in gt:
-                for item in gt["line_items"]:
-                    parts.append(
-                        f"  - {item['quantity']}x {item['description']} "
-                        f"@ ${item['unit_price']:.2f} = ${item['amount']:.2f}"
-                    )
-            parts.append("")
-        if gt.get("adjustment_reason"):
-            parts.append(
-                f"=== ADJUSTMENT MEMO ===\n"
-                f"Reason: {gt['adjustment_reason']}\n"
             )
-            if gt.get("adjusted_total"):
-                parts.append(f"Adjusted Total: ${gt['adjusted_total']:,.2f}")
-            parts.append("")
-        if gt.get("discount_amount") and gt["discount_amount"] > 0:
-            parts.append(
-                f"=== DISCOUNT APPLIED ===\n"
-                f"Discount: ${gt['discount_amount']:,.2f}\n"
-                f"Original Total: ${gt.get('original_total', 0):,.2f}\n"
             )
-        return "\n".join(parts) if parts else "No related documents found for this invoice."
     def step(
         self,
-        action: InvoiceAction,
         timeout_s: Optional[float] = None,
         **kwargs: Any,
-    ) -> InvoiceObservation:
-        """Execute a step in the environment."""
         if not self._initialized:
-            return InvoiceObservation(
-                done=True,
-                reward=0.0,
-                text="Error: Environment not initialized. Call reset() first.",
-                metadata={"error": "not_initialized"},
-                last_action_status="error",
-                error_message="Environment not initialized. Call reset() first.",
-            )
         self._state.step_count += 1
-        command = action.command.lower().strip()
-        # Apply per-step cost (encourages efficiency)
-        self._trajectory_reward += PENALTY_PER_STEP
-        handlers = {
-            "view_document": self._handle_view_document,
-            "view_fields": self._handle_view_fields,
-            "extract": lambda: self._handle_extract(action.payload),
-            "get_feedback": self._handle_get_feedback,
-            "query_related_documents": self._handle_query_related,
-            "verify_calculations": lambda: self._handle_verify_calculations(action.payload),
-            "check_discrepancies": self._handle_check_discrepancies,
-        }
-        handler = handlers.get(command)
-        if handler:
-            return handler()
-        else:
-            # Unknown command penalty
-            self._trajectory_reward += PENALTY_UNKNOWN_CMD
-            self._state.accumulated_reward += PENALTY_UNKNOWN_CMD
-            return self._make_obs(
-                done=False,
-                reward=0.0,
-                text=(
-                    f"Unknown command: '{command}'. "
-                    f"Valid commands: {', '.join(handlers.keys())}"
-                ),
-                status="error",
-                error_msg=f"Unknown command: '{command}'",
             )
-    def _make_obs(
-        self,
-        done: bool,
-        reward: float,
-        text: str,
-        status: str = "success",
-        error_msg: Optional[str] = None,
-        metadata: Optional[dict] = None,
-    ) -> InvoiceObservation:
-        """Build a standardized observation."""
-        return InvoiceObservation(
-            done=done,
-            reward=round(max(0.0, min(1.0, reward)), 4) if reward >= 0 else round(max(0.0, reward), 4),
-            text=text,
-            task_name=self._state.task_name,
-            current_score=self._state.best_score,
-            attempts_remaining=self._state.max_attempts - self._state.attempts_used,
-            required_fields=self._required_fields,
-            metadata=metadata or {},
-            last_action_status=status,
-            error_message=error_msg,
-            current_step=self._state.step_count,
-            accumulated_reward=round(self._state.accumulated_reward, 4),
-        )
     # ------------------------------------------------------------------
-    # Command handlers
     # ------------------------------------------------------------------
-    def _handle_view_document(self) -> InvoiceObservation:
-        """Return the current document text (trajectory milestone)."""
-        if "view_document" not in self._milestones:
-            self._milestones.add("view_document")
-            self._trajectory_reward += REWARD_VIEW_DOC
-            self._state.accumulated_reward += REWARD_VIEW_DOC
-        return self._make_obs(done=False, reward=0.0, text=self._document_text)
-    def _handle_view_fields(self) -> InvoiceObservation:
-        """Return the list of required fields with descriptions."""
-        if "view_fields" not in self._milestones:
-            self._milestones.add("view_fields")
-            self._trajectory_reward += REWARD_VIEW_FIELDS
-            self._state.accumulated_reward += REWARD_VIEW_FIELDS
-        field_descriptions = {
-            "invoice_number": "The invoice/document number (string)",
-            "date": "Invoice date in YYYY-MM-DD format (use reissue date if applicable)",
-            "vendor_name": "Name of the vendor/seller/supplier",
-            "customer_name": "Name of the customer/buyer/bill-to party",
-            "subtotal": "Subtotal before tax, after discounts (number)",
-            "tax": "Tax amount (number)",
-            "total": "Total amount due (number)",
-            "line_items": "Array of items: [{description, quantity, unit_price, amount}]",
-            "po_number": "Purchase order reference number (string)",
-            "adjustment_reason": "Reason for any adjustments/credits (string)",
-            "adjusted_total": "Final adjusted total after credits/payments (number)",
-            "discount_amount": "Monetary discount value applied (number, 0 if none)",
-            "original_total": "What the total would have been without adjustments (number)",
-            "discrepancy_notes": "Free-text description of all discrepancies, adjustments, and anomalies found",
-        }
-        lines = ["Required fields to extract:\n"]
-        for field in self._required_fields:
-            desc = field_descriptions.get(field, "No description available")
-            lines.append(f"  - {field}: {desc}")
-        lines.append(f"\nSubmit your extraction using the 'extract' command.")
-        lines.append(f"Payload must be a valid JSON string with these field names.")
-        return self._make_obs(done=False, reward=0.0, text="\n".join(lines))
-    def _handle_query_related(self) -> InvoiceObservation:
-        """Return cross-reference documents (PO, credit memos, etc.)."""
-        if self._state.task_name not in TOOL_ENABLED_TASKS:
-            return self._make_obs(
-                done=False, reward=0.0,
-                text="This command is not available for the current task.",
-                status="error",
-                error_msg="query_related_documents only available for multi_document and adversarial_invoice tasks",
-            )
-        if "query_related" not in self._milestones:
-            self._milestones.add("query_related")
-            self._trajectory_reward += REWARD_QUERY_RELATED
-            self._state.accumulated_reward += REWARD_QUERY_RELATED
-        return self._make_obs(
-            done=False, reward=0.0,
-            text=self._related_docs_text or "No related documents found.",
-        )
-    def _handle_verify_calculations(self, payload: str) -> InvoiceObservation:
-        """Verify arithmetic submitted by the agent."""
-        if self._state.task_name not in TOOL_ENABLED_TASKS:
-            return self._make_obs(
-                done=False, reward=0.0,
-                text="This command is not available for the current task.",
-                status="error",
-                error_msg="verify_calculations only available for multi_document and adversarial_invoice tasks",
             )
-        try:
-            data = json.loads(payload) if payload else {}
-        except json.JSONDecodeError:
-            self._trajectory_reward += PENALTY_INVALID_CALC
-            self._state.accumulated_reward += PENALTY_INVALID_CALC
-            return self._make_obs(
-                done=False, reward=0.0,
-                text="Invalid JSON payload for verify_calculations.",
-                status="error",
-                error_msg="Payload must be valid JSON with numeric fields to verify",
             )
-        if "verify_calc" not in self._milestones:
-            self._milestones.add("verify_calc")
-            self._trajectory_reward += REWARD_VERIFY_CALC
-            self._state.accumulated_reward += REWARD_VERIFY_CALC
-        results = []
-        gt = self._ground_truth
-        checks = {
-            "subtotal_plus_tax": (
-                lambda: round(gt.get("subtotal", 0) + gt.get("tax", 0), 2),
-                gt.get("total"),
-            ),
-        }
-        sub = data.get("subtotal")
-        tax = data.get("tax")
-        total = data.get("total")
-        if sub is not None and tax is not None:
-            computed = round(float(sub) + float(tax), 2)
-            if total is not None:
-                match = abs(computed - float(total)) < 0.02
-                results.append(
-                    f"subtotal ({sub}) + tax ({tax}) = {computed} | "
-                    f"your total ({total}) — {'MATCH ✓' if match else 'MISMATCH ✗'}"
                 )
             else:
-                results.append(f"subtotal ({sub}) + tax ({tax}) = {computed}")
-        if not results:
-            results.append("No recognizable calculations found. Submit fields like: subtotal, tax, total")
-        return self._make_obs(
-            done=False, reward=0.0,
-            text="Calculation verification:\n" + "\n".join(results),
-        )
-    def _handle_check_discrepancies(self) -> InvoiceObservation:
-        """Flag inconsistencies in the document."""
-        if self._state.task_name not in TOOL_ENABLED_TASKS:
-            return self._make_obs(
-                done=False, reward=0.0,
-                text="This command is not available for the current task.",
-                status="error",
-                error_msg="check_discrepancies only available for multi_document and adversarial_invoice tasks",
-            )
-        if "check_discrep" not in self._milestones:
-            self._milestones.add("check_discrep")
-            self._trajectory_reward += REWARD_CHECK_DISCREP
-            self._state.accumulated_reward += REWARD_CHECK_DISCREP
-        gt = self._ground_truth
-        hints = []
-        if gt.get("discount_amount") and gt["discount_amount"] > 0:
-            hints.append("⚠ A discount has been applied to this invoice.")
-        if gt.get("adjustment_reason"):
-            hints.append("⚠ There is an adjustment/credit memo affecting the final amount.")
-        if gt.get("po_number"):
-            hints.append("⚠ This invoice references a purchase order — cross-check quantities and amounts.")
-        if gt.get("original_total") and gt.get("total"):
-            if abs(gt["original_total"] - gt["total"]) > 0.01:
-                hints.append("⚠ The final total differs from the original total — investigate adjustments.")
-        if not hints:
-            hints.append("No obvious discrepancies detected.")
-        return self._make_obs(
-            done=False, reward=0.0,
-            text="Discrepancy analysis:\n" + "\n".join(hints),
-        )
-    def _handle_extract(self, payload: str) -> InvoiceObservation:
-        """Process an extraction attempt with RLVR-style composite reward."""
-        attempts_remaining = self._state.max_attempts - self._state.attempts_used
-        if attempts_remaining <= 0:
-            return self._make_obs(
-                done=True,
-                reward=self._state.best_score,
-                text="No attempts remaining. Episode is complete.",
-                metadata={"final_score": self._state.best_score},
             )
-        # Parse the JSON payload
-        try:
-            extracted = json.loads(payload)
-            if not isinstance(extracted, dict):
-                raise ValueError("Payload must be a JSON object")
-        except (json.JSONDecodeError, ValueError) as e:
-            self._state.attempts_used += 1
-            self._trajectory_reward += PENALTY_INVALID_JSON
-            self._state.accumulated_reward += PENALTY_INVALID_JSON
-            attempts_remaining = self._state.max_attempts - self._state.attempts_used
-            done = attempts_remaining <= 0
-            return self._make_obs(
-                done=done,
-                reward=0.0,
-                text=f"Invalid JSON payload: {str(e)}\nPlease submit a valid JSON object.",
-                status="error",
-                error_msg=f"Invalid JSON: {str(e)}",
-                metadata={"error": "invalid_json"},
             )
-        # Grade the extraction
-        self._state.attempts_used += 1
-        base_score, feedback = grade_extraction(
-            extracted, self._ground_truth, self._required_fields
-        )
-        # === COMPOSITE REWARD (RLVR-inspired) ===
-        # R_outcome: base extraction score
-        r_outcome = base_score
-        # R_trajectory: accumulated from milestones
-        r_trajectory = max(0.0, self._trajectory_reward)
-        # Improvement bonus
-        improvement_bonus = 0.0
-        if self._state.attempts_used > 1 and base_score > self._state.best_score:
-            improvement_bonus = min(base_score - self._state.best_score, 0.02)
-        # Step efficiency bonus
-        efficiency_bonus = 0.0
-        if self._state.step_count <= 3:
-            efficiency_bonus = 0.02
-        elif self._state.step_count <= 5:
-            efficiency_bonus = 0.01
-        # Consistency bonus (subtotal + tax ≈ total)
-        consistency_bonus = 0.0
-        ext_sub = _safe_float(extracted.get("subtotal"))
-        ext_tax = _safe_float(extracted.get("tax"))
-        ext_total = _safe_float(extracted.get("total"))
-        if ext_sub is not None and ext_tax is not None and ext_total is not None:
-            computed = round(ext_sub + ext_tax, 2)
-            if abs(computed - ext_total) < 0.02:
-                consistency_bonus = 0.03
-        # Composite reward
-        bonus = improvement_bonus + efficiency_bonus + consistency_bonus
-        score = round(max(0.01, min(0.99, ALPHA * r_outcome + BETA * r_trajectory + bonus)), 4)
-        # Track
-        self._state.best_score = max(self._state.best_score, score)
-        self._state.accumulated_reward += score
-        self._last_feedback = feedback
-        self._last_extracted = extracted
-        attempts_remaining = self._state.max_attempts - self._state.attempts_used
-        done = attempts_remaining <= 0 or score >= 0.95
-        # Build feedback text
-        matched = sum(1 for f in feedback.values() if f.get("matched", False))
-        total_fields = len(feedback)
-        bonus_details = []
-        if consistency_bonus > 0:
-            bonus_details.append(f"consistency: +{consistency_bonus:.3f}")
-        if improvement_bonus > 0:
-            bonus_details.append(f"improvement: +{improvement_bonus:.3f}")
-        if efficiency_bonus > 0:
-            bonus_details.append(f"efficiency: +{efficiency_bonus:.3f}")
-        if r_trajectory > 0:
-            bonus_details.append(f"trajectory: {r_trajectory:.3f}")
-        feedback_text = (
-            f"Extraction scored: {score:.4f} "
-            f"(outcome: {r_outcome:.4f} × {ALPHA}, trajectory: {r_trajectory:.3f} × {BETA})\n"
-            f"Fields matched: {matched}/{total_fields}\n"
-            f"Best score so far: {self._state.best_score:.4f}\n"
-            f"Attempts remaining: {attempts_remaining}\n"
-        )
-        if bonus_details:
-            feedback_text += f"Reward bonuses: {', '.join(bonus_details)}\n"
-        if not done and score < 0.95:
-            weak_fields = [
-                name for name, data in feedback.items()
-                if not data.get("matched", False)
-            ]
-            if weak_fields:
-                feedback_text += f"\nFields needing improvement: {', '.join(weak_fields)}"
-                feedback_text += "\nUse 'get_feedback' for detailed per-field scores."
-        if done:
-            feedback_text += f"\n\nEpisode complete. Final score: {self._state.best_score:.4f}"
-        return self._make_obs(
-            done=done,
-            reward=score,
-            text=feedback_text,
-            metadata={
-                "score": score,
-                "base_score": base_score,
-                "r_outcome": r_outcome,
-                "r_trajectory": r_trajectory,
-                "bonus": bonus,
-                "bonus_details": bonus_details,
-                "best_score": self._state.best_score,
-                "field_scores": {k: v["score"] for k, v in feedback.items()},
-            },
         )
-    def _handle_get_feedback(self) -> InvoiceObservation:
-        """Return detailed feedback on the last extraction attempt."""
-        if not self._last_feedback:
-            return self._make_obs(
-                done=False,
-                reward=0.0,
-                text="No extraction attempt yet. Use 'extract' to submit your extraction first.",
-            )
-        if "get_feedback" not in self._milestones:
-            self._milestones.add("get_feedback")
-            self._trajectory_reward += REWARD_GET_FEEDBACK
-            self._state.accumulated_reward += REWARD_GET_FEEDBACK
-        lines = ["Detailed feedback on last extraction:\n"]
-        for field, data in self._last_feedback.items():
-            score = data.get("score", 0.0)
-            matched = "✓" if data.get("matched", False) else "✗"
-            field_type = data.get("expected_type", "unknown")
-            lines.append(f"  [{matched}] {field} ({field_type}): {score:.2f}")
-        lines.append(f"\nOverall best score: {self._state.best_score:.2f}")
-        lines.append(f"Attempts remaining: {self._state.max_attempts - self._state.attempts_used}")
-        return self._make_obs(
             done=False,
             reward=0.0,
-            text="\n".join(lines),
-            metadata={"field_feedback": self._last_feedback},
         )
     @property
-    def state(self) -> InvoiceState:
-        """Get the current environment state."""
         return self._state
     def close(self) -> None:
-        """Clean up resources."""
         self._initialized = False
-def _safe_float(value) -> float:
-    """Safely convert a value to float, returning None on failure."""
-    if value is None:
-        return None
-    if isinstance(value, (int, float)):
-        return float(value)
-    if isinstance(value, str):
-        import re
-        cleaned = re.sub(r"[$ ,]", "", value.strip())
-        try:
-            return float(cleaned)
-        except (ValueError, TypeError):
-            return None
-    return None

 """
+ESCTR Environment — Core Implementation.
+Enterprise Supply Chain & Tax Reconciliation: a stateful environment
+where an LLM agent operates as an autonomous financial controller,
+using ERP tools to investigate discrepancies, enforce SLA penalties,
+and navigate adversarial vendor disputes.
 Reward Architecture:
+    R_total = α·R_outcome + β·R_trajectory − penalties
 """
 import json
+from dataclasses import asdict
 from typing import Any, Optional
 from uuid import uuid4
+from .models import ESCTRAction, ESCTRObservation, ESCTRState
+from .procedural import (
+    generate_scenario, Scenario, VALID_TASKS, MAX_STEPS,
+    render_purchase_order, render_invoice, render_sla,
+    render_shipping_log, render_warehouse_logs,
+)
+from .graders import grade_task1, grade_task2, grade_task3
+# Reward constants
+STEP_COST = 0.005
+HALLUCINATION_PENALTY = 0.02
+# Available tools per task
+TASK_TOOLS = {
+    "procurement_reconciliation": [
+        "query_database", "read_document", "submit_financial_decision",
+    ],
+    "sla_enforcement": [
+        "query_database", "read_document", "submit_financial_decision",
+    ],
+    "adversarial_auditing": [
+        "query_database", "read_document", "communicate_vendor", "submit_financial_decision",
+    ],
 }
+# Database tables per task
+AVAILABLE_TABLES = {
+    "procurement_reconciliation": ["purchase_orders", "invoices"],
+    "sla_enforcement": ["purchase_orders", "invoices", "shipping_logs", "sla_contracts"],
+    "adversarial_auditing": ["purchase_orders", "invoices", "shipping_logs", "sla_contracts", "warehouse_logs"],
+}
+class ESCTREnvironment:
+    """Enterprise Supply Chain & Tax Reconciliation Environment."""
     def __init__(self):
+        self._state = ESCTRState(episode_id=str(uuid4()))
+        self._scenario: Optional[Scenario] = None
         self._initialized = False
         self._trajectory_reward = 0.0
+        self._milestones: list = []
+        self._vendor_negotiation_count = 0
+        self._settlement_offered = False
+        self._settlement_rejected = False
+        self._cited_evidence = False
     def reset(
         self,
         seed: Optional[int] = None,
         episode_id: Optional[str] = None,
+        task_name: str = "procurement_reconciliation",
         **kwargs: Any,
+    ) -> ESCTRObservation:
+        """Reset the environment with a new scenario."""
         if task_name not in VALID_TASKS:
+            task_name = "procurement_reconciliation"
+        actual_seed = seed if seed is not None else 0
+        scenario = generate_scenario(task_name, actual_seed)
+        max_steps = MAX_STEPS.get(task_name, 15)
+        self._state = ESCTRState(
             episode_id=episode_id or str(uuid4()),
             step_count=0,
             task_name=task_name,
+            seed=actual_seed,
             accumulated_reward=0.0,
+            outcome_submitted=False,
+            milestones_hit=[],
         )
+        self._scenario = scenario
         self._initialized = True
         self._trajectory_reward = 0.0
+        self._milestones = []
+        self._vendor_negotiation_count = 0
+        self._settlement_offered = False
+        self._settlement_rejected = False
+        self._cited_evidence = False
+        tools = TASK_TOOLS.get(task_name, [])
+        tables = AVAILABLE_TABLES.get(task_name, [])
+        # Build initial briefing
+        briefing = self._build_briefing(task_name, scenario, tables)
+        return ESCTRObservation(
             done=False,
             reward=0.0,
+            system_response=briefing,
+            last_action_status="success",
             current_step=0,
+            max_steps=max_steps,
             accumulated_reward=0.0,
+            task_name=task_name,
+            available_tools=tools,
         )
+    def _build_briefing(self, task_name: str, scenario: Scenario, tables: list) -> str:
+        """Generate task-specific initial briefing."""
+        vendor = scenario.vendor.name
+        buyer = scenario.buyer.name
+        inv_num = scenario.invoice.invoice_number
+        po_num = scenario.purchase_order.po_number
+        if task_name == "procurement_reconciliation":
+            return (
+                f"=== DISCREPANCY ALERT ===\n"
+                f"A pricing discrepancy has been detected between Purchase Order {po_num} "
+                f"and Vendor Invoice {inv_num} from {vendor}.\n\n"
+                f"Your task: Investigate the discrepancy, identify the overcharged line item, "
+                f"and submit the correct financial adjustment.\n\n"
+                f"Available databases: {', '.join(tables)}\n"
+                f"Available tools: query_database, read_document, submit_financial_decision\n\n"
+                f"Use 'query_database' with {{'table': '<table_name>'}} to explore data.\n"
+                f"Use 'read_document' with document_id (e.g. '{po_num}' or '{inv_num}') to read full documents.\n"
+                f"Use 'submit_financial_decision' with adjustment_amount and adjustment_reason when ready."
             )
+        elif task_name == "sla_enforcement":
+            return (
+                f"=== PAYMENT DEMAND REVIEW ===\n"
+                f"Vendor {vendor} has submitted Invoice {inv_num} (ref: {po_num}) "
+                f"demanding full payment without penalties.\n\n"
+                f"Intelligence suggests the shipment may have been delivered late. "
+                f"Your task: Verify delivery timing, review the SLA contract, calculate "
+                f"any applicable penalties, and submit the correct adjusted payment.\n\n"
+                f"Available databases: {', '.join(tables)}\n"
+                f"Available tools: query_database, read_document, submit_financial_decision\n\n"
+                f"Key steps: Check shipping_logs → Review sla_contracts → Calculate penalty → Submit adjustment."
             )
+        elif task_name == "adversarial_auditing":
+            return (
+                f"=== VENDOR DISPUTE ALERT ===\n"
+                f"Vendor {vendor} has submitted Invoice {inv_num} (ref: {po_num}) "
+                f"demanding full payment. Shipping records indicate a late delivery.\n\n"
+                f"⚠ The vendor DISPUTES the late delivery claim. They assert that {buyer}'s "
+                f"receiving warehouse rejected the initial delivery attempt.\n\n"
+                f"Your task: Investigate the vendor's claim against internal records, "
+                f"verify warehouse availability, enforce SLA penalties if warranted, and "
+                f"handle any settlement offers from the vendor.\n\n"
+                f"Available databases: {', '.join(tables)}\n"
+                f"Available tools: query_database, read_document, communicate_vendor, submit_financial_decision\n\n"
+                f"WARNING: The vendor may attempt to negotiate a reduced penalty. "
+                f"Verify all claims against internal data before accepting ANY settlement."
             )
+        return "Environment ready."
     def step(
         self,
+        action: ESCTRAction,
         timeout_s: Optional[float] = None,
         **kwargs: Any,
+    ) -> ESCTRObservation:
+        """Execute one step in the environment."""
         if not self._initialized:
+            return self._error_obs("Environment not initialized. Call reset() first.", terminal=True)
+        if self._state.outcome_submitted:
+            return self._error_obs("Episode already complete. Call reset() for a new episode.", terminal=True)
         self._state.step_count += 1
+        max_steps = MAX_STEPS.get(self._state.task_name, 15)
+        # Step cost
+        self._trajectory_reward -= STEP_COST
+        # Check max steps
+        if self._state.step_count > max_steps:
+            return self._finalize("Maximum steps exceeded. Episode terminated.", forced=True)
+        # Validate tool availability
+        available = TASK_TOOLS.get(self._state.task_name, [])
+        if action.action_type not in available:
+            self._trajectory_reward -= HALLUCINATION_PENALTY
+            return self._error_obs(
+                f"Tool '{action.action_type}' is not available for task '{self._state.task_name}'. "
+                f"Available tools: {', '.join(available)}"
             )
+        # Dispatch
+        if action.action_type == "query_database":
+            return self._handle_query(action)
+        elif action.action_type == "read_document":
+            return self._handle_read(action)
+        elif action.action_type == "communicate_vendor":
+            return self._handle_vendor_comm(action)
+        elif action.action_type == "submit_financial_decision":
+            return self._handle_submit(action)
+        return self._error_obs(f"Unknown action type: {action.action_type}")
     # ------------------------------------------------------------------
+    # Tool handlers
     # ------------------------------------------------------------------
+    def _handle_query(self, action: ESCTRAction) -> ESCTRObservation:
+        """Handle database queries."""
+        params = action.query_parameters or {}
+        table = params.get("table", "")
+        available = AVAILABLE_TABLES.get(self._state.task_name, [])
+        if not table:
+            self._trajectory_reward -= HALLUCINATION_PENALTY
+            return self._error_obs(
+                f"Missing 'table' in query_parameters. Available tables: {', '.join(available)}"
             )
+        if table not in available:
+            self._trajectory_reward -= HALLUCINATION_PENALTY
+            return self._error_obs(
+                f"Table '{table}' not found. Available tables: {', '.join(available)}"
             )
+        scenario = self._scenario
+        if table == "purchase_orders":
+            self._add_milestone("retrieved_po")
+            po = scenario.purchase_order
+            summary = (
+                f"Query result: 1 record found in purchase_orders\n\n"
+                f"PO Number: {po.po_number}\n"
+                f"Date: {po.date}\n"
+                f"Vendor: {po.vendor.name}\n"
+                f"Buyer: {po.buyer.name}\n"
+                f"Total: ${po.total_amount:,.2f}\n"
+                f"Items: {len(po.line_items)}\n\n"
+                f"Use read_document with document_id='{po.po_number}' for full details."
+            )
+            return self._success_obs(summary)
+        elif table == "invoices":
+            self._add_milestone("retrieved_invoice")
+            inv = scenario.invoice
+            summary = (
+                f"Query result: 1 record found in invoices\n\n"
+                f"Invoice: {inv.invoice_number}\n"
+                f"Date: {inv.date}\n"
+                f"PO Ref: {inv.po_reference}\n"
+                f"Vendor: {inv.vendor.name}\n"
+                f"Subtotal: ${inv.subtotal:,.2f}\n"
+                f"Tax: ${inv.tax_amount:,.2f}\n"
+                f"Total: ${inv.total:,.2f}\n\n"
+                f"Use read_document with document_id='{inv.invoice_number}' for full details."
+            )
+            return self._success_obs(summary)
+        elif table == "shipping_logs":
+            self._add_milestone("retrieved_shipping")
+            log = scenario.shipping_log
+            if log:
+                summary = (
+                    f"Query result: 1 record found in shipping_logs\n\n"
+                    f"Tracking: {log.tracking_id}\n"
+                    f"PO Ref: {log.po_reference}\n"
+                    f"Carrier: {log.carrier}\n"
+                    f"Expected Delivery: {log.expected_delivery}\n"
+                    f"Actual Delivery: {log.actual_delivery}\n"
+                    f"Delay: {log.delay_days} day(s)\n"
+                    f"Status: {log.status}\n\n"
+                    f"Use read_document with document_id='{log.tracking_id}' for full log."
+                )
+            else:
+                summary = "Query result: 0 records found in shipping_logs."
+            return self._success_obs(summary)
+        elif table == "sla_contracts":
+            self._add_milestone("retrieved_sla")
+            sla = scenario.sla_contract
+            if sla:
+                summary = (
+                    f"Query result: 1 record found in sla_contracts\n\n"
+                    f"Contract: {sla.contract_id}\n"
+                    f"Vendor: {sla.vendor}\n"
+                    f"Buyer: {sla.buyer}\n"
+                    f"Delivery Terms: {sla.delivery_terms}\n\n"
+                    f"Use read_document with document_id='{sla.contract_id}' for full SLA."
+                )
+            else:
+                summary = "Query result: 0 records found in sla_contracts."
+            return self._success_obs(summary)
+        elif table == "warehouse_logs":
+            self._add_milestone("checked_warehouse")
+            logs = scenario.warehouse_logs
+            if logs:
+                summary = (
+                    f"Query result: {len(logs)} records found in warehouse_logs\n\n"
+                )
+                for wl in logs:
+                    summary += (
+                        f"Date: {wl.date} | Dock: {wl.dock_id} | Status: {wl.status.upper()} | "
+                        f"Staff: {wl.staff_on_duty} | Shipments: {wl.shipments_received}\n"
+                    )
+                summary += (
+                    f"\nAll records show dock status: OPEN with active receiving operations.\n"
+                    f"This contradicts any claim that the warehouse was unavailable."
                 )
             else:
+                summary = "Query result: 0 records found in warehouse_logs."
+            return self._success_obs(summary)
+        return self._error_obs(f"Unknown table: {table}")
+    def _handle_read(self, action: ESCTRAction) -> ESCTRObservation:
+        """Handle document reads."""
+        doc_id = action.document_id
+        if not doc_id:
+            self._trajectory_reward -= HALLUCINATION_PENALTY
+            return self._error_obs("Missing document_id. Specify the document to read.")
+        scenario = self._scenario
+        # Match document_id to known documents
+        if doc_id == scenario.purchase_order.po_number:
+            self._add_milestone("retrieved_po")
+            self._add_milestone("compared_documents")
+            return self._success_obs(render_purchase_order(scenario.purchase_order))
+        elif doc_id == scenario.invoice.invoice_number:
+            self._add_milestone("retrieved_invoice")
+            self._add_milestone("compared_documents")
+            return self._success_obs(render_invoice(scenario.invoice))
+        elif scenario.sla_contract and doc_id == scenario.sla_contract.contract_id:
+            self._add_milestone("retrieved_sla")
+            return self._success_obs(render_sla(scenario.sla_contract))
+        elif scenario.shipping_log and doc_id == scenario.shipping_log.tracking_id:
+            self._add_milestone("retrieved_shipping")
+            return self._success_obs(render_shipping_log(scenario.shipping_log))
+        else:
+            self._trajectory_reward -= HALLUCINATION_PENALTY
+            return self._error_obs(f"Document '{doc_id}' not found in the system.")
+    def _handle_vendor_comm(self, action: ESCTRAction) -> ESCTRObservation:
+        """Handle vendor communication (adversarial negotiation)."""
+        self._add_milestone("vendor_negotiation")
+        self._vendor_negotiation_count += 1
+        msg = (action.message_content or "").lower()
+        scenario = self._scenario
+        import random as _rng
+        _rng.seed(self._state.seed + self._vendor_negotiation_count)
+        if self._vendor_negotiation_count == 1:
+            # First contact: vendor makes their excuse
+            excuse = _rng.choice([
+                "Our records indicate the receiving warehouse rejected the initial delivery attempt due to dock unavailability.",
+                "We believe the shipment arrived on time but was misrouted by your internal receiving department.",
+                "Our carrier has confirmed timely delivery; any apparent delay is a systems error on your end.",
+            ])
+            response = (
+                f"VENDOR RESPONSE ({scenario.vendor.name}):\n\n"
+                f"\"{excuse}\"\n\n"
+                f"We therefore consider the late delivery penalty claim to be unfounded "
+                f"and request full payment of Invoice {scenario.invoice.invoice_number}."
+            )
+        elif self._vendor_negotiation_count == 2:
+            # Second contact: vendor offers settlement
+            self._settlement_offered = True
+            pct = _rng.choice([40, 45, 50, 55])
+            penalty = scenario.penalty_amount or 0
+            settlement = round(penalty * (pct / 100.0), 2)
+            response = (
+                f"VENDOR RESPONSE ({scenario.vendor.name}):\n\n"
+                f"\"In the interest of maintaining our business relationship, we propose "
+                f"settling at {pct}% of the claimed penalty amount — a credit of ${settlement:,.2f} "
+                f"instead of the full ${penalty:,.2f}.\"\n\n"
+                f"⚠ This is a settlement offer. You may ACCEPT by submitting the reduced amount, "
+                f"or REJECT and enforce the full contractual penalty."
             )
+        else:
+            # Further contacts: vendor concedes or threatens
+            if "reject" in msg or "refuse" in msg or "full penalty" in msg or "enforce" in msg:
+                self._settlement_rejected = True
+                response = (
+                    f"VENDOR RESPONSE ({scenario.vendor.name}):\n\n"
+                    f"\"We acknowledge your position. If you have documentary evidence supporting "
+                    f"the penalty, please proceed with the full adjustment.\""
+                )
+            else:
+                response = (
+                    f"VENDOR RESPONSE ({scenario.vendor.name}):\n\n"
+                    f"\"We maintain our position. Please review the evidence and respond accordingly.\""
+                )
+        return self._success_obs(response)
+    def _handle_submit(self, action: ESCTRAction) -> ESCTRObservation:
+        """Handle financial decision submission (terminal action)."""
+        amount = action.adjustment_amount
+        reason = action.adjustment_reason or ""
+        if amount is None:
+            self._trajectory_reward -= HALLUCINATION_PENALTY
+            return self._error_obs(
+                "Missing adjustment_amount. Submit the exact monetary adjustment as a float."
             )
+        # Check for evidence citation in reason
+        if "warehouse" in reason.lower() or "dock" in reason.lower() or "access log" in reason.lower():
+            self._cited_evidence = True
+        # Mark as submitted
+        self._state.outcome_submitted = True
+        # Check if settlement was accepted (for task 3)
+        if self._settlement_offered and not self._settlement_rejected:
+            # Agent accepted the settlement (bad for task 3)
+            pass
+        return self._finalize_with_grading(amount)
+    # ------------------------------------------------------------------
+    # Helpers
+    # ------------------------------------------------------------------
+    def _add_milestone(self, milestone: str):
+        if milestone not in self._milestones:
+            self._milestones.append(milestone)
+            self._state.milestones_hit = self._milestones.copy()
+    def _finalize_with_grading(self, submitted_amount: float) -> ESCTRObservation:
+        """Run the appropriate grader and return final observation."""
+        task = self._state.task_name
+        scenario = self._scenario
+        steps = self._state.step_count
+        if task == "procurement_reconciliation":
+            # Try to extract line item from milestones or just use amount
+            score, feedback = grade_task1(
+                scenario, submitted_amount,
+                milestones=self._milestones,
+                steps_taken=steps,
+            )
+        elif task == "sla_enforcement":
+            self._add_milestone("calculated_penalty")
+            score, feedback = grade_task2(
+                scenario, submitted_amount,
+                milestones=self._milestones,
+                steps_taken=steps,
+            )
+        elif task == "adversarial_auditing":
+            score, feedback = grade_task3(
+                scenario, submitted_amount,
+                rejected_settlement=self._settlement_rejected,
+                cited_evidence=self._cited_evidence,
+                milestones=self._milestones,
+                steps_taken=steps,
+            )
+        else:
+            score = 0.01
+            feedback = {"error": "Unknown task"}
+        self._state.best_score = score
+        self._state.accumulated_reward += score
+        response = (
+            f"=== FINANCIAL DECISION PROCESSED ===\n\n"
+            f"Submitted adjustment: ${submitted_amount:,.2f}\n"
+            f"Score: {score:.4f}\n\n"
         )
+        if "outcome" in feedback:
+            response += f"Outcome: {feedback['outcome']}\n"
+        if "trajectory" in feedback:
+            response += f"Investigation milestones: {', '.join(feedback.get('trajectory', []))}\n"
+        if feedback.get("gullibility_penalty", 0) > 0:
+            response += f"⚠ Gullibility penalty: -{feedback['gullibility_penalty']:.2f}\n"
+        if feedback.get("evidence_bonus", 0) > 0:
+            response += f"✓ Evidence citation bonus: +{feedback['evidence_bonus']:.2f}\n"
+        response += f"\nFinal score: {score:.4f}"
+        return ESCTRObservation(
+            done=True,
+            reward=score,
+            system_response=response,
+            last_action_status="success",
+            current_step=self._state.step_count,
+            max_steps=MAX_STEPS.get(task, 15),
+            accumulated_reward=self._state.accumulated_reward,
+            task_name=task,
+            available_tools=[],
+            metadata=feedback,
+        )
+    def _finalize(self, msg: str, forced: bool = False) -> ESCTRObservation:
+        """Finalize episode without submission (timeout / error)."""
+        self._state.outcome_submitted = True
+        return ESCTRObservation(
+            done=True,
+            reward=0.01,
+            system_response=msg,
+            last_action_status="error" if forced else "success",
+            current_step=self._state.step_count,
+            max_steps=MAX_STEPS.get(self._state.task_name, 15),
+            accumulated_reward=self._state.accumulated_reward,
+            task_name=self._state.task_name,
+            metadata={"forced_termination": forced},
+        )
+    def _success_obs(self, text: str) -> ESCTRObservation:
+        return ESCTRObservation(
             done=False,
             reward=0.0,
+            system_response=text,
+            last_action_status="success",
+            current_step=self._state.step_count,
+            max_steps=MAX_STEPS.get(self._state.task_name, 15),
+            accumulated_reward=self._state.accumulated_reward,
+            task_name=self._state.task_name,
+            available_tools=TASK_TOOLS.get(self._state.task_name, []),
+        )
+    def _error_obs(self, msg: str, terminal: bool = False) -> ESCTRObservation:
+        return ESCTRObservation(
+            done=terminal,
+            reward=0.0,
+            system_response=msg,
+            last_action_status="error",
+            error_message=msg,
+            current_step=self._state.step_count,
+            max_steps=MAX_STEPS.get(self._state.task_name, 15),
+            accumulated_reward=self._state.accumulated_reward,
+            task_name=self._state.task_name,
+            available_tools=TASK_TOOLS.get(self._state.task_name, []),
         )
     @property
+    def state(self) -> ESCTRState:
         return self._state
     def close(self) -> None:
         self._initialized = False

server/graders.py CHANGED Viewed

@@ -1,313 +1,291 @@
 """
-Grading logic for the Invoice Extraction Environment.
-Provides field-level scoring with fuzzy matching for text fields
-and exact matching for numeric/date fields. All scores are in [0.0, 1.0].
 """
-import json
-import re
-from difflib import SequenceMatcher
-from typing import Any, Dict, List, Optional, Tuple
-def normalize_text(text: str) -> str:
-    """Normalize text for comparison: lowercase, strip, collapse whitespace."""
-    if not isinstance(text, str):
-        text = str(text)
-    text = text.lower().strip()
-    text = re.sub(r"\s+", " ", text)
-    # Remove common punctuation variations
-    text = text.replace(".", "").replace(",", "").replace("'", "").replace('"', "")
-    return text
-def normalize_number(value: Any) -> Optional[float]:
-    """Normalize a numeric value: strip currency symbols, parse to float."""
-    if isinstance(value, (int, float)):
-        return round(float(value), 2)
-    if isinstance(value, str):
-        # Remove currency symbols, commas, whitespace
-        cleaned = re.sub(r"[$ ,]", "", value.strip())
-        try:
-            return round(float(cleaned), 2)
-        except (ValueError, TypeError):
-            return None
-    return None
-def normalize_date(date_str: str) -> Optional[str]:
-    """Normalize date to YYYY-MM-DD format."""
-    if not isinstance(date_str, str):
-        return None
-    date_str = date_str.strip()
-    # Already in YYYY-MM-DD
-    if re.match(r"^\d{4}-\d{2}-\d{2}$", date_str):
-        return date_str
-    # MM/DD/YYYY
-    m = re.match(r"^(\d{1,2})/(\d{1,2})/(\d{4})$", date_str)
-    if m:
-        return f"{m.group(3)}-{int(m.group(1)):02d}-{int(m.group(2)):02d}"
-    # DD-Mon-YYYY or Mon DD, YYYY etc - try common patterns
-    month_map = {
-        "jan": "01", "january": "01", "feb": "02", "february": "02",
-        "mar": "03", "march": "03", "apr": "04", "april": "04",
-        "may": "05", "jun": "06", "june": "06", "jul": "07", "july": "07",
-        "aug": "08", "august": "08", "sep": "09", "september": "09",
-        "oct": "10", "october": "10", "nov": "11", "november": "11",
-        "dec": "12", "december": "12",
-    }
-    # "January 15, 2024" or "Jan 15 2024"
-    m = re.match(r"(\w+)\s+(\d{1,2}),?\s*'?(\d{2,4})$", date_str, re.IGNORECASE)
-    if m:
-        month = month_map.get(m.group(1).lower())
-        if month:
-            year = m.group(3)
-            if len(year) == 2:
-                year = "20" + year
-            return f"{year}-{month}-{int(m.group(2)):02d}"
-    # "15-Feb-2024" or "20-Feb-2024"
-    m = re.match(r"(\d{1,2})-(\w+)-(\d{4})$", date_str, re.IGNORECASE)
-    if m:
-        month = month_map.get(m.group(2).lower())
-        if month:
-            return f"{m.group(3)}-{month}-{int(m.group(1)):02d}"
-    return date_str  # Return as-is if no pattern matches
-def grade_text(actual: Any, expected: Any) -> float:
-    """Grade a text field using fuzzy matching. Returns 0.0-1.0."""
-    if actual is None or expected is None:
-        return 0.0 if actual != expected else 1.0
-    norm_actual = normalize_text(str(actual))
-    norm_expected = normalize_text(str(expected))
-    if norm_actual == norm_expected:
-        return 1.0
-    # Use SequenceMatcher for fuzzy comparison
-    ratio = SequenceMatcher(None, norm_actual, norm_expected).ratio()
-    # Apply a threshold: below 0.4 similarity = 0 score
-    if ratio < 0.4:
-        return 0.0
-    return round(ratio, 4)
-def grade_numeric(actual: Any, expected: Any) -> float:
-    """Grade a numeric field. Returns 1.0 for exact match, partial for close."""
-    norm_actual = normalize_number(actual)
-    norm_expected = normalize_number(expected)
-    if norm_actual is None or norm_expected is None:
-        return 0.0
-    if norm_actual == norm_expected:
-        return 1.0
-    # Partial credit for being close (within 5%)
-    if norm_expected != 0:
-        error_pct = abs(norm_actual - norm_expected) / abs(norm_expected)
-        if error_pct <= 0.01:
-            return 0.9  # Very close
         elif error_pct <= 0.05:
-            return 0.5  # Somewhat close
         elif error_pct <= 0.10:
-            return 0.2  # In the ballpark
-    return 0.0
-def grade_date(actual: Any, expected: Any) -> float:
-    """Grade a date field after normalization. Returns 0.0 or 1.0."""
-    if actual is None:
-        return 0.0
-    norm_actual = normalize_date(str(actual))
-    norm_expected = normalize_date(str(expected))
-    if norm_actual == norm_expected:
-        return 1.0
-    # Partial credit for getting the right date with wrong format
-    if norm_actual and norm_expected:
-        # Remove separators and compare
-        a = re.sub(r"[^0-9]", "", norm_actual)
-        e = re.sub(r"[^0-9]", "", norm_expected)
-        if a == e:
-            return 0.8
-    return 0.0
-def grade_line_items(actual: Any, expected: Any) -> float:
-    """Grade line items extraction. Checks description, qty, price, amount."""
-    if not isinstance(actual, list) or not isinstance(expected, list):
-        return 0.0
-    if len(actual) == 0:
-        return 0.0
-    total_score = 0.0
-    matched_expected = set()
-    for act_item in actual:
-        if not isinstance(act_item, dict):
-            continue
-        best_score = 0.0
-        best_idx = -1
-        for idx, exp_item in enumerate(expected):
-            if idx in matched_expected:
-                continue
-            if not isinstance(exp_item, dict):
-                continue
-            # Score each field of the line item
-            desc_score = grade_text(
-                act_item.get("description", ""),
-                exp_item.get("description", ""),
-            )
-            qty_score = grade_numeric(
-                act_item.get("quantity"),
-                exp_item.get("quantity"),
-            )
-            price_score = grade_numeric(
-                act_item.get("unit_price"),
-                exp_item.get("unit_price"),
-            )
-            amt_score = grade_numeric(
-                act_item.get("amount"),
-                exp_item.get("amount"),
-            )
-            item_score = (desc_score * 0.3 + qty_score * 0.2 +
-                          price_score * 0.2 + amt_score * 0.3)
-            if item_score > best_score:
-                best_score = item_score
-                best_idx = idx
-        if best_idx >= 0:
-            matched_expected.add(best_idx)
-            total_score += best_score
-    # Normalize by expected count, penalize missing/extra items
-    expected_count = len(expected)
-    if expected_count == 0:
-        return 1.0 if len(actual) == 0 else 0.0
-    # Score = matched items score / expected count
-    # Penalize for extra items (max penalty = 0.2)
-    extra_penalty = max(0, len(actual) - expected_count) * 0.05
-    extra_penalty = min(extra_penalty, 0.2)
-    score = (total_score / expected_count) - extra_penalty
-    return max(0.0, min(1.0, round(score, 4)))
-def grade_extraction(
-    extracted: Dict[str, Any],
-    ground_truth: Dict[str, Any],
-    required_fields: List[str],
 ) -> Tuple[float, Dict[str, Any]]:
-    """Grade the full extraction against ground truth.
-    Uses weighted scoring: financial fields (subtotal, tax, total) are
-    weighted 1.5x, line_items 2.0x, and reasoning fields 0.8x to reflect
-    their relative importance in real-world invoice processing.
-    Args:
-        extracted: The agent's extracted fields
-        ground_truth: The correct field values
-        required_fields: List of field names to grade
-    Returns:
-        Tuple of (overall_score, field_feedback)
-        overall_score is in [0.0, 1.0]
-        field_feedback maps field names to {score, expected, actual}
     """
-    field_scores = {}
-    feedback = {}
-    numeric_fields = {"total", "subtotal", "tax", "adjusted_total",
-                       "discount_amount", "original_total"}
-    date_fields = {"date", "due_date"}
-    list_fields = {"line_items"}
-    # Free-text reasoning fields — graded with lower threshold
-    reasoning_fields = {"discrepancy_notes", "adjustment_reason"}
-    # Field importance weights for weighted average
-    field_weights = {
-        "subtotal": 1.5, "tax": 1.5, "total": 1.5,
-        "adjusted_total": 1.5, "discount_amount": 1.2, "original_total": 1.2,
-        "line_items": 2.0,
-        "discrepancy_notes": 0.8, "adjustment_reason": 0.8,
-    }
-    for field in required_fields:
-        expected = ground_truth.get(field)
-        actual = extracted.get(field)
-        if field in list_fields:
-            score = grade_line_items(actual, expected)
-        elif field in numeric_fields:
-            score = grade_numeric(actual, expected)
-        elif field in date_fields:
-            score = grade_date(actual, expected)
-        elif field in reasoning_fields:
-            # Free-text reasoning: use fuzzy matching with generous partial credit
-            score = grade_text(actual, expected)
         else:
-            score = grade_text(actual, expected)
-        field_scores[field] = score
-        feedback[field] = {
-            "score": score,
-            "expected_type": "list" if field in list_fields else
-                            "number" if field in numeric_fields else
-                            "date" if field in date_fields else
-                            "reasoning" if field in reasoning_fields else "text",
-            "matched": score >= 0.5 if field in reasoning_fields else score >= 0.8,
-        }
-    # Weighted average
-    if not field_scores:
-        return 0.01, feedback
-    weighted_sum = 0.0
-    weight_total = 0.0
-    for field, score in field_scores.items():
-        w = field_weights.get(field, 1.0)
-        weighted_sum += score * w
-        weight_total += w
-    overall = weighted_sum / weight_total if weight_total > 0 else 0.0
-    # Cross-field arithmetic verification bonus
-    gt_sub = ground_truth.get("subtotal")
-    gt_tax = ground_truth.get("tax")
-    gt_total = ground_truth.get("total")
-    if gt_sub is not None and gt_tax is not None and gt_total is not None:
-        ext_sub = normalize_number(extracted.get("subtotal"))
-        ext_tax = normalize_number(extracted.get("tax"))
-        ext_total = normalize_number(extracted.get("total"))
-        if ext_sub is not None and ext_tax is not None and ext_total is not None:
-            computed = round(ext_sub + ext_tax, 2)
-            if abs(computed - ext_total) < 0.02:
-                overall += 0.02  # Arithmetic consistency bonus built into grader
-    # Clamp to strict (0, 1) — validator rejects exactly 0.0 or 1.0
-    overall = round(max(0.01, min(0.99, overall)), 4)
-    return overall, feedback

 """
+Deterministic Graders for the ESCTR Environment.
+Each task has a specific grader that scores the agent's performance
+using verifiable, programmatic criteria — no subjective evaluation.
+Scoring is always in the strict range (0.01, 0.99) to satisfy OpenEnv validators.
 """
+from typing import Any, Dict, List, Tuple
+from .procedural import Scenario
+def clamp_score(score: float) -> float:
+    """Clamp score to strict (0.01, 0.99) range."""
+    return round(max(0.01, min(0.99, score)), 4)
+# ---------------------------------------------------------------------------
+# Task 1: Procurement Reconciliation
+# ---------------------------------------------------------------------------
+def grade_task1(
+    scenario: Scenario,
+    submitted_amount: float,
+    submitted_line_item: str = None,
+    milestones: List[str] = None,
+    steps_taken: int = 0,
+) -> Tuple[float, Dict[str, Any]]:
+    """Grade the procurement reconciliation task.
+    Perfect score requires:
+    - Correct discrepant line item identified
+    - Exact adjustment amount (overcharge value, negative)
+    Partial credit:
+    - Correct line item but wrong amount → 0.5
+    - Wrong line item → 0.0 outcome
+    """
+    milestones = milestones or []
+    feedback = {"task": "procurement_reconciliation"}
+    # Outcome scoring (weight: 0.70)
+    correct_amount = scenario.correct_adjustment
+    correct_item = scenario.discrepant_line_item_id
+    outcome_score = 0.0
+    item_correct = (submitted_line_item == correct_item) if submitted_line_item and correct_item else False
+    amount_correct = abs(submitted_amount - correct_amount) < 0.02 if submitted_amount is not None else False
+    if item_correct and amount_correct:
+        outcome_score = 1.0
+        feedback["outcome"] = "PERFECT — correct line item and exact adjustment amount"
+    elif item_correct and not amount_correct:
+        outcome_score = 0.5
+        feedback["outcome"] = f"PARTIAL — correct line item but wrong amount (expected {correct_amount:.2f}, got {submitted_amount:.2f})"
+    elif not item_correct and amount_correct:
+        outcome_score = 0.4
+        feedback["outcome"] = f"PARTIAL — correct amount but wrong line item (expected {correct_item})"
+    else:
+        outcome_score = 0.0
+        feedback["outcome"] = "FAIL — wrong line item and wrong amount"
+    # Trajectory scoring (weight: 0.30)
+    trajectory_score = 0.0
+    trajectory_details = []
+    if "retrieved_po" in milestones:
+        trajectory_score += 0.4
+        trajectory_details.append("Retrieved PO ✓")
+    if "retrieved_invoice" in milestones:
+        trajectory_score += 0.4
+        trajectory_details.append("Retrieved Invoice ✓")
+    if "compared_documents" in milestones:
+        trajectory_score += 0.2
+        trajectory_details.append("Compared documents ✓")
+    trajectory_score = min(1.0, trajectory_score)
+    feedback["trajectory"] = trajectory_details
+    # Efficiency penalty
+    max_steps = 10
+    efficiency_penalty = max(0, (steps_taken - max_steps) * 0.02)
+    # Composite
+    alpha, beta = 0.70, 0.30
+    raw_score = alpha * outcome_score + beta * trajectory_score - efficiency_penalty
+    final_score = clamp_score(raw_score)
+    feedback["outcome_score"] = outcome_score
+    feedback["trajectory_score"] = trajectory_score
+    feedback["efficiency_penalty"] = efficiency_penalty
+    feedback["final_score"] = final_score
+    feedback["correct_adjustment"] = correct_amount
+    feedback["correct_line_item"] = correct_item
+    return final_score, feedback
+# ---------------------------------------------------------------------------
+# Task 2: SLA Enforcement
+# ---------------------------------------------------------------------------
+def grade_task2(
+    scenario: Scenario,
+    submitted_amount: float,
+    milestones: List[str] = None,
+    steps_taken: int = 0,
+) -> Tuple[float, Dict[str, Any]]:
+    """Grade the SLA enforcement task.
+    Perfect score requires:
+    - Exact penalty amount calculated from shipping delay + SLA terms
+    Partial credit:
+    - Within 5% of correct penalty → 0.7
+    - Within 10% → 0.4
+    - Approved invoice without penalty → 0.0
+    """
+    milestones = milestones or []
+    feedback = {"task": "sla_enforcement"}
+    correct_penalty = scenario.penalty_amount or 0.0
+    correct_adjustment = scenario.correct_adjustment  # negative
+    # Outcome scoring (weight: 0.60)
+    outcome_score = 0.0
+    if submitted_amount is not None and correct_adjustment != 0:
+        error = abs(submitted_amount - correct_adjustment)
+        error_pct = error / abs(correct_adjustment) if correct_adjustment != 0 else float('inf')
+        if error < 0.02:
+            outcome_score = 1.0
+            feedback["outcome"] = "PERFECT — exact penalty amount"
         elif error_pct <= 0.05:
+            outcome_score = 0.7
+            feedback["outcome"] = f"CLOSE — within 5% (expected {correct_adjustment:.2f}, got {submitted_amount:.2f})"
         elif error_pct <= 0.10:
+            outcome_score = 0.4
+            feedback["outcome"] = f"PARTIAL — within 10% (expected {correct_adjustment:.2f}, got {submitted_amount:.2f})"
+        else:
+            outcome_score = 0.1
+            feedback["outcome"] = f"INCORRECT — expected {correct_adjustment:.2f}, got {submitted_amount:.2f}"
+    elif submitted_amount == 0 or submitted_amount is None:
+        outcome_score = 0.0
+        feedback["outcome"] = "FAIL — approved invoice without applying penalty"
+    # Trajectory scoring (weight: 0.40)
+    trajectory_score = 0.0
+    trajectory_details = []
+    if "retrieved_shipping" in milestones:
+        trajectory_score += 0.30
+        trajectory_details.append("Retrieved shipping log ✓")
+    if "retrieved_sla" in milestones:
+        trajectory_score += 0.30
+        trajectory_details.append("Retrieved SLA contract ✓")
+    if "retrieved_po" in milestones:
+        trajectory_score += 0.15
+        trajectory_details.append("Retrieved PO ✓")
+    if "retrieved_invoice" in milestones:
+        trajectory_score += 0.15
+        trajectory_details.append("Retrieved Invoice ✓")
+    if "calculated_penalty" in milestones:
+        trajectory_score += 0.10
+        trajectory_details.append("Performed penalty calculation ✓")
+    trajectory_score = min(1.0, trajectory_score)
+    feedback["trajectory"] = trajectory_details
+    # Efficiency
+    max_steps = 15
+    efficiency_penalty = max(0, (steps_taken - max_steps) * 0.02)
+    alpha, beta = 0.60, 0.40
+    raw_score = alpha * outcome_score + beta * trajectory_score - efficiency_penalty
+    final_score = clamp_score(raw_score)
+    feedback["outcome_score"] = outcome_score
+    feedback["trajectory_score"] = trajectory_score
+    feedback["efficiency_penalty"] = efficiency_penalty
+    feedback["final_score"] = final_score
+    feedback["correct_adjustment"] = correct_adjustment
+    feedback["penalty_amount"] = correct_penalty
+    return final_score, feedback
+# ---------------------------------------------------------------------------
+# Task 3: Adversarial Auditing
+# ---------------------------------------------------------------------------
+def grade_task3(
+    scenario: Scenario,
+    submitted_amount: float,
+    rejected_settlement: bool = False,
+    cited_evidence: bool = False,
+    milestones: List[str] = None,
+    steps_taken: int = 0,
 ) -> Tuple[float, Dict[str, Any]]:
+    """Grade the adversarial auditing task.
+    Perfect score requires:
+    - Full penalty enforced (rejected vendor's bogus claim)
+    - Settlement offer rejected
+    - Warehouse logs cited as evidence
+    Severe deductions for:
+    - Accepting settlement (gullibility)
+    - Not checking warehouse logs before rejecting
     """
+    milestones = milestones or []
+    feedback = {"task": "adversarial_auditing"}
+    correct_adjustment = scenario.correct_adjustment
+    # Outcome scoring (weight: 0.60)
+    outcome_score = 0.0
+    if submitted_amount is not None and correct_adjustment != 0:
+        error = abs(submitted_amount - correct_adjustment)
+        error_pct = error / abs(correct_adjustment) if correct_adjustment != 0 else float('inf')
+        if error < 0.02:
+            outcome_score = 1.0
+            feedback["outcome"] = "PERFECT — full contractual penalty enforced"
+        elif error_pct <= 0.05:
+            outcome_score = 0.6
+            feedback["outcome"] = f"CLOSE — nearly correct (expected {correct_adjustment:.2f}, got {submitted_amount:.2f})"
         else:
+            outcome_score = 0.2
+            feedback["outcome"] = f"INCORRECT — expected {correct_adjustment:.2f}, got {submitted_amount:.2f}"
+    else:
+        outcome_score = 0.0
+        feedback["outcome"] = "FAIL — no financial decision submitted"
+    # Gullibility penalty: accepted adversarial settlement
+    gullibility_penalty = 0.0
+    if not rejected_settlement and "vendor_negotiation" in milestones:
+        gullibility_penalty = 0.20
+        feedback["gullibility"] = "PENALIZED — accepted vendor's settlement offer"
+    elif rejected_settlement:
+        feedback["gullibility"] = "GOOD — rejected vendor's settlement offer"
+    # Evidence citation bonus
+    evidence_bonus = 0.0
+    if cited_evidence and "checked_warehouse" in milestones:
+        evidence_bonus = 0.05
+        feedback["evidence"] = "GOOD — cited warehouse logs as evidence"
+    # Trajectory scoring (weight: 0.40)
+    trajectory_score = 0.0
+    trajectory_details = []
+    if "retrieved_shipping" in milestones:
+        trajectory_score += 0.20
+        trajectory_details.append("Retrieved shipping log ✓")
+    if "retrieved_sla" in milestones:
+        trajectory_score += 0.20
+        trajectory_details.append("Retrieved SLA contract ✓")
+    if "checked_warehouse" in milestones:
+        trajectory_score += 0.25
+        trajectory_details.append("Checked warehouse access logs ✓")
+    if "vendor_negotiation" in milestones:
+        trajectory_score += 0.15
+        trajectory_details.append("Engaged in vendor negotiation ✓")
+    if "retrieved_po" in milestones:
+        trajectory_score += 0.10
+        trajectory_details.append("Retrieved PO ✓")
+    if "retrieved_invoice" in milestones:
+        trajectory_score += 0.10
+        trajectory_details.append("Retrieved Invoice ✓")
+    trajectory_score = min(1.0, trajectory_score)
+    feedback["trajectory"] = trajectory_details
+    # Efficiency
+    max_steps = 20
+    efficiency_penalty = max(0, (steps_taken - max_steps) * 0.015)
+    alpha, beta = 0.60, 0.40
+    raw_score = (alpha * outcome_score + beta * trajectory_score
+                 + evidence_bonus - gullibility_penalty - efficiency_penalty)
+    final_score = clamp_score(raw_score)
+    feedback["outcome_score"] = outcome_score
+    feedback["trajectory_score"] = trajectory_score
+    feedback["gullibility_penalty"] = gullibility_penalty
+    feedback["evidence_bonus"] = evidence_bonus
+    feedback["efficiency_penalty"] = efficiency_penalty
+    feedback["final_score"] = final_score
+    feedback["correct_adjustment"] = correct_adjustment
+    return final_score, feedback

server/models.py CHANGED Viewed

@@ -1,8 +1,9 @@
 """
-Pydantic models for the Invoice Extraction Environment.
-Defines the Action and Observation types used for communication
-between the agent and the environment.
 """
 from typing import Any, Dict, List, Literal, Optional
@@ -10,84 +11,129 @@ from typing import Any, Dict, List, Literal, Optional
 from pydantic import BaseModel, ConfigDict, Field
-class InvoiceAction(BaseModel):
-    """Action sent by the agent to the environment.
-    Commands:
-        - 'view_document': View the current document text
-        - 'view_fields': View the required fields to extract
-        - 'extract': Submit extracted fields (payload = JSON string)
-        - 'get_feedback': Get feedback on the last extraction attempt
-        - 'query_related_documents': Retrieve cross-reference documents (PO, credit memos)
-        - 'verify_calculations': Submit arithmetic for verification (payload = JSON)
-        - 'check_discrepancies': Request environment to flag inconsistencies
     """
     model_config = ConfigDict(extra="forbid")
-    command: str = Field(
         ...,
         description=(
-            "Command to execute: 'view_document', 'view_fields', 'extract', "
-            "'get_feedback', 'query_related_documents', 'verify_calculations', "
-            "or 'check_discrepancies'"
         ),
     )
-    payload: str = Field(
-        default="",
-        description="JSON string payload (used with 'extract' and 'verify_calculations' commands)",
     )
-    metadata: Dict[str, Any] = Field(
-        default_factory=dict,
-        description="Additional metadata",
     )
-class InvoiceObservation(BaseModel):
-    """Observation returned by the environment after each step.
-    Contains the response text, task metadata, current score,
-    and episode control signals (done, reward).
     """
     model_config = ConfigDict(extra="forbid")
     done: bool = Field(default=False, description="Whether the episode has ended")
-    reward: Optional[float] = Field(default=None, description="Reward signal [0.0-1.0]")
-    text: str = Field(default="", description="Response text from the environment")
-    task_name: str = Field(default="", description="Current task name")
-    current_score: float = Field(default=0.0, description="Best score achieved so far")
-    attempts_remaining: int = Field(default=0, description="Remaining extraction attempts")
-    required_fields: List[str] = Field(default_factory=list, description="Fields to extract")
-    metadata: Dict[str, Any] = Field(default_factory=dict, description="Additional metadata")
     last_action_status: Literal["success", "error"] = Field(
         default="success",
-        description="Whether the last action was valid and executed successfully",
     )
     error_message: Optional[str] = Field(
         default=None,
-        description="Diagnostic error message if last_action_status is 'error'",
     )
     current_step: int = Field(
         default=0,
-        description="Current step number within the episode",
     )
     accumulated_reward: float = Field(
         default=0.0,
-        description="Total accumulated reward across all steps in this episode",
     )
-class InvoiceState(BaseModel):
-    """Internal environment state."""
     model_config = ConfigDict(extra="allow")
     episode_id: Optional[str] = Field(default=None, description="Current episode ID")
     step_count: int = Field(default=0, ge=0, description="Steps taken in current episode")
     task_name: str = Field(default="", description="Current task name")
-    document_id: str = Field(default="", description="Current document ID")
-    best_score: float = Field(default=0.0, description="Best extraction score so far")
-    attempts_used: int = Field(default=0, description="Extraction attempts used")
-    max_attempts: int = Field(default=3, description="Maximum extraction attempts")
-    accumulated_reward: float = Field(default=0.0, description="Total reward accumulated in episode")

 """
+Pydantic models for the Enterprise Supply Chain & Tax Reconciliation Environment.
+Defines the Action, Observation, and State types used for communication
+between the agent and the environment. Designed for type-safe interaction
+with an ERP-like tool suite.
 """
 from typing import Any, Dict, List, Literal, Optional
 from pydantic import BaseModel, ConfigDict, Field
+# ---------------------------------------------------------------------------
+# Action — what the agent sends to the environment
+# ---------------------------------------------------------------------------
+class ESCTRAction(BaseModel):
+    """Action sent by the agent to the ESCTR environment.
+    The agent operates as an autonomous financial controller using 4 tool verbs:
+      - 'query_database': Search procurement, accounts payable, shipping, or warehouse databases
+      - 'read_document': Retrieve a specific contract, SLA, PO, or invoice by document_id
+      - 'communicate_vendor': Send a negotiation message to the simulated vendor
+      - 'submit_financial_decision': Submit the final ledger adjustment (terminal action)
     """
     model_config = ConfigDict(extra="forbid")
+    action_type: Literal[
+        "query_database",
+        "read_document",
+        "communicate_vendor",
+        "submit_financial_decision",
+    ] = Field(
         ...,
         description=(
+            "The tool verb to execute. One of: 'query_database', 'read_document', "
+            "'communicate_vendor', or 'submit_financial_decision'."
         ),
     )
+    query_parameters: Optional[Dict[str, Any]] = Field(
+        default=None,
+        description=(
+            "Structured query for database lookups. Example: "
+            '{"table": "shipping_logs", "tracking_id": "TRK-9921"}'
+        ),
     )
+    document_id: Optional[str] = Field(
+        default=None,
+        description="Unique alphanumeric identifier of the document to read (e.g. 'PO-2024-0055').",
+    )
+    message_content: Optional[str] = Field(
+        default=None,
+        description="Natural language message for vendor negotiation (used with 'communicate_vendor').",
+    )
+    adjustment_amount: Optional[float] = Field(
+        default=None,
+        description=(
+            "The precise monetary adjustment to submit (used with 'submit_financial_decision'). "
+            "Must be the exact floating-point value calculated from contract terms."
+        ),
+    )
+    adjustment_reason: Optional[str] = Field(
+        default=None,
+        description="Brief explanation of the adjustment rationale (used with 'submit_financial_decision').",
     )
+# ---------------------------------------------------------------------------
+# Observation — what the environment returns after each step
+# ---------------------------------------------------------------------------
+class ESCTRObservation(BaseModel):
+    """Observation returned by the ESCTR environment after each step.
+    Provides structured telemetry to help the agent understand the
+    outcome of its action and plan the next move.
     """
     model_config = ConfigDict(extra="forbid")
     done: bool = Field(default=False, description="Whether the episode has ended")
+    reward: float = Field(default=0.0, description="Reward signal for this step (0.0-1.0)")
+    system_response: str = Field(
+        default="",
+        description="Output from the tool: database results, document text, vendor reply, or grader feedback.",
+    )
     last_action_status: Literal["success", "error"] = Field(
         default="success",
+        description="Whether the last action was valid and executed successfully.",
     )
     error_message: Optional[str] = Field(
         default=None,
+        description="Diagnostic error message if last_action_status is 'error'.",
     )
     current_step: int = Field(
         default=0,
+        description="Current step number within the episode (0-indexed at reset).",
+    )
+    max_steps: int = Field(
+        default=15,
+        description="Maximum steps allowed for this task.",
     )
     accumulated_reward: float = Field(
         default=0.0,
+        description="Total reward accumulated across all steps in this episode.",
     )
+    task_name: str = Field(default="", description="Current task name.")
+    available_tools: List[str] = Field(
+        default_factory=list,
+        description="List of tool verbs available in this task.",
+    )
+    metadata: Dict[str, Any] = Field(
+        default_factory=dict,
+        description="Additional structured metadata (scores, milestones, etc.).",
+    )
+# ---------------------------------------------------------------------------
+# State — internal environment state (exposed via GET /state)
+# ---------------------------------------------------------------------------
+class ESCTRState(BaseModel):
+    """Internal environment state for the ESCTR environment."""
     model_config = ConfigDict(extra="allow")
     episode_id: Optional[str] = Field(default=None, description="Current episode ID")
     step_count: int = Field(default=0, ge=0, description="Steps taken in current episode")
     task_name: str = Field(default="", description="Current task name")
+    seed: int = Field(default=0, description="Seed used for procedural generation")
+    accumulated_reward: float = Field(default=0.0, description="Total reward accumulated")
+    outcome_submitted: bool = Field(default=False, description="Whether final decision was submitted")
+    milestones_hit: List[str] = Field(
+        default_factory=list,
+        description="Trajectory milestones achieved (e.g. 'retrieved_po', 'retrieved_sla').",
+    )
+    best_score: float = Field(default=0.0, description="Best score achieved")

server/procedural.py CHANGED Viewed

@@ -1,426 +1,580 @@
 """
-Procedural Document Generation Engine.
-Generates infinite invoice variations using seed-based randomization.
-Addresses the "data sparsity" critique by providing virtually unlimited
-training configurations while maintaining deterministic reproducibility.
 """
 import random
-import string
-from typing import Any, Dict, List, Tuple
 # ---------------------------------------------------------------------------
-# Data pools for procedural generation
 # ---------------------------------------------------------------------------
-VENDOR_POOL = [
-    ("Acme Corporation", "123 Business Avenue", "New York", "NY", "10001"),
-    ("TechStart Solutions LLC", "890 Innovation Drive, Suite 200", "San Francisco", "CA", "94105"),
-    ("Global Supplies Inc.", "2500 Industrial Parkway", "Detroit", "MI", "48201"),
-    ("Pinnacle Systems Ltd.", "77 Summit Road", "Boston", "MA", "02101"),
-    ("Nexus Digital Services", "400 Cloud Way", "Seattle", "WA", "98101"),
-    ("Ironclad Manufacturing Co.", "1200 Forge Lane", "Pittsburgh", "PA", "15201"),
-    ("Brightwave Analytics", "55 Data Drive", "Austin", "TX", "78701"),
-    ("SilverLine Logistics", "909 Transport Blvd", "Memphis", "TN", "38101"),
-    ("Quantum Computing Corp.", "1 Qubit Plaza", "Boulder", "CO", "80301"),
-    ("Evergreen Office Supplies", "330 Elm Street", "Portland", "OR", "97201"),
-    ("Atlas Engineering Group", "620 Blueprint Ave", "Houston", "TX", "77001"),
-    ("Cobalt Healthcare Solutions", "88 Wellness Pkwy", "Minneapolis", "MN", "55401"),
-    ("Meridian Consulting Partners", "250 Strategy Lane", "Chicago", "IL", "60601"),
-    ("Vanguard Robotics Inc.", "15 Automation Circle", "San Jose", "CA", "95101"),
-    ("Horizon Energy Systems", "700 Solar Way", "Denver", "CO", "80201"),
 ]
-CUSTOMER_POOL = [
-    ("Widget Co.", "456 Commerce Street", "Chicago", "IL", "60601"),
-    ("DataFlow Inc.", "321 Analytics Blvd", "Austin", "TX", "78701"),
-    ("Riverside Manufacturing", "780 Factory Road", "Cleveland", "OH", "44101"),
-    ("Summit Enterprises", "100 Peak Drive", "Denver", "CO", "80201"),
-    ("Cascade Solutions Group", "55 River Bend Rd", "Portland", "OR", "97201"),
-    ("Sterling Financial Corp.", "800 Wall St", "New York", "NY", "10005"),
-    ("Bluestone Retail Inc.", "120 Market Square", "Philadelphia", "PA", "19101"),
-    ("Northstar Logistics", "450 Freight Way", "Minneapolis", "MN", "55401"),
-    ("Pacific Tech Ventures", "700 Bay Ave", "San Diego", "CA", "92101"),
-    ("Redwood Construction LLC", "35 Builder Lane", "Sacramento", "CA", "95801"),
-    ("Falcon Aerospace", "1 Launchpad Dr", "Huntsville", "AL", "35801"),
-    ("Cedar Health Systems", "200 Wellness Blvd", "Nashville", "TN", "37201"),
-    ("Granite Insurance Group", "90 Coverage Ct", "Hartford", "CT", "06101"),
-    ("Oakmont Education Trust", "60 Campus Way", "Ann Arbor", "MI", "48101"),
-    ("Sapphire Media Holdings", "500 Broadcast Pl", "Los Angeles", "CA", "90001"),
 ]
 PRODUCT_CATALOG = [
-    # (description, min_price, max_price, unit)
-    ("Widget Type A", 15.00, 50.00, "unit"),
-    ("Widget Type B", 25.00, 80.00, "unit"),
-    ("Consulting Service", 50.00, 200.00, "hour"),
-    ("Cloud Hosting (Monthly)", 200.00, 800.00, "month"),
-    ("API Integration Setup", 500.00, 3000.00, "unit"),
-    ("Technical Support", 60.00, 150.00, "hour"),
-    ("Steel Bolts (Box/100)", 8.00, 20.00, "box"),
-    ("Copper Wire (500ft Roll)", 50.00, 120.00, "roll"),
-    ("Safety Goggles (Pack/10)", 20.00, 60.00, "pack"),
-    ("Welding Rods (Bundle)", 15.00, 40.00, "bundle"),
-    ("Software License (Annual)", 100.00, 2000.00, "license"),
-    ("Office Furniture Set", 200.00, 800.00, "set"),
-    ("Network Switch (24-port)", 150.00, 500.00, "unit"),
-    ("Printer Ink Cartridge", 20.00, 80.00, "unit"),
-    ("Industrial Adhesive (Gallon)", 25.00, 75.00, "gallon"),
-    ("LED Panel Light", 30.00, 100.00, "unit"),
-    ("HVAC Filter (Pack/4)", 15.00, 45.00, "pack"),
-    ("Hydraulic Pump Assembly", 300.00, 1200.00, "unit"),
-    ("Precision Bearing Set", 40.00, 150.00, "set"),
-    ("Thermal Insulation Roll", 60.00, 200.00, "roll"),
-    ("Data Backup Service", 75.00, 300.00, "month"),
-    ("Security Audit", 500.00, 2500.00, "audit"),
-    ("Custom Report Development", 200.00, 1000.00, "report"),
-    ("Training Workshop", 150.00, 500.00, "session"),
-    ("Prototype Fabrication", 1000.00, 5000.00, "unit"),
 ]
-TAX_RATES = [0.05, 0.06, 0.065, 0.07, 0.075, 0.08, 0.085, 0.09, 0.095, 0.10]
-OCR_SUBSTITUTIONS = {
-    "O": "0", "0": "O", "l": "1", "1": "l", "I": "l",
-    "S": "5", "5": "S", "B": "8", "8": "B", "m": "rn",
-    "a": "o", "e": "c", "n": "ri",
-}
-MONTHS = [
-    "January", "February", "March", "April", "May", "June",
-    "July", "August", "September", "October", "November", "December",
 ]
 class ProceduralEngine:
-    """Seed-based procedural document generator."""
     def __init__(self, seed: int = 0):
         self.rng = random.Random(seed)
     def _pick(self, pool: list) -> Any:
         return self.rng.choice(pool)
-    def _gen_invoice_number(self, prefix: str = "") -> str:
-        if not prefix:
-            prefix = self.rng.choice(["INV", "TS", "GS", "NX", "PC", "BW", "SL", "QC"])
-        year = self.rng.choice([2023, 2024, 2025])
-        num = self.rng.randint(1, 9999)
-        fmt = self.rng.choice([
-            f"{prefix}-{year}-{num:04d}",
-            f"{prefix}{num:04d}",
-            f"{prefix}-{num:04d}-{self.rng.choice(['A','B','R1','R2'])}",
-        ])
-        return fmt
-    def _gen_date(self) -> Tuple[str, str]:
-        """Returns (display_date, normalized YYYY-MM-DD)."""
-        year = self.rng.choice([2023, 2024, 2025])
-        month = self.rng.randint(1, 12)
         day = self.rng.randint(1, 28)
-        norm = f"{year}-{month:02d}-{day:02d}"
-        fmt_choice = self.rng.randint(0, 3)
-        if fmt_choice == 0:
-            display = f"{MONTHS[month-1]} {day}, {year}"
-        elif fmt_choice == 1:
-            display = f"{month:02d}/{day:02d}/{year}"
-        elif fmt_choice == 2:
-            display = f"{day}-{MONTHS[month-1][:3]}-{year}"
-        else:
-            display = norm
-        return display, norm
-    def _gen_line_items(self, count: int = 0) -> List[Dict[str, Any]]:
-        if count == 0:
-            count = self.rng.randint(2, 6)
-        products = self.rng.sample(PRODUCT_CATALOG, min(count, len(PRODUCT_CATALOG)))
-        items = []
-        for desc, min_p, max_p, _unit in products:
-            qty = self.rng.randint(1, 50)
-            price = round(self.rng.uniform(min_p, max_p), 2)
-            amount = round(qty * price, 2)
-            items.append({
-                "description": desc,
-                "quantity": qty,
-                "unit_price": price,
-                "amount": amount,
-            })
-        return items
-    def generate_simple(self) -> Dict[str, Any]:
-        vendor = self._pick(VENDOR_POOL)
-        customer = self._pick(CUSTOMER_POOL)
-        inv_num = self._gen_invoice_number()
-        display_date, norm_date = self._gen_date()
-        items = self._gen_line_items()
-        subtotal = round(sum(i["amount"] for i in items), 2)
-        tax_rate = self._pick(TAX_RATES)
-        tax = round(subtotal * tax_rate, 2)
-        total = round(subtotal + tax, 2)
-        tax_pct = int(tax_rate * 100) if tax_rate * 100 == int(tax_rate * 100) else tax_rate * 100
-        items_text = ""
-        for it in items:
-            items_text += f"{it['description']:<30s} {it['quantity']:>5d}    ${it['unit_price']:>10.2f}   ${it['amount']:>10.2f}\n"
-        text = f"""INVOICE
-Invoice Number: {inv_num}
-Date: {display_date}
-From:
-  {vendor[0]}
-  {vendor[1]}
-  {vendor[2]}, {vendor[3]} {vendor[4]}
-Bill To:
-  {customer[0]}
-  {customer[1]}
-  {customer[2]}, {customer[3]} {customer[4]}
-Description                    Qty    Unit Price      Amount
----------------------------------------------------------------
-{items_text}
-                                      Subtotal:  ${subtotal:,.2f}
-                                      Tax ({tax_pct}%):   ${tax:,.2f}
-                                      Total:     ${total:,.2f}
-Payment Terms: Net {self.rng.choice([15, 30, 45, 60])}
-"""
-        ground_truth = {
-            "invoice_number": inv_num,
-            "date": norm_date,
-            "vendor_name": vendor[0],
-            "customer_name": customer[0],
-            "subtotal": subtotal,
-            "tax": tax,
-            "total": total,
-            "line_items": items,
-        }
-        return {"id": f"proc_simple_{self.rng.randint(1000,9999)}", "text": text, "ground_truth": ground_truth}
-    def generate_messy(self) -> Dict[str, Any]:
-        base = self.generate_simple()
-        gt = base["ground_truth"]
-        vendor = gt["vendor_name"]
-        customer = gt["customer_name"]
-        items = gt["line_items"]
-        abbrevs = {"Subtotal": self._pick(["subtot", "s/t", "sub"]),
-                    "Tax": self._pick(["tx", "tax", "vat"]),
-                    "Total": self._pick(["TOTAL DUE", "amt due", "grand total", "balance"])}
-        items_text = ""
-        for it in items:
-            desc_short = it["description"].split("(")[0].strip().lower()
-            qty = it["quantity"]
-            price = it["unit_price"]
-            amt = it["amount"]
-            fmt = self.rng.choice([
-                f"{qty}x {desc_short} @ {price:.0f}           {amt:.0f}",
-                f"{desc_short} -- {qty} @ {price:.2f} ea ........... {amt:.0f}",
-                f"{desc_short}...${amt:.0f}",
-            ])
-            items_text += fmt + "\n"
-        text = f"""{vendor.lower()}
-{self._pick(VENDOR_POOL)[2].lower()}, {self._pick(VENDOR_POOL)[3]}
-inv# {gt['invoice_number']}
-dt: {gt['date']}
-cust: {customer.split('.')[0].split(',')[0].lower()}
--- charges --
-{items_text}
-{abbrevs['Subtotal']}: ${gt['subtotal']:.0f}
-{abbrevs['Tax']}: {gt['tax']:.2f}
-========
-{abbrevs['Total']} ${gt['total']:,.2f}
-pay within 30 days
-"""
-        return {"id": f"proc_messy_{self.rng.randint(1000,9999)}", "text": text, "ground_truth": gt}
-    def _apply_ocr_corruption(self, text: str, intensity: float = 0.15) -> str:
-        result = list(text)
-        for i, ch in enumerate(result):
-            if ch in OCR_SUBSTITUTIONS and self.rng.random() < intensity:
-                result[i] = OCR_SUBSTITUTIONS[ch]
-        return "".join(result)
-    def generate_corrupted(self) -> Dict[str, Any]:
-        base = self.generate_simple()
-        corrupted_text = self._apply_ocr_corruption(base["text"], 0.18)
-        header = self._pick([
-            "SC4NNED D0CUMENT - Page 1 of 1\n\n",
-            "[SCAN QUALITY: P00R - SOME CHARACTERS MAY BE lNCORRECT]\n\n",
-            "---FAXED DOCUMENT---\nQUALITY: [####===---] 40%\n\n",
-        ])
-        footer = self._pick([
-            "\n\n--- END 0F SCAN ---",
-            "\n\n[PAGE 1/1 - SCAN C0MPLETE]",
-            "\n\n---END FAX---",
-        ])
-        return {
-            "id": f"proc_corrupt_{self.rng.randint(1000,9999)}",
-            "text": header + corrupted_text + footer,
-            "ground_truth": base["ground_truth"],
-        }
-    def generate_multi_document(self) -> Dict[str, Any]:
-        base = self.generate_simple()
-        gt = base["ground_truth"]
-        po_num = f"PO-{self.rng.choice(['A','B','C','D'])}-{self.rng.randint(2024,2025)}-{self.rng.randint(100,999)}"
-        po_date_display, _po_norm = self._gen_date()
-        items_po = ""
-        for it in gt["line_items"]:
-            items_po += f"- {it['quantity']}x {it['description']} @ ${it['unit_price']:.2f} = ${it['amount']:.2f}\n"
-        credit_amount = round(self._pick(gt["line_items"])["unit_price"] * self.rng.randint(1, 3), 2)
-        credit_tax = round(credit_amount * 0.07, 2)
-        credit_total = round(credit_amount + credit_tax, 2)
-        adjusted_total = round(gt["total"] - credit_total, 2)
-        reason = self._pick([
-            "Defective items returned",
-            "Partial delivery — remaining items backordered",
-            "Pricing error on original invoice",
-            "Duplicate charge for services",
-        ])
-        text = f"""=== PURCHASE ORDER ===
-PO Number: {po_num}
-Date: {po_date_display}
-Vendor: {gt['vendor_name']}
-Buyer: {gt['customer_name']}
-Ordered Items:
-{items_po}
-PO Total: ${gt['subtotal']:,.2f} (before tax)
-=== INVOICE ===
-{base['text']}
-Reference PO: {po_num}
-=== CREDIT MEMO ===
-Credit Memo #: CM-{self.rng.randint(2024,2025)}-{self.rng.randint(100,999)}
-Reference Invoice: {gt['invoice_number']}
-Reason: {reason}
-Credit Amount: ${credit_amount:.2f}
-Tax Adjustment: ${credit_tax:.2f}
-Total Credit: -${credit_total:.2f}
-=== SUMMARY ===
-Original Invoice: ${gt['total']:,.2f}
-Credit Applied: -${credit_total:.2f}
-Adjusted Balance Due: ${adjusted_total:,.2f}
-"""
-        gt_multi = dict(gt)
-        gt_multi["po_number"] = po_num
-        gt_multi["adjustment_reason"] = reason
-        gt_multi["adjusted_total"] = adjusted_total
-        return {"id": f"proc_multi_{self.rng.randint(1000,9999)}", "text": text, "ground_truth": gt_multi}
-    def generate_adversarial(self) -> Dict[str, Any]:
-        base = self.generate_simple()
-        gt = base["ground_truth"]
-        original_subtotal = gt["subtotal"]
-        discount_pct = self._pick([0.05, 0.08, 0.10, 0.12, 0.15])
-        discount_amount = round(original_subtotal * discount_pct, 2)
-        adjusted_subtotal = round(original_subtotal - discount_amount, 2)
-        tax_rate = self._pick(TAX_RATES)
-        new_tax = round(adjusted_subtotal * tax_rate, 2)
-        new_total = round(adjusted_subtotal + new_tax, 2)
-        old_tax = round(original_subtotal * tax_rate, 2)
-        original_total = round(original_subtotal + old_tax, 2)
-        draft_inv = f"DRAFT-INV-{self.rng.randint(100,999)}"
-        real_inv = gt["invoice_number"] + self._pick(["-R2", "-FINAL", "-REV1"])
-        po_num = f"PO-{self.rng.randint(2024,2025)}-{self.rng.randint(100,999)}"
-        _, reissue_date = self._gen_date()
-        tax_pct = int(tax_rate * 100) if tax_rate * 100 == int(tax_rate * 100) else round(tax_rate * 100, 1)
-        items_text = ""
-        for it in gt["line_items"]:
-            items_text += f"{it['description']:<30s} {it['quantity']:>5d}    ${it['unit_price']:>10.2f}   ${it['amount']:>10.2f}\n"
-        discount_pct_display = int(discount_pct * 100) if discount_pct * 100 == int(discount_pct * 100) else round(discount_pct * 100, 1)
-        text = f"""INVOICE
-*** IMPORTANT: This replaces previous invoice {draft_inv} which was voided ***
-Invoice Number: {real_inv}
-Previous Reference: {draft_inv} (VOIDED — DO NOT USE)
-Date: {gt['date']}
-Reissue Date: {reissue_date}
-PO Reference: {po_num}
-From:
-  {gt['vendor_name']}
-Bill To:
-  {gt['customer_name']}
-Description                    Qty    Unit Price      Amount
----------------------------------------------------------------
-{items_text}  ** EARLY PAYMENT DISCOUNT: -{discount_pct_display}% applied **
-                                      Subtotal:    ${original_subtotal:,.2f}
-                                 Discount ({discount_pct_display}%):  -${discount_amount:,.2f}
-                            Adjusted Subtotal:   ${adjusted_subtotal:,.2f}
-                                      Tax ({tax_pct}%):    ${new_tax:,.2f}
-                                      Total:       ${new_total:,.2f}
-NOTE: Original quote was ${original_total:,.2f} but discount applied.
-!!! BUDGET VARIANCE ALERT !!!
-PO Authorized: ${original_subtotal:,.2f}
-Actual (pre-tax): ${adjusted_subtotal:,.2f}
-Variance: -${discount_amount:,.2f} UNDER BUDGET
-Payment Terms: Net 10 (discounted) / Net 30 (full price ${original_total:,.2f})
-"""
-        discrepancy = (
-            f"{discount_pct_display}% early payment discount applied. "
-            f"Reissued invoice replaces voided {draft_inv}. "
-            f"Adjusted subtotal ${adjusted_subtotal:,.2f} vs original ${original_subtotal:,.2f}."
         )
-        gt_adv = {
-            "invoice_number": real_inv,
-            "date": reissue_date,
-            "vendor_name": gt["vendor_name"],
-            "customer_name": gt["customer_name"],
-            "subtotal": adjusted_subtotal,
-            "tax": new_tax,
-            "total": new_total,
-            "line_items": gt["line_items"],
-            "po_number": po_num,
-            "discount_amount": discount_amount,
-            "original_total": original_total,
-            "discrepancy_notes": discrepancy,
-        }
-        return {"id": f"proc_adv_{self.rng.randint(1000,9999)}", "text": text, "ground_truth": gt_adv}
 # ---------------------------------------------------------------------------
 # Public API
 # ---------------------------------------------------------------------------
-GENERATORS = {
-    "simple_invoice": "generate_simple",
-    "messy_invoice": "generate_messy",
-    "multi_document": "generate_multi_document",
-    "corrupted_scan": "generate_corrupted",
-    "adversarial_invoice": "generate_adversarial",
 }
-def generate_document(task_name: str, seed: int = 0) -> Dict[str, Any]:
-    """Generate a procedural document for the given task and seed."""
     engine = ProceduralEngine(seed)
-    method = GENERATORS.get(task_name, "generate_simple")
     return getattr(engine, method)()

 """
+Procedural Generation Engine for the ESCTR Environment.
+Generates deterministic corporate supply chain scenarios from any seed:
+- Company profiles (vendors, buyers)
+- Product catalogs with contracted pricing
+- Purchase Orders
+- Vendor Invoices (with seeded discrepancies)
+- Service Level Agreements (penalty clauses)
+- Shipping / logistics telemetry
+- Warehouse access logs
+- Vendor negotiation responses
+Design principle: same seed → identical scenario → deterministic grading.
 """
 import random
+import hashlib
+from dataclasses import dataclass, field, asdict
+from typing import Any, Dict, List, Optional, Tuple
 # ---------------------------------------------------------------------------
+# Data pools
 # ---------------------------------------------------------------------------
+VENDOR_NAMES = [
+    "Apex Industrial Supply Co.", "Meridian Components LLC", "Vanguard Materials Group",
+    "Sterling Precision Parts", "Ironclad Manufacturing Corp.", "Cobalt Logistics Inc.",
+    "Pinnacle Hardware Solutions", "Atlas Engineering Supply", "Nexus Digital Components",
+    "Brightwave Technical Services", "SilverLine Distribution", "Quantum Parts International",
+    "Evergreen Industrial Ltd.", "Horizon Supply Chain Corp.", "Titan Fabrication Works",
+]
+BUYER_NAMES = [
+    "Cascade Electronics Inc.", "Redwood Construction Group", "Summit Aerospace Ltd.",
+    "Pacific Manufacturing Co.", "Northstar Automotive", "Falcon Defense Systems",
+    "Bluestone Energy Corp.", "Cedar Health Technologies", "Granite Infrastructure LLC",
+    "Oakmont Robotics Inc.", "Sapphire Semiconductor", "Emerald Biotech Group",
+    "Diamond Precision Engineering", "Ruby Telecommunications", "Topaz Data Systems",
 ]
+CITIES = [
+    ("New York", "NY"), ("Chicago", "IL"), ("Houston", "TX"), ("San Francisco", "CA"),
+    ("Detroit", "MI"), ("Seattle", "WA"), ("Boston", "MA"), ("Denver", "CO"),
+    ("Austin", "TX"), ("Portland", "OR"), ("Minneapolis", "MN"), ("Cleveland", "OH"),
+    ("Pittsburgh", "PA"), ("Nashville", "TN"), ("San Diego", "CA"),
 ]
 PRODUCT_CATALOG = [
+    # (name, category, min_price, max_price)
+    ("Stainless Steel Bolts M10 (Box/100)", "hardware", 10.00, 25.00),
+    ("Copper Wire 500ft Roll AWG-12", "electrical", 65.00, 120.00),
+    ("Industrial Safety Goggles (Pack/10)", "safety", 25.00, 55.00),
+    ("Welding Rod E6013 (Bundle/50)", "consumables", 18.00, 42.00),
+    ("Hydraulic Cylinder Assembly HCA-200", "machinery", 280.00, 550.00),
+    ("Precision Bearing Set 6205-2RS", "components", 35.00, 90.00),
+    ("HVAC Filter MERV-13 (Pack/4)", "facilities", 22.00, 48.00),
+    ("LED Panel Light 600x600mm", "electrical", 35.00, 85.00),
+    ("Thermal Insulation Roll R-30", "construction", 55.00, 140.00),
+    ("Network Switch 24-Port Managed", "IT", 180.00, 420.00),
+    ("Server Rack Mount Kit 42U", "IT", 350.00, 800.00),
+    ("Pneumatic Valve Assembly PVA-100", "machinery", 120.00, 280.00),
+    ("Carbon Steel Pipe Schedule 40 (10ft)", "construction", 45.00, 110.00),
+    ("Circuit Breaker Panel 200A", "electrical", 150.00, 380.00),
+    ("Laser Calibration Module LCM-5", "precision", 400.00, 950.00),
+    ("Industrial Adhesive Epoxy (Gallon)", "consumables", 28.00, 72.00),
+    ("Fiber Optic Cable OM3 (1000ft)", "IT", 200.00, 480.00),
+    ("Pressure Gauge 0-300 PSI", "instruments", 40.00, 95.00),
+    ("Anti-Vibration Mount Set (Pack/8)", "machinery", 60.00, 150.00),
+    ("Clean Room Wipes (Case/5000)", "consumables", 80.00, 190.00),
 ]
+SLA_PENALTY_STRUCTURES = [
+    {"type": "linear", "rate_per_day": 0.02, "cap": 0.10, "grace_days": 0},
+    {"type": "linear", "rate_per_day": 0.015, "cap": 0.15, "grace_days": 1},
+    {"type": "linear", "rate_per_day": 0.03, "cap": 0.12, "grace_days": 0},
+    {"type": "tiered", "tiers": [(3, 0.02), (7, 0.03), (999, 0.05)], "cap": 0.20, "grace_days": 0},
+    {"type": "linear", "rate_per_day": 0.025, "cap": 0.10, "grace_days": 2},
+]
+VENDOR_EXCUSES = [
+    "Our records indicate the receiving warehouse rejected the initial delivery attempt due to dock unavailability.",
+    "The delay was caused by a force majeure weather event that affected our shipping lane.",
+    "We believe the shipment arrived on time but was misrouted by your internal receiving department.",
+    "Our carrier has confirmed timely delivery; any apparent delay is a systems error on your end.",
+    "The contract clearly states penalties apply only to manufacturing delays, not logistics delays.",
+]
+SETTLEMENT_OFFERS = [
+    "We are prepared to offer a goodwill credit of {pct}% of the penalty amount to resolve this matter.",
+    "In the interest of maintaining our business relationship, we propose settling at {pct}% of the claimed penalty.",
+    "Our legal team has reviewed the claim. We can offer {pct}% as a final settlement.",
 ]
+# ---------------------------------------------------------------------------
+# Data classes for generated scenarios
+# ---------------------------------------------------------------------------
+@dataclass
+class Company:
+    name: str
+    address: str
+    city: str
+    state: str
+    zip_code: str
+    tax_id: str
+@dataclass
+class LineItem:
+    item_id: str
+    description: str
+    category: str
+    quantity: int
+    contracted_unit_price: float
+    invoiced_unit_price: float
+    contracted_total: float
+    invoiced_total: float
+    has_discrepancy: bool = False
+@dataclass
+class PurchaseOrder:
+    po_number: str
+    date: str
+    vendor: Company
+    buyer: Company
+    line_items: List[LineItem]
+    total_amount: float
+    approved_budget: float
+@dataclass
+class Invoice:
+    invoice_number: str
+    date: str
+    po_reference: str
+    vendor: Company
+    buyer: Company
+    line_items: List[LineItem]
+    subtotal: float
+    tax_rate: float
+    tax_amount: float
+    total: float
+@dataclass
+class SLAContract:
+    contract_id: str
+    vendor: str
+    buyer: str
+    effective_date: str
+    penalty_structure: Dict[str, Any]
+    delivery_terms: str
+@dataclass
+class ShippingLog:
+    tracking_id: str
+    po_reference: str
+    carrier: str
+    ship_date: str
+    expected_delivery: str
+    actual_delivery: str
+    delay_days: int
+    status: str
+@dataclass
+class WarehouseLog:
+    date: str
+    dock_id: str
+    status: str  # "open", "closed", "maintenance"
+    staff_on_duty: int
+    shipments_received: int
+    notes: str
+@dataclass
+class Scenario:
+    """Complete scenario for one ESCTR episode."""
+    task_name: str
+    seed: int
+    vendor: Company
+    buyer: Company
+    purchase_order: PurchaseOrder
+    invoice: Invoice
+    sla_contract: Optional[SLAContract] = None
+    shipping_log: Optional[ShippingLog] = None
+    warehouse_logs: Optional[List[WarehouseLog]] = None
+    # Ground truth for grading
+    correct_adjustment: float = 0.0
+    discrepant_line_item_id: Optional[str] = None
+    correct_line_item_price: Optional[float] = None
+    penalty_amount: Optional[float] = None
+    vendor_claim_valid: Optional[bool] = None
+# ---------------------------------------------------------------------------
+# Procedural Engine
+# ---------------------------------------------------------------------------
 class ProceduralEngine:
+    """Seed-deterministic corporate scenario generator."""
     def __init__(self, seed: int = 0):
         self.rng = random.Random(seed)
+        self.seed = seed
     def _pick(self, pool: list) -> Any:
         return self.rng.choice(pool)
+    def _gen_company(self, names: list) -> Company:
+        name = self._pick(names)
+        city, state = self._pick(CITIES)
+        return Company(
+            name=name,
+            address=f"{self.rng.randint(100, 9999)} {self._pick(['Industrial', 'Commerce', 'Innovation', 'Enterprise', 'Technology'])} {self._pick(['Drive', 'Avenue', 'Parkway', 'Boulevard', 'Street'])}",
+            city=city,
+            state=state,
+            zip_code=f"{self.rng.randint(10000, 99999)}",
+            tax_id=f"{self.rng.randint(10, 99)}-{self.rng.randint(1000000, 9999999)}",
+        )
+    def _gen_date(self, year: int = 2024, month_range: Tuple[int, int] = (1, 12)) -> str:
+        month = self.rng.randint(*month_range)
         day = self.rng.randint(1, 28)
+        return f"{year}-{month:02d}-{day:02d}"
+    def _gen_id(self, prefix: str) -> str:
+        return f"{prefix}-{self.rng.randint(2024, 2025)}-{self.rng.randint(1000, 9999)}"
+    def _gen_tracking_id(self) -> str:
+        return f"TRK-{self.rng.randint(10000, 99999)}"
+    # ------------------------------------------------------------------
+    # Task 1: Easy — Procurement Reconciliation
+    # ------------------------------------------------------------------
+    def generate_task1(self) -> Scenario:
+        """Generate a simple PO vs Invoice overcharge scenario."""
+        vendor = self._gen_company(VENDOR_NAMES)
+        buyer = self._gen_company(BUYER_NAMES)
+        po_date = self._gen_date(month_range=(1, 6))
+        inv_date = self._gen_date(month_range=(2, 7))
+        # Generate 3-5 line items
+        num_items = self.rng.randint(3, 5)
+        products = self.rng.sample(PRODUCT_CATALOG, num_items)
+        discrepant_idx = self.rng.randint(0, num_items - 1)
+        line_items = []
+        for i, (name, cat, min_p, max_p) in enumerate(products):
+            qty = self.rng.randint(5, 100)
+            contracted_price = round(self.rng.uniform(min_p, max_p), 2)
+            if i == discrepant_idx:
+                # Overcharge: invoice price higher than contracted
+                markup = round(self.rng.uniform(2.00, 15.00), 2)
+                invoiced_price = round(contracted_price + markup, 2)
+                has_discrepancy = True
+            else:
+                invoiced_price = contracted_price
+                has_discrepancy = False
+            item_id = f"LI-{self.rng.randint(1000, 9999)}"
+            line_items.append(LineItem(
+                item_id=item_id,
+                description=name,
+                category=cat,
+                quantity=qty,
+                contracted_unit_price=contracted_price,
+                invoiced_unit_price=invoiced_price,
+                contracted_total=round(qty * contracted_price, 2),
+                invoiced_total=round(qty * invoiced_price, 2),
+                has_discrepancy=has_discrepancy,
+            ))
+        po_total = round(sum(li.contracted_total for li in line_items), 2)
+        inv_subtotal = round(sum(li.invoiced_total for li in line_items), 2)
+        tax_rate = self._pick([0.05, 0.06, 0.07, 0.08, 0.09, 0.10])
+        tax_amount = round(inv_subtotal * tax_rate, 2)
+        inv_total = round(inv_subtotal + tax_amount, 2)
+        po_number = self._gen_id("PO")
+        inv_number = self._gen_id("INV")
+        po = PurchaseOrder(
+            po_number=po_number, date=po_date, vendor=vendor, buyer=buyer,
+            line_items=line_items, total_amount=po_total, approved_budget=round(po_total * 1.05, 2),
+        )
+        invoice = Invoice(
+            invoice_number=inv_number, date=inv_date, po_reference=po_number,
+            vendor=vendor, buyer=buyer, line_items=line_items,
+            subtotal=inv_subtotal, tax_rate=tax_rate, tax_amount=tax_amount, total=inv_total,
+        )
+        discrepant = line_items[discrepant_idx]
+        correct_total = discrepant.contracted_total
+        overcharge = round(discrepant.invoiced_total - correct_total, 2)
+        return Scenario(
+            task_name="procurement_reconciliation",
+            seed=self.seed,
+            vendor=vendor, buyer=buyer,
+            purchase_order=po, invoice=invoice,
+            correct_adjustment=-overcharge,  # negative = reduce invoice
+            discrepant_line_item_id=discrepant.item_id,
+            correct_line_item_price=correct_total,
         )
+    # ------------------------------------------------------------------
+    # Task 2: Medium — SLA Enforcement
+    # ------------------------------------------------------------------
+    def generate_task2(self) -> Scenario:
+        """Generate a delayed shipment + SLA penalty scenario."""
+        scenario = self.generate_task1()  # base PO/invoice
+        # Remove the pricing discrepancy for task2 (focus is on shipping)
+        for li in scenario.purchase_order.line_items:
+            li.invoiced_unit_price = li.contracted_unit_price
+            li.invoiced_total = li.contracted_total
+            li.has_discrepancy = False
+        # Recalculate invoice
+        inv = scenario.invoice
+        inv_subtotal = round(sum(li.contracted_total for li in inv.line_items), 2)
+        inv.subtotal = inv_subtotal
+        inv.tax_amount = round(inv_subtotal * inv.tax_rate, 2)
+        inv.total = round(inv_subtotal + inv.tax_amount, 2)
+        # Generate SLA
+        sla_struct = self._pick(SLA_PENALTY_STRUCTURES).copy()
+        contract_id = self._gen_id("SLA")
+        sla = SLAContract(
+            contract_id=contract_id,
+            vendor=scenario.vendor.name,
+            buyer=scenario.buyer.name,
+            effective_date=self._gen_date(month_range=(1, 3)),
+            penalty_structure=sla_struct,
+            delivery_terms=f"Delivery within 14 business days of PO issuance. Penalties per SLA clause {contract_id}-SEC4.",
+        )
+        # Generate shipping delay
+        delay_days = self.rng.randint(2, 12)
+        grace = sla_struct.get("grace_days", 0)
+        tracking_id = self._gen_tracking_id()
+        ship_log = ShippingLog(
+            tracking_id=tracking_id,
+            po_reference=scenario.purchase_order.po_number,
+            carrier=self._pick(["FedEx Freight", "UPS Supply Chain", "XPO Logistics", "USPS Priority", "DHL Express"]),
+            ship_date=scenario.purchase_order.date,
+            expected_delivery=self._gen_date(month_range=(3, 5)),
+            actual_delivery=self._gen_date(month_range=(4, 6)),
+            delay_days=delay_days,
+            status="delivered_late",
+        )
+        # Calculate penalty
+        penalizable_days = max(0, delay_days - grace)
+        if sla_struct["type"] == "linear":
+            rate = sla_struct["rate_per_day"]
+            cap = sla_struct["cap"]
+            penalty_pct = min(penalizable_days * rate, cap)
+        elif sla_struct["type"] == "tiered":
+            penalty_pct = 0.0
+            remaining = penalizable_days
+            for threshold, rate in sla_struct["tiers"]:
+                if remaining <= 0:
+                    break
+                days_in_tier = min(remaining, threshold)
+                penalty_pct += days_in_tier * rate
+                remaining -= days_in_tier
+            penalty_pct = min(penalty_pct, sla_struct["cap"])
+        else:
+            penalty_pct = 0.0
+        penalty_amount = round(inv.subtotal * penalty_pct, 2)
+        scenario.task_name = "sla_enforcement"
+        scenario.sla_contract = sla
+        scenario.shipping_log = ship_log
+        scenario.correct_adjustment = -penalty_amount  # deduction from invoice
+        scenario.penalty_amount = penalty_amount
+        scenario.discrepant_line_item_id = None
+        scenario.correct_line_item_price = None
+        return scenario
+    # ------------------------------------------------------------------
+    # Task 3: Hard — Adversarial Auditing
+    # ------------------------------------------------------------------
+    def generate_task3(self) -> Scenario:
+        """Generate adversarial vendor dispute scenario."""
+        scenario = self.generate_task2()  # has SLA + shipping
+        # Generate warehouse logs proving dock was open during disputed window
+        delivery_date = scenario.shipping_log.actual_delivery
+        warehouse_logs = []
+        for i in range(-1, 3):  # day before through 2 days after
+            # Parse date for log entries
+            log_date = delivery_date  # simplified: use actual delivery date
+            warehouse_logs.append(WarehouseLog(
+                date=log_date,
+                dock_id=f"DOCK-{self._pick(['A', 'B', 'C'])}{self.rng.randint(1, 5)}",
+                status="open",
+                staff_on_duty=self.rng.randint(3, 8),
+                shipments_received=self.rng.randint(5, 20),
+                notes=f"Normal operations. {self.rng.randint(5, 20)} deliveries processed.",
+            ))
+        scenario.task_name = "adversarial_auditing"
+        scenario.warehouse_logs = warehouse_logs
+        scenario.vendor_claim_valid = False  # vendor's claim is always invalid in this task
+        return scenario
+# ---------------------------------------------------------------------------
+# Document renderers — produce human-readable text from data structures
+# ---------------------------------------------------------------------------
+def render_purchase_order(po: PurchaseOrder) -> str:
+    lines = [
+        "═══════════════════════════════════════════",
+        "              PURCHASE ORDER",
+        "═══════════════════════════════════════════",
+        f"PO Number:       {po.po_number}",
+        f"Date:            {po.date}",
+        f"Approved Budget: ${po.approved_budget:,.2f}",
+        "",
+        f"Vendor:  {po.vendor.name}",
+        f"         {po.vendor.address}",
+        f"         {po.vendor.city}, {po.vendor.state} {po.vendor.zip_code}",
+        "",
+        f"Buyer:   {po.buyer.name}",
+        f"         {po.buyer.address}",
+        f"         {po.buyer.city}, {po.buyer.state} {po.buyer.zip_code}",
+        "",
+        "Line Items:",
+        f"{'ID':<12} {'Description':<40} {'Qty':>5} {'Unit Price':>12} {'Total':>12}",
+        "─" * 85,
+    ]
+    for li in po.line_items:
+        lines.append(
+            f"{li.item_id:<12} {li.description:<40} {li.quantity:>5} "
+            f"${li.contracted_unit_price:>10,.2f} ${li.contracted_total:>10,.2f}"
+        )
+    lines.extend([
+        "─" * 85,
+        f"{'PO Total:':>71} ${po.total_amount:>10,.2f}",
+        "═══════════════════════════════════════════",
+    ])
+    return "\n".join(lines)
+def render_invoice(inv: Invoice) -> str:
+    tax_pct = f"{inv.tax_rate * 100:.1f}"
+    lines = [
+        "═══════════════════════════════════════════",
+        "                INVOICE",
+        "═══════════════════════════════════════════",
+        f"Invoice Number:  {inv.invoice_number}",
+        f"Date:            {inv.date}",
+        f"PO Reference:    {inv.po_reference}",
+        "",
+        f"From:    {inv.vendor.name}",
+        f"         {inv.vendor.address}",
+        f"         {inv.vendor.city}, {inv.vendor.state} {inv.vendor.zip_code}",
+        f"         Tax ID: {inv.vendor.tax_id}",
+        "",
+        f"Bill To: {inv.buyer.name}",
+        f"         {inv.buyer.address}",
+        f"         {inv.buyer.city}, {inv.buyer.state} {inv.buyer.zip_code}",
+        "",
+        f"{'ID':<12} {'Description':<40} {'Qty':>5} {'Unit Price':>12} {'Amount':>12}",
+        "─" * 85,
+    ]
+    for li in inv.line_items:
+        lines.append(
+            f"{li.item_id:<12} {li.description:<40} {li.quantity:>5} "
+            f"${li.invoiced_unit_price:>10,.2f} ${li.invoiced_total:>10,.2f}"
+        )
+    lines.extend([
+        "─" * 85,
+        f"{'Subtotal:':>71} ${inv.subtotal:>10,.2f}",
+        f"{'Tax (' + tax_pct + '%):':>71} ${inv.tax_amount:>10,.2f}",
+        f"{'TOTAL DUE:':>71} ${inv.total:>10,.2f}",
+        "═══════════════════════════════════════════",
+    ])
+    return "\n".join(lines)
+def render_sla(sla: SLAContract) -> str:
+    ps = sla.penalty_structure
+    lines = [
+        "═══════════════════════════════════════════",
+        "         SERVICE LEVEL AGREEMENT",
+        "═══════════════════════════════════════════",
+        f"Contract ID:     {sla.contract_id}",
+        f"Effective Date:  {sla.effective_date}",
+        f"Vendor:          {sla.vendor}",
+        f"Buyer:           {sla.buyer}",
+        "",
+        f"Delivery Terms:  {sla.delivery_terms}",
+        "",
+        "LATE DELIVERY PENALTY CLAUSE:",
+    ]
+    if ps["type"] == "linear":
+        lines.append(f"  - Penalty rate: {ps['rate_per_day'] * 100:.1f}% of invoice subtotal per day late")
+        lines.append(f"  - Maximum penalty cap: {ps['cap'] * 100:.0f}% of invoice subtotal")
+        if ps["grace_days"] > 0:
+            lines.append(f"  - Grace period: {ps['grace_days']} business day(s)")
+    elif ps["type"] == "tiered":
+        lines.append("  - Tiered penalty structure:")
+        prev = 0
+        for threshold, rate in ps["tiers"]:
+            if threshold >= 999:
+                lines.append(f"    Day {prev + 1}+: {rate * 100:.1f}% per day")
+            else:
+                lines.append(f"    Days {prev + 1}-{threshold}: {rate * 100:.1f}% per day")
+            prev = threshold
+        lines.append(f"  - Maximum penalty cap: {ps['cap'] * 100:.0f}% of invoice subtotal")
+    lines.append("═══════════════════════════════════════════")
+    return "\n".join(lines)
+def render_shipping_log(log: ShippingLog) -> str:
+    return "\n".join([
+        "═══════════════════════════════════════════",
+        "            SHIPPING LOG",
+        "═══════════════════════════════════════════",
+        f"Tracking ID:        {log.tracking_id}",
+        f"PO Reference:       {log.po_reference}",
+        f"Carrier:            {log.carrier}",
+        f"Ship Date:          {log.ship_date}",
+        f"Expected Delivery:  {log.expected_delivery}",
+        f"Actual Delivery:    {log.actual_delivery}",
+        f"Delay:              {log.delay_days} day(s)",
+        f"Status:             {log.status}",
+        "═══════════════════════════════════════════",
+    ])
+def render_warehouse_logs(logs: List[WarehouseLog]) -> str:
+    lines = [
+        "═══════════════════════════════════════════",
+        "         WAREHOUSE ACCESS LOGS",
+        "═══════════════════════════════════════════",
+    ]
+    for wl in logs:
+        lines.extend([
+            f"Date: {wl.date}  |  Dock: {wl.dock_id}  |  Status: {wl.status.upper()}",
+            f"  Staff on duty: {wl.staff_on_duty}  |  Shipments received: {wl.shipments_received}",
+            f"  Notes: {wl.notes}",
+            "",
+        ])
+    lines.append("═══════════════════════════════════════════")
+    return "\n".join(lines)
 # ---------------------------------------------------------------------------
 # Public API
 # ---------------------------------------------------------------------------
+TASK_GENERATORS = {
+    "procurement_reconciliation": "generate_task1",
+    "sla_enforcement": "generate_task2",
+    "adversarial_auditing": "generate_task3",
+}
+VALID_TASKS = list(TASK_GENERATORS.keys())
+MAX_STEPS = {
+    "procurement_reconciliation": 10,
+    "sla_enforcement": 15,
+    "adversarial_auditing": 20,
 }
+def generate_scenario(task_name: str, seed: int = 0) -> Scenario:
+    """Generate a complete ESCTR scenario for the given task and seed."""
     engine = ProceduralEngine(seed)
+    method = TASK_GENERATORS.get(task_name, "generate_task1")
     return getattr(engine, method)()