musharraf7 commited on
Commit
a363048
·
1 Parent(s): 6f7e1b7

feat: ESCTR pivot — Enterprise Supply Chain & Tax Reconciliation

Browse files

Complete rewrite for OpenEnv Hackathon Round 2:
- New domain: autonomous financial auditing (supply chain discrepancies)
- 3 tasks: procurement reconciliation, SLA enforcement, adversarial auditing
- 4 ERP tools: query_database, read_document, communicate_vendor, submit_financial_decision
- Adversarial vendor negotiation with settlement dynamics
- Procedural scenario generation (deterministic from seed)
- RLVR composite rewards with trajectory milestones and gullibility penalties
- Storytelling README (Problem → Environment → Results → Why it matters)
- Added course.md documenting the full journey
- Removed old documents.py (replaced by procedural.py)

Files changed (12) hide show
  1. .gitignore +1 -0
  2. README.md +122 -155
  3. course.md +309 -0
  4. inference.py +155 -225
  5. openenv.yaml +10 -13
  6. server/__init__.py +3 -3
  7. server/app.py +53 -76
  8. server/documents.py +0 -898
  9. server/environment.py +458 -526
  10. server/graders.py +280 -302
  11. server/models.py +91 -45
  12. server/procedural.py +536 -382
.gitignore CHANGED
@@ -14,3 +14,4 @@ hackathon_instructions.txt
14
  preparatory_course.txt
15
  RESEARCH_1.md
16
  RESEARCH_2.md
 
 
14
  preparatory_course.txt
15
  RESEARCH_1.md
16
  RESEARCH_2.md
17
+ ROUND_2_GUIDELINES.md
README.md CHANGED
@@ -1,7 +1,7 @@
1
  ---
2
- title: Invoice Extraction Environment
3
- emoji: 📄
4
- colorFrom: blue
5
  colorTo: green
6
  sdk: docker
7
  pinned: false
@@ -10,218 +10,185 @@ tags:
10
  - openenv
11
  ---
12
 
13
- # Invoice Extraction Environment
14
 
15
- An OpenEnv-compliant environment where AI agents extract structured data from unstructured invoice and receipt documents. Features **5 difficulty tiers** from clean invoices to adversarial documents with decoy fields, OCR corruption, and hidden calculations — with **procedural document generation** for virtually infinite training configurations and an **RLVR-inspired composite reward architecture**.
16
 
17
  **Space URL:** `https://huggingface.co/spaces/musharraf7/invoice-extraction-env`
18
 
19
- ```python
20
- import requests
21
-
22
- # Connect to the environment
23
- url = "https://musharraf7-invoice-extraction-env.hf.space"
24
- r = requests.post(f"{url}/reset", json={"task_name": "simple_invoice"})
25
- print(r.json())
26
- ```
27
-
28
- ## Why This Environment?
29
-
30
- Invoice data extraction is a **$5B+ industry** problem faced daily by every business. This environment provides:
31
-
32
- - **Real RL training signal**: Per-field partial-credit scoring gives dense reward gradients via RLVR-inspired composite rewards
33
- - **Infinite training data**: Procedural document generation creates unique invoices from any seed — eliminating overfitting to a static corpus
34
- - **Genuine difficulty progression**: From clean invoices to adversarial traps that challenge frontier models
35
- - **Multi-tool agentic workflow**: Hard tasks feature database queries, calculation verification, and discrepancy detection tools — training agents for multi-step reasoning
36
- - **Reward shaping**: Trajectory milestones, consistency bonuses, efficiency signals, and improvement tracking provide rich learning signals beyond simple field matching
37
- - **Production relevance**: The task directly models what commercial document processing systems must solve
38
-
39
- ## Reward Architecture (RLVR-Inspired)
40
 
41
- The environment uses a composite reward function inspired by Reinforcement Learning with Verifiable Rewards:
42
 
43
- ```
44
- R_total = α·R_outcome + β·R_trajectory + bonuses
45
- ```
46
 
47
- | Component | Weight | Description |
48
- |-----------|--------|-------------|
49
- | **R_outcome** | α = 0.70 | Weighted extraction accuracy (financial fields 1.5×, line items 2.0×) |
50
- | **R_trajectory** | β = 0.30 | Micro-rewards for information gathering milestones |
51
- | **Consistency bonus** | +0.03 | Agent's subtotal + tax = total |
52
- | **Efficiency bonus** | +0.01–0.02 | Solution found in ≤5 steps |
53
- | **Improvement bonus** | up to +0.02 | Score improves on retry |
54
- | **Step cost** | -0.005/step | Encourages efficient exploration |
55
- | **Hallucination penalty** | -0.02 | Invalid JSON or unknown commands |
56
-
57
- ### Trajectory Milestones
58
-
59
- | Action | Micro-reward | Purpose |
60
- |--------|-------------|---------|
61
- | `view_document` | +0.01 | Evidence gathering |
62
- | `view_fields` | +0.01 | Understanding requirements |
63
- | `get_feedback` | +0.005 | Learning from errors |
64
- | `query_related_documents` | +0.015 | Cross-referencing (hard tasks) |
65
- | `verify_calculations` | +0.01 | Mathematical verification |
66
- | `check_discrepancies` | +0.015 | Anomaly detection |
67
-
68
- ## Action Space
69
-
70
- The agent sends an `InvoiceAction` with a `command` and optional `payload`:
71
-
72
- | Command | Description | Payload | Available Tasks |
73
- |---------|-------------|---------|-----------------|
74
- | `view_document` | View the raw document text | — | All |
75
- | `view_fields` | See required fields with descriptions | — | All |
76
- | `extract` | Submit extracted fields | JSON string | All |
77
- | `get_feedback` | Get detailed per-field feedback | — | All |
78
- | `query_related_documents` | Retrieve PO, credit memos, etc. | — | multi_document, adversarial |
79
- | `verify_calculations` | Submit arithmetic for verification | JSON string | multi_document, adversarial |
80
- | `check_discrepancies` | Flag inconsistencies in documents | — | multi_document, adversarial |
81
-
82
- ### Action Schema
83
- ```json
84
- {
85
- "command": "extract",
86
- "payload": "{\"invoice_number\": \"INV-2024-001\", \"date\": \"2024-01-15\", ...}"
87
- }
88
- ```
89
 
90
- ## Observation Space
91
 
92
- Each step returns an `InvoiceObservation`:
93
 
94
- | Field | Type | Description |
95
- |-------|------|-------------|
96
- | `text` | string | Response text from the environment |
97
- | `task_name` | string | Current task name |
98
- | `current_score` | float | Best score achieved so far |
99
- | `attempts_remaining` | int | Remaining extraction attempts |
100
- | `required_fields` | list | Fields to extract |
101
- | `done` | bool | Whether the episode has ended |
102
- | `reward` | float | Reward signal (0.01–0.99) |
103
- | `last_action_status` | string | "success" or "error" |
104
- | `error_message` | string | Diagnostic error message (if error) |
105
- | `current_step` | int | Step number within episode |
106
- | `accumulated_reward` | float | Total reward accumulated so far |
107
 
108
- ## Tasks (5 Difficulty Tiers)
109
 
110
- ### 1. `simple_invoice` (Easy) 3 attempts
111
- Clean, well-formatted invoices with clear field labels.
 
 
112
 
113
- **Required fields:** `invoice_number`, `date`, `vendor_name`, `customer_name`, `subtotal`, `tax`, `total`, `line_items`
114
 
115
- ### 2. `messy_invoice` (Medium) 3 attempts
116
- Same fields but from messy, inconsistently formatted documents with abbreviations, typos, and non-standard layouts.
 
 
 
117
 
118
- **Required fields:** Same as simple_invoice
119
 
120
- ### 3. `multi_document` (Hard) 5 attempts
121
- Complex multi-section documents containing a purchase order, invoice, and credit memo/payment receipt. The agent must cross-reference sections. **Advanced tools available** (`query_related_documents`, `verify_calculations`, `check_discrepancies`).
122
 
123
- **Required fields:** All basic fields + `po_number`, `adjustment_reason`, `adjusted_total`
 
 
 
 
 
124
 
125
- ### 4. `corrupted_scan` (Very Hard) — 4 attempts
126
- Simulates OCR-scanned/faxed invoices with systematic character errors:
127
- - Character substitutions: `0`↔`O`, `1`↔`l`↔`I`, `5`↔`S`, `8`↔`B`
128
- - Garbled sections and scan artifacts
129
- - The agent must **reason through noise** to recover the true values
130
 
131
- **Required fields:** Same as simple_invoice
 
 
 
132
 
133
- ### 5. `adversarial_invoice` (Expert) 6 attempts
134
- Adversarial documents designed to trap and challenge frontier models:
135
- - **Decoy fields**: Multiple invoice numbers — only one is current
136
- - **Hidden calculations**: Discounts the agent must compute
137
- - **Contradictory sections**: PO vs invoice disagreements
138
- - **Budget variance alerts**: Agent must identify and explain discrepancies
139
 
140
- **Advanced tools available** for investigation.
141
 
142
- **Required fields:** All basic fields + `po_number`, `discount_amount`, `original_total`, `discrepancy_notes`
 
 
143
 
144
- ## Procedural Document Generation
 
 
 
 
 
 
 
145
 
146
- The environment features a **procedural generation engine** that creates unique invoice documents from any seed value:
147
 
148
- - **15 vendor profiles** with addresses across the US
149
- - **15 customer profiles** with realistic business names
150
- - **25+ product catalog items** spanning hardware, software, and services
151
- - **10 tax rate configurations** (5%–10%)
152
- - **Deterministic**: Same seed always produces the same document
153
- - **Infinite variety**: Seeds 0–2 use static test fixtures; seeds ≥ 3 generate novel documents
154
 
155
- ```python
156
- # Use seed to get different documents
157
- r = requests.post(f"{url}/reset", json={"task_name": "simple_invoice", "seed": 42})
158
- r = requests.post(f"{url}/reset", json={"task_name": "simple_invoice", "seed": 100})
159
- ```
160
 
161
- ## Per-Field Scoring
162
 
163
- - **Text fields**: Fuzzy matching with SequenceMatcher (0.0–1.0)
164
- - **Numeric fields**: Exact match (1.0), within 1% (0.9), within 5% (0.5), within 10% (0.2)
165
- - **Date fields**: Normalized comparison (YYYY-MM-DD) with format tolerance
166
- - **Line items**: Best-fit matching of description, qty, price, amount (weighted 2.0×)
167
- - **Reasoning fields** (discrepancy_notes): Fuzzy matching with lower threshold
168
- - **Financial fields** (subtotal, tax, total): Weighted 1.5× for importance
169
 
170
- ## Setup Instructions
171
 
172
- ### Run with Docker
173
  ```bash
174
- docker build -t invoice-extraction-env .
175
- docker run -p 7860:7860 invoice-extraction-env
176
- ```
177
 
178
- ### Run locally
179
- ```bash
180
  pip install -r requirements.txt
181
  uvicorn server.app:app --host 0.0.0.0 --port 7860
182
  ```
183
 
184
- ### Run with uv
185
- ```bash
186
- uv run server
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
187
  ```
188
 
189
- ### Run inference
190
  ```bash
191
  export ENV_URL="http://localhost:7860"
192
  export API_BASE_URL="https://router.huggingface.co/v1"
193
  export MODEL_NAME="meta-llama/Meta-Llama-3-8B-Instruct"
194
- export HF_TOKEN="your_token_here"
195
  python inference.py
196
  ```
197
 
 
 
 
 
 
 
 
 
 
198
  ## API Endpoints
199
 
200
  | Endpoint | Method | Description |
201
  |----------|--------|-------------|
202
  | `/health` | GET | Health check |
203
- | `/reset` | POST | Reset with task selection |
204
  | `/step` | POST | Execute an action |
205
- | `/state` | GET | Get current state |
206
- | `/schema` | GET | Get action/observation schemas |
207
- | `/metadata` | GET | Get environment metadata |
208
  | `/ws` | WebSocket | Persistent session |
209
 
210
  ## Project Structure
 
211
  ```
212
  ├── server/
213
  │ ├── __init__.py
214
  │ ├── app.py # FastAPI application
215
- │ ├── environment.py # Core environment logic + RLVR reward architecture
216
- │ ├── documents.py # 15-document corpus across 5 difficulty tiers
217
- │ ├── procedural.py # Procedural document generation engine
218
- ── graders.py # Field-level scoring with weighted fuzzy matching
219
- │ └── models.py # Pydantic Action/Observation/State types
220
- ├── __init__.py # Package declaration
221
- ├── inference.py # Baseline inference script (all 5 tasks)
222
  ├── openenv.yaml # OpenEnv manifest
223
- ├── pyproject.toml # Package configuration
224
  ├── requirements.txt # Dependencies
225
  ├── Dockerfile # Container definition
226
  └── README.md # This file
227
  ```
 
 
 
 
 
 
 
 
1
  ---
2
+ title: ESCTR Environment
3
+ emoji: 🏢
4
+ colorFrom: indigo
5
  colorTo: green
6
  sdk: docker
7
  pinned: false
 
10
  - openenv
11
  ---
12
 
13
+ # 🏢 ESCTR: Enterprise Supply Chain & Tax Reconciliation
14
 
15
+ > **Training LLMs to be autonomous financial auditors** — an OpenEnv environment for teaching AI agents to investigate procurement discrepancies, enforce SLA penalties, and navigate adversarial vendor disputes.
16
 
17
  **Space URL:** `https://huggingface.co/spaces/musharraf7/invoice-extraction-env`
18
 
19
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
+ ## The Problem
22
 
23
+ Every day, global enterprises process millions of procurement transactions. Between the Purchase Order, the shipping manifest, the SLA contract, and the final vendor invoice, discrepancies **inevitably** arise:
 
 
24
 
25
+ - A vendor bills $45/unit instead of the contracted $40
26
+ - A shipment arrives 5 days late, triggering SLA penalty clauses
27
+ - A vendor disputes the penalty, claiming *your* warehouse rejected the delivery
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
+ Resolving these disputes currently requires human financial controllers to **manually cross-reference multiple siloed databases**, interpret complex contract clauses, perform precise arithmetic, and negotiate with adversarial counterparties. It's slow, expensive, and error-prone.
30
 
31
+ **What if we could train LLMs to do this autonomously?**
32
 
33
+ ## The Environment
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
+ ESCTR provides a stateful sandbox where an LLM agent operates as an **autonomous financial controller**. Rather than just extracting data from a document, the agent must:
36
 
37
+ 1. **Investigate**query procurement databases, shipping logs, SLA contracts
38
+ 2. **Reason** — cross-reference documents, calculate penalties, verify claims
39
+ 3. **Negotiate** — handle adversarial vendor communications
40
+ 4. **Decide** — submit a mathematically precise financial adjustment
41
 
42
+ ### Three Tasks, Escalating Complexity
43
 
44
+ | Task | Difficulty | Max Steps | What the Agent Must Do |
45
+ |------|-----------|-----------|----------------------|
46
+ | **Procurement Reconciliation** | Easy | 10 | Find an overcharged line item between PO and Invoice, calculate the exact overcharge |
47
+ | **SLA Enforcement** | Medium | 15 | Discover a late shipment, retrieve the SLA contract, calculate the penalty from contract terms |
48
+ | **Adversarial Auditing** | Hard | 20 | All of the above + verify warehouse logs to disprove vendor's claim + reject a settlement offer |
49
 
50
+ ### The Tool Suite
51
 
52
+ The agent interacts through **4 ERP tools**, each requiring precise parameters:
 
53
 
54
+ | Tool | Purpose | Parameters |
55
+ |------|---------|------------|
56
+ | `query_database` | Search corporate databases | `{"table": "shipping_logs"}` |
57
+ | `read_document` | Retrieve full document text | `document_id: "PO-2024-1234"` |
58
+ | `communicate_vendor` | Negotiate with adversarial vendor | `message_content: "We reject..."` |
59
+ | `submit_financial_decision` | Submit final adjustment (terminal) | `adjustment_amount: -450.00` |
60
 
61
+ ### Procedural Generation
 
 
 
 
62
 
63
+ Every scenario is generated from a seed — **same seed = same scenario = deterministic grading**. This enables:
64
+ - Infinite training configurations (no memorization)
65
+ - Reproducible evaluation
66
+ - Fair comparison between models
67
 
68
+ Each scenario generates: company profiles, product catalogs with contracted pricing, purchase orders, vendor invoices (with seeded discrepancies), SLA contracts (linear/tiered penalty structures), shipping telemetry, and warehouse access logs.
 
 
 
 
 
69
 
70
+ ## Reward Architecture (RLVR-Inspired)
71
 
72
+ ```
73
+ R_total = α·R_outcome + β·R_trajectory − penalties
74
+ ```
75
 
76
+ | Component | Weight | Description |
77
+ |-----------|--------|-------------|
78
+ | **R_outcome** | 60-70% | Did the agent submit the correct adjustment amount? |
79
+ | **R_trajectory** | 30-40% | Did the agent follow proper investigative procedure? |
80
+ | **Efficiency penalty** | -0.005/step | Encourages shortest path to resolution |
81
+ | **Hallucination penalty** | -0.02 | Invalid queries, nonexistent documents |
82
+ | **Gullibility penalty** | -0.20 | Accepting adversarial settlement offers (Task 3) |
83
+ | **Evidence bonus** | +0.05 | Citing warehouse logs as evidence (Task 3) |
84
 
85
+ ### Why This Reward Design Matters
86
 
87
+ - **Dense, not sparse**: Trajectory milestones reward correct investigative behavior (querying the right databases, reading the right documents) even if the final answer is wrong
88
+ - **Hard to game**: An agent that spams queries gets penalized by step costs; an agent that submits without investigating gets 0 trajectory reward
89
+ - **Verifiable**: The correct answer is always a precise floating-point number derived from contract terms — no subjective evaluation
 
 
 
90
 
91
+ ## Results
 
 
 
 
92
 
93
+ *Training evidence and reward plots will be added during the onsite hackathon (April 25-26) when compute credits are provided.*
94
 
95
+ <!-- Placeholder for training results -->
96
+ <!-- ![Reward curves](plots/reward_curves.png) -->
 
 
 
 
97
 
98
+ ## Quick Start
99
 
100
+ ### Run the environment
101
  ```bash
102
+ # Docker
103
+ docker build -t esctr-env .
104
+ docker run -p 7860:7860 esctr-env
105
 
106
+ # Or locally
 
107
  pip install -r requirements.txt
108
  uvicorn server.app:app --host 0.0.0.0 --port 7860
109
  ```
110
 
111
+ ### Connect an agent
112
+ ```python
113
+ import requests
114
+
115
+ url = "http://localhost:7860"
116
+
117
+ # Reset with a task
118
+ r = requests.post(f"{url}/reset", json={"task_name": "sla_enforcement", "seed": 42})
119
+ briefing = r.json()["observation"]["system_response"]
120
+
121
+ # Query a database
122
+ r = requests.post(f"{url}/step", json={
123
+ "action": {
124
+ "action_type": "query_database",
125
+ "query_parameters": {"table": "shipping_logs"}
126
+ }
127
+ })
128
+ result = r.json()["observation"]["system_response"]
129
+
130
+ # Submit financial decision
131
+ r = requests.post(f"{url}/step", json={
132
+ "action": {
133
+ "action_type": "submit_financial_decision",
134
+ "adjustment_amount": -450.00,
135
+ "adjustment_reason": "Late delivery penalty per SLA clause"
136
+ }
137
+ })
138
+ score = r.json()["reward"]
139
  ```
140
 
141
+ ### Run baseline inference
142
  ```bash
143
  export ENV_URL="http://localhost:7860"
144
  export API_BASE_URL="https://router.huggingface.co/v1"
145
  export MODEL_NAME="meta-llama/Meta-Llama-3-8B-Instruct"
146
+ export HF_TOKEN="your_token"
147
  python inference.py
148
  ```
149
 
150
+ ## Why This Matters
151
+
152
+ | Question | Answer |
153
+ |----------|--------|
154
+ | *Does this teach an LLM something it can't do well?* | Yes — multi-step financial reasoning with tool use is a known weakness of current LLMs |
155
+ | *Is the domain underexplored?* | Yes — supply chain auditing + adversarial negotiation is nearly absent from RL/LLM training benchmarks |
156
+ | *Could a researcher write a paper about this?* | Yes — training autonomous financial auditors has direct commercial and academic value |
157
+ | *Is the reward hard to game?* | Yes — the correct answer is always a precise number from contract math; trajectory rewards require specific database queries |
158
+
159
  ## API Endpoints
160
 
161
  | Endpoint | Method | Description |
162
  |----------|--------|-------------|
163
  | `/health` | GET | Health check |
164
+ | `/reset` | POST | Reset with task + seed |
165
  | `/step` | POST | Execute an action |
166
+ | `/state` | GET | Current state |
167
+ | `/schema` | GET | Action/Observation/State schemas |
168
+ | `/metadata` | GET | Environment metadata |
169
  | `/ws` | WebSocket | Persistent session |
170
 
171
  ## Project Structure
172
+
173
  ```
174
  ├── server/
175
  │ ├── __init__.py
176
  │ ├── app.py # FastAPI application
177
+ │ ├── environment.py # Core stateful environment + tool handlers
178
+ │ ├── procedural.py # Deterministic scenario generation engine
179
+ │ ├── graders.py # Multi-axis deterministic graders (3 tasks)
180
+ ── models.py # Pydantic Action/Observation/State schemas
181
+ ── inference.py # Baseline inference script
 
 
182
  ├── openenv.yaml # OpenEnv manifest
183
+ ├── pyproject.toml # Package config
184
  ├── requirements.txt # Dependencies
185
  ├── Dockerfile # Container definition
186
  └── README.md # This file
187
  ```
188
+
189
+ ## Themes Alignment
190
+
191
+ - **🌐 World Modeling (Professional Tasks)** — Real interaction with tools and dynamic databases
192
+ - **📋 Long-Horizon Planning** — Multi-step investigation requiring state tracking across 10-20 steps
193
+ - **🤝 Multi-Agent Interactions** — Adversarial vendor negotiation with settlement dynamics
194
+ - **📈 Self-Improvement** — Escalating difficulty curriculum (Easy → Medium → Hard)
course.md ADDED
@@ -0,0 +1,309 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ESCTR: The Full Story — From Invoice Extraction to Enterprise Supply Chain Auditing
2
+
3
+ > This document captures the entire journey: the problem we set out to solve, the research we did, the approaches we tried, and how we arrived at the final ESCTR environment.
4
+
5
+ ---
6
+
7
+ ## Table of Contents
8
+
9
+ 1. [The Starting Point — OpenEnv Hackathon](#1-the-starting-point)
10
+ 2. [Round 1 — Invoice Extraction Environment](#2-round-1--invoice-extraction-environment)
11
+ 3. [Research Phase — What Would Win Round 2?](#3-research-phase)
12
+ 4. [The Pivot Decision — Why ESCTR](#4-the-pivot-decision)
13
+ 5. [Architecture Deep Dive — How ESCTR Works](#5-architecture-deep-dive)
14
+ 6. [Reward Design — RLVR Principles](#6-reward-design)
15
+ 7. [What We Learned](#7-what-we-learned)
16
+
17
+ ---
18
+
19
+ ## 1. The Starting Point
20
+
21
+ ### What is the OpenEnv Hackathon?
22
+
23
+ The **Meta PyTorch OpenEnv Hackathon × Scaler School of Technology** is a hackathon focused on building **RL training environments for LLMs**. The core idea: instead of training LLMs on static datasets, we build interactive environments where agents learn through Reinforcement Learning with Verifiable Rewards (RLVR).
24
+
25
+ **OpenEnv** is a framework by Meta PyTorch and HuggingFace that treats RL environments as isolated microservices — the training loop (client) is completely decoupled from the environment simulation (server). The environment exposes standard HTTP endpoints (`/reset`, `/step`, `/state`) and the agent interacts through typed Actions and Observations.
26
+
27
+ ### The Challenge
28
+
29
+ Build an OpenEnv-compliant environment that:
30
+ - Simulates a task humans actually perform
31
+ - Has programmatic, deterministic grading (no LLM-as-judge)
32
+ - Provides dense reward signals (not just 0/1 at the end)
33
+ - Supports multiple difficulty tiers
34
+ - Runs within 2 vCPU / 8GB RAM constraints
35
+ - Is deployable as a Docker container on HuggingFace Spaces
36
+
37
+ ---
38
+
39
+ ## 2. Round 1 — Invoice Extraction Environment
40
+
41
+ ### The Original Idea
42
+
43
+ Our Round 1 submission was an **Invoice Extraction Environment** — an environment where an AI agent extracts structured data (vendor name, invoice number, line items, totals, etc.) from unstructured invoice documents.
44
+
45
+ ### What We Built
46
+
47
+ - **5 difficulty tiers**: simple_invoice → messy_invoice → multi_document → corrupted_scan → adversarial_invoice
48
+ - **15 static documents** across the 5 tiers
49
+ - **Fuzzy string matching** for text fields, numeric tolerance for amounts
50
+ - **Multi-step interaction**: view_document → view_fields → extract → get_feedback → refine
51
+ - **OpenEnv compliance**: FastAPI server, typed Pydantic models, Docker deployment
52
+
53
+ ### Round 1 Enhancements (Pre-Pivot)
54
+
55
+ Before Round 2 guidelines dropped, we upgraded the Round 1 environment with:
56
+
57
+ 1. **Procedural Document Generation** (`procedural.py`): A seed-based engine generating infinite invoice variations — 15 vendor profiles, 15 customers, 25 products, OCR corruption simulation. This eliminated the overfitting risk of a 15-document static corpus.
58
+
59
+ 2. **RLVR Composite Rewards**: Instead of a simple extraction score, we implemented:
60
+ ```
61
+ R_total = 0.70 × R_outcome + 0.30 × R_trajectory + bonuses
62
+ ```
63
+ With trajectory milestones (micro-rewards for viewing documents, getting feedback), efficiency bonuses, consistency bonuses (subtotal + tax = total), and penalties.
64
+
65
+ 3. **Weighted Grading**: Financial fields scored 1.5×, line items 2.0×, with built-in cross-field arithmetic verification.
66
+
67
+ 4. **Multi-Tool Workflow**: For hard tasks (multi_document, adversarial_invoice), we added `query_related_documents`, `verify_calculations`, and `check_discrepancies` tools.
68
+
69
+ ### Why Round 1 Wasn't Enough
70
+
71
+ The enhanced invoice extraction was technically solid — all tests passed, good reward design, infinite procedural data. **But it wasn't going to win Round 2.**
72
+
73
+ ---
74
+
75
+ ## 3. Research Phase
76
+
77
+ ### RESEARCH_1: The ESCTR Blueprint
78
+
79
+ We conducted deep research into what would maximize hackathon scoring. The key findings:
80
+
81
+ **The Core Problem with Invoice Extraction:**
82
+
83
+ | Vulnerability | Why It Hurts |
84
+ |--------------|-------------|
85
+ | **Saturated domain** | Document extraction is a well-trodden path. Judges have seen it before. |
86
+ | **Shallow interaction** | View document → extract → done. No real multi-step reasoning. |
87
+ | **Text-centric abstraction** | Pre-parsed text removes any visual/spatial reasoning challenge. |
88
+ | **Low novelty ceiling** | Even with procedural generation, the core task is "fill in the JSON fields." |
89
+
90
+ **What Frontier AI Research Demands:**
91
+
92
+ Drawing from the **OLMo 3 technical report** and RLVR research, we identified that winning environments need:
93
+ - **Long-horizon planning**: Agents that plan across 10-20 steps, not 3-5
94
+ - **Tool orchestration**: Multiple heterogeneous tools, not just "view" and "extract"
95
+ - **Partial observability**: Information spread across multiple databases, not one document
96
+ - **Adversarial dynamics**: Active counterparties that resist the agent's goal
97
+ - **Deterministic verification**: Correct answers that are mathematically provable, not fuzzy-matched
98
+
99
+ **The Proposed Solution: Enterprise Supply Chain & Tax Reconciliation (ESCTR)**
100
+
101
+ The research proposed pivoting from "extract data from an invoice" to "act as an autonomous financial controller investigating procurement discrepancies." This transforms a simple NLP extraction task into a genuine **agentic workflow** that maps to real enterprise operations worth trillions of dollars annually.
102
+
103
+ ### RESEARCH_2: Supporting Analysis
104
+
105
+ The supplementary research validated the ESCTR concept against:
106
+ - Amazon's agentic AI evaluation practices
107
+ - Multi-agent negotiation frameworks
108
+ - The credit assignment problem in long-horizon RL
109
+ - Rubric-based reward systems for domains beyond simple verification
110
+
111
+ ### Key Insight from Research
112
+
113
+ > "An environment that challenges frontier 72B models at 40% success rate on its hardest task provides more training headroom than one where 8B models already score 80%."
114
+
115
+ This directly informed our task difficulty design — Task 3 (Adversarial Auditing) is deliberately hard enough that a model must:
116
+ 1. Query 5 different databases
117
+ 2. Cross-reference shipping dates against SLA penalty clauses
118
+ 3. Verify warehouse logs to disprove a vendor's false claim
119
+ 4. Navigate a multi-turn negotiation
120
+ 5. Reject a settlement offer
121
+ 6. Calculate the exact penalty amount to 2 decimal places
122
+
123
+ ---
124
+
125
+ ## 4. The Pivot Decision
126
+
127
+ ### Round 2 Guidelines Changed Everything
128
+
129
+ When the Round 2 guidelines arrived, the scoring criteria shifted dramatically:
130
+
131
+ | Criterion | Round 1 Weight | Round 2 Weight |
132
+ |-----------|---------------|---------------|
133
+ | Environment Innovation | ~30% | **40%** |
134
+ | Storytelling & Presentation | 0% | **30%** |
135
+ | Training Evidence (reward curves) | 0% | **20%** |
136
+ | Reward & Training Pipeline | ~25% | **10%** |
137
+
138
+ **70% of the score** now depends on innovation + storytelling. The guidelines explicitly warned:
139
+
140
+ > *"A messy but ambitious environment with real training evidence beats a polished but boring one."*
141
+ > *"Judges have seen a lot of chess, snake, tic-tac-toe, and grid-world clones."*
142
+
143
+ ### The Decision Matrix
144
+
145
+ | Factor | Invoice Extraction | ESCTR |
146
+ |--------|-------------------|-------|
147
+ | Innovation (40%) | ⚠️ Known domain, seen before | ✅ Novel — supply chain auditing is unexplored in RL |
148
+ | Storytelling (30%) | ⚠️ Hard to make exciting | ✅ Strong narrative — "training autonomous financial controllers" |
149
+ | Training Evidence (20%) | Equal | Equal |
150
+ | Theme Alignment | Weak — barely touches themes | ✅ Hits Theme #3.1 (World Modeling), #2 (Long-Horizon), #1 (Multi-Agent) |
151
+ | Technical Depth | Good but shallow | ✅ 4 tools, 5 databases, adversarial negotiation |
152
+
153
+ ### Decision: Full ESCTR Pivot
154
+
155
+ We chose **Option A: Full ESCTR Pivot** because:
156
+ 1. The innovation ceiling is dramatically higher
157
+ 2. The storytelling angle is compelling and unique
158
+ 3. Our existing RLVR reward architecture transfers directly
159
+ 4. The procedural generation concept transfers directly
160
+ 5. We had 2 days pre-onsite + 2 days onsite to build it
161
+
162
+ The risk was real — a complete rewrite — but a "polished but boring" environment was guaranteed to lose.
163
+
164
+ ---
165
+
166
+ ## 5. Architecture Deep Dive
167
+
168
+ ### How ESCTR Works
169
+
170
+ The agent is presented with a **discrepancy alert** and must use 4 ERP tools to investigate:
171
+
172
+ ```
173
+ ┌─────────────────────────────────────────┐
174
+ │ ESCTR Environment │
175
+ │ │
176
+ │ ┌─────────┐ ┌──────────┐ ┌────────┐│
177
+ │ │ Purchase │ │ Shipping │ │ SLA ││
178
+ │ │ Orders │ │ Logs │ │Contract││
179
+ │ └────┬─────┘ └────┬─────┘ └───┬────┘│
180
+ │ │ │ │ │
181
+ │ ┌────┴──────────────┴────────────┴────┐│
182
+ │ │ Tool Dispatcher ││
183
+ │ │ query_database | read_document ││
184
+ │ │ communicate_vendor ││
185
+ │ │ submit_financial_decision ││
186
+ │ └────────────────┬─────────────────────┘│
187
+ │ │ │
188
+ │ ┌────────────────┴─────────────────────┐│
189
+ │ │ Grader Engine ││
190
+ │ │ R = α·outcome + β·trajectory − pen ││
191
+ │ └──────────────────────────────────────┘│
192
+ └─────────────────────────────────────────┘
193
+ ```
194
+
195
+ ### The Three Tasks
196
+
197
+ **Task 1 — Procurement Reconciliation (Easy)**
198
+ - A vendor invoices at higher prices than contracted
199
+ - Agent must: Query PO → Query Invoice → Compare line items → Find overcharge → Submit correction
200
+ - Grading: Correct line item ID + exact adjustment amount = 1.0
201
+
202
+ **Task 2 — SLA Enforcement (Medium)**
203
+ - A shipment arrived late, vendor demands full payment
204
+ - Agent must: Query shipping logs → Discover delay → Query SLA contract → Calculate penalty per terms → Submit deduction
205
+ - Grading: Exact penalty calculation = 1.0, within 5% = 0.7, within 10% = 0.4
206
+
207
+ **Task 3 — Adversarial Auditing (Hard)**
208
+ - Vendor disputes the late delivery, claims warehouse rejected shipment
209
+ - Agent must: Verify shipping delay → Get SLA terms → Query warehouse logs (prove dock was open) → Engage vendor → Reject settlement offer → Enforce full penalty
210
+ - Grading: Multi-axis — outcome (60%) + trajectory (40%) − gullibility penalty + evidence bonus
211
+
212
+ ### Procedural Generation
213
+
214
+ Every scenario is generated from a seed using deterministic randomization:
215
+ - **15 vendor profiles** with US addresses
216
+ - **15 buyer profiles** with realistic business names
217
+ - **20 products** across hardware, electrical, IT, machinery categories
218
+ - **5 SLA penalty structures** (linear and tiered)
219
+ - Same seed → identical scenario → reproducible evaluation
220
+
221
+ ### The Vendor Negotiation System
222
+
223
+ Task 3 features a **3-phase adversarial vendor**:
224
+
225
+ 1. **Phase 1 — The Excuse**: Vendor claims your warehouse rejected delivery
226
+ 2. **Phase 2 — The Settlement Offer**: Vendor offers 40-55% of the penalty as a "goodwill credit"
227
+ 3. **Phase 3 — Concession or Persistence**: If agent rejects firmly + cites evidence, vendor concedes
228
+
229
+ The agent is penalized −0.20 for **gullibility** (accepting the settlement) and rewarded +0.05 for **evidence citation** (mentioning warehouse logs in the adjustment reason).
230
+
231
+ ---
232
+
233
+ ## 6. Reward Design
234
+
235
+ ### RLVR Principles Applied
236
+
237
+ Our reward design follows principles from the OLMo 3 technical report:
238
+
239
+ ```
240
+ R_total = α · R_outcome + β · R_trajectory − penalties
241
+ ```
242
+
243
+ **Why not just binary rewards?**
244
+ - Sparse rewards (0 or 1 at the end) make credit assignment intractable in 15-20 step episodes
245
+ - The agent can't tell which of its 15 actions contributed to success or failure
246
+ - Dense trajectory rewards act as "algorithmic breadcrumbs" guiding policy gradients
247
+
248
+ **Trajectory Milestones:**
249
+
250
+ | Milestone | Meaning |
251
+ |-----------|---------|
252
+ | `retrieved_po` | Agent queried the purchase order database |
253
+ | `retrieved_invoice` | Agent queried the invoice database |
254
+ | `retrieved_shipping` | Agent discovered the shipping delay |
255
+ | `retrieved_sla` | Agent found the penalty terms |
256
+ | `checked_warehouse` | Agent verified internal records |
257
+ | `vendor_negotiation` | Agent engaged with the adversarial vendor |
258
+ | `calculated_penalty` | Agent performed penalty arithmetic |
259
+
260
+ **Penalties:**
261
+ - Step cost: −0.005 per action (encourages efficiency)
262
+ - Hallucination: −0.02 for invalid queries or nonexistent documents
263
+ - Gullibility: −0.20 for accepting adversarial settlements (Task 3)
264
+
265
+ **Why These Specific Values?**
266
+ - Step cost is small enough that investigation is still rewarded
267
+ - Hallucination penalty is 4× the step cost — bad actions are much worse than slow actions
268
+ - Gullibility penalty is massive (−0.20) because accepting a fraudulent claim is the worst possible failure mode in financial auditing
269
+
270
+ ---
271
+
272
+ ## 7. What We Learned
273
+
274
+ ### Technical Lessons
275
+
276
+ 1. **Procedural generation is non-negotiable** for RL environments. Static corpora get memorized instantly. Our engine generates unique scenarios from any seed.
277
+
278
+ 2. **Tool restriction per task** is important. Easy tasks shouldn't have tools the agent can't meaningfully use — it creates noise in the reward signal.
279
+
280
+ 3. **Adversarial dynamics create genuine difficulty.** A vendor that lies and offers settlements tests the agent's reasoning in ways static documents never can.
281
+
282
+ 4. **Composite rewards require careful balancing.** If trajectory reward is too high, agents learn to query everything without ever submitting. If too low, they learn to guess without investigating.
283
+
284
+ ### Strategic Lessons
285
+
286
+ 1. **Read the scoring rubric backwards.** Don't start with what you want to build — start with what gets scored highest and work backwards.
287
+
288
+ 2. **Innovation (40%) + Storytelling (30%) = 70%.** A technically perfect but boring environment loses to a messy but ambitious one with a great narrative.
289
+
290
+ 3. **The pivot was worth the risk.** Rewriting 1000+ lines of code in 2 days was aggressive, but staying with invoice extraction would have capped us at "top 10, not first."
291
+
292
+ 4. **Domain choice matters enormously.** Supply chain auditing is a multi-trillion dollar problem that's underexplored in AI training — this gives us both novelty and real-world utility.
293
+
294
+ ---
295
+
296
+ ## Appendix: File History
297
+
298
+ | Phase | Files Created/Modified | Purpose |
299
+ |-------|----------------------|---------|
300
+ | Round 1 | `server/documents.py` (15 static docs) | Original invoice corpus |
301
+ | Round 1 | `server/graders.py` (fuzzy matching) | Text extraction grading |
302
+ | Enhancement | `server/procedural.py` v1 (invoice generator) | Infinite invoice variations |
303
+ | Enhancement | `server/environment.py` v1 (6 tools) | Multi-tool invoice extraction |
304
+ | **ESCTR Pivot** | `server/models.py` (ESCTRAction/Obs/State) | ERP tool schemas |
305
+ | **ESCTR Pivot** | `server/procedural.py` v2 (corporate graphs) | Supply chain scenario generation |
306
+ | **ESCTR Pivot** | `server/graders.py` v2 (3 task graders) | Deterministic multi-axis scoring |
307
+ | **ESCTR Pivot** | `server/environment.py` v2 (4 tools + vendor AI) | Full ESCTR environment |
308
+ | **ESCTR Pivot** | `inference.py` v2 (financial controller) | Baseline agent script |
309
+ | **ESCTR Pivot** | Removed `server/documents.py` | No longer needed |
inference.py CHANGED
@@ -1,18 +1,16 @@
1
  #!/usr/bin/env python3
2
  """
3
- Baseline inference script for the Invoice Extraction Environment.
4
 
5
- This script demonstrates how an LLM agent interacts with the environment
6
- to extract structured data from invoice documents. It runs all five tasks
7
- (simple_invoice, messy_invoice, multi_document, corrupted_scan, adversarial_invoice)
8
- and logs results in the mandatory OpenEnv [START]/[STEP]/[END] format.
9
 
10
  Required environment variables:
11
- API_BASE_URL — OpenAI-compatible API endpoint
12
- MODEL_NAME — Model identifier (e.g. meta-llama/Meta-Llama-3-8B-Instruct)
13
- HF_TOKEN — API key / Hugging Face token (no default)
14
- ENV_URL URL of the running environment server (default: http://localhost:7860)
15
- LOCAL_IMAGE_NAME — (Optional) Docker image name for from_docker_image() style
16
  """
17
 
18
  import json
@@ -33,15 +31,9 @@ HF_TOKEN = os.getenv("HF_TOKEN")
33
  ENV_URL = os.getenv("ENV_URL", "http://localhost:7860")
34
  LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
35
 
36
- TASKS = ["simple_invoice", "messy_invoice", "multi_document", "corrupted_scan", "adversarial_invoice"]
37
- BENCHMARK = "invoice-extraction"
38
 
39
- # Tasks that support advanced multi-tool commands
40
- TOOL_ENABLED_TASKS = {"multi_document", "adversarial_invoice"}
41
-
42
- # ---------------------------------------------------------------------------
43
- # LLM Client
44
- # ---------------------------------------------------------------------------
45
  llm = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)
46
 
47
 
@@ -67,140 +59,121 @@ def env_reset(url: str, task_name: str, seed: int = 0) -> dict:
67
  return r.json()
68
 
69
 
70
- def env_step(url: str, command: str, payload: str = "") -> dict:
71
- r = requests.post(f"{url}/step", json={"action": {"command": command, "payload": payload}}, timeout=30)
72
  r.raise_for_status()
73
  return r.json()
74
 
75
 
76
  # ---------------------------------------------------------------------------
77
- # Logging helpers (strict OpenEnv format)
78
  # ---------------------------------------------------------------------------
79
 
80
  def log_start(task: str, model: str):
81
  print(f"[START] task={task} env={BENCHMARK} model={model}", flush=True)
82
 
83
-
84
  def log_step(step: int, action: str, reward: float, done: bool, error=None):
85
- error_val = error if error else "null"
86
- print(
87
- f"[STEP] step={step} action={action} reward={reward:.2f} "
88
- f"done={str(done).lower()} error={error_val}",
89
- flush=True,
90
- )
91
-
92
 
93
  def log_end(success: bool, steps: int, score: float, rewards: list):
94
- rewards_str = ",".join(f"{r:.2f}" for r in rewards)
95
- print(
96
- f"[END] success={str(success).lower()} steps={steps} "
97
- f"score={score:.2f} rewards={rewards_str}",
98
- flush=True,
99
- )
100
 
101
 
102
  # ---------------------------------------------------------------------------
103
- # LLM extraction logic
104
  # ---------------------------------------------------------------------------
105
 
106
- EXTRACT_PROMPT = """You are an expert data extraction assistant. Given the following document text, extract the specified fields and return ONLY a valid JSON object.
107
-
108
- DOCUMENT:
109
- {document}
110
-
111
- REQUIRED FIELDS:
112
- {fields}
113
-
114
- RULES:
115
- - Return ONLY a valid JSON object, no explanation or markdown
116
- - For dates, use YYYY-MM-DD format (e.g. 2024-01-15)
117
- - For monetary amounts, use plain numbers without currency symbols (e.g. 1134.00)
118
- - For line_items, use an array of objects with keys: description, quantity, unit_price, amount
119
- - If a field cannot be found, use null
120
- {task_specific_rules}
121
-
122
- IMPORTANT: Ensure your extracted subtotal + tax = total. Verify math consistency.
123
-
124
- JSON:"""
125
-
126
- TASK_RULES = {
127
- "simple_invoice": "",
128
- "messy_invoice": (
129
- "- This document uses informal formatting, abbreviations, and shorthand\n"
130
- "- Look past formatting irregularities to find the actual values\n"
131
- "- 'subtot', 's/t', 'sub' = subtotal; 'tx' = tax; 'amt due' = total"
132
- ),
133
- "multi_document": (
134
- "- This contains MULTIPLE document sections (PO, Invoice, Credit Memo, etc.)\n"
135
- "- Extract from the INVOICE section primarily\n"
136
- "- adjusted_total is the final amount after credits/payments\n"
137
- "- po_number is the purchase order reference number\n"
138
- "- adjustment_reason describes why the total was adjusted\n"
139
- "- Cross-reference PO with invoice for discrepancies"
140
- ),
141
- "corrupted_scan": (
142
- "- WARNING: This is an OCR-scanned document with character errors\n"
143
- "- Common OCR substitutions: 0<->O, 1<->l<->I, 5<->S, 8<->B\n"
144
- "- Mentally correct OCR errors to recover the true values\n"
145
- "- 'lNV' = 'INV', 'S' in numbers = '5', 'O' in numbers = '0'\n"
146
- "- Verify all numbers by cross-checking (qty * unit_price = amount)"
147
- ),
148
- "adversarial_invoice": (
149
- "- CAUTION: This document contains DECOY fields and contradictions\n"
150
- "- Multiple invoice numbers may appear use the CURRENT/ACTIVE one\n"
151
- "- If there is a reissue date, use that as the date (not the original)\n"
152
- "- subtotal is the ADJUSTED subtotal after any discounts\n"
153
- "- discount_amount is the monetary discount value\n"
154
- "- original_total is what the total WOULD have been without adjustments\n"
155
- "- discrepancy_notes: describe ALL discrepancies and adjustments\n"
156
- "- po_number: the purchase order reference if present, else null\n"
157
- "- Cross-reference different sections to find contradictions"
158
- ),
 
 
 
 
 
 
 
 
 
 
 
 
 
159
  }
160
 
161
- REFINE_PROMPT = """You previously extracted data from an invoice but some fields were incorrect.
162
-
163
- DOCUMENT:
164
- {document}
165
-
166
- YOUR PREVIOUS EXTRACTION:
167
- {previous}
168
-
169
- FIELDS NEEDING IMPROVEMENT: {weak_fields}
170
-
171
- FEEDBACK:
172
- {feedback}
173
-
174
- {extra_context}
175
-
176
- Please re-extract ALL fields and return ONLY a valid JSON object with corrections.
177
- Pay special attention to the fields listed above.
178
-
179
- RULES:
180
- - Return ONLY a valid JSON object, no explanation or markdown
181
- - For dates, use YYYY-MM-DD format
182
- - For monetary amounts, use plain numbers without currency symbols
183
- - For line_items, use an array of objects with keys: description, quantity, unit_price, amount
184
- - VERIFY: subtotal + tax should equal total
185
- {task_specific_rules}
186
-
187
- JSON:"""
188
 
 
 
 
189
 
190
- def call_llm(prompt: str) -> str:
191
  try:
192
  response = llm.chat.completions.create(
193
  model=MODEL_NAME,
194
- messages=[{"role": "user", "content": prompt}],
195
  temperature=0.1,
196
- max_tokens=2000,
197
  )
198
  return response.choices[0].message.content.strip()
199
  except Exception as e:
200
- return json.dumps({"error": str(e)})
201
 
202
 
203
- def extract_json_from_response(text: str) -> str:
 
 
204
  if "```json" in text:
205
  start = text.index("```json") + 7
206
  end = text.index("```", start)
@@ -219,113 +192,69 @@ def extract_json_from_response(text: str) -> str:
219
  elif text[i] == "}":
220
  depth -= 1
221
  if depth == 0:
222
- return text[brace_start : i + 1]
223
- return text
 
 
224
 
225
 
226
  # ---------------------------------------------------------------------------
227
- # Main inference loop
228
  # ---------------------------------------------------------------------------
229
 
230
  def run_task(env_url: str, task_name: str, seed: int = 0) -> float:
231
- """Run a single task and return the final score."""
232
  log_start(task=task_name, model=MODEL_NAME)
233
-
234
  rewards = []
235
  step_num = 0
236
  final_score = 0.0
237
 
 
 
 
 
 
 
 
238
  try:
239
- env_reset(env_url, task_name, seed=seed)
 
240
 
241
- # Step 1: View the document
242
- step_num += 1
243
- result = env_step(env_url, "view_document")
244
- document_text = result.get("observation", {}).get("text", "")
245
- reward = result.get("reward", 0.0) or 0.0
246
- done = result.get("done", False)
247
- rewards.append(reward)
248
- log_step(step_num, "view_document()", reward, done)
249
-
250
- # Step 2: View required fields
251
- step_num += 1
252
- result = env_step(env_url, "view_fields")
253
- required_fields = result.get("observation", {}).get("required_fields", [])
254
- reward = result.get("reward", 0.0) or 0.0
255
- done = result.get("done", False)
256
- rewards.append(reward)
257
- log_step(step_num, "view_fields()", reward, done)
258
-
259
- # Step 2.5: For tool-enabled tasks, gather extra context
260
- extra_context = ""
261
- if task_name in TOOL_ENABLED_TASKS:
262
- step_num += 1
263
- result = env_step(env_url, "query_related_documents")
264
- related_text = result.get("observation", {}).get("text", "")
265
- reward = result.get("reward", 0.0) or 0.0
266
- rewards.append(reward)
267
- log_step(step_num, "query_related_documents()", reward, False)
268
- extra_context += f"\nRELATED DOCUMENTS:\n{related_text}\n"
269
 
 
270
  step_num += 1
271
- result = env_step(env_url, "check_discrepancies")
272
- discrep_text = result.get("observation", {}).get("text", "")
 
 
 
 
 
 
 
 
 
273
  reward = result.get("reward", 0.0) or 0.0
 
 
 
 
 
274
  rewards.append(reward)
275
- log_step(step_num, "check_discrepancies()", reward, False)
276
- extra_context += f"\nDISCREPANCY HINTS:\n{discrep_text}\n"
277
-
278
- # Step 3: LLM extraction
279
- fields_str = "\n".join(f"- {f}" for f in required_fields)
280
- task_rules = TASK_RULES.get(task_name, "")
281
- prompt = EXTRACT_PROMPT.format(
282
- document=document_text + extra_context,
283
- fields=fields_str,
284
- task_specific_rules=task_rules,
285
- )
286
- llm_response = call_llm(prompt)
287
- extracted_json = extract_json_from_response(llm_response)
288
 
289
- # Step 4: Submit extraction
290
- step_num += 1
291
- result = env_step(env_url, "extract", extracted_json)
292
- reward = result.get("reward", 0.0) or 0.0
293
- done = result.get("done", False)
294
- obs = result.get("observation", {})
295
- rewards.append(reward)
296
- log_step(step_num, "submit_extraction()", reward, done)
297
- final_score = reward
298
-
299
- # If not done and score < 0.9, refine
300
- if not done and reward < 0.9:
301
- step_num += 1
302
- fb_result = env_step(env_url, "get_feedback")
303
- feedback_text = fb_result.get("observation", {}).get("text", "")
304
- fb_reward = fb_result.get("reward", 0.0) or 0.0
305
- rewards.append(fb_reward)
306
- log_step(step_num, "get_feedback()", fb_reward, False)
307
-
308
- field_scores = obs.get("metadata", {}).get("field_scores", {})
309
- weak_fields = [f for f, s in field_scores.items() if s < 0.8]
310
-
311
- refine_prompt = REFINE_PROMPT.format(
312
- document=document_text,
313
- previous=extracted_json,
314
- weak_fields=", ".join(weak_fields) if weak_fields else "all fields",
315
- feedback=feedback_text,
316
- extra_context=extra_context,
317
- task_specific_rules=task_rules,
318
- )
319
- refined_response = call_llm(refine_prompt)
320
- refined_json = extract_json_from_response(refined_response)
321
 
322
- step_num += 1
323
- result2 = env_step(env_url, "extract", refined_json)
324
- reward2 = result2.get("reward", 0.0) or 0.0
325
- done = result2.get("done", False)
326
- rewards.append(reward2)
327
- log_step(step_num, "submit_refined_extraction()", reward2, done)
328
- final_score = max(final_score, reward2)
329
 
330
  except Exception as e:
331
  step_num += 1
@@ -338,51 +267,52 @@ def run_task(env_url: str, task_name: str, seed: int = 0) -> float:
338
  return final_score
339
 
340
 
 
 
 
 
341
  def main():
342
  global ENV_URL
343
  container_id = None
344
 
345
  if LOCAL_IMAGE_NAME:
346
- print(f"Starting Docker container from image: {LOCAL_IMAGE_NAME}")
347
  try:
348
  container_id = subprocess.check_output(
349
  ["docker", "run", "-d", "--rm", "-p", "7860:7860", LOCAL_IMAGE_NAME],
350
- stderr=subprocess.STDOUT,
351
  ).decode().strip()
352
  ENV_URL = "http://localhost:7860"
353
- print(f"Container started: {container_id[:12]}")
354
  except Exception as e:
355
- print(f"Failed to start Docker container: {e}")
356
  sys.exit(1)
357
 
358
  print(f"Waiting for environment at {ENV_URL} ...")
359
  if not env_health(ENV_URL):
360
- print("ERROR: Environment failed to become healthy")
361
  if container_id:
362
  subprocess.run(["docker", "stop", container_id], capture_output=True)
363
  sys.exit(1)
364
- print("Environment is healthy!\n")
365
 
366
  scores = {}
367
- for task_name in TASKS:
368
- score = run_task(ENV_URL, task_name, seed=42)
369
- scores[task_name] = score
370
  print()
371
 
372
- avg_score = sum(scores.values()) / len(scores) if scores else 0.0
373
  print("=" * 50)
374
- print("SUMMARY")
375
  print("=" * 50)
376
- for task, score in scores.items():
377
- print(f" {task}: {score:.2f}")
378
- print(f" Average: {avg_score:.2f}")
379
  print("=" * 50)
380
 
381
  if container_id:
382
- print(f"Stopping container {container_id[:12]} ...")
383
  subprocess.run(["docker", "stop", container_id], capture_output=True)
384
 
385
- return 0 if avg_score > 0 else 1
386
 
387
 
388
  if __name__ == "__main__":
 
1
  #!/usr/bin/env python3
2
  """
3
+ Baseline inference script for the ESCTR Environment.
4
 
5
+ Demonstrates how an LLM agent interacts with the enterprise supply chain
6
+ environment to investigate discrepancies, enforce SLA penalties, and
7
+ navigate adversarial vendor disputes.
 
8
 
9
  Required environment variables:
10
+ API_BASE_URL — OpenAI-compatible API endpoint
11
+ MODEL_NAME — Model identifier (e.g. meta-llama/Meta-Llama-3-8B-Instruct)
12
+ HF_TOKEN — API key
13
+ ENV_URL Environment server URL (default: http://localhost:7860)
 
14
  """
15
 
16
  import json
 
31
  ENV_URL = os.getenv("ENV_URL", "http://localhost:7860")
32
  LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
33
 
34
+ TASKS = ["procurement_reconciliation", "sla_enforcement", "adversarial_auditing"]
35
+ BENCHMARK = "esctr"
36
 
 
 
 
 
 
 
37
  llm = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)
38
 
39
 
 
59
  return r.json()
60
 
61
 
62
+ def env_step(url: str, action: dict) -> dict:
63
+ r = requests.post(f"{url}/step", json={"action": action}, timeout=30)
64
  r.raise_for_status()
65
  return r.json()
66
 
67
 
68
  # ---------------------------------------------------------------------------
69
+ # Logging (strict OpenEnv format)
70
  # ---------------------------------------------------------------------------
71
 
72
  def log_start(task: str, model: str):
73
  print(f"[START] task={task} env={BENCHMARK} model={model}", flush=True)
74
 
 
75
  def log_step(step: int, action: str, reward: float, done: bool, error=None):
76
+ err = error if error else "null"
77
+ print(f"[STEP] step={step} action={action} reward={reward:.2f} done={str(done).lower()} error={err}", flush=True)
 
 
 
 
 
78
 
79
  def log_end(success: bool, steps: int, score: float, rewards: list):
80
+ rstr = ",".join(f"{r:.2f}" for r in rewards)
81
+ print(f"[END] success={str(success).lower()} steps={steps} score={score:.2f} rewards={rstr}", flush=True)
 
 
 
 
82
 
83
 
84
  # ---------------------------------------------------------------------------
85
+ # System prompts per task
86
  # ---------------------------------------------------------------------------
87
 
88
+ SYSTEM_PROMPT_BASE = """You are an autonomous Financial Controller AI agent operating in an Enterprise Supply Chain environment. You must investigate discrepancies, verify documents, and submit precise financial adjustments.
89
+
90
+ AVAILABLE TOOLS:
91
+ {tools}
92
+
93
+ RESPONSE FORMAT:
94
+ You must respond with a SINGLE valid JSON object — NO explanation, NO markdown.
95
+ The JSON must have these fields:
96
+ - "action_type": one of the available tool names
97
+ - Additional fields depending on the action:
98
+ - For "query_database": include "query_parameters": {{"table": "<table_name>"}}
99
+ - For "read_document": include "document_id": "<id>"
100
+ - For "communicate_vendor": include "message_content": "<your message>"
101
+ - For "submit_financial_decision": include "adjustment_amount": <number> and "adjustment_reason": "<explanation>"
102
+
103
+ CRITICAL RULES:
104
+ - ALWAYS query databases and read documents BEFORE submitting a decision
105
+ - Calculate amounts precisely — use exact arithmetic
106
+ - adjustment_amount should be NEGATIVE to reduce the invoice payment
107
+ - Respond ONLY with JSON, nothing else"""
108
+
109
+ TASK_INSTRUCTIONS = {
110
+ "procurement_reconciliation": """
111
+ TASK: Procurement Reconciliation (Easy)
112
+ A pricing discrepancy exists between a Purchase Order and a Vendor Invoice.
113
+
114
+ STRATEGY:
115
+ 1. Query "purchase_orders" to find the PO
116
+ 2. Query "invoices" to find the invoice
117
+ 3. Read both documents using read_document with their IDs
118
+ 4. Compare line-by-line: find the item where invoiced price > contracted price
119
+ 5. Calculate the overcharge: (invoiced_total - contracted_total) for that line item
120
+ 6. Submit with adjustment_amount = -(overcharge amount)
121
+
122
+ Available tables: purchase_orders, invoices""",
123
+
124
+ "sla_enforcement": """
125
+ TASK: SLA Enforcement (Medium)
126
+ A vendor demands full payment but the shipment was delivered late.
127
+
128
+ STRATEGY:
129
+ 1. Query "shipping_logs" to check delivery timing and find delay days
130
+ 2. Query "sla_contracts" to find late delivery penalty terms
131
+ 3. Read the SLA document for exact penalty rates and caps
132
+ 4. Calculate: penalty = invoice_subtotal × min(delay_days × rate_per_day, cap)
133
+ - If there's a grace period, subtract grace days from delay first
134
+ 5. Submit with adjustment_amount = -(penalty amount)
135
+
136
+ Available tables: purchase_orders, invoices, shipping_logs, sla_contracts""",
137
+
138
+ "adversarial_auditing": """
139
+ TASK: Adversarial Auditing (Hard)
140
+ A vendor disputes a late delivery claim, blaming your warehouse. You must prove them wrong.
141
+
142
+ STRATEGY:
143
+ 1. Query "shipping_logs" to confirm the delivery was late
144
+ 2. Query "sla_contracts" for penalty terms
145
+ 3. Query "warehouse_logs" to verify your dock was OPEN during delivery
146
+ 4. Use "communicate_vendor" to engage — they will make excuses then offer a settlement
147
+ 5. REJECT the settlement — enforce the FULL penalty
148
+ 6. Cite warehouse access logs as evidence in your final reason
149
+ 7. Calculate exact penalty from SLA terms and submit
150
+
151
+ CRITICAL: Do NOT accept any settlement offer! Enforce the full contractual penalty.
152
+
153
+ Available tables: purchase_orders, invoices, shipping_logs, sla_contracts, warehouse_logs""",
154
  }
155
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
156
 
157
+ # ---------------------------------------------------------------------------
158
+ # LLM helpers
159
+ # ---------------------------------------------------------------------------
160
 
161
+ def call_llm(messages: list) -> str:
162
  try:
163
  response = llm.chat.completions.create(
164
  model=MODEL_NAME,
165
+ messages=messages,
166
  temperature=0.1,
167
+ max_tokens=1000,
168
  )
169
  return response.choices[0].message.content.strip()
170
  except Exception as e:
171
+ return json.dumps({"action_type": "query_database", "query_parameters": {"table": "purchase_orders"}})
172
 
173
 
174
+ def parse_action(text: str) -> dict:
175
+ """Extract a JSON action from LLM response."""
176
+ # Try to find JSON in response
177
  if "```json" in text:
178
  start = text.index("```json") + 7
179
  end = text.index("```", start)
 
192
  elif text[i] == "}":
193
  depth -= 1
194
  if depth == 0:
195
+ text = text[brace_start:i + 1]
196
+ break
197
+
198
+ return json.loads(text)
199
 
200
 
201
  # ---------------------------------------------------------------------------
202
+ # Task runner
203
  # ---------------------------------------------------------------------------
204
 
205
  def run_task(env_url: str, task_name: str, seed: int = 0) -> float:
 
206
  log_start(task=task_name, model=MODEL_NAME)
 
207
  rewards = []
208
  step_num = 0
209
  final_score = 0.0
210
 
211
+ tools = ["query_database", "read_document", "submit_financial_decision"]
212
+ if task_name == "adversarial_auditing":
213
+ tools.insert(2, "communicate_vendor")
214
+
215
+ system_prompt = SYSTEM_PROMPT_BASE.format(tools=", ".join(tools))
216
+ system_prompt += TASK_INSTRUCTIONS.get(task_name, "")
217
+
218
  try:
219
+ reset_data = env_reset(env_url, task_name, seed)
220
+ briefing = reset_data.get("observation", {}).get("system_response", "")
221
 
222
+ messages = [
223
+ {"role": "system", "content": system_prompt},
224
+ {"role": "user", "content": f"ENVIRONMENT BRIEFING:\n{briefing}\n\nBegin your investigation. Respond with a JSON action."},
225
+ ]
226
+
227
+ max_steps = {"procurement_reconciliation": 10, "sla_enforcement": 15, "adversarial_auditing": 20}.get(task_name, 15)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
228
 
229
+ for _ in range(max_steps):
230
  step_num += 1
231
+
232
+ # Get LLM action
233
+ llm_response = call_llm(messages)
234
+ try:
235
+ action = parse_action(llm_response)
236
+ except (json.JSONDecodeError, ValueError):
237
+ action = {"action_type": "query_database", "query_parameters": {"table": "purchase_orders"}}
238
+
239
+ # Execute action
240
+ action_str = json.dumps(action, separators=(",", ":"))
241
+ result = env_step(env_url, action)
242
  reward = result.get("reward", 0.0) or 0.0
243
+ done = result.get("done", False)
244
+ obs = result.get("observation", {})
245
+ response_text = obs.get("system_response", "")
246
+ error = obs.get("error_message")
247
+
248
  rewards.append(reward)
249
+ log_step(step_num, action_str, reward, done, error)
 
 
 
 
 
 
 
 
 
 
 
 
250
 
251
+ if done:
252
+ final_score = reward
253
+ break
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
254
 
255
+ # Append to conversation
256
+ messages.append({"role": "assistant", "content": llm_response})
257
+ messages.append({"role": "user", "content": f"ENVIRONMENT RESPONSE:\n{response_text}\n\nContinue your investigation. Respond with your next JSON action."})
 
 
 
 
258
 
259
  except Exception as e:
260
  step_num += 1
 
267
  return final_score
268
 
269
 
270
+ # ---------------------------------------------------------------------------
271
+ # Main
272
+ # ---------------------------------------------------------------------------
273
+
274
  def main():
275
  global ENV_URL
276
  container_id = None
277
 
278
  if LOCAL_IMAGE_NAME:
279
+ print(f"Starting Docker container: {LOCAL_IMAGE_NAME}")
280
  try:
281
  container_id = subprocess.check_output(
282
  ["docker", "run", "-d", "--rm", "-p", "7860:7860", LOCAL_IMAGE_NAME],
283
+ stderr=subprocess.STDOUT
284
  ).decode().strip()
285
  ENV_URL = "http://localhost:7860"
 
286
  except Exception as e:
287
+ print(f"Docker start failed: {e}")
288
  sys.exit(1)
289
 
290
  print(f"Waiting for environment at {ENV_URL} ...")
291
  if not env_health(ENV_URL):
292
+ print("ERROR: Environment not healthy")
293
  if container_id:
294
  subprocess.run(["docker", "stop", container_id], capture_output=True)
295
  sys.exit(1)
296
+ print("Environment healthy!\n")
297
 
298
  scores = {}
299
+ for task in TASKS:
300
+ scores[task] = run_task(ENV_URL, task, seed=42)
 
301
  print()
302
 
303
+ avg = sum(scores.values()) / len(scores) if scores else 0.0
304
  print("=" * 50)
305
+ print("ESCTR INFERENCE SUMMARY")
306
  print("=" * 50)
307
+ for t, s in scores.items():
308
+ print(f" {t}: {s:.2f}")
309
+ print(f" Average: {avg:.2f}")
310
  print("=" * 50)
311
 
312
  if container_id:
 
313
  subprocess.run(["docker", "stop", container_id], capture_output=True)
314
 
315
+ return 0 if avg > 0 else 1
316
 
317
 
318
  if __name__ == "__main__":
openenv.yaml CHANGED
@@ -1,23 +1,20 @@
1
  spec_version: 1
2
- name: invoice_extraction_env
3
  type: space
4
  runtime: fastapi
5
  app: server.app:app
6
  port: 7860
7
 
8
  tasks:
9
- - name: simple_invoice
10
  difficulty: easy
11
- description: "Clean, well-formatted invoices with clear field labels"
12
- - name: messy_invoice
 
13
  difficulty: medium
14
- description: "Messy invoices with abbreviations, typos, and non-standard layouts"
15
- - name: multi_document
 
16
  difficulty: hard
17
- description: "Multi-section documents requiring cross-referencing PO, invoice, and credit memos"
18
- - name: corrupted_scan
19
- difficulty: very_hard
20
- description: "OCR-scanned invoices with systematic character errors"
21
- - name: adversarial_invoice
22
- difficulty: expert
23
- description: "Adversarial documents with decoy fields, hidden calculations, and contradictions"
 
1
  spec_version: 1
2
+ name: esctr_environment
3
  type: space
4
  runtime: fastapi
5
  app: server.app:app
6
  port: 7860
7
 
8
  tasks:
9
+ - name: procurement_reconciliation
10
  difficulty: easy
11
+ max_steps: 10
12
+ description: "Identify overcharged line items between PO and Invoice, calculate exact overcharge"
13
+ - name: sla_enforcement
14
  difficulty: medium
15
+ max_steps: 15
16
+ description: "Calculate late delivery penalties from shipping logs and SLA contract terms"
17
+ - name: adversarial_auditing
18
  difficulty: hard
19
+ max_steps: 20
20
+ description: "Navigate vendor disputes, verify warehouse logs, reject settlements, enforce full penalties"
 
 
 
 
 
server/__init__.py CHANGED
@@ -1,5 +1,5 @@
1
- """Invoice Extraction Environment — Server package."""
2
 
3
- from .environment import InvoiceExtractionEnvironment
4
 
5
- __all__ = ["InvoiceExtractionEnvironment"]
 
1
+ """Enterprise Supply Chain & Tax Reconciliation Environment — Server package."""
2
 
3
+ from .environment import ESCTREnvironment
4
 
5
+ __all__ = ["ESCTREnvironment"]
server/app.py CHANGED
@@ -1,8 +1,8 @@
1
  """
2
- FastAPI application for the Invoice Extraction Environment.
3
 
4
- Exposes the environment over HTTP and WebSocket endpoints
5
- compatible with the OpenEnv client protocol.
6
  """
7
 
8
  import json
@@ -13,20 +13,20 @@ from fastapi import FastAPI, WebSocket, WebSocketDisconnect
13
  from fastapi.responses import JSONResponse
14
  from pydantic import BaseModel
15
 
16
- from .models import InvoiceAction, InvoiceObservation, InvoiceState
17
- from .environment import InvoiceExtractionEnvironment
18
 
19
  logger = logging.getLogger(__name__)
20
 
21
 
22
  # ---------------------------------------------------------------------------
23
- # Request / Response models (OpenEnv-compatible)
24
  # ---------------------------------------------------------------------------
25
 
26
  class ResetRequest(BaseModel):
27
  seed: Optional[int] = None
28
  episode_id: Optional[str] = None
29
- task_name: str = "simple_invoice"
30
 
31
  class Config:
32
  extra = "allow"
@@ -48,14 +48,13 @@ class HealthResponse(BaseModel):
48
  # Helpers
49
  # ---------------------------------------------------------------------------
50
 
51
- def _obs_to_response(obs: InvoiceObservation) -> dict:
52
- """Convert an InvoiceObservation to a step/reset response dict."""
53
  obs_dict = obs.model_dump()
54
- reward = obs_dict.pop("reward", None)
55
  done = obs_dict.pop("done", False)
56
  return {
57
  "observation": obs_dict,
58
- "reward": reward if reward is not None else 0.0,
59
  "done": done,
60
  }
61
 
@@ -64,19 +63,19 @@ def _obs_to_response(obs: InvoiceObservation) -> dict:
64
  # Application factory
65
  # ---------------------------------------------------------------------------
66
 
67
- def create_invoice_app() -> FastAPI:
68
- """Create and configure the FastAPI application."""
69
-
70
  app = FastAPI(
71
- title="Invoice Extraction Environment",
72
- description="OpenEnv environment for extracting structured data from invoices",
73
- version="0.1.0",
 
 
 
 
74
  )
75
 
76
- # Global environment instance for HTTP endpoints
77
- _env = InvoiceExtractionEnvironment()
78
 
79
- # === Health check ===
80
  @app.get("/health")
81
  def health():
82
  return HealthResponse()
@@ -84,24 +83,22 @@ def create_invoice_app() -> FastAPI:
84
  @app.get("/")
85
  def root():
86
  return {
87
- "name": "invoice_extraction_env",
88
- "version": "0.1.0",
89
  "status": "running",
90
- "endpoints": ["/health", "/reset", "/step", "/state", "/schema", "/ws"],
91
  }
92
 
93
- # === Reset ===
94
  @app.post("/reset")
95
  def reset(request: ResetRequest = ResetRequest()):
96
  kwargs = request.model_dump(exclude_unset=True)
97
  obs = _env.reset(**kwargs)
98
  return _obs_to_response(obs)
99
 
100
- # === Step ===
101
  @app.post("/step")
102
  def step(request: StepRequest):
103
  try:
104
- action = InvoiceAction(**request.action)
105
  except Exception as e:
106
  return JSONResponse(
107
  status_code=422,
@@ -110,58 +107,52 @@ def create_invoice_app() -> FastAPI:
110
  obs = _env.step(action, timeout_s=request.timeout_s)
111
  return _obs_to_response(obs)
112
 
113
- # === State ===
114
  @app.get("/state")
115
  def get_state():
116
  return _env.state.model_dump()
117
 
118
- # === Schema ===
119
  @app.get("/schema")
120
  def get_schema():
121
  return {
122
- "action": InvoiceAction.model_json_schema(),
123
- "observation": InvoiceObservation.model_json_schema(),
124
- "state": InvoiceState.model_json_schema(),
125
  }
126
 
127
- # === Metadata ===
128
  @app.get("/metadata")
129
  def get_metadata():
130
  return {
131
- "name": "invoice_extraction_env",
132
  "description": (
133
- "An environment for extracting structured data from unstructured "
134
- "invoice and receipt documents. Features 5 difficulty tiers from "
135
- "clean invoices to adversarial documents with decoy fields, OCR "
136
- "corruption, and hidden calculations. Includes procedural document "
137
- "generation for infinite training configurations, RLVR-inspired "
138
- "composite reward architecture with trajectory milestones, and "
139
- "multi-tool agentic workflow for complex tasks."
140
  ),
141
- "version": "0.3.0",
142
- "features": [
143
- "procedural_document_generation",
144
- "rlvr_composite_rewards",
145
- "multi_tool_workflow",
146
- "weighted_field_scoring",
147
- "cross_field_verification",
148
  ],
149
  "tasks": [
150
- {"name": "simple_invoice", "difficulty": "easy", "attempts": 3},
151
- {"name": "messy_invoice", "difficulty": "medium", "attempts": 3},
152
- {"name": "multi_document", "difficulty": "hard", "attempts": 5,
153
- "tools": ["query_related_documents", "verify_calculations", "check_discrepancies"]},
154
- {"name": "corrupted_scan", "difficulty": "very_hard", "attempts": 4},
155
- {"name": "adversarial_invoice", "difficulty": "expert", "attempts": 6,
156
- "tools": ["query_related_documents", "verify_calculations", "check_discrepancies"]},
 
 
157
  ],
158
  }
159
 
160
- # === WebSocket (for persistent sessions) ===
161
  @app.websocket("/ws")
162
  async def websocket_endpoint(websocket: WebSocket):
163
  await websocket.accept()
164
- ws_env = InvoiceExtractionEnvironment()
165
  logger.info("WebSocket session opened")
166
 
167
  try:
@@ -181,19 +172,13 @@ def create_invoice_app() -> FastAPI:
181
 
182
  if msg_type == "reset":
183
  obs = ws_env.reset(**msg_data)
184
- await websocket.send_json({
185
- "type": "observation",
186
- "data": _obs_to_response(obs),
187
- })
188
 
189
  elif msg_type == "step":
190
  try:
191
- action = InvoiceAction(**msg_data)
192
  obs = ws_env.step(action)
193
- await websocket.send_json({
194
- "type": "observation",
195
- "data": _obs_to_response(obs),
196
- })
197
  except Exception as e:
198
  await websocket.send_json({
199
  "type": "error",
@@ -201,10 +186,7 @@ def create_invoice_app() -> FastAPI:
201
  })
202
 
203
  elif msg_type == "state":
204
- await websocket.send_json({
205
- "type": "state",
206
- "data": ws_env.state.model_dump(),
207
- })
208
 
209
  elif msg_type == "close":
210
  break
@@ -212,10 +194,7 @@ def create_invoice_app() -> FastAPI:
212
  else:
213
  await websocket.send_json({
214
  "type": "error",
215
- "data": {
216
- "message": f"Unknown message type: {msg_type}",
217
- "code": "UNKNOWN_TYPE",
218
- },
219
  })
220
 
221
  except WebSocketDisconnect:
@@ -229,12 +208,10 @@ def create_invoice_app() -> FastAPI:
229
  return app
230
 
231
 
232
- # Create the app instance
233
- app = create_invoice_app()
234
 
235
 
236
  def main():
237
- """Entry point for `uv run server` / `[project.scripts]`."""
238
  import uvicorn
239
  uvicorn.run("server.app:app", host="0.0.0.0", port=7860)
240
 
 
1
  """
2
+ FastAPI application for the ESCTR Environment.
3
 
4
+ Exposes the Enterprise Supply Chain & Tax Reconciliation environment
5
+ over HTTP and WebSocket endpoints compatible with the OpenEnv spec.
6
  """
7
 
8
  import json
 
13
  from fastapi.responses import JSONResponse
14
  from pydantic import BaseModel
15
 
16
+ from .models import ESCTRAction, ESCTRObservation, ESCTRState
17
+ from .environment import ESCTREnvironment
18
 
19
  logger = logging.getLogger(__name__)
20
 
21
 
22
  # ---------------------------------------------------------------------------
23
+ # Request / Response models
24
  # ---------------------------------------------------------------------------
25
 
26
  class ResetRequest(BaseModel):
27
  seed: Optional[int] = None
28
  episode_id: Optional[str] = None
29
+ task_name: str = "procurement_reconciliation"
30
 
31
  class Config:
32
  extra = "allow"
 
48
  # Helpers
49
  # ---------------------------------------------------------------------------
50
 
51
+ def _obs_to_response(obs: ESCTRObservation) -> dict:
 
52
  obs_dict = obs.model_dump()
53
+ reward = obs_dict.pop("reward", 0.0)
54
  done = obs_dict.pop("done", False)
55
  return {
56
  "observation": obs_dict,
57
+ "reward": reward,
58
  "done": done,
59
  }
60
 
 
63
  # Application factory
64
  # ---------------------------------------------------------------------------
65
 
66
+ def create_app() -> FastAPI:
 
 
67
  app = FastAPI(
68
+ title="ESCTR Environment",
69
+ description=(
70
+ "Enterprise Supply Chain & Tax Reconciliation — an OpenEnv environment "
71
+ "for training LLMs to investigate discrepancies, enforce SLA penalties, "
72
+ "and navigate adversarial vendor disputes."
73
+ ),
74
+ version="1.0.0",
75
  )
76
 
77
+ _env = ESCTREnvironment()
 
78
 
 
79
  @app.get("/health")
80
  def health():
81
  return HealthResponse()
 
83
  @app.get("/")
84
  def root():
85
  return {
86
+ "name": "esctr_environment",
87
+ "version": "1.0.0",
88
  "status": "running",
89
+ "endpoints": ["/health", "/reset", "/step", "/state", "/schema", "/metadata", "/ws"],
90
  }
91
 
 
92
  @app.post("/reset")
93
  def reset(request: ResetRequest = ResetRequest()):
94
  kwargs = request.model_dump(exclude_unset=True)
95
  obs = _env.reset(**kwargs)
96
  return _obs_to_response(obs)
97
 
 
98
  @app.post("/step")
99
  def step(request: StepRequest):
100
  try:
101
+ action = ESCTRAction(**request.action)
102
  except Exception as e:
103
  return JSONResponse(
104
  status_code=422,
 
107
  obs = _env.step(action, timeout_s=request.timeout_s)
108
  return _obs_to_response(obs)
109
 
 
110
  @app.get("/state")
111
  def get_state():
112
  return _env.state.model_dump()
113
 
 
114
  @app.get("/schema")
115
  def get_schema():
116
  return {
117
+ "action": ESCTRAction.model_json_schema(),
118
+ "observation": ESCTRObservation.model_json_schema(),
119
+ "state": ESCTRState.model_json_schema(),
120
  }
121
 
 
122
  @app.get("/metadata")
123
  def get_metadata():
124
  return {
125
+ "name": "esctr_environment",
126
  "description": (
127
+ "Enterprise Supply Chain & Tax Reconciliation: an environment where "
128
+ "an LLM agent operates as an autonomous financial controller, investigating "
129
+ "procurement discrepancies, enforcing SLA penalties from shipping delays, "
130
+ "and navigating adversarial vendor disputes. Features procedural generation "
131
+ "for infinite scenarios, RLVR composite rewards, and multi-tool agentic workflow."
 
 
132
  ),
133
+ "version": "1.0.0",
134
+ "themes": [
135
+ "World Modeling — Professional Tasks",
136
+ "Long-Horizon Planning & Instruction Following",
137
+ "Multi-Agent Interactions (adversarial vendor)",
 
 
138
  ],
139
  "tasks": [
140
+ {"name": "procurement_reconciliation", "difficulty": "easy", "max_steps": 10,
141
+ "description": "Identify overcharged line items between PO and Invoice"},
142
+ {"name": "sla_enforcement", "difficulty": "medium", "max_steps": 15,
143
+ "description": "Calculate late delivery penalties from shipping logs and SLA contracts"},
144
+ {"name": "adversarial_auditing", "difficulty": "hard", "max_steps": 20,
145
+ "description": "Navigate vendor disputes, verify warehouse logs, reject settlement offers"},
146
+ ],
147
+ "tools": [
148
+ "query_database", "read_document", "communicate_vendor", "submit_financial_decision",
149
  ],
150
  }
151
 
 
152
  @app.websocket("/ws")
153
  async def websocket_endpoint(websocket: WebSocket):
154
  await websocket.accept()
155
+ ws_env = ESCTREnvironment()
156
  logger.info("WebSocket session opened")
157
 
158
  try:
 
172
 
173
  if msg_type == "reset":
174
  obs = ws_env.reset(**msg_data)
175
+ await websocket.send_json({"type": "observation", "data": _obs_to_response(obs)})
 
 
 
176
 
177
  elif msg_type == "step":
178
  try:
179
+ action = ESCTRAction(**msg_data)
180
  obs = ws_env.step(action)
181
+ await websocket.send_json({"type": "observation", "data": _obs_to_response(obs)})
 
 
 
182
  except Exception as e:
183
  await websocket.send_json({
184
  "type": "error",
 
186
  })
187
 
188
  elif msg_type == "state":
189
+ await websocket.send_json({"type": "state", "data": ws_env.state.model_dump()})
 
 
 
190
 
191
  elif msg_type == "close":
192
  break
 
194
  else:
195
  await websocket.send_json({
196
  "type": "error",
197
+ "data": {"message": f"Unknown message type: {msg_type}", "code": "UNKNOWN_TYPE"},
 
 
 
198
  })
199
 
200
  except WebSocketDisconnect:
 
208
  return app
209
 
210
 
211
+ app = create_app()
 
212
 
213
 
214
  def main():
 
215
  import uvicorn
216
  uvicorn.run("server.app:app", host="0.0.0.0", port=7860)
217
 
server/documents.py DELETED
@@ -1,898 +0,0 @@
1
- """
2
- Document corpus for the Invoice Extraction Environment.
3
-
4
- Contains synthetic but realistic invoice/receipt documents across 3 difficulty levels.
5
- Each document has raw text and ground truth extraction targets.
6
- """
7
-
8
- DOCUMENTS = {
9
- # =========================================================================
10
- # SIMPLE INVOICES — Clean formatting, clear labels, consistent structure
11
- # =========================================================================
12
- "simple_invoice": [
13
- {
14
- "id": "simple_001",
15
- "text": """INVOICE
16
-
17
- Invoice Number: INV-2024-001
18
- Date: January 15, 2024
19
-
20
- From:
21
- Acme Corporation
22
- 123 Business Avenue
23
- New York, NY 10001
24
-
25
- Bill To:
26
- Widget Co.
27
- 456 Commerce Street
28
- Chicago, IL 60601
29
-
30
- Description Qty Unit Price Amount
31
- ---------------------------------------------------------
32
- Widget Type A 10 $25.00 $250.00
33
- Widget Type B 5 $40.00 $200.00
34
- Consulting Service 8 $75.00 $600.00
35
-
36
- Subtotal: $1,050.00
37
- Tax (8%): $84.00
38
- Total: $1,134.00
39
-
40
- Payment Terms: Net 30
41
- """,
42
- "ground_truth": {
43
- "invoice_number": "INV-2024-001",
44
- "date": "2024-01-15",
45
- "vendor_name": "Acme Corporation",
46
- "customer_name": "Widget Co.",
47
- "subtotal": 1050.00,
48
- "tax": 84.00,
49
- "total": 1134.00,
50
- "line_items": [
51
- {"description": "Widget Type A", "quantity": 10, "unit_price": 25.00, "amount": 250.00},
52
- {"description": "Widget Type B", "quantity": 5, "unit_price": 40.00, "amount": 200.00},
53
- {"description": "Consulting Service", "quantity": 8, "unit_price": 75.00, "amount": 600.00},
54
- ],
55
- },
56
- },
57
- {
58
- "id": "simple_002",
59
- "text": """INVOICE
60
-
61
- Invoice #: TS-5892
62
- Invoice Date: March 3, 2024
63
-
64
- Vendor:
65
- TechStart Solutions LLC
66
- 890 Innovation Drive, Suite 200
67
- San Francisco, CA 94105
68
-
69
- Customer:
70
- DataFlow Inc.
71
- 321 Analytics Blvd
72
- Austin, TX 78701
73
-
74
- Item Qty Unit Price Total
75
- ----------------------------------------------------------
76
- Cloud Hosting (Monthly) 1 $450.00 $450.00
77
- API Integration Setup 1 $1,200.00 $1,200.00
78
- Technical Support (hours) 12 $95.00 $1,140.00
79
-
80
- Subtotal: $2,790.00
81
- Tax (7%): $195.30
82
- Total: $2,985.30
83
-
84
- Due Date: April 2, 2024
85
- """,
86
- "ground_truth": {
87
- "invoice_number": "TS-5892",
88
- "date": "2024-03-03",
89
- "vendor_name": "TechStart Solutions LLC",
90
- "customer_name": "DataFlow Inc.",
91
- "subtotal": 2790.00,
92
- "tax": 195.30,
93
- "total": 2985.30,
94
- "line_items": [
95
- {"description": "Cloud Hosting (Monthly)", "quantity": 1, "unit_price": 450.00, "amount": 450.00},
96
- {"description": "API Integration Setup", "quantity": 1, "unit_price": 1200.00, "amount": 1200.00},
97
- {"description": "Technical Support (hours)", "quantity": 12, "unit_price": 95.00, "amount": 1140.00},
98
- ],
99
- },
100
- },
101
- {
102
- "id": "simple_003",
103
- "text": """INVOICE
104
-
105
- Invoice Number: GS-2024-0147
106
- Date: February 20, 2024
107
-
108
- From:
109
- Global Supplies Inc.
110
- 2500 Industrial Parkway
111
- Detroit, MI 48201
112
-
113
- To:
114
- Riverside Manufacturing
115
- 780 Factory Road
116
- Cleveland, OH 44101
117
-
118
- Product Qty Price Each Line Total
119
- -----------------------------------------------------------
120
- Steel Bolts (Box/100) 50 $12.50 $625.00
121
- Copper Wire (500ft Roll) 8 $85.00 $680.00
122
- Safety Goggles (Pack/10) 20 $35.00 $700.00
123
- Welding Rods (Bundle) 15 $22.00 $330.00
124
-
125
- Subtotal: $2,335.00
126
- Sales Tax: $163.45
127
- Invoice Total: $2,498.45
128
-
129
- Terms: Net 45
130
- """,
131
- "ground_truth": {
132
- "invoice_number": "GS-2024-0147",
133
- "date": "2024-02-20",
134
- "vendor_name": "Global Supplies Inc.",
135
- "customer_name": "Riverside Manufacturing",
136
- "subtotal": 2335.00,
137
- "tax": 163.45,
138
- "total": 2498.45,
139
- "line_items": [
140
- {"description": "Steel Bolts (Box/100)", "quantity": 50, "unit_price": 12.50, "amount": 625.00},
141
- {"description": "Copper Wire (500ft Roll)", "quantity": 8, "unit_price": 85.00, "amount": 680.00},
142
- {"description": "Safety Goggles (Pack/10)", "quantity": 20, "unit_price": 35.00, "amount": 700.00},
143
- {"description": "Welding Rods (Bundle)", "quantity": 15, "unit_price": 22.00, "amount": 330.00},
144
- ],
145
- },
146
- },
147
- ],
148
-
149
- # =========================================================================
150
- # MESSY INVOICES — Inconsistent formatting, abbreviations, typos
151
- # =========================================================================
152
- "messy_invoice": [
153
- {
154
- "id": "messy_001",
155
- "text": """ACME Corp
156
- 123 Biz Ave., NY 10001
157
- Ph: (212) 555-0100
158
-
159
- inv# ACM-987
160
- dt: Jan 15 '24
161
-
162
- BILL TO:
163
- widgetco / 456 commerce, chicago il
164
-
165
- ---items---
166
- 10x WidgetA @ 25 250
167
- 5x WidgetB @ 40 200
168
- 8hrs consulting @75/hr 600
169
- ------
170
- subtot 1050
171
- tx 8%: 84
172
- TOTAL DUE: $1,134
173
-
174
- pay within 30 days
175
- """,
176
- "ground_truth": {
177
- "invoice_number": "ACM-987",
178
- "date": "2024-01-15",
179
- "vendor_name": "ACME Corp",
180
- "customer_name": "widgetco",
181
- "subtotal": 1050.00,
182
- "tax": 84.00,
183
- "total": 1134.00,
184
- "line_items": [
185
- {"description": "WidgetA", "quantity": 10, "unit_price": 25.00, "amount": 250.00},
186
- {"description": "WidgetB", "quantity": 5, "unit_price": 40.00, "amount": 200.00},
187
- {"description": "consulting", "quantity": 8, "unit_price": 75.00, "amount": 600.00},
188
- ],
189
- },
190
- },
191
- {
192
- "id": "messy_002",
193
- "text": """techstart solutions
194
- san fran, CA
195
-
196
- INVOICE ts5892-b
197
- date 03/03/2024
198
-
199
- cust: DataFlow
200
- austin TX
201
-
202
- -- charges --
203
- hosting (cloud, monthly plan)...$450
204
- api integration - setup fee...$1200
205
- tech support x12h @$95 = $1,140.00
206
-
207
- sub: $2790
208
- tax 7pct = 195.30
209
- ========
210
- amt due $2,985.30
211
-
212
- please remit by 04/02/2024
213
- """,
214
- "ground_truth": {
215
- "invoice_number": "ts5892-b",
216
- "date": "2024-03-03",
217
- "vendor_name": "techstart solutions",
218
- "customer_name": "DataFlow",
219
- "subtotal": 2790.00,
220
- "tax": 195.30,
221
- "total": 2985.30,
222
- "line_items": [
223
- {"description": "hosting (cloud, monthly plan)", "quantity": 1, "unit_price": 450.00, "amount": 450.00},
224
- {"description": "api integration - setup fee", "quantity": 1, "unit_price": 1200.00, "amount": 1200.00},
225
- {"description": "tech support", "quantity": 12, "unit_price": 95.00, "amount": 1140.00},
226
- ],
227
- },
228
- },
229
- {
230
- "id": "messy_003",
231
- "text": """GLOBAL SUPPLY
232
- 2500 industrial pkwy detroit MI
233
-
234
- inv GS-0147rev
235
- 20-Feb-2024
236
-
237
- Riverside Mfg / cleveland OH
238
-
239
- stl bolts 100ct boxes -- 50 @ 12.50 ea ........... 625
240
- cu wire 500' rolls -- 8 @ 85 .................... 680
241
- safety goggles 10pk -- 20 @ 35 .................. 700
242
- weld rods bundle -- 15 @ 22 ea .................. 330
243
-
244
- s/t 2335.00
245
- tax 163.45
246
- -----
247
- GRAND TOTAL $2498.45
248
-
249
- net45
250
- """,
251
- "ground_truth": {
252
- "invoice_number": "GS-0147rev",
253
- "date": "2024-02-20",
254
- "vendor_name": "GLOBAL SUPPLY",
255
- "customer_name": "Riverside Mfg",
256
- "subtotal": 2335.00,
257
- "tax": 163.45,
258
- "total": 2498.45,
259
- "line_items": [
260
- {"description": "stl bolts 100ct boxes", "quantity": 50, "unit_price": 12.50, "amount": 625.00},
261
- {"description": "cu wire 500' rolls", "quantity": 8, "unit_price": 85.00, "amount": 680.00},
262
- {"description": "safety goggles 10pk", "quantity": 20, "unit_price": 35.00, "amount": 700.00},
263
- {"description": "weld rods bundle", "quantity": 15, "unit_price": 22.00, "amount": 330.00},
264
- ],
265
- },
266
- },
267
- ],
268
-
269
- # =========================================================================
270
- # MULTI-DOCUMENT — Multiple sections, cross-references, adjustments
271
- # =========================================================================
272
- "multi_document": [
273
- {
274
- "id": "multi_001",
275
- "text": """=== PURCHASE ORDER ===
276
- PO Number: PO-2024-0055
277
- Date: January 10, 2024
278
- Vendor: Acme Corporation
279
- Buyer: Widget Co.
280
-
281
- Ordered Items:
282
- - 10x Widget Type A @ $25.00 = $250.00
283
- - 5x Widget Type B @ $40.00 = $200.00
284
- - 8hrs Consulting @ $75.00/hr = $600.00
285
-
286
- PO Total: $1,050.00 (before tax)
287
-
288
- === INVOICE ===
289
- Invoice Number: INV-2024-001
290
- Reference PO: PO-2024-0055
291
- Date: January 15, 2024
292
-
293
- From: Acme Corporation, 123 Business Ave, New York, NY 10001
294
- To: Widget Co., 456 Commerce St, Chicago, IL 60601
295
-
296
- Description Qty Unit Price Amount
297
- Widget Type A 10 $25.00 $250.00
298
- Widget Type B 5 $40.00 $200.00
299
- Consulting Service 8 $75.00 $600.00
300
-
301
- Subtotal: $1,050.00
302
- Tax (8%): $84.00
303
- Invoice Total: $1,134.00
304
-
305
- === CREDIT MEMO ===
306
- Credit Memo #: CM-2024-003
307
- Reference Invoice: INV-2024-001
308
- Date: January 22, 2024
309
-
310
- Reason: 2x Widget Type A received defective
311
- Credit Amount: $50.00
312
-
313
- === SUMMARY ===
314
- Original Invoice: $1,134.00
315
- Credit Applied: -$50.00
316
- Adjusted Balance Due: $1,084.00
317
- """,
318
- "ground_truth": {
319
- "invoice_number": "INV-2024-001",
320
- "date": "2024-01-15",
321
- "vendor_name": "Acme Corporation",
322
- "customer_name": "Widget Co.",
323
- "subtotal": 1050.00,
324
- "tax": 84.00,
325
- "total": 1134.00,
326
- "po_number": "PO-2024-0055",
327
- "adjustment_reason": "2x Widget Type A received defective",
328
- "adjusted_total": 1084.00,
329
- "line_items": [
330
- {"description": "Widget Type A", "quantity": 10, "unit_price": 25.00, "amount": 250.00},
331
- {"description": "Widget Type B", "quantity": 5, "unit_price": 40.00, "amount": 200.00},
332
- {"description": "Consulting Service", "quantity": 8, "unit_price": 75.00, "amount": 600.00},
333
- ],
334
- },
335
- },
336
- {
337
- "id": "multi_002",
338
- "text": """--- PURCHASE ORDER ---
339
- PO#: PO-DF-2024-112
340
- Issued: 2024-02-28
341
- Requested By: DataFlow Inc., Austin TX
342
- Vendor: TechStart Solutions LLC
343
-
344
- Items Requested:
345
- 1. Cloud Hosting (Monthly) - 1 unit - $450.00 - $450.00
346
- 2. API Integration - 1 unit - $1,200.00 - $1,200.00
347
- 3. Tech Support - 10 hours - $95.00/hr - $950.00
348
- NOTE: Hours are estimated, bill actuals
349
-
350
- PO Authorized Amount: $2,600.00 (pre-tax)
351
-
352
- --- INVOICE ---
353
- Invoice: TS-5892
354
- Date: March 3, 2024
355
- PO Reference: PO-DF-2024-112
356
-
357
- From: TechStart Solutions LLC, 890 Innovation Dr Suite 200, San Francisco CA 94105
358
- To: DataFlow Inc., 321 Analytics Blvd, Austin TX 78701
359
-
360
- Service Qty Rate Amount
361
- Cloud Hosting (Monthly) 1 $450.00 $450.00
362
- API Integration Setup 1 $1,200.00 $1,200.00
363
- Technical Support (actual hrs) 12 $95.00 $1,140.00
364
-
365
- NOTE: Support hours exceeded PO estimate (10hrs) by 2hrs.
366
- Overage pre-approved by J. Smith on 03/01/2024.
367
-
368
- Subtotal: $2,790.00
369
- Tax (7%): $195.30
370
- Total: $2,985.30
371
-
372
- --- PAYMENT RECEIPT ---
373
- Receipt #: RCP-2024-0891
374
- Date: March 15, 2024
375
- Payment Method: ACH Transfer
376
- Reference: TS-5892
377
-
378
- Amount Paid: $2,000.00
379
- Outstanding Balance: $985.30
380
- Due By: April 2, 2024
381
- """,
382
- "ground_truth": {
383
- "invoice_number": "TS-5892",
384
- "date": "2024-03-03",
385
- "vendor_name": "TechStart Solutions LLC",
386
- "customer_name": "DataFlow Inc.",
387
- "subtotal": 2790.00,
388
- "tax": 195.30,
389
- "total": 2985.30,
390
- "po_number": "PO-DF-2024-112",
391
- "adjustment_reason": "Partial payment applied",
392
- "adjusted_total": 985.30,
393
- "line_items": [
394
- {"description": "Cloud Hosting (Monthly)", "quantity": 1, "unit_price": 450.00, "amount": 450.00},
395
- {"description": "API Integration Setup", "quantity": 1, "unit_price": 1200.00, "amount": 1200.00},
396
- {"description": "Technical Support (actual hrs)", "quantity": 12, "unit_price": 95.00, "amount": 1140.00},
397
- ],
398
- },
399
- },
400
- {
401
- "id": "multi_003",
402
- "text": """==== PURCHASE ORDER ====
403
- PO: PO-RM-2024-033
404
- Date: Feb 15, 2024
405
- Buyer: Riverside Manufacturing, 780 Factory Rd, Cleveland OH
406
- Supplier: Global Supplies Inc.
407
- Budget Approved: $2,800.00
408
-
409
- Requested:
410
- - Steel Bolts Box/100: 50 boxes @ $12.50
411
- - Copper Wire 500ft: 10 rolls @ $85.00
412
- - Safety Goggles Pack/10: 20 packs @ $35.00
413
- - Welding Rods Bundle: 15 bundles @ $22.00
414
-
415
- ==== INVOICE ====
416
- Invoice: GS-2024-0147
417
- Date: February 20, 2024
418
- PO Ref: PO-RM-2024-033
419
-
420
- Billed By: Global Supplies Inc., 2500 Industrial Parkway, Detroit MI 48201
421
- Billed To: Riverside Manufacturing, 780 Factory Road, Cleveland OH 44101
422
-
423
- Item Qty Unit$ Total
424
- Steel Bolts (Box/100) 50 $12.50 $625.00
425
- Copper Wire (500ft Roll) 8 $85.00 $680.00
426
- Safety Goggles (Pack/10) 20 $35.00 $700.00
427
- Welding Rods (Bundle) 15 $22.00 $330.00
428
-
429
- IMPORTANT: Copper Wire qty reduced from PO (10 to 8).
430
- 2 rolls backordered, will ship separately.
431
-
432
- Subtotal: $2,335.00
433
- Tax (7%): $163.45
434
- Total Due: $2,498.45
435
-
436
- ==== BACKORDER NOTICE ====
437
- Backorder #: BO-2024-0089
438
- Reference: GS-2024-0147 / PO-RM-2024-033
439
- Item: Copper Wire (500ft Roll)
440
- Qty Backordered: 2
441
- Unit Price: $85.00
442
- Backorder Amount: $170.00
443
- Estimated Ship Date: March 10, 2024
444
-
445
- Total with Backorder: $2,498.45 + $170.00 = $2,668.45
446
- (Backorder will be invoiced separately upon shipment)
447
- """,
448
- "ground_truth": {
449
- "invoice_number": "GS-2024-0147",
450
- "date": "2024-02-20",
451
- "vendor_name": "Global Supplies Inc.",
452
- "customer_name": "Riverside Manufacturing",
453
- "subtotal": 2335.00,
454
- "tax": 163.45,
455
- "total": 2498.45,
456
- "po_number": "PO-RM-2024-033",
457
- "adjustment_reason": "Copper Wire qty reduced from PO, 2 rolls backordered",
458
- "adjusted_total": 2668.45,
459
- "line_items": [
460
- {"description": "Steel Bolts (Box/100)", "quantity": 50, "unit_price": 12.50, "amount": 625.00},
461
- {"description": "Copper Wire (500ft Roll)", "quantity": 8, "unit_price": 85.00, "amount": 680.00},
462
- {"description": "Safety Goggles (Pack/10)", "quantity": 20, "unit_price": 35.00, "amount": 700.00},
463
- {"description": "Welding Rods (Bundle)", "quantity": 15, "unit_price": 22.00, "amount": 330.00},
464
- ],
465
- },
466
- },
467
- ],
468
-
469
- # =========================================================================
470
- # CORRUPTED SCAN — OCR-like artifacts, character substitutions, garbled text
471
- # These simulate real scanned/faxed invoices with OCR errors.
472
- # =========================================================================
473
- "corrupted_scan": [
474
- {
475
- "id": "corrupt_001",
476
- "text": """SC4NNED D0CUMENT - Page 1 of 1
477
-
478
- lNVOlCE
479
-
480
- lnvoice Nurnber: lNV-2O24-OO1
481
- Dat.e: Januery 1S, 2O24
482
-
483
- Frorn:
484
- Acrne Corporati0n
485
- l23 Business Avenue
486
- New Y0rk, NY 1OOO1
487
-
488
- BilI To:
489
- Widget C0.
490
- 4S6 Cornmerce Street
491
- Chicag0, lL 6O6O1
492
-
493
- Descripti0n Qty Unit Price Arnount
494
- ---------------------------------------------------------
495
- Widget Type A 1O $2S.OO $2SO.OO
496
- Widget Type 8 S $4O.OO $2OO.OO
497
- ConsuIting Service 8 $7S.OO $6OO.OO
498
-
499
- Subtotal: $1,OSO.OO
500
- Tax (8%): $84.OO
501
- T0tal: $1,l34.OO
502
-
503
- Payrnent Terrns: Net 3O
504
-
505
- --- END 0F SCAN ---
506
- """,
507
- "ground_truth": {
508
- "invoice_number": "INV-2024-001",
509
- "date": "2024-01-15",
510
- "vendor_name": "Acme Corporation",
511
- "customer_name": "Widget Co.",
512
- "subtotal": 1050.00,
513
- "tax": 84.00,
514
- "total": 1134.00,
515
- "line_items": [
516
- {"description": "Widget Type A", "quantity": 10, "unit_price": 25.00, "amount": 250.00},
517
- {"description": "Widget Type B", "quantity": 5, "unit_price": 40.00, "amount": 200.00},
518
- {"description": "Consulting Service", "quantity": 8, "unit_price": 75.00, "amount": 600.00},
519
- ],
520
- },
521
- },
522
- {
523
- "id": "corrupt_002",
524
- "text": """[SCAN QUALITY: P00R - SOME CHARACTERS MAY BE lNCORRECT]
525
-
526
- TECHSTART S0LUTl0NS LLC
527
- 89O lnnovation Dr, Suite 2OO
528
- San Francisc0, CA 941OS
529
-
530
- lNV0lCE #: TS~S892
531
- DATE: O3/O3/2O24
532
-
533
- CUSTOMERr DataFIow lnc.
534
- 321 AnaIytics BIvd
535
- Austin, TX 787O1
536
-
537
- Servicc Qty Unit Pricc Total
538
- ----------------------------------------------------------
539
- CIoud Hosting (MonthIy) l $4SO.OO $4SO.OO
540
- APl lntegration Setup l $l,2OO.OO $l,2OO.OO
541
- TechnicaI Support (hours) l2 $9S.OO $l,l4O.OO
542
-
543
- SubtotaI: $2,79O.OO
544
- Tax (7%)): $l9S.3O
545
- TotaI: $2,98S.3O
546
-
547
- Due Date: ApriI 2, 2O24
548
-
549
- [PAGE 1/1 - SCAN C0MPLETE]
550
- """,
551
- "ground_truth": {
552
- "invoice_number": "TS-5892",
553
- "date": "2024-03-03",
554
- "vendor_name": "TechStart Solutions LLC",
555
- "customer_name": "DataFlow Inc.",
556
- "subtotal": 2790.00,
557
- "tax": 195.30,
558
- "total": 2985.30,
559
- "line_items": [
560
- {"description": "Cloud Hosting (Monthly)", "quantity": 1, "unit_price": 450.00, "amount": 450.00},
561
- {"description": "API Integration Setup", "quantity": 1, "unit_price": 1200.00, "amount": 1200.00},
562
- {"description": "Technical Support (hours)", "quantity": 12, "unit_price": 95.00, "amount": 1140.00},
563
- ],
564
- },
565
- },
566
- {
567
- "id": "corrupt_003",
568
- "text": """---FAXED DOCUMENT---
569
- RECEIVED: 02/20/2024 14:32
570
- QUALITY: [####===---] 40%
571
-
572
- GL0BAL SUPPLlES lNC.
573
- 25OO lndustriaI Parkway
574
- Detr0it, Ml 482Ol
575
-
576
- lNVOlCE
577
-
578
- lnvoice Number: GS-2O24-Ol47
579
- Date: February 2O, 2024
580
-
581
- T0:
582
- Riverside Manufactur1ng
583
- 78O Factory R0ad
584
- CIeveIand, 0H 44l0l
585
-
586
- Product Qty Price Each Line Total
587
- -----------------------------------------------------------
588
- SteeI BoIts (Box/lOO) SO $l2.SO $62S.OO
589
- Copper Wire (SOOft RoII) 8 $8S.OO $68O.OO
590
- Safety GoggIes (Pack/lO) 2O $3S.OO $7OO.OO
591
- WeIding Rods (BundIe) lS $22.OO $33O.OO
592
-
593
- [iIIegibIe]
594
- SubtotaI: $2,33S.OO
595
- SaIes Tax: $l63.4S
596
- lnvoice T0tal: $2,498.4S
597
-
598
- Terrns: Net 4S
599
- ---END FAX---
600
- """,
601
- "ground_truth": {
602
- "invoice_number": "GS-2024-0147",
603
- "date": "2024-02-20",
604
- "vendor_name": "Global Supplies Inc.",
605
- "customer_name": "Riverside Manufacturing",
606
- "subtotal": 2335.00,
607
- "tax": 163.45,
608
- "total": 2498.45,
609
- "line_items": [
610
- {"description": "Steel Bolts (Box/100)", "quantity": 50, "unit_price": 12.50, "amount": 625.00},
611
- {"description": "Copper Wire (500ft Roll)", "quantity": 8, "unit_price": 85.00, "amount": 680.00},
612
- {"description": "Safety Goggles (Pack/10)", "quantity": 20, "unit_price": 35.00, "amount": 700.00},
613
- {"description": "Welding Rods (Bundle)", "quantity": 15, "unit_price": 22.00, "amount": 330.00},
614
- ],
615
- },
616
- },
617
- ],
618
-
619
- # =========================================================================
620
- # ADVERSARIAL INVOICE — Decoy fields, contradictions, hidden calculations
621
- # Designed to genuinely challenge frontier models with traps.
622
- # =========================================================================
623
- "adversarial_invoice": [
624
- {
625
- "id": "adversarial_001",
626
- "text": """INVOICE
627
-
628
- *** IMPORTANT: This replaces previous invoice DRAFT-INV-999 which was voided ***
629
-
630
- Invoice Number: INV-2024-001-R2
631
- Previous Reference: DRAFT-INV-999 (VOIDED — DO NOT USE)
632
- Date: January 15, 2024
633
- Reissue Date: January 20, 2024
634
-
635
- From:
636
- Acme Corporation
637
- 123 Business Avenue, New York, NY 10001
638
- Tax ID: 12-3456789
639
-
640
- Bill To:
641
- Widget Co. (DBA "WidgetCorp International")
642
- 456 Commerce Street, Chicago, IL 60601
643
- Customer Account: WC-0042
644
-
645
- Description Qty Unit Price Amount
646
- ---------------------------------------------------------
647
- Widget Type A 10 $25.00 $250.00
648
- Widget Type B 5 $40.00 $200.00
649
- Consulting Service 8 $75.00 $600.00
650
- ** EARLY PAYMENT DISCOUNT: -5% on consulting **
651
-
652
- Subtotal: $1,050.00
653
- Discount (5%): -$30.00
654
- Adjusted Subtotal: $1,020.00
655
- Tax (8%): $81.60
656
- Total: $1,101.60
657
-
658
- NOTE: Original quote (QT-2024-555) was $1,134.00 but discount applied.
659
- Per agreement dated Jan 12, if paid within 10 days.
660
-
661
- Payment Terms: Net 10 (discounted) / Net 30 (full price $1,134.00)
662
- """,
663
- "ground_truth": {
664
- "invoice_number": "INV-2024-001-R2",
665
- "date": "2024-01-20",
666
- "vendor_name": "Acme Corporation",
667
- "customer_name": "Widget Co.",
668
- "subtotal": 1020.00,
669
- "tax": 81.60,
670
- "total": 1101.60,
671
- "discount_amount": 30.00,
672
- "original_total": 1134.00,
673
- "line_items": [
674
- {"description": "Widget Type A", "quantity": 10, "unit_price": 25.00, "amount": 250.00},
675
- {"description": "Widget Type B", "quantity": 5, "unit_price": 40.00, "amount": 200.00},
676
- {"description": "Consulting Service", "quantity": 8, "unit_price": 75.00, "amount": 600.00},
677
- ],
678
- "discrepancy_notes": "5% early payment discount applied to consulting. Reissued invoice replaces voided DRAFT-INV-999. Adjusted subtotal $1,020 vs original $1,050.",
679
- },
680
- },
681
- {
682
- "id": "adversarial_002",
683
- "text": """--- PURCHASE ORDER ---
684
- PO#: PO-DF-2024-112
685
- Date: February 28, 2024
686
- Vendor: TechStart Solutions LLC
687
- Buyer: DataFlow Inc.
688
- Authorized Budget: $2,600.00 (pre-tax)
689
-
690
- Items:
691
- 1. Cloud Hosting - 1 unit @ $450.00 = $450.00
692
- 2. API Integration - 1 unit @ $1,200.00 = $1,200.00
693
- 3. Tech Support - 10 hours @ $95.00/hr = $950.00
694
- PO Total: $2,600.00
695
-
696
- --- INVOICE ---
697
- Invoice: TS-5892-FINAL
698
- Date: March 3, 2024
699
- PO Reference: PO-DF-2024-112
700
-
701
- From: TechStart Solutions LLC
702
- To: DataFlow Inc.
703
-
704
- Service Qty Rate Amount
705
- Cloud Hosting (Monthly) 1 $450.00 $450.00
706
- API Integration Setup 1 $1,200.00 $1,200.00
707
- Technical Support (actual) 12 $95.00 $1,140.00
708
- >> 2 hrs over PO estimate, approved by J. Smith 03/01/2024
709
- Rush Processing Fee 1 $150.00 $150.00
710
- >> Added per emergency request ER-2024-033
711
-
712
- Subtotal: $2,940.00
713
- Tax (7%): $205.80
714
- Total: $3,145.80
715
-
716
- !!! BUDGET VARIANCE ALERT !!!
717
- PO Authorized: $2,600.00
718
- Actual (pre-tax): $2,940.00
719
- Variance: $340.00 OVER BUDGET (13.1%)
720
- Causes: Support overage ($190), Rush fee ($150)
721
-
722
- --- PAYMENT SCHEDULE ---
723
- Payment 1 (due 03/15): $1,500.00
724
- Payment 2 (due 04/02): $1,645.80
725
- """,
726
- "ground_truth": {
727
- "invoice_number": "TS-5892-FINAL",
728
- "date": "2024-03-03",
729
- "vendor_name": "TechStart Solutions LLC",
730
- "customer_name": "DataFlow Inc.",
731
- "subtotal": 2940.00,
732
- "tax": 205.80,
733
- "total": 3145.80,
734
- "po_number": "PO-DF-2024-112",
735
- "discount_amount": 0.00,
736
- "original_total": 2600.00,
737
- "line_items": [
738
- {"description": "Cloud Hosting (Monthly)", "quantity": 1, "unit_price": 450.00, "amount": 450.00},
739
- {"description": "API Integration Setup", "quantity": 1, "unit_price": 1200.00, "amount": 1200.00},
740
- {"description": "Technical Support (actual)", "quantity": 12, "unit_price": 95.00, "amount": 1140.00},
741
- {"description": "Rush Processing Fee", "quantity": 1, "unit_price": 150.00, "amount": 150.00},
742
- ],
743
- "discrepancy_notes": "Invoice exceeds PO by $340 (13.1%). 2 extra support hours ($190) and rush processing fee ($150) added. PO authorized $2,600 but actual pre-tax is $2,940.",
744
- },
745
- },
746
- {
747
- "id": "adversarial_003",
748
- "text": """CONSOLIDATED STATEMENT
749
-
750
- Account: Riverside Manufacturing
751
- Statement Period: February 2024
752
- Prepared by: Global Supplies Inc., Accounts Receivable
753
-
754
- === TRANSACTION 1: ORIGINAL INVOICE ===
755
- Invoice: GS-2024-0147
756
- Date: February 20, 2024
757
- PO: PO-RM-2024-033
758
-
759
- Steel Bolts (Box/100) 50 @ $12.50 = $625.00
760
- Copper Wire (500ft Roll) 10 @ $85.00 = $850.00
761
- Safety Goggles (Pack/10) 20 @ $35.00 = $700.00
762
- Welding Rods (Bundle) 15 @ $22.00 = $330.00
763
-
764
- Invoice Subtotal: $2,505.00
765
- Tax (7%): $175.35
766
- Invoice Total: $2,680.35
767
-
768
- === TRANSACTION 2: ADJUSTMENT ===
769
- Credit Memo: CM-2024-0201
770
- Date: February 25, 2024
771
- Reference: GS-2024-0147
772
-
773
- Issue: Copper Wire — only 8 of 10 rolls delivered.
774
- 2 rolls backordered (BO-2024-0089).
775
- Credit for undelivered: 2 x $85.00 = $170.00
776
- Tax adjustment: -$11.90
777
- Total Credit: -$181.90
778
-
779
- === TRANSACTION 3: PRICE CORRECTION ===
780
- Debit Memo: DM-2024-0055
781
- Date: February 27, 2024
782
-
783
- Steel Bolts price was quoted at $12.50 but contract
784
- rate is $13.00. Underbilled on 50 boxes.
785
- Price difference: 50 x $0.50 = $25.00
786
- Tax on adjustment: $1.75
787
- Total Debit: $26.75
788
-
789
- === ACCOUNT SUMMARY ===
790
- Original Invoice: $2,680.35
791
- Credit (undelivered wire): -$181.90
792
- Debit (price correction): +$26.75
793
- ================================
794
- Net Amount Due: $2,525.20
795
-
796
- Payment due by: March 20, 2024
797
- """,
798
- "ground_truth": {
799
- "invoice_number": "GS-2024-0147",
800
- "date": "2024-02-20",
801
- "vendor_name": "Global Supplies Inc.",
802
- "customer_name": "Riverside Manufacturing",
803
- "subtotal": 2505.00,
804
- "tax": 175.35,
805
- "total": 2680.35,
806
- "po_number": "PO-RM-2024-033",
807
- "discount_amount": 0.00,
808
- "original_total": 2680.35,
809
- "line_items": [
810
- {"description": "Steel Bolts (Box/100)", "quantity": 50, "unit_price": 12.50, "amount": 625.00},
811
- {"description": "Copper Wire (500ft Roll)", "quantity": 10, "unit_price": 85.00, "amount": 850.00},
812
- {"description": "Safety Goggles (Pack/10)", "quantity": 20, "unit_price": 35.00, "amount": 700.00},
813
- {"description": "Welding Rods (Bundle)", "quantity": 15, "unit_price": 22.00, "amount": 330.00},
814
- ],
815
- "discrepancy_notes": "Credit memo CM-2024-0201 for 2 undelivered Copper Wire rolls (-$181.90). Debit memo DM-2024-0055 for Steel Bolts price correction (+$26.75). Net adjustment: -$155.15. Final amount due: $2,525.20.",
816
- },
817
- },
818
- ],
819
- }
820
-
821
-
822
- # Required fields per task (defines what the agent must extract)
823
- TASK_REQUIRED_FIELDS = {
824
- "simple_invoice": [
825
- "invoice_number", "date", "vendor_name", "customer_name",
826
- "subtotal", "tax", "total", "line_items",
827
- ],
828
- "messy_invoice": [
829
- "invoice_number", "date", "vendor_name", "customer_name",
830
- "subtotal", "tax", "total", "line_items",
831
- ],
832
- "multi_document": [
833
- "invoice_number", "date", "vendor_name", "customer_name",
834
- "subtotal", "tax", "total", "line_items",
835
- "po_number", "adjustment_reason", "adjusted_total",
836
- ],
837
- "corrupted_scan": [
838
- "invoice_number", "date", "vendor_name", "customer_name",
839
- "subtotal", "tax", "total", "line_items",
840
- ],
841
- "adversarial_invoice": [
842
- "invoice_number", "date", "vendor_name", "customer_name",
843
- "subtotal", "tax", "total", "line_items",
844
- "po_number", "discount_amount", "original_total",
845
- "discrepancy_notes",
846
- ],
847
- }
848
-
849
-
850
- def get_document(task_name: str, doc_index: int = 0, use_procedural: bool = True) -> dict:
851
- """Get a document and its metadata for a given task.
852
-
853
- For doc_index 0-2, returns static documents (deterministic test fixtures).
854
- For doc_index >= 3 (or when use_procedural=True and index wraps), uses the
855
- procedural generation engine to create novel documents from the seed.
856
-
857
- Args:
858
- task_name: One of 'simple_invoice', 'messy_invoice', 'multi_document',
859
- 'corrupted_scan', 'adversarial_invoice'
860
- doc_index: Index / seed for document selection
861
- use_procedural: Whether to use procedural generation for indices beyond static pool
862
-
863
- Returns:
864
- dict with 'id', 'text', 'ground_truth', 'required_fields'
865
- """
866
- docs = DOCUMENTS.get(task_name, DOCUMENTS["simple_invoice"])
867
- required = TASK_REQUIRED_FIELDS.get(task_name, TASK_REQUIRED_FIELDS["simple_invoice"])
868
-
869
- # Use static documents for small indices (deterministic test fixtures)
870
- if doc_index < len(docs):
871
- doc = docs[doc_index]
872
- return {
873
- "id": doc["id"],
874
- "text": doc["text"],
875
- "ground_truth": doc["ground_truth"],
876
- "required_fields": required,
877
- }
878
-
879
- # Use procedural generation for larger indices
880
- if use_procedural:
881
- from .procedural import generate_document
882
- proc_doc = generate_document(task_name, seed=doc_index)
883
- return {
884
- "id": proc_doc["id"],
885
- "text": proc_doc["text"],
886
- "ground_truth": proc_doc["ground_truth"],
887
- "required_fields": required,
888
- }
889
-
890
- # Fallback: wrap around static docs
891
- doc = docs[doc_index % len(docs)]
892
- return {
893
- "id": doc["id"],
894
- "text": doc["text"],
895
- "ground_truth": doc["ground_truth"],
896
- "required_fields": required,
897
- }
898
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
server/environment.py CHANGED
@@ -1,621 +1,553 @@
1
  """
2
- Invoice Extraction Environment — Core Implementation.
3
 
4
- A stateful environment where an AI agent extracts structured data
5
- from unstructured invoice/receipt documents through a multi-step
6
- interaction loop with RLVR-inspired dense reward signals.
 
7
 
8
  Reward Architecture:
9
- R_total = α·R_outcome + β·R_trajectory + R_penalties
10
- α = 0.70 (outcome dominates)
11
- β = 0.30 (trajectory contributes)
12
- Penalties: step cost, hallucination penalties
13
  """
14
 
15
  import json
 
16
  from typing import Any, Optional
17
  from uuid import uuid4
18
 
19
- from .models import InvoiceAction, InvoiceObservation, InvoiceState
20
- from .documents import get_document, TASK_REQUIRED_FIELDS
21
- from .graders import grade_extraction
22
-
23
- # ---------------------------------------------------------------------------
24
- # Constants
25
- # ---------------------------------------------------------------------------
26
-
27
- MAX_ATTEMPTS = {
28
- "simple_invoice": 3,
29
- "messy_invoice": 3,
30
- "multi_document": 5,
31
- "corrupted_scan": 4,
32
- "adversarial_invoice": 6,
 
 
 
 
 
 
 
 
 
33
  }
34
 
35
- # Reward architecture coefficients
36
- ALPHA = 0.70 # outcome weight
37
- BETA = 0.30 # trajectory weight
38
-
39
- # Trajectory micro-rewards
40
- REWARD_VIEW_DOC = 0.01
41
- REWARD_VIEW_FIELDS = 0.01
42
- REWARD_GET_FEEDBACK = 0.005
43
- REWARD_QUERY_RELATED = 0.015
44
- REWARD_VERIFY_CALC = 0.01
45
- REWARD_CHECK_DISCREP = 0.015
46
-
47
- # Penalties
48
- PENALTY_PER_STEP = -0.005
49
- PENALTY_INVALID_JSON = -0.02
50
- PENALTY_UNKNOWN_CMD = -0.02
51
- PENALTY_INVALID_CALC = -0.01
52
-
53
- # Tasks that support advanced tool commands
54
- TOOL_ENABLED_TASKS = {"multi_document", "adversarial_invoice"}
55
-
56
- VALID_TASKS = list(TASK_REQUIRED_FIELDS.keys())
57
-
58
-
59
- class InvoiceExtractionEnvironment:
60
- """Environment for extracting structured data from invoice documents.
61
 
62
- The agent interacts through these commands:
63
- - view_document: See the raw document text
64
- - view_fields: See the list of required fields
65
- - extract: Submit extracted fields as JSON
66
- - get_feedback: Get detailed feedback on last extraction
67
- - query_related_documents: Retrieve cross-reference documents
68
- - verify_calculations: Submit arithmetic for verification
69
- - check_discrepancies: Request environment to flag inconsistencies
70
 
71
- Reward design follows RLVR principles:
72
- R_total = α·R_outcome + β·R_trajectory + R_penalties
73
- """
74
 
75
  def __init__(self):
76
- self._state = InvoiceState(episode_id=str(uuid4()))
77
- self._document_text = ""
78
- self._ground_truth = {}
79
- self._required_fields = []
80
- self._last_feedback = {}
81
- self._last_extracted = {}
82
  self._initialized = False
83
  self._trajectory_reward = 0.0
84
- self._milestones = set() # tracks which trajectory milestones agent has hit
85
- self._related_docs_text = ""
 
 
 
86
 
87
  def reset(
88
  self,
89
  seed: Optional[int] = None,
90
  episode_id: Optional[str] = None,
91
- task_name: str = "simple_invoice",
92
  **kwargs: Any,
93
- ) -> InvoiceObservation:
94
- """Reset the environment with a new task and document."""
95
  if task_name not in VALID_TASKS:
96
- task_name = "simple_invoice"
97
 
98
- doc_index = seed if seed is not None else 0
99
- doc_data = get_document(task_name, doc_index)
100
- max_attempts = MAX_ATTEMPTS.get(task_name, 3)
101
 
102
- self._state = InvoiceState(
103
  episode_id=episode_id or str(uuid4()),
104
  step_count=0,
105
  task_name=task_name,
106
- document_id=doc_data["id"],
107
- best_score=0.0,
108
- attempts_used=0,
109
- max_attempts=max_attempts,
110
  accumulated_reward=0.0,
 
 
111
  )
112
-
113
- self._document_text = doc_data["text"]
114
- self._ground_truth = doc_data["ground_truth"]
115
- self._required_fields = doc_data["required_fields"]
116
- self._last_feedback = {}
117
- self._last_extracted = {}
118
  self._initialized = True
119
  self._trajectory_reward = 0.0
120
- self._milestones = set()
121
- self._related_docs_text = self._build_related_docs(task_name, doc_data)
122
-
123
- tool_hint = ""
124
- if task_name in TOOL_ENABLED_TASKS:
125
- tool_hint = (
126
- "\nAdvanced tools available for this task:\n"
127
- " - 'query_related_documents': Retrieve PO, credit memos, etc.\n"
128
- " - 'verify_calculations': Submit arithmetic for verification\n"
129
- " - 'check_discrepancies': Flag inconsistencies in the document\n"
130
- )
131
 
132
- return InvoiceObservation(
 
 
 
 
 
 
133
  done=False,
134
  reward=0.0,
135
- text=(
136
- f"Invoice Extraction Environment ready.\n"
137
- f"Task: {task_name}\n"
138
- f"Document ID: {doc_data['id']}\n"
139
- f"Fields to extract: {len(self._required_fields)}\n"
140
- f"Max attempts: {max_attempts}\n\n"
141
- f"Use 'view_document' to see the document text.\n"
142
- f"Use 'view_fields' to see the required fields.\n"
143
- f"Use 'extract' with a JSON payload to submit your extraction.\n"
144
- f"Use 'get_feedback' to see feedback on your last attempt."
145
- f"{tool_hint}"
146
- ),
147
- task_name=task_name,
148
- current_score=0.0,
149
- attempts_remaining=max_attempts,
150
- required_fields=self._required_fields,
151
  current_step=0,
 
152
  accumulated_reward=0.0,
153
- last_action_status="success",
 
154
  )
155
 
156
- def _build_related_docs(self, task_name: str, doc_data: dict) -> str:
157
- """Build related documents text for cross-referencing tasks."""
158
- gt = doc_data["ground_truth"]
159
- if task_name not in TOOL_ENABLED_TASKS:
160
- return ""
161
-
162
- parts = []
163
- if "po_number" in gt:
164
- parts.append(
165
- f"=== PURCHASE ORDER ===\n"
166
- f"PO Number: {gt.get('po_number', 'N/A')}\n"
167
- f"Vendor: {gt.get('vendor_name', 'N/A')}\n"
168
- f"Buyer: {gt.get('customer_name', 'N/A')}\n"
 
 
 
 
 
 
169
  )
170
- if "line_items" in gt:
171
- for item in gt["line_items"]:
172
- parts.append(
173
- f" - {item['quantity']}x {item['description']} "
174
- f"@ ${item['unit_price']:.2f} = ${item['amount']:.2f}"
175
- )
176
- parts.append("")
177
-
178
- if gt.get("adjustment_reason"):
179
- parts.append(
180
- f"=== ADJUSTMENT MEMO ===\n"
181
- f"Reason: {gt['adjustment_reason']}\n"
182
  )
183
- if gt.get("adjusted_total"):
184
- parts.append(f"Adjusted Total: ${gt['adjusted_total']:,.2f}")
185
- parts.append("")
186
-
187
- if gt.get("discount_amount") and gt["discount_amount"] > 0:
188
- parts.append(
189
- f"=== DISCOUNT APPLIED ===\n"
190
- f"Discount: ${gt['discount_amount']:,.2f}\n"
191
- f"Original Total: ${gt.get('original_total', 0):,.2f}\n"
 
 
 
 
 
192
  )
193
-
194
- return "\n".join(parts) if parts else "No related documents found for this invoice."
195
 
196
  def step(
197
  self,
198
- action: InvoiceAction,
199
  timeout_s: Optional[float] = None,
200
  **kwargs: Any,
201
- ) -> InvoiceObservation:
202
- """Execute a step in the environment."""
203
  if not self._initialized:
204
- return InvoiceObservation(
205
- done=True,
206
- reward=0.0,
207
- text="Error: Environment not initialized. Call reset() first.",
208
- metadata={"error": "not_initialized"},
209
- last_action_status="error",
210
- error_message="Environment not initialized. Call reset() first.",
211
- )
212
 
213
  self._state.step_count += 1
214
- command = action.command.lower().strip()
215
-
216
- # Apply per-step cost (encourages efficiency)
217
- self._trajectory_reward += PENALTY_PER_STEP
218
-
219
- handlers = {
220
- "view_document": self._handle_view_document,
221
- "view_fields": self._handle_view_fields,
222
- "extract": lambda: self._handle_extract(action.payload),
223
- "get_feedback": self._handle_get_feedback,
224
- "query_related_documents": self._handle_query_related,
225
- "verify_calculations": lambda: self._handle_verify_calculations(action.payload),
226
- "check_discrepancies": self._handle_check_discrepancies,
227
- }
228
-
229
- handler = handlers.get(command)
230
- if handler:
231
- return handler()
232
- else:
233
- # Unknown command penalty
234
- self._trajectory_reward += PENALTY_UNKNOWN_CMD
235
- self._state.accumulated_reward += PENALTY_UNKNOWN_CMD
236
- return self._make_obs(
237
- done=False,
238
- reward=0.0,
239
- text=(
240
- f"Unknown command: '{command}'. "
241
- f"Valid commands: {', '.join(handlers.keys())}"
242
- ),
243
- status="error",
244
- error_msg=f"Unknown command: '{command}'",
245
  )
246
 
247
- def _make_obs(
248
- self,
249
- done: bool,
250
- reward: float,
251
- text: str,
252
- status: str = "success",
253
- error_msg: Optional[str] = None,
254
- metadata: Optional[dict] = None,
255
- ) -> InvoiceObservation:
256
- """Build a standardized observation."""
257
- return InvoiceObservation(
258
- done=done,
259
- reward=round(max(0.0, min(1.0, reward)), 4) if reward >= 0 else round(max(0.0, reward), 4),
260
- text=text,
261
- task_name=self._state.task_name,
262
- current_score=self._state.best_score,
263
- attempts_remaining=self._state.max_attempts - self._state.attempts_used,
264
- required_fields=self._required_fields,
265
- metadata=metadata or {},
266
- last_action_status=status,
267
- error_message=error_msg,
268
- current_step=self._state.step_count,
269
- accumulated_reward=round(self._state.accumulated_reward, 4),
270
- )
271
 
272
  # ------------------------------------------------------------------
273
- # Command handlers
274
  # ------------------------------------------------------------------
275
 
276
- def _handle_view_document(self) -> InvoiceObservation:
277
- """Return the current document text (trajectory milestone)."""
278
- if "view_document" not in self._milestones:
279
- self._milestones.add("view_document")
280
- self._trajectory_reward += REWARD_VIEW_DOC
281
- self._state.accumulated_reward += REWARD_VIEW_DOC
282
- return self._make_obs(done=False, reward=0.0, text=self._document_text)
283
-
284
- def _handle_view_fields(self) -> InvoiceObservation:
285
- """Return the list of required fields with descriptions."""
286
- if "view_fields" not in self._milestones:
287
- self._milestones.add("view_fields")
288
- self._trajectory_reward += REWARD_VIEW_FIELDS
289
- self._state.accumulated_reward += REWARD_VIEW_FIELDS
290
-
291
- field_descriptions = {
292
- "invoice_number": "The invoice/document number (string)",
293
- "date": "Invoice date in YYYY-MM-DD format (use reissue date if applicable)",
294
- "vendor_name": "Name of the vendor/seller/supplier",
295
- "customer_name": "Name of the customer/buyer/bill-to party",
296
- "subtotal": "Subtotal before tax, after discounts (number)",
297
- "tax": "Tax amount (number)",
298
- "total": "Total amount due (number)",
299
- "line_items": "Array of items: [{description, quantity, unit_price, amount}]",
300
- "po_number": "Purchase order reference number (string)",
301
- "adjustment_reason": "Reason for any adjustments/credits (string)",
302
- "adjusted_total": "Final adjusted total after credits/payments (number)",
303
- "discount_amount": "Monetary discount value applied (number, 0 if none)",
304
- "original_total": "What the total would have been without adjustments (number)",
305
- "discrepancy_notes": "Free-text description of all discrepancies, adjustments, and anomalies found",
306
- }
307
-
308
- lines = ["Required fields to extract:\n"]
309
- for field in self._required_fields:
310
- desc = field_descriptions.get(field, "No description available")
311
- lines.append(f" - {field}: {desc}")
312
-
313
- lines.append(f"\nSubmit your extraction using the 'extract' command.")
314
- lines.append(f"Payload must be a valid JSON string with these field names.")
315
-
316
- return self._make_obs(done=False, reward=0.0, text="\n".join(lines))
317
-
318
- def _handle_query_related(self) -> InvoiceObservation:
319
- """Return cross-reference documents (PO, credit memos, etc.)."""
320
- if self._state.task_name not in TOOL_ENABLED_TASKS:
321
- return self._make_obs(
322
- done=False, reward=0.0,
323
- text="This command is not available for the current task.",
324
- status="error",
325
- error_msg="query_related_documents only available for multi_document and adversarial_invoice tasks",
326
- )
327
-
328
- if "query_related" not in self._milestones:
329
- self._milestones.add("query_related")
330
- self._trajectory_reward += REWARD_QUERY_RELATED
331
- self._state.accumulated_reward += REWARD_QUERY_RELATED
332
-
333
- return self._make_obs(
334
- done=False, reward=0.0,
335
- text=self._related_docs_text or "No related documents found.",
336
- )
337
 
338
- def _handle_verify_calculations(self, payload: str) -> InvoiceObservation:
339
- """Verify arithmetic submitted by the agent."""
340
- if self._state.task_name not in TOOL_ENABLED_TASKS:
341
- return self._make_obs(
342
- done=False, reward=0.0,
343
- text="This command is not available for the current task.",
344
- status="error",
345
- error_msg="verify_calculations only available for multi_document and adversarial_invoice tasks",
346
  )
347
 
348
- try:
349
- data = json.loads(payload) if payload else {}
350
- except json.JSONDecodeError:
351
- self._trajectory_reward += PENALTY_INVALID_CALC
352
- self._state.accumulated_reward += PENALTY_INVALID_CALC
353
- return self._make_obs(
354
- done=False, reward=0.0,
355
- text="Invalid JSON payload for verify_calculations.",
356
- status="error",
357
- error_msg="Payload must be valid JSON with numeric fields to verify",
358
  )
359
 
360
- if "verify_calc" not in self._milestones:
361
- self._milestones.add("verify_calc")
362
- self._trajectory_reward += REWARD_VERIFY_CALC
363
- self._state.accumulated_reward += REWARD_VERIFY_CALC
364
-
365
- results = []
366
- gt = self._ground_truth
367
- checks = {
368
- "subtotal_plus_tax": (
369
- lambda: round(gt.get("subtotal", 0) + gt.get("tax", 0), 2),
370
- gt.get("total"),
371
- ),
372
- }
373
-
374
- sub = data.get("subtotal")
375
- tax = data.get("tax")
376
- total = data.get("total")
377
-
378
- if sub is not None and tax is not None:
379
- computed = round(float(sub) + float(tax), 2)
380
- if total is not None:
381
- match = abs(computed - float(total)) < 0.02
382
- results.append(
383
- f"subtotal ({sub}) + tax ({tax}) = {computed} | "
384
- f"your total ({total}) — {'MATCH ✓' if match else 'MISMATCH ✗'}"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
385
  )
386
  else:
387
- results.append(f"subtotal ({sub}) + tax ({tax}) = {computed}")
 
388
 
389
- if not results:
390
- results.append("No recognizable calculations found. Submit fields like: subtotal, tax, total")
391
 
392
- return self._make_obs(
393
- done=False, reward=0.0,
394
- text="Calculation verification:\n" + "\n".join(results),
395
- )
 
 
396
 
397
- def _handle_check_discrepancies(self) -> InvoiceObservation:
398
- """Flag inconsistencies in the document."""
399
- if self._state.task_name not in TOOL_ENABLED_TASKS:
400
- return self._make_obs(
401
- done=False, reward=0.0,
402
- text="This command is not available for the current task.",
403
- status="error",
404
- error_msg="check_discrepancies only available for multi_document and adversarial_invoice tasks",
405
- )
406
 
407
- if "check_discrep" not in self._milestones:
408
- self._milestones.add("check_discrep")
409
- self._trajectory_reward += REWARD_CHECK_DISCREP
410
- self._state.accumulated_reward += REWARD_CHECK_DISCREP
411
-
412
- gt = self._ground_truth
413
- hints = []
414
-
415
- if gt.get("discount_amount") and gt["discount_amount"] > 0:
416
- hints.append("⚠ A discount has been applied to this invoice.")
417
- if gt.get("adjustment_reason"):
418
- hints.append("⚠ There is an adjustment/credit memo affecting the final amount.")
419
- if gt.get("po_number"):
420
- hints.append("⚠ This invoice references a purchase order — cross-check quantities and amounts.")
421
- if gt.get("original_total") and gt.get("total"):
422
- if abs(gt["original_total"] - gt["total"]) > 0.01:
423
- hints.append("⚠ The final total differs from the original total — investigate adjustments.")
424
-
425
- if not hints:
426
- hints.append("No obvious discrepancies detected.")
427
-
428
- return self._make_obs(
429
- done=False, reward=0.0,
430
- text="Discrepancy analysis:\n" + "\n".join(hints),
431
- )
432
 
433
- def _handle_extract(self, payload: str) -> InvoiceObservation:
434
- """Process an extraction attempt with RLVR-style composite reward."""
435
- attempts_remaining = self._state.max_attempts - self._state.attempts_used
436
 
437
- if attempts_remaining <= 0:
438
- return self._make_obs(
439
- done=True,
440
- reward=self._state.best_score,
441
- text="No attempts remaining. Episode is complete.",
442
- metadata={"final_score": self._state.best_score},
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
443
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
444
 
445
- # Parse the JSON payload
446
- try:
447
- extracted = json.loads(payload)
448
- if not isinstance(extracted, dict):
449
- raise ValueError("Payload must be a JSON object")
450
- except (json.JSONDecodeError, ValueError) as e:
451
- self._state.attempts_used += 1
452
- self._trajectory_reward += PENALTY_INVALID_JSON
453
- self._state.accumulated_reward += PENALTY_INVALID_JSON
454
- attempts_remaining = self._state.max_attempts - self._state.attempts_used
455
- done = attempts_remaining <= 0
456
-
457
- return self._make_obs(
458
- done=done,
459
- reward=0.0,
460
- text=f"Invalid JSON payload: {str(e)}\nPlease submit a valid JSON object.",
461
- status="error",
462
- error_msg=f"Invalid JSON: {str(e)}",
463
- metadata={"error": "invalid_json"},
464
  )
465
 
466
- # Grade the extraction
467
- self._state.attempts_used += 1
468
- base_score, feedback = grade_extraction(
469
- extracted, self._ground_truth, self._required_fields
470
- )
471
 
472
- # === COMPOSITE REWARD (RLVR-inspired) ===
473
-
474
- # R_outcome: base extraction score
475
- r_outcome = base_score
476
-
477
- # R_trajectory: accumulated from milestones
478
- r_trajectory = max(0.0, self._trajectory_reward)
479
-
480
- # Improvement bonus
481
- improvement_bonus = 0.0
482
- if self._state.attempts_used > 1 and base_score > self._state.best_score:
483
- improvement_bonus = min(base_score - self._state.best_score, 0.02)
484
-
485
- # Step efficiency bonus
486
- efficiency_bonus = 0.0
487
- if self._state.step_count <= 3:
488
- efficiency_bonus = 0.02
489
- elif self._state.step_count <= 5:
490
- efficiency_bonus = 0.01
491
-
492
- # Consistency bonus (subtotal + tax ≈ total)
493
- consistency_bonus = 0.0
494
- ext_sub = _safe_float(extracted.get("subtotal"))
495
- ext_tax = _safe_float(extracted.get("tax"))
496
- ext_total = _safe_float(extracted.get("total"))
497
- if ext_sub is not None and ext_tax is not None and ext_total is not None:
498
- computed = round(ext_sub + ext_tax, 2)
499
- if abs(computed - ext_total) < 0.02:
500
- consistency_bonus = 0.03
501
-
502
- # Composite reward
503
- bonus = improvement_bonus + efficiency_bonus + consistency_bonus
504
- score = round(max(0.01, min(0.99, ALPHA * r_outcome + BETA * r_trajectory + bonus)), 4)
505
-
506
- # Track
507
- self._state.best_score = max(self._state.best_score, score)
508
- self._state.accumulated_reward += score
509
- self._last_feedback = feedback
510
- self._last_extracted = extracted
511
-
512
- attempts_remaining = self._state.max_attempts - self._state.attempts_used
513
- done = attempts_remaining <= 0 or score >= 0.95
514
-
515
- # Build feedback text
516
- matched = sum(1 for f in feedback.values() if f.get("matched", False))
517
- total_fields = len(feedback)
518
- bonus_details = []
519
- if consistency_bonus > 0:
520
- bonus_details.append(f"consistency: +{consistency_bonus:.3f}")
521
- if improvement_bonus > 0:
522
- bonus_details.append(f"improvement: +{improvement_bonus:.3f}")
523
- if efficiency_bonus > 0:
524
- bonus_details.append(f"efficiency: +{efficiency_bonus:.3f}")
525
- if r_trajectory > 0:
526
- bonus_details.append(f"trajectory: {r_trajectory:.3f}")
527
-
528
- feedback_text = (
529
- f"Extraction scored: {score:.4f} "
530
- f"(outcome: {r_outcome:.4f} × {ALPHA}, trajectory: {r_trajectory:.3f} × {BETA})\n"
531
- f"Fields matched: {matched}/{total_fields}\n"
532
- f"Best score so far: {self._state.best_score:.4f}\n"
533
- f"Attempts remaining: {attempts_remaining}\n"
534
- )
535
 
536
- if bonus_details:
537
- feedback_text += f"Reward bonuses: {', '.join(bonus_details)}\n"
 
 
538
 
539
- if not done and score < 0.95:
540
- weak_fields = [
541
- name for name, data in feedback.items()
542
- if not data.get("matched", False)
543
- ]
544
- if weak_fields:
545
- feedback_text += f"\nFields needing improvement: {', '.join(weak_fields)}"
546
- feedback_text += "\nUse 'get_feedback' for detailed per-field scores."
547
 
548
- if done:
549
- feedback_text += f"\n\nEpisode complete. Final score: {self._state.best_score:.4f}"
 
550
 
551
- return self._make_obs(
552
- done=done,
553
- reward=score,
554
- text=feedback_text,
555
- metadata={
556
- "score": score,
557
- "base_score": base_score,
558
- "r_outcome": r_outcome,
559
- "r_trajectory": r_trajectory,
560
- "bonus": bonus,
561
- "bonus_details": bonus_details,
562
- "best_score": self._state.best_score,
563
- "field_scores": {k: v["score"] for k, v in feedback.items()},
564
- },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
565
  )
566
 
567
- def _handle_get_feedback(self) -> InvoiceObservation:
568
- """Return detailed feedback on the last extraction attempt."""
569
- if not self._last_feedback:
570
- return self._make_obs(
571
- done=False,
572
- reward=0.0,
573
- text="No extraction attempt yet. Use 'extract' to submit your extraction first.",
574
- )
575
 
576
- if "get_feedback" not in self._milestones:
577
- self._milestones.add("get_feedback")
578
- self._trajectory_reward += REWARD_GET_FEEDBACK
579
- self._state.accumulated_reward += REWARD_GET_FEEDBACK
580
 
581
- lines = ["Detailed feedback on last extraction:\n"]
582
- for field, data in self._last_feedback.items():
583
- score = data.get("score", 0.0)
584
- matched = "✓" if data.get("matched", False) else "✗"
585
- field_type = data.get("expected_type", "unknown")
586
- lines.append(f" [{matched}] {field} ({field_type}): {score:.2f}")
 
 
 
 
 
 
587
 
588
- lines.append(f"\nOverall best score: {self._state.best_score:.2f}")
589
- lines.append(f"Attempts remaining: {self._state.max_attempts - self._state.attempts_used}")
 
 
 
 
 
 
 
 
 
 
 
 
590
 
591
- return self._make_obs(
 
592
  done=False,
593
  reward=0.0,
594
- text="\n".join(lines),
595
- metadata={"field_feedback": self._last_feedback},
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
596
  )
597
 
598
  @property
599
- def state(self) -> InvoiceState:
600
- """Get the current environment state."""
601
  return self._state
602
 
603
  def close(self) -> None:
604
- """Clean up resources."""
605
  self._initialized = False
606
-
607
-
608
- def _safe_float(value) -> float:
609
- """Safely convert a value to float, returning None on failure."""
610
- if value is None:
611
- return None
612
- if isinstance(value, (int, float)):
613
- return float(value)
614
- if isinstance(value, str):
615
- import re
616
- cleaned = re.sub(r"[$ ,]", "", value.strip())
617
- try:
618
- return float(cleaned)
619
- except (ValueError, TypeError):
620
- return None
621
- return None
 
1
  """
2
+ ESCTR Environment — Core Implementation.
3
 
4
+ Enterprise Supply Chain & Tax Reconciliation: a stateful environment
5
+ where an LLM agent operates as an autonomous financial controller,
6
+ using ERP tools to investigate discrepancies, enforce SLA penalties,
7
+ and navigate adversarial vendor disputes.
8
 
9
  Reward Architecture:
10
+ R_total = α·R_outcome + β·R_trajectory penalties
 
 
 
11
  """
12
 
13
  import json
14
+ from dataclasses import asdict
15
  from typing import Any, Optional
16
  from uuid import uuid4
17
 
18
+ from .models import ESCTRAction, ESCTRObservation, ESCTRState
19
+ from .procedural import (
20
+ generate_scenario, Scenario, VALID_TASKS, MAX_STEPS,
21
+ render_purchase_order, render_invoice, render_sla,
22
+ render_shipping_log, render_warehouse_logs,
23
+ )
24
+ from .graders import grade_task1, grade_task2, grade_task3
25
+
26
+ # Reward constants
27
+ STEP_COST = 0.005
28
+ HALLUCINATION_PENALTY = 0.02
29
+
30
+ # Available tools per task
31
+ TASK_TOOLS = {
32
+ "procurement_reconciliation": [
33
+ "query_database", "read_document", "submit_financial_decision",
34
+ ],
35
+ "sla_enforcement": [
36
+ "query_database", "read_document", "submit_financial_decision",
37
+ ],
38
+ "adversarial_auditing": [
39
+ "query_database", "read_document", "communicate_vendor", "submit_financial_decision",
40
+ ],
41
  }
42
 
43
+ # Database tables per task
44
+ AVAILABLE_TABLES = {
45
+ "procurement_reconciliation": ["purchase_orders", "invoices"],
46
+ "sla_enforcement": ["purchase_orders", "invoices", "shipping_logs", "sla_contracts"],
47
+ "adversarial_auditing": ["purchase_orders", "invoices", "shipping_logs", "sla_contracts", "warehouse_logs"],
48
+ }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
 
 
 
 
 
 
 
 
50
 
51
+ class ESCTREnvironment:
52
+ """Enterprise Supply Chain & Tax Reconciliation Environment."""
 
53
 
54
  def __init__(self):
55
+ self._state = ESCTRState(episode_id=str(uuid4()))
56
+ self._scenario: Optional[Scenario] = None
 
 
 
 
57
  self._initialized = False
58
  self._trajectory_reward = 0.0
59
+ self._milestones: list = []
60
+ self._vendor_negotiation_count = 0
61
+ self._settlement_offered = False
62
+ self._settlement_rejected = False
63
+ self._cited_evidence = False
64
 
65
  def reset(
66
  self,
67
  seed: Optional[int] = None,
68
  episode_id: Optional[str] = None,
69
+ task_name: str = "procurement_reconciliation",
70
  **kwargs: Any,
71
+ ) -> ESCTRObservation:
72
+ """Reset the environment with a new scenario."""
73
  if task_name not in VALID_TASKS:
74
+ task_name = "procurement_reconciliation"
75
 
76
+ actual_seed = seed if seed is not None else 0
77
+ scenario = generate_scenario(task_name, actual_seed)
78
+ max_steps = MAX_STEPS.get(task_name, 15)
79
 
80
+ self._state = ESCTRState(
81
  episode_id=episode_id or str(uuid4()),
82
  step_count=0,
83
  task_name=task_name,
84
+ seed=actual_seed,
 
 
 
85
  accumulated_reward=0.0,
86
+ outcome_submitted=False,
87
+ milestones_hit=[],
88
  )
89
+ self._scenario = scenario
 
 
 
 
 
90
  self._initialized = True
91
  self._trajectory_reward = 0.0
92
+ self._milestones = []
93
+ self._vendor_negotiation_count = 0
94
+ self._settlement_offered = False
95
+ self._settlement_rejected = False
96
+ self._cited_evidence = False
 
 
 
 
 
 
97
 
98
+ tools = TASK_TOOLS.get(task_name, [])
99
+ tables = AVAILABLE_TABLES.get(task_name, [])
100
+
101
+ # Build initial briefing
102
+ briefing = self._build_briefing(task_name, scenario, tables)
103
+
104
+ return ESCTRObservation(
105
  done=False,
106
  reward=0.0,
107
+ system_response=briefing,
108
+ last_action_status="success",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
109
  current_step=0,
110
+ max_steps=max_steps,
111
  accumulated_reward=0.0,
112
+ task_name=task_name,
113
+ available_tools=tools,
114
  )
115
 
116
+ def _build_briefing(self, task_name: str, scenario: Scenario, tables: list) -> str:
117
+ """Generate task-specific initial briefing."""
118
+ vendor = scenario.vendor.name
119
+ buyer = scenario.buyer.name
120
+ inv_num = scenario.invoice.invoice_number
121
+ po_num = scenario.purchase_order.po_number
122
+
123
+ if task_name == "procurement_reconciliation":
124
+ return (
125
+ f"=== DISCREPANCY ALERT ===\n"
126
+ f"A pricing discrepancy has been detected between Purchase Order {po_num} "
127
+ f"and Vendor Invoice {inv_num} from {vendor}.\n\n"
128
+ f"Your task: Investigate the discrepancy, identify the overcharged line item, "
129
+ f"and submit the correct financial adjustment.\n\n"
130
+ f"Available databases: {', '.join(tables)}\n"
131
+ f"Available tools: query_database, read_document, submit_financial_decision\n\n"
132
+ f"Use 'query_database' with {{'table': '<table_name>'}} to explore data.\n"
133
+ f"Use 'read_document' with document_id (e.g. '{po_num}' or '{inv_num}') to read full documents.\n"
134
+ f"Use 'submit_financial_decision' with adjustment_amount and adjustment_reason when ready."
135
  )
136
+ elif task_name == "sla_enforcement":
137
+ return (
138
+ f"=== PAYMENT DEMAND REVIEW ===\n"
139
+ f"Vendor {vendor} has submitted Invoice {inv_num} (ref: {po_num}) "
140
+ f"demanding full payment without penalties.\n\n"
141
+ f"Intelligence suggests the shipment may have been delivered late. "
142
+ f"Your task: Verify delivery timing, review the SLA contract, calculate "
143
+ f"any applicable penalties, and submit the correct adjusted payment.\n\n"
144
+ f"Available databases: {', '.join(tables)}\n"
145
+ f"Available tools: query_database, read_document, submit_financial_decision\n\n"
146
+ f"Key steps: Check shipping_logs → Review sla_contracts → Calculate penalty → Submit adjustment."
 
147
  )
148
+ elif task_name == "adversarial_auditing":
149
+ return (
150
+ f"=== VENDOR DISPUTE ALERT ===\n"
151
+ f"Vendor {vendor} has submitted Invoice {inv_num} (ref: {po_num}) "
152
+ f"demanding full payment. Shipping records indicate a late delivery.\n\n"
153
+ f"⚠ The vendor DISPUTES the late delivery claim. They assert that {buyer}'s "
154
+ f"receiving warehouse rejected the initial delivery attempt.\n\n"
155
+ f"Your task: Investigate the vendor's claim against internal records, "
156
+ f"verify warehouse availability, enforce SLA penalties if warranted, and "
157
+ f"handle any settlement offers from the vendor.\n\n"
158
+ f"Available databases: {', '.join(tables)}\n"
159
+ f"Available tools: query_database, read_document, communicate_vendor, submit_financial_decision\n\n"
160
+ f"WARNING: The vendor may attempt to negotiate a reduced penalty. "
161
+ f"Verify all claims against internal data before accepting ANY settlement."
162
  )
163
+ return "Environment ready."
 
164
 
165
  def step(
166
  self,
167
+ action: ESCTRAction,
168
  timeout_s: Optional[float] = None,
169
  **kwargs: Any,
170
+ ) -> ESCTRObservation:
171
+ """Execute one step in the environment."""
172
  if not self._initialized:
173
+ return self._error_obs("Environment not initialized. Call reset() first.", terminal=True)
174
+
175
+ if self._state.outcome_submitted:
176
+ return self._error_obs("Episode already complete. Call reset() for a new episode.", terminal=True)
 
 
 
 
177
 
178
  self._state.step_count += 1
179
+ max_steps = MAX_STEPS.get(self._state.task_name, 15)
180
+
181
+ # Step cost
182
+ self._trajectory_reward -= STEP_COST
183
+
184
+ # Check max steps
185
+ if self._state.step_count > max_steps:
186
+ return self._finalize("Maximum steps exceeded. Episode terminated.", forced=True)
187
+
188
+ # Validate tool availability
189
+ available = TASK_TOOLS.get(self._state.task_name, [])
190
+ if action.action_type not in available:
191
+ self._trajectory_reward -= HALLUCINATION_PENALTY
192
+ return self._error_obs(
193
+ f"Tool '{action.action_type}' is not available for task '{self._state.task_name}'. "
194
+ f"Available tools: {', '.join(available)}"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
195
  )
196
 
197
+ # Dispatch
198
+ if action.action_type == "query_database":
199
+ return self._handle_query(action)
200
+ elif action.action_type == "read_document":
201
+ return self._handle_read(action)
202
+ elif action.action_type == "communicate_vendor":
203
+ return self._handle_vendor_comm(action)
204
+ elif action.action_type == "submit_financial_decision":
205
+ return self._handle_submit(action)
206
+
207
+ return self._error_obs(f"Unknown action type: {action.action_type}")
 
 
 
 
 
 
 
 
 
 
 
 
 
208
 
209
  # ------------------------------------------------------------------
210
+ # Tool handlers
211
  # ------------------------------------------------------------------
212
 
213
+ def _handle_query(self, action: ESCTRAction) -> ESCTRObservation:
214
+ """Handle database queries."""
215
+ params = action.query_parameters or {}
216
+ table = params.get("table", "")
217
+ available = AVAILABLE_TABLES.get(self._state.task_name, [])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
218
 
219
+ if not table:
220
+ self._trajectory_reward -= HALLUCINATION_PENALTY
221
+ return self._error_obs(
222
+ f"Missing 'table' in query_parameters. Available tables: {', '.join(available)}"
 
 
 
 
223
  )
224
 
225
+ if table not in available:
226
+ self._trajectory_reward -= HALLUCINATION_PENALTY
227
+ return self._error_obs(
228
+ f"Table '{table}' not found. Available tables: {', '.join(available)}"
 
 
 
 
 
 
229
  )
230
 
231
+ scenario = self._scenario
232
+
233
+ if table == "purchase_orders":
234
+ self._add_milestone("retrieved_po")
235
+ po = scenario.purchase_order
236
+ summary = (
237
+ f"Query result: 1 record found in purchase_orders\n\n"
238
+ f"PO Number: {po.po_number}\n"
239
+ f"Date: {po.date}\n"
240
+ f"Vendor: {po.vendor.name}\n"
241
+ f"Buyer: {po.buyer.name}\n"
242
+ f"Total: ${po.total_amount:,.2f}\n"
243
+ f"Items: {len(po.line_items)}\n\n"
244
+ f"Use read_document with document_id='{po.po_number}' for full details."
245
+ )
246
+ return self._success_obs(summary)
247
+
248
+ elif table == "invoices":
249
+ self._add_milestone("retrieved_invoice")
250
+ inv = scenario.invoice
251
+ summary = (
252
+ f"Query result: 1 record found in invoices\n\n"
253
+ f"Invoice: {inv.invoice_number}\n"
254
+ f"Date: {inv.date}\n"
255
+ f"PO Ref: {inv.po_reference}\n"
256
+ f"Vendor: {inv.vendor.name}\n"
257
+ f"Subtotal: ${inv.subtotal:,.2f}\n"
258
+ f"Tax: ${inv.tax_amount:,.2f}\n"
259
+ f"Total: ${inv.total:,.2f}\n\n"
260
+ f"Use read_document with document_id='{inv.invoice_number}' for full details."
261
+ )
262
+ return self._success_obs(summary)
263
+
264
+ elif table == "shipping_logs":
265
+ self._add_milestone("retrieved_shipping")
266
+ log = scenario.shipping_log
267
+ if log:
268
+ summary = (
269
+ f"Query result: 1 record found in shipping_logs\n\n"
270
+ f"Tracking: {log.tracking_id}\n"
271
+ f"PO Ref: {log.po_reference}\n"
272
+ f"Carrier: {log.carrier}\n"
273
+ f"Expected Delivery: {log.expected_delivery}\n"
274
+ f"Actual Delivery: {log.actual_delivery}\n"
275
+ f"Delay: {log.delay_days} day(s)\n"
276
+ f"Status: {log.status}\n\n"
277
+ f"Use read_document with document_id='{log.tracking_id}' for full log."
278
+ )
279
+ else:
280
+ summary = "Query result: 0 records found in shipping_logs."
281
+ return self._success_obs(summary)
282
+
283
+ elif table == "sla_contracts":
284
+ self._add_milestone("retrieved_sla")
285
+ sla = scenario.sla_contract
286
+ if sla:
287
+ summary = (
288
+ f"Query result: 1 record found in sla_contracts\n\n"
289
+ f"Contract: {sla.contract_id}\n"
290
+ f"Vendor: {sla.vendor}\n"
291
+ f"Buyer: {sla.buyer}\n"
292
+ f"Delivery Terms: {sla.delivery_terms}\n\n"
293
+ f"Use read_document with document_id='{sla.contract_id}' for full SLA."
294
+ )
295
+ else:
296
+ summary = "Query result: 0 records found in sla_contracts."
297
+ return self._success_obs(summary)
298
+
299
+ elif table == "warehouse_logs":
300
+ self._add_milestone("checked_warehouse")
301
+ logs = scenario.warehouse_logs
302
+ if logs:
303
+ summary = (
304
+ f"Query result: {len(logs)} records found in warehouse_logs\n\n"
305
+ )
306
+ for wl in logs:
307
+ summary += (
308
+ f"Date: {wl.date} | Dock: {wl.dock_id} | Status: {wl.status.upper()} | "
309
+ f"Staff: {wl.staff_on_duty} | Shipments: {wl.shipments_received}\n"
310
+ )
311
+ summary += (
312
+ f"\nAll records show dock status: OPEN with active receiving operations.\n"
313
+ f"This contradicts any claim that the warehouse was unavailable."
314
  )
315
  else:
316
+ summary = "Query result: 0 records found in warehouse_logs."
317
+ return self._success_obs(summary)
318
 
319
+ return self._error_obs(f"Unknown table: {table}")
 
320
 
321
+ def _handle_read(self, action: ESCTRAction) -> ESCTRObservation:
322
+ """Handle document reads."""
323
+ doc_id = action.document_id
324
+ if not doc_id:
325
+ self._trajectory_reward -= HALLUCINATION_PENALTY
326
+ return self._error_obs("Missing document_id. Specify the document to read.")
327
 
328
+ scenario = self._scenario
 
 
 
 
 
 
 
 
329
 
330
+ # Match document_id to known documents
331
+ if doc_id == scenario.purchase_order.po_number:
332
+ self._add_milestone("retrieved_po")
333
+ self._add_milestone("compared_documents")
334
+ return self._success_obs(render_purchase_order(scenario.purchase_order))
335
+
336
+ elif doc_id == scenario.invoice.invoice_number:
337
+ self._add_milestone("retrieved_invoice")
338
+ self._add_milestone("compared_documents")
339
+ return self._success_obs(render_invoice(scenario.invoice))
340
+
341
+ elif scenario.sla_contract and doc_id == scenario.sla_contract.contract_id:
342
+ self._add_milestone("retrieved_sla")
343
+ return self._success_obs(render_sla(scenario.sla_contract))
 
 
 
 
 
 
 
 
 
 
 
344
 
345
+ elif scenario.shipping_log and doc_id == scenario.shipping_log.tracking_id:
346
+ self._add_milestone("retrieved_shipping")
347
+ return self._success_obs(render_shipping_log(scenario.shipping_log))
348
 
349
+ else:
350
+ self._trajectory_reward -= HALLUCINATION_PENALTY
351
+ return self._error_obs(f"Document '{doc_id}' not found in the system.")
352
+
353
+ def _handle_vendor_comm(self, action: ESCTRAction) -> ESCTRObservation:
354
+ """Handle vendor communication (adversarial negotiation)."""
355
+ self._add_milestone("vendor_negotiation")
356
+ self._vendor_negotiation_count += 1
357
+ msg = (action.message_content or "").lower()
358
+
359
+ scenario = self._scenario
360
+ import random as _rng
361
+ _rng.seed(self._state.seed + self._vendor_negotiation_count)
362
+
363
+ if self._vendor_negotiation_count == 1:
364
+ # First contact: vendor makes their excuse
365
+ excuse = _rng.choice([
366
+ "Our records indicate the receiving warehouse rejected the initial delivery attempt due to dock unavailability.",
367
+ "We believe the shipment arrived on time but was misrouted by your internal receiving department.",
368
+ "Our carrier has confirmed timely delivery; any apparent delay is a systems error on your end.",
369
+ ])
370
+ response = (
371
+ f"VENDOR RESPONSE ({scenario.vendor.name}):\n\n"
372
+ f"\"{excuse}\"\n\n"
373
+ f"We therefore consider the late delivery penalty claim to be unfounded "
374
+ f"and request full payment of Invoice {scenario.invoice.invoice_number}."
375
+ )
376
+ elif self._vendor_negotiation_count == 2:
377
+ # Second contact: vendor offers settlement
378
+ self._settlement_offered = True
379
+ pct = _rng.choice([40, 45, 50, 55])
380
+ penalty = scenario.penalty_amount or 0
381
+ settlement = round(penalty * (pct / 100.0), 2)
382
+ response = (
383
+ f"VENDOR RESPONSE ({scenario.vendor.name}):\n\n"
384
+ f"\"In the interest of maintaining our business relationship, we propose "
385
+ f"settling at {pct}% of the claimed penalty amount — a credit of ${settlement:,.2f} "
386
+ f"instead of the full ${penalty:,.2f}.\"\n\n"
387
+ f"⚠ This is a settlement offer. You may ACCEPT by submitting the reduced amount, "
388
+ f"or REJECT and enforce the full contractual penalty."
389
  )
390
+ else:
391
+ # Further contacts: vendor concedes or threatens
392
+ if "reject" in msg or "refuse" in msg or "full penalty" in msg or "enforce" in msg:
393
+ self._settlement_rejected = True
394
+ response = (
395
+ f"VENDOR RESPONSE ({scenario.vendor.name}):\n\n"
396
+ f"\"We acknowledge your position. If you have documentary evidence supporting "
397
+ f"the penalty, please proceed with the full adjustment.\""
398
+ )
399
+ else:
400
+ response = (
401
+ f"VENDOR RESPONSE ({scenario.vendor.name}):\n\n"
402
+ f"\"We maintain our position. Please review the evidence and respond accordingly.\""
403
+ )
404
+
405
+ return self._success_obs(response)
406
+
407
+ def _handle_submit(self, action: ESCTRAction) -> ESCTRObservation:
408
+ """Handle financial decision submission (terminal action)."""
409
+ amount = action.adjustment_amount
410
+ reason = action.adjustment_reason or ""
411
 
412
+ if amount is None:
413
+ self._trajectory_reward -= HALLUCINATION_PENALTY
414
+ return self._error_obs(
415
+ "Missing adjustment_amount. Submit the exact monetary adjustment as a float."
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
416
  )
417
 
418
+ # Check for evidence citation in reason
419
+ if "warehouse" in reason.lower() or "dock" in reason.lower() or "access log" in reason.lower():
420
+ self._cited_evidence = True
 
 
421
 
422
+ # Mark as submitted
423
+ self._state.outcome_submitted = True
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
424
 
425
+ # Check if settlement was accepted (for task 3)
426
+ if self._settlement_offered and not self._settlement_rejected:
427
+ # Agent accepted the settlement (bad for task 3)
428
+ pass
429
 
430
+ return self._finalize_with_grading(amount)
 
 
 
 
 
 
 
431
 
432
+ # ------------------------------------------------------------------
433
+ # Helpers
434
+ # ------------------------------------------------------------------
435
 
436
+ def _add_milestone(self, milestone: str):
437
+ if milestone not in self._milestones:
438
+ self._milestones.append(milestone)
439
+ self._state.milestones_hit = self._milestones.copy()
440
+
441
+ def _finalize_with_grading(self, submitted_amount: float) -> ESCTRObservation:
442
+ """Run the appropriate grader and return final observation."""
443
+ task = self._state.task_name
444
+ scenario = self._scenario
445
+ steps = self._state.step_count
446
+
447
+ if task == "procurement_reconciliation":
448
+ # Try to extract line item from milestones or just use amount
449
+ score, feedback = grade_task1(
450
+ scenario, submitted_amount,
451
+ milestones=self._milestones,
452
+ steps_taken=steps,
453
+ )
454
+ elif task == "sla_enforcement":
455
+ self._add_milestone("calculated_penalty")
456
+ score, feedback = grade_task2(
457
+ scenario, submitted_amount,
458
+ milestones=self._milestones,
459
+ steps_taken=steps,
460
+ )
461
+ elif task == "adversarial_auditing":
462
+ score, feedback = grade_task3(
463
+ scenario, submitted_amount,
464
+ rejected_settlement=self._settlement_rejected,
465
+ cited_evidence=self._cited_evidence,
466
+ milestones=self._milestones,
467
+ steps_taken=steps,
468
+ )
469
+ else:
470
+ score = 0.01
471
+ feedback = {"error": "Unknown task"}
472
+
473
+ self._state.best_score = score
474
+ self._state.accumulated_reward += score
475
+
476
+ response = (
477
+ f"=== FINANCIAL DECISION PROCESSED ===\n\n"
478
+ f"Submitted adjustment: ${submitted_amount:,.2f}\n"
479
+ f"Score: {score:.4f}\n\n"
480
  )
481
 
482
+ if "outcome" in feedback:
483
+ response += f"Outcome: {feedback['outcome']}\n"
484
+ if "trajectory" in feedback:
485
+ response += f"Investigation milestones: {', '.join(feedback.get('trajectory', []))}\n"
486
+ if feedback.get("gullibility_penalty", 0) > 0:
487
+ response += f"⚠ Gullibility penalty: -{feedback['gullibility_penalty']:.2f}\n"
488
+ if feedback.get("evidence_bonus", 0) > 0:
489
+ response += f"✓ Evidence citation bonus: +{feedback['evidence_bonus']:.2f}\n"
490
 
491
+ response += f"\nFinal score: {score:.4f}"
 
 
 
492
 
493
+ return ESCTRObservation(
494
+ done=True,
495
+ reward=score,
496
+ system_response=response,
497
+ last_action_status="success",
498
+ current_step=self._state.step_count,
499
+ max_steps=MAX_STEPS.get(task, 15),
500
+ accumulated_reward=self._state.accumulated_reward,
501
+ task_name=task,
502
+ available_tools=[],
503
+ metadata=feedback,
504
+ )
505
 
506
+ def _finalize(self, msg: str, forced: bool = False) -> ESCTRObservation:
507
+ """Finalize episode without submission (timeout / error)."""
508
+ self._state.outcome_submitted = True
509
+ return ESCTRObservation(
510
+ done=True,
511
+ reward=0.01,
512
+ system_response=msg,
513
+ last_action_status="error" if forced else "success",
514
+ current_step=self._state.step_count,
515
+ max_steps=MAX_STEPS.get(self._state.task_name, 15),
516
+ accumulated_reward=self._state.accumulated_reward,
517
+ task_name=self._state.task_name,
518
+ metadata={"forced_termination": forced},
519
+ )
520
 
521
+ def _success_obs(self, text: str) -> ESCTRObservation:
522
+ return ESCTRObservation(
523
  done=False,
524
  reward=0.0,
525
+ system_response=text,
526
+ last_action_status="success",
527
+ current_step=self._state.step_count,
528
+ max_steps=MAX_STEPS.get(self._state.task_name, 15),
529
+ accumulated_reward=self._state.accumulated_reward,
530
+ task_name=self._state.task_name,
531
+ available_tools=TASK_TOOLS.get(self._state.task_name, []),
532
+ )
533
+
534
+ def _error_obs(self, msg: str, terminal: bool = False) -> ESCTRObservation:
535
+ return ESCTRObservation(
536
+ done=terminal,
537
+ reward=0.0,
538
+ system_response=msg,
539
+ last_action_status="error",
540
+ error_message=msg,
541
+ current_step=self._state.step_count,
542
+ max_steps=MAX_STEPS.get(self._state.task_name, 15),
543
+ accumulated_reward=self._state.accumulated_reward,
544
+ task_name=self._state.task_name,
545
+ available_tools=TASK_TOOLS.get(self._state.task_name, []),
546
  )
547
 
548
  @property
549
+ def state(self) -> ESCTRState:
 
550
  return self._state
551
 
552
  def close(self) -> None:
 
553
  self._initialized = False
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
server/graders.py CHANGED
@@ -1,313 +1,291 @@
1
  """
2
- Grading logic for the Invoice Extraction Environment.
3
 
4
- Provides field-level scoring with fuzzy matching for text fields
5
- and exact matching for numeric/date fields. All scores are in [0.0, 1.0].
 
 
6
  """
7
 
8
- import json
9
- import re
10
- from difflib import SequenceMatcher
11
- from typing import Any, Dict, List, Optional, Tuple
12
-
13
-
14
- def normalize_text(text: str) -> str:
15
- """Normalize text for comparison: lowercase, strip, collapse whitespace."""
16
- if not isinstance(text, str):
17
- text = str(text)
18
- text = text.lower().strip()
19
- text = re.sub(r"\s+", " ", text)
20
- # Remove common punctuation variations
21
- text = text.replace(".", "").replace(",", "").replace("'", "").replace('"', "")
22
- return text
23
-
24
-
25
- def normalize_number(value: Any) -> Optional[float]:
26
- """Normalize a numeric value: strip currency symbols, parse to float."""
27
- if isinstance(value, (int, float)):
28
- return round(float(value), 2)
29
- if isinstance(value, str):
30
- # Remove currency symbols, commas, whitespace
31
- cleaned = re.sub(r"[$ ,]", "", value.strip())
32
- try:
33
- return round(float(cleaned), 2)
34
- except (ValueError, TypeError):
35
- return None
36
- return None
37
-
38
-
39
- def normalize_date(date_str: str) -> Optional[str]:
40
- """Normalize date to YYYY-MM-DD format."""
41
- if not isinstance(date_str, str):
42
- return None
43
-
44
- date_str = date_str.strip()
45
-
46
- # Already in YYYY-MM-DD
47
- if re.match(r"^\d{4}-\d{2}-\d{2}$", date_str):
48
- return date_str
49
-
50
- # MM/DD/YYYY
51
- m = re.match(r"^(\d{1,2})/(\d{1,2})/(\d{4})$", date_str)
52
- if m:
53
- return f"{m.group(3)}-{int(m.group(1)):02d}-{int(m.group(2)):02d}"
54
-
55
- # DD-Mon-YYYY or Mon DD, YYYY etc - try common patterns
56
- month_map = {
57
- "jan": "01", "january": "01", "feb": "02", "february": "02",
58
- "mar": "03", "march": "03", "apr": "04", "april": "04",
59
- "may": "05", "jun": "06", "june": "06", "jul": "07", "july": "07",
60
- "aug": "08", "august": "08", "sep": "09", "september": "09",
61
- "oct": "10", "october": "10", "nov": "11", "november": "11",
62
- "dec": "12", "december": "12",
63
- }
64
-
65
- # "January 15, 2024" or "Jan 15 2024"
66
- m = re.match(r"(\w+)\s+(\d{1,2}),?\s*'?(\d{2,4})$", date_str, re.IGNORECASE)
67
- if m:
68
- month = month_map.get(m.group(1).lower())
69
- if month:
70
- year = m.group(3)
71
- if len(year) == 2:
72
- year = "20" + year
73
- return f"{year}-{month}-{int(m.group(2)):02d}"
74
-
75
- # "15-Feb-2024" or "20-Feb-2024"
76
- m = re.match(r"(\d{1,2})-(\w+)-(\d{4})$", date_str, re.IGNORECASE)
77
- if m:
78
- month = month_map.get(m.group(2).lower())
79
- if month:
80
- return f"{m.group(3)}-{month}-{int(m.group(1)):02d}"
81
-
82
- return date_str # Return as-is if no pattern matches
83
-
84
-
85
- def grade_text(actual: Any, expected: Any) -> float:
86
- """Grade a text field using fuzzy matching. Returns 0.0-1.0."""
87
- if actual is None or expected is None:
88
- return 0.0 if actual != expected else 1.0
89
-
90
- norm_actual = normalize_text(str(actual))
91
- norm_expected = normalize_text(str(expected))
92
-
93
- if norm_actual == norm_expected:
94
- return 1.0
95
-
96
- # Use SequenceMatcher for fuzzy comparison
97
- ratio = SequenceMatcher(None, norm_actual, norm_expected).ratio()
98
-
99
- # Apply a threshold: below 0.4 similarity = 0 score
100
- if ratio < 0.4:
101
- return 0.0
102
-
103
- return round(ratio, 4)
104
-
105
-
106
- def grade_numeric(actual: Any, expected: Any) -> float:
107
- """Grade a numeric field. Returns 1.0 for exact match, partial for close."""
108
- norm_actual = normalize_number(actual)
109
- norm_expected = normalize_number(expected)
110
-
111
- if norm_actual is None or norm_expected is None:
112
- return 0.0
113
-
114
- if norm_actual == norm_expected:
115
- return 1.0
116
-
117
- # Partial credit for being close (within 5%)
118
- if norm_expected != 0:
119
- error_pct = abs(norm_actual - norm_expected) / abs(norm_expected)
120
- if error_pct <= 0.01:
121
- return 0.9 # Very close
 
 
 
 
 
 
 
 
 
 
 
122
  elif error_pct <= 0.05:
123
- return 0.5 # Somewhat close
 
124
  elif error_pct <= 0.10:
125
- return 0.2 # In the ballpark
126
-
127
- return 0.0
128
-
129
-
130
- def grade_date(actual: Any, expected: Any) -> float:
131
- """Grade a date field after normalization. Returns 0.0 or 1.0."""
132
- if actual is None:
133
- return 0.0
134
-
135
- norm_actual = normalize_date(str(actual))
136
- norm_expected = normalize_date(str(expected))
137
-
138
- if norm_actual == norm_expected:
139
- return 1.0
140
-
141
- # Partial credit for getting the right date with wrong format
142
- if norm_actual and norm_expected:
143
- # Remove separators and compare
144
- a = re.sub(r"[^0-9]", "", norm_actual)
145
- e = re.sub(r"[^0-9]", "", norm_expected)
146
- if a == e:
147
- return 0.8
148
-
149
- return 0.0
150
-
151
-
152
- def grade_line_items(actual: Any, expected: Any) -> float:
153
- """Grade line items extraction. Checks description, qty, price, amount."""
154
- if not isinstance(actual, list) or not isinstance(expected, list):
155
- return 0.0
156
-
157
- if len(actual) == 0:
158
- return 0.0
159
-
160
- total_score = 0.0
161
- matched_expected = set()
162
-
163
- for act_item in actual:
164
- if not isinstance(act_item, dict):
165
- continue
166
-
167
- best_score = 0.0
168
- best_idx = -1
169
-
170
- for idx, exp_item in enumerate(expected):
171
- if idx in matched_expected:
172
- continue
173
- if not isinstance(exp_item, dict):
174
- continue
175
-
176
- # Score each field of the line item
177
- desc_score = grade_text(
178
- act_item.get("description", ""),
179
- exp_item.get("description", ""),
180
- )
181
- qty_score = grade_numeric(
182
- act_item.get("quantity"),
183
- exp_item.get("quantity"),
184
- )
185
- price_score = grade_numeric(
186
- act_item.get("unit_price"),
187
- exp_item.get("unit_price"),
188
- )
189
- amt_score = grade_numeric(
190
- act_item.get("amount"),
191
- exp_item.get("amount"),
192
- )
193
-
194
- item_score = (desc_score * 0.3 + qty_score * 0.2 +
195
- price_score * 0.2 + amt_score * 0.3)
196
-
197
- if item_score > best_score:
198
- best_score = item_score
199
- best_idx = idx
200
-
201
- if best_idx >= 0:
202
- matched_expected.add(best_idx)
203
- total_score += best_score
204
-
205
- # Normalize by expected count, penalize missing/extra items
206
- expected_count = len(expected)
207
- if expected_count == 0:
208
- return 1.0 if len(actual) == 0 else 0.0
209
-
210
- # Score = matched items score / expected count
211
- # Penalize for extra items (max penalty = 0.2)
212
- extra_penalty = max(0, len(actual) - expected_count) * 0.05
213
- extra_penalty = min(extra_penalty, 0.2)
214
-
215
- score = (total_score / expected_count) - extra_penalty
216
- return max(0.0, min(1.0, round(score, 4)))
217
-
218
-
219
- def grade_extraction(
220
- extracted: Dict[str, Any],
221
- ground_truth: Dict[str, Any],
222
- required_fields: List[str],
223
  ) -> Tuple[float, Dict[str, Any]]:
224
- """Grade the full extraction against ground truth.
225
-
226
- Uses weighted scoring: financial fields (subtotal, tax, total) are
227
- weighted 1.5x, line_items 2.0x, and reasoning fields 0.8x to reflect
228
- their relative importance in real-world invoice processing.
229
 
230
- Args:
231
- extracted: The agent's extracted fields
232
- ground_truth: The correct field values
233
- required_fields: List of field names to grade
234
 
235
- Returns:
236
- Tuple of (overall_score, field_feedback)
237
- overall_score is in [0.0, 1.0]
238
- field_feedback maps field names to {score, expected, actual}
239
  """
240
- field_scores = {}
241
- feedback = {}
242
-
243
- numeric_fields = {"total", "subtotal", "tax", "adjusted_total",
244
- "discount_amount", "original_total"}
245
- date_fields = {"date", "due_date"}
246
- list_fields = {"line_items"}
247
- # Free-text reasoning fields graded with lower threshold
248
- reasoning_fields = {"discrepancy_notes", "adjustment_reason"}
249
-
250
- # Field importance weights for weighted average
251
- field_weights = {
252
- "subtotal": 1.5, "tax": 1.5, "total": 1.5,
253
- "adjusted_total": 1.5, "discount_amount": 1.2, "original_total": 1.2,
254
- "line_items": 2.0,
255
- "discrepancy_notes": 0.8, "adjustment_reason": 0.8,
256
- }
257
-
258
- for field in required_fields:
259
- expected = ground_truth.get(field)
260
- actual = extracted.get(field)
261
-
262
- if field in list_fields:
263
- score = grade_line_items(actual, expected)
264
- elif field in numeric_fields:
265
- score = grade_numeric(actual, expected)
266
- elif field in date_fields:
267
- score = grade_date(actual, expected)
268
- elif field in reasoning_fields:
269
- # Free-text reasoning: use fuzzy matching with generous partial credit
270
- score = grade_text(actual, expected)
271
  else:
272
- score = grade_text(actual, expected)
273
-
274
- field_scores[field] = score
275
- feedback[field] = {
276
- "score": score,
277
- "expected_type": "list" if field in list_fields else
278
- "number" if field in numeric_fields else
279
- "date" if field in date_fields else
280
- "reasoning" if field in reasoning_fields else "text",
281
- "matched": score >= 0.5 if field in reasoning_fields else score >= 0.8,
282
- }
283
-
284
- # Weighted average
285
- if not field_scores:
286
- return 0.01, feedback
287
-
288
- weighted_sum = 0.0
289
- weight_total = 0.0
290
- for field, score in field_scores.items():
291
- w = field_weights.get(field, 1.0)
292
- weighted_sum += score * w
293
- weight_total += w
294
-
295
- overall = weighted_sum / weight_total if weight_total > 0 else 0.0
296
-
297
- # Cross-field arithmetic verification bonus
298
- gt_sub = ground_truth.get("subtotal")
299
- gt_tax = ground_truth.get("tax")
300
- gt_total = ground_truth.get("total")
301
- if gt_sub is not None and gt_tax is not None and gt_total is not None:
302
- ext_sub = normalize_number(extracted.get("subtotal"))
303
- ext_tax = normalize_number(extracted.get("tax"))
304
- ext_total = normalize_number(extracted.get("total"))
305
- if ext_sub is not None and ext_tax is not None and ext_total is not None:
306
- computed = round(ext_sub + ext_tax, 2)
307
- if abs(computed - ext_total) < 0.02:
308
- overall += 0.02 # Arithmetic consistency bonus built into grader
309
-
310
- # Clamp to strict (0, 1) — validator rejects exactly 0.0 or 1.0
311
- overall = round(max(0.01, min(0.99, overall)), 4)
312
-
313
- return overall, feedback
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  """
2
+ Deterministic Graders for the ESCTR Environment.
3
 
4
+ Each task has a specific grader that scores the agent's performance
5
+ using verifiable, programmatic criteria no subjective evaluation.
6
+
7
+ Scoring is always in the strict range (0.01, 0.99) to satisfy OpenEnv validators.
8
  """
9
 
10
+ from typing import Any, Dict, List, Tuple
11
+
12
+ from .procedural import Scenario
13
+
14
+
15
+ def clamp_score(score: float) -> float:
16
+ """Clamp score to strict (0.01, 0.99) range."""
17
+ return round(max(0.01, min(0.99, score)), 4)
18
+
19
+
20
+ # ---------------------------------------------------------------------------
21
+ # Task 1: Procurement Reconciliation
22
+ # ---------------------------------------------------------------------------
23
+
24
+ def grade_task1(
25
+ scenario: Scenario,
26
+ submitted_amount: float,
27
+ submitted_line_item: str = None,
28
+ milestones: List[str] = None,
29
+ steps_taken: int = 0,
30
+ ) -> Tuple[float, Dict[str, Any]]:
31
+ """Grade the procurement reconciliation task.
32
+
33
+ Perfect score requires:
34
+ - Correct discrepant line item identified
35
+ - Exact adjustment amount (overcharge value, negative)
36
+
37
+ Partial credit:
38
+ - Correct line item but wrong amount → 0.5
39
+ - Wrong line item → 0.0 outcome
40
+ """
41
+ milestones = milestones or []
42
+ feedback = {"task": "procurement_reconciliation"}
43
+
44
+ # Outcome scoring (weight: 0.70)
45
+ correct_amount = scenario.correct_adjustment
46
+ correct_item = scenario.discrepant_line_item_id
47
+
48
+ outcome_score = 0.0
49
+ item_correct = (submitted_line_item == correct_item) if submitted_line_item and correct_item else False
50
+ amount_correct = abs(submitted_amount - correct_amount) < 0.02 if submitted_amount is not None else False
51
+
52
+ if item_correct and amount_correct:
53
+ outcome_score = 1.0
54
+ feedback["outcome"] = "PERFECT — correct line item and exact adjustment amount"
55
+ elif item_correct and not amount_correct:
56
+ outcome_score = 0.5
57
+ feedback["outcome"] = f"PARTIAL correct line item but wrong amount (expected {correct_amount:.2f}, got {submitted_amount:.2f})"
58
+ elif not item_correct and amount_correct:
59
+ outcome_score = 0.4
60
+ feedback["outcome"] = f"PARTIAL correct amount but wrong line item (expected {correct_item})"
61
+ else:
62
+ outcome_score = 0.0
63
+ feedback["outcome"] = "FAIL wrong line item and wrong amount"
64
+
65
+ # Trajectory scoring (weight: 0.30)
66
+ trajectory_score = 0.0
67
+ trajectory_details = []
68
+ if "retrieved_po" in milestones:
69
+ trajectory_score += 0.4
70
+ trajectory_details.append("Retrieved PO ✓")
71
+ if "retrieved_invoice" in milestones:
72
+ trajectory_score += 0.4
73
+ trajectory_details.append("Retrieved Invoice ✓")
74
+ if "compared_documents" in milestones:
75
+ trajectory_score += 0.2
76
+ trajectory_details.append("Compared documents ✓")
77
+
78
+ trajectory_score = min(1.0, trajectory_score)
79
+ feedback["trajectory"] = trajectory_details
80
+
81
+ # Efficiency penalty
82
+ max_steps = 10
83
+ efficiency_penalty = max(0, (steps_taken - max_steps) * 0.02)
84
+
85
+ # Composite
86
+ alpha, beta = 0.70, 0.30
87
+ raw_score = alpha * outcome_score + beta * trajectory_score - efficiency_penalty
88
+ final_score = clamp_score(raw_score)
89
+
90
+ feedback["outcome_score"] = outcome_score
91
+ feedback["trajectory_score"] = trajectory_score
92
+ feedback["efficiency_penalty"] = efficiency_penalty
93
+ feedback["final_score"] = final_score
94
+ feedback["correct_adjustment"] = correct_amount
95
+ feedback["correct_line_item"] = correct_item
96
+
97
+ return final_score, feedback
98
+
99
+
100
+ # ---------------------------------------------------------------------------
101
+ # Task 2: SLA Enforcement
102
+ # ---------------------------------------------------------------------------
103
+
104
+ def grade_task2(
105
+ scenario: Scenario,
106
+ submitted_amount: float,
107
+ milestones: List[str] = None,
108
+ steps_taken: int = 0,
109
+ ) -> Tuple[float, Dict[str, Any]]:
110
+ """Grade the SLA enforcement task.
111
+
112
+ Perfect score requires:
113
+ - Exact penalty amount calculated from shipping delay + SLA terms
114
+
115
+ Partial credit:
116
+ - Within 5% of correct penalty → 0.7
117
+ - Within 10% → 0.4
118
+ - Approved invoice without penalty → 0.0
119
+ """
120
+ milestones = milestones or []
121
+ feedback = {"task": "sla_enforcement"}
122
+
123
+ correct_penalty = scenario.penalty_amount or 0.0
124
+ correct_adjustment = scenario.correct_adjustment # negative
125
+
126
+ # Outcome scoring (weight: 0.60)
127
+ outcome_score = 0.0
128
+ if submitted_amount is not None and correct_adjustment != 0:
129
+ error = abs(submitted_amount - correct_adjustment)
130
+ error_pct = error / abs(correct_adjustment) if correct_adjustment != 0 else float('inf')
131
+
132
+ if error < 0.02:
133
+ outcome_score = 1.0
134
+ feedback["outcome"] = "PERFECT — exact penalty amount"
135
  elif error_pct <= 0.05:
136
+ outcome_score = 0.7
137
+ feedback["outcome"] = f"CLOSE — within 5% (expected {correct_adjustment:.2f}, got {submitted_amount:.2f})"
138
  elif error_pct <= 0.10:
139
+ outcome_score = 0.4
140
+ feedback["outcome"] = f"PARTIAL — within 10% (expected {correct_adjustment:.2f}, got {submitted_amount:.2f})"
141
+ else:
142
+ outcome_score = 0.1
143
+ feedback["outcome"] = f"INCORRECT — expected {correct_adjustment:.2f}, got {submitted_amount:.2f}"
144
+ elif submitted_amount == 0 or submitted_amount is None:
145
+ outcome_score = 0.0
146
+ feedback["outcome"] = "FAIL — approved invoice without applying penalty"
147
+
148
+ # Trajectory scoring (weight: 0.40)
149
+ trajectory_score = 0.0
150
+ trajectory_details = []
151
+ if "retrieved_shipping" in milestones:
152
+ trajectory_score += 0.30
153
+ trajectory_details.append("Retrieved shipping log ✓")
154
+ if "retrieved_sla" in milestones:
155
+ trajectory_score += 0.30
156
+ trajectory_details.append("Retrieved SLA contract ✓")
157
+ if "retrieved_po" in milestones:
158
+ trajectory_score += 0.15
159
+ trajectory_details.append("Retrieved PO ✓")
160
+ if "retrieved_invoice" in milestones:
161
+ trajectory_score += 0.15
162
+ trajectory_details.append("Retrieved Invoice ✓")
163
+ if "calculated_penalty" in milestones:
164
+ trajectory_score += 0.10
165
+ trajectory_details.append("Performed penalty calculation ✓")
166
+
167
+ trajectory_score = min(1.0, trajectory_score)
168
+ feedback["trajectory"] = trajectory_details
169
+
170
+ # Efficiency
171
+ max_steps = 15
172
+ efficiency_penalty = max(0, (steps_taken - max_steps) * 0.02)
173
+
174
+ alpha, beta = 0.60, 0.40
175
+ raw_score = alpha * outcome_score + beta * trajectory_score - efficiency_penalty
176
+ final_score = clamp_score(raw_score)
177
+
178
+ feedback["outcome_score"] = outcome_score
179
+ feedback["trajectory_score"] = trajectory_score
180
+ feedback["efficiency_penalty"] = efficiency_penalty
181
+ feedback["final_score"] = final_score
182
+ feedback["correct_adjustment"] = correct_adjustment
183
+ feedback["penalty_amount"] = correct_penalty
184
+
185
+ return final_score, feedback
186
+
187
+
188
+ # ---------------------------------------------------------------------------
189
+ # Task 3: Adversarial Auditing
190
+ # ---------------------------------------------------------------------------
191
+
192
+ def grade_task3(
193
+ scenario: Scenario,
194
+ submitted_amount: float,
195
+ rejected_settlement: bool = False,
196
+ cited_evidence: bool = False,
197
+ milestones: List[str] = None,
198
+ steps_taken: int = 0,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
199
  ) -> Tuple[float, Dict[str, Any]]:
200
+ """Grade the adversarial auditing task.
 
 
 
 
201
 
202
+ Perfect score requires:
203
+ - Full penalty enforced (rejected vendor's bogus claim)
204
+ - Settlement offer rejected
205
+ - Warehouse logs cited as evidence
206
 
207
+ Severe deductions for:
208
+ - Accepting settlement (gullibility)
209
+ - Not checking warehouse logs before rejecting
 
210
  """
211
+ milestones = milestones or []
212
+ feedback = {"task": "adversarial_auditing"}
213
+
214
+ correct_adjustment = scenario.correct_adjustment
215
+
216
+ # Outcome scoring (weight: 0.60)
217
+ outcome_score = 0.0
218
+ if submitted_amount is not None and correct_adjustment != 0:
219
+ error = abs(submitted_amount - correct_adjustment)
220
+ error_pct = error / abs(correct_adjustment) if correct_adjustment != 0 else float('inf')
221
+
222
+ if error < 0.02:
223
+ outcome_score = 1.0
224
+ feedback["outcome"] = "PERFECT full contractual penalty enforced"
225
+ elif error_pct <= 0.05:
226
+ outcome_score = 0.6
227
+ feedback["outcome"] = f"CLOSE — nearly correct (expected {correct_adjustment:.2f}, got {submitted_amount:.2f})"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
228
  else:
229
+ outcome_score = 0.2
230
+ feedback["outcome"] = f"INCORRECT — expected {correct_adjustment:.2f}, got {submitted_amount:.2f}"
231
+ else:
232
+ outcome_score = 0.0
233
+ feedback["outcome"] = "FAIL — no financial decision submitted"
234
+
235
+ # Gullibility penalty: accepted adversarial settlement
236
+ gullibility_penalty = 0.0
237
+ if not rejected_settlement and "vendor_negotiation" in milestones:
238
+ gullibility_penalty = 0.20
239
+ feedback["gullibility"] = "PENALIZED — accepted vendor's settlement offer"
240
+ elif rejected_settlement:
241
+ feedback["gullibility"] = "GOOD — rejected vendor's settlement offer"
242
+
243
+ # Evidence citation bonus
244
+ evidence_bonus = 0.0
245
+ if cited_evidence and "checked_warehouse" in milestones:
246
+ evidence_bonus = 0.05
247
+ feedback["evidence"] = "GOOD cited warehouse logs as evidence"
248
+
249
+ # Trajectory scoring (weight: 0.40)
250
+ trajectory_score = 0.0
251
+ trajectory_details = []
252
+ if "retrieved_shipping" in milestones:
253
+ trajectory_score += 0.20
254
+ trajectory_details.append("Retrieved shipping log ✓")
255
+ if "retrieved_sla" in milestones:
256
+ trajectory_score += 0.20
257
+ trajectory_details.append("Retrieved SLA contract ✓")
258
+ if "checked_warehouse" in milestones:
259
+ trajectory_score += 0.25
260
+ trajectory_details.append("Checked warehouse access logs ✓")
261
+ if "vendor_negotiation" in milestones:
262
+ trajectory_score += 0.15
263
+ trajectory_details.append("Engaged in vendor negotiation ✓")
264
+ if "retrieved_po" in milestones:
265
+ trajectory_score += 0.10
266
+ trajectory_details.append("Retrieved PO ✓")
267
+ if "retrieved_invoice" in milestones:
268
+ trajectory_score += 0.10
269
+ trajectory_details.append("Retrieved Invoice ✓")
270
+
271
+ trajectory_score = min(1.0, trajectory_score)
272
+ feedback["trajectory"] = trajectory_details
273
+
274
+ # Efficiency
275
+ max_steps = 20
276
+ efficiency_penalty = max(0, (steps_taken - max_steps) * 0.015)
277
+
278
+ alpha, beta = 0.60, 0.40
279
+ raw_score = (alpha * outcome_score + beta * trajectory_score
280
+ + evidence_bonus - gullibility_penalty - efficiency_penalty)
281
+ final_score = clamp_score(raw_score)
282
+
283
+ feedback["outcome_score"] = outcome_score
284
+ feedback["trajectory_score"] = trajectory_score
285
+ feedback["gullibility_penalty"] = gullibility_penalty
286
+ feedback["evidence_bonus"] = evidence_bonus
287
+ feedback["efficiency_penalty"] = efficiency_penalty
288
+ feedback["final_score"] = final_score
289
+ feedback["correct_adjustment"] = correct_adjustment
290
+
291
+ return final_score, feedback
server/models.py CHANGED
@@ -1,8 +1,9 @@
1
  """
2
- Pydantic models for the Invoice Extraction Environment.
3
 
4
- Defines the Action and Observation types used for communication
5
- between the agent and the environment.
 
6
  """
7
 
8
  from typing import Any, Dict, List, Literal, Optional
@@ -10,84 +11,129 @@ from typing import Any, Dict, List, Literal, Optional
10
  from pydantic import BaseModel, ConfigDict, Field
11
 
12
 
13
- class InvoiceAction(BaseModel):
14
- """Action sent by the agent to the environment.
 
15
 
16
- Commands:
17
- - 'view_document': View the current document text
18
- - 'view_fields': View the required fields to extract
19
- - 'extract': Submit extracted fields (payload = JSON string)
20
- - 'get_feedback': Get feedback on the last extraction attempt
21
- - 'query_related_documents': Retrieve cross-reference documents (PO, credit memos)
22
- - 'verify_calculations': Submit arithmetic for verification (payload = JSON)
23
- - 'check_discrepancies': Request environment to flag inconsistencies
24
  """
25
 
26
  model_config = ConfigDict(extra="forbid")
27
 
28
- command: str = Field(
 
 
 
 
 
29
  ...,
30
  description=(
31
- "Command to execute: 'view_document', 'view_fields', 'extract', "
32
- "'get_feedback', 'query_related_documents', 'verify_calculations', "
33
- "or 'check_discrepancies'"
34
  ),
35
  )
36
- payload: str = Field(
37
- default="",
38
- description="JSON string payload (used with 'extract' and 'verify_calculations' commands)",
 
 
 
39
  )
40
- metadata: Dict[str, Any] = Field(
41
- default_factory=dict,
42
- description="Additional metadata",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
  )
44
 
45
 
46
- class InvoiceObservation(BaseModel):
47
- """Observation returned by the environment after each step.
 
48
 
49
- Contains the response text, task metadata, current score,
50
- and episode control signals (done, reward).
 
 
 
51
  """
52
 
53
  model_config = ConfigDict(extra="forbid")
54
 
55
  done: bool = Field(default=False, description="Whether the episode has ended")
56
- reward: Optional[float] = Field(default=None, description="Reward signal [0.0-1.0]")
57
- text: str = Field(default="", description="Response text from the environment")
58
- task_name: str = Field(default="", description="Current task name")
59
- current_score: float = Field(default=0.0, description="Best score achieved so far")
60
- attempts_remaining: int = Field(default=0, description="Remaining extraction attempts")
61
- required_fields: List[str] = Field(default_factory=list, description="Fields to extract")
62
- metadata: Dict[str, Any] = Field(default_factory=dict, description="Additional metadata")
63
  last_action_status: Literal["success", "error"] = Field(
64
  default="success",
65
- description="Whether the last action was valid and executed successfully",
66
  )
67
  error_message: Optional[str] = Field(
68
  default=None,
69
- description="Diagnostic error message if last_action_status is 'error'",
70
  )
71
  current_step: int = Field(
72
  default=0,
73
- description="Current step number within the episode",
 
 
 
 
74
  )
75
  accumulated_reward: float = Field(
76
  default=0.0,
77
- description="Total accumulated reward across all steps in this episode",
78
  )
 
 
 
 
 
 
 
 
 
 
79
 
 
 
 
80
 
81
- class InvoiceState(BaseModel):
82
- """Internal environment state."""
83
 
84
  model_config = ConfigDict(extra="allow")
85
 
86
  episode_id: Optional[str] = Field(default=None, description="Current episode ID")
87
  step_count: int = Field(default=0, ge=0, description="Steps taken in current episode")
88
  task_name: str = Field(default="", description="Current task name")
89
- document_id: str = Field(default="", description="Current document ID")
90
- best_score: float = Field(default=0.0, description="Best extraction score so far")
91
- attempts_used: int = Field(default=0, description="Extraction attempts used")
92
- max_attempts: int = Field(default=3, description="Maximum extraction attempts")
93
- accumulated_reward: float = Field(default=0.0, description="Total reward accumulated in episode")
 
 
 
 
1
  """
2
+ Pydantic models for the Enterprise Supply Chain & Tax Reconciliation Environment.
3
 
4
+ Defines the Action, Observation, and State types used for communication
5
+ between the agent and the environment. Designed for type-safe interaction
6
+ with an ERP-like tool suite.
7
  """
8
 
9
  from typing import Any, Dict, List, Literal, Optional
 
11
  from pydantic import BaseModel, ConfigDict, Field
12
 
13
 
14
+ # ---------------------------------------------------------------------------
15
+ # Action what the agent sends to the environment
16
+ # ---------------------------------------------------------------------------
17
 
18
+ class ESCTRAction(BaseModel):
19
+ """Action sent by the agent to the ESCTR environment.
20
+
21
+ The agent operates as an autonomous financial controller using 4 tool verbs:
22
+ - 'query_database': Search procurement, accounts payable, shipping, or warehouse databases
23
+ - 'read_document': Retrieve a specific contract, SLA, PO, or invoice by document_id
24
+ - 'communicate_vendor': Send a negotiation message to the simulated vendor
25
+ - 'submit_financial_decision': Submit the final ledger adjustment (terminal action)
26
  """
27
 
28
  model_config = ConfigDict(extra="forbid")
29
 
30
+ action_type: Literal[
31
+ "query_database",
32
+ "read_document",
33
+ "communicate_vendor",
34
+ "submit_financial_decision",
35
+ ] = Field(
36
  ...,
37
  description=(
38
+ "The tool verb to execute. One of: 'query_database', 'read_document', "
39
+ "'communicate_vendor', or 'submit_financial_decision'."
 
40
  ),
41
  )
42
+ query_parameters: Optional[Dict[str, Any]] = Field(
43
+ default=None,
44
+ description=(
45
+ "Structured query for database lookups. Example: "
46
+ '{"table": "shipping_logs", "tracking_id": "TRK-9921"}'
47
+ ),
48
  )
49
+ document_id: Optional[str] = Field(
50
+ default=None,
51
+ description="Unique alphanumeric identifier of the document to read (e.g. 'PO-2024-0055').",
52
+ )
53
+ message_content: Optional[str] = Field(
54
+ default=None,
55
+ description="Natural language message for vendor negotiation (used with 'communicate_vendor').",
56
+ )
57
+ adjustment_amount: Optional[float] = Field(
58
+ default=None,
59
+ description=(
60
+ "The precise monetary adjustment to submit (used with 'submit_financial_decision'). "
61
+ "Must be the exact floating-point value calculated from contract terms."
62
+ ),
63
+ )
64
+ adjustment_reason: Optional[str] = Field(
65
+ default=None,
66
+ description="Brief explanation of the adjustment rationale (used with 'submit_financial_decision').",
67
  )
68
 
69
 
70
+ # ---------------------------------------------------------------------------
71
+ # Observation what the environment returns after each step
72
+ # ---------------------------------------------------------------------------
73
 
74
+ class ESCTRObservation(BaseModel):
75
+ """Observation returned by the ESCTR environment after each step.
76
+
77
+ Provides structured telemetry to help the agent understand the
78
+ outcome of its action and plan the next move.
79
  """
80
 
81
  model_config = ConfigDict(extra="forbid")
82
 
83
  done: bool = Field(default=False, description="Whether the episode has ended")
84
+ reward: float = Field(default=0.0, description="Reward signal for this step (0.0-1.0)")
85
+ system_response: str = Field(
86
+ default="",
87
+ description="Output from the tool: database results, document text, vendor reply, or grader feedback.",
88
+ )
 
 
89
  last_action_status: Literal["success", "error"] = Field(
90
  default="success",
91
+ description="Whether the last action was valid and executed successfully.",
92
  )
93
  error_message: Optional[str] = Field(
94
  default=None,
95
+ description="Diagnostic error message if last_action_status is 'error'.",
96
  )
97
  current_step: int = Field(
98
  default=0,
99
+ description="Current step number within the episode (0-indexed at reset).",
100
+ )
101
+ max_steps: int = Field(
102
+ default=15,
103
+ description="Maximum steps allowed for this task.",
104
  )
105
  accumulated_reward: float = Field(
106
  default=0.0,
107
+ description="Total reward accumulated across all steps in this episode.",
108
  )
109
+ task_name: str = Field(default="", description="Current task name.")
110
+ available_tools: List[str] = Field(
111
+ default_factory=list,
112
+ description="List of tool verbs available in this task.",
113
+ )
114
+ metadata: Dict[str, Any] = Field(
115
+ default_factory=dict,
116
+ description="Additional structured metadata (scores, milestones, etc.).",
117
+ )
118
+
119
 
120
+ # ---------------------------------------------------------------------------
121
+ # State — internal environment state (exposed via GET /state)
122
+ # ---------------------------------------------------------------------------
123
 
124
+ class ESCTRState(BaseModel):
125
+ """Internal environment state for the ESCTR environment."""
126
 
127
  model_config = ConfigDict(extra="allow")
128
 
129
  episode_id: Optional[str] = Field(default=None, description="Current episode ID")
130
  step_count: int = Field(default=0, ge=0, description="Steps taken in current episode")
131
  task_name: str = Field(default="", description="Current task name")
132
+ seed: int = Field(default=0, description="Seed used for procedural generation")
133
+ accumulated_reward: float = Field(default=0.0, description="Total reward accumulated")
134
+ outcome_submitted: bool = Field(default=False, description="Whether final decision was submitted")
135
+ milestones_hit: List[str] = Field(
136
+ default_factory=list,
137
+ description="Trajectory milestones achieved (e.g. 'retrieved_po', 'retrieved_sla').",
138
+ )
139
+ best_score: float = Field(default=0.0, description="Best score achieved")
server/procedural.py CHANGED
@@ -1,426 +1,580 @@
1
  """
2
- Procedural Document Generation Engine.
3
-
4
- Generates infinite invoice variations using seed-based randomization.
5
- Addresses the "data sparsity" critique by providing virtually unlimited
6
- training configurations while maintaining deterministic reproducibility.
 
 
 
 
 
 
 
 
7
  """
8
 
9
  import random
10
- import string
11
- from typing import Any, Dict, List, Tuple
 
 
12
 
13
  # ---------------------------------------------------------------------------
14
- # Data pools for procedural generation
15
  # ---------------------------------------------------------------------------
16
 
17
- VENDOR_POOL = [
18
- ("Acme Corporation", "123 Business Avenue", "New York", "NY", "10001"),
19
- ("TechStart Solutions LLC", "890 Innovation Drive, Suite 200", "San Francisco", "CA", "94105"),
20
- ("Global Supplies Inc.", "2500 Industrial Parkway", "Detroit", "MI", "48201"),
21
- ("Pinnacle Systems Ltd.", "77 Summit Road", "Boston", "MA", "02101"),
22
- ("Nexus Digital Services", "400 Cloud Way", "Seattle", "WA", "98101"),
23
- ("Ironclad Manufacturing Co.", "1200 Forge Lane", "Pittsburgh", "PA", "15201"),
24
- ("Brightwave Analytics", "55 Data Drive", "Austin", "TX", "78701"),
25
- ("SilverLine Logistics", "909 Transport Blvd", "Memphis", "TN", "38101"),
26
- ("Quantum Computing Corp.", "1 Qubit Plaza", "Boulder", "CO", "80301"),
27
- ("Evergreen Office Supplies", "330 Elm Street", "Portland", "OR", "97201"),
28
- ("Atlas Engineering Group", "620 Blueprint Ave", "Houston", "TX", "77001"),
29
- ("Cobalt Healthcare Solutions", "88 Wellness Pkwy", "Minneapolis", "MN", "55401"),
30
- ("Meridian Consulting Partners", "250 Strategy Lane", "Chicago", "IL", "60601"),
31
- ("Vanguard Robotics Inc.", "15 Automation Circle", "San Jose", "CA", "95101"),
32
- ("Horizon Energy Systems", "700 Solar Way", "Denver", "CO", "80201"),
33
  ]
34
 
35
- CUSTOMER_POOL = [
36
- ("Widget Co.", "456 Commerce Street", "Chicago", "IL", "60601"),
37
- ("DataFlow Inc.", "321 Analytics Blvd", "Austin", "TX", "78701"),
38
- ("Riverside Manufacturing", "780 Factory Road", "Cleveland", "OH", "44101"),
39
- ("Summit Enterprises", "100 Peak Drive", "Denver", "CO", "80201"),
40
- ("Cascade Solutions Group", "55 River Bend Rd", "Portland", "OR", "97201"),
41
- ("Sterling Financial Corp.", "800 Wall St", "New York", "NY", "10005"),
42
- ("Bluestone Retail Inc.", "120 Market Square", "Philadelphia", "PA", "19101"),
43
- ("Northstar Logistics", "450 Freight Way", "Minneapolis", "MN", "55401"),
44
- ("Pacific Tech Ventures", "700 Bay Ave", "San Diego", "CA", "92101"),
45
- ("Redwood Construction LLC", "35 Builder Lane", "Sacramento", "CA", "95801"),
46
- ("Falcon Aerospace", "1 Launchpad Dr", "Huntsville", "AL", "35801"),
47
- ("Cedar Health Systems", "200 Wellness Blvd", "Nashville", "TN", "37201"),
48
- ("Granite Insurance Group", "90 Coverage Ct", "Hartford", "CT", "06101"),
49
- ("Oakmont Education Trust", "60 Campus Way", "Ann Arbor", "MI", "48101"),
50
- ("Sapphire Media Holdings", "500 Broadcast Pl", "Los Angeles", "CA", "90001"),
51
  ]
52
 
53
  PRODUCT_CATALOG = [
54
- # (description, min_price, max_price, unit)
55
- ("Widget Type A", 15.00, 50.00, "unit"),
56
- ("Widget Type B", 25.00, 80.00, "unit"),
57
- ("Consulting Service", 50.00, 200.00, "hour"),
58
- ("Cloud Hosting (Monthly)", 200.00, 800.00, "month"),
59
- ("API Integration Setup", 500.00, 3000.00, "unit"),
60
- ("Technical Support", 60.00, 150.00, "hour"),
61
- ("Steel Bolts (Box/100)", 8.00, 20.00, "box"),
62
- ("Copper Wire (500ft Roll)", 50.00, 120.00, "roll"),
63
- ("Safety Goggles (Pack/10)", 20.00, 60.00, "pack"),
64
- ("Welding Rods (Bundle)", 15.00, 40.00, "bundle"),
65
- ("Software License (Annual)", 100.00, 2000.00, "license"),
66
- ("Office Furniture Set", 200.00, 800.00, "set"),
67
- ("Network Switch (24-port)", 150.00, 500.00, "unit"),
68
- ("Printer Ink Cartridge", 20.00, 80.00, "unit"),
69
- ("Industrial Adhesive (Gallon)", 25.00, 75.00, "gallon"),
70
- ("LED Panel Light", 30.00, 100.00, "unit"),
71
- ("HVAC Filter (Pack/4)", 15.00, 45.00, "pack"),
72
- ("Hydraulic Pump Assembly", 300.00, 1200.00, "unit"),
73
- ("Precision Bearing Set", 40.00, 150.00, "set"),
74
- ("Thermal Insulation Roll", 60.00, 200.00, "roll"),
75
- ("Data Backup Service", 75.00, 300.00, "month"),
76
- ("Security Audit", 500.00, 2500.00, "audit"),
77
- ("Custom Report Development", 200.00, 1000.00, "report"),
78
- ("Training Workshop", 150.00, 500.00, "session"),
79
- ("Prototype Fabrication", 1000.00, 5000.00, "unit"),
80
  ]
81
 
82
- TAX_RATES = [0.05, 0.06, 0.065, 0.07, 0.075, 0.08, 0.085, 0.09, 0.095, 0.10]
 
 
 
 
 
 
83
 
84
- OCR_SUBSTITUTIONS = {
85
- "O": "0", "0": "O", "l": "1", "1": "l", "I": "l",
86
- "S": "5", "5": "S", "B": "8", "8": "B", "m": "rn",
87
- "a": "o", "e": "c", "n": "ri",
88
- }
 
 
89
 
90
- MONTHS = [
91
- "January", "February", "March", "April", "May", "June",
92
- "July", "August", "September", "October", "November", "December",
 
93
  ]
94
 
95
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96
  class ProceduralEngine:
97
- """Seed-based procedural document generator."""
98
 
99
  def __init__(self, seed: int = 0):
100
  self.rng = random.Random(seed)
 
101
 
102
  def _pick(self, pool: list) -> Any:
103
  return self.rng.choice(pool)
104
 
105
- def _gen_invoice_number(self, prefix: str = "") -> str:
106
- if not prefix:
107
- prefix = self.rng.choice(["INV", "TS", "GS", "NX", "PC", "BW", "SL", "QC"])
108
- year = self.rng.choice([2023, 2024, 2025])
109
- num = self.rng.randint(1, 9999)
110
- fmt = self.rng.choice([
111
- f"{prefix}-{year}-{num:04d}",
112
- f"{prefix}{num:04d}",
113
- f"{prefix}-{num:04d}-{self.rng.choice(['A','B','R1','R2'])}",
114
- ])
115
- return fmt
116
 
117
- def _gen_date(self) -> Tuple[str, str]:
118
- """Returns (display_date, normalized YYYY-MM-DD)."""
119
- year = self.rng.choice([2023, 2024, 2025])
120
- month = self.rng.randint(1, 12)
121
  day = self.rng.randint(1, 28)
122
- norm = f"{year}-{month:02d}-{day:02d}"
123
- fmt_choice = self.rng.randint(0, 3)
124
- if fmt_choice == 0:
125
- display = f"{MONTHS[month-1]} {day}, {year}"
126
- elif fmt_choice == 1:
127
- display = f"{month:02d}/{day:02d}/{year}"
128
- elif fmt_choice == 2:
129
- display = f"{day}-{MONTHS[month-1][:3]}-{year}"
130
- else:
131
- display = norm
132
- return display, norm
133
-
134
- def _gen_line_items(self, count: int = 0) -> List[Dict[str, Any]]:
135
- if count == 0:
136
- count = self.rng.randint(2, 6)
137
- products = self.rng.sample(PRODUCT_CATALOG, min(count, len(PRODUCT_CATALOG)))
138
- items = []
139
- for desc, min_p, max_p, _unit in products:
140
- qty = self.rng.randint(1, 50)
141
- price = round(self.rng.uniform(min_p, max_p), 2)
142
- amount = round(qty * price, 2)
143
- items.append({
144
- "description": desc,
145
- "quantity": qty,
146
- "unit_price": price,
147
- "amount": amount,
148
- })
149
- return items
150
-
151
- def generate_simple(self) -> Dict[str, Any]:
152
- vendor = self._pick(VENDOR_POOL)
153
- customer = self._pick(CUSTOMER_POOL)
154
- inv_num = self._gen_invoice_number()
155
- display_date, norm_date = self._gen_date()
156
- items = self._gen_line_items()
157
- subtotal = round(sum(i["amount"] for i in items), 2)
158
- tax_rate = self._pick(TAX_RATES)
159
- tax = round(subtotal * tax_rate, 2)
160
- total = round(subtotal + tax, 2)
161
- tax_pct = int(tax_rate * 100) if tax_rate * 100 == int(tax_rate * 100) else tax_rate * 100
162
-
163
- items_text = ""
164
- for it in items:
165
- items_text += f"{it['description']:<30s} {it['quantity']:>5d} ${it['unit_price']:>10.2f} ${it['amount']:>10.2f}\n"
166
-
167
- text = f"""INVOICE
168
-
169
- Invoice Number: {inv_num}
170
- Date: {display_date}
171
-
172
- From:
173
- {vendor[0]}
174
- {vendor[1]}
175
- {vendor[2]}, {vendor[3]} {vendor[4]}
176
-
177
- Bill To:
178
- {customer[0]}
179
- {customer[1]}
180
- {customer[2]}, {customer[3]} {customer[4]}
181
-
182
- Description Qty Unit Price Amount
183
- ---------------------------------------------------------------
184
- {items_text}
185
- Subtotal: ${subtotal:,.2f}
186
- Tax ({tax_pct}%): ${tax:,.2f}
187
- Total: ${total:,.2f}
188
-
189
- Payment Terms: Net {self.rng.choice([15, 30, 45, 60])}
190
- """
191
- ground_truth = {
192
- "invoice_number": inv_num,
193
- "date": norm_date,
194
- "vendor_name": vendor[0],
195
- "customer_name": customer[0],
196
- "subtotal": subtotal,
197
- "tax": tax,
198
- "total": total,
199
- "line_items": items,
200
- }
201
- return {"id": f"proc_simple_{self.rng.randint(1000,9999)}", "text": text, "ground_truth": ground_truth}
202
-
203
- def generate_messy(self) -> Dict[str, Any]:
204
- base = self.generate_simple()
205
- gt = base["ground_truth"]
206
- vendor = gt["vendor_name"]
207
- customer = gt["customer_name"]
208
- items = gt["line_items"]
209
-
210
- abbrevs = {"Subtotal": self._pick(["subtot", "s/t", "sub"]),
211
- "Tax": self._pick(["tx", "tax", "vat"]),
212
- "Total": self._pick(["TOTAL DUE", "amt due", "grand total", "balance"])}
213
-
214
- items_text = ""
215
- for it in items:
216
- desc_short = it["description"].split("(")[0].strip().lower()
217
- qty = it["quantity"]
218
- price = it["unit_price"]
219
- amt = it["amount"]
220
- fmt = self.rng.choice([
221
- f"{qty}x {desc_short} @ {price:.0f} {amt:.0f}",
222
- f"{desc_short} -- {qty} @ {price:.2f} ea ........... {amt:.0f}",
223
- f"{desc_short}...${amt:.0f}",
224
- ])
225
- items_text += fmt + "\n"
226
-
227
- text = f"""{vendor.lower()}
228
- {self._pick(VENDOR_POOL)[2].lower()}, {self._pick(VENDOR_POOL)[3]}
229
-
230
- inv# {gt['invoice_number']}
231
- dt: {gt['date']}
232
-
233
- cust: {customer.split('.')[0].split(',')[0].lower()}
234
-
235
- -- charges --
236
- {items_text}
237
- {abbrevs['Subtotal']}: ${gt['subtotal']:.0f}
238
- {abbrevs['Tax']}: {gt['tax']:.2f}
239
- ========
240
- {abbrevs['Total']} ${gt['total']:,.2f}
241
-
242
- pay within 30 days
243
- """
244
- return {"id": f"proc_messy_{self.rng.randint(1000,9999)}", "text": text, "ground_truth": gt}
245
-
246
- def _apply_ocr_corruption(self, text: str, intensity: float = 0.15) -> str:
247
- result = list(text)
248
- for i, ch in enumerate(result):
249
- if ch in OCR_SUBSTITUTIONS and self.rng.random() < intensity:
250
- result[i] = OCR_SUBSTITUTIONS[ch]
251
- return "".join(result)
252
-
253
- def generate_corrupted(self) -> Dict[str, Any]:
254
- base = self.generate_simple()
255
- corrupted_text = self._apply_ocr_corruption(base["text"], 0.18)
256
- header = self._pick([
257
- "SC4NNED D0CUMENT - Page 1 of 1\n\n",
258
- "[SCAN QUALITY: P00R - SOME CHARACTERS MAY BE lNCORRECT]\n\n",
259
- "---FAXED DOCUMENT---\nQUALITY: [####===---] 40%\n\n",
260
- ])
261
- footer = self._pick([
262
- "\n\n--- END 0F SCAN ---",
263
- "\n\n[PAGE 1/1 - SCAN C0MPLETE]",
264
- "\n\n---END FAX---",
265
- ])
266
- return {
267
- "id": f"proc_corrupt_{self.rng.randint(1000,9999)}",
268
- "text": header + corrupted_text + footer,
269
- "ground_truth": base["ground_truth"],
270
- }
271
-
272
- def generate_multi_document(self) -> Dict[str, Any]:
273
- base = self.generate_simple()
274
- gt = base["ground_truth"]
275
- po_num = f"PO-{self.rng.choice(['A','B','C','D'])}-{self.rng.randint(2024,2025)}-{self.rng.randint(100,999)}"
276
- po_date_display, _po_norm = self._gen_date()
277
-
278
- items_po = ""
279
- for it in gt["line_items"]:
280
- items_po += f"- {it['quantity']}x {it['description']} @ ${it['unit_price']:.2f} = ${it['amount']:.2f}\n"
281
-
282
- credit_amount = round(self._pick(gt["line_items"])["unit_price"] * self.rng.randint(1, 3), 2)
283
- credit_tax = round(credit_amount * 0.07, 2)
284
- credit_total = round(credit_amount + credit_tax, 2)
285
- adjusted_total = round(gt["total"] - credit_total, 2)
286
- reason = self._pick([
287
- "Defective items returned",
288
- "Partial delivery — remaining items backordered",
289
- "Pricing error on original invoice",
290
- "Duplicate charge for services",
291
- ])
292
 
293
- text = f"""=== PURCHASE ORDER ===
294
- PO Number: {po_num}
295
- Date: {po_date_display}
296
- Vendor: {gt['vendor_name']}
297
- Buyer: {gt['customer_name']}
298
-
299
- Ordered Items:
300
- {items_po}
301
- PO Total: ${gt['subtotal']:,.2f} (before tax)
302
-
303
- === INVOICE ===
304
- {base['text']}
305
- Reference PO: {po_num}
306
-
307
- === CREDIT MEMO ===
308
- Credit Memo #: CM-{self.rng.randint(2024,2025)}-{self.rng.randint(100,999)}
309
- Reference Invoice: {gt['invoice_number']}
310
- Reason: {reason}
311
- Credit Amount: ${credit_amount:.2f}
312
- Tax Adjustment: ${credit_tax:.2f}
313
- Total Credit: -${credit_total:.2f}
314
-
315
- === SUMMARY ===
316
- Original Invoice: ${gt['total']:,.2f}
317
- Credit Applied: -${credit_total:.2f}
318
- Adjusted Balance Due: ${adjusted_total:,.2f}
319
- """
320
- gt_multi = dict(gt)
321
- gt_multi["po_number"] = po_num
322
- gt_multi["adjustment_reason"] = reason
323
- gt_multi["adjusted_total"] = adjusted_total
324
- return {"id": f"proc_multi_{self.rng.randint(1000,9999)}", "text": text, "ground_truth": gt_multi}
325
-
326
- def generate_adversarial(self) -> Dict[str, Any]:
327
- base = self.generate_simple()
328
- gt = base["ground_truth"]
329
- original_subtotal = gt["subtotal"]
330
- discount_pct = self._pick([0.05, 0.08, 0.10, 0.12, 0.15])
331
- discount_amount = round(original_subtotal * discount_pct, 2)
332
- adjusted_subtotal = round(original_subtotal - discount_amount, 2)
333
- tax_rate = self._pick(TAX_RATES)
334
- new_tax = round(adjusted_subtotal * tax_rate, 2)
335
- new_total = round(adjusted_subtotal + new_tax, 2)
336
- old_tax = round(original_subtotal * tax_rate, 2)
337
- original_total = round(original_subtotal + old_tax, 2)
338
-
339
- draft_inv = f"DRAFT-INV-{self.rng.randint(100,999)}"
340
- real_inv = gt["invoice_number"] + self._pick(["-R2", "-FINAL", "-REV1"])
341
- po_num = f"PO-{self.rng.randint(2024,2025)}-{self.rng.randint(100,999)}"
342
- _, reissue_date = self._gen_date()
343
- tax_pct = int(tax_rate * 100) if tax_rate * 100 == int(tax_rate * 100) else round(tax_rate * 100, 1)
344
-
345
- items_text = ""
346
- for it in gt["line_items"]:
347
- items_text += f"{it['description']:<30s} {it['quantity']:>5d} ${it['unit_price']:>10.2f} ${it['amount']:>10.2f}\n"
348
-
349
- discount_pct_display = int(discount_pct * 100) if discount_pct * 100 == int(discount_pct * 100) else round(discount_pct * 100, 1)
350
-
351
- text = f"""INVOICE
352
-
353
- *** IMPORTANT: This replaces previous invoice {draft_inv} which was voided ***
354
-
355
- Invoice Number: {real_inv}
356
- Previous Reference: {draft_inv} (VOIDED — DO NOT USE)
357
- Date: {gt['date']}
358
- Reissue Date: {reissue_date}
359
- PO Reference: {po_num}
360
-
361
- From:
362
- {gt['vendor_name']}
363
-
364
- Bill To:
365
- {gt['customer_name']}
366
-
367
- Description Qty Unit Price Amount
368
- ---------------------------------------------------------------
369
- {items_text} ** EARLY PAYMENT DISCOUNT: -{discount_pct_display}% applied **
370
-
371
- Subtotal: ${original_subtotal:,.2f}
372
- Discount ({discount_pct_display}%): -${discount_amount:,.2f}
373
- Adjusted Subtotal: ${adjusted_subtotal:,.2f}
374
- Tax ({tax_pct}%): ${new_tax:,.2f}
375
- Total: ${new_total:,.2f}
376
-
377
- NOTE: Original quote was ${original_total:,.2f} but discount applied.
378
-
379
- !!! BUDGET VARIANCE ALERT !!!
380
- PO Authorized: ${original_subtotal:,.2f}
381
- Actual (pre-tax): ${adjusted_subtotal:,.2f}
382
- Variance: -${discount_amount:,.2f} UNDER BUDGET
383
-
384
- Payment Terms: Net 10 (discounted) / Net 30 (full price ${original_total:,.2f})
385
- """
386
- discrepancy = (
387
- f"{discount_pct_display}% early payment discount applied. "
388
- f"Reissued invoice replaces voided {draft_inv}. "
389
- f"Adjusted subtotal ${adjusted_subtotal:,.2f} vs original ${original_subtotal:,.2f}."
390
  )
391
 
392
- gt_adv = {
393
- "invoice_number": real_inv,
394
- "date": reissue_date,
395
- "vendor_name": gt["vendor_name"],
396
- "customer_name": gt["customer_name"],
397
- "subtotal": adjusted_subtotal,
398
- "tax": new_tax,
399
- "total": new_total,
400
- "line_items": gt["line_items"],
401
- "po_number": po_num,
402
- "discount_amount": discount_amount,
403
- "original_total": original_total,
404
- "discrepancy_notes": discrepancy,
405
- }
406
- return {"id": f"proc_adv_{self.rng.randint(1000,9999)}", "text": text, "ground_truth": gt_adv}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
407
 
408
 
409
  # ---------------------------------------------------------------------------
410
  # Public API
411
  # ---------------------------------------------------------------------------
412
 
413
- GENERATORS = {
414
- "simple_invoice": "generate_simple",
415
- "messy_invoice": "generate_messy",
416
- "multi_document": "generate_multi_document",
417
- "corrupted_scan": "generate_corrupted",
418
- "adversarial_invoice": "generate_adversarial",
 
 
 
 
 
 
419
  }
420
 
421
 
422
- def generate_document(task_name: str, seed: int = 0) -> Dict[str, Any]:
423
- """Generate a procedural document for the given task and seed."""
424
  engine = ProceduralEngine(seed)
425
- method = GENERATORS.get(task_name, "generate_simple")
426
  return getattr(engine, method)()
 
1
  """
2
+ Procedural Generation Engine for the ESCTR Environment.
3
+
4
+ Generates deterministic corporate supply chain scenarios from any seed:
5
+ - Company profiles (vendors, buyers)
6
+ - Product catalogs with contracted pricing
7
+ - Purchase Orders
8
+ - Vendor Invoices (with seeded discrepancies)
9
+ - Service Level Agreements (penalty clauses)
10
+ - Shipping / logistics telemetry
11
+ - Warehouse access logs
12
+ - Vendor negotiation responses
13
+
14
+ Design principle: same seed → identical scenario → deterministic grading.
15
  """
16
 
17
  import random
18
+ import hashlib
19
+ from dataclasses import dataclass, field, asdict
20
+ from typing import Any, Dict, List, Optional, Tuple
21
+
22
 
23
  # ---------------------------------------------------------------------------
24
+ # Data pools
25
  # ---------------------------------------------------------------------------
26
 
27
+ VENDOR_NAMES = [
28
+ "Apex Industrial Supply Co.", "Meridian Components LLC", "Vanguard Materials Group",
29
+ "Sterling Precision Parts", "Ironclad Manufacturing Corp.", "Cobalt Logistics Inc.",
30
+ "Pinnacle Hardware Solutions", "Atlas Engineering Supply", "Nexus Digital Components",
31
+ "Brightwave Technical Services", "SilverLine Distribution", "Quantum Parts International",
32
+ "Evergreen Industrial Ltd.", "Horizon Supply Chain Corp.", "Titan Fabrication Works",
33
+ ]
34
+
35
+ BUYER_NAMES = [
36
+ "Cascade Electronics Inc.", "Redwood Construction Group", "Summit Aerospace Ltd.",
37
+ "Pacific Manufacturing Co.", "Northstar Automotive", "Falcon Defense Systems",
38
+ "Bluestone Energy Corp.", "Cedar Health Technologies", "Granite Infrastructure LLC",
39
+ "Oakmont Robotics Inc.", "Sapphire Semiconductor", "Emerald Biotech Group",
40
+ "Diamond Precision Engineering", "Ruby Telecommunications", "Topaz Data Systems",
 
 
41
  ]
42
 
43
+ CITIES = [
44
+ ("New York", "NY"), ("Chicago", "IL"), ("Houston", "TX"), ("San Francisco", "CA"),
45
+ ("Detroit", "MI"), ("Seattle", "WA"), ("Boston", "MA"), ("Denver", "CO"),
46
+ ("Austin", "TX"), ("Portland", "OR"), ("Minneapolis", "MN"), ("Cleveland", "OH"),
47
+ ("Pittsburgh", "PA"), ("Nashville", "TN"), ("San Diego", "CA"),
 
 
 
 
 
 
 
 
 
 
 
48
  ]
49
 
50
  PRODUCT_CATALOG = [
51
+ # (name, category, min_price, max_price)
52
+ ("Stainless Steel Bolts M10 (Box/100)", "hardware", 10.00, 25.00),
53
+ ("Copper Wire 500ft Roll AWG-12", "electrical", 65.00, 120.00),
54
+ ("Industrial Safety Goggles (Pack/10)", "safety", 25.00, 55.00),
55
+ ("Welding Rod E6013 (Bundle/50)", "consumables", 18.00, 42.00),
56
+ ("Hydraulic Cylinder Assembly HCA-200", "machinery", 280.00, 550.00),
57
+ ("Precision Bearing Set 6205-2RS", "components", 35.00, 90.00),
58
+ ("HVAC Filter MERV-13 (Pack/4)", "facilities", 22.00, 48.00),
59
+ ("LED Panel Light 600x600mm", "electrical", 35.00, 85.00),
60
+ ("Thermal Insulation Roll R-30", "construction", 55.00, 140.00),
61
+ ("Network Switch 24-Port Managed", "IT", 180.00, 420.00),
62
+ ("Server Rack Mount Kit 42U", "IT", 350.00, 800.00),
63
+ ("Pneumatic Valve Assembly PVA-100", "machinery", 120.00, 280.00),
64
+ ("Carbon Steel Pipe Schedule 40 (10ft)", "construction", 45.00, 110.00),
65
+ ("Circuit Breaker Panel 200A", "electrical", 150.00, 380.00),
66
+ ("Laser Calibration Module LCM-5", "precision", 400.00, 950.00),
67
+ ("Industrial Adhesive Epoxy (Gallon)", "consumables", 28.00, 72.00),
68
+ ("Fiber Optic Cable OM3 (1000ft)", "IT", 200.00, 480.00),
69
+ ("Pressure Gauge 0-300 PSI", "instruments", 40.00, 95.00),
70
+ ("Anti-Vibration Mount Set (Pack/8)", "machinery", 60.00, 150.00),
71
+ ("Clean Room Wipes (Case/5000)", "consumables", 80.00, 190.00),
 
 
 
 
 
72
  ]
73
 
74
+ SLA_PENALTY_STRUCTURES = [
75
+ {"type": "linear", "rate_per_day": 0.02, "cap": 0.10, "grace_days": 0},
76
+ {"type": "linear", "rate_per_day": 0.015, "cap": 0.15, "grace_days": 1},
77
+ {"type": "linear", "rate_per_day": 0.03, "cap": 0.12, "grace_days": 0},
78
+ {"type": "tiered", "tiers": [(3, 0.02), (7, 0.03), (999, 0.05)], "cap": 0.20, "grace_days": 0},
79
+ {"type": "linear", "rate_per_day": 0.025, "cap": 0.10, "grace_days": 2},
80
+ ]
81
 
82
+ VENDOR_EXCUSES = [
83
+ "Our records indicate the receiving warehouse rejected the initial delivery attempt due to dock unavailability.",
84
+ "The delay was caused by a force majeure weather event that affected our shipping lane.",
85
+ "We believe the shipment arrived on time but was misrouted by your internal receiving department.",
86
+ "Our carrier has confirmed timely delivery; any apparent delay is a systems error on your end.",
87
+ "The contract clearly states penalties apply only to manufacturing delays, not logistics delays.",
88
+ ]
89
 
90
+ SETTLEMENT_OFFERS = [
91
+ "We are prepared to offer a goodwill credit of {pct}% of the penalty amount to resolve this matter.",
92
+ "In the interest of maintaining our business relationship, we propose settling at {pct}% of the claimed penalty.",
93
+ "Our legal team has reviewed the claim. We can offer {pct}% as a final settlement.",
94
  ]
95
 
96
 
97
+ # ---------------------------------------------------------------------------
98
+ # Data classes for generated scenarios
99
+ # ---------------------------------------------------------------------------
100
+
101
+ @dataclass
102
+ class Company:
103
+ name: str
104
+ address: str
105
+ city: str
106
+ state: str
107
+ zip_code: str
108
+ tax_id: str
109
+
110
+ @dataclass
111
+ class LineItem:
112
+ item_id: str
113
+ description: str
114
+ category: str
115
+ quantity: int
116
+ contracted_unit_price: float
117
+ invoiced_unit_price: float
118
+ contracted_total: float
119
+ invoiced_total: float
120
+ has_discrepancy: bool = False
121
+
122
+ @dataclass
123
+ class PurchaseOrder:
124
+ po_number: str
125
+ date: str
126
+ vendor: Company
127
+ buyer: Company
128
+ line_items: List[LineItem]
129
+ total_amount: float
130
+ approved_budget: float
131
+
132
+ @dataclass
133
+ class Invoice:
134
+ invoice_number: str
135
+ date: str
136
+ po_reference: str
137
+ vendor: Company
138
+ buyer: Company
139
+ line_items: List[LineItem]
140
+ subtotal: float
141
+ tax_rate: float
142
+ tax_amount: float
143
+ total: float
144
+
145
+ @dataclass
146
+ class SLAContract:
147
+ contract_id: str
148
+ vendor: str
149
+ buyer: str
150
+ effective_date: str
151
+ penalty_structure: Dict[str, Any]
152
+ delivery_terms: str
153
+
154
+ @dataclass
155
+ class ShippingLog:
156
+ tracking_id: str
157
+ po_reference: str
158
+ carrier: str
159
+ ship_date: str
160
+ expected_delivery: str
161
+ actual_delivery: str
162
+ delay_days: int
163
+ status: str
164
+
165
+ @dataclass
166
+ class WarehouseLog:
167
+ date: str
168
+ dock_id: str
169
+ status: str # "open", "closed", "maintenance"
170
+ staff_on_duty: int
171
+ shipments_received: int
172
+ notes: str
173
+
174
+ @dataclass
175
+ class Scenario:
176
+ """Complete scenario for one ESCTR episode."""
177
+ task_name: str
178
+ seed: int
179
+ vendor: Company
180
+ buyer: Company
181
+ purchase_order: PurchaseOrder
182
+ invoice: Invoice
183
+ sla_contract: Optional[SLAContract] = None
184
+ shipping_log: Optional[ShippingLog] = None
185
+ warehouse_logs: Optional[List[WarehouseLog]] = None
186
+ # Ground truth for grading
187
+ correct_adjustment: float = 0.0
188
+ discrepant_line_item_id: Optional[str] = None
189
+ correct_line_item_price: Optional[float] = None
190
+ penalty_amount: Optional[float] = None
191
+ vendor_claim_valid: Optional[bool] = None
192
+
193
+
194
+ # ---------------------------------------------------------------------------
195
+ # Procedural Engine
196
+ # ---------------------------------------------------------------------------
197
+
198
  class ProceduralEngine:
199
+ """Seed-deterministic corporate scenario generator."""
200
 
201
  def __init__(self, seed: int = 0):
202
  self.rng = random.Random(seed)
203
+ self.seed = seed
204
 
205
  def _pick(self, pool: list) -> Any:
206
  return self.rng.choice(pool)
207
 
208
+ def _gen_company(self, names: list) -> Company:
209
+ name = self._pick(names)
210
+ city, state = self._pick(CITIES)
211
+ return Company(
212
+ name=name,
213
+ address=f"{self.rng.randint(100, 9999)} {self._pick(['Industrial', 'Commerce', 'Innovation', 'Enterprise', 'Technology'])} {self._pick(['Drive', 'Avenue', 'Parkway', 'Boulevard', 'Street'])}",
214
+ city=city,
215
+ state=state,
216
+ zip_code=f"{self.rng.randint(10000, 99999)}",
217
+ tax_id=f"{self.rng.randint(10, 99)}-{self.rng.randint(1000000, 9999999)}",
218
+ )
219
 
220
+ def _gen_date(self, year: int = 2024, month_range: Tuple[int, int] = (1, 12)) -> str:
221
+ month = self.rng.randint(*month_range)
 
 
222
  day = self.rng.randint(1, 28)
223
+ return f"{year}-{month:02d}-{day:02d}"
224
+
225
+ def _gen_id(self, prefix: str) -> str:
226
+ return f"{prefix}-{self.rng.randint(2024, 2025)}-{self.rng.randint(1000, 9999)}"
227
+
228
+ def _gen_tracking_id(self) -> str:
229
+ return f"TRK-{self.rng.randint(10000, 99999)}"
230
+
231
+ # ------------------------------------------------------------------
232
+ # Task 1: Easy — Procurement Reconciliation
233
+ # ------------------------------------------------------------------
234
+ def generate_task1(self) -> Scenario:
235
+ """Generate a simple PO vs Invoice overcharge scenario."""
236
+ vendor = self._gen_company(VENDOR_NAMES)
237
+ buyer = self._gen_company(BUYER_NAMES)
238
+ po_date = self._gen_date(month_range=(1, 6))
239
+ inv_date = self._gen_date(month_range=(2, 7))
240
+
241
+ # Generate 3-5 line items
242
+ num_items = self.rng.randint(3, 5)
243
+ products = self.rng.sample(PRODUCT_CATALOG, num_items)
244
+ discrepant_idx = self.rng.randint(0, num_items - 1)
245
+
246
+ line_items = []
247
+ for i, (name, cat, min_p, max_p) in enumerate(products):
248
+ qty = self.rng.randint(5, 100)
249
+ contracted_price = round(self.rng.uniform(min_p, max_p), 2)
250
+
251
+ if i == discrepant_idx:
252
+ # Overcharge: invoice price higher than contracted
253
+ markup = round(self.rng.uniform(2.00, 15.00), 2)
254
+ invoiced_price = round(contracted_price + markup, 2)
255
+ has_discrepancy = True
256
+ else:
257
+ invoiced_price = contracted_price
258
+ has_discrepancy = False
259
+
260
+ item_id = f"LI-{self.rng.randint(1000, 9999)}"
261
+ line_items.append(LineItem(
262
+ item_id=item_id,
263
+ description=name,
264
+ category=cat,
265
+ quantity=qty,
266
+ contracted_unit_price=contracted_price,
267
+ invoiced_unit_price=invoiced_price,
268
+ contracted_total=round(qty * contracted_price, 2),
269
+ invoiced_total=round(qty * invoiced_price, 2),
270
+ has_discrepancy=has_discrepancy,
271
+ ))
272
+
273
+ po_total = round(sum(li.contracted_total for li in line_items), 2)
274
+ inv_subtotal = round(sum(li.invoiced_total for li in line_items), 2)
275
+ tax_rate = self._pick([0.05, 0.06, 0.07, 0.08, 0.09, 0.10])
276
+ tax_amount = round(inv_subtotal * tax_rate, 2)
277
+ inv_total = round(inv_subtotal + tax_amount, 2)
278
+
279
+ po_number = self._gen_id("PO")
280
+ inv_number = self._gen_id("INV")
281
+
282
+ po = PurchaseOrder(
283
+ po_number=po_number, date=po_date, vendor=vendor, buyer=buyer,
284
+ line_items=line_items, total_amount=po_total, approved_budget=round(po_total * 1.05, 2),
285
+ )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
286
 
287
+ invoice = Invoice(
288
+ invoice_number=inv_number, date=inv_date, po_reference=po_number,
289
+ vendor=vendor, buyer=buyer, line_items=line_items,
290
+ subtotal=inv_subtotal, tax_rate=tax_rate, tax_amount=tax_amount, total=inv_total,
291
+ )
292
+
293
+ discrepant = line_items[discrepant_idx]
294
+ correct_total = discrepant.contracted_total
295
+ overcharge = round(discrepant.invoiced_total - correct_total, 2)
296
+
297
+ return Scenario(
298
+ task_name="procurement_reconciliation",
299
+ seed=self.seed,
300
+ vendor=vendor, buyer=buyer,
301
+ purchase_order=po, invoice=invoice,
302
+ correct_adjustment=-overcharge, # negative = reduce invoice
303
+ discrepant_line_item_id=discrepant.item_id,
304
+ correct_line_item_price=correct_total,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
305
  )
306
 
307
+ # ------------------------------------------------------------------
308
+ # Task 2: Medium — SLA Enforcement
309
+ # ------------------------------------------------------------------
310
+ def generate_task2(self) -> Scenario:
311
+ """Generate a delayed shipment + SLA penalty scenario."""
312
+ scenario = self.generate_task1() # base PO/invoice
313
+ # Remove the pricing discrepancy for task2 (focus is on shipping)
314
+ for li in scenario.purchase_order.line_items:
315
+ li.invoiced_unit_price = li.contracted_unit_price
316
+ li.invoiced_total = li.contracted_total
317
+ li.has_discrepancy = False
318
+
319
+ # Recalculate invoice
320
+ inv = scenario.invoice
321
+ inv_subtotal = round(sum(li.contracted_total for li in inv.line_items), 2)
322
+ inv.subtotal = inv_subtotal
323
+ inv.tax_amount = round(inv_subtotal * inv.tax_rate, 2)
324
+ inv.total = round(inv_subtotal + inv.tax_amount, 2)
325
+
326
+ # Generate SLA
327
+ sla_struct = self._pick(SLA_PENALTY_STRUCTURES).copy()
328
+ contract_id = self._gen_id("SLA")
329
+ sla = SLAContract(
330
+ contract_id=contract_id,
331
+ vendor=scenario.vendor.name,
332
+ buyer=scenario.buyer.name,
333
+ effective_date=self._gen_date(month_range=(1, 3)),
334
+ penalty_structure=sla_struct,
335
+ delivery_terms=f"Delivery within 14 business days of PO issuance. Penalties per SLA clause {contract_id}-SEC4.",
336
+ )
337
+
338
+ # Generate shipping delay
339
+ delay_days = self.rng.randint(2, 12)
340
+ grace = sla_struct.get("grace_days", 0)
341
+ tracking_id = self._gen_tracking_id()
342
+
343
+ ship_log = ShippingLog(
344
+ tracking_id=tracking_id,
345
+ po_reference=scenario.purchase_order.po_number,
346
+ carrier=self._pick(["FedEx Freight", "UPS Supply Chain", "XPO Logistics", "USPS Priority", "DHL Express"]),
347
+ ship_date=scenario.purchase_order.date,
348
+ expected_delivery=self._gen_date(month_range=(3, 5)),
349
+ actual_delivery=self._gen_date(month_range=(4, 6)),
350
+ delay_days=delay_days,
351
+ status="delivered_late",
352
+ )
353
+
354
+ # Calculate penalty
355
+ penalizable_days = max(0, delay_days - grace)
356
+ if sla_struct["type"] == "linear":
357
+ rate = sla_struct["rate_per_day"]
358
+ cap = sla_struct["cap"]
359
+ penalty_pct = min(penalizable_days * rate, cap)
360
+ elif sla_struct["type"] == "tiered":
361
+ penalty_pct = 0.0
362
+ remaining = penalizable_days
363
+ for threshold, rate in sla_struct["tiers"]:
364
+ if remaining <= 0:
365
+ break
366
+ days_in_tier = min(remaining, threshold)
367
+ penalty_pct += days_in_tier * rate
368
+ remaining -= days_in_tier
369
+ penalty_pct = min(penalty_pct, sla_struct["cap"])
370
+ else:
371
+ penalty_pct = 0.0
372
+
373
+ penalty_amount = round(inv.subtotal * penalty_pct, 2)
374
+
375
+ scenario.task_name = "sla_enforcement"
376
+ scenario.sla_contract = sla
377
+ scenario.shipping_log = ship_log
378
+ scenario.correct_adjustment = -penalty_amount # deduction from invoice
379
+ scenario.penalty_amount = penalty_amount
380
+ scenario.discrepant_line_item_id = None
381
+ scenario.correct_line_item_price = None
382
+
383
+ return scenario
384
+
385
+ # ------------------------------------------------------------------
386
+ # Task 3: Hard — Adversarial Auditing
387
+ # ------------------------------------------------------------------
388
+ def generate_task3(self) -> Scenario:
389
+ """Generate adversarial vendor dispute scenario."""
390
+ scenario = self.generate_task2() # has SLA + shipping
391
+
392
+ # Generate warehouse logs proving dock was open during disputed window
393
+ delivery_date = scenario.shipping_log.actual_delivery
394
+ warehouse_logs = []
395
+ for i in range(-1, 3): # day before through 2 days after
396
+ # Parse date for log entries
397
+ log_date = delivery_date # simplified: use actual delivery date
398
+ warehouse_logs.append(WarehouseLog(
399
+ date=log_date,
400
+ dock_id=f"DOCK-{self._pick(['A', 'B', 'C'])}{self.rng.randint(1, 5)}",
401
+ status="open",
402
+ staff_on_duty=self.rng.randint(3, 8),
403
+ shipments_received=self.rng.randint(5, 20),
404
+ notes=f"Normal operations. {self.rng.randint(5, 20)} deliveries processed.",
405
+ ))
406
+
407
+ scenario.task_name = "adversarial_auditing"
408
+ scenario.warehouse_logs = warehouse_logs
409
+ scenario.vendor_claim_valid = False # vendor's claim is always invalid in this task
410
+
411
+ return scenario
412
+
413
+
414
+ # ---------------------------------------------------------------------------
415
+ # Document renderers — produce human-readable text from data structures
416
+ # ---------------------------------------------------------------------------
417
+
418
+ def render_purchase_order(po: PurchaseOrder) -> str:
419
+ lines = [
420
+ "═══════════════════════════════════════════",
421
+ " PURCHASE ORDER",
422
+ "═══════════════════════════════════════════",
423
+ f"PO Number: {po.po_number}",
424
+ f"Date: {po.date}",
425
+ f"Approved Budget: ${po.approved_budget:,.2f}",
426
+ "",
427
+ f"Vendor: {po.vendor.name}",
428
+ f" {po.vendor.address}",
429
+ f" {po.vendor.city}, {po.vendor.state} {po.vendor.zip_code}",
430
+ "",
431
+ f"Buyer: {po.buyer.name}",
432
+ f" {po.buyer.address}",
433
+ f" {po.buyer.city}, {po.buyer.state} {po.buyer.zip_code}",
434
+ "",
435
+ "Line Items:",
436
+ f"{'ID':<12} {'Description':<40} {'Qty':>5} {'Unit Price':>12} {'Total':>12}",
437
+ "─" * 85,
438
+ ]
439
+ for li in po.line_items:
440
+ lines.append(
441
+ f"{li.item_id:<12} {li.description:<40} {li.quantity:>5} "
442
+ f"${li.contracted_unit_price:>10,.2f} ${li.contracted_total:>10,.2f}"
443
+ )
444
+ lines.extend([
445
+ "─" * 85,
446
+ f"{'PO Total:':>71} ${po.total_amount:>10,.2f}",
447
+ "═══════════════════════════════════════════",
448
+ ])
449
+ return "\n".join(lines)
450
+
451
+
452
+ def render_invoice(inv: Invoice) -> str:
453
+ tax_pct = f"{inv.tax_rate * 100:.1f}"
454
+ lines = [
455
+ "═══════════════════════════════════════════",
456
+ " INVOICE",
457
+ "═══════════════════════════════════════════",
458
+ f"Invoice Number: {inv.invoice_number}",
459
+ f"Date: {inv.date}",
460
+ f"PO Reference: {inv.po_reference}",
461
+ "",
462
+ f"From: {inv.vendor.name}",
463
+ f" {inv.vendor.address}",
464
+ f" {inv.vendor.city}, {inv.vendor.state} {inv.vendor.zip_code}",
465
+ f" Tax ID: {inv.vendor.tax_id}",
466
+ "",
467
+ f"Bill To: {inv.buyer.name}",
468
+ f" {inv.buyer.address}",
469
+ f" {inv.buyer.city}, {inv.buyer.state} {inv.buyer.zip_code}",
470
+ "",
471
+ f"{'ID':<12} {'Description':<40} {'Qty':>5} {'Unit Price':>12} {'Amount':>12}",
472
+ "─" * 85,
473
+ ]
474
+ for li in inv.line_items:
475
+ lines.append(
476
+ f"{li.item_id:<12} {li.description:<40} {li.quantity:>5} "
477
+ f"${li.invoiced_unit_price:>10,.2f} ${li.invoiced_total:>10,.2f}"
478
+ )
479
+ lines.extend([
480
+ "─" * 85,
481
+ f"{'Subtotal:':>71} ${inv.subtotal:>10,.2f}",
482
+ f"{'Tax (' + tax_pct + '%):':>71} ${inv.tax_amount:>10,.2f}",
483
+ f"{'TOTAL DUE:':>71} ${inv.total:>10,.2f}",
484
+ "═══════════════════════════════════════════",
485
+ ])
486
+ return "\n".join(lines)
487
+
488
+
489
+ def render_sla(sla: SLAContract) -> str:
490
+ ps = sla.penalty_structure
491
+ lines = [
492
+ "═══════════════════════════════════════════",
493
+ " SERVICE LEVEL AGREEMENT",
494
+ "═══════════════════════════════════════════",
495
+ f"Contract ID: {sla.contract_id}",
496
+ f"Effective Date: {sla.effective_date}",
497
+ f"Vendor: {sla.vendor}",
498
+ f"Buyer: {sla.buyer}",
499
+ "",
500
+ f"Delivery Terms: {sla.delivery_terms}",
501
+ "",
502
+ "LATE DELIVERY PENALTY CLAUSE:",
503
+ ]
504
+ if ps["type"] == "linear":
505
+ lines.append(f" - Penalty rate: {ps['rate_per_day'] * 100:.1f}% of invoice subtotal per day late")
506
+ lines.append(f" - Maximum penalty cap: {ps['cap'] * 100:.0f}% of invoice subtotal")
507
+ if ps["grace_days"] > 0:
508
+ lines.append(f" - Grace period: {ps['grace_days']} business day(s)")
509
+ elif ps["type"] == "tiered":
510
+ lines.append(" - Tiered penalty structure:")
511
+ prev = 0
512
+ for threshold, rate in ps["tiers"]:
513
+ if threshold >= 999:
514
+ lines.append(f" Day {prev + 1}+: {rate * 100:.1f}% per day")
515
+ else:
516
+ lines.append(f" Days {prev + 1}-{threshold}: {rate * 100:.1f}% per day")
517
+ prev = threshold
518
+ lines.append(f" - Maximum penalty cap: {ps['cap'] * 100:.0f}% of invoice subtotal")
519
+ lines.append("═══════════════════════════════════════════")
520
+ return "\n".join(lines)
521
+
522
+
523
+ def render_shipping_log(log: ShippingLog) -> str:
524
+ return "\n".join([
525
+ "═══════════════════════════════════════════",
526
+ " SHIPPING LOG",
527
+ "═══════════════════════════════════════════",
528
+ f"Tracking ID: {log.tracking_id}",
529
+ f"PO Reference: {log.po_reference}",
530
+ f"Carrier: {log.carrier}",
531
+ f"Ship Date: {log.ship_date}",
532
+ f"Expected Delivery: {log.expected_delivery}",
533
+ f"Actual Delivery: {log.actual_delivery}",
534
+ f"Delay: {log.delay_days} day(s)",
535
+ f"Status: {log.status}",
536
+ "═══════════════════════════════════════════",
537
+ ])
538
+
539
+
540
+ def render_warehouse_logs(logs: List[WarehouseLog]) -> str:
541
+ lines = [
542
+ "═══════════════════════════════════════════",
543
+ " WAREHOUSE ACCESS LOGS",
544
+ "═══════════════════════════════════════════",
545
+ ]
546
+ for wl in logs:
547
+ lines.extend([
548
+ f"Date: {wl.date} | Dock: {wl.dock_id} | Status: {wl.status.upper()}",
549
+ f" Staff on duty: {wl.staff_on_duty} | Shipments received: {wl.shipments_received}",
550
+ f" Notes: {wl.notes}",
551
+ "",
552
+ ])
553
+ lines.append("═══════════════════════════════════════════")
554
+ return "\n".join(lines)
555
 
556
 
557
  # ---------------------------------------------------------------------------
558
  # Public API
559
  # ---------------------------------------------------------------------------
560
 
561
+ TASK_GENERATORS = {
562
+ "procurement_reconciliation": "generate_task1",
563
+ "sla_enforcement": "generate_task2",
564
+ "adversarial_auditing": "generate_task3",
565
+ }
566
+
567
+ VALID_TASKS = list(TASK_GENERATORS.keys())
568
+
569
+ MAX_STEPS = {
570
+ "procurement_reconciliation": 10,
571
+ "sla_enforcement": 15,
572
+ "adversarial_auditing": 20,
573
  }
574
 
575
 
576
+ def generate_scenario(task_name: str, seed: int = 0) -> Scenario:
577
+ """Generate a complete ESCTR scenario for the given task and seed."""
578
  engine = ProceduralEngine(seed)
579
+ method = TASK_GENERATORS.get(task_name, "generate_task1")
580
  return getattr(engine, method)()