Spaces:
Running
Running
Commit ·
a363048
1
Parent(s): 6f7e1b7
feat: ESCTR pivot — Enterprise Supply Chain & Tax Reconciliation
Browse filesComplete rewrite for OpenEnv Hackathon Round 2:
- New domain: autonomous financial auditing (supply chain discrepancies)
- 3 tasks: procurement reconciliation, SLA enforcement, adversarial auditing
- 4 ERP tools: query_database, read_document, communicate_vendor, submit_financial_decision
- Adversarial vendor negotiation with settlement dynamics
- Procedural scenario generation (deterministic from seed)
- RLVR composite rewards with trajectory milestones and gullibility penalties
- Storytelling README (Problem → Environment → Results → Why it matters)
- Added course.md documenting the full journey
- Removed old documents.py (replaced by procedural.py)
- .gitignore +1 -0
- README.md +122 -155
- course.md +309 -0
- inference.py +155 -225
- openenv.yaml +10 -13
- server/__init__.py +3 -3
- server/app.py +53 -76
- server/documents.py +0 -898
- server/environment.py +458 -526
- server/graders.py +280 -302
- server/models.py +91 -45
- server/procedural.py +536 -382
.gitignore
CHANGED
|
@@ -14,3 +14,4 @@ hackathon_instructions.txt
|
|
| 14 |
preparatory_course.txt
|
| 15 |
RESEARCH_1.md
|
| 16 |
RESEARCH_2.md
|
|
|
|
|
|
| 14 |
preparatory_course.txt
|
| 15 |
RESEARCH_1.md
|
| 16 |
RESEARCH_2.md
|
| 17 |
+
ROUND_2_GUIDELINES.md
|
README.md
CHANGED
|
@@ -1,7 +1,7 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
colorTo: green
|
| 6 |
sdk: docker
|
| 7 |
pinned: false
|
|
@@ -10,218 +10,185 @@ tags:
|
|
| 10 |
- openenv
|
| 11 |
---
|
| 12 |
|
| 13 |
-
#
|
| 14 |
|
| 15 |
-
|
| 16 |
|
| 17 |
**Space URL:** `https://huggingface.co/spaces/musharraf7/invoice-extraction-env`
|
| 18 |
|
| 19 |
-
|
| 20 |
-
import requests
|
| 21 |
-
|
| 22 |
-
# Connect to the environment
|
| 23 |
-
url = "https://musharraf7-invoice-extraction-env.hf.space"
|
| 24 |
-
r = requests.post(f"{url}/reset", json={"task_name": "simple_invoice"})
|
| 25 |
-
print(r.json())
|
| 26 |
-
```
|
| 27 |
-
|
| 28 |
-
## Why This Environment?
|
| 29 |
-
|
| 30 |
-
Invoice data extraction is a **$5B+ industry** problem faced daily by every business. This environment provides:
|
| 31 |
-
|
| 32 |
-
- **Real RL training signal**: Per-field partial-credit scoring gives dense reward gradients via RLVR-inspired composite rewards
|
| 33 |
-
- **Infinite training data**: Procedural document generation creates unique invoices from any seed — eliminating overfitting to a static corpus
|
| 34 |
-
- **Genuine difficulty progression**: From clean invoices to adversarial traps that challenge frontier models
|
| 35 |
-
- **Multi-tool agentic workflow**: Hard tasks feature database queries, calculation verification, and discrepancy detection tools — training agents for multi-step reasoning
|
| 36 |
-
- **Reward shaping**: Trajectory milestones, consistency bonuses, efficiency signals, and improvement tracking provide rich learning signals beyond simple field matching
|
| 37 |
-
- **Production relevance**: The task directly models what commercial document processing systems must solve
|
| 38 |
-
|
| 39 |
-
## Reward Architecture (RLVR-Inspired)
|
| 40 |
|
| 41 |
-
The
|
| 42 |
|
| 43 |
-
|
| 44 |
-
R_total = α·R_outcome + β·R_trajectory + bonuses
|
| 45 |
-
```
|
| 46 |
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
| **R_trajectory** | β = 0.30 | Micro-rewards for information gathering milestones |
|
| 51 |
-
| **Consistency bonus** | +0.03 | Agent's subtotal + tax = total |
|
| 52 |
-
| **Efficiency bonus** | +0.01–0.02 | Solution found in ≤5 steps |
|
| 53 |
-
| **Improvement bonus** | up to +0.02 | Score improves on retry |
|
| 54 |
-
| **Step cost** | -0.005/step | Encourages efficient exploration |
|
| 55 |
-
| **Hallucination penalty** | -0.02 | Invalid JSON or unknown commands |
|
| 56 |
-
|
| 57 |
-
### Trajectory Milestones
|
| 58 |
-
|
| 59 |
-
| Action | Micro-reward | Purpose |
|
| 60 |
-
|--------|-------------|---------|
|
| 61 |
-
| `view_document` | +0.01 | Evidence gathering |
|
| 62 |
-
| `view_fields` | +0.01 | Understanding requirements |
|
| 63 |
-
| `get_feedback` | +0.005 | Learning from errors |
|
| 64 |
-
| `query_related_documents` | +0.015 | Cross-referencing (hard tasks) |
|
| 65 |
-
| `verify_calculations` | +0.01 | Mathematical verification |
|
| 66 |
-
| `check_discrepancies` | +0.015 | Anomaly detection |
|
| 67 |
-
|
| 68 |
-
## Action Space
|
| 69 |
-
|
| 70 |
-
The agent sends an `InvoiceAction` with a `command` and optional `payload`:
|
| 71 |
-
|
| 72 |
-
| Command | Description | Payload | Available Tasks |
|
| 73 |
-
|---------|-------------|---------|-----------------|
|
| 74 |
-
| `view_document` | View the raw document text | — | All |
|
| 75 |
-
| `view_fields` | See required fields with descriptions | — | All |
|
| 76 |
-
| `extract` | Submit extracted fields | JSON string | All |
|
| 77 |
-
| `get_feedback` | Get detailed per-field feedback | — | All |
|
| 78 |
-
| `query_related_documents` | Retrieve PO, credit memos, etc. | — | multi_document, adversarial |
|
| 79 |
-
| `verify_calculations` | Submit arithmetic for verification | JSON string | multi_document, adversarial |
|
| 80 |
-
| `check_discrepancies` | Flag inconsistencies in documents | — | multi_document, adversarial |
|
| 81 |
-
|
| 82 |
-
### Action Schema
|
| 83 |
-
```json
|
| 84 |
-
{
|
| 85 |
-
"command": "extract",
|
| 86 |
-
"payload": "{\"invoice_number\": \"INV-2024-001\", \"date\": \"2024-01-15\", ...}"
|
| 87 |
-
}
|
| 88 |
-
```
|
| 89 |
|
| 90 |
-
|
| 91 |
|
| 92 |
-
|
| 93 |
|
| 94 |
-
|
| 95 |
-
|-------|------|-------------|
|
| 96 |
-
| `text` | string | Response text from the environment |
|
| 97 |
-
| `task_name` | string | Current task name |
|
| 98 |
-
| `current_score` | float | Best score achieved so far |
|
| 99 |
-
| `attempts_remaining` | int | Remaining extraction attempts |
|
| 100 |
-
| `required_fields` | list | Fields to extract |
|
| 101 |
-
| `done` | bool | Whether the episode has ended |
|
| 102 |
-
| `reward` | float | Reward signal (0.01–0.99) |
|
| 103 |
-
| `last_action_status` | string | "success" or "error" |
|
| 104 |
-
| `error_message` | string | Diagnostic error message (if error) |
|
| 105 |
-
| `current_step` | int | Step number within episode |
|
| 106 |
-
| `accumulated_reward` | float | Total reward accumulated so far |
|
| 107 |
|
| 108 |
-
|
| 109 |
|
| 110 |
-
|
| 111 |
-
|
|
|
|
|
|
|
| 112 |
|
| 113 |
-
|
| 114 |
|
| 115 |
-
|
| 116 |
-
|
|
|
|
|
|
|
|
|
|
| 117 |
|
| 118 |
-
|
| 119 |
|
| 120 |
-
|
| 121 |
-
Complex multi-section documents containing a purchase order, invoice, and credit memo/payment receipt. The agent must cross-reference sections. **Advanced tools available** (`query_related_documents`, `verify_calculations`, `check_discrepancies`).
|
| 122 |
|
| 123 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 124 |
|
| 125 |
-
###
|
| 126 |
-
Simulates OCR-scanned/faxed invoices with systematic character errors:
|
| 127 |
-
- Character substitutions: `0`↔`O`, `1`↔`l`↔`I`, `5`↔`S`, `8`↔`B`
|
| 128 |
-
- Garbled sections and scan artifacts
|
| 129 |
-
- The agent must **reason through noise** to recover the true values
|
| 130 |
|
| 131 |
-
**
|
|
|
|
|
|
|
|
|
|
| 132 |
|
| 133 |
-
|
| 134 |
-
Adversarial documents designed to trap and challenge frontier models:
|
| 135 |
-
- **Decoy fields**: Multiple invoice numbers — only one is current
|
| 136 |
-
- **Hidden calculations**: Discounts the agent must compute
|
| 137 |
-
- **Contradictory sections**: PO vs invoice disagreements
|
| 138 |
-
- **Budget variance alerts**: Agent must identify and explain discrepancies
|
| 139 |
|
| 140 |
-
|
| 141 |
|
| 142 |
-
|
|
|
|
|
|
|
| 143 |
|
| 144 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
|
| 146 |
-
|
| 147 |
|
| 148 |
-
- **
|
| 149 |
-
- **
|
| 150 |
-
- **
|
| 151 |
-
- **10 tax rate configurations** (5%–10%)
|
| 152 |
-
- **Deterministic**: Same seed always produces the same document
|
| 153 |
-
- **Infinite variety**: Seeds 0–2 use static test fixtures; seeds ≥ 3 generate novel documents
|
| 154 |
|
| 155 |
-
|
| 156 |
-
# Use seed to get different documents
|
| 157 |
-
r = requests.post(f"{url}/reset", json={"task_name": "simple_invoice", "seed": 42})
|
| 158 |
-
r = requests.post(f"{url}/reset", json={"task_name": "simple_invoice", "seed": 100})
|
| 159 |
-
```
|
| 160 |
|
| 161 |
-
|
| 162 |
|
| 163 |
-
-
|
| 164 |
-
-
|
| 165 |
-
- **Date fields**: Normalized comparison (YYYY-MM-DD) with format tolerance
|
| 166 |
-
- **Line items**: Best-fit matching of description, qty, price, amount (weighted 2.0×)
|
| 167 |
-
- **Reasoning fields** (discrepancy_notes): Fuzzy matching with lower threshold
|
| 168 |
-
- **Financial fields** (subtotal, tax, total): Weighted 1.5× for importance
|
| 169 |
|
| 170 |
-
##
|
| 171 |
|
| 172 |
-
### Run
|
| 173 |
```bash
|
| 174 |
-
|
| 175 |
-
docker
|
| 176 |
-
|
| 177 |
|
| 178 |
-
#
|
| 179 |
-
```bash
|
| 180 |
pip install -r requirements.txt
|
| 181 |
uvicorn server.app:app --host 0.0.0.0 --port 7860
|
| 182 |
```
|
| 183 |
|
| 184 |
-
###
|
| 185 |
-
```
|
| 186 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 187 |
```
|
| 188 |
|
| 189 |
-
### Run inference
|
| 190 |
```bash
|
| 191 |
export ENV_URL="http://localhost:7860"
|
| 192 |
export API_BASE_URL="https://router.huggingface.co/v1"
|
| 193 |
export MODEL_NAME="meta-llama/Meta-Llama-3-8B-Instruct"
|
| 194 |
-
export HF_TOKEN="
|
| 195 |
python inference.py
|
| 196 |
```
|
| 197 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 198 |
## API Endpoints
|
| 199 |
|
| 200 |
| Endpoint | Method | Description |
|
| 201 |
|----------|--------|-------------|
|
| 202 |
| `/health` | GET | Health check |
|
| 203 |
-
| `/reset` | POST | Reset with task
|
| 204 |
| `/step` | POST | Execute an action |
|
| 205 |
-
| `/state` | GET |
|
| 206 |
-
| `/schema` | GET |
|
| 207 |
-
| `/metadata` | GET |
|
| 208 |
| `/ws` | WebSocket | Persistent session |
|
| 209 |
|
| 210 |
## Project Structure
|
|
|
|
| 211 |
```
|
| 212 |
├── server/
|
| 213 |
│ ├── __init__.py
|
| 214 |
│ ├── app.py # FastAPI application
|
| 215 |
-
│ ├── environment.py # Core environment
|
| 216 |
-
│ ├──
|
| 217 |
-
│ ├──
|
| 218 |
-
│
|
| 219 |
-
|
| 220 |
-
├── __init__.py # Package declaration
|
| 221 |
-
├── inference.py # Baseline inference script (all 5 tasks)
|
| 222 |
├── openenv.yaml # OpenEnv manifest
|
| 223 |
-
├── pyproject.toml # Package
|
| 224 |
├── requirements.txt # Dependencies
|
| 225 |
├── Dockerfile # Container definition
|
| 226 |
└── README.md # This file
|
| 227 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: ESCTR Environment
|
| 3 |
+
emoji: 🏢
|
| 4 |
+
colorFrom: indigo
|
| 5 |
colorTo: green
|
| 6 |
sdk: docker
|
| 7 |
pinned: false
|
|
|
|
| 10 |
- openenv
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# 🏢 ESCTR: Enterprise Supply Chain & Tax Reconciliation
|
| 14 |
|
| 15 |
+
> **Training LLMs to be autonomous financial auditors** — an OpenEnv environment for teaching AI agents to investigate procurement discrepancies, enforce SLA penalties, and navigate adversarial vendor disputes.
|
| 16 |
|
| 17 |
**Space URL:** `https://huggingface.co/spaces/musharraf7/invoice-extraction-env`
|
| 18 |
|
| 19 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
+
## The Problem
|
| 22 |
|
| 23 |
+
Every day, global enterprises process millions of procurement transactions. Between the Purchase Order, the shipping manifest, the SLA contract, and the final vendor invoice, discrepancies **inevitably** arise:
|
|
|
|
|
|
|
| 24 |
|
| 25 |
+
- A vendor bills $45/unit instead of the contracted $40
|
| 26 |
+
- A shipment arrives 5 days late, triggering SLA penalty clauses
|
| 27 |
+
- A vendor disputes the penalty, claiming *your* warehouse rejected the delivery
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
+
Resolving these disputes currently requires human financial controllers to **manually cross-reference multiple siloed databases**, interpret complex contract clauses, perform precise arithmetic, and negotiate with adversarial counterparties. It's slow, expensive, and error-prone.
|
| 30 |
|
| 31 |
+
**What if we could train LLMs to do this autonomously?**
|
| 32 |
|
| 33 |
+
## The Environment
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
+
ESCTR provides a stateful sandbox where an LLM agent operates as an **autonomous financial controller**. Rather than just extracting data from a document, the agent must:
|
| 36 |
|
| 37 |
+
1. **Investigate** — query procurement databases, shipping logs, SLA contracts
|
| 38 |
+
2. **Reason** — cross-reference documents, calculate penalties, verify claims
|
| 39 |
+
3. **Negotiate** — handle adversarial vendor communications
|
| 40 |
+
4. **Decide** — submit a mathematically precise financial adjustment
|
| 41 |
|
| 42 |
+
### Three Tasks, Escalating Complexity
|
| 43 |
|
| 44 |
+
| Task | Difficulty | Max Steps | What the Agent Must Do |
|
| 45 |
+
|------|-----------|-----------|----------------------|
|
| 46 |
+
| **Procurement Reconciliation** | Easy | 10 | Find an overcharged line item between PO and Invoice, calculate the exact overcharge |
|
| 47 |
+
| **SLA Enforcement** | Medium | 15 | Discover a late shipment, retrieve the SLA contract, calculate the penalty from contract terms |
|
| 48 |
+
| **Adversarial Auditing** | Hard | 20 | All of the above + verify warehouse logs to disprove vendor's claim + reject a settlement offer |
|
| 49 |
|
| 50 |
+
### The Tool Suite
|
| 51 |
|
| 52 |
+
The agent interacts through **4 ERP tools**, each requiring precise parameters:
|
|
|
|
| 53 |
|
| 54 |
+
| Tool | Purpose | Parameters |
|
| 55 |
+
|------|---------|------------|
|
| 56 |
+
| `query_database` | Search corporate databases | `{"table": "shipping_logs"}` |
|
| 57 |
+
| `read_document` | Retrieve full document text | `document_id: "PO-2024-1234"` |
|
| 58 |
+
| `communicate_vendor` | Negotiate with adversarial vendor | `message_content: "We reject..."` |
|
| 59 |
+
| `submit_financial_decision` | Submit final adjustment (terminal) | `adjustment_amount: -450.00` |
|
| 60 |
|
| 61 |
+
### Procedural Generation
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
+
Every scenario is generated from a seed — **same seed = same scenario = deterministic grading**. This enables:
|
| 64 |
+
- Infinite training configurations (no memorization)
|
| 65 |
+
- Reproducible evaluation
|
| 66 |
+
- Fair comparison between models
|
| 67 |
|
| 68 |
+
Each scenario generates: company profiles, product catalogs with contracted pricing, purchase orders, vendor invoices (with seeded discrepancies), SLA contracts (linear/tiered penalty structures), shipping telemetry, and warehouse access logs.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
+
## Reward Architecture (RLVR-Inspired)
|
| 71 |
|
| 72 |
+
```
|
| 73 |
+
R_total = α·R_outcome + β·R_trajectory − penalties
|
| 74 |
+
```
|
| 75 |
|
| 76 |
+
| Component | Weight | Description |
|
| 77 |
+
|-----------|--------|-------------|
|
| 78 |
+
| **R_outcome** | 60-70% | Did the agent submit the correct adjustment amount? |
|
| 79 |
+
| **R_trajectory** | 30-40% | Did the agent follow proper investigative procedure? |
|
| 80 |
+
| **Efficiency penalty** | -0.005/step | Encourages shortest path to resolution |
|
| 81 |
+
| **Hallucination penalty** | -0.02 | Invalid queries, nonexistent documents |
|
| 82 |
+
| **Gullibility penalty** | -0.20 | Accepting adversarial settlement offers (Task 3) |
|
| 83 |
+
| **Evidence bonus** | +0.05 | Citing warehouse logs as evidence (Task 3) |
|
| 84 |
|
| 85 |
+
### Why This Reward Design Matters
|
| 86 |
|
| 87 |
+
- **Dense, not sparse**: Trajectory milestones reward correct investigative behavior (querying the right databases, reading the right documents) even if the final answer is wrong
|
| 88 |
+
- **Hard to game**: An agent that spams queries gets penalized by step costs; an agent that submits without investigating gets 0 trajectory reward
|
| 89 |
+
- **Verifiable**: The correct answer is always a precise floating-point number derived from contract terms — no subjective evaluation
|
|
|
|
|
|
|
|
|
|
| 90 |
|
| 91 |
+
## Results
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
+
*Training evidence and reward plots will be added during the onsite hackathon (April 25-26) when compute credits are provided.*
|
| 94 |
|
| 95 |
+
<!-- Placeholder for training results -->
|
| 96 |
+
<!--  -->
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
|
| 98 |
+
## Quick Start
|
| 99 |
|
| 100 |
+
### Run the environment
|
| 101 |
```bash
|
| 102 |
+
# Docker
|
| 103 |
+
docker build -t esctr-env .
|
| 104 |
+
docker run -p 7860:7860 esctr-env
|
| 105 |
|
| 106 |
+
# Or locally
|
|
|
|
| 107 |
pip install -r requirements.txt
|
| 108 |
uvicorn server.app:app --host 0.0.0.0 --port 7860
|
| 109 |
```
|
| 110 |
|
| 111 |
+
### Connect an agent
|
| 112 |
+
```python
|
| 113 |
+
import requests
|
| 114 |
+
|
| 115 |
+
url = "http://localhost:7860"
|
| 116 |
+
|
| 117 |
+
# Reset with a task
|
| 118 |
+
r = requests.post(f"{url}/reset", json={"task_name": "sla_enforcement", "seed": 42})
|
| 119 |
+
briefing = r.json()["observation"]["system_response"]
|
| 120 |
+
|
| 121 |
+
# Query a database
|
| 122 |
+
r = requests.post(f"{url}/step", json={
|
| 123 |
+
"action": {
|
| 124 |
+
"action_type": "query_database",
|
| 125 |
+
"query_parameters": {"table": "shipping_logs"}
|
| 126 |
+
}
|
| 127 |
+
})
|
| 128 |
+
result = r.json()["observation"]["system_response"]
|
| 129 |
+
|
| 130 |
+
# Submit financial decision
|
| 131 |
+
r = requests.post(f"{url}/step", json={
|
| 132 |
+
"action": {
|
| 133 |
+
"action_type": "submit_financial_decision",
|
| 134 |
+
"adjustment_amount": -450.00,
|
| 135 |
+
"adjustment_reason": "Late delivery penalty per SLA clause"
|
| 136 |
+
}
|
| 137 |
+
})
|
| 138 |
+
score = r.json()["reward"]
|
| 139 |
```
|
| 140 |
|
| 141 |
+
### Run baseline inference
|
| 142 |
```bash
|
| 143 |
export ENV_URL="http://localhost:7860"
|
| 144 |
export API_BASE_URL="https://router.huggingface.co/v1"
|
| 145 |
export MODEL_NAME="meta-llama/Meta-Llama-3-8B-Instruct"
|
| 146 |
+
export HF_TOKEN="your_token"
|
| 147 |
python inference.py
|
| 148 |
```
|
| 149 |
|
| 150 |
+
## Why This Matters
|
| 151 |
+
|
| 152 |
+
| Question | Answer |
|
| 153 |
+
|----------|--------|
|
| 154 |
+
| *Does this teach an LLM something it can't do well?* | Yes — multi-step financial reasoning with tool use is a known weakness of current LLMs |
|
| 155 |
+
| *Is the domain underexplored?* | Yes — supply chain auditing + adversarial negotiation is nearly absent from RL/LLM training benchmarks |
|
| 156 |
+
| *Could a researcher write a paper about this?* | Yes — training autonomous financial auditors has direct commercial and academic value |
|
| 157 |
+
| *Is the reward hard to game?* | Yes — the correct answer is always a precise number from contract math; trajectory rewards require specific database queries |
|
| 158 |
+
|
| 159 |
## API Endpoints
|
| 160 |
|
| 161 |
| Endpoint | Method | Description |
|
| 162 |
|----------|--------|-------------|
|
| 163 |
| `/health` | GET | Health check |
|
| 164 |
+
| `/reset` | POST | Reset with task + seed |
|
| 165 |
| `/step` | POST | Execute an action |
|
| 166 |
+
| `/state` | GET | Current state |
|
| 167 |
+
| `/schema` | GET | Action/Observation/State schemas |
|
| 168 |
+
| `/metadata` | GET | Environment metadata |
|
| 169 |
| `/ws` | WebSocket | Persistent session |
|
| 170 |
|
| 171 |
## Project Structure
|
| 172 |
+
|
| 173 |
```
|
| 174 |
├── server/
|
| 175 |
│ ├── __init__.py
|
| 176 |
│ ├── app.py # FastAPI application
|
| 177 |
+
│ ├── environment.py # Core stateful environment + tool handlers
|
| 178 |
+
│ ├── procedural.py # Deterministic scenario generation engine
|
| 179 |
+
│ ├── graders.py # Multi-axis deterministic graders (3 tasks)
|
| 180 |
+
│ └── models.py # Pydantic Action/Observation/State schemas
|
| 181 |
+
├── inference.py # Baseline inference script
|
|
|
|
|
|
|
| 182 |
├── openenv.yaml # OpenEnv manifest
|
| 183 |
+
├── pyproject.toml # Package config
|
| 184 |
├── requirements.txt # Dependencies
|
| 185 |
├── Dockerfile # Container definition
|
| 186 |
└── README.md # This file
|
| 187 |
```
|
| 188 |
+
|
| 189 |
+
## Themes Alignment
|
| 190 |
+
|
| 191 |
+
- **🌐 World Modeling (Professional Tasks)** — Real interaction with tools and dynamic databases
|
| 192 |
+
- **📋 Long-Horizon Planning** — Multi-step investigation requiring state tracking across 10-20 steps
|
| 193 |
+
- **🤝 Multi-Agent Interactions** — Adversarial vendor negotiation with settlement dynamics
|
| 194 |
+
- **📈 Self-Improvement** — Escalating difficulty curriculum (Easy → Medium → Hard)
|
course.md
ADDED
|
@@ -0,0 +1,309 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ESCTR: The Full Story — From Invoice Extraction to Enterprise Supply Chain Auditing
|
| 2 |
+
|
| 3 |
+
> This document captures the entire journey: the problem we set out to solve, the research we did, the approaches we tried, and how we arrived at the final ESCTR environment.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## Table of Contents
|
| 8 |
+
|
| 9 |
+
1. [The Starting Point — OpenEnv Hackathon](#1-the-starting-point)
|
| 10 |
+
2. [Round 1 — Invoice Extraction Environment](#2-round-1--invoice-extraction-environment)
|
| 11 |
+
3. [Research Phase — What Would Win Round 2?](#3-research-phase)
|
| 12 |
+
4. [The Pivot Decision — Why ESCTR](#4-the-pivot-decision)
|
| 13 |
+
5. [Architecture Deep Dive — How ESCTR Works](#5-architecture-deep-dive)
|
| 14 |
+
6. [Reward Design — RLVR Principles](#6-reward-design)
|
| 15 |
+
7. [What We Learned](#7-what-we-learned)
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## 1. The Starting Point
|
| 20 |
+
|
| 21 |
+
### What is the OpenEnv Hackathon?
|
| 22 |
+
|
| 23 |
+
The **Meta PyTorch OpenEnv Hackathon × Scaler School of Technology** is a hackathon focused on building **RL training environments for LLMs**. The core idea: instead of training LLMs on static datasets, we build interactive environments where agents learn through Reinforcement Learning with Verifiable Rewards (RLVR).
|
| 24 |
+
|
| 25 |
+
**OpenEnv** is a framework by Meta PyTorch and HuggingFace that treats RL environments as isolated microservices — the training loop (client) is completely decoupled from the environment simulation (server). The environment exposes standard HTTP endpoints (`/reset`, `/step`, `/state`) and the agent interacts through typed Actions and Observations.
|
| 26 |
+
|
| 27 |
+
### The Challenge
|
| 28 |
+
|
| 29 |
+
Build an OpenEnv-compliant environment that:
|
| 30 |
+
- Simulates a task humans actually perform
|
| 31 |
+
- Has programmatic, deterministic grading (no LLM-as-judge)
|
| 32 |
+
- Provides dense reward signals (not just 0/1 at the end)
|
| 33 |
+
- Supports multiple difficulty tiers
|
| 34 |
+
- Runs within 2 vCPU / 8GB RAM constraints
|
| 35 |
+
- Is deployable as a Docker container on HuggingFace Spaces
|
| 36 |
+
|
| 37 |
+
---
|
| 38 |
+
|
| 39 |
+
## 2. Round 1 — Invoice Extraction Environment
|
| 40 |
+
|
| 41 |
+
### The Original Idea
|
| 42 |
+
|
| 43 |
+
Our Round 1 submission was an **Invoice Extraction Environment** — an environment where an AI agent extracts structured data (vendor name, invoice number, line items, totals, etc.) from unstructured invoice documents.
|
| 44 |
+
|
| 45 |
+
### What We Built
|
| 46 |
+
|
| 47 |
+
- **5 difficulty tiers**: simple_invoice → messy_invoice → multi_document → corrupted_scan → adversarial_invoice
|
| 48 |
+
- **15 static documents** across the 5 tiers
|
| 49 |
+
- **Fuzzy string matching** for text fields, numeric tolerance for amounts
|
| 50 |
+
- **Multi-step interaction**: view_document → view_fields → extract → get_feedback → refine
|
| 51 |
+
- **OpenEnv compliance**: FastAPI server, typed Pydantic models, Docker deployment
|
| 52 |
+
|
| 53 |
+
### Round 1 Enhancements (Pre-Pivot)
|
| 54 |
+
|
| 55 |
+
Before Round 2 guidelines dropped, we upgraded the Round 1 environment with:
|
| 56 |
+
|
| 57 |
+
1. **Procedural Document Generation** (`procedural.py`): A seed-based engine generating infinite invoice variations — 15 vendor profiles, 15 customers, 25 products, OCR corruption simulation. This eliminated the overfitting risk of a 15-document static corpus.
|
| 58 |
+
|
| 59 |
+
2. **RLVR Composite Rewards**: Instead of a simple extraction score, we implemented:
|
| 60 |
+
```
|
| 61 |
+
R_total = 0.70 × R_outcome + 0.30 × R_trajectory + bonuses
|
| 62 |
+
```
|
| 63 |
+
With trajectory milestones (micro-rewards for viewing documents, getting feedback), efficiency bonuses, consistency bonuses (subtotal + tax = total), and penalties.
|
| 64 |
+
|
| 65 |
+
3. **Weighted Grading**: Financial fields scored 1.5×, line items 2.0×, with built-in cross-field arithmetic verification.
|
| 66 |
+
|
| 67 |
+
4. **Multi-Tool Workflow**: For hard tasks (multi_document, adversarial_invoice), we added `query_related_documents`, `verify_calculations`, and `check_discrepancies` tools.
|
| 68 |
+
|
| 69 |
+
### Why Round 1 Wasn't Enough
|
| 70 |
+
|
| 71 |
+
The enhanced invoice extraction was technically solid — all tests passed, good reward design, infinite procedural data. **But it wasn't going to win Round 2.**
|
| 72 |
+
|
| 73 |
+
---
|
| 74 |
+
|
| 75 |
+
## 3. Research Phase
|
| 76 |
+
|
| 77 |
+
### RESEARCH_1: The ESCTR Blueprint
|
| 78 |
+
|
| 79 |
+
We conducted deep research into what would maximize hackathon scoring. The key findings:
|
| 80 |
+
|
| 81 |
+
**The Core Problem with Invoice Extraction:**
|
| 82 |
+
|
| 83 |
+
| Vulnerability | Why It Hurts |
|
| 84 |
+
|--------------|-------------|
|
| 85 |
+
| **Saturated domain** | Document extraction is a well-trodden path. Judges have seen it before. |
|
| 86 |
+
| **Shallow interaction** | View document → extract → done. No real multi-step reasoning. |
|
| 87 |
+
| **Text-centric abstraction** | Pre-parsed text removes any visual/spatial reasoning challenge. |
|
| 88 |
+
| **Low novelty ceiling** | Even with procedural generation, the core task is "fill in the JSON fields." |
|
| 89 |
+
|
| 90 |
+
**What Frontier AI Research Demands:**
|
| 91 |
+
|
| 92 |
+
Drawing from the **OLMo 3 technical report** and RLVR research, we identified that winning environments need:
|
| 93 |
+
- **Long-horizon planning**: Agents that plan across 10-20 steps, not 3-5
|
| 94 |
+
- **Tool orchestration**: Multiple heterogeneous tools, not just "view" and "extract"
|
| 95 |
+
- **Partial observability**: Information spread across multiple databases, not one document
|
| 96 |
+
- **Adversarial dynamics**: Active counterparties that resist the agent's goal
|
| 97 |
+
- **Deterministic verification**: Correct answers that are mathematically provable, not fuzzy-matched
|
| 98 |
+
|
| 99 |
+
**The Proposed Solution: Enterprise Supply Chain & Tax Reconciliation (ESCTR)**
|
| 100 |
+
|
| 101 |
+
The research proposed pivoting from "extract data from an invoice" to "act as an autonomous financial controller investigating procurement discrepancies." This transforms a simple NLP extraction task into a genuine **agentic workflow** that maps to real enterprise operations worth trillions of dollars annually.
|
| 102 |
+
|
| 103 |
+
### RESEARCH_2: Supporting Analysis
|
| 104 |
+
|
| 105 |
+
The supplementary research validated the ESCTR concept against:
|
| 106 |
+
- Amazon's agentic AI evaluation practices
|
| 107 |
+
- Multi-agent negotiation frameworks
|
| 108 |
+
- The credit assignment problem in long-horizon RL
|
| 109 |
+
- Rubric-based reward systems for domains beyond simple verification
|
| 110 |
+
|
| 111 |
+
### Key Insight from Research
|
| 112 |
+
|
| 113 |
+
> "An environment that challenges frontier 72B models at 40% success rate on its hardest task provides more training headroom than one where 8B models already score 80%."
|
| 114 |
+
|
| 115 |
+
This directly informed our task difficulty design — Task 3 (Adversarial Auditing) is deliberately hard enough that a model must:
|
| 116 |
+
1. Query 5 different databases
|
| 117 |
+
2. Cross-reference shipping dates against SLA penalty clauses
|
| 118 |
+
3. Verify warehouse logs to disprove a vendor's false claim
|
| 119 |
+
4. Navigate a multi-turn negotiation
|
| 120 |
+
5. Reject a settlement offer
|
| 121 |
+
6. Calculate the exact penalty amount to 2 decimal places
|
| 122 |
+
|
| 123 |
+
---
|
| 124 |
+
|
| 125 |
+
## 4. The Pivot Decision
|
| 126 |
+
|
| 127 |
+
### Round 2 Guidelines Changed Everything
|
| 128 |
+
|
| 129 |
+
When the Round 2 guidelines arrived, the scoring criteria shifted dramatically:
|
| 130 |
+
|
| 131 |
+
| Criterion | Round 1 Weight | Round 2 Weight |
|
| 132 |
+
|-----------|---------------|---------------|
|
| 133 |
+
| Environment Innovation | ~30% | **40%** |
|
| 134 |
+
| Storytelling & Presentation | 0% | **30%** |
|
| 135 |
+
| Training Evidence (reward curves) | 0% | **20%** |
|
| 136 |
+
| Reward & Training Pipeline | ~25% | **10%** |
|
| 137 |
+
|
| 138 |
+
**70% of the score** now depends on innovation + storytelling. The guidelines explicitly warned:
|
| 139 |
+
|
| 140 |
+
> *"A messy but ambitious environment with real training evidence beats a polished but boring one."*
|
| 141 |
+
> *"Judges have seen a lot of chess, snake, tic-tac-toe, and grid-world clones."*
|
| 142 |
+
|
| 143 |
+
### The Decision Matrix
|
| 144 |
+
|
| 145 |
+
| Factor | Invoice Extraction | ESCTR |
|
| 146 |
+
|--------|-------------------|-------|
|
| 147 |
+
| Innovation (40%) | ⚠️ Known domain, seen before | ✅ Novel — supply chain auditing is unexplored in RL |
|
| 148 |
+
| Storytelling (30%) | ⚠️ Hard to make exciting | ✅ Strong narrative — "training autonomous financial controllers" |
|
| 149 |
+
| Training Evidence (20%) | Equal | Equal |
|
| 150 |
+
| Theme Alignment | Weak — barely touches themes | ✅ Hits Theme #3.1 (World Modeling), #2 (Long-Horizon), #1 (Multi-Agent) |
|
| 151 |
+
| Technical Depth | Good but shallow | ✅ 4 tools, 5 databases, adversarial negotiation |
|
| 152 |
+
|
| 153 |
+
### Decision: Full ESCTR Pivot
|
| 154 |
+
|
| 155 |
+
We chose **Option A: Full ESCTR Pivot** because:
|
| 156 |
+
1. The innovation ceiling is dramatically higher
|
| 157 |
+
2. The storytelling angle is compelling and unique
|
| 158 |
+
3. Our existing RLVR reward architecture transfers directly
|
| 159 |
+
4. The procedural generation concept transfers directly
|
| 160 |
+
5. We had 2 days pre-onsite + 2 days onsite to build it
|
| 161 |
+
|
| 162 |
+
The risk was real — a complete rewrite — but a "polished but boring" environment was guaranteed to lose.
|
| 163 |
+
|
| 164 |
+
---
|
| 165 |
+
|
| 166 |
+
## 5. Architecture Deep Dive
|
| 167 |
+
|
| 168 |
+
### How ESCTR Works
|
| 169 |
+
|
| 170 |
+
The agent is presented with a **discrepancy alert** and must use 4 ERP tools to investigate:
|
| 171 |
+
|
| 172 |
+
```
|
| 173 |
+
┌─────────────────────────────────────────┐
|
| 174 |
+
│ ESCTR Environment │
|
| 175 |
+
│ │
|
| 176 |
+
│ ┌─────────┐ ┌──────────┐ ┌────────┐│
|
| 177 |
+
│ │ Purchase │ │ Shipping │ │ SLA ││
|
| 178 |
+
│ │ Orders │ │ Logs │ │Contract││
|
| 179 |
+
│ └────┬─────┘ └────┬─────┘ └───┬────┘│
|
| 180 |
+
│ │ │ │ │
|
| 181 |
+
│ ┌────┴──────────────┴────────────┴────┐│
|
| 182 |
+
│ │ Tool Dispatcher ││
|
| 183 |
+
│ │ query_database | read_document ││
|
| 184 |
+
│ │ communicate_vendor ││
|
| 185 |
+
│ │ submit_financial_decision ││
|
| 186 |
+
│ └────────────────┬─────────────────────┘│
|
| 187 |
+
│ │ │
|
| 188 |
+
│ ┌────────────────┴─────────────────────┐│
|
| 189 |
+
│ │ Grader Engine ││
|
| 190 |
+
│ │ R = α·outcome + β·trajectory − pen ││
|
| 191 |
+
│ └──────────────────────────────────────┘│
|
| 192 |
+
└─────────────────────────────────────────┘
|
| 193 |
+
```
|
| 194 |
+
|
| 195 |
+
### The Three Tasks
|
| 196 |
+
|
| 197 |
+
**Task 1 — Procurement Reconciliation (Easy)**
|
| 198 |
+
- A vendor invoices at higher prices than contracted
|
| 199 |
+
- Agent must: Query PO → Query Invoice → Compare line items → Find overcharge → Submit correction
|
| 200 |
+
- Grading: Correct line item ID + exact adjustment amount = 1.0
|
| 201 |
+
|
| 202 |
+
**Task 2 — SLA Enforcement (Medium)**
|
| 203 |
+
- A shipment arrived late, vendor demands full payment
|
| 204 |
+
- Agent must: Query shipping logs → Discover delay → Query SLA contract → Calculate penalty per terms → Submit deduction
|
| 205 |
+
- Grading: Exact penalty calculation = 1.0, within 5% = 0.7, within 10% = 0.4
|
| 206 |
+
|
| 207 |
+
**Task 3 — Adversarial Auditing (Hard)**
|
| 208 |
+
- Vendor disputes the late delivery, claims warehouse rejected shipment
|
| 209 |
+
- Agent must: Verify shipping delay → Get SLA terms → Query warehouse logs (prove dock was open) → Engage vendor → Reject settlement offer → Enforce full penalty
|
| 210 |
+
- Grading: Multi-axis — outcome (60%) + trajectory (40%) − gullibility penalty + evidence bonus
|
| 211 |
+
|
| 212 |
+
### Procedural Generation
|
| 213 |
+
|
| 214 |
+
Every scenario is generated from a seed using deterministic randomization:
|
| 215 |
+
- **15 vendor profiles** with US addresses
|
| 216 |
+
- **15 buyer profiles** with realistic business names
|
| 217 |
+
- **20 products** across hardware, electrical, IT, machinery categories
|
| 218 |
+
- **5 SLA penalty structures** (linear and tiered)
|
| 219 |
+
- Same seed → identical scenario → reproducible evaluation
|
| 220 |
+
|
| 221 |
+
### The Vendor Negotiation System
|
| 222 |
+
|
| 223 |
+
Task 3 features a **3-phase adversarial vendor**:
|
| 224 |
+
|
| 225 |
+
1. **Phase 1 — The Excuse**: Vendor claims your warehouse rejected delivery
|
| 226 |
+
2. **Phase 2 — The Settlement Offer**: Vendor offers 40-55% of the penalty as a "goodwill credit"
|
| 227 |
+
3. **Phase 3 — Concession or Persistence**: If agent rejects firmly + cites evidence, vendor concedes
|
| 228 |
+
|
| 229 |
+
The agent is penalized −0.20 for **gullibility** (accepting the settlement) and rewarded +0.05 for **evidence citation** (mentioning warehouse logs in the adjustment reason).
|
| 230 |
+
|
| 231 |
+
---
|
| 232 |
+
|
| 233 |
+
## 6. Reward Design
|
| 234 |
+
|
| 235 |
+
### RLVR Principles Applied
|
| 236 |
+
|
| 237 |
+
Our reward design follows principles from the OLMo 3 technical report:
|
| 238 |
+
|
| 239 |
+
```
|
| 240 |
+
R_total = α · R_outcome + β · R_trajectory − penalties
|
| 241 |
+
```
|
| 242 |
+
|
| 243 |
+
**Why not just binary rewards?**
|
| 244 |
+
- Sparse rewards (0 or 1 at the end) make credit assignment intractable in 15-20 step episodes
|
| 245 |
+
- The agent can't tell which of its 15 actions contributed to success or failure
|
| 246 |
+
- Dense trajectory rewards act as "algorithmic breadcrumbs" guiding policy gradients
|
| 247 |
+
|
| 248 |
+
**Trajectory Milestones:**
|
| 249 |
+
|
| 250 |
+
| Milestone | Meaning |
|
| 251 |
+
|-----------|---------|
|
| 252 |
+
| `retrieved_po` | Agent queried the purchase order database |
|
| 253 |
+
| `retrieved_invoice` | Agent queried the invoice database |
|
| 254 |
+
| `retrieved_shipping` | Agent discovered the shipping delay |
|
| 255 |
+
| `retrieved_sla` | Agent found the penalty terms |
|
| 256 |
+
| `checked_warehouse` | Agent verified internal records |
|
| 257 |
+
| `vendor_negotiation` | Agent engaged with the adversarial vendor |
|
| 258 |
+
| `calculated_penalty` | Agent performed penalty arithmetic |
|
| 259 |
+
|
| 260 |
+
**Penalties:**
|
| 261 |
+
- Step cost: −0.005 per action (encourages efficiency)
|
| 262 |
+
- Hallucination: −0.02 for invalid queries or nonexistent documents
|
| 263 |
+
- Gullibility: −0.20 for accepting adversarial settlements (Task 3)
|
| 264 |
+
|
| 265 |
+
**Why These Specific Values?**
|
| 266 |
+
- Step cost is small enough that investigation is still rewarded
|
| 267 |
+
- Hallucination penalty is 4× the step cost — bad actions are much worse than slow actions
|
| 268 |
+
- Gullibility penalty is massive (−0.20) because accepting a fraudulent claim is the worst possible failure mode in financial auditing
|
| 269 |
+
|
| 270 |
+
---
|
| 271 |
+
|
| 272 |
+
## 7. What We Learned
|
| 273 |
+
|
| 274 |
+
### Technical Lessons
|
| 275 |
+
|
| 276 |
+
1. **Procedural generation is non-negotiable** for RL environments. Static corpora get memorized instantly. Our engine generates unique scenarios from any seed.
|
| 277 |
+
|
| 278 |
+
2. **Tool restriction per task** is important. Easy tasks shouldn't have tools the agent can't meaningfully use — it creates noise in the reward signal.
|
| 279 |
+
|
| 280 |
+
3. **Adversarial dynamics create genuine difficulty.** A vendor that lies and offers settlements tests the agent's reasoning in ways static documents never can.
|
| 281 |
+
|
| 282 |
+
4. **Composite rewards require careful balancing.** If trajectory reward is too high, agents learn to query everything without ever submitting. If too low, they learn to guess without investigating.
|
| 283 |
+
|
| 284 |
+
### Strategic Lessons
|
| 285 |
+
|
| 286 |
+
1. **Read the scoring rubric backwards.** Don't start with what you want to build — start with what gets scored highest and work backwards.
|
| 287 |
+
|
| 288 |
+
2. **Innovation (40%) + Storytelling (30%) = 70%.** A technically perfect but boring environment loses to a messy but ambitious one with a great narrative.
|
| 289 |
+
|
| 290 |
+
3. **The pivot was worth the risk.** Rewriting 1000+ lines of code in 2 days was aggressive, but staying with invoice extraction would have capped us at "top 10, not first."
|
| 291 |
+
|
| 292 |
+
4. **Domain choice matters enormously.** Supply chain auditing is a multi-trillion dollar problem that's underexplored in AI training — this gives us both novelty and real-world utility.
|
| 293 |
+
|
| 294 |
+
---
|
| 295 |
+
|
| 296 |
+
## Appendix: File History
|
| 297 |
+
|
| 298 |
+
| Phase | Files Created/Modified | Purpose |
|
| 299 |
+
|-------|----------------------|---------|
|
| 300 |
+
| Round 1 | `server/documents.py` (15 static docs) | Original invoice corpus |
|
| 301 |
+
| Round 1 | `server/graders.py` (fuzzy matching) | Text extraction grading |
|
| 302 |
+
| Enhancement | `server/procedural.py` v1 (invoice generator) | Infinite invoice variations |
|
| 303 |
+
| Enhancement | `server/environment.py` v1 (6 tools) | Multi-tool invoice extraction |
|
| 304 |
+
| **ESCTR Pivot** | `server/models.py` (ESCTRAction/Obs/State) | ERP tool schemas |
|
| 305 |
+
| **ESCTR Pivot** | `server/procedural.py` v2 (corporate graphs) | Supply chain scenario generation |
|
| 306 |
+
| **ESCTR Pivot** | `server/graders.py` v2 (3 task graders) | Deterministic multi-axis scoring |
|
| 307 |
+
| **ESCTR Pivot** | `server/environment.py` v2 (4 tools + vendor AI) | Full ESCTR environment |
|
| 308 |
+
| **ESCTR Pivot** | `inference.py` v2 (financial controller) | Baseline agent script |
|
| 309 |
+
| **ESCTR Pivot** | Removed `server/documents.py` | No longer needed |
|
inference.py
CHANGED
|
@@ -1,18 +1,16 @@
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
-
Baseline inference script for the
|
| 4 |
|
| 5 |
-
|
| 6 |
-
to
|
| 7 |
-
|
| 8 |
-
and logs results in the mandatory OpenEnv [START]/[STEP]/[END] format.
|
| 9 |
|
| 10 |
Required environment variables:
|
| 11 |
-
API_BASE_URL
|
| 12 |
-
MODEL_NAME
|
| 13 |
-
HF_TOKEN
|
| 14 |
-
ENV_URL
|
| 15 |
-
LOCAL_IMAGE_NAME — (Optional) Docker image name for from_docker_image() style
|
| 16 |
"""
|
| 17 |
|
| 18 |
import json
|
|
@@ -33,15 +31,9 @@ HF_TOKEN = os.getenv("HF_TOKEN")
|
|
| 33 |
ENV_URL = os.getenv("ENV_URL", "http://localhost:7860")
|
| 34 |
LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
|
| 35 |
|
| 36 |
-
TASKS = ["
|
| 37 |
-
BENCHMARK = "
|
| 38 |
|
| 39 |
-
# Tasks that support advanced multi-tool commands
|
| 40 |
-
TOOL_ENABLED_TASKS = {"multi_document", "adversarial_invoice"}
|
| 41 |
-
|
| 42 |
-
# ---------------------------------------------------------------------------
|
| 43 |
-
# LLM Client
|
| 44 |
-
# ---------------------------------------------------------------------------
|
| 45 |
llm = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)
|
| 46 |
|
| 47 |
|
|
@@ -67,140 +59,121 @@ def env_reset(url: str, task_name: str, seed: int = 0) -> dict:
|
|
| 67 |
return r.json()
|
| 68 |
|
| 69 |
|
| 70 |
-
def env_step(url: str,
|
| 71 |
-
r = requests.post(f"{url}/step", json={"action":
|
| 72 |
r.raise_for_status()
|
| 73 |
return r.json()
|
| 74 |
|
| 75 |
|
| 76 |
# ---------------------------------------------------------------------------
|
| 77 |
-
# Logging
|
| 78 |
# ---------------------------------------------------------------------------
|
| 79 |
|
| 80 |
def log_start(task: str, model: str):
|
| 81 |
print(f"[START] task={task} env={BENCHMARK} model={model}", flush=True)
|
| 82 |
|
| 83 |
-
|
| 84 |
def log_step(step: int, action: str, reward: float, done: bool, error=None):
|
| 85 |
-
|
| 86 |
-
print(
|
| 87 |
-
f"[STEP] step={step} action={action} reward={reward:.2f} "
|
| 88 |
-
f"done={str(done).lower()} error={error_val}",
|
| 89 |
-
flush=True,
|
| 90 |
-
)
|
| 91 |
-
|
| 92 |
|
| 93 |
def log_end(success: bool, steps: int, score: float, rewards: list):
|
| 94 |
-
|
| 95 |
-
print(
|
| 96 |
-
f"[END] success={str(success).lower()} steps={steps} "
|
| 97 |
-
f"score={score:.2f} rewards={rewards_str}",
|
| 98 |
-
flush=True,
|
| 99 |
-
)
|
| 100 |
|
| 101 |
|
| 102 |
# ---------------------------------------------------------------------------
|
| 103 |
-
#
|
| 104 |
# ---------------------------------------------------------------------------
|
| 105 |
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
{
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
-
|
| 116 |
-
- For
|
| 117 |
-
- For
|
| 118 |
-
- For
|
| 119 |
-
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
"
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 159 |
}
|
| 160 |
|
| 161 |
-
REFINE_PROMPT = """You previously extracted data from an invoice but some fields were incorrect.
|
| 162 |
-
|
| 163 |
-
DOCUMENT:
|
| 164 |
-
{document}
|
| 165 |
-
|
| 166 |
-
YOUR PREVIOUS EXTRACTION:
|
| 167 |
-
{previous}
|
| 168 |
-
|
| 169 |
-
FIELDS NEEDING IMPROVEMENT: {weak_fields}
|
| 170 |
-
|
| 171 |
-
FEEDBACK:
|
| 172 |
-
{feedback}
|
| 173 |
-
|
| 174 |
-
{extra_context}
|
| 175 |
-
|
| 176 |
-
Please re-extract ALL fields and return ONLY a valid JSON object with corrections.
|
| 177 |
-
Pay special attention to the fields listed above.
|
| 178 |
-
|
| 179 |
-
RULES:
|
| 180 |
-
- Return ONLY a valid JSON object, no explanation or markdown
|
| 181 |
-
- For dates, use YYYY-MM-DD format
|
| 182 |
-
- For monetary amounts, use plain numbers without currency symbols
|
| 183 |
-
- For line_items, use an array of objects with keys: description, quantity, unit_price, amount
|
| 184 |
-
- VERIFY: subtotal + tax should equal total
|
| 185 |
-
{task_specific_rules}
|
| 186 |
-
|
| 187 |
-
JSON:"""
|
| 188 |
|
|
|
|
|
|
|
|
|
|
| 189 |
|
| 190 |
-
def call_llm(
|
| 191 |
try:
|
| 192 |
response = llm.chat.completions.create(
|
| 193 |
model=MODEL_NAME,
|
| 194 |
-
messages=
|
| 195 |
temperature=0.1,
|
| 196 |
-
max_tokens=
|
| 197 |
)
|
| 198 |
return response.choices[0].message.content.strip()
|
| 199 |
except Exception as e:
|
| 200 |
-
return json.dumps({"
|
| 201 |
|
| 202 |
|
| 203 |
-
def
|
|
|
|
|
|
|
| 204 |
if "```json" in text:
|
| 205 |
start = text.index("```json") + 7
|
| 206 |
end = text.index("```", start)
|
|
@@ -219,113 +192,69 @@ def extract_json_from_response(text: str) -> str:
|
|
| 219 |
elif text[i] == "}":
|
| 220 |
depth -= 1
|
| 221 |
if depth == 0:
|
| 222 |
-
|
| 223 |
-
|
|
|
|
|
|
|
| 224 |
|
| 225 |
|
| 226 |
# ---------------------------------------------------------------------------
|
| 227 |
-
#
|
| 228 |
# ---------------------------------------------------------------------------
|
| 229 |
|
| 230 |
def run_task(env_url: str, task_name: str, seed: int = 0) -> float:
|
| 231 |
-
"""Run a single task and return the final score."""
|
| 232 |
log_start(task=task_name, model=MODEL_NAME)
|
| 233 |
-
|
| 234 |
rewards = []
|
| 235 |
step_num = 0
|
| 236 |
final_score = 0.0
|
| 237 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 238 |
try:
|
| 239 |
-
env_reset(env_url, task_name, seed
|
|
|
|
| 240 |
|
| 241 |
-
|
| 242 |
-
|
| 243 |
-
|
| 244 |
-
|
| 245 |
-
|
| 246 |
-
|
| 247 |
-
rewards.append(reward)
|
| 248 |
-
log_step(step_num, "view_document()", reward, done)
|
| 249 |
-
|
| 250 |
-
# Step 2: View required fields
|
| 251 |
-
step_num += 1
|
| 252 |
-
result = env_step(env_url, "view_fields")
|
| 253 |
-
required_fields = result.get("observation", {}).get("required_fields", [])
|
| 254 |
-
reward = result.get("reward", 0.0) or 0.0
|
| 255 |
-
done = result.get("done", False)
|
| 256 |
-
rewards.append(reward)
|
| 257 |
-
log_step(step_num, "view_fields()", reward, done)
|
| 258 |
-
|
| 259 |
-
# Step 2.5: For tool-enabled tasks, gather extra context
|
| 260 |
-
extra_context = ""
|
| 261 |
-
if task_name in TOOL_ENABLED_TASKS:
|
| 262 |
-
step_num += 1
|
| 263 |
-
result = env_step(env_url, "query_related_documents")
|
| 264 |
-
related_text = result.get("observation", {}).get("text", "")
|
| 265 |
-
reward = result.get("reward", 0.0) or 0.0
|
| 266 |
-
rewards.append(reward)
|
| 267 |
-
log_step(step_num, "query_related_documents()", reward, False)
|
| 268 |
-
extra_context += f"\nRELATED DOCUMENTS:\n{related_text}\n"
|
| 269 |
|
|
|
|
| 270 |
step_num += 1
|
| 271 |
-
|
| 272 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 273 |
reward = result.get("reward", 0.0) or 0.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 274 |
rewards.append(reward)
|
| 275 |
-
log_step(step_num,
|
| 276 |
-
extra_context += f"\nDISCREPANCY HINTS:\n{discrep_text}\n"
|
| 277 |
-
|
| 278 |
-
# Step 3: LLM extraction
|
| 279 |
-
fields_str = "\n".join(f"- {f}" for f in required_fields)
|
| 280 |
-
task_rules = TASK_RULES.get(task_name, "")
|
| 281 |
-
prompt = EXTRACT_PROMPT.format(
|
| 282 |
-
document=document_text + extra_context,
|
| 283 |
-
fields=fields_str,
|
| 284 |
-
task_specific_rules=task_rules,
|
| 285 |
-
)
|
| 286 |
-
llm_response = call_llm(prompt)
|
| 287 |
-
extracted_json = extract_json_from_response(llm_response)
|
| 288 |
|
| 289 |
-
|
| 290 |
-
|
| 291 |
-
|
| 292 |
-
reward = result.get("reward", 0.0) or 0.0
|
| 293 |
-
done = result.get("done", False)
|
| 294 |
-
obs = result.get("observation", {})
|
| 295 |
-
rewards.append(reward)
|
| 296 |
-
log_step(step_num, "submit_extraction()", reward, done)
|
| 297 |
-
final_score = reward
|
| 298 |
-
|
| 299 |
-
# If not done and score < 0.9, refine
|
| 300 |
-
if not done and reward < 0.9:
|
| 301 |
-
step_num += 1
|
| 302 |
-
fb_result = env_step(env_url, "get_feedback")
|
| 303 |
-
feedback_text = fb_result.get("observation", {}).get("text", "")
|
| 304 |
-
fb_reward = fb_result.get("reward", 0.0) or 0.0
|
| 305 |
-
rewards.append(fb_reward)
|
| 306 |
-
log_step(step_num, "get_feedback()", fb_reward, False)
|
| 307 |
-
|
| 308 |
-
field_scores = obs.get("metadata", {}).get("field_scores", {})
|
| 309 |
-
weak_fields = [f for f, s in field_scores.items() if s < 0.8]
|
| 310 |
-
|
| 311 |
-
refine_prompt = REFINE_PROMPT.format(
|
| 312 |
-
document=document_text,
|
| 313 |
-
previous=extracted_json,
|
| 314 |
-
weak_fields=", ".join(weak_fields) if weak_fields else "all fields",
|
| 315 |
-
feedback=feedback_text,
|
| 316 |
-
extra_context=extra_context,
|
| 317 |
-
task_specific_rules=task_rules,
|
| 318 |
-
)
|
| 319 |
-
refined_response = call_llm(refine_prompt)
|
| 320 |
-
refined_json = extract_json_from_response(refined_response)
|
| 321 |
|
| 322 |
-
|
| 323 |
-
|
| 324 |
-
|
| 325 |
-
done = result2.get("done", False)
|
| 326 |
-
rewards.append(reward2)
|
| 327 |
-
log_step(step_num, "submit_refined_extraction()", reward2, done)
|
| 328 |
-
final_score = max(final_score, reward2)
|
| 329 |
|
| 330 |
except Exception as e:
|
| 331 |
step_num += 1
|
|
@@ -338,51 +267,52 @@ def run_task(env_url: str, task_name: str, seed: int = 0) -> float:
|
|
| 338 |
return final_score
|
| 339 |
|
| 340 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 341 |
def main():
|
| 342 |
global ENV_URL
|
| 343 |
container_id = None
|
| 344 |
|
| 345 |
if LOCAL_IMAGE_NAME:
|
| 346 |
-
print(f"Starting Docker container
|
| 347 |
try:
|
| 348 |
container_id = subprocess.check_output(
|
| 349 |
["docker", "run", "-d", "--rm", "-p", "7860:7860", LOCAL_IMAGE_NAME],
|
| 350 |
-
stderr=subprocess.STDOUT
|
| 351 |
).decode().strip()
|
| 352 |
ENV_URL = "http://localhost:7860"
|
| 353 |
-
print(f"Container started: {container_id[:12]}")
|
| 354 |
except Exception as e:
|
| 355 |
-
print(f"
|
| 356 |
sys.exit(1)
|
| 357 |
|
| 358 |
print(f"Waiting for environment at {ENV_URL} ...")
|
| 359 |
if not env_health(ENV_URL):
|
| 360 |
-
print("ERROR: Environment
|
| 361 |
if container_id:
|
| 362 |
subprocess.run(["docker", "stop", container_id], capture_output=True)
|
| 363 |
sys.exit(1)
|
| 364 |
-
print("Environment
|
| 365 |
|
| 366 |
scores = {}
|
| 367 |
-
for
|
| 368 |
-
|
| 369 |
-
scores[task_name] = score
|
| 370 |
print()
|
| 371 |
|
| 372 |
-
|
| 373 |
print("=" * 50)
|
| 374 |
-
print("SUMMARY")
|
| 375 |
print("=" * 50)
|
| 376 |
-
for
|
| 377 |
-
print(f" {
|
| 378 |
-
print(f" Average: {
|
| 379 |
print("=" * 50)
|
| 380 |
|
| 381 |
if container_id:
|
| 382 |
-
print(f"Stopping container {container_id[:12]} ...")
|
| 383 |
subprocess.run(["docker", "stop", container_id], capture_output=True)
|
| 384 |
|
| 385 |
-
return 0 if
|
| 386 |
|
| 387 |
|
| 388 |
if __name__ == "__main__":
|
|
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
+
Baseline inference script for the ESCTR Environment.
|
| 4 |
|
| 5 |
+
Demonstrates how an LLM agent interacts with the enterprise supply chain
|
| 6 |
+
environment to investigate discrepancies, enforce SLA penalties, and
|
| 7 |
+
navigate adversarial vendor disputes.
|
|
|
|
| 8 |
|
| 9 |
Required environment variables:
|
| 10 |
+
API_BASE_URL — OpenAI-compatible API endpoint
|
| 11 |
+
MODEL_NAME — Model identifier (e.g. meta-llama/Meta-Llama-3-8B-Instruct)
|
| 12 |
+
HF_TOKEN — API key
|
| 13 |
+
ENV_URL — Environment server URL (default: http://localhost:7860)
|
|
|
|
| 14 |
"""
|
| 15 |
|
| 16 |
import json
|
|
|
|
| 31 |
ENV_URL = os.getenv("ENV_URL", "http://localhost:7860")
|
| 32 |
LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
|
| 33 |
|
| 34 |
+
TASKS = ["procurement_reconciliation", "sla_enforcement", "adversarial_auditing"]
|
| 35 |
+
BENCHMARK = "esctr"
|
| 36 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
llm = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)
|
| 38 |
|
| 39 |
|
|
|
|
| 59 |
return r.json()
|
| 60 |
|
| 61 |
|
| 62 |
+
def env_step(url: str, action: dict) -> dict:
|
| 63 |
+
r = requests.post(f"{url}/step", json={"action": action}, timeout=30)
|
| 64 |
r.raise_for_status()
|
| 65 |
return r.json()
|
| 66 |
|
| 67 |
|
| 68 |
# ---------------------------------------------------------------------------
|
| 69 |
+
# Logging (strict OpenEnv format)
|
| 70 |
# ---------------------------------------------------------------------------
|
| 71 |
|
| 72 |
def log_start(task: str, model: str):
|
| 73 |
print(f"[START] task={task} env={BENCHMARK} model={model}", flush=True)
|
| 74 |
|
|
|
|
| 75 |
def log_step(step: int, action: str, reward: float, done: bool, error=None):
|
| 76 |
+
err = error if error else "null"
|
| 77 |
+
print(f"[STEP] step={step} action={action} reward={reward:.2f} done={str(done).lower()} error={err}", flush=True)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
|
| 79 |
def log_end(success: bool, steps: int, score: float, rewards: list):
|
| 80 |
+
rstr = ",".join(f"{r:.2f}" for r in rewards)
|
| 81 |
+
print(f"[END] success={str(success).lower()} steps={steps} score={score:.2f} rewards={rstr}", flush=True)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
|
| 83 |
|
| 84 |
# ---------------------------------------------------------------------------
|
| 85 |
+
# System prompts per task
|
| 86 |
# ---------------------------------------------------------------------------
|
| 87 |
|
| 88 |
+
SYSTEM_PROMPT_BASE = """You are an autonomous Financial Controller AI agent operating in an Enterprise Supply Chain environment. You must investigate discrepancies, verify documents, and submit precise financial adjustments.
|
| 89 |
+
|
| 90 |
+
AVAILABLE TOOLS:
|
| 91 |
+
{tools}
|
| 92 |
+
|
| 93 |
+
RESPONSE FORMAT:
|
| 94 |
+
You must respond with a SINGLE valid JSON object — NO explanation, NO markdown.
|
| 95 |
+
The JSON must have these fields:
|
| 96 |
+
- "action_type": one of the available tool names
|
| 97 |
+
- Additional fields depending on the action:
|
| 98 |
+
- For "query_database": include "query_parameters": {{"table": "<table_name>"}}
|
| 99 |
+
- For "read_document": include "document_id": "<id>"
|
| 100 |
+
- For "communicate_vendor": include "message_content": "<your message>"
|
| 101 |
+
- For "submit_financial_decision": include "adjustment_amount": <number> and "adjustment_reason": "<explanation>"
|
| 102 |
+
|
| 103 |
+
CRITICAL RULES:
|
| 104 |
+
- ALWAYS query databases and read documents BEFORE submitting a decision
|
| 105 |
+
- Calculate amounts precisely — use exact arithmetic
|
| 106 |
+
- adjustment_amount should be NEGATIVE to reduce the invoice payment
|
| 107 |
+
- Respond ONLY with JSON, nothing else"""
|
| 108 |
+
|
| 109 |
+
TASK_INSTRUCTIONS = {
|
| 110 |
+
"procurement_reconciliation": """
|
| 111 |
+
TASK: Procurement Reconciliation (Easy)
|
| 112 |
+
A pricing discrepancy exists between a Purchase Order and a Vendor Invoice.
|
| 113 |
+
|
| 114 |
+
STRATEGY:
|
| 115 |
+
1. Query "purchase_orders" to find the PO
|
| 116 |
+
2. Query "invoices" to find the invoice
|
| 117 |
+
3. Read both documents using read_document with their IDs
|
| 118 |
+
4. Compare line-by-line: find the item where invoiced price > contracted price
|
| 119 |
+
5. Calculate the overcharge: (invoiced_total - contracted_total) for that line item
|
| 120 |
+
6. Submit with adjustment_amount = -(overcharge amount)
|
| 121 |
+
|
| 122 |
+
Available tables: purchase_orders, invoices""",
|
| 123 |
+
|
| 124 |
+
"sla_enforcement": """
|
| 125 |
+
TASK: SLA Enforcement (Medium)
|
| 126 |
+
A vendor demands full payment but the shipment was delivered late.
|
| 127 |
+
|
| 128 |
+
STRATEGY:
|
| 129 |
+
1. Query "shipping_logs" to check delivery timing and find delay days
|
| 130 |
+
2. Query "sla_contracts" to find late delivery penalty terms
|
| 131 |
+
3. Read the SLA document for exact penalty rates and caps
|
| 132 |
+
4. Calculate: penalty = invoice_subtotal × min(delay_days × rate_per_day, cap)
|
| 133 |
+
- If there's a grace period, subtract grace days from delay first
|
| 134 |
+
5. Submit with adjustment_amount = -(penalty amount)
|
| 135 |
+
|
| 136 |
+
Available tables: purchase_orders, invoices, shipping_logs, sla_contracts""",
|
| 137 |
+
|
| 138 |
+
"adversarial_auditing": """
|
| 139 |
+
TASK: Adversarial Auditing (Hard)
|
| 140 |
+
A vendor disputes a late delivery claim, blaming your warehouse. You must prove them wrong.
|
| 141 |
+
|
| 142 |
+
STRATEGY:
|
| 143 |
+
1. Query "shipping_logs" to confirm the delivery was late
|
| 144 |
+
2. Query "sla_contracts" for penalty terms
|
| 145 |
+
3. Query "warehouse_logs" to verify your dock was OPEN during delivery
|
| 146 |
+
4. Use "communicate_vendor" to engage — they will make excuses then offer a settlement
|
| 147 |
+
5. REJECT the settlement — enforce the FULL penalty
|
| 148 |
+
6. Cite warehouse access logs as evidence in your final reason
|
| 149 |
+
7. Calculate exact penalty from SLA terms and submit
|
| 150 |
+
|
| 151 |
+
CRITICAL: Do NOT accept any settlement offer! Enforce the full contractual penalty.
|
| 152 |
+
|
| 153 |
+
Available tables: purchase_orders, invoices, shipping_logs, sla_contracts, warehouse_logs""",
|
| 154 |
}
|
| 155 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 156 |
|
| 157 |
+
# ---------------------------------------------------------------------------
|
| 158 |
+
# LLM helpers
|
| 159 |
+
# ---------------------------------------------------------------------------
|
| 160 |
|
| 161 |
+
def call_llm(messages: list) -> str:
|
| 162 |
try:
|
| 163 |
response = llm.chat.completions.create(
|
| 164 |
model=MODEL_NAME,
|
| 165 |
+
messages=messages,
|
| 166 |
temperature=0.1,
|
| 167 |
+
max_tokens=1000,
|
| 168 |
)
|
| 169 |
return response.choices[0].message.content.strip()
|
| 170 |
except Exception as e:
|
| 171 |
+
return json.dumps({"action_type": "query_database", "query_parameters": {"table": "purchase_orders"}})
|
| 172 |
|
| 173 |
|
| 174 |
+
def parse_action(text: str) -> dict:
|
| 175 |
+
"""Extract a JSON action from LLM response."""
|
| 176 |
+
# Try to find JSON in response
|
| 177 |
if "```json" in text:
|
| 178 |
start = text.index("```json") + 7
|
| 179 |
end = text.index("```", start)
|
|
|
|
| 192 |
elif text[i] == "}":
|
| 193 |
depth -= 1
|
| 194 |
if depth == 0:
|
| 195 |
+
text = text[brace_start:i + 1]
|
| 196 |
+
break
|
| 197 |
+
|
| 198 |
+
return json.loads(text)
|
| 199 |
|
| 200 |
|
| 201 |
# ---------------------------------------------------------------------------
|
| 202 |
+
# Task runner
|
| 203 |
# ---------------------------------------------------------------------------
|
| 204 |
|
| 205 |
def run_task(env_url: str, task_name: str, seed: int = 0) -> float:
|
|
|
|
| 206 |
log_start(task=task_name, model=MODEL_NAME)
|
|
|
|
| 207 |
rewards = []
|
| 208 |
step_num = 0
|
| 209 |
final_score = 0.0
|
| 210 |
|
| 211 |
+
tools = ["query_database", "read_document", "submit_financial_decision"]
|
| 212 |
+
if task_name == "adversarial_auditing":
|
| 213 |
+
tools.insert(2, "communicate_vendor")
|
| 214 |
+
|
| 215 |
+
system_prompt = SYSTEM_PROMPT_BASE.format(tools=", ".join(tools))
|
| 216 |
+
system_prompt += TASK_INSTRUCTIONS.get(task_name, "")
|
| 217 |
+
|
| 218 |
try:
|
| 219 |
+
reset_data = env_reset(env_url, task_name, seed)
|
| 220 |
+
briefing = reset_data.get("observation", {}).get("system_response", "")
|
| 221 |
|
| 222 |
+
messages = [
|
| 223 |
+
{"role": "system", "content": system_prompt},
|
| 224 |
+
{"role": "user", "content": f"ENVIRONMENT BRIEFING:\n{briefing}\n\nBegin your investigation. Respond with a JSON action."},
|
| 225 |
+
]
|
| 226 |
+
|
| 227 |
+
max_steps = {"procurement_reconciliation": 10, "sla_enforcement": 15, "adversarial_auditing": 20}.get(task_name, 15)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 228 |
|
| 229 |
+
for _ in range(max_steps):
|
| 230 |
step_num += 1
|
| 231 |
+
|
| 232 |
+
# Get LLM action
|
| 233 |
+
llm_response = call_llm(messages)
|
| 234 |
+
try:
|
| 235 |
+
action = parse_action(llm_response)
|
| 236 |
+
except (json.JSONDecodeError, ValueError):
|
| 237 |
+
action = {"action_type": "query_database", "query_parameters": {"table": "purchase_orders"}}
|
| 238 |
+
|
| 239 |
+
# Execute action
|
| 240 |
+
action_str = json.dumps(action, separators=(",", ":"))
|
| 241 |
+
result = env_step(env_url, action)
|
| 242 |
reward = result.get("reward", 0.0) or 0.0
|
| 243 |
+
done = result.get("done", False)
|
| 244 |
+
obs = result.get("observation", {})
|
| 245 |
+
response_text = obs.get("system_response", "")
|
| 246 |
+
error = obs.get("error_message")
|
| 247 |
+
|
| 248 |
rewards.append(reward)
|
| 249 |
+
log_step(step_num, action_str, reward, done, error)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 250 |
|
| 251 |
+
if done:
|
| 252 |
+
final_score = reward
|
| 253 |
+
break
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 254 |
|
| 255 |
+
# Append to conversation
|
| 256 |
+
messages.append({"role": "assistant", "content": llm_response})
|
| 257 |
+
messages.append({"role": "user", "content": f"ENVIRONMENT RESPONSE:\n{response_text}\n\nContinue your investigation. Respond with your next JSON action."})
|
|
|
|
|
|
|
|
|
|
|
|
|
| 258 |
|
| 259 |
except Exception as e:
|
| 260 |
step_num += 1
|
|
|
|
| 267 |
return final_score
|
| 268 |
|
| 269 |
|
| 270 |
+
# ---------------------------------------------------------------------------
|
| 271 |
+
# Main
|
| 272 |
+
# ---------------------------------------------------------------------------
|
| 273 |
+
|
| 274 |
def main():
|
| 275 |
global ENV_URL
|
| 276 |
container_id = None
|
| 277 |
|
| 278 |
if LOCAL_IMAGE_NAME:
|
| 279 |
+
print(f"Starting Docker container: {LOCAL_IMAGE_NAME}")
|
| 280 |
try:
|
| 281 |
container_id = subprocess.check_output(
|
| 282 |
["docker", "run", "-d", "--rm", "-p", "7860:7860", LOCAL_IMAGE_NAME],
|
| 283 |
+
stderr=subprocess.STDOUT
|
| 284 |
).decode().strip()
|
| 285 |
ENV_URL = "http://localhost:7860"
|
|
|
|
| 286 |
except Exception as e:
|
| 287 |
+
print(f"Docker start failed: {e}")
|
| 288 |
sys.exit(1)
|
| 289 |
|
| 290 |
print(f"Waiting for environment at {ENV_URL} ...")
|
| 291 |
if not env_health(ENV_URL):
|
| 292 |
+
print("ERROR: Environment not healthy")
|
| 293 |
if container_id:
|
| 294 |
subprocess.run(["docker", "stop", container_id], capture_output=True)
|
| 295 |
sys.exit(1)
|
| 296 |
+
print("Environment healthy!\n")
|
| 297 |
|
| 298 |
scores = {}
|
| 299 |
+
for task in TASKS:
|
| 300 |
+
scores[task] = run_task(ENV_URL, task, seed=42)
|
|
|
|
| 301 |
print()
|
| 302 |
|
| 303 |
+
avg = sum(scores.values()) / len(scores) if scores else 0.0
|
| 304 |
print("=" * 50)
|
| 305 |
+
print("ESCTR INFERENCE SUMMARY")
|
| 306 |
print("=" * 50)
|
| 307 |
+
for t, s in scores.items():
|
| 308 |
+
print(f" {t}: {s:.2f}")
|
| 309 |
+
print(f" Average: {avg:.2f}")
|
| 310 |
print("=" * 50)
|
| 311 |
|
| 312 |
if container_id:
|
|
|
|
| 313 |
subprocess.run(["docker", "stop", container_id], capture_output=True)
|
| 314 |
|
| 315 |
+
return 0 if avg > 0 else 1
|
| 316 |
|
| 317 |
|
| 318 |
if __name__ == "__main__":
|
openenv.yaml
CHANGED
|
@@ -1,23 +1,20 @@
|
|
| 1 |
spec_version: 1
|
| 2 |
-
name:
|
| 3 |
type: space
|
| 4 |
runtime: fastapi
|
| 5 |
app: server.app:app
|
| 6 |
port: 7860
|
| 7 |
|
| 8 |
tasks:
|
| 9 |
-
- name:
|
| 10 |
difficulty: easy
|
| 11 |
-
|
| 12 |
-
|
|
|
|
| 13 |
difficulty: medium
|
| 14 |
-
|
| 15 |
-
|
|
|
|
| 16 |
difficulty: hard
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
difficulty: very_hard
|
| 20 |
-
description: "OCR-scanned invoices with systematic character errors"
|
| 21 |
-
- name: adversarial_invoice
|
| 22 |
-
difficulty: expert
|
| 23 |
-
description: "Adversarial documents with decoy fields, hidden calculations, and contradictions"
|
|
|
|
| 1 |
spec_version: 1
|
| 2 |
+
name: esctr_environment
|
| 3 |
type: space
|
| 4 |
runtime: fastapi
|
| 5 |
app: server.app:app
|
| 6 |
port: 7860
|
| 7 |
|
| 8 |
tasks:
|
| 9 |
+
- name: procurement_reconciliation
|
| 10 |
difficulty: easy
|
| 11 |
+
max_steps: 10
|
| 12 |
+
description: "Identify overcharged line items between PO and Invoice, calculate exact overcharge"
|
| 13 |
+
- name: sla_enforcement
|
| 14 |
difficulty: medium
|
| 15 |
+
max_steps: 15
|
| 16 |
+
description: "Calculate late delivery penalties from shipping logs and SLA contract terms"
|
| 17 |
+
- name: adversarial_auditing
|
| 18 |
difficulty: hard
|
| 19 |
+
max_steps: 20
|
| 20 |
+
description: "Navigate vendor disputes, verify warehouse logs, reject settlements, enforce full penalties"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
server/__init__.py
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
-
"""
|
| 2 |
|
| 3 |
-
from .environment import
|
| 4 |
|
| 5 |
-
__all__ = ["
|
|
|
|
| 1 |
+
"""Enterprise Supply Chain & Tax Reconciliation Environment — Server package."""
|
| 2 |
|
| 3 |
+
from .environment import ESCTREnvironment
|
| 4 |
|
| 5 |
+
__all__ = ["ESCTREnvironment"]
|
server/app.py
CHANGED
|
@@ -1,8 +1,8 @@
|
|
| 1 |
"""
|
| 2 |
-
FastAPI application for the
|
| 3 |
|
| 4 |
-
Exposes the
|
| 5 |
-
compatible with the OpenEnv
|
| 6 |
"""
|
| 7 |
|
| 8 |
import json
|
|
@@ -13,20 +13,20 @@ from fastapi import FastAPI, WebSocket, WebSocketDisconnect
|
|
| 13 |
from fastapi.responses import JSONResponse
|
| 14 |
from pydantic import BaseModel
|
| 15 |
|
| 16 |
-
from .models import
|
| 17 |
-
from .environment import
|
| 18 |
|
| 19 |
logger = logging.getLogger(__name__)
|
| 20 |
|
| 21 |
|
| 22 |
# ---------------------------------------------------------------------------
|
| 23 |
-
# Request / Response models
|
| 24 |
# ---------------------------------------------------------------------------
|
| 25 |
|
| 26 |
class ResetRequest(BaseModel):
|
| 27 |
seed: Optional[int] = None
|
| 28 |
episode_id: Optional[str] = None
|
| 29 |
-
task_name: str = "
|
| 30 |
|
| 31 |
class Config:
|
| 32 |
extra = "allow"
|
|
@@ -48,14 +48,13 @@ class HealthResponse(BaseModel):
|
|
| 48 |
# Helpers
|
| 49 |
# ---------------------------------------------------------------------------
|
| 50 |
|
| 51 |
-
def _obs_to_response(obs:
|
| 52 |
-
"""Convert an InvoiceObservation to a step/reset response dict."""
|
| 53 |
obs_dict = obs.model_dump()
|
| 54 |
-
reward = obs_dict.pop("reward",
|
| 55 |
done = obs_dict.pop("done", False)
|
| 56 |
return {
|
| 57 |
"observation": obs_dict,
|
| 58 |
-
"reward": reward
|
| 59 |
"done": done,
|
| 60 |
}
|
| 61 |
|
|
@@ -64,19 +63,19 @@ def _obs_to_response(obs: InvoiceObservation) -> dict:
|
|
| 64 |
# Application factory
|
| 65 |
# ---------------------------------------------------------------------------
|
| 66 |
|
| 67 |
-
def
|
| 68 |
-
"""Create and configure the FastAPI application."""
|
| 69 |
-
|
| 70 |
app = FastAPI(
|
| 71 |
-
title="
|
| 72 |
-
description=
|
| 73 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
)
|
| 75 |
|
| 76 |
-
|
| 77 |
-
_env = InvoiceExtractionEnvironment()
|
| 78 |
|
| 79 |
-
# === Health check ===
|
| 80 |
@app.get("/health")
|
| 81 |
def health():
|
| 82 |
return HealthResponse()
|
|
@@ -84,24 +83,22 @@ def create_invoice_app() -> FastAPI:
|
|
| 84 |
@app.get("/")
|
| 85 |
def root():
|
| 86 |
return {
|
| 87 |
-
"name": "
|
| 88 |
-
"version": "
|
| 89 |
"status": "running",
|
| 90 |
-
"endpoints": ["/health", "/reset", "/step", "/state", "/schema", "/ws"],
|
| 91 |
}
|
| 92 |
|
| 93 |
-
# === Reset ===
|
| 94 |
@app.post("/reset")
|
| 95 |
def reset(request: ResetRequest = ResetRequest()):
|
| 96 |
kwargs = request.model_dump(exclude_unset=True)
|
| 97 |
obs = _env.reset(**kwargs)
|
| 98 |
return _obs_to_response(obs)
|
| 99 |
|
| 100 |
-
# === Step ===
|
| 101 |
@app.post("/step")
|
| 102 |
def step(request: StepRequest):
|
| 103 |
try:
|
| 104 |
-
action =
|
| 105 |
except Exception as e:
|
| 106 |
return JSONResponse(
|
| 107 |
status_code=422,
|
|
@@ -110,58 +107,52 @@ def create_invoice_app() -> FastAPI:
|
|
| 110 |
obs = _env.step(action, timeout_s=request.timeout_s)
|
| 111 |
return _obs_to_response(obs)
|
| 112 |
|
| 113 |
-
# === State ===
|
| 114 |
@app.get("/state")
|
| 115 |
def get_state():
|
| 116 |
return _env.state.model_dump()
|
| 117 |
|
| 118 |
-
# === Schema ===
|
| 119 |
@app.get("/schema")
|
| 120 |
def get_schema():
|
| 121 |
return {
|
| 122 |
-
"action":
|
| 123 |
-
"observation":
|
| 124 |
-
"state":
|
| 125 |
}
|
| 126 |
|
| 127 |
-
# === Metadata ===
|
| 128 |
@app.get("/metadata")
|
| 129 |
def get_metadata():
|
| 130 |
return {
|
| 131 |
-
"name": "
|
| 132 |
"description": (
|
| 133 |
-
"
|
| 134 |
-
"
|
| 135 |
-
"
|
| 136 |
-
"
|
| 137 |
-
"
|
| 138 |
-
"composite reward architecture with trajectory milestones, and "
|
| 139 |
-
"multi-tool agentic workflow for complex tasks."
|
| 140 |
),
|
| 141 |
-
"version": "0.
|
| 142 |
-
"
|
| 143 |
-
"
|
| 144 |
-
"
|
| 145 |
-
"
|
| 146 |
-
"weighted_field_scoring",
|
| 147 |
-
"cross_field_verification",
|
| 148 |
],
|
| 149 |
"tasks": [
|
| 150 |
-
{"name": "
|
| 151 |
-
|
| 152 |
-
{"name": "
|
| 153 |
-
"
|
| 154 |
-
{"name": "
|
| 155 |
-
|
| 156 |
-
|
|
|
|
|
|
|
| 157 |
],
|
| 158 |
}
|
| 159 |
|
| 160 |
-
# === WebSocket (for persistent sessions) ===
|
| 161 |
@app.websocket("/ws")
|
| 162 |
async def websocket_endpoint(websocket: WebSocket):
|
| 163 |
await websocket.accept()
|
| 164 |
-
ws_env =
|
| 165 |
logger.info("WebSocket session opened")
|
| 166 |
|
| 167 |
try:
|
|
@@ -181,19 +172,13 @@ def create_invoice_app() -> FastAPI:
|
|
| 181 |
|
| 182 |
if msg_type == "reset":
|
| 183 |
obs = ws_env.reset(**msg_data)
|
| 184 |
-
await websocket.send_json({
|
| 185 |
-
"type": "observation",
|
| 186 |
-
"data": _obs_to_response(obs),
|
| 187 |
-
})
|
| 188 |
|
| 189 |
elif msg_type == "step":
|
| 190 |
try:
|
| 191 |
-
action =
|
| 192 |
obs = ws_env.step(action)
|
| 193 |
-
await websocket.send_json({
|
| 194 |
-
"type": "observation",
|
| 195 |
-
"data": _obs_to_response(obs),
|
| 196 |
-
})
|
| 197 |
except Exception as e:
|
| 198 |
await websocket.send_json({
|
| 199 |
"type": "error",
|
|
@@ -201,10 +186,7 @@ def create_invoice_app() -> FastAPI:
|
|
| 201 |
})
|
| 202 |
|
| 203 |
elif msg_type == "state":
|
| 204 |
-
await websocket.send_json({
|
| 205 |
-
"type": "state",
|
| 206 |
-
"data": ws_env.state.model_dump(),
|
| 207 |
-
})
|
| 208 |
|
| 209 |
elif msg_type == "close":
|
| 210 |
break
|
|
@@ -212,10 +194,7 @@ def create_invoice_app() -> FastAPI:
|
|
| 212 |
else:
|
| 213 |
await websocket.send_json({
|
| 214 |
"type": "error",
|
| 215 |
-
"data": {
|
| 216 |
-
"message": f"Unknown message type: {msg_type}",
|
| 217 |
-
"code": "UNKNOWN_TYPE",
|
| 218 |
-
},
|
| 219 |
})
|
| 220 |
|
| 221 |
except WebSocketDisconnect:
|
|
@@ -229,12 +208,10 @@ def create_invoice_app() -> FastAPI:
|
|
| 229 |
return app
|
| 230 |
|
| 231 |
|
| 232 |
-
|
| 233 |
-
app = create_invoice_app()
|
| 234 |
|
| 235 |
|
| 236 |
def main():
|
| 237 |
-
"""Entry point for `uv run server` / `[project.scripts]`."""
|
| 238 |
import uvicorn
|
| 239 |
uvicorn.run("server.app:app", host="0.0.0.0", port=7860)
|
| 240 |
|
|
|
|
| 1 |
"""
|
| 2 |
+
FastAPI application for the ESCTR Environment.
|
| 3 |
|
| 4 |
+
Exposes the Enterprise Supply Chain & Tax Reconciliation environment
|
| 5 |
+
over HTTP and WebSocket endpoints compatible with the OpenEnv spec.
|
| 6 |
"""
|
| 7 |
|
| 8 |
import json
|
|
|
|
| 13 |
from fastapi.responses import JSONResponse
|
| 14 |
from pydantic import BaseModel
|
| 15 |
|
| 16 |
+
from .models import ESCTRAction, ESCTRObservation, ESCTRState
|
| 17 |
+
from .environment import ESCTREnvironment
|
| 18 |
|
| 19 |
logger = logging.getLogger(__name__)
|
| 20 |
|
| 21 |
|
| 22 |
# ---------------------------------------------------------------------------
|
| 23 |
+
# Request / Response models
|
| 24 |
# ---------------------------------------------------------------------------
|
| 25 |
|
| 26 |
class ResetRequest(BaseModel):
|
| 27 |
seed: Optional[int] = None
|
| 28 |
episode_id: Optional[str] = None
|
| 29 |
+
task_name: str = "procurement_reconciliation"
|
| 30 |
|
| 31 |
class Config:
|
| 32 |
extra = "allow"
|
|
|
|
| 48 |
# Helpers
|
| 49 |
# ---------------------------------------------------------------------------
|
| 50 |
|
| 51 |
+
def _obs_to_response(obs: ESCTRObservation) -> dict:
|
|
|
|
| 52 |
obs_dict = obs.model_dump()
|
| 53 |
+
reward = obs_dict.pop("reward", 0.0)
|
| 54 |
done = obs_dict.pop("done", False)
|
| 55 |
return {
|
| 56 |
"observation": obs_dict,
|
| 57 |
+
"reward": reward,
|
| 58 |
"done": done,
|
| 59 |
}
|
| 60 |
|
|
|
|
| 63 |
# Application factory
|
| 64 |
# ---------------------------------------------------------------------------
|
| 65 |
|
| 66 |
+
def create_app() -> FastAPI:
|
|
|
|
|
|
|
| 67 |
app = FastAPI(
|
| 68 |
+
title="ESCTR Environment",
|
| 69 |
+
description=(
|
| 70 |
+
"Enterprise Supply Chain & Tax Reconciliation — an OpenEnv environment "
|
| 71 |
+
"for training LLMs to investigate discrepancies, enforce SLA penalties, "
|
| 72 |
+
"and navigate adversarial vendor disputes."
|
| 73 |
+
),
|
| 74 |
+
version="1.0.0",
|
| 75 |
)
|
| 76 |
|
| 77 |
+
_env = ESCTREnvironment()
|
|
|
|
| 78 |
|
|
|
|
| 79 |
@app.get("/health")
|
| 80 |
def health():
|
| 81 |
return HealthResponse()
|
|
|
|
| 83 |
@app.get("/")
|
| 84 |
def root():
|
| 85 |
return {
|
| 86 |
+
"name": "esctr_environment",
|
| 87 |
+
"version": "1.0.0",
|
| 88 |
"status": "running",
|
| 89 |
+
"endpoints": ["/health", "/reset", "/step", "/state", "/schema", "/metadata", "/ws"],
|
| 90 |
}
|
| 91 |
|
|
|
|
| 92 |
@app.post("/reset")
|
| 93 |
def reset(request: ResetRequest = ResetRequest()):
|
| 94 |
kwargs = request.model_dump(exclude_unset=True)
|
| 95 |
obs = _env.reset(**kwargs)
|
| 96 |
return _obs_to_response(obs)
|
| 97 |
|
|
|
|
| 98 |
@app.post("/step")
|
| 99 |
def step(request: StepRequest):
|
| 100 |
try:
|
| 101 |
+
action = ESCTRAction(**request.action)
|
| 102 |
except Exception as e:
|
| 103 |
return JSONResponse(
|
| 104 |
status_code=422,
|
|
|
|
| 107 |
obs = _env.step(action, timeout_s=request.timeout_s)
|
| 108 |
return _obs_to_response(obs)
|
| 109 |
|
|
|
|
| 110 |
@app.get("/state")
|
| 111 |
def get_state():
|
| 112 |
return _env.state.model_dump()
|
| 113 |
|
|
|
|
| 114 |
@app.get("/schema")
|
| 115 |
def get_schema():
|
| 116 |
return {
|
| 117 |
+
"action": ESCTRAction.model_json_schema(),
|
| 118 |
+
"observation": ESCTRObservation.model_json_schema(),
|
| 119 |
+
"state": ESCTRState.model_json_schema(),
|
| 120 |
}
|
| 121 |
|
|
|
|
| 122 |
@app.get("/metadata")
|
| 123 |
def get_metadata():
|
| 124 |
return {
|
| 125 |
+
"name": "esctr_environment",
|
| 126 |
"description": (
|
| 127 |
+
"Enterprise Supply Chain & Tax Reconciliation: an environment where "
|
| 128 |
+
"an LLM agent operates as an autonomous financial controller, investigating "
|
| 129 |
+
"procurement discrepancies, enforcing SLA penalties from shipping delays, "
|
| 130 |
+
"and navigating adversarial vendor disputes. Features procedural generation "
|
| 131 |
+
"for infinite scenarios, RLVR composite rewards, and multi-tool agentic workflow."
|
|
|
|
|
|
|
| 132 |
),
|
| 133 |
+
"version": "1.0.0",
|
| 134 |
+
"themes": [
|
| 135 |
+
"World Modeling — Professional Tasks",
|
| 136 |
+
"Long-Horizon Planning & Instruction Following",
|
| 137 |
+
"Multi-Agent Interactions (adversarial vendor)",
|
|
|
|
|
|
|
| 138 |
],
|
| 139 |
"tasks": [
|
| 140 |
+
{"name": "procurement_reconciliation", "difficulty": "easy", "max_steps": 10,
|
| 141 |
+
"description": "Identify overcharged line items between PO and Invoice"},
|
| 142 |
+
{"name": "sla_enforcement", "difficulty": "medium", "max_steps": 15,
|
| 143 |
+
"description": "Calculate late delivery penalties from shipping logs and SLA contracts"},
|
| 144 |
+
{"name": "adversarial_auditing", "difficulty": "hard", "max_steps": 20,
|
| 145 |
+
"description": "Navigate vendor disputes, verify warehouse logs, reject settlement offers"},
|
| 146 |
+
],
|
| 147 |
+
"tools": [
|
| 148 |
+
"query_database", "read_document", "communicate_vendor", "submit_financial_decision",
|
| 149 |
],
|
| 150 |
}
|
| 151 |
|
|
|
|
| 152 |
@app.websocket("/ws")
|
| 153 |
async def websocket_endpoint(websocket: WebSocket):
|
| 154 |
await websocket.accept()
|
| 155 |
+
ws_env = ESCTREnvironment()
|
| 156 |
logger.info("WebSocket session opened")
|
| 157 |
|
| 158 |
try:
|
|
|
|
| 172 |
|
| 173 |
if msg_type == "reset":
|
| 174 |
obs = ws_env.reset(**msg_data)
|
| 175 |
+
await websocket.send_json({"type": "observation", "data": _obs_to_response(obs)})
|
|
|
|
|
|
|
|
|
|
| 176 |
|
| 177 |
elif msg_type == "step":
|
| 178 |
try:
|
| 179 |
+
action = ESCTRAction(**msg_data)
|
| 180 |
obs = ws_env.step(action)
|
| 181 |
+
await websocket.send_json({"type": "observation", "data": _obs_to_response(obs)})
|
|
|
|
|
|
|
|
|
|
| 182 |
except Exception as e:
|
| 183 |
await websocket.send_json({
|
| 184 |
"type": "error",
|
|
|
|
| 186 |
})
|
| 187 |
|
| 188 |
elif msg_type == "state":
|
| 189 |
+
await websocket.send_json({"type": "state", "data": ws_env.state.model_dump()})
|
|
|
|
|
|
|
|
|
|
| 190 |
|
| 191 |
elif msg_type == "close":
|
| 192 |
break
|
|
|
|
| 194 |
else:
|
| 195 |
await websocket.send_json({
|
| 196 |
"type": "error",
|
| 197 |
+
"data": {"message": f"Unknown message type: {msg_type}", "code": "UNKNOWN_TYPE"},
|
|
|
|
|
|
|
|
|
|
| 198 |
})
|
| 199 |
|
| 200 |
except WebSocketDisconnect:
|
|
|
|
| 208 |
return app
|
| 209 |
|
| 210 |
|
| 211 |
+
app = create_app()
|
|
|
|
| 212 |
|
| 213 |
|
| 214 |
def main():
|
|
|
|
| 215 |
import uvicorn
|
| 216 |
uvicorn.run("server.app:app", host="0.0.0.0", port=7860)
|
| 217 |
|
server/documents.py
DELETED
|
@@ -1,898 +0,0 @@
|
|
| 1 |
-
"""
|
| 2 |
-
Document corpus for the Invoice Extraction Environment.
|
| 3 |
-
|
| 4 |
-
Contains synthetic but realistic invoice/receipt documents across 3 difficulty levels.
|
| 5 |
-
Each document has raw text and ground truth extraction targets.
|
| 6 |
-
"""
|
| 7 |
-
|
| 8 |
-
DOCUMENTS = {
|
| 9 |
-
# =========================================================================
|
| 10 |
-
# SIMPLE INVOICES — Clean formatting, clear labels, consistent structure
|
| 11 |
-
# =========================================================================
|
| 12 |
-
"simple_invoice": [
|
| 13 |
-
{
|
| 14 |
-
"id": "simple_001",
|
| 15 |
-
"text": """INVOICE
|
| 16 |
-
|
| 17 |
-
Invoice Number: INV-2024-001
|
| 18 |
-
Date: January 15, 2024
|
| 19 |
-
|
| 20 |
-
From:
|
| 21 |
-
Acme Corporation
|
| 22 |
-
123 Business Avenue
|
| 23 |
-
New York, NY 10001
|
| 24 |
-
|
| 25 |
-
Bill To:
|
| 26 |
-
Widget Co.
|
| 27 |
-
456 Commerce Street
|
| 28 |
-
Chicago, IL 60601
|
| 29 |
-
|
| 30 |
-
Description Qty Unit Price Amount
|
| 31 |
-
---------------------------------------------------------
|
| 32 |
-
Widget Type A 10 $25.00 $250.00
|
| 33 |
-
Widget Type B 5 $40.00 $200.00
|
| 34 |
-
Consulting Service 8 $75.00 $600.00
|
| 35 |
-
|
| 36 |
-
Subtotal: $1,050.00
|
| 37 |
-
Tax (8%): $84.00
|
| 38 |
-
Total: $1,134.00
|
| 39 |
-
|
| 40 |
-
Payment Terms: Net 30
|
| 41 |
-
""",
|
| 42 |
-
"ground_truth": {
|
| 43 |
-
"invoice_number": "INV-2024-001",
|
| 44 |
-
"date": "2024-01-15",
|
| 45 |
-
"vendor_name": "Acme Corporation",
|
| 46 |
-
"customer_name": "Widget Co.",
|
| 47 |
-
"subtotal": 1050.00,
|
| 48 |
-
"tax": 84.00,
|
| 49 |
-
"total": 1134.00,
|
| 50 |
-
"line_items": [
|
| 51 |
-
{"description": "Widget Type A", "quantity": 10, "unit_price": 25.00, "amount": 250.00},
|
| 52 |
-
{"description": "Widget Type B", "quantity": 5, "unit_price": 40.00, "amount": 200.00},
|
| 53 |
-
{"description": "Consulting Service", "quantity": 8, "unit_price": 75.00, "amount": 600.00},
|
| 54 |
-
],
|
| 55 |
-
},
|
| 56 |
-
},
|
| 57 |
-
{
|
| 58 |
-
"id": "simple_002",
|
| 59 |
-
"text": """INVOICE
|
| 60 |
-
|
| 61 |
-
Invoice #: TS-5892
|
| 62 |
-
Invoice Date: March 3, 2024
|
| 63 |
-
|
| 64 |
-
Vendor:
|
| 65 |
-
TechStart Solutions LLC
|
| 66 |
-
890 Innovation Drive, Suite 200
|
| 67 |
-
San Francisco, CA 94105
|
| 68 |
-
|
| 69 |
-
Customer:
|
| 70 |
-
DataFlow Inc.
|
| 71 |
-
321 Analytics Blvd
|
| 72 |
-
Austin, TX 78701
|
| 73 |
-
|
| 74 |
-
Item Qty Unit Price Total
|
| 75 |
-
----------------------------------------------------------
|
| 76 |
-
Cloud Hosting (Monthly) 1 $450.00 $450.00
|
| 77 |
-
API Integration Setup 1 $1,200.00 $1,200.00
|
| 78 |
-
Technical Support (hours) 12 $95.00 $1,140.00
|
| 79 |
-
|
| 80 |
-
Subtotal: $2,790.00
|
| 81 |
-
Tax (7%): $195.30
|
| 82 |
-
Total: $2,985.30
|
| 83 |
-
|
| 84 |
-
Due Date: April 2, 2024
|
| 85 |
-
""",
|
| 86 |
-
"ground_truth": {
|
| 87 |
-
"invoice_number": "TS-5892",
|
| 88 |
-
"date": "2024-03-03",
|
| 89 |
-
"vendor_name": "TechStart Solutions LLC",
|
| 90 |
-
"customer_name": "DataFlow Inc.",
|
| 91 |
-
"subtotal": 2790.00,
|
| 92 |
-
"tax": 195.30,
|
| 93 |
-
"total": 2985.30,
|
| 94 |
-
"line_items": [
|
| 95 |
-
{"description": "Cloud Hosting (Monthly)", "quantity": 1, "unit_price": 450.00, "amount": 450.00},
|
| 96 |
-
{"description": "API Integration Setup", "quantity": 1, "unit_price": 1200.00, "amount": 1200.00},
|
| 97 |
-
{"description": "Technical Support (hours)", "quantity": 12, "unit_price": 95.00, "amount": 1140.00},
|
| 98 |
-
],
|
| 99 |
-
},
|
| 100 |
-
},
|
| 101 |
-
{
|
| 102 |
-
"id": "simple_003",
|
| 103 |
-
"text": """INVOICE
|
| 104 |
-
|
| 105 |
-
Invoice Number: GS-2024-0147
|
| 106 |
-
Date: February 20, 2024
|
| 107 |
-
|
| 108 |
-
From:
|
| 109 |
-
Global Supplies Inc.
|
| 110 |
-
2500 Industrial Parkway
|
| 111 |
-
Detroit, MI 48201
|
| 112 |
-
|
| 113 |
-
To:
|
| 114 |
-
Riverside Manufacturing
|
| 115 |
-
780 Factory Road
|
| 116 |
-
Cleveland, OH 44101
|
| 117 |
-
|
| 118 |
-
Product Qty Price Each Line Total
|
| 119 |
-
-----------------------------------------------------------
|
| 120 |
-
Steel Bolts (Box/100) 50 $12.50 $625.00
|
| 121 |
-
Copper Wire (500ft Roll) 8 $85.00 $680.00
|
| 122 |
-
Safety Goggles (Pack/10) 20 $35.00 $700.00
|
| 123 |
-
Welding Rods (Bundle) 15 $22.00 $330.00
|
| 124 |
-
|
| 125 |
-
Subtotal: $2,335.00
|
| 126 |
-
Sales Tax: $163.45
|
| 127 |
-
Invoice Total: $2,498.45
|
| 128 |
-
|
| 129 |
-
Terms: Net 45
|
| 130 |
-
""",
|
| 131 |
-
"ground_truth": {
|
| 132 |
-
"invoice_number": "GS-2024-0147",
|
| 133 |
-
"date": "2024-02-20",
|
| 134 |
-
"vendor_name": "Global Supplies Inc.",
|
| 135 |
-
"customer_name": "Riverside Manufacturing",
|
| 136 |
-
"subtotal": 2335.00,
|
| 137 |
-
"tax": 163.45,
|
| 138 |
-
"total": 2498.45,
|
| 139 |
-
"line_items": [
|
| 140 |
-
{"description": "Steel Bolts (Box/100)", "quantity": 50, "unit_price": 12.50, "amount": 625.00},
|
| 141 |
-
{"description": "Copper Wire (500ft Roll)", "quantity": 8, "unit_price": 85.00, "amount": 680.00},
|
| 142 |
-
{"description": "Safety Goggles (Pack/10)", "quantity": 20, "unit_price": 35.00, "amount": 700.00},
|
| 143 |
-
{"description": "Welding Rods (Bundle)", "quantity": 15, "unit_price": 22.00, "amount": 330.00},
|
| 144 |
-
],
|
| 145 |
-
},
|
| 146 |
-
},
|
| 147 |
-
],
|
| 148 |
-
|
| 149 |
-
# =========================================================================
|
| 150 |
-
# MESSY INVOICES — Inconsistent formatting, abbreviations, typos
|
| 151 |
-
# =========================================================================
|
| 152 |
-
"messy_invoice": [
|
| 153 |
-
{
|
| 154 |
-
"id": "messy_001",
|
| 155 |
-
"text": """ACME Corp
|
| 156 |
-
123 Biz Ave., NY 10001
|
| 157 |
-
Ph: (212) 555-0100
|
| 158 |
-
|
| 159 |
-
inv# ACM-987
|
| 160 |
-
dt: Jan 15 '24
|
| 161 |
-
|
| 162 |
-
BILL TO:
|
| 163 |
-
widgetco / 456 commerce, chicago il
|
| 164 |
-
|
| 165 |
-
---items---
|
| 166 |
-
10x WidgetA @ 25 250
|
| 167 |
-
5x WidgetB @ 40 200
|
| 168 |
-
8hrs consulting @75/hr 600
|
| 169 |
-
------
|
| 170 |
-
subtot 1050
|
| 171 |
-
tx 8%: 84
|
| 172 |
-
TOTAL DUE: $1,134
|
| 173 |
-
|
| 174 |
-
pay within 30 days
|
| 175 |
-
""",
|
| 176 |
-
"ground_truth": {
|
| 177 |
-
"invoice_number": "ACM-987",
|
| 178 |
-
"date": "2024-01-15",
|
| 179 |
-
"vendor_name": "ACME Corp",
|
| 180 |
-
"customer_name": "widgetco",
|
| 181 |
-
"subtotal": 1050.00,
|
| 182 |
-
"tax": 84.00,
|
| 183 |
-
"total": 1134.00,
|
| 184 |
-
"line_items": [
|
| 185 |
-
{"description": "WidgetA", "quantity": 10, "unit_price": 25.00, "amount": 250.00},
|
| 186 |
-
{"description": "WidgetB", "quantity": 5, "unit_price": 40.00, "amount": 200.00},
|
| 187 |
-
{"description": "consulting", "quantity": 8, "unit_price": 75.00, "amount": 600.00},
|
| 188 |
-
],
|
| 189 |
-
},
|
| 190 |
-
},
|
| 191 |
-
{
|
| 192 |
-
"id": "messy_002",
|
| 193 |
-
"text": """techstart solutions
|
| 194 |
-
san fran, CA
|
| 195 |
-
|
| 196 |
-
INVOICE ts5892-b
|
| 197 |
-
date 03/03/2024
|
| 198 |
-
|
| 199 |
-
cust: DataFlow
|
| 200 |
-
austin TX
|
| 201 |
-
|
| 202 |
-
-- charges --
|
| 203 |
-
hosting (cloud, monthly plan)...$450
|
| 204 |
-
api integration - setup fee...$1200
|
| 205 |
-
tech support x12h @$95 = $1,140.00
|
| 206 |
-
|
| 207 |
-
sub: $2790
|
| 208 |
-
tax 7pct = 195.30
|
| 209 |
-
========
|
| 210 |
-
amt due $2,985.30
|
| 211 |
-
|
| 212 |
-
please remit by 04/02/2024
|
| 213 |
-
""",
|
| 214 |
-
"ground_truth": {
|
| 215 |
-
"invoice_number": "ts5892-b",
|
| 216 |
-
"date": "2024-03-03",
|
| 217 |
-
"vendor_name": "techstart solutions",
|
| 218 |
-
"customer_name": "DataFlow",
|
| 219 |
-
"subtotal": 2790.00,
|
| 220 |
-
"tax": 195.30,
|
| 221 |
-
"total": 2985.30,
|
| 222 |
-
"line_items": [
|
| 223 |
-
{"description": "hosting (cloud, monthly plan)", "quantity": 1, "unit_price": 450.00, "amount": 450.00},
|
| 224 |
-
{"description": "api integration - setup fee", "quantity": 1, "unit_price": 1200.00, "amount": 1200.00},
|
| 225 |
-
{"description": "tech support", "quantity": 12, "unit_price": 95.00, "amount": 1140.00},
|
| 226 |
-
],
|
| 227 |
-
},
|
| 228 |
-
},
|
| 229 |
-
{
|
| 230 |
-
"id": "messy_003",
|
| 231 |
-
"text": """GLOBAL SUPPLY
|
| 232 |
-
2500 industrial pkwy detroit MI
|
| 233 |
-
|
| 234 |
-
inv GS-0147rev
|
| 235 |
-
20-Feb-2024
|
| 236 |
-
|
| 237 |
-
Riverside Mfg / cleveland OH
|
| 238 |
-
|
| 239 |
-
stl bolts 100ct boxes -- 50 @ 12.50 ea ........... 625
|
| 240 |
-
cu wire 500' rolls -- 8 @ 85 .................... 680
|
| 241 |
-
safety goggles 10pk -- 20 @ 35 .................. 700
|
| 242 |
-
weld rods bundle -- 15 @ 22 ea .................. 330
|
| 243 |
-
|
| 244 |
-
s/t 2335.00
|
| 245 |
-
tax 163.45
|
| 246 |
-
-----
|
| 247 |
-
GRAND TOTAL $2498.45
|
| 248 |
-
|
| 249 |
-
net45
|
| 250 |
-
""",
|
| 251 |
-
"ground_truth": {
|
| 252 |
-
"invoice_number": "GS-0147rev",
|
| 253 |
-
"date": "2024-02-20",
|
| 254 |
-
"vendor_name": "GLOBAL SUPPLY",
|
| 255 |
-
"customer_name": "Riverside Mfg",
|
| 256 |
-
"subtotal": 2335.00,
|
| 257 |
-
"tax": 163.45,
|
| 258 |
-
"total": 2498.45,
|
| 259 |
-
"line_items": [
|
| 260 |
-
{"description": "stl bolts 100ct boxes", "quantity": 50, "unit_price": 12.50, "amount": 625.00},
|
| 261 |
-
{"description": "cu wire 500' rolls", "quantity": 8, "unit_price": 85.00, "amount": 680.00},
|
| 262 |
-
{"description": "safety goggles 10pk", "quantity": 20, "unit_price": 35.00, "amount": 700.00},
|
| 263 |
-
{"description": "weld rods bundle", "quantity": 15, "unit_price": 22.00, "amount": 330.00},
|
| 264 |
-
],
|
| 265 |
-
},
|
| 266 |
-
},
|
| 267 |
-
],
|
| 268 |
-
|
| 269 |
-
# =========================================================================
|
| 270 |
-
# MULTI-DOCUMENT — Multiple sections, cross-references, adjustments
|
| 271 |
-
# =========================================================================
|
| 272 |
-
"multi_document": [
|
| 273 |
-
{
|
| 274 |
-
"id": "multi_001",
|
| 275 |
-
"text": """=== PURCHASE ORDER ===
|
| 276 |
-
PO Number: PO-2024-0055
|
| 277 |
-
Date: January 10, 2024
|
| 278 |
-
Vendor: Acme Corporation
|
| 279 |
-
Buyer: Widget Co.
|
| 280 |
-
|
| 281 |
-
Ordered Items:
|
| 282 |
-
- 10x Widget Type A @ $25.00 = $250.00
|
| 283 |
-
- 5x Widget Type B @ $40.00 = $200.00
|
| 284 |
-
- 8hrs Consulting @ $75.00/hr = $600.00
|
| 285 |
-
|
| 286 |
-
PO Total: $1,050.00 (before tax)
|
| 287 |
-
|
| 288 |
-
=== INVOICE ===
|
| 289 |
-
Invoice Number: INV-2024-001
|
| 290 |
-
Reference PO: PO-2024-0055
|
| 291 |
-
Date: January 15, 2024
|
| 292 |
-
|
| 293 |
-
From: Acme Corporation, 123 Business Ave, New York, NY 10001
|
| 294 |
-
To: Widget Co., 456 Commerce St, Chicago, IL 60601
|
| 295 |
-
|
| 296 |
-
Description Qty Unit Price Amount
|
| 297 |
-
Widget Type A 10 $25.00 $250.00
|
| 298 |
-
Widget Type B 5 $40.00 $200.00
|
| 299 |
-
Consulting Service 8 $75.00 $600.00
|
| 300 |
-
|
| 301 |
-
Subtotal: $1,050.00
|
| 302 |
-
Tax (8%): $84.00
|
| 303 |
-
Invoice Total: $1,134.00
|
| 304 |
-
|
| 305 |
-
=== CREDIT MEMO ===
|
| 306 |
-
Credit Memo #: CM-2024-003
|
| 307 |
-
Reference Invoice: INV-2024-001
|
| 308 |
-
Date: January 22, 2024
|
| 309 |
-
|
| 310 |
-
Reason: 2x Widget Type A received defective
|
| 311 |
-
Credit Amount: $50.00
|
| 312 |
-
|
| 313 |
-
=== SUMMARY ===
|
| 314 |
-
Original Invoice: $1,134.00
|
| 315 |
-
Credit Applied: -$50.00
|
| 316 |
-
Adjusted Balance Due: $1,084.00
|
| 317 |
-
""",
|
| 318 |
-
"ground_truth": {
|
| 319 |
-
"invoice_number": "INV-2024-001",
|
| 320 |
-
"date": "2024-01-15",
|
| 321 |
-
"vendor_name": "Acme Corporation",
|
| 322 |
-
"customer_name": "Widget Co.",
|
| 323 |
-
"subtotal": 1050.00,
|
| 324 |
-
"tax": 84.00,
|
| 325 |
-
"total": 1134.00,
|
| 326 |
-
"po_number": "PO-2024-0055",
|
| 327 |
-
"adjustment_reason": "2x Widget Type A received defective",
|
| 328 |
-
"adjusted_total": 1084.00,
|
| 329 |
-
"line_items": [
|
| 330 |
-
{"description": "Widget Type A", "quantity": 10, "unit_price": 25.00, "amount": 250.00},
|
| 331 |
-
{"description": "Widget Type B", "quantity": 5, "unit_price": 40.00, "amount": 200.00},
|
| 332 |
-
{"description": "Consulting Service", "quantity": 8, "unit_price": 75.00, "amount": 600.00},
|
| 333 |
-
],
|
| 334 |
-
},
|
| 335 |
-
},
|
| 336 |
-
{
|
| 337 |
-
"id": "multi_002",
|
| 338 |
-
"text": """--- PURCHASE ORDER ---
|
| 339 |
-
PO#: PO-DF-2024-112
|
| 340 |
-
Issued: 2024-02-28
|
| 341 |
-
Requested By: DataFlow Inc., Austin TX
|
| 342 |
-
Vendor: TechStart Solutions LLC
|
| 343 |
-
|
| 344 |
-
Items Requested:
|
| 345 |
-
1. Cloud Hosting (Monthly) - 1 unit - $450.00 - $450.00
|
| 346 |
-
2. API Integration - 1 unit - $1,200.00 - $1,200.00
|
| 347 |
-
3. Tech Support - 10 hours - $95.00/hr - $950.00
|
| 348 |
-
NOTE: Hours are estimated, bill actuals
|
| 349 |
-
|
| 350 |
-
PO Authorized Amount: $2,600.00 (pre-tax)
|
| 351 |
-
|
| 352 |
-
--- INVOICE ---
|
| 353 |
-
Invoice: TS-5892
|
| 354 |
-
Date: March 3, 2024
|
| 355 |
-
PO Reference: PO-DF-2024-112
|
| 356 |
-
|
| 357 |
-
From: TechStart Solutions LLC, 890 Innovation Dr Suite 200, San Francisco CA 94105
|
| 358 |
-
To: DataFlow Inc., 321 Analytics Blvd, Austin TX 78701
|
| 359 |
-
|
| 360 |
-
Service Qty Rate Amount
|
| 361 |
-
Cloud Hosting (Monthly) 1 $450.00 $450.00
|
| 362 |
-
API Integration Setup 1 $1,200.00 $1,200.00
|
| 363 |
-
Technical Support (actual hrs) 12 $95.00 $1,140.00
|
| 364 |
-
|
| 365 |
-
NOTE: Support hours exceeded PO estimate (10hrs) by 2hrs.
|
| 366 |
-
Overage pre-approved by J. Smith on 03/01/2024.
|
| 367 |
-
|
| 368 |
-
Subtotal: $2,790.00
|
| 369 |
-
Tax (7%): $195.30
|
| 370 |
-
Total: $2,985.30
|
| 371 |
-
|
| 372 |
-
--- PAYMENT RECEIPT ---
|
| 373 |
-
Receipt #: RCP-2024-0891
|
| 374 |
-
Date: March 15, 2024
|
| 375 |
-
Payment Method: ACH Transfer
|
| 376 |
-
Reference: TS-5892
|
| 377 |
-
|
| 378 |
-
Amount Paid: $2,000.00
|
| 379 |
-
Outstanding Balance: $985.30
|
| 380 |
-
Due By: April 2, 2024
|
| 381 |
-
""",
|
| 382 |
-
"ground_truth": {
|
| 383 |
-
"invoice_number": "TS-5892",
|
| 384 |
-
"date": "2024-03-03",
|
| 385 |
-
"vendor_name": "TechStart Solutions LLC",
|
| 386 |
-
"customer_name": "DataFlow Inc.",
|
| 387 |
-
"subtotal": 2790.00,
|
| 388 |
-
"tax": 195.30,
|
| 389 |
-
"total": 2985.30,
|
| 390 |
-
"po_number": "PO-DF-2024-112",
|
| 391 |
-
"adjustment_reason": "Partial payment applied",
|
| 392 |
-
"adjusted_total": 985.30,
|
| 393 |
-
"line_items": [
|
| 394 |
-
{"description": "Cloud Hosting (Monthly)", "quantity": 1, "unit_price": 450.00, "amount": 450.00},
|
| 395 |
-
{"description": "API Integration Setup", "quantity": 1, "unit_price": 1200.00, "amount": 1200.00},
|
| 396 |
-
{"description": "Technical Support (actual hrs)", "quantity": 12, "unit_price": 95.00, "amount": 1140.00},
|
| 397 |
-
],
|
| 398 |
-
},
|
| 399 |
-
},
|
| 400 |
-
{
|
| 401 |
-
"id": "multi_003",
|
| 402 |
-
"text": """==== PURCHASE ORDER ====
|
| 403 |
-
PO: PO-RM-2024-033
|
| 404 |
-
Date: Feb 15, 2024
|
| 405 |
-
Buyer: Riverside Manufacturing, 780 Factory Rd, Cleveland OH
|
| 406 |
-
Supplier: Global Supplies Inc.
|
| 407 |
-
Budget Approved: $2,800.00
|
| 408 |
-
|
| 409 |
-
Requested:
|
| 410 |
-
- Steel Bolts Box/100: 50 boxes @ $12.50
|
| 411 |
-
- Copper Wire 500ft: 10 rolls @ $85.00
|
| 412 |
-
- Safety Goggles Pack/10: 20 packs @ $35.00
|
| 413 |
-
- Welding Rods Bundle: 15 bundles @ $22.00
|
| 414 |
-
|
| 415 |
-
==== INVOICE ====
|
| 416 |
-
Invoice: GS-2024-0147
|
| 417 |
-
Date: February 20, 2024
|
| 418 |
-
PO Ref: PO-RM-2024-033
|
| 419 |
-
|
| 420 |
-
Billed By: Global Supplies Inc., 2500 Industrial Parkway, Detroit MI 48201
|
| 421 |
-
Billed To: Riverside Manufacturing, 780 Factory Road, Cleveland OH 44101
|
| 422 |
-
|
| 423 |
-
Item Qty Unit$ Total
|
| 424 |
-
Steel Bolts (Box/100) 50 $12.50 $625.00
|
| 425 |
-
Copper Wire (500ft Roll) 8 $85.00 $680.00
|
| 426 |
-
Safety Goggles (Pack/10) 20 $35.00 $700.00
|
| 427 |
-
Welding Rods (Bundle) 15 $22.00 $330.00
|
| 428 |
-
|
| 429 |
-
IMPORTANT: Copper Wire qty reduced from PO (10 to 8).
|
| 430 |
-
2 rolls backordered, will ship separately.
|
| 431 |
-
|
| 432 |
-
Subtotal: $2,335.00
|
| 433 |
-
Tax (7%): $163.45
|
| 434 |
-
Total Due: $2,498.45
|
| 435 |
-
|
| 436 |
-
==== BACKORDER NOTICE ====
|
| 437 |
-
Backorder #: BO-2024-0089
|
| 438 |
-
Reference: GS-2024-0147 / PO-RM-2024-033
|
| 439 |
-
Item: Copper Wire (500ft Roll)
|
| 440 |
-
Qty Backordered: 2
|
| 441 |
-
Unit Price: $85.00
|
| 442 |
-
Backorder Amount: $170.00
|
| 443 |
-
Estimated Ship Date: March 10, 2024
|
| 444 |
-
|
| 445 |
-
Total with Backorder: $2,498.45 + $170.00 = $2,668.45
|
| 446 |
-
(Backorder will be invoiced separately upon shipment)
|
| 447 |
-
""",
|
| 448 |
-
"ground_truth": {
|
| 449 |
-
"invoice_number": "GS-2024-0147",
|
| 450 |
-
"date": "2024-02-20",
|
| 451 |
-
"vendor_name": "Global Supplies Inc.",
|
| 452 |
-
"customer_name": "Riverside Manufacturing",
|
| 453 |
-
"subtotal": 2335.00,
|
| 454 |
-
"tax": 163.45,
|
| 455 |
-
"total": 2498.45,
|
| 456 |
-
"po_number": "PO-RM-2024-033",
|
| 457 |
-
"adjustment_reason": "Copper Wire qty reduced from PO, 2 rolls backordered",
|
| 458 |
-
"adjusted_total": 2668.45,
|
| 459 |
-
"line_items": [
|
| 460 |
-
{"description": "Steel Bolts (Box/100)", "quantity": 50, "unit_price": 12.50, "amount": 625.00},
|
| 461 |
-
{"description": "Copper Wire (500ft Roll)", "quantity": 8, "unit_price": 85.00, "amount": 680.00},
|
| 462 |
-
{"description": "Safety Goggles (Pack/10)", "quantity": 20, "unit_price": 35.00, "amount": 700.00},
|
| 463 |
-
{"description": "Welding Rods (Bundle)", "quantity": 15, "unit_price": 22.00, "amount": 330.00},
|
| 464 |
-
],
|
| 465 |
-
},
|
| 466 |
-
},
|
| 467 |
-
],
|
| 468 |
-
|
| 469 |
-
# =========================================================================
|
| 470 |
-
# CORRUPTED SCAN — OCR-like artifacts, character substitutions, garbled text
|
| 471 |
-
# These simulate real scanned/faxed invoices with OCR errors.
|
| 472 |
-
# =========================================================================
|
| 473 |
-
"corrupted_scan": [
|
| 474 |
-
{
|
| 475 |
-
"id": "corrupt_001",
|
| 476 |
-
"text": """SC4NNED D0CUMENT - Page 1 of 1
|
| 477 |
-
|
| 478 |
-
lNVOlCE
|
| 479 |
-
|
| 480 |
-
lnvoice Nurnber: lNV-2O24-OO1
|
| 481 |
-
Dat.e: Januery 1S, 2O24
|
| 482 |
-
|
| 483 |
-
Frorn:
|
| 484 |
-
Acrne Corporati0n
|
| 485 |
-
l23 Business Avenue
|
| 486 |
-
New Y0rk, NY 1OOO1
|
| 487 |
-
|
| 488 |
-
BilI To:
|
| 489 |
-
Widget C0.
|
| 490 |
-
4S6 Cornmerce Street
|
| 491 |
-
Chicag0, lL 6O6O1
|
| 492 |
-
|
| 493 |
-
Descripti0n Qty Unit Price Arnount
|
| 494 |
-
---------------------------------------------------------
|
| 495 |
-
Widget Type A 1O $2S.OO $2SO.OO
|
| 496 |
-
Widget Type 8 S $4O.OO $2OO.OO
|
| 497 |
-
ConsuIting Service 8 $7S.OO $6OO.OO
|
| 498 |
-
|
| 499 |
-
Subtotal: $1,OSO.OO
|
| 500 |
-
Tax (8%): $84.OO
|
| 501 |
-
T0tal: $1,l34.OO
|
| 502 |
-
|
| 503 |
-
Payrnent Terrns: Net 3O
|
| 504 |
-
|
| 505 |
-
--- END 0F SCAN ---
|
| 506 |
-
""",
|
| 507 |
-
"ground_truth": {
|
| 508 |
-
"invoice_number": "INV-2024-001",
|
| 509 |
-
"date": "2024-01-15",
|
| 510 |
-
"vendor_name": "Acme Corporation",
|
| 511 |
-
"customer_name": "Widget Co.",
|
| 512 |
-
"subtotal": 1050.00,
|
| 513 |
-
"tax": 84.00,
|
| 514 |
-
"total": 1134.00,
|
| 515 |
-
"line_items": [
|
| 516 |
-
{"description": "Widget Type A", "quantity": 10, "unit_price": 25.00, "amount": 250.00},
|
| 517 |
-
{"description": "Widget Type B", "quantity": 5, "unit_price": 40.00, "amount": 200.00},
|
| 518 |
-
{"description": "Consulting Service", "quantity": 8, "unit_price": 75.00, "amount": 600.00},
|
| 519 |
-
],
|
| 520 |
-
},
|
| 521 |
-
},
|
| 522 |
-
{
|
| 523 |
-
"id": "corrupt_002",
|
| 524 |
-
"text": """[SCAN QUALITY: P00R - SOME CHARACTERS MAY BE lNCORRECT]
|
| 525 |
-
|
| 526 |
-
TECHSTART S0LUTl0NS LLC
|
| 527 |
-
89O lnnovation Dr, Suite 2OO
|
| 528 |
-
San Francisc0, CA 941OS
|
| 529 |
-
|
| 530 |
-
lNV0lCE #: TS~S892
|
| 531 |
-
DATE: O3/O3/2O24
|
| 532 |
-
|
| 533 |
-
CUSTOMERr DataFIow lnc.
|
| 534 |
-
321 AnaIytics BIvd
|
| 535 |
-
Austin, TX 787O1
|
| 536 |
-
|
| 537 |
-
Servicc Qty Unit Pricc Total
|
| 538 |
-
----------------------------------------------------------
|
| 539 |
-
CIoud Hosting (MonthIy) l $4SO.OO $4SO.OO
|
| 540 |
-
APl lntegration Setup l $l,2OO.OO $l,2OO.OO
|
| 541 |
-
TechnicaI Support (hours) l2 $9S.OO $l,l4O.OO
|
| 542 |
-
|
| 543 |
-
SubtotaI: $2,79O.OO
|
| 544 |
-
Tax (7%)): $l9S.3O
|
| 545 |
-
TotaI: $2,98S.3O
|
| 546 |
-
|
| 547 |
-
Due Date: ApriI 2, 2O24
|
| 548 |
-
|
| 549 |
-
[PAGE 1/1 - SCAN C0MPLETE]
|
| 550 |
-
""",
|
| 551 |
-
"ground_truth": {
|
| 552 |
-
"invoice_number": "TS-5892",
|
| 553 |
-
"date": "2024-03-03",
|
| 554 |
-
"vendor_name": "TechStart Solutions LLC",
|
| 555 |
-
"customer_name": "DataFlow Inc.",
|
| 556 |
-
"subtotal": 2790.00,
|
| 557 |
-
"tax": 195.30,
|
| 558 |
-
"total": 2985.30,
|
| 559 |
-
"line_items": [
|
| 560 |
-
{"description": "Cloud Hosting (Monthly)", "quantity": 1, "unit_price": 450.00, "amount": 450.00},
|
| 561 |
-
{"description": "API Integration Setup", "quantity": 1, "unit_price": 1200.00, "amount": 1200.00},
|
| 562 |
-
{"description": "Technical Support (hours)", "quantity": 12, "unit_price": 95.00, "amount": 1140.00},
|
| 563 |
-
],
|
| 564 |
-
},
|
| 565 |
-
},
|
| 566 |
-
{
|
| 567 |
-
"id": "corrupt_003",
|
| 568 |
-
"text": """---FAXED DOCUMENT---
|
| 569 |
-
RECEIVED: 02/20/2024 14:32
|
| 570 |
-
QUALITY: [####===---] 40%
|
| 571 |
-
|
| 572 |
-
GL0BAL SUPPLlES lNC.
|
| 573 |
-
25OO lndustriaI Parkway
|
| 574 |
-
Detr0it, Ml 482Ol
|
| 575 |
-
|
| 576 |
-
lNVOlCE
|
| 577 |
-
|
| 578 |
-
lnvoice Number: GS-2O24-Ol47
|
| 579 |
-
Date: February 2O, 2024
|
| 580 |
-
|
| 581 |
-
T0:
|
| 582 |
-
Riverside Manufactur1ng
|
| 583 |
-
78O Factory R0ad
|
| 584 |
-
CIeveIand, 0H 44l0l
|
| 585 |
-
|
| 586 |
-
Product Qty Price Each Line Total
|
| 587 |
-
-----------------------------------------------------------
|
| 588 |
-
SteeI BoIts (Box/lOO) SO $l2.SO $62S.OO
|
| 589 |
-
Copper Wire (SOOft RoII) 8 $8S.OO $68O.OO
|
| 590 |
-
Safety GoggIes (Pack/lO) 2O $3S.OO $7OO.OO
|
| 591 |
-
WeIding Rods (BundIe) lS $22.OO $33O.OO
|
| 592 |
-
|
| 593 |
-
[iIIegibIe]
|
| 594 |
-
SubtotaI: $2,33S.OO
|
| 595 |
-
SaIes Tax: $l63.4S
|
| 596 |
-
lnvoice T0tal: $2,498.4S
|
| 597 |
-
|
| 598 |
-
Terrns: Net 4S
|
| 599 |
-
---END FAX---
|
| 600 |
-
""",
|
| 601 |
-
"ground_truth": {
|
| 602 |
-
"invoice_number": "GS-2024-0147",
|
| 603 |
-
"date": "2024-02-20",
|
| 604 |
-
"vendor_name": "Global Supplies Inc.",
|
| 605 |
-
"customer_name": "Riverside Manufacturing",
|
| 606 |
-
"subtotal": 2335.00,
|
| 607 |
-
"tax": 163.45,
|
| 608 |
-
"total": 2498.45,
|
| 609 |
-
"line_items": [
|
| 610 |
-
{"description": "Steel Bolts (Box/100)", "quantity": 50, "unit_price": 12.50, "amount": 625.00},
|
| 611 |
-
{"description": "Copper Wire (500ft Roll)", "quantity": 8, "unit_price": 85.00, "amount": 680.00},
|
| 612 |
-
{"description": "Safety Goggles (Pack/10)", "quantity": 20, "unit_price": 35.00, "amount": 700.00},
|
| 613 |
-
{"description": "Welding Rods (Bundle)", "quantity": 15, "unit_price": 22.00, "amount": 330.00},
|
| 614 |
-
],
|
| 615 |
-
},
|
| 616 |
-
},
|
| 617 |
-
],
|
| 618 |
-
|
| 619 |
-
# =========================================================================
|
| 620 |
-
# ADVERSARIAL INVOICE — Decoy fields, contradictions, hidden calculations
|
| 621 |
-
# Designed to genuinely challenge frontier models with traps.
|
| 622 |
-
# =========================================================================
|
| 623 |
-
"adversarial_invoice": [
|
| 624 |
-
{
|
| 625 |
-
"id": "adversarial_001",
|
| 626 |
-
"text": """INVOICE
|
| 627 |
-
|
| 628 |
-
*** IMPORTANT: This replaces previous invoice DRAFT-INV-999 which was voided ***
|
| 629 |
-
|
| 630 |
-
Invoice Number: INV-2024-001-R2
|
| 631 |
-
Previous Reference: DRAFT-INV-999 (VOIDED — DO NOT USE)
|
| 632 |
-
Date: January 15, 2024
|
| 633 |
-
Reissue Date: January 20, 2024
|
| 634 |
-
|
| 635 |
-
From:
|
| 636 |
-
Acme Corporation
|
| 637 |
-
123 Business Avenue, New York, NY 10001
|
| 638 |
-
Tax ID: 12-3456789
|
| 639 |
-
|
| 640 |
-
Bill To:
|
| 641 |
-
Widget Co. (DBA "WidgetCorp International")
|
| 642 |
-
456 Commerce Street, Chicago, IL 60601
|
| 643 |
-
Customer Account: WC-0042
|
| 644 |
-
|
| 645 |
-
Description Qty Unit Price Amount
|
| 646 |
-
---------------------------------------------------------
|
| 647 |
-
Widget Type A 10 $25.00 $250.00
|
| 648 |
-
Widget Type B 5 $40.00 $200.00
|
| 649 |
-
Consulting Service 8 $75.00 $600.00
|
| 650 |
-
** EARLY PAYMENT DISCOUNT: -5% on consulting **
|
| 651 |
-
|
| 652 |
-
Subtotal: $1,050.00
|
| 653 |
-
Discount (5%): -$30.00
|
| 654 |
-
Adjusted Subtotal: $1,020.00
|
| 655 |
-
Tax (8%): $81.60
|
| 656 |
-
Total: $1,101.60
|
| 657 |
-
|
| 658 |
-
NOTE: Original quote (QT-2024-555) was $1,134.00 but discount applied.
|
| 659 |
-
Per agreement dated Jan 12, if paid within 10 days.
|
| 660 |
-
|
| 661 |
-
Payment Terms: Net 10 (discounted) / Net 30 (full price $1,134.00)
|
| 662 |
-
""",
|
| 663 |
-
"ground_truth": {
|
| 664 |
-
"invoice_number": "INV-2024-001-R2",
|
| 665 |
-
"date": "2024-01-20",
|
| 666 |
-
"vendor_name": "Acme Corporation",
|
| 667 |
-
"customer_name": "Widget Co.",
|
| 668 |
-
"subtotal": 1020.00,
|
| 669 |
-
"tax": 81.60,
|
| 670 |
-
"total": 1101.60,
|
| 671 |
-
"discount_amount": 30.00,
|
| 672 |
-
"original_total": 1134.00,
|
| 673 |
-
"line_items": [
|
| 674 |
-
{"description": "Widget Type A", "quantity": 10, "unit_price": 25.00, "amount": 250.00},
|
| 675 |
-
{"description": "Widget Type B", "quantity": 5, "unit_price": 40.00, "amount": 200.00},
|
| 676 |
-
{"description": "Consulting Service", "quantity": 8, "unit_price": 75.00, "amount": 600.00},
|
| 677 |
-
],
|
| 678 |
-
"discrepancy_notes": "5% early payment discount applied to consulting. Reissued invoice replaces voided DRAFT-INV-999. Adjusted subtotal $1,020 vs original $1,050.",
|
| 679 |
-
},
|
| 680 |
-
},
|
| 681 |
-
{
|
| 682 |
-
"id": "adversarial_002",
|
| 683 |
-
"text": """--- PURCHASE ORDER ---
|
| 684 |
-
PO#: PO-DF-2024-112
|
| 685 |
-
Date: February 28, 2024
|
| 686 |
-
Vendor: TechStart Solutions LLC
|
| 687 |
-
Buyer: DataFlow Inc.
|
| 688 |
-
Authorized Budget: $2,600.00 (pre-tax)
|
| 689 |
-
|
| 690 |
-
Items:
|
| 691 |
-
1. Cloud Hosting - 1 unit @ $450.00 = $450.00
|
| 692 |
-
2. API Integration - 1 unit @ $1,200.00 = $1,200.00
|
| 693 |
-
3. Tech Support - 10 hours @ $95.00/hr = $950.00
|
| 694 |
-
PO Total: $2,600.00
|
| 695 |
-
|
| 696 |
-
--- INVOICE ---
|
| 697 |
-
Invoice: TS-5892-FINAL
|
| 698 |
-
Date: March 3, 2024
|
| 699 |
-
PO Reference: PO-DF-2024-112
|
| 700 |
-
|
| 701 |
-
From: TechStart Solutions LLC
|
| 702 |
-
To: DataFlow Inc.
|
| 703 |
-
|
| 704 |
-
Service Qty Rate Amount
|
| 705 |
-
Cloud Hosting (Monthly) 1 $450.00 $450.00
|
| 706 |
-
API Integration Setup 1 $1,200.00 $1,200.00
|
| 707 |
-
Technical Support (actual) 12 $95.00 $1,140.00
|
| 708 |
-
>> 2 hrs over PO estimate, approved by J. Smith 03/01/2024
|
| 709 |
-
Rush Processing Fee 1 $150.00 $150.00
|
| 710 |
-
>> Added per emergency request ER-2024-033
|
| 711 |
-
|
| 712 |
-
Subtotal: $2,940.00
|
| 713 |
-
Tax (7%): $205.80
|
| 714 |
-
Total: $3,145.80
|
| 715 |
-
|
| 716 |
-
!!! BUDGET VARIANCE ALERT !!!
|
| 717 |
-
PO Authorized: $2,600.00
|
| 718 |
-
Actual (pre-tax): $2,940.00
|
| 719 |
-
Variance: $340.00 OVER BUDGET (13.1%)
|
| 720 |
-
Causes: Support overage ($190), Rush fee ($150)
|
| 721 |
-
|
| 722 |
-
--- PAYMENT SCHEDULE ---
|
| 723 |
-
Payment 1 (due 03/15): $1,500.00
|
| 724 |
-
Payment 2 (due 04/02): $1,645.80
|
| 725 |
-
""",
|
| 726 |
-
"ground_truth": {
|
| 727 |
-
"invoice_number": "TS-5892-FINAL",
|
| 728 |
-
"date": "2024-03-03",
|
| 729 |
-
"vendor_name": "TechStart Solutions LLC",
|
| 730 |
-
"customer_name": "DataFlow Inc.",
|
| 731 |
-
"subtotal": 2940.00,
|
| 732 |
-
"tax": 205.80,
|
| 733 |
-
"total": 3145.80,
|
| 734 |
-
"po_number": "PO-DF-2024-112",
|
| 735 |
-
"discount_amount": 0.00,
|
| 736 |
-
"original_total": 2600.00,
|
| 737 |
-
"line_items": [
|
| 738 |
-
{"description": "Cloud Hosting (Monthly)", "quantity": 1, "unit_price": 450.00, "amount": 450.00},
|
| 739 |
-
{"description": "API Integration Setup", "quantity": 1, "unit_price": 1200.00, "amount": 1200.00},
|
| 740 |
-
{"description": "Technical Support (actual)", "quantity": 12, "unit_price": 95.00, "amount": 1140.00},
|
| 741 |
-
{"description": "Rush Processing Fee", "quantity": 1, "unit_price": 150.00, "amount": 150.00},
|
| 742 |
-
],
|
| 743 |
-
"discrepancy_notes": "Invoice exceeds PO by $340 (13.1%). 2 extra support hours ($190) and rush processing fee ($150) added. PO authorized $2,600 but actual pre-tax is $2,940.",
|
| 744 |
-
},
|
| 745 |
-
},
|
| 746 |
-
{
|
| 747 |
-
"id": "adversarial_003",
|
| 748 |
-
"text": """CONSOLIDATED STATEMENT
|
| 749 |
-
|
| 750 |
-
Account: Riverside Manufacturing
|
| 751 |
-
Statement Period: February 2024
|
| 752 |
-
Prepared by: Global Supplies Inc., Accounts Receivable
|
| 753 |
-
|
| 754 |
-
=== TRANSACTION 1: ORIGINAL INVOICE ===
|
| 755 |
-
Invoice: GS-2024-0147
|
| 756 |
-
Date: February 20, 2024
|
| 757 |
-
PO: PO-RM-2024-033
|
| 758 |
-
|
| 759 |
-
Steel Bolts (Box/100) 50 @ $12.50 = $625.00
|
| 760 |
-
Copper Wire (500ft Roll) 10 @ $85.00 = $850.00
|
| 761 |
-
Safety Goggles (Pack/10) 20 @ $35.00 = $700.00
|
| 762 |
-
Welding Rods (Bundle) 15 @ $22.00 = $330.00
|
| 763 |
-
|
| 764 |
-
Invoice Subtotal: $2,505.00
|
| 765 |
-
Tax (7%): $175.35
|
| 766 |
-
Invoice Total: $2,680.35
|
| 767 |
-
|
| 768 |
-
=== TRANSACTION 2: ADJUSTMENT ===
|
| 769 |
-
Credit Memo: CM-2024-0201
|
| 770 |
-
Date: February 25, 2024
|
| 771 |
-
Reference: GS-2024-0147
|
| 772 |
-
|
| 773 |
-
Issue: Copper Wire — only 8 of 10 rolls delivered.
|
| 774 |
-
2 rolls backordered (BO-2024-0089).
|
| 775 |
-
Credit for undelivered: 2 x $85.00 = $170.00
|
| 776 |
-
Tax adjustment: -$11.90
|
| 777 |
-
Total Credit: -$181.90
|
| 778 |
-
|
| 779 |
-
=== TRANSACTION 3: PRICE CORRECTION ===
|
| 780 |
-
Debit Memo: DM-2024-0055
|
| 781 |
-
Date: February 27, 2024
|
| 782 |
-
|
| 783 |
-
Steel Bolts price was quoted at $12.50 but contract
|
| 784 |
-
rate is $13.00. Underbilled on 50 boxes.
|
| 785 |
-
Price difference: 50 x $0.50 = $25.00
|
| 786 |
-
Tax on adjustment: $1.75
|
| 787 |
-
Total Debit: $26.75
|
| 788 |
-
|
| 789 |
-
=== ACCOUNT SUMMARY ===
|
| 790 |
-
Original Invoice: $2,680.35
|
| 791 |
-
Credit (undelivered wire): -$181.90
|
| 792 |
-
Debit (price correction): +$26.75
|
| 793 |
-
================================
|
| 794 |
-
Net Amount Due: $2,525.20
|
| 795 |
-
|
| 796 |
-
Payment due by: March 20, 2024
|
| 797 |
-
""",
|
| 798 |
-
"ground_truth": {
|
| 799 |
-
"invoice_number": "GS-2024-0147",
|
| 800 |
-
"date": "2024-02-20",
|
| 801 |
-
"vendor_name": "Global Supplies Inc.",
|
| 802 |
-
"customer_name": "Riverside Manufacturing",
|
| 803 |
-
"subtotal": 2505.00,
|
| 804 |
-
"tax": 175.35,
|
| 805 |
-
"total": 2680.35,
|
| 806 |
-
"po_number": "PO-RM-2024-033",
|
| 807 |
-
"discount_amount": 0.00,
|
| 808 |
-
"original_total": 2680.35,
|
| 809 |
-
"line_items": [
|
| 810 |
-
{"description": "Steel Bolts (Box/100)", "quantity": 50, "unit_price": 12.50, "amount": 625.00},
|
| 811 |
-
{"description": "Copper Wire (500ft Roll)", "quantity": 10, "unit_price": 85.00, "amount": 850.00},
|
| 812 |
-
{"description": "Safety Goggles (Pack/10)", "quantity": 20, "unit_price": 35.00, "amount": 700.00},
|
| 813 |
-
{"description": "Welding Rods (Bundle)", "quantity": 15, "unit_price": 22.00, "amount": 330.00},
|
| 814 |
-
],
|
| 815 |
-
"discrepancy_notes": "Credit memo CM-2024-0201 for 2 undelivered Copper Wire rolls (-$181.90). Debit memo DM-2024-0055 for Steel Bolts price correction (+$26.75). Net adjustment: -$155.15. Final amount due: $2,525.20.",
|
| 816 |
-
},
|
| 817 |
-
},
|
| 818 |
-
],
|
| 819 |
-
}
|
| 820 |
-
|
| 821 |
-
|
| 822 |
-
# Required fields per task (defines what the agent must extract)
|
| 823 |
-
TASK_REQUIRED_FIELDS = {
|
| 824 |
-
"simple_invoice": [
|
| 825 |
-
"invoice_number", "date", "vendor_name", "customer_name",
|
| 826 |
-
"subtotal", "tax", "total", "line_items",
|
| 827 |
-
],
|
| 828 |
-
"messy_invoice": [
|
| 829 |
-
"invoice_number", "date", "vendor_name", "customer_name",
|
| 830 |
-
"subtotal", "tax", "total", "line_items",
|
| 831 |
-
],
|
| 832 |
-
"multi_document": [
|
| 833 |
-
"invoice_number", "date", "vendor_name", "customer_name",
|
| 834 |
-
"subtotal", "tax", "total", "line_items",
|
| 835 |
-
"po_number", "adjustment_reason", "adjusted_total",
|
| 836 |
-
],
|
| 837 |
-
"corrupted_scan": [
|
| 838 |
-
"invoice_number", "date", "vendor_name", "customer_name",
|
| 839 |
-
"subtotal", "tax", "total", "line_items",
|
| 840 |
-
],
|
| 841 |
-
"adversarial_invoice": [
|
| 842 |
-
"invoice_number", "date", "vendor_name", "customer_name",
|
| 843 |
-
"subtotal", "tax", "total", "line_items",
|
| 844 |
-
"po_number", "discount_amount", "original_total",
|
| 845 |
-
"discrepancy_notes",
|
| 846 |
-
],
|
| 847 |
-
}
|
| 848 |
-
|
| 849 |
-
|
| 850 |
-
def get_document(task_name: str, doc_index: int = 0, use_procedural: bool = True) -> dict:
|
| 851 |
-
"""Get a document and its metadata for a given task.
|
| 852 |
-
|
| 853 |
-
For doc_index 0-2, returns static documents (deterministic test fixtures).
|
| 854 |
-
For doc_index >= 3 (or when use_procedural=True and index wraps), uses the
|
| 855 |
-
procedural generation engine to create novel documents from the seed.
|
| 856 |
-
|
| 857 |
-
Args:
|
| 858 |
-
task_name: One of 'simple_invoice', 'messy_invoice', 'multi_document',
|
| 859 |
-
'corrupted_scan', 'adversarial_invoice'
|
| 860 |
-
doc_index: Index / seed for document selection
|
| 861 |
-
use_procedural: Whether to use procedural generation for indices beyond static pool
|
| 862 |
-
|
| 863 |
-
Returns:
|
| 864 |
-
dict with 'id', 'text', 'ground_truth', 'required_fields'
|
| 865 |
-
"""
|
| 866 |
-
docs = DOCUMENTS.get(task_name, DOCUMENTS["simple_invoice"])
|
| 867 |
-
required = TASK_REQUIRED_FIELDS.get(task_name, TASK_REQUIRED_FIELDS["simple_invoice"])
|
| 868 |
-
|
| 869 |
-
# Use static documents for small indices (deterministic test fixtures)
|
| 870 |
-
if doc_index < len(docs):
|
| 871 |
-
doc = docs[doc_index]
|
| 872 |
-
return {
|
| 873 |
-
"id": doc["id"],
|
| 874 |
-
"text": doc["text"],
|
| 875 |
-
"ground_truth": doc["ground_truth"],
|
| 876 |
-
"required_fields": required,
|
| 877 |
-
}
|
| 878 |
-
|
| 879 |
-
# Use procedural generation for larger indices
|
| 880 |
-
if use_procedural:
|
| 881 |
-
from .procedural import generate_document
|
| 882 |
-
proc_doc = generate_document(task_name, seed=doc_index)
|
| 883 |
-
return {
|
| 884 |
-
"id": proc_doc["id"],
|
| 885 |
-
"text": proc_doc["text"],
|
| 886 |
-
"ground_truth": proc_doc["ground_truth"],
|
| 887 |
-
"required_fields": required,
|
| 888 |
-
}
|
| 889 |
-
|
| 890 |
-
# Fallback: wrap around static docs
|
| 891 |
-
doc = docs[doc_index % len(docs)]
|
| 892 |
-
return {
|
| 893 |
-
"id": doc["id"],
|
| 894 |
-
"text": doc["text"],
|
| 895 |
-
"ground_truth": doc["ground_truth"],
|
| 896 |
-
"required_fields": required,
|
| 897 |
-
}
|
| 898 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
server/environment.py
CHANGED
|
@@ -1,621 +1,553 @@
|
|
| 1 |
"""
|
| 2 |
-
|
| 3 |
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
|
|
|
| 7 |
|
| 8 |
Reward Architecture:
|
| 9 |
-
R_total = α·R_outcome + β·R_trajectory
|
| 10 |
-
α = 0.70 (outcome dominates)
|
| 11 |
-
β = 0.30 (trajectory contributes)
|
| 12 |
-
Penalties: step cost, hallucination penalties
|
| 13 |
"""
|
| 14 |
|
| 15 |
import json
|
|
|
|
| 16 |
from typing import Any, Optional
|
| 17 |
from uuid import uuid4
|
| 18 |
|
| 19 |
-
from .models import
|
| 20 |
-
from .
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
}
|
| 34 |
|
| 35 |
-
#
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
REWARD_VIEW_FIELDS = 0.01
|
| 42 |
-
REWARD_GET_FEEDBACK = 0.005
|
| 43 |
-
REWARD_QUERY_RELATED = 0.015
|
| 44 |
-
REWARD_VERIFY_CALC = 0.01
|
| 45 |
-
REWARD_CHECK_DISCREP = 0.015
|
| 46 |
-
|
| 47 |
-
# Penalties
|
| 48 |
-
PENALTY_PER_STEP = -0.005
|
| 49 |
-
PENALTY_INVALID_JSON = -0.02
|
| 50 |
-
PENALTY_UNKNOWN_CMD = -0.02
|
| 51 |
-
PENALTY_INVALID_CALC = -0.01
|
| 52 |
-
|
| 53 |
-
# Tasks that support advanced tool commands
|
| 54 |
-
TOOL_ENABLED_TASKS = {"multi_document", "adversarial_invoice"}
|
| 55 |
-
|
| 56 |
-
VALID_TASKS = list(TASK_REQUIRED_FIELDS.keys())
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
class InvoiceExtractionEnvironment:
|
| 60 |
-
"""Environment for extracting structured data from invoice documents.
|
| 61 |
|
| 62 |
-
The agent interacts through these commands:
|
| 63 |
-
- view_document: See the raw document text
|
| 64 |
-
- view_fields: See the list of required fields
|
| 65 |
-
- extract: Submit extracted fields as JSON
|
| 66 |
-
- get_feedback: Get detailed feedback on last extraction
|
| 67 |
-
- query_related_documents: Retrieve cross-reference documents
|
| 68 |
-
- verify_calculations: Submit arithmetic for verification
|
| 69 |
-
- check_discrepancies: Request environment to flag inconsistencies
|
| 70 |
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
"""
|
| 74 |
|
| 75 |
def __init__(self):
|
| 76 |
-
self._state =
|
| 77 |
-
self.
|
| 78 |
-
self._ground_truth = {}
|
| 79 |
-
self._required_fields = []
|
| 80 |
-
self._last_feedback = {}
|
| 81 |
-
self._last_extracted = {}
|
| 82 |
self._initialized = False
|
| 83 |
self._trajectory_reward = 0.0
|
| 84 |
-
self._milestones =
|
| 85 |
-
self.
|
|
|
|
|
|
|
|
|
|
| 86 |
|
| 87 |
def reset(
|
| 88 |
self,
|
| 89 |
seed: Optional[int] = None,
|
| 90 |
episode_id: Optional[str] = None,
|
| 91 |
-
task_name: str = "
|
| 92 |
**kwargs: Any,
|
| 93 |
-
) ->
|
| 94 |
-
"""Reset the environment with a new
|
| 95 |
if task_name not in VALID_TASKS:
|
| 96 |
-
task_name = "
|
| 97 |
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
|
| 102 |
-
self._state =
|
| 103 |
episode_id=episode_id or str(uuid4()),
|
| 104 |
step_count=0,
|
| 105 |
task_name=task_name,
|
| 106 |
-
|
| 107 |
-
best_score=0.0,
|
| 108 |
-
attempts_used=0,
|
| 109 |
-
max_attempts=max_attempts,
|
| 110 |
accumulated_reward=0.0,
|
|
|
|
|
|
|
| 111 |
)
|
| 112 |
-
|
| 113 |
-
self._document_text = doc_data["text"]
|
| 114 |
-
self._ground_truth = doc_data["ground_truth"]
|
| 115 |
-
self._required_fields = doc_data["required_fields"]
|
| 116 |
-
self._last_feedback = {}
|
| 117 |
-
self._last_extracted = {}
|
| 118 |
self._initialized = True
|
| 119 |
self._trajectory_reward = 0.0
|
| 120 |
-
self._milestones =
|
| 121 |
-
self.
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
tool_hint = (
|
| 126 |
-
"\nAdvanced tools available for this task:\n"
|
| 127 |
-
" - 'query_related_documents': Retrieve PO, credit memos, etc.\n"
|
| 128 |
-
" - 'verify_calculations': Submit arithmetic for verification\n"
|
| 129 |
-
" - 'check_discrepancies': Flag inconsistencies in the document\n"
|
| 130 |
-
)
|
| 131 |
|
| 132 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 133 |
done=False,
|
| 134 |
reward=0.0,
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
f"Task: {task_name}\n"
|
| 138 |
-
f"Document ID: {doc_data['id']}\n"
|
| 139 |
-
f"Fields to extract: {len(self._required_fields)}\n"
|
| 140 |
-
f"Max attempts: {max_attempts}\n\n"
|
| 141 |
-
f"Use 'view_document' to see the document text.\n"
|
| 142 |
-
f"Use 'view_fields' to see the required fields.\n"
|
| 143 |
-
f"Use 'extract' with a JSON payload to submit your extraction.\n"
|
| 144 |
-
f"Use 'get_feedback' to see feedback on your last attempt."
|
| 145 |
-
f"{tool_hint}"
|
| 146 |
-
),
|
| 147 |
-
task_name=task_name,
|
| 148 |
-
current_score=0.0,
|
| 149 |
-
attempts_remaining=max_attempts,
|
| 150 |
-
required_fields=self._required_fields,
|
| 151 |
current_step=0,
|
|
|
|
| 152 |
accumulated_reward=0.0,
|
| 153 |
-
|
|
|
|
| 154 |
)
|
| 155 |
|
| 156 |
-
def
|
| 157 |
-
"""
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
if
|
| 164 |
-
|
| 165 |
-
f"===
|
| 166 |
-
f"
|
| 167 |
-
f"Vendor
|
| 168 |
-
f"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 169 |
)
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
f"
|
| 181 |
-
f"Reason: {gt['adjustment_reason']}\n"
|
| 182 |
)
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
f"
|
| 190 |
-
f"
|
| 191 |
-
f"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 192 |
)
|
| 193 |
-
|
| 194 |
-
return "\n".join(parts) if parts else "No related documents found for this invoice."
|
| 195 |
|
| 196 |
def step(
|
| 197 |
self,
|
| 198 |
-
action:
|
| 199 |
timeout_s: Optional[float] = None,
|
| 200 |
**kwargs: Any,
|
| 201 |
-
) ->
|
| 202 |
-
"""Execute
|
| 203 |
if not self._initialized:
|
| 204 |
-
return
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
|
| 208 |
-
metadata={"error": "not_initialized"},
|
| 209 |
-
last_action_status="error",
|
| 210 |
-
error_message="Environment not initialized. Call reset() first.",
|
| 211 |
-
)
|
| 212 |
|
| 213 |
self._state.step_count += 1
|
| 214 |
-
|
| 215 |
-
|
| 216 |
-
#
|
| 217 |
-
self._trajectory_reward
|
| 218 |
-
|
| 219 |
-
|
| 220 |
-
|
| 221 |
-
|
| 222 |
-
|
| 223 |
-
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
|
| 227 |
-
|
| 228 |
-
|
| 229 |
-
|
| 230 |
-
if handler:
|
| 231 |
-
return handler()
|
| 232 |
-
else:
|
| 233 |
-
# Unknown command penalty
|
| 234 |
-
self._trajectory_reward += PENALTY_UNKNOWN_CMD
|
| 235 |
-
self._state.accumulated_reward += PENALTY_UNKNOWN_CMD
|
| 236 |
-
return self._make_obs(
|
| 237 |
-
done=False,
|
| 238 |
-
reward=0.0,
|
| 239 |
-
text=(
|
| 240 |
-
f"Unknown command: '{command}'. "
|
| 241 |
-
f"Valid commands: {', '.join(handlers.keys())}"
|
| 242 |
-
),
|
| 243 |
-
status="error",
|
| 244 |
-
error_msg=f"Unknown command: '{command}'",
|
| 245 |
)
|
| 246 |
|
| 247 |
-
|
| 248 |
-
|
| 249 |
-
|
| 250 |
-
|
| 251 |
-
|
| 252 |
-
|
| 253 |
-
|
| 254 |
-
|
| 255 |
-
|
| 256 |
-
|
| 257 |
-
return
|
| 258 |
-
done=done,
|
| 259 |
-
reward=round(max(0.0, min(1.0, reward)), 4) if reward >= 0 else round(max(0.0, reward), 4),
|
| 260 |
-
text=text,
|
| 261 |
-
task_name=self._state.task_name,
|
| 262 |
-
current_score=self._state.best_score,
|
| 263 |
-
attempts_remaining=self._state.max_attempts - self._state.attempts_used,
|
| 264 |
-
required_fields=self._required_fields,
|
| 265 |
-
metadata=metadata or {},
|
| 266 |
-
last_action_status=status,
|
| 267 |
-
error_message=error_msg,
|
| 268 |
-
current_step=self._state.step_count,
|
| 269 |
-
accumulated_reward=round(self._state.accumulated_reward, 4),
|
| 270 |
-
)
|
| 271 |
|
| 272 |
# ------------------------------------------------------------------
|
| 273 |
-
#
|
| 274 |
# ------------------------------------------------------------------
|
| 275 |
|
| 276 |
-
def
|
| 277 |
-
"""
|
| 278 |
-
|
| 279 |
-
|
| 280 |
-
|
| 281 |
-
self._state.accumulated_reward += REWARD_VIEW_DOC
|
| 282 |
-
return self._make_obs(done=False, reward=0.0, text=self._document_text)
|
| 283 |
-
|
| 284 |
-
def _handle_view_fields(self) -> InvoiceObservation:
|
| 285 |
-
"""Return the list of required fields with descriptions."""
|
| 286 |
-
if "view_fields" not in self._milestones:
|
| 287 |
-
self._milestones.add("view_fields")
|
| 288 |
-
self._trajectory_reward += REWARD_VIEW_FIELDS
|
| 289 |
-
self._state.accumulated_reward += REWARD_VIEW_FIELDS
|
| 290 |
-
|
| 291 |
-
field_descriptions = {
|
| 292 |
-
"invoice_number": "The invoice/document number (string)",
|
| 293 |
-
"date": "Invoice date in YYYY-MM-DD format (use reissue date if applicable)",
|
| 294 |
-
"vendor_name": "Name of the vendor/seller/supplier",
|
| 295 |
-
"customer_name": "Name of the customer/buyer/bill-to party",
|
| 296 |
-
"subtotal": "Subtotal before tax, after discounts (number)",
|
| 297 |
-
"tax": "Tax amount (number)",
|
| 298 |
-
"total": "Total amount due (number)",
|
| 299 |
-
"line_items": "Array of items: [{description, quantity, unit_price, amount}]",
|
| 300 |
-
"po_number": "Purchase order reference number (string)",
|
| 301 |
-
"adjustment_reason": "Reason for any adjustments/credits (string)",
|
| 302 |
-
"adjusted_total": "Final adjusted total after credits/payments (number)",
|
| 303 |
-
"discount_amount": "Monetary discount value applied (number, 0 if none)",
|
| 304 |
-
"original_total": "What the total would have been without adjustments (number)",
|
| 305 |
-
"discrepancy_notes": "Free-text description of all discrepancies, adjustments, and anomalies found",
|
| 306 |
-
}
|
| 307 |
-
|
| 308 |
-
lines = ["Required fields to extract:\n"]
|
| 309 |
-
for field in self._required_fields:
|
| 310 |
-
desc = field_descriptions.get(field, "No description available")
|
| 311 |
-
lines.append(f" - {field}: {desc}")
|
| 312 |
-
|
| 313 |
-
lines.append(f"\nSubmit your extraction using the 'extract' command.")
|
| 314 |
-
lines.append(f"Payload must be a valid JSON string with these field names.")
|
| 315 |
-
|
| 316 |
-
return self._make_obs(done=False, reward=0.0, text="\n".join(lines))
|
| 317 |
-
|
| 318 |
-
def _handle_query_related(self) -> InvoiceObservation:
|
| 319 |
-
"""Return cross-reference documents (PO, credit memos, etc.)."""
|
| 320 |
-
if self._state.task_name not in TOOL_ENABLED_TASKS:
|
| 321 |
-
return self._make_obs(
|
| 322 |
-
done=False, reward=0.0,
|
| 323 |
-
text="This command is not available for the current task.",
|
| 324 |
-
status="error",
|
| 325 |
-
error_msg="query_related_documents only available for multi_document and adversarial_invoice tasks",
|
| 326 |
-
)
|
| 327 |
-
|
| 328 |
-
if "query_related" not in self._milestones:
|
| 329 |
-
self._milestones.add("query_related")
|
| 330 |
-
self._trajectory_reward += REWARD_QUERY_RELATED
|
| 331 |
-
self._state.accumulated_reward += REWARD_QUERY_RELATED
|
| 332 |
-
|
| 333 |
-
return self._make_obs(
|
| 334 |
-
done=False, reward=0.0,
|
| 335 |
-
text=self._related_docs_text or "No related documents found.",
|
| 336 |
-
)
|
| 337 |
|
| 338 |
-
|
| 339 |
-
|
| 340 |
-
|
| 341 |
-
|
| 342 |
-
done=False, reward=0.0,
|
| 343 |
-
text="This command is not available for the current task.",
|
| 344 |
-
status="error",
|
| 345 |
-
error_msg="verify_calculations only available for multi_document and adversarial_invoice tasks",
|
| 346 |
)
|
| 347 |
|
| 348 |
-
|
| 349 |
-
|
| 350 |
-
|
| 351 |
-
|
| 352 |
-
self._state.accumulated_reward += PENALTY_INVALID_CALC
|
| 353 |
-
return self._make_obs(
|
| 354 |
-
done=False, reward=0.0,
|
| 355 |
-
text="Invalid JSON payload for verify_calculations.",
|
| 356 |
-
status="error",
|
| 357 |
-
error_msg="Payload must be valid JSON with numeric fields to verify",
|
| 358 |
)
|
| 359 |
|
| 360 |
-
|
| 361 |
-
|
| 362 |
-
|
| 363 |
-
self.
|
| 364 |
-
|
| 365 |
-
|
| 366 |
-
|
| 367 |
-
|
| 368 |
-
|
| 369 |
-
|
| 370 |
-
|
| 371 |
-
|
| 372 |
-
|
| 373 |
-
|
| 374 |
-
|
| 375 |
-
|
| 376 |
-
|
| 377 |
-
|
| 378 |
-
|
| 379 |
-
|
| 380 |
-
|
| 381 |
-
|
| 382 |
-
|
| 383 |
-
|
| 384 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 385 |
)
|
| 386 |
else:
|
| 387 |
-
|
|
|
|
| 388 |
|
| 389 |
-
|
| 390 |
-
results.append("No recognizable calculations found. Submit fields like: subtotal, tax, total")
|
| 391 |
|
| 392 |
-
|
| 393 |
-
|
| 394 |
-
|
| 395 |
-
|
|
|
|
|
|
|
| 396 |
|
| 397 |
-
|
| 398 |
-
"""Flag inconsistencies in the document."""
|
| 399 |
-
if self._state.task_name not in TOOL_ENABLED_TASKS:
|
| 400 |
-
return self._make_obs(
|
| 401 |
-
done=False, reward=0.0,
|
| 402 |
-
text="This command is not available for the current task.",
|
| 403 |
-
status="error",
|
| 404 |
-
error_msg="check_discrepancies only available for multi_document and adversarial_invoice tasks",
|
| 405 |
-
)
|
| 406 |
|
| 407 |
-
|
| 408 |
-
|
| 409 |
-
self.
|
| 410 |
-
self.
|
| 411 |
-
|
| 412 |
-
|
| 413 |
-
|
| 414 |
-
|
| 415 |
-
|
| 416 |
-
|
| 417 |
-
|
| 418 |
-
|
| 419 |
-
|
| 420 |
-
|
| 421 |
-
if gt.get("original_total") and gt.get("total"):
|
| 422 |
-
if abs(gt["original_total"] - gt["total"]) > 0.01:
|
| 423 |
-
hints.append("⚠ The final total differs from the original total — investigate adjustments.")
|
| 424 |
-
|
| 425 |
-
if not hints:
|
| 426 |
-
hints.append("No obvious discrepancies detected.")
|
| 427 |
-
|
| 428 |
-
return self._make_obs(
|
| 429 |
-
done=False, reward=0.0,
|
| 430 |
-
text="Discrepancy analysis:\n" + "\n".join(hints),
|
| 431 |
-
)
|
| 432 |
|
| 433 |
-
|
| 434 |
-
|
| 435 |
-
|
| 436 |
|
| 437 |
-
|
| 438 |
-
|
| 439 |
-
|
| 440 |
-
|
| 441 |
-
|
| 442 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 443 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 444 |
|
| 445 |
-
|
| 446 |
-
|
| 447 |
-
|
| 448 |
-
|
| 449 |
-
raise ValueError("Payload must be a JSON object")
|
| 450 |
-
except (json.JSONDecodeError, ValueError) as e:
|
| 451 |
-
self._state.attempts_used += 1
|
| 452 |
-
self._trajectory_reward += PENALTY_INVALID_JSON
|
| 453 |
-
self._state.accumulated_reward += PENALTY_INVALID_JSON
|
| 454 |
-
attempts_remaining = self._state.max_attempts - self._state.attempts_used
|
| 455 |
-
done = attempts_remaining <= 0
|
| 456 |
-
|
| 457 |
-
return self._make_obs(
|
| 458 |
-
done=done,
|
| 459 |
-
reward=0.0,
|
| 460 |
-
text=f"Invalid JSON payload: {str(e)}\nPlease submit a valid JSON object.",
|
| 461 |
-
status="error",
|
| 462 |
-
error_msg=f"Invalid JSON: {str(e)}",
|
| 463 |
-
metadata={"error": "invalid_json"},
|
| 464 |
)
|
| 465 |
|
| 466 |
-
#
|
| 467 |
-
|
| 468 |
-
|
| 469 |
-
extracted, self._ground_truth, self._required_fields
|
| 470 |
-
)
|
| 471 |
|
| 472 |
-
#
|
| 473 |
-
|
| 474 |
-
# R_outcome: base extraction score
|
| 475 |
-
r_outcome = base_score
|
| 476 |
-
|
| 477 |
-
# R_trajectory: accumulated from milestones
|
| 478 |
-
r_trajectory = max(0.0, self._trajectory_reward)
|
| 479 |
-
|
| 480 |
-
# Improvement bonus
|
| 481 |
-
improvement_bonus = 0.0
|
| 482 |
-
if self._state.attempts_used > 1 and base_score > self._state.best_score:
|
| 483 |
-
improvement_bonus = min(base_score - self._state.best_score, 0.02)
|
| 484 |
-
|
| 485 |
-
# Step efficiency bonus
|
| 486 |
-
efficiency_bonus = 0.0
|
| 487 |
-
if self._state.step_count <= 3:
|
| 488 |
-
efficiency_bonus = 0.02
|
| 489 |
-
elif self._state.step_count <= 5:
|
| 490 |
-
efficiency_bonus = 0.01
|
| 491 |
-
|
| 492 |
-
# Consistency bonus (subtotal + tax ≈ total)
|
| 493 |
-
consistency_bonus = 0.0
|
| 494 |
-
ext_sub = _safe_float(extracted.get("subtotal"))
|
| 495 |
-
ext_tax = _safe_float(extracted.get("tax"))
|
| 496 |
-
ext_total = _safe_float(extracted.get("total"))
|
| 497 |
-
if ext_sub is not None and ext_tax is not None and ext_total is not None:
|
| 498 |
-
computed = round(ext_sub + ext_tax, 2)
|
| 499 |
-
if abs(computed - ext_total) < 0.02:
|
| 500 |
-
consistency_bonus = 0.03
|
| 501 |
-
|
| 502 |
-
# Composite reward
|
| 503 |
-
bonus = improvement_bonus + efficiency_bonus + consistency_bonus
|
| 504 |
-
score = round(max(0.01, min(0.99, ALPHA * r_outcome + BETA * r_trajectory + bonus)), 4)
|
| 505 |
-
|
| 506 |
-
# Track
|
| 507 |
-
self._state.best_score = max(self._state.best_score, score)
|
| 508 |
-
self._state.accumulated_reward += score
|
| 509 |
-
self._last_feedback = feedback
|
| 510 |
-
self._last_extracted = extracted
|
| 511 |
-
|
| 512 |
-
attempts_remaining = self._state.max_attempts - self._state.attempts_used
|
| 513 |
-
done = attempts_remaining <= 0 or score >= 0.95
|
| 514 |
-
|
| 515 |
-
# Build feedback text
|
| 516 |
-
matched = sum(1 for f in feedback.values() if f.get("matched", False))
|
| 517 |
-
total_fields = len(feedback)
|
| 518 |
-
bonus_details = []
|
| 519 |
-
if consistency_bonus > 0:
|
| 520 |
-
bonus_details.append(f"consistency: +{consistency_bonus:.3f}")
|
| 521 |
-
if improvement_bonus > 0:
|
| 522 |
-
bonus_details.append(f"improvement: +{improvement_bonus:.3f}")
|
| 523 |
-
if efficiency_bonus > 0:
|
| 524 |
-
bonus_details.append(f"efficiency: +{efficiency_bonus:.3f}")
|
| 525 |
-
if r_trajectory > 0:
|
| 526 |
-
bonus_details.append(f"trajectory: {r_trajectory:.3f}")
|
| 527 |
-
|
| 528 |
-
feedback_text = (
|
| 529 |
-
f"Extraction scored: {score:.4f} "
|
| 530 |
-
f"(outcome: {r_outcome:.4f} × {ALPHA}, trajectory: {r_trajectory:.3f} × {BETA})\n"
|
| 531 |
-
f"Fields matched: {matched}/{total_fields}\n"
|
| 532 |
-
f"Best score so far: {self._state.best_score:.4f}\n"
|
| 533 |
-
f"Attempts remaining: {attempts_remaining}\n"
|
| 534 |
-
)
|
| 535 |
|
| 536 |
-
if
|
| 537 |
-
|
|
|
|
|
|
|
| 538 |
|
| 539 |
-
|
| 540 |
-
weak_fields = [
|
| 541 |
-
name for name, data in feedback.items()
|
| 542 |
-
if not data.get("matched", False)
|
| 543 |
-
]
|
| 544 |
-
if weak_fields:
|
| 545 |
-
feedback_text += f"\nFields needing improvement: {', '.join(weak_fields)}"
|
| 546 |
-
feedback_text += "\nUse 'get_feedback' for detailed per-field scores."
|
| 547 |
|
| 548 |
-
|
| 549 |
-
|
|
|
|
| 550 |
|
| 551 |
-
|
| 552 |
-
|
| 553 |
-
|
| 554 |
-
|
| 555 |
-
|
| 556 |
-
|
| 557 |
-
|
| 558 |
-
|
| 559 |
-
|
| 560 |
-
|
| 561 |
-
|
| 562 |
-
|
| 563 |
-
|
| 564 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 565 |
)
|
| 566 |
|
| 567 |
-
|
| 568 |
-
|
| 569 |
-
if
|
| 570 |
-
|
| 571 |
-
|
| 572 |
-
|
| 573 |
-
|
| 574 |
-
|
| 575 |
|
| 576 |
-
|
| 577 |
-
self._milestones.add("get_feedback")
|
| 578 |
-
self._trajectory_reward += REWARD_GET_FEEDBACK
|
| 579 |
-
self._state.accumulated_reward += REWARD_GET_FEEDBACK
|
| 580 |
|
| 581 |
-
|
| 582 |
-
|
| 583 |
-
|
| 584 |
-
|
| 585 |
-
|
| 586 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 587 |
|
| 588 |
-
|
| 589 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 590 |
|
| 591 |
-
|
|
|
|
| 592 |
done=False,
|
| 593 |
reward=0.0,
|
| 594 |
-
|
| 595 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 596 |
)
|
| 597 |
|
| 598 |
@property
|
| 599 |
-
def state(self) ->
|
| 600 |
-
"""Get the current environment state."""
|
| 601 |
return self._state
|
| 602 |
|
| 603 |
def close(self) -> None:
|
| 604 |
-
"""Clean up resources."""
|
| 605 |
self._initialized = False
|
| 606 |
-
|
| 607 |
-
|
| 608 |
-
def _safe_float(value) -> float:
|
| 609 |
-
"""Safely convert a value to float, returning None on failure."""
|
| 610 |
-
if value is None:
|
| 611 |
-
return None
|
| 612 |
-
if isinstance(value, (int, float)):
|
| 613 |
-
return float(value)
|
| 614 |
-
if isinstance(value, str):
|
| 615 |
-
import re
|
| 616 |
-
cleaned = re.sub(r"[$ ,]", "", value.strip())
|
| 617 |
-
try:
|
| 618 |
-
return float(cleaned)
|
| 619 |
-
except (ValueError, TypeError):
|
| 620 |
-
return None
|
| 621 |
-
return None
|
|
|
|
| 1 |
"""
|
| 2 |
+
ESCTR Environment — Core Implementation.
|
| 3 |
|
| 4 |
+
Enterprise Supply Chain & Tax Reconciliation: a stateful environment
|
| 5 |
+
where an LLM agent operates as an autonomous financial controller,
|
| 6 |
+
using ERP tools to investigate discrepancies, enforce SLA penalties,
|
| 7 |
+
and navigate adversarial vendor disputes.
|
| 8 |
|
| 9 |
Reward Architecture:
|
| 10 |
+
R_total = α·R_outcome + β·R_trajectory − penalties
|
|
|
|
|
|
|
|
|
|
| 11 |
"""
|
| 12 |
|
| 13 |
import json
|
| 14 |
+
from dataclasses import asdict
|
| 15 |
from typing import Any, Optional
|
| 16 |
from uuid import uuid4
|
| 17 |
|
| 18 |
+
from .models import ESCTRAction, ESCTRObservation, ESCTRState
|
| 19 |
+
from .procedural import (
|
| 20 |
+
generate_scenario, Scenario, VALID_TASKS, MAX_STEPS,
|
| 21 |
+
render_purchase_order, render_invoice, render_sla,
|
| 22 |
+
render_shipping_log, render_warehouse_logs,
|
| 23 |
+
)
|
| 24 |
+
from .graders import grade_task1, grade_task2, grade_task3
|
| 25 |
+
|
| 26 |
+
# Reward constants
|
| 27 |
+
STEP_COST = 0.005
|
| 28 |
+
HALLUCINATION_PENALTY = 0.02
|
| 29 |
+
|
| 30 |
+
# Available tools per task
|
| 31 |
+
TASK_TOOLS = {
|
| 32 |
+
"procurement_reconciliation": [
|
| 33 |
+
"query_database", "read_document", "submit_financial_decision",
|
| 34 |
+
],
|
| 35 |
+
"sla_enforcement": [
|
| 36 |
+
"query_database", "read_document", "submit_financial_decision",
|
| 37 |
+
],
|
| 38 |
+
"adversarial_auditing": [
|
| 39 |
+
"query_database", "read_document", "communicate_vendor", "submit_financial_decision",
|
| 40 |
+
],
|
| 41 |
}
|
| 42 |
|
| 43 |
+
# Database tables per task
|
| 44 |
+
AVAILABLE_TABLES = {
|
| 45 |
+
"procurement_reconciliation": ["purchase_orders", "invoices"],
|
| 46 |
+
"sla_enforcement": ["purchase_orders", "invoices", "shipping_logs", "sla_contracts"],
|
| 47 |
+
"adversarial_auditing": ["purchase_orders", "invoices", "shipping_logs", "sla_contracts", "warehouse_logs"],
|
| 48 |
+
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
+
class ESCTREnvironment:
|
| 52 |
+
"""Enterprise Supply Chain & Tax Reconciliation Environment."""
|
|
|
|
| 53 |
|
| 54 |
def __init__(self):
|
| 55 |
+
self._state = ESCTRState(episode_id=str(uuid4()))
|
| 56 |
+
self._scenario: Optional[Scenario] = None
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
self._initialized = False
|
| 58 |
self._trajectory_reward = 0.0
|
| 59 |
+
self._milestones: list = []
|
| 60 |
+
self._vendor_negotiation_count = 0
|
| 61 |
+
self._settlement_offered = False
|
| 62 |
+
self._settlement_rejected = False
|
| 63 |
+
self._cited_evidence = False
|
| 64 |
|
| 65 |
def reset(
|
| 66 |
self,
|
| 67 |
seed: Optional[int] = None,
|
| 68 |
episode_id: Optional[str] = None,
|
| 69 |
+
task_name: str = "procurement_reconciliation",
|
| 70 |
**kwargs: Any,
|
| 71 |
+
) -> ESCTRObservation:
|
| 72 |
+
"""Reset the environment with a new scenario."""
|
| 73 |
if task_name not in VALID_TASKS:
|
| 74 |
+
task_name = "procurement_reconciliation"
|
| 75 |
|
| 76 |
+
actual_seed = seed if seed is not None else 0
|
| 77 |
+
scenario = generate_scenario(task_name, actual_seed)
|
| 78 |
+
max_steps = MAX_STEPS.get(task_name, 15)
|
| 79 |
|
| 80 |
+
self._state = ESCTRState(
|
| 81 |
episode_id=episode_id or str(uuid4()),
|
| 82 |
step_count=0,
|
| 83 |
task_name=task_name,
|
| 84 |
+
seed=actual_seed,
|
|
|
|
|
|
|
|
|
|
| 85 |
accumulated_reward=0.0,
|
| 86 |
+
outcome_submitted=False,
|
| 87 |
+
milestones_hit=[],
|
| 88 |
)
|
| 89 |
+
self._scenario = scenario
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
self._initialized = True
|
| 91 |
self._trajectory_reward = 0.0
|
| 92 |
+
self._milestones = []
|
| 93 |
+
self._vendor_negotiation_count = 0
|
| 94 |
+
self._settlement_offered = False
|
| 95 |
+
self._settlement_rejected = False
|
| 96 |
+
self._cited_evidence = False
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
|
| 98 |
+
tools = TASK_TOOLS.get(task_name, [])
|
| 99 |
+
tables = AVAILABLE_TABLES.get(task_name, [])
|
| 100 |
+
|
| 101 |
+
# Build initial briefing
|
| 102 |
+
briefing = self._build_briefing(task_name, scenario, tables)
|
| 103 |
+
|
| 104 |
+
return ESCTRObservation(
|
| 105 |
done=False,
|
| 106 |
reward=0.0,
|
| 107 |
+
system_response=briefing,
|
| 108 |
+
last_action_status="success",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
current_step=0,
|
| 110 |
+
max_steps=max_steps,
|
| 111 |
accumulated_reward=0.0,
|
| 112 |
+
task_name=task_name,
|
| 113 |
+
available_tools=tools,
|
| 114 |
)
|
| 115 |
|
| 116 |
+
def _build_briefing(self, task_name: str, scenario: Scenario, tables: list) -> str:
|
| 117 |
+
"""Generate task-specific initial briefing."""
|
| 118 |
+
vendor = scenario.vendor.name
|
| 119 |
+
buyer = scenario.buyer.name
|
| 120 |
+
inv_num = scenario.invoice.invoice_number
|
| 121 |
+
po_num = scenario.purchase_order.po_number
|
| 122 |
+
|
| 123 |
+
if task_name == "procurement_reconciliation":
|
| 124 |
+
return (
|
| 125 |
+
f"=== DISCREPANCY ALERT ===\n"
|
| 126 |
+
f"A pricing discrepancy has been detected between Purchase Order {po_num} "
|
| 127 |
+
f"and Vendor Invoice {inv_num} from {vendor}.\n\n"
|
| 128 |
+
f"Your task: Investigate the discrepancy, identify the overcharged line item, "
|
| 129 |
+
f"and submit the correct financial adjustment.\n\n"
|
| 130 |
+
f"Available databases: {', '.join(tables)}\n"
|
| 131 |
+
f"Available tools: query_database, read_document, submit_financial_decision\n\n"
|
| 132 |
+
f"Use 'query_database' with {{'table': '<table_name>'}} to explore data.\n"
|
| 133 |
+
f"Use 'read_document' with document_id (e.g. '{po_num}' or '{inv_num}') to read full documents.\n"
|
| 134 |
+
f"Use 'submit_financial_decision' with adjustment_amount and adjustment_reason when ready."
|
| 135 |
)
|
| 136 |
+
elif task_name == "sla_enforcement":
|
| 137 |
+
return (
|
| 138 |
+
f"=== PAYMENT DEMAND REVIEW ===\n"
|
| 139 |
+
f"Vendor {vendor} has submitted Invoice {inv_num} (ref: {po_num}) "
|
| 140 |
+
f"demanding full payment without penalties.\n\n"
|
| 141 |
+
f"Intelligence suggests the shipment may have been delivered late. "
|
| 142 |
+
f"Your task: Verify delivery timing, review the SLA contract, calculate "
|
| 143 |
+
f"any applicable penalties, and submit the correct adjusted payment.\n\n"
|
| 144 |
+
f"Available databases: {', '.join(tables)}\n"
|
| 145 |
+
f"Available tools: query_database, read_document, submit_financial_decision\n\n"
|
| 146 |
+
f"Key steps: Check shipping_logs → Review sla_contracts → Calculate penalty → Submit adjustment."
|
|
|
|
| 147 |
)
|
| 148 |
+
elif task_name == "adversarial_auditing":
|
| 149 |
+
return (
|
| 150 |
+
f"=== VENDOR DISPUTE ALERT ===\n"
|
| 151 |
+
f"Vendor {vendor} has submitted Invoice {inv_num} (ref: {po_num}) "
|
| 152 |
+
f"demanding full payment. Shipping records indicate a late delivery.\n\n"
|
| 153 |
+
f"⚠ The vendor DISPUTES the late delivery claim. They assert that {buyer}'s "
|
| 154 |
+
f"receiving warehouse rejected the initial delivery attempt.\n\n"
|
| 155 |
+
f"Your task: Investigate the vendor's claim against internal records, "
|
| 156 |
+
f"verify warehouse availability, enforce SLA penalties if warranted, and "
|
| 157 |
+
f"handle any settlement offers from the vendor.\n\n"
|
| 158 |
+
f"Available databases: {', '.join(tables)}\n"
|
| 159 |
+
f"Available tools: query_database, read_document, communicate_vendor, submit_financial_decision\n\n"
|
| 160 |
+
f"WARNING: The vendor may attempt to negotiate a reduced penalty. "
|
| 161 |
+
f"Verify all claims against internal data before accepting ANY settlement."
|
| 162 |
)
|
| 163 |
+
return "Environment ready."
|
|
|
|
| 164 |
|
| 165 |
def step(
|
| 166 |
self,
|
| 167 |
+
action: ESCTRAction,
|
| 168 |
timeout_s: Optional[float] = None,
|
| 169 |
**kwargs: Any,
|
| 170 |
+
) -> ESCTRObservation:
|
| 171 |
+
"""Execute one step in the environment."""
|
| 172 |
if not self._initialized:
|
| 173 |
+
return self._error_obs("Environment not initialized. Call reset() first.", terminal=True)
|
| 174 |
+
|
| 175 |
+
if self._state.outcome_submitted:
|
| 176 |
+
return self._error_obs("Episode already complete. Call reset() for a new episode.", terminal=True)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 177 |
|
| 178 |
self._state.step_count += 1
|
| 179 |
+
max_steps = MAX_STEPS.get(self._state.task_name, 15)
|
| 180 |
+
|
| 181 |
+
# Step cost
|
| 182 |
+
self._trajectory_reward -= STEP_COST
|
| 183 |
+
|
| 184 |
+
# Check max steps
|
| 185 |
+
if self._state.step_count > max_steps:
|
| 186 |
+
return self._finalize("Maximum steps exceeded. Episode terminated.", forced=True)
|
| 187 |
+
|
| 188 |
+
# Validate tool availability
|
| 189 |
+
available = TASK_TOOLS.get(self._state.task_name, [])
|
| 190 |
+
if action.action_type not in available:
|
| 191 |
+
self._trajectory_reward -= HALLUCINATION_PENALTY
|
| 192 |
+
return self._error_obs(
|
| 193 |
+
f"Tool '{action.action_type}' is not available for task '{self._state.task_name}'. "
|
| 194 |
+
f"Available tools: {', '.join(available)}"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 195 |
)
|
| 196 |
|
| 197 |
+
# Dispatch
|
| 198 |
+
if action.action_type == "query_database":
|
| 199 |
+
return self._handle_query(action)
|
| 200 |
+
elif action.action_type == "read_document":
|
| 201 |
+
return self._handle_read(action)
|
| 202 |
+
elif action.action_type == "communicate_vendor":
|
| 203 |
+
return self._handle_vendor_comm(action)
|
| 204 |
+
elif action.action_type == "submit_financial_decision":
|
| 205 |
+
return self._handle_submit(action)
|
| 206 |
+
|
| 207 |
+
return self._error_obs(f"Unknown action type: {action.action_type}")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 208 |
|
| 209 |
# ------------------------------------------------------------------
|
| 210 |
+
# Tool handlers
|
| 211 |
# ------------------------------------------------------------------
|
| 212 |
|
| 213 |
+
def _handle_query(self, action: ESCTRAction) -> ESCTRObservation:
|
| 214 |
+
"""Handle database queries."""
|
| 215 |
+
params = action.query_parameters or {}
|
| 216 |
+
table = params.get("table", "")
|
| 217 |
+
available = AVAILABLE_TABLES.get(self._state.task_name, [])
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 218 |
|
| 219 |
+
if not table:
|
| 220 |
+
self._trajectory_reward -= HALLUCINATION_PENALTY
|
| 221 |
+
return self._error_obs(
|
| 222 |
+
f"Missing 'table' in query_parameters. Available tables: {', '.join(available)}"
|
|
|
|
|
|
|
|
|
|
|
|
|
| 223 |
)
|
| 224 |
|
| 225 |
+
if table not in available:
|
| 226 |
+
self._trajectory_reward -= HALLUCINATION_PENALTY
|
| 227 |
+
return self._error_obs(
|
| 228 |
+
f"Table '{table}' not found. Available tables: {', '.join(available)}"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 229 |
)
|
| 230 |
|
| 231 |
+
scenario = self._scenario
|
| 232 |
+
|
| 233 |
+
if table == "purchase_orders":
|
| 234 |
+
self._add_milestone("retrieved_po")
|
| 235 |
+
po = scenario.purchase_order
|
| 236 |
+
summary = (
|
| 237 |
+
f"Query result: 1 record found in purchase_orders\n\n"
|
| 238 |
+
f"PO Number: {po.po_number}\n"
|
| 239 |
+
f"Date: {po.date}\n"
|
| 240 |
+
f"Vendor: {po.vendor.name}\n"
|
| 241 |
+
f"Buyer: {po.buyer.name}\n"
|
| 242 |
+
f"Total: ${po.total_amount:,.2f}\n"
|
| 243 |
+
f"Items: {len(po.line_items)}\n\n"
|
| 244 |
+
f"Use read_document with document_id='{po.po_number}' for full details."
|
| 245 |
+
)
|
| 246 |
+
return self._success_obs(summary)
|
| 247 |
+
|
| 248 |
+
elif table == "invoices":
|
| 249 |
+
self._add_milestone("retrieved_invoice")
|
| 250 |
+
inv = scenario.invoice
|
| 251 |
+
summary = (
|
| 252 |
+
f"Query result: 1 record found in invoices\n\n"
|
| 253 |
+
f"Invoice: {inv.invoice_number}\n"
|
| 254 |
+
f"Date: {inv.date}\n"
|
| 255 |
+
f"PO Ref: {inv.po_reference}\n"
|
| 256 |
+
f"Vendor: {inv.vendor.name}\n"
|
| 257 |
+
f"Subtotal: ${inv.subtotal:,.2f}\n"
|
| 258 |
+
f"Tax: ${inv.tax_amount:,.2f}\n"
|
| 259 |
+
f"Total: ${inv.total:,.2f}\n\n"
|
| 260 |
+
f"Use read_document with document_id='{inv.invoice_number}' for full details."
|
| 261 |
+
)
|
| 262 |
+
return self._success_obs(summary)
|
| 263 |
+
|
| 264 |
+
elif table == "shipping_logs":
|
| 265 |
+
self._add_milestone("retrieved_shipping")
|
| 266 |
+
log = scenario.shipping_log
|
| 267 |
+
if log:
|
| 268 |
+
summary = (
|
| 269 |
+
f"Query result: 1 record found in shipping_logs\n\n"
|
| 270 |
+
f"Tracking: {log.tracking_id}\n"
|
| 271 |
+
f"PO Ref: {log.po_reference}\n"
|
| 272 |
+
f"Carrier: {log.carrier}\n"
|
| 273 |
+
f"Expected Delivery: {log.expected_delivery}\n"
|
| 274 |
+
f"Actual Delivery: {log.actual_delivery}\n"
|
| 275 |
+
f"Delay: {log.delay_days} day(s)\n"
|
| 276 |
+
f"Status: {log.status}\n\n"
|
| 277 |
+
f"Use read_document with document_id='{log.tracking_id}' for full log."
|
| 278 |
+
)
|
| 279 |
+
else:
|
| 280 |
+
summary = "Query result: 0 records found in shipping_logs."
|
| 281 |
+
return self._success_obs(summary)
|
| 282 |
+
|
| 283 |
+
elif table == "sla_contracts":
|
| 284 |
+
self._add_milestone("retrieved_sla")
|
| 285 |
+
sla = scenario.sla_contract
|
| 286 |
+
if sla:
|
| 287 |
+
summary = (
|
| 288 |
+
f"Query result: 1 record found in sla_contracts\n\n"
|
| 289 |
+
f"Contract: {sla.contract_id}\n"
|
| 290 |
+
f"Vendor: {sla.vendor}\n"
|
| 291 |
+
f"Buyer: {sla.buyer}\n"
|
| 292 |
+
f"Delivery Terms: {sla.delivery_terms}\n\n"
|
| 293 |
+
f"Use read_document with document_id='{sla.contract_id}' for full SLA."
|
| 294 |
+
)
|
| 295 |
+
else:
|
| 296 |
+
summary = "Query result: 0 records found in sla_contracts."
|
| 297 |
+
return self._success_obs(summary)
|
| 298 |
+
|
| 299 |
+
elif table == "warehouse_logs":
|
| 300 |
+
self._add_milestone("checked_warehouse")
|
| 301 |
+
logs = scenario.warehouse_logs
|
| 302 |
+
if logs:
|
| 303 |
+
summary = (
|
| 304 |
+
f"Query result: {len(logs)} records found in warehouse_logs\n\n"
|
| 305 |
+
)
|
| 306 |
+
for wl in logs:
|
| 307 |
+
summary += (
|
| 308 |
+
f"Date: {wl.date} | Dock: {wl.dock_id} | Status: {wl.status.upper()} | "
|
| 309 |
+
f"Staff: {wl.staff_on_duty} | Shipments: {wl.shipments_received}\n"
|
| 310 |
+
)
|
| 311 |
+
summary += (
|
| 312 |
+
f"\nAll records show dock status: OPEN with active receiving operations.\n"
|
| 313 |
+
f"This contradicts any claim that the warehouse was unavailable."
|
| 314 |
)
|
| 315 |
else:
|
| 316 |
+
summary = "Query result: 0 records found in warehouse_logs."
|
| 317 |
+
return self._success_obs(summary)
|
| 318 |
|
| 319 |
+
return self._error_obs(f"Unknown table: {table}")
|
|
|
|
| 320 |
|
| 321 |
+
def _handle_read(self, action: ESCTRAction) -> ESCTRObservation:
|
| 322 |
+
"""Handle document reads."""
|
| 323 |
+
doc_id = action.document_id
|
| 324 |
+
if not doc_id:
|
| 325 |
+
self._trajectory_reward -= HALLUCINATION_PENALTY
|
| 326 |
+
return self._error_obs("Missing document_id. Specify the document to read.")
|
| 327 |
|
| 328 |
+
scenario = self._scenario
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 329 |
|
| 330 |
+
# Match document_id to known documents
|
| 331 |
+
if doc_id == scenario.purchase_order.po_number:
|
| 332 |
+
self._add_milestone("retrieved_po")
|
| 333 |
+
self._add_milestone("compared_documents")
|
| 334 |
+
return self._success_obs(render_purchase_order(scenario.purchase_order))
|
| 335 |
+
|
| 336 |
+
elif doc_id == scenario.invoice.invoice_number:
|
| 337 |
+
self._add_milestone("retrieved_invoice")
|
| 338 |
+
self._add_milestone("compared_documents")
|
| 339 |
+
return self._success_obs(render_invoice(scenario.invoice))
|
| 340 |
+
|
| 341 |
+
elif scenario.sla_contract and doc_id == scenario.sla_contract.contract_id:
|
| 342 |
+
self._add_milestone("retrieved_sla")
|
| 343 |
+
return self._success_obs(render_sla(scenario.sla_contract))
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 344 |
|
| 345 |
+
elif scenario.shipping_log and doc_id == scenario.shipping_log.tracking_id:
|
| 346 |
+
self._add_milestone("retrieved_shipping")
|
| 347 |
+
return self._success_obs(render_shipping_log(scenario.shipping_log))
|
| 348 |
|
| 349 |
+
else:
|
| 350 |
+
self._trajectory_reward -= HALLUCINATION_PENALTY
|
| 351 |
+
return self._error_obs(f"Document '{doc_id}' not found in the system.")
|
| 352 |
+
|
| 353 |
+
def _handle_vendor_comm(self, action: ESCTRAction) -> ESCTRObservation:
|
| 354 |
+
"""Handle vendor communication (adversarial negotiation)."""
|
| 355 |
+
self._add_milestone("vendor_negotiation")
|
| 356 |
+
self._vendor_negotiation_count += 1
|
| 357 |
+
msg = (action.message_content or "").lower()
|
| 358 |
+
|
| 359 |
+
scenario = self._scenario
|
| 360 |
+
import random as _rng
|
| 361 |
+
_rng.seed(self._state.seed + self._vendor_negotiation_count)
|
| 362 |
+
|
| 363 |
+
if self._vendor_negotiation_count == 1:
|
| 364 |
+
# First contact: vendor makes their excuse
|
| 365 |
+
excuse = _rng.choice([
|
| 366 |
+
"Our records indicate the receiving warehouse rejected the initial delivery attempt due to dock unavailability.",
|
| 367 |
+
"We believe the shipment arrived on time but was misrouted by your internal receiving department.",
|
| 368 |
+
"Our carrier has confirmed timely delivery; any apparent delay is a systems error on your end.",
|
| 369 |
+
])
|
| 370 |
+
response = (
|
| 371 |
+
f"VENDOR RESPONSE ({scenario.vendor.name}):\n\n"
|
| 372 |
+
f"\"{excuse}\"\n\n"
|
| 373 |
+
f"We therefore consider the late delivery penalty claim to be unfounded "
|
| 374 |
+
f"and request full payment of Invoice {scenario.invoice.invoice_number}."
|
| 375 |
+
)
|
| 376 |
+
elif self._vendor_negotiation_count == 2:
|
| 377 |
+
# Second contact: vendor offers settlement
|
| 378 |
+
self._settlement_offered = True
|
| 379 |
+
pct = _rng.choice([40, 45, 50, 55])
|
| 380 |
+
penalty = scenario.penalty_amount or 0
|
| 381 |
+
settlement = round(penalty * (pct / 100.0), 2)
|
| 382 |
+
response = (
|
| 383 |
+
f"VENDOR RESPONSE ({scenario.vendor.name}):\n\n"
|
| 384 |
+
f"\"In the interest of maintaining our business relationship, we propose "
|
| 385 |
+
f"settling at {pct}% of the claimed penalty amount — a credit of ${settlement:,.2f} "
|
| 386 |
+
f"instead of the full ${penalty:,.2f}.\"\n\n"
|
| 387 |
+
f"⚠ This is a settlement offer. You may ACCEPT by submitting the reduced amount, "
|
| 388 |
+
f"or REJECT and enforce the full contractual penalty."
|
| 389 |
)
|
| 390 |
+
else:
|
| 391 |
+
# Further contacts: vendor concedes or threatens
|
| 392 |
+
if "reject" in msg or "refuse" in msg or "full penalty" in msg or "enforce" in msg:
|
| 393 |
+
self._settlement_rejected = True
|
| 394 |
+
response = (
|
| 395 |
+
f"VENDOR RESPONSE ({scenario.vendor.name}):\n\n"
|
| 396 |
+
f"\"We acknowledge your position. If you have documentary evidence supporting "
|
| 397 |
+
f"the penalty, please proceed with the full adjustment.\""
|
| 398 |
+
)
|
| 399 |
+
else:
|
| 400 |
+
response = (
|
| 401 |
+
f"VENDOR RESPONSE ({scenario.vendor.name}):\n\n"
|
| 402 |
+
f"\"We maintain our position. Please review the evidence and respond accordingly.\""
|
| 403 |
+
)
|
| 404 |
+
|
| 405 |
+
return self._success_obs(response)
|
| 406 |
+
|
| 407 |
+
def _handle_submit(self, action: ESCTRAction) -> ESCTRObservation:
|
| 408 |
+
"""Handle financial decision submission (terminal action)."""
|
| 409 |
+
amount = action.adjustment_amount
|
| 410 |
+
reason = action.adjustment_reason or ""
|
| 411 |
|
| 412 |
+
if amount is None:
|
| 413 |
+
self._trajectory_reward -= HALLUCINATION_PENALTY
|
| 414 |
+
return self._error_obs(
|
| 415 |
+
"Missing adjustment_amount. Submit the exact monetary adjustment as a float."
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 416 |
)
|
| 417 |
|
| 418 |
+
# Check for evidence citation in reason
|
| 419 |
+
if "warehouse" in reason.lower() or "dock" in reason.lower() or "access log" in reason.lower():
|
| 420 |
+
self._cited_evidence = True
|
|
|
|
|
|
|
| 421 |
|
| 422 |
+
# Mark as submitted
|
| 423 |
+
self._state.outcome_submitted = True
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 424 |
|
| 425 |
+
# Check if settlement was accepted (for task 3)
|
| 426 |
+
if self._settlement_offered and not self._settlement_rejected:
|
| 427 |
+
# Agent accepted the settlement (bad for task 3)
|
| 428 |
+
pass
|
| 429 |
|
| 430 |
+
return self._finalize_with_grading(amount)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 431 |
|
| 432 |
+
# ------------------------------------------------------------------
|
| 433 |
+
# Helpers
|
| 434 |
+
# ------------------------------------------------------------------
|
| 435 |
|
| 436 |
+
def _add_milestone(self, milestone: str):
|
| 437 |
+
if milestone not in self._milestones:
|
| 438 |
+
self._milestones.append(milestone)
|
| 439 |
+
self._state.milestones_hit = self._milestones.copy()
|
| 440 |
+
|
| 441 |
+
def _finalize_with_grading(self, submitted_amount: float) -> ESCTRObservation:
|
| 442 |
+
"""Run the appropriate grader and return final observation."""
|
| 443 |
+
task = self._state.task_name
|
| 444 |
+
scenario = self._scenario
|
| 445 |
+
steps = self._state.step_count
|
| 446 |
+
|
| 447 |
+
if task == "procurement_reconciliation":
|
| 448 |
+
# Try to extract line item from milestones or just use amount
|
| 449 |
+
score, feedback = grade_task1(
|
| 450 |
+
scenario, submitted_amount,
|
| 451 |
+
milestones=self._milestones,
|
| 452 |
+
steps_taken=steps,
|
| 453 |
+
)
|
| 454 |
+
elif task == "sla_enforcement":
|
| 455 |
+
self._add_milestone("calculated_penalty")
|
| 456 |
+
score, feedback = grade_task2(
|
| 457 |
+
scenario, submitted_amount,
|
| 458 |
+
milestones=self._milestones,
|
| 459 |
+
steps_taken=steps,
|
| 460 |
+
)
|
| 461 |
+
elif task == "adversarial_auditing":
|
| 462 |
+
score, feedback = grade_task3(
|
| 463 |
+
scenario, submitted_amount,
|
| 464 |
+
rejected_settlement=self._settlement_rejected,
|
| 465 |
+
cited_evidence=self._cited_evidence,
|
| 466 |
+
milestones=self._milestones,
|
| 467 |
+
steps_taken=steps,
|
| 468 |
+
)
|
| 469 |
+
else:
|
| 470 |
+
score = 0.01
|
| 471 |
+
feedback = {"error": "Unknown task"}
|
| 472 |
+
|
| 473 |
+
self._state.best_score = score
|
| 474 |
+
self._state.accumulated_reward += score
|
| 475 |
+
|
| 476 |
+
response = (
|
| 477 |
+
f"=== FINANCIAL DECISION PROCESSED ===\n\n"
|
| 478 |
+
f"Submitted adjustment: ${submitted_amount:,.2f}\n"
|
| 479 |
+
f"Score: {score:.4f}\n\n"
|
| 480 |
)
|
| 481 |
|
| 482 |
+
if "outcome" in feedback:
|
| 483 |
+
response += f"Outcome: {feedback['outcome']}\n"
|
| 484 |
+
if "trajectory" in feedback:
|
| 485 |
+
response += f"Investigation milestones: {', '.join(feedback.get('trajectory', []))}\n"
|
| 486 |
+
if feedback.get("gullibility_penalty", 0) > 0:
|
| 487 |
+
response += f"⚠ Gullibility penalty: -{feedback['gullibility_penalty']:.2f}\n"
|
| 488 |
+
if feedback.get("evidence_bonus", 0) > 0:
|
| 489 |
+
response += f"✓ Evidence citation bonus: +{feedback['evidence_bonus']:.2f}\n"
|
| 490 |
|
| 491 |
+
response += f"\nFinal score: {score:.4f}"
|
|
|
|
|
|
|
|
|
|
| 492 |
|
| 493 |
+
return ESCTRObservation(
|
| 494 |
+
done=True,
|
| 495 |
+
reward=score,
|
| 496 |
+
system_response=response,
|
| 497 |
+
last_action_status="success",
|
| 498 |
+
current_step=self._state.step_count,
|
| 499 |
+
max_steps=MAX_STEPS.get(task, 15),
|
| 500 |
+
accumulated_reward=self._state.accumulated_reward,
|
| 501 |
+
task_name=task,
|
| 502 |
+
available_tools=[],
|
| 503 |
+
metadata=feedback,
|
| 504 |
+
)
|
| 505 |
|
| 506 |
+
def _finalize(self, msg: str, forced: bool = False) -> ESCTRObservation:
|
| 507 |
+
"""Finalize episode without submission (timeout / error)."""
|
| 508 |
+
self._state.outcome_submitted = True
|
| 509 |
+
return ESCTRObservation(
|
| 510 |
+
done=True,
|
| 511 |
+
reward=0.01,
|
| 512 |
+
system_response=msg,
|
| 513 |
+
last_action_status="error" if forced else "success",
|
| 514 |
+
current_step=self._state.step_count,
|
| 515 |
+
max_steps=MAX_STEPS.get(self._state.task_name, 15),
|
| 516 |
+
accumulated_reward=self._state.accumulated_reward,
|
| 517 |
+
task_name=self._state.task_name,
|
| 518 |
+
metadata={"forced_termination": forced},
|
| 519 |
+
)
|
| 520 |
|
| 521 |
+
def _success_obs(self, text: str) -> ESCTRObservation:
|
| 522 |
+
return ESCTRObservation(
|
| 523 |
done=False,
|
| 524 |
reward=0.0,
|
| 525 |
+
system_response=text,
|
| 526 |
+
last_action_status="success",
|
| 527 |
+
current_step=self._state.step_count,
|
| 528 |
+
max_steps=MAX_STEPS.get(self._state.task_name, 15),
|
| 529 |
+
accumulated_reward=self._state.accumulated_reward,
|
| 530 |
+
task_name=self._state.task_name,
|
| 531 |
+
available_tools=TASK_TOOLS.get(self._state.task_name, []),
|
| 532 |
+
)
|
| 533 |
+
|
| 534 |
+
def _error_obs(self, msg: str, terminal: bool = False) -> ESCTRObservation:
|
| 535 |
+
return ESCTRObservation(
|
| 536 |
+
done=terminal,
|
| 537 |
+
reward=0.0,
|
| 538 |
+
system_response=msg,
|
| 539 |
+
last_action_status="error",
|
| 540 |
+
error_message=msg,
|
| 541 |
+
current_step=self._state.step_count,
|
| 542 |
+
max_steps=MAX_STEPS.get(self._state.task_name, 15),
|
| 543 |
+
accumulated_reward=self._state.accumulated_reward,
|
| 544 |
+
task_name=self._state.task_name,
|
| 545 |
+
available_tools=TASK_TOOLS.get(self._state.task_name, []),
|
| 546 |
)
|
| 547 |
|
| 548 |
@property
|
| 549 |
+
def state(self) -> ESCTRState:
|
|
|
|
| 550 |
return self._state
|
| 551 |
|
| 552 |
def close(self) -> None:
|
|
|
|
| 553 |
self._initialized = False
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
server/graders.py
CHANGED
|
@@ -1,313 +1,291 @@
|
|
| 1 |
"""
|
| 2 |
-
|
| 3 |
|
| 4 |
-
|
| 5 |
-
|
|
|
|
|
|
|
| 6 |
"""
|
| 7 |
|
| 8 |
-
import
|
| 9 |
-
|
| 10 |
-
from
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
"
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
"
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
elif error_pct <= 0.05:
|
| 123 |
-
|
|
|
|
| 124 |
elif error_pct <= 0.10:
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
price_score = grade_numeric(
|
| 186 |
-
act_item.get("unit_price"),
|
| 187 |
-
exp_item.get("unit_price"),
|
| 188 |
-
)
|
| 189 |
-
amt_score = grade_numeric(
|
| 190 |
-
act_item.get("amount"),
|
| 191 |
-
exp_item.get("amount"),
|
| 192 |
-
)
|
| 193 |
-
|
| 194 |
-
item_score = (desc_score * 0.3 + qty_score * 0.2 +
|
| 195 |
-
price_score * 0.2 + amt_score * 0.3)
|
| 196 |
-
|
| 197 |
-
if item_score > best_score:
|
| 198 |
-
best_score = item_score
|
| 199 |
-
best_idx = idx
|
| 200 |
-
|
| 201 |
-
if best_idx >= 0:
|
| 202 |
-
matched_expected.add(best_idx)
|
| 203 |
-
total_score += best_score
|
| 204 |
-
|
| 205 |
-
# Normalize by expected count, penalize missing/extra items
|
| 206 |
-
expected_count = len(expected)
|
| 207 |
-
if expected_count == 0:
|
| 208 |
-
return 1.0 if len(actual) == 0 else 0.0
|
| 209 |
-
|
| 210 |
-
# Score = matched items score / expected count
|
| 211 |
-
# Penalize for extra items (max penalty = 0.2)
|
| 212 |
-
extra_penalty = max(0, len(actual) - expected_count) * 0.05
|
| 213 |
-
extra_penalty = min(extra_penalty, 0.2)
|
| 214 |
-
|
| 215 |
-
score = (total_score / expected_count) - extra_penalty
|
| 216 |
-
return max(0.0, min(1.0, round(score, 4)))
|
| 217 |
-
|
| 218 |
-
|
| 219 |
-
def grade_extraction(
|
| 220 |
-
extracted: Dict[str, Any],
|
| 221 |
-
ground_truth: Dict[str, Any],
|
| 222 |
-
required_fields: List[str],
|
| 223 |
) -> Tuple[float, Dict[str, Any]]:
|
| 224 |
-
"""Grade the
|
| 225 |
-
|
| 226 |
-
Uses weighted scoring: financial fields (subtotal, tax, total) are
|
| 227 |
-
weighted 1.5x, line_items 2.0x, and reasoning fields 0.8x to reflect
|
| 228 |
-
their relative importance in real-world invoice processing.
|
| 229 |
|
| 230 |
-
|
| 231 |
-
|
| 232 |
-
|
| 233 |
-
|
| 234 |
|
| 235 |
-
|
| 236 |
-
|
| 237 |
-
|
| 238 |
-
field_feedback maps field names to {score, expected, actual}
|
| 239 |
"""
|
| 240 |
-
|
| 241 |
-
feedback = {}
|
| 242 |
-
|
| 243 |
-
|
| 244 |
-
|
| 245 |
-
|
| 246 |
-
|
| 247 |
-
|
| 248 |
-
|
| 249 |
-
|
| 250 |
-
|
| 251 |
-
|
| 252 |
-
|
| 253 |
-
|
| 254 |
-
|
| 255 |
-
|
| 256 |
-
|
| 257 |
-
|
| 258 |
-
for field in required_fields:
|
| 259 |
-
expected = ground_truth.get(field)
|
| 260 |
-
actual = extracted.get(field)
|
| 261 |
-
|
| 262 |
-
if field in list_fields:
|
| 263 |
-
score = grade_line_items(actual, expected)
|
| 264 |
-
elif field in numeric_fields:
|
| 265 |
-
score = grade_numeric(actual, expected)
|
| 266 |
-
elif field in date_fields:
|
| 267 |
-
score = grade_date(actual, expected)
|
| 268 |
-
elif field in reasoning_fields:
|
| 269 |
-
# Free-text reasoning: use fuzzy matching with generous partial credit
|
| 270 |
-
score = grade_text(actual, expected)
|
| 271 |
else:
|
| 272 |
-
|
| 273 |
-
|
| 274 |
-
|
| 275 |
-
|
| 276 |
-
|
| 277 |
-
|
| 278 |
-
|
| 279 |
-
|
| 280 |
-
|
| 281 |
-
|
| 282 |
-
|
| 283 |
-
|
| 284 |
-
|
| 285 |
-
|
| 286 |
-
|
| 287 |
-
|
| 288 |
-
|
| 289 |
-
|
| 290 |
-
|
| 291 |
-
|
| 292 |
-
|
| 293 |
-
|
| 294 |
-
|
| 295 |
-
|
| 296 |
-
|
| 297 |
-
|
| 298 |
-
|
| 299 |
-
|
| 300 |
-
|
| 301 |
-
if
|
| 302 |
-
|
| 303 |
-
|
| 304 |
-
|
| 305 |
-
|
| 306 |
-
|
| 307 |
-
|
| 308 |
-
|
| 309 |
-
|
| 310 |
-
|
| 311 |
-
|
| 312 |
-
|
| 313 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
"""
|
| 2 |
+
Deterministic Graders for the ESCTR Environment.
|
| 3 |
|
| 4 |
+
Each task has a specific grader that scores the agent's performance
|
| 5 |
+
using verifiable, programmatic criteria — no subjective evaluation.
|
| 6 |
+
|
| 7 |
+
Scoring is always in the strict range (0.01, 0.99) to satisfy OpenEnv validators.
|
| 8 |
"""
|
| 9 |
|
| 10 |
+
from typing import Any, Dict, List, Tuple
|
| 11 |
+
|
| 12 |
+
from .procedural import Scenario
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
def clamp_score(score: float) -> float:
|
| 16 |
+
"""Clamp score to strict (0.01, 0.99) range."""
|
| 17 |
+
return round(max(0.01, min(0.99, score)), 4)
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
# ---------------------------------------------------------------------------
|
| 21 |
+
# Task 1: Procurement Reconciliation
|
| 22 |
+
# ---------------------------------------------------------------------------
|
| 23 |
+
|
| 24 |
+
def grade_task1(
|
| 25 |
+
scenario: Scenario,
|
| 26 |
+
submitted_amount: float,
|
| 27 |
+
submitted_line_item: str = None,
|
| 28 |
+
milestones: List[str] = None,
|
| 29 |
+
steps_taken: int = 0,
|
| 30 |
+
) -> Tuple[float, Dict[str, Any]]:
|
| 31 |
+
"""Grade the procurement reconciliation task.
|
| 32 |
+
|
| 33 |
+
Perfect score requires:
|
| 34 |
+
- Correct discrepant line item identified
|
| 35 |
+
- Exact adjustment amount (overcharge value, negative)
|
| 36 |
+
|
| 37 |
+
Partial credit:
|
| 38 |
+
- Correct line item but wrong amount → 0.5
|
| 39 |
+
- Wrong line item → 0.0 outcome
|
| 40 |
+
"""
|
| 41 |
+
milestones = milestones or []
|
| 42 |
+
feedback = {"task": "procurement_reconciliation"}
|
| 43 |
+
|
| 44 |
+
# Outcome scoring (weight: 0.70)
|
| 45 |
+
correct_amount = scenario.correct_adjustment
|
| 46 |
+
correct_item = scenario.discrepant_line_item_id
|
| 47 |
+
|
| 48 |
+
outcome_score = 0.0
|
| 49 |
+
item_correct = (submitted_line_item == correct_item) if submitted_line_item and correct_item else False
|
| 50 |
+
amount_correct = abs(submitted_amount - correct_amount) < 0.02 if submitted_amount is not None else False
|
| 51 |
+
|
| 52 |
+
if item_correct and amount_correct:
|
| 53 |
+
outcome_score = 1.0
|
| 54 |
+
feedback["outcome"] = "PERFECT — correct line item and exact adjustment amount"
|
| 55 |
+
elif item_correct and not amount_correct:
|
| 56 |
+
outcome_score = 0.5
|
| 57 |
+
feedback["outcome"] = f"PARTIAL — correct line item but wrong amount (expected {correct_amount:.2f}, got {submitted_amount:.2f})"
|
| 58 |
+
elif not item_correct and amount_correct:
|
| 59 |
+
outcome_score = 0.4
|
| 60 |
+
feedback["outcome"] = f"PARTIAL — correct amount but wrong line item (expected {correct_item})"
|
| 61 |
+
else:
|
| 62 |
+
outcome_score = 0.0
|
| 63 |
+
feedback["outcome"] = "FAIL — wrong line item and wrong amount"
|
| 64 |
+
|
| 65 |
+
# Trajectory scoring (weight: 0.30)
|
| 66 |
+
trajectory_score = 0.0
|
| 67 |
+
trajectory_details = []
|
| 68 |
+
if "retrieved_po" in milestones:
|
| 69 |
+
trajectory_score += 0.4
|
| 70 |
+
trajectory_details.append("Retrieved PO ✓")
|
| 71 |
+
if "retrieved_invoice" in milestones:
|
| 72 |
+
trajectory_score += 0.4
|
| 73 |
+
trajectory_details.append("Retrieved Invoice ✓")
|
| 74 |
+
if "compared_documents" in milestones:
|
| 75 |
+
trajectory_score += 0.2
|
| 76 |
+
trajectory_details.append("Compared documents ✓")
|
| 77 |
+
|
| 78 |
+
trajectory_score = min(1.0, trajectory_score)
|
| 79 |
+
feedback["trajectory"] = trajectory_details
|
| 80 |
+
|
| 81 |
+
# Efficiency penalty
|
| 82 |
+
max_steps = 10
|
| 83 |
+
efficiency_penalty = max(0, (steps_taken - max_steps) * 0.02)
|
| 84 |
+
|
| 85 |
+
# Composite
|
| 86 |
+
alpha, beta = 0.70, 0.30
|
| 87 |
+
raw_score = alpha * outcome_score + beta * trajectory_score - efficiency_penalty
|
| 88 |
+
final_score = clamp_score(raw_score)
|
| 89 |
+
|
| 90 |
+
feedback["outcome_score"] = outcome_score
|
| 91 |
+
feedback["trajectory_score"] = trajectory_score
|
| 92 |
+
feedback["efficiency_penalty"] = efficiency_penalty
|
| 93 |
+
feedback["final_score"] = final_score
|
| 94 |
+
feedback["correct_adjustment"] = correct_amount
|
| 95 |
+
feedback["correct_line_item"] = correct_item
|
| 96 |
+
|
| 97 |
+
return final_score, feedback
|
| 98 |
+
|
| 99 |
+
|
| 100 |
+
# ---------------------------------------------------------------------------
|
| 101 |
+
# Task 2: SLA Enforcement
|
| 102 |
+
# ---------------------------------------------------------------------------
|
| 103 |
+
|
| 104 |
+
def grade_task2(
|
| 105 |
+
scenario: Scenario,
|
| 106 |
+
submitted_amount: float,
|
| 107 |
+
milestones: List[str] = None,
|
| 108 |
+
steps_taken: int = 0,
|
| 109 |
+
) -> Tuple[float, Dict[str, Any]]:
|
| 110 |
+
"""Grade the SLA enforcement task.
|
| 111 |
+
|
| 112 |
+
Perfect score requires:
|
| 113 |
+
- Exact penalty amount calculated from shipping delay + SLA terms
|
| 114 |
+
|
| 115 |
+
Partial credit:
|
| 116 |
+
- Within 5% of correct penalty → 0.7
|
| 117 |
+
- Within 10% → 0.4
|
| 118 |
+
- Approved invoice without penalty → 0.0
|
| 119 |
+
"""
|
| 120 |
+
milestones = milestones or []
|
| 121 |
+
feedback = {"task": "sla_enforcement"}
|
| 122 |
+
|
| 123 |
+
correct_penalty = scenario.penalty_amount or 0.0
|
| 124 |
+
correct_adjustment = scenario.correct_adjustment # negative
|
| 125 |
+
|
| 126 |
+
# Outcome scoring (weight: 0.60)
|
| 127 |
+
outcome_score = 0.0
|
| 128 |
+
if submitted_amount is not None and correct_adjustment != 0:
|
| 129 |
+
error = abs(submitted_amount - correct_adjustment)
|
| 130 |
+
error_pct = error / abs(correct_adjustment) if correct_adjustment != 0 else float('inf')
|
| 131 |
+
|
| 132 |
+
if error < 0.02:
|
| 133 |
+
outcome_score = 1.0
|
| 134 |
+
feedback["outcome"] = "PERFECT — exact penalty amount"
|
| 135 |
elif error_pct <= 0.05:
|
| 136 |
+
outcome_score = 0.7
|
| 137 |
+
feedback["outcome"] = f"CLOSE — within 5% (expected {correct_adjustment:.2f}, got {submitted_amount:.2f})"
|
| 138 |
elif error_pct <= 0.10:
|
| 139 |
+
outcome_score = 0.4
|
| 140 |
+
feedback["outcome"] = f"PARTIAL — within 10% (expected {correct_adjustment:.2f}, got {submitted_amount:.2f})"
|
| 141 |
+
else:
|
| 142 |
+
outcome_score = 0.1
|
| 143 |
+
feedback["outcome"] = f"INCORRECT — expected {correct_adjustment:.2f}, got {submitted_amount:.2f}"
|
| 144 |
+
elif submitted_amount == 0 or submitted_amount is None:
|
| 145 |
+
outcome_score = 0.0
|
| 146 |
+
feedback["outcome"] = "FAIL — approved invoice without applying penalty"
|
| 147 |
+
|
| 148 |
+
# Trajectory scoring (weight: 0.40)
|
| 149 |
+
trajectory_score = 0.0
|
| 150 |
+
trajectory_details = []
|
| 151 |
+
if "retrieved_shipping" in milestones:
|
| 152 |
+
trajectory_score += 0.30
|
| 153 |
+
trajectory_details.append("Retrieved shipping log ✓")
|
| 154 |
+
if "retrieved_sla" in milestones:
|
| 155 |
+
trajectory_score += 0.30
|
| 156 |
+
trajectory_details.append("Retrieved SLA contract ✓")
|
| 157 |
+
if "retrieved_po" in milestones:
|
| 158 |
+
trajectory_score += 0.15
|
| 159 |
+
trajectory_details.append("Retrieved PO ✓")
|
| 160 |
+
if "retrieved_invoice" in milestones:
|
| 161 |
+
trajectory_score += 0.15
|
| 162 |
+
trajectory_details.append("Retrieved Invoice ✓")
|
| 163 |
+
if "calculated_penalty" in milestones:
|
| 164 |
+
trajectory_score += 0.10
|
| 165 |
+
trajectory_details.append("Performed penalty calculation ✓")
|
| 166 |
+
|
| 167 |
+
trajectory_score = min(1.0, trajectory_score)
|
| 168 |
+
feedback["trajectory"] = trajectory_details
|
| 169 |
+
|
| 170 |
+
# Efficiency
|
| 171 |
+
max_steps = 15
|
| 172 |
+
efficiency_penalty = max(0, (steps_taken - max_steps) * 0.02)
|
| 173 |
+
|
| 174 |
+
alpha, beta = 0.60, 0.40
|
| 175 |
+
raw_score = alpha * outcome_score + beta * trajectory_score - efficiency_penalty
|
| 176 |
+
final_score = clamp_score(raw_score)
|
| 177 |
+
|
| 178 |
+
feedback["outcome_score"] = outcome_score
|
| 179 |
+
feedback["trajectory_score"] = trajectory_score
|
| 180 |
+
feedback["efficiency_penalty"] = efficiency_penalty
|
| 181 |
+
feedback["final_score"] = final_score
|
| 182 |
+
feedback["correct_adjustment"] = correct_adjustment
|
| 183 |
+
feedback["penalty_amount"] = correct_penalty
|
| 184 |
+
|
| 185 |
+
return final_score, feedback
|
| 186 |
+
|
| 187 |
+
|
| 188 |
+
# ---------------------------------------------------------------------------
|
| 189 |
+
# Task 3: Adversarial Auditing
|
| 190 |
+
# ---------------------------------------------------------------------------
|
| 191 |
+
|
| 192 |
+
def grade_task3(
|
| 193 |
+
scenario: Scenario,
|
| 194 |
+
submitted_amount: float,
|
| 195 |
+
rejected_settlement: bool = False,
|
| 196 |
+
cited_evidence: bool = False,
|
| 197 |
+
milestones: List[str] = None,
|
| 198 |
+
steps_taken: int = 0,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 199 |
) -> Tuple[float, Dict[str, Any]]:
|
| 200 |
+
"""Grade the adversarial auditing task.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 201 |
|
| 202 |
+
Perfect score requires:
|
| 203 |
+
- Full penalty enforced (rejected vendor's bogus claim)
|
| 204 |
+
- Settlement offer rejected
|
| 205 |
+
- Warehouse logs cited as evidence
|
| 206 |
|
| 207 |
+
Severe deductions for:
|
| 208 |
+
- Accepting settlement (gullibility)
|
| 209 |
+
- Not checking warehouse logs before rejecting
|
|
|
|
| 210 |
"""
|
| 211 |
+
milestones = milestones or []
|
| 212 |
+
feedback = {"task": "adversarial_auditing"}
|
| 213 |
+
|
| 214 |
+
correct_adjustment = scenario.correct_adjustment
|
| 215 |
+
|
| 216 |
+
# Outcome scoring (weight: 0.60)
|
| 217 |
+
outcome_score = 0.0
|
| 218 |
+
if submitted_amount is not None and correct_adjustment != 0:
|
| 219 |
+
error = abs(submitted_amount - correct_adjustment)
|
| 220 |
+
error_pct = error / abs(correct_adjustment) if correct_adjustment != 0 else float('inf')
|
| 221 |
+
|
| 222 |
+
if error < 0.02:
|
| 223 |
+
outcome_score = 1.0
|
| 224 |
+
feedback["outcome"] = "PERFECT — full contractual penalty enforced"
|
| 225 |
+
elif error_pct <= 0.05:
|
| 226 |
+
outcome_score = 0.6
|
| 227 |
+
feedback["outcome"] = f"CLOSE — nearly correct (expected {correct_adjustment:.2f}, got {submitted_amount:.2f})"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 228 |
else:
|
| 229 |
+
outcome_score = 0.2
|
| 230 |
+
feedback["outcome"] = f"INCORRECT — expected {correct_adjustment:.2f}, got {submitted_amount:.2f}"
|
| 231 |
+
else:
|
| 232 |
+
outcome_score = 0.0
|
| 233 |
+
feedback["outcome"] = "FAIL — no financial decision submitted"
|
| 234 |
+
|
| 235 |
+
# Gullibility penalty: accepted adversarial settlement
|
| 236 |
+
gullibility_penalty = 0.0
|
| 237 |
+
if not rejected_settlement and "vendor_negotiation" in milestones:
|
| 238 |
+
gullibility_penalty = 0.20
|
| 239 |
+
feedback["gullibility"] = "PENALIZED — accepted vendor's settlement offer"
|
| 240 |
+
elif rejected_settlement:
|
| 241 |
+
feedback["gullibility"] = "GOOD — rejected vendor's settlement offer"
|
| 242 |
+
|
| 243 |
+
# Evidence citation bonus
|
| 244 |
+
evidence_bonus = 0.0
|
| 245 |
+
if cited_evidence and "checked_warehouse" in milestones:
|
| 246 |
+
evidence_bonus = 0.05
|
| 247 |
+
feedback["evidence"] = "GOOD — cited warehouse logs as evidence"
|
| 248 |
+
|
| 249 |
+
# Trajectory scoring (weight: 0.40)
|
| 250 |
+
trajectory_score = 0.0
|
| 251 |
+
trajectory_details = []
|
| 252 |
+
if "retrieved_shipping" in milestones:
|
| 253 |
+
trajectory_score += 0.20
|
| 254 |
+
trajectory_details.append("Retrieved shipping log ✓")
|
| 255 |
+
if "retrieved_sla" in milestones:
|
| 256 |
+
trajectory_score += 0.20
|
| 257 |
+
trajectory_details.append("Retrieved SLA contract ✓")
|
| 258 |
+
if "checked_warehouse" in milestones:
|
| 259 |
+
trajectory_score += 0.25
|
| 260 |
+
trajectory_details.append("Checked warehouse access logs ✓")
|
| 261 |
+
if "vendor_negotiation" in milestones:
|
| 262 |
+
trajectory_score += 0.15
|
| 263 |
+
trajectory_details.append("Engaged in vendor negotiation ✓")
|
| 264 |
+
if "retrieved_po" in milestones:
|
| 265 |
+
trajectory_score += 0.10
|
| 266 |
+
trajectory_details.append("Retrieved PO ✓")
|
| 267 |
+
if "retrieved_invoice" in milestones:
|
| 268 |
+
trajectory_score += 0.10
|
| 269 |
+
trajectory_details.append("Retrieved Invoice ✓")
|
| 270 |
+
|
| 271 |
+
trajectory_score = min(1.0, trajectory_score)
|
| 272 |
+
feedback["trajectory"] = trajectory_details
|
| 273 |
+
|
| 274 |
+
# Efficiency
|
| 275 |
+
max_steps = 20
|
| 276 |
+
efficiency_penalty = max(0, (steps_taken - max_steps) * 0.015)
|
| 277 |
+
|
| 278 |
+
alpha, beta = 0.60, 0.40
|
| 279 |
+
raw_score = (alpha * outcome_score + beta * trajectory_score
|
| 280 |
+
+ evidence_bonus - gullibility_penalty - efficiency_penalty)
|
| 281 |
+
final_score = clamp_score(raw_score)
|
| 282 |
+
|
| 283 |
+
feedback["outcome_score"] = outcome_score
|
| 284 |
+
feedback["trajectory_score"] = trajectory_score
|
| 285 |
+
feedback["gullibility_penalty"] = gullibility_penalty
|
| 286 |
+
feedback["evidence_bonus"] = evidence_bonus
|
| 287 |
+
feedback["efficiency_penalty"] = efficiency_penalty
|
| 288 |
+
feedback["final_score"] = final_score
|
| 289 |
+
feedback["correct_adjustment"] = correct_adjustment
|
| 290 |
+
|
| 291 |
+
return final_score, feedback
|
server/models.py
CHANGED
|
@@ -1,8 +1,9 @@
|
|
| 1 |
"""
|
| 2 |
-
Pydantic models for the
|
| 3 |
|
| 4 |
-
Defines the Action and
|
| 5 |
-
between the agent and the environment.
|
|
|
|
| 6 |
"""
|
| 7 |
|
| 8 |
from typing import Any, Dict, List, Literal, Optional
|
|
@@ -10,84 +11,129 @@ from typing import Any, Dict, List, Literal, Optional
|
|
| 10 |
from pydantic import BaseModel, ConfigDict, Field
|
| 11 |
|
| 12 |
|
| 13 |
-
|
| 14 |
-
|
|
|
|
| 15 |
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
"""
|
| 25 |
|
| 26 |
model_config = ConfigDict(extra="forbid")
|
| 27 |
|
| 28 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
...,
|
| 30 |
description=(
|
| 31 |
-
"
|
| 32 |
-
"'
|
| 33 |
-
"or 'check_discrepancies'"
|
| 34 |
),
|
| 35 |
)
|
| 36 |
-
|
| 37 |
-
default=
|
| 38 |
-
description=
|
|
|
|
|
|
|
|
|
|
| 39 |
)
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
description="
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
)
|
| 44 |
|
| 45 |
|
| 46 |
-
|
| 47 |
-
|
|
|
|
| 48 |
|
| 49 |
-
|
| 50 |
-
|
|
|
|
|
|
|
|
|
|
| 51 |
"""
|
| 52 |
|
| 53 |
model_config = ConfigDict(extra="forbid")
|
| 54 |
|
| 55 |
done: bool = Field(default=False, description="Whether the episode has ended")
|
| 56 |
-
reward:
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
required_fields: List[str] = Field(default_factory=list, description="Fields to extract")
|
| 62 |
-
metadata: Dict[str, Any] = Field(default_factory=dict, description="Additional metadata")
|
| 63 |
last_action_status: Literal["success", "error"] = Field(
|
| 64 |
default="success",
|
| 65 |
-
description="Whether the last action was valid and executed successfully",
|
| 66 |
)
|
| 67 |
error_message: Optional[str] = Field(
|
| 68 |
default=None,
|
| 69 |
-
description="Diagnostic error message if last_action_status is 'error'",
|
| 70 |
)
|
| 71 |
current_step: int = Field(
|
| 72 |
default=0,
|
| 73 |
-
description="Current step number within the episode",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
)
|
| 75 |
accumulated_reward: float = Field(
|
| 76 |
default=0.0,
|
| 77 |
-
description="Total
|
| 78 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
|
|
|
|
|
|
|
|
|
|
| 80 |
|
| 81 |
-
class
|
| 82 |
-
"""Internal environment state."""
|
| 83 |
|
| 84 |
model_config = ConfigDict(extra="allow")
|
| 85 |
|
| 86 |
episode_id: Optional[str] = Field(default=None, description="Current episode ID")
|
| 87 |
step_count: int = Field(default=0, ge=0, description="Steps taken in current episode")
|
| 88 |
task_name: str = Field(default="", description="Current task name")
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
"""
|
| 2 |
+
Pydantic models for the Enterprise Supply Chain & Tax Reconciliation Environment.
|
| 3 |
|
| 4 |
+
Defines the Action, Observation, and State types used for communication
|
| 5 |
+
between the agent and the environment. Designed for type-safe interaction
|
| 6 |
+
with an ERP-like tool suite.
|
| 7 |
"""
|
| 8 |
|
| 9 |
from typing import Any, Dict, List, Literal, Optional
|
|
|
|
| 11 |
from pydantic import BaseModel, ConfigDict, Field
|
| 12 |
|
| 13 |
|
| 14 |
+
# ---------------------------------------------------------------------------
|
| 15 |
+
# Action — what the agent sends to the environment
|
| 16 |
+
# ---------------------------------------------------------------------------
|
| 17 |
|
| 18 |
+
class ESCTRAction(BaseModel):
|
| 19 |
+
"""Action sent by the agent to the ESCTR environment.
|
| 20 |
+
|
| 21 |
+
The agent operates as an autonomous financial controller using 4 tool verbs:
|
| 22 |
+
- 'query_database': Search procurement, accounts payable, shipping, or warehouse databases
|
| 23 |
+
- 'read_document': Retrieve a specific contract, SLA, PO, or invoice by document_id
|
| 24 |
+
- 'communicate_vendor': Send a negotiation message to the simulated vendor
|
| 25 |
+
- 'submit_financial_decision': Submit the final ledger adjustment (terminal action)
|
| 26 |
"""
|
| 27 |
|
| 28 |
model_config = ConfigDict(extra="forbid")
|
| 29 |
|
| 30 |
+
action_type: Literal[
|
| 31 |
+
"query_database",
|
| 32 |
+
"read_document",
|
| 33 |
+
"communicate_vendor",
|
| 34 |
+
"submit_financial_decision",
|
| 35 |
+
] = Field(
|
| 36 |
...,
|
| 37 |
description=(
|
| 38 |
+
"The tool verb to execute. One of: 'query_database', 'read_document', "
|
| 39 |
+
"'communicate_vendor', or 'submit_financial_decision'."
|
|
|
|
| 40 |
),
|
| 41 |
)
|
| 42 |
+
query_parameters: Optional[Dict[str, Any]] = Field(
|
| 43 |
+
default=None,
|
| 44 |
+
description=(
|
| 45 |
+
"Structured query for database lookups. Example: "
|
| 46 |
+
'{"table": "shipping_logs", "tracking_id": "TRK-9921"}'
|
| 47 |
+
),
|
| 48 |
)
|
| 49 |
+
document_id: Optional[str] = Field(
|
| 50 |
+
default=None,
|
| 51 |
+
description="Unique alphanumeric identifier of the document to read (e.g. 'PO-2024-0055').",
|
| 52 |
+
)
|
| 53 |
+
message_content: Optional[str] = Field(
|
| 54 |
+
default=None,
|
| 55 |
+
description="Natural language message for vendor negotiation (used with 'communicate_vendor').",
|
| 56 |
+
)
|
| 57 |
+
adjustment_amount: Optional[float] = Field(
|
| 58 |
+
default=None,
|
| 59 |
+
description=(
|
| 60 |
+
"The precise monetary adjustment to submit (used with 'submit_financial_decision'). "
|
| 61 |
+
"Must be the exact floating-point value calculated from contract terms."
|
| 62 |
+
),
|
| 63 |
+
)
|
| 64 |
+
adjustment_reason: Optional[str] = Field(
|
| 65 |
+
default=None,
|
| 66 |
+
description="Brief explanation of the adjustment rationale (used with 'submit_financial_decision').",
|
| 67 |
)
|
| 68 |
|
| 69 |
|
| 70 |
+
# ---------------------------------------------------------------------------
|
| 71 |
+
# Observation — what the environment returns after each step
|
| 72 |
+
# ---------------------------------------------------------------------------
|
| 73 |
|
| 74 |
+
class ESCTRObservation(BaseModel):
|
| 75 |
+
"""Observation returned by the ESCTR environment after each step.
|
| 76 |
+
|
| 77 |
+
Provides structured telemetry to help the agent understand the
|
| 78 |
+
outcome of its action and plan the next move.
|
| 79 |
"""
|
| 80 |
|
| 81 |
model_config = ConfigDict(extra="forbid")
|
| 82 |
|
| 83 |
done: bool = Field(default=False, description="Whether the episode has ended")
|
| 84 |
+
reward: float = Field(default=0.0, description="Reward signal for this step (0.0-1.0)")
|
| 85 |
+
system_response: str = Field(
|
| 86 |
+
default="",
|
| 87 |
+
description="Output from the tool: database results, document text, vendor reply, or grader feedback.",
|
| 88 |
+
)
|
|
|
|
|
|
|
| 89 |
last_action_status: Literal["success", "error"] = Field(
|
| 90 |
default="success",
|
| 91 |
+
description="Whether the last action was valid and executed successfully.",
|
| 92 |
)
|
| 93 |
error_message: Optional[str] = Field(
|
| 94 |
default=None,
|
| 95 |
+
description="Diagnostic error message if last_action_status is 'error'.",
|
| 96 |
)
|
| 97 |
current_step: int = Field(
|
| 98 |
default=0,
|
| 99 |
+
description="Current step number within the episode (0-indexed at reset).",
|
| 100 |
+
)
|
| 101 |
+
max_steps: int = Field(
|
| 102 |
+
default=15,
|
| 103 |
+
description="Maximum steps allowed for this task.",
|
| 104 |
)
|
| 105 |
accumulated_reward: float = Field(
|
| 106 |
default=0.0,
|
| 107 |
+
description="Total reward accumulated across all steps in this episode.",
|
| 108 |
)
|
| 109 |
+
task_name: str = Field(default="", description="Current task name.")
|
| 110 |
+
available_tools: List[str] = Field(
|
| 111 |
+
default_factory=list,
|
| 112 |
+
description="List of tool verbs available in this task.",
|
| 113 |
+
)
|
| 114 |
+
metadata: Dict[str, Any] = Field(
|
| 115 |
+
default_factory=dict,
|
| 116 |
+
description="Additional structured metadata (scores, milestones, etc.).",
|
| 117 |
+
)
|
| 118 |
+
|
| 119 |
|
| 120 |
+
# ---------------------------------------------------------------------------
|
| 121 |
+
# State — internal environment state (exposed via GET /state)
|
| 122 |
+
# ---------------------------------------------------------------------------
|
| 123 |
|
| 124 |
+
class ESCTRState(BaseModel):
|
| 125 |
+
"""Internal environment state for the ESCTR environment."""
|
| 126 |
|
| 127 |
model_config = ConfigDict(extra="allow")
|
| 128 |
|
| 129 |
episode_id: Optional[str] = Field(default=None, description="Current episode ID")
|
| 130 |
step_count: int = Field(default=0, ge=0, description="Steps taken in current episode")
|
| 131 |
task_name: str = Field(default="", description="Current task name")
|
| 132 |
+
seed: int = Field(default=0, description="Seed used for procedural generation")
|
| 133 |
+
accumulated_reward: float = Field(default=0.0, description="Total reward accumulated")
|
| 134 |
+
outcome_submitted: bool = Field(default=False, description="Whether final decision was submitted")
|
| 135 |
+
milestones_hit: List[str] = Field(
|
| 136 |
+
default_factory=list,
|
| 137 |
+
description="Trajectory milestones achieved (e.g. 'retrieved_po', 'retrieved_sla').",
|
| 138 |
+
)
|
| 139 |
+
best_score: float = Field(default=0.0, description="Best score achieved")
|
server/procedural.py
CHANGED
|
@@ -1,426 +1,580 @@
|
|
| 1 |
"""
|
| 2 |
-
Procedural
|
| 3 |
-
|
| 4 |
-
Generates
|
| 5 |
-
|
| 6 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
"""
|
| 8 |
|
| 9 |
import random
|
| 10 |
-
import
|
| 11 |
-
from
|
|
|
|
|
|
|
| 12 |
|
| 13 |
# ---------------------------------------------------------------------------
|
| 14 |
-
# Data pools
|
| 15 |
# ---------------------------------------------------------------------------
|
| 16 |
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
("Vanguard Robotics Inc.", "15 Automation Circle", "San Jose", "CA", "95101"),
|
| 32 |
-
("Horizon Energy Systems", "700 Solar Way", "Denver", "CO", "80201"),
|
| 33 |
]
|
| 34 |
|
| 35 |
-
|
| 36 |
-
("
|
| 37 |
-
("
|
| 38 |
-
("
|
| 39 |
-
("
|
| 40 |
-
("Cascade Solutions Group", "55 River Bend Rd", "Portland", "OR", "97201"),
|
| 41 |
-
("Sterling Financial Corp.", "800 Wall St", "New York", "NY", "10005"),
|
| 42 |
-
("Bluestone Retail Inc.", "120 Market Square", "Philadelphia", "PA", "19101"),
|
| 43 |
-
("Northstar Logistics", "450 Freight Way", "Minneapolis", "MN", "55401"),
|
| 44 |
-
("Pacific Tech Ventures", "700 Bay Ave", "San Diego", "CA", "92101"),
|
| 45 |
-
("Redwood Construction LLC", "35 Builder Lane", "Sacramento", "CA", "95801"),
|
| 46 |
-
("Falcon Aerospace", "1 Launchpad Dr", "Huntsville", "AL", "35801"),
|
| 47 |
-
("Cedar Health Systems", "200 Wellness Blvd", "Nashville", "TN", "37201"),
|
| 48 |
-
("Granite Insurance Group", "90 Coverage Ct", "Hartford", "CT", "06101"),
|
| 49 |
-
("Oakmont Education Trust", "60 Campus Way", "Ann Arbor", "MI", "48101"),
|
| 50 |
-
("Sapphire Media Holdings", "500 Broadcast Pl", "Los Angeles", "CA", "90001"),
|
| 51 |
]
|
| 52 |
|
| 53 |
PRODUCT_CATALOG = [
|
| 54 |
-
# (
|
| 55 |
-
("
|
| 56 |
-
("
|
| 57 |
-
("
|
| 58 |
-
("
|
| 59 |
-
("
|
| 60 |
-
("
|
| 61 |
-
("
|
| 62 |
-
("
|
| 63 |
-
("
|
| 64 |
-
("
|
| 65 |
-
("
|
| 66 |
-
("
|
| 67 |
-
("
|
| 68 |
-
("
|
| 69 |
-
("
|
| 70 |
-
("
|
| 71 |
-
("
|
| 72 |
-
("
|
| 73 |
-
("
|
| 74 |
-
("
|
| 75 |
-
("Data Backup Service", 75.00, 300.00, "month"),
|
| 76 |
-
("Security Audit", 500.00, 2500.00, "audit"),
|
| 77 |
-
("Custom Report Development", 200.00, 1000.00, "report"),
|
| 78 |
-
("Training Workshop", 150.00, 500.00, "session"),
|
| 79 |
-
("Prototype Fabrication", 1000.00, 5000.00, "unit"),
|
| 80 |
]
|
| 81 |
|
| 82 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
|
| 84 |
-
|
| 85 |
-
"
|
| 86 |
-
"
|
| 87 |
-
"
|
| 88 |
-
|
|
|
|
|
|
|
| 89 |
|
| 90 |
-
|
| 91 |
-
"
|
| 92 |
-
"
|
|
|
|
| 93 |
]
|
| 94 |
|
| 95 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
class ProceduralEngine:
|
| 97 |
-
"""Seed-
|
| 98 |
|
| 99 |
def __init__(self, seed: int = 0):
|
| 100 |
self.rng = random.Random(seed)
|
|
|
|
| 101 |
|
| 102 |
def _pick(self, pool: list) -> Any:
|
| 103 |
return self.rng.choice(pool)
|
| 104 |
|
| 105 |
-
def
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
f"{
|
| 114 |
-
|
| 115 |
-
|
| 116 |
|
| 117 |
-
def _gen_date(self
|
| 118 |
-
|
| 119 |
-
year = self.rng.choice([2023, 2024, 2025])
|
| 120 |
-
month = self.rng.randint(1, 12)
|
| 121 |
day = self.rng.randint(1, 28)
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
Subtotal: ${subtotal:,.2f}
|
| 186 |
-
Tax ({tax_pct}%): ${tax:,.2f}
|
| 187 |
-
Total: ${total:,.2f}
|
| 188 |
-
|
| 189 |
-
Payment Terms: Net {self.rng.choice([15, 30, 45, 60])}
|
| 190 |
-
"""
|
| 191 |
-
ground_truth = {
|
| 192 |
-
"invoice_number": inv_num,
|
| 193 |
-
"date": norm_date,
|
| 194 |
-
"vendor_name": vendor[0],
|
| 195 |
-
"customer_name": customer[0],
|
| 196 |
-
"subtotal": subtotal,
|
| 197 |
-
"tax": tax,
|
| 198 |
-
"total": total,
|
| 199 |
-
"line_items": items,
|
| 200 |
-
}
|
| 201 |
-
return {"id": f"proc_simple_{self.rng.randint(1000,9999)}", "text": text, "ground_truth": ground_truth}
|
| 202 |
-
|
| 203 |
-
def generate_messy(self) -> Dict[str, Any]:
|
| 204 |
-
base = self.generate_simple()
|
| 205 |
-
gt = base["ground_truth"]
|
| 206 |
-
vendor = gt["vendor_name"]
|
| 207 |
-
customer = gt["customer_name"]
|
| 208 |
-
items = gt["line_items"]
|
| 209 |
-
|
| 210 |
-
abbrevs = {"Subtotal": self._pick(["subtot", "s/t", "sub"]),
|
| 211 |
-
"Tax": self._pick(["tx", "tax", "vat"]),
|
| 212 |
-
"Total": self._pick(["TOTAL DUE", "amt due", "grand total", "balance"])}
|
| 213 |
-
|
| 214 |
-
items_text = ""
|
| 215 |
-
for it in items:
|
| 216 |
-
desc_short = it["description"].split("(")[0].strip().lower()
|
| 217 |
-
qty = it["quantity"]
|
| 218 |
-
price = it["unit_price"]
|
| 219 |
-
amt = it["amount"]
|
| 220 |
-
fmt = self.rng.choice([
|
| 221 |
-
f"{qty}x {desc_short} @ {price:.0f} {amt:.0f}",
|
| 222 |
-
f"{desc_short} -- {qty} @ {price:.2f} ea ........... {amt:.0f}",
|
| 223 |
-
f"{desc_short}...${amt:.0f}",
|
| 224 |
-
])
|
| 225 |
-
items_text += fmt + "\n"
|
| 226 |
-
|
| 227 |
-
text = f"""{vendor.lower()}
|
| 228 |
-
{self._pick(VENDOR_POOL)[2].lower()}, {self._pick(VENDOR_POOL)[3]}
|
| 229 |
-
|
| 230 |
-
inv# {gt['invoice_number']}
|
| 231 |
-
dt: {gt['date']}
|
| 232 |
-
|
| 233 |
-
cust: {customer.split('.')[0].split(',')[0].lower()}
|
| 234 |
-
|
| 235 |
-
-- charges --
|
| 236 |
-
{items_text}
|
| 237 |
-
{abbrevs['Subtotal']}: ${gt['subtotal']:.0f}
|
| 238 |
-
{abbrevs['Tax']}: {gt['tax']:.2f}
|
| 239 |
-
========
|
| 240 |
-
{abbrevs['Total']} ${gt['total']:,.2f}
|
| 241 |
-
|
| 242 |
-
pay within 30 days
|
| 243 |
-
"""
|
| 244 |
-
return {"id": f"proc_messy_{self.rng.randint(1000,9999)}", "text": text, "ground_truth": gt}
|
| 245 |
-
|
| 246 |
-
def _apply_ocr_corruption(self, text: str, intensity: float = 0.15) -> str:
|
| 247 |
-
result = list(text)
|
| 248 |
-
for i, ch in enumerate(result):
|
| 249 |
-
if ch in OCR_SUBSTITUTIONS and self.rng.random() < intensity:
|
| 250 |
-
result[i] = OCR_SUBSTITUTIONS[ch]
|
| 251 |
-
return "".join(result)
|
| 252 |
-
|
| 253 |
-
def generate_corrupted(self) -> Dict[str, Any]:
|
| 254 |
-
base = self.generate_simple()
|
| 255 |
-
corrupted_text = self._apply_ocr_corruption(base["text"], 0.18)
|
| 256 |
-
header = self._pick([
|
| 257 |
-
"SC4NNED D0CUMENT - Page 1 of 1\n\n",
|
| 258 |
-
"[SCAN QUALITY: P00R - SOME CHARACTERS MAY BE lNCORRECT]\n\n",
|
| 259 |
-
"---FAXED DOCUMENT---\nQUALITY: [####===---] 40%\n\n",
|
| 260 |
-
])
|
| 261 |
-
footer = self._pick([
|
| 262 |
-
"\n\n--- END 0F SCAN ---",
|
| 263 |
-
"\n\n[PAGE 1/1 - SCAN C0MPLETE]",
|
| 264 |
-
"\n\n---END FAX---",
|
| 265 |
-
])
|
| 266 |
-
return {
|
| 267 |
-
"id": f"proc_corrupt_{self.rng.randint(1000,9999)}",
|
| 268 |
-
"text": header + corrupted_text + footer,
|
| 269 |
-
"ground_truth": base["ground_truth"],
|
| 270 |
-
}
|
| 271 |
-
|
| 272 |
-
def generate_multi_document(self) -> Dict[str, Any]:
|
| 273 |
-
base = self.generate_simple()
|
| 274 |
-
gt = base["ground_truth"]
|
| 275 |
-
po_num = f"PO-{self.rng.choice(['A','B','C','D'])}-{self.rng.randint(2024,2025)}-{self.rng.randint(100,999)}"
|
| 276 |
-
po_date_display, _po_norm = self._gen_date()
|
| 277 |
-
|
| 278 |
-
items_po = ""
|
| 279 |
-
for it in gt["line_items"]:
|
| 280 |
-
items_po += f"- {it['quantity']}x {it['description']} @ ${it['unit_price']:.2f} = ${it['amount']:.2f}\n"
|
| 281 |
-
|
| 282 |
-
credit_amount = round(self._pick(gt["line_items"])["unit_price"] * self.rng.randint(1, 3), 2)
|
| 283 |
-
credit_tax = round(credit_amount * 0.07, 2)
|
| 284 |
-
credit_total = round(credit_amount + credit_tax, 2)
|
| 285 |
-
adjusted_total = round(gt["total"] - credit_total, 2)
|
| 286 |
-
reason = self._pick([
|
| 287 |
-
"Defective items returned",
|
| 288 |
-
"Partial delivery — remaining items backordered",
|
| 289 |
-
"Pricing error on original invoice",
|
| 290 |
-
"Duplicate charge for services",
|
| 291 |
-
])
|
| 292 |
|
| 293 |
-
|
| 294 |
-
|
| 295 |
-
|
| 296 |
-
|
| 297 |
-
|
| 298 |
-
|
| 299 |
-
|
| 300 |
-
|
| 301 |
-
|
| 302 |
-
|
| 303 |
-
|
| 304 |
-
|
| 305 |
-
|
| 306 |
-
|
| 307 |
-
=
|
| 308 |
-
|
| 309 |
-
|
| 310 |
-
|
| 311 |
-
Credit Amount: ${credit_amount:.2f}
|
| 312 |
-
Tax Adjustment: ${credit_tax:.2f}
|
| 313 |
-
Total Credit: -${credit_total:.2f}
|
| 314 |
-
|
| 315 |
-
=== SUMMARY ===
|
| 316 |
-
Original Invoice: ${gt['total']:,.2f}
|
| 317 |
-
Credit Applied: -${credit_total:.2f}
|
| 318 |
-
Adjusted Balance Due: ${adjusted_total:,.2f}
|
| 319 |
-
"""
|
| 320 |
-
gt_multi = dict(gt)
|
| 321 |
-
gt_multi["po_number"] = po_num
|
| 322 |
-
gt_multi["adjustment_reason"] = reason
|
| 323 |
-
gt_multi["adjusted_total"] = adjusted_total
|
| 324 |
-
return {"id": f"proc_multi_{self.rng.randint(1000,9999)}", "text": text, "ground_truth": gt_multi}
|
| 325 |
-
|
| 326 |
-
def generate_adversarial(self) -> Dict[str, Any]:
|
| 327 |
-
base = self.generate_simple()
|
| 328 |
-
gt = base["ground_truth"]
|
| 329 |
-
original_subtotal = gt["subtotal"]
|
| 330 |
-
discount_pct = self._pick([0.05, 0.08, 0.10, 0.12, 0.15])
|
| 331 |
-
discount_amount = round(original_subtotal * discount_pct, 2)
|
| 332 |
-
adjusted_subtotal = round(original_subtotal - discount_amount, 2)
|
| 333 |
-
tax_rate = self._pick(TAX_RATES)
|
| 334 |
-
new_tax = round(adjusted_subtotal * tax_rate, 2)
|
| 335 |
-
new_total = round(adjusted_subtotal + new_tax, 2)
|
| 336 |
-
old_tax = round(original_subtotal * tax_rate, 2)
|
| 337 |
-
original_total = round(original_subtotal + old_tax, 2)
|
| 338 |
-
|
| 339 |
-
draft_inv = f"DRAFT-INV-{self.rng.randint(100,999)}"
|
| 340 |
-
real_inv = gt["invoice_number"] + self._pick(["-R2", "-FINAL", "-REV1"])
|
| 341 |
-
po_num = f"PO-{self.rng.randint(2024,2025)}-{self.rng.randint(100,999)}"
|
| 342 |
-
_, reissue_date = self._gen_date()
|
| 343 |
-
tax_pct = int(tax_rate * 100) if tax_rate * 100 == int(tax_rate * 100) else round(tax_rate * 100, 1)
|
| 344 |
-
|
| 345 |
-
items_text = ""
|
| 346 |
-
for it in gt["line_items"]:
|
| 347 |
-
items_text += f"{it['description']:<30s} {it['quantity']:>5d} ${it['unit_price']:>10.2f} ${it['amount']:>10.2f}\n"
|
| 348 |
-
|
| 349 |
-
discount_pct_display = int(discount_pct * 100) if discount_pct * 100 == int(discount_pct * 100) else round(discount_pct * 100, 1)
|
| 350 |
-
|
| 351 |
-
text = f"""INVOICE
|
| 352 |
-
|
| 353 |
-
*** IMPORTANT: This replaces previous invoice {draft_inv} which was voided ***
|
| 354 |
-
|
| 355 |
-
Invoice Number: {real_inv}
|
| 356 |
-
Previous Reference: {draft_inv} (VOIDED — DO NOT USE)
|
| 357 |
-
Date: {gt['date']}
|
| 358 |
-
Reissue Date: {reissue_date}
|
| 359 |
-
PO Reference: {po_num}
|
| 360 |
-
|
| 361 |
-
From:
|
| 362 |
-
{gt['vendor_name']}
|
| 363 |
-
|
| 364 |
-
Bill To:
|
| 365 |
-
{gt['customer_name']}
|
| 366 |
-
|
| 367 |
-
Description Qty Unit Price Amount
|
| 368 |
-
---------------------------------------------------------------
|
| 369 |
-
{items_text} ** EARLY PAYMENT DISCOUNT: -{discount_pct_display}% applied **
|
| 370 |
-
|
| 371 |
-
Subtotal: ${original_subtotal:,.2f}
|
| 372 |
-
Discount ({discount_pct_display}%): -${discount_amount:,.2f}
|
| 373 |
-
Adjusted Subtotal: ${adjusted_subtotal:,.2f}
|
| 374 |
-
Tax ({tax_pct}%): ${new_tax:,.2f}
|
| 375 |
-
Total: ${new_total:,.2f}
|
| 376 |
-
|
| 377 |
-
NOTE: Original quote was ${original_total:,.2f} but discount applied.
|
| 378 |
-
|
| 379 |
-
!!! BUDGET VARIANCE ALERT !!!
|
| 380 |
-
PO Authorized: ${original_subtotal:,.2f}
|
| 381 |
-
Actual (pre-tax): ${adjusted_subtotal:,.2f}
|
| 382 |
-
Variance: -${discount_amount:,.2f} UNDER BUDGET
|
| 383 |
-
|
| 384 |
-
Payment Terms: Net 10 (discounted) / Net 30 (full price ${original_total:,.2f})
|
| 385 |
-
"""
|
| 386 |
-
discrepancy = (
|
| 387 |
-
f"{discount_pct_display}% early payment discount applied. "
|
| 388 |
-
f"Reissued invoice replaces voided {draft_inv}. "
|
| 389 |
-
f"Adjusted subtotal ${adjusted_subtotal:,.2f} vs original ${original_subtotal:,.2f}."
|
| 390 |
)
|
| 391 |
|
| 392 |
-
|
| 393 |
-
|
| 394 |
-
|
| 395 |
-
|
| 396 |
-
|
| 397 |
-
|
| 398 |
-
|
| 399 |
-
|
| 400 |
-
|
| 401 |
-
|
| 402 |
-
|
| 403 |
-
|
| 404 |
-
|
| 405 |
-
|
| 406 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 407 |
|
| 408 |
|
| 409 |
# ---------------------------------------------------------------------------
|
| 410 |
# Public API
|
| 411 |
# ---------------------------------------------------------------------------
|
| 412 |
|
| 413 |
-
|
| 414 |
-
"
|
| 415 |
-
"
|
| 416 |
-
"
|
| 417 |
-
|
| 418 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 419 |
}
|
| 420 |
|
| 421 |
|
| 422 |
-
def
|
| 423 |
-
"""Generate a
|
| 424 |
engine = ProceduralEngine(seed)
|
| 425 |
-
method =
|
| 426 |
return getattr(engine, method)()
|
|
|
|
| 1 |
"""
|
| 2 |
+
Procedural Generation Engine for the ESCTR Environment.
|
| 3 |
+
|
| 4 |
+
Generates deterministic corporate supply chain scenarios from any seed:
|
| 5 |
+
- Company profiles (vendors, buyers)
|
| 6 |
+
- Product catalogs with contracted pricing
|
| 7 |
+
- Purchase Orders
|
| 8 |
+
- Vendor Invoices (with seeded discrepancies)
|
| 9 |
+
- Service Level Agreements (penalty clauses)
|
| 10 |
+
- Shipping / logistics telemetry
|
| 11 |
+
- Warehouse access logs
|
| 12 |
+
- Vendor negotiation responses
|
| 13 |
+
|
| 14 |
+
Design principle: same seed → identical scenario → deterministic grading.
|
| 15 |
"""
|
| 16 |
|
| 17 |
import random
|
| 18 |
+
import hashlib
|
| 19 |
+
from dataclasses import dataclass, field, asdict
|
| 20 |
+
from typing import Any, Dict, List, Optional, Tuple
|
| 21 |
+
|
| 22 |
|
| 23 |
# ---------------------------------------------------------------------------
|
| 24 |
+
# Data pools
|
| 25 |
# ---------------------------------------------------------------------------
|
| 26 |
|
| 27 |
+
VENDOR_NAMES = [
|
| 28 |
+
"Apex Industrial Supply Co.", "Meridian Components LLC", "Vanguard Materials Group",
|
| 29 |
+
"Sterling Precision Parts", "Ironclad Manufacturing Corp.", "Cobalt Logistics Inc.",
|
| 30 |
+
"Pinnacle Hardware Solutions", "Atlas Engineering Supply", "Nexus Digital Components",
|
| 31 |
+
"Brightwave Technical Services", "SilverLine Distribution", "Quantum Parts International",
|
| 32 |
+
"Evergreen Industrial Ltd.", "Horizon Supply Chain Corp.", "Titan Fabrication Works",
|
| 33 |
+
]
|
| 34 |
+
|
| 35 |
+
BUYER_NAMES = [
|
| 36 |
+
"Cascade Electronics Inc.", "Redwood Construction Group", "Summit Aerospace Ltd.",
|
| 37 |
+
"Pacific Manufacturing Co.", "Northstar Automotive", "Falcon Defense Systems",
|
| 38 |
+
"Bluestone Energy Corp.", "Cedar Health Technologies", "Granite Infrastructure LLC",
|
| 39 |
+
"Oakmont Robotics Inc.", "Sapphire Semiconductor", "Emerald Biotech Group",
|
| 40 |
+
"Diamond Precision Engineering", "Ruby Telecommunications", "Topaz Data Systems",
|
|
|
|
|
|
|
| 41 |
]
|
| 42 |
|
| 43 |
+
CITIES = [
|
| 44 |
+
("New York", "NY"), ("Chicago", "IL"), ("Houston", "TX"), ("San Francisco", "CA"),
|
| 45 |
+
("Detroit", "MI"), ("Seattle", "WA"), ("Boston", "MA"), ("Denver", "CO"),
|
| 46 |
+
("Austin", "TX"), ("Portland", "OR"), ("Minneapolis", "MN"), ("Cleveland", "OH"),
|
| 47 |
+
("Pittsburgh", "PA"), ("Nashville", "TN"), ("San Diego", "CA"),
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
]
|
| 49 |
|
| 50 |
PRODUCT_CATALOG = [
|
| 51 |
+
# (name, category, min_price, max_price)
|
| 52 |
+
("Stainless Steel Bolts M10 (Box/100)", "hardware", 10.00, 25.00),
|
| 53 |
+
("Copper Wire 500ft Roll AWG-12", "electrical", 65.00, 120.00),
|
| 54 |
+
("Industrial Safety Goggles (Pack/10)", "safety", 25.00, 55.00),
|
| 55 |
+
("Welding Rod E6013 (Bundle/50)", "consumables", 18.00, 42.00),
|
| 56 |
+
("Hydraulic Cylinder Assembly HCA-200", "machinery", 280.00, 550.00),
|
| 57 |
+
("Precision Bearing Set 6205-2RS", "components", 35.00, 90.00),
|
| 58 |
+
("HVAC Filter MERV-13 (Pack/4)", "facilities", 22.00, 48.00),
|
| 59 |
+
("LED Panel Light 600x600mm", "electrical", 35.00, 85.00),
|
| 60 |
+
("Thermal Insulation Roll R-30", "construction", 55.00, 140.00),
|
| 61 |
+
("Network Switch 24-Port Managed", "IT", 180.00, 420.00),
|
| 62 |
+
("Server Rack Mount Kit 42U", "IT", 350.00, 800.00),
|
| 63 |
+
("Pneumatic Valve Assembly PVA-100", "machinery", 120.00, 280.00),
|
| 64 |
+
("Carbon Steel Pipe Schedule 40 (10ft)", "construction", 45.00, 110.00),
|
| 65 |
+
("Circuit Breaker Panel 200A", "electrical", 150.00, 380.00),
|
| 66 |
+
("Laser Calibration Module LCM-5", "precision", 400.00, 950.00),
|
| 67 |
+
("Industrial Adhesive Epoxy (Gallon)", "consumables", 28.00, 72.00),
|
| 68 |
+
("Fiber Optic Cable OM3 (1000ft)", "IT", 200.00, 480.00),
|
| 69 |
+
("Pressure Gauge 0-300 PSI", "instruments", 40.00, 95.00),
|
| 70 |
+
("Anti-Vibration Mount Set (Pack/8)", "machinery", 60.00, 150.00),
|
| 71 |
+
("Clean Room Wipes (Case/5000)", "consumables", 80.00, 190.00),
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
]
|
| 73 |
|
| 74 |
+
SLA_PENALTY_STRUCTURES = [
|
| 75 |
+
{"type": "linear", "rate_per_day": 0.02, "cap": 0.10, "grace_days": 0},
|
| 76 |
+
{"type": "linear", "rate_per_day": 0.015, "cap": 0.15, "grace_days": 1},
|
| 77 |
+
{"type": "linear", "rate_per_day": 0.03, "cap": 0.12, "grace_days": 0},
|
| 78 |
+
{"type": "tiered", "tiers": [(3, 0.02), (7, 0.03), (999, 0.05)], "cap": 0.20, "grace_days": 0},
|
| 79 |
+
{"type": "linear", "rate_per_day": 0.025, "cap": 0.10, "grace_days": 2},
|
| 80 |
+
]
|
| 81 |
|
| 82 |
+
VENDOR_EXCUSES = [
|
| 83 |
+
"Our records indicate the receiving warehouse rejected the initial delivery attempt due to dock unavailability.",
|
| 84 |
+
"The delay was caused by a force majeure weather event that affected our shipping lane.",
|
| 85 |
+
"We believe the shipment arrived on time but was misrouted by your internal receiving department.",
|
| 86 |
+
"Our carrier has confirmed timely delivery; any apparent delay is a systems error on your end.",
|
| 87 |
+
"The contract clearly states penalties apply only to manufacturing delays, not logistics delays.",
|
| 88 |
+
]
|
| 89 |
|
| 90 |
+
SETTLEMENT_OFFERS = [
|
| 91 |
+
"We are prepared to offer a goodwill credit of {pct}% of the penalty amount to resolve this matter.",
|
| 92 |
+
"In the interest of maintaining our business relationship, we propose settling at {pct}% of the claimed penalty.",
|
| 93 |
+
"Our legal team has reviewed the claim. We can offer {pct}% as a final settlement.",
|
| 94 |
]
|
| 95 |
|
| 96 |
|
| 97 |
+
# ---------------------------------------------------------------------------
|
| 98 |
+
# Data classes for generated scenarios
|
| 99 |
+
# ---------------------------------------------------------------------------
|
| 100 |
+
|
| 101 |
+
@dataclass
|
| 102 |
+
class Company:
|
| 103 |
+
name: str
|
| 104 |
+
address: str
|
| 105 |
+
city: str
|
| 106 |
+
state: str
|
| 107 |
+
zip_code: str
|
| 108 |
+
tax_id: str
|
| 109 |
+
|
| 110 |
+
@dataclass
|
| 111 |
+
class LineItem:
|
| 112 |
+
item_id: str
|
| 113 |
+
description: str
|
| 114 |
+
category: str
|
| 115 |
+
quantity: int
|
| 116 |
+
contracted_unit_price: float
|
| 117 |
+
invoiced_unit_price: float
|
| 118 |
+
contracted_total: float
|
| 119 |
+
invoiced_total: float
|
| 120 |
+
has_discrepancy: bool = False
|
| 121 |
+
|
| 122 |
+
@dataclass
|
| 123 |
+
class PurchaseOrder:
|
| 124 |
+
po_number: str
|
| 125 |
+
date: str
|
| 126 |
+
vendor: Company
|
| 127 |
+
buyer: Company
|
| 128 |
+
line_items: List[LineItem]
|
| 129 |
+
total_amount: float
|
| 130 |
+
approved_budget: float
|
| 131 |
+
|
| 132 |
+
@dataclass
|
| 133 |
+
class Invoice:
|
| 134 |
+
invoice_number: str
|
| 135 |
+
date: str
|
| 136 |
+
po_reference: str
|
| 137 |
+
vendor: Company
|
| 138 |
+
buyer: Company
|
| 139 |
+
line_items: List[LineItem]
|
| 140 |
+
subtotal: float
|
| 141 |
+
tax_rate: float
|
| 142 |
+
tax_amount: float
|
| 143 |
+
total: float
|
| 144 |
+
|
| 145 |
+
@dataclass
|
| 146 |
+
class SLAContract:
|
| 147 |
+
contract_id: str
|
| 148 |
+
vendor: str
|
| 149 |
+
buyer: str
|
| 150 |
+
effective_date: str
|
| 151 |
+
penalty_structure: Dict[str, Any]
|
| 152 |
+
delivery_terms: str
|
| 153 |
+
|
| 154 |
+
@dataclass
|
| 155 |
+
class ShippingLog:
|
| 156 |
+
tracking_id: str
|
| 157 |
+
po_reference: str
|
| 158 |
+
carrier: str
|
| 159 |
+
ship_date: str
|
| 160 |
+
expected_delivery: str
|
| 161 |
+
actual_delivery: str
|
| 162 |
+
delay_days: int
|
| 163 |
+
status: str
|
| 164 |
+
|
| 165 |
+
@dataclass
|
| 166 |
+
class WarehouseLog:
|
| 167 |
+
date: str
|
| 168 |
+
dock_id: str
|
| 169 |
+
status: str # "open", "closed", "maintenance"
|
| 170 |
+
staff_on_duty: int
|
| 171 |
+
shipments_received: int
|
| 172 |
+
notes: str
|
| 173 |
+
|
| 174 |
+
@dataclass
|
| 175 |
+
class Scenario:
|
| 176 |
+
"""Complete scenario for one ESCTR episode."""
|
| 177 |
+
task_name: str
|
| 178 |
+
seed: int
|
| 179 |
+
vendor: Company
|
| 180 |
+
buyer: Company
|
| 181 |
+
purchase_order: PurchaseOrder
|
| 182 |
+
invoice: Invoice
|
| 183 |
+
sla_contract: Optional[SLAContract] = None
|
| 184 |
+
shipping_log: Optional[ShippingLog] = None
|
| 185 |
+
warehouse_logs: Optional[List[WarehouseLog]] = None
|
| 186 |
+
# Ground truth for grading
|
| 187 |
+
correct_adjustment: float = 0.0
|
| 188 |
+
discrepant_line_item_id: Optional[str] = None
|
| 189 |
+
correct_line_item_price: Optional[float] = None
|
| 190 |
+
penalty_amount: Optional[float] = None
|
| 191 |
+
vendor_claim_valid: Optional[bool] = None
|
| 192 |
+
|
| 193 |
+
|
| 194 |
+
# ---------------------------------------------------------------------------
|
| 195 |
+
# Procedural Engine
|
| 196 |
+
# ---------------------------------------------------------------------------
|
| 197 |
+
|
| 198 |
class ProceduralEngine:
|
| 199 |
+
"""Seed-deterministic corporate scenario generator."""
|
| 200 |
|
| 201 |
def __init__(self, seed: int = 0):
|
| 202 |
self.rng = random.Random(seed)
|
| 203 |
+
self.seed = seed
|
| 204 |
|
| 205 |
def _pick(self, pool: list) -> Any:
|
| 206 |
return self.rng.choice(pool)
|
| 207 |
|
| 208 |
+
def _gen_company(self, names: list) -> Company:
|
| 209 |
+
name = self._pick(names)
|
| 210 |
+
city, state = self._pick(CITIES)
|
| 211 |
+
return Company(
|
| 212 |
+
name=name,
|
| 213 |
+
address=f"{self.rng.randint(100, 9999)} {self._pick(['Industrial', 'Commerce', 'Innovation', 'Enterprise', 'Technology'])} {self._pick(['Drive', 'Avenue', 'Parkway', 'Boulevard', 'Street'])}",
|
| 214 |
+
city=city,
|
| 215 |
+
state=state,
|
| 216 |
+
zip_code=f"{self.rng.randint(10000, 99999)}",
|
| 217 |
+
tax_id=f"{self.rng.randint(10, 99)}-{self.rng.randint(1000000, 9999999)}",
|
| 218 |
+
)
|
| 219 |
|
| 220 |
+
def _gen_date(self, year: int = 2024, month_range: Tuple[int, int] = (1, 12)) -> str:
|
| 221 |
+
month = self.rng.randint(*month_range)
|
|
|
|
|
|
|
| 222 |
day = self.rng.randint(1, 28)
|
| 223 |
+
return f"{year}-{month:02d}-{day:02d}"
|
| 224 |
+
|
| 225 |
+
def _gen_id(self, prefix: str) -> str:
|
| 226 |
+
return f"{prefix}-{self.rng.randint(2024, 2025)}-{self.rng.randint(1000, 9999)}"
|
| 227 |
+
|
| 228 |
+
def _gen_tracking_id(self) -> str:
|
| 229 |
+
return f"TRK-{self.rng.randint(10000, 99999)}"
|
| 230 |
+
|
| 231 |
+
# ------------------------------------------------------------------
|
| 232 |
+
# Task 1: Easy — Procurement Reconciliation
|
| 233 |
+
# ------------------------------------------------------------------
|
| 234 |
+
def generate_task1(self) -> Scenario:
|
| 235 |
+
"""Generate a simple PO vs Invoice overcharge scenario."""
|
| 236 |
+
vendor = self._gen_company(VENDOR_NAMES)
|
| 237 |
+
buyer = self._gen_company(BUYER_NAMES)
|
| 238 |
+
po_date = self._gen_date(month_range=(1, 6))
|
| 239 |
+
inv_date = self._gen_date(month_range=(2, 7))
|
| 240 |
+
|
| 241 |
+
# Generate 3-5 line items
|
| 242 |
+
num_items = self.rng.randint(3, 5)
|
| 243 |
+
products = self.rng.sample(PRODUCT_CATALOG, num_items)
|
| 244 |
+
discrepant_idx = self.rng.randint(0, num_items - 1)
|
| 245 |
+
|
| 246 |
+
line_items = []
|
| 247 |
+
for i, (name, cat, min_p, max_p) in enumerate(products):
|
| 248 |
+
qty = self.rng.randint(5, 100)
|
| 249 |
+
contracted_price = round(self.rng.uniform(min_p, max_p), 2)
|
| 250 |
+
|
| 251 |
+
if i == discrepant_idx:
|
| 252 |
+
# Overcharge: invoice price higher than contracted
|
| 253 |
+
markup = round(self.rng.uniform(2.00, 15.00), 2)
|
| 254 |
+
invoiced_price = round(contracted_price + markup, 2)
|
| 255 |
+
has_discrepancy = True
|
| 256 |
+
else:
|
| 257 |
+
invoiced_price = contracted_price
|
| 258 |
+
has_discrepancy = False
|
| 259 |
+
|
| 260 |
+
item_id = f"LI-{self.rng.randint(1000, 9999)}"
|
| 261 |
+
line_items.append(LineItem(
|
| 262 |
+
item_id=item_id,
|
| 263 |
+
description=name,
|
| 264 |
+
category=cat,
|
| 265 |
+
quantity=qty,
|
| 266 |
+
contracted_unit_price=contracted_price,
|
| 267 |
+
invoiced_unit_price=invoiced_price,
|
| 268 |
+
contracted_total=round(qty * contracted_price, 2),
|
| 269 |
+
invoiced_total=round(qty * invoiced_price, 2),
|
| 270 |
+
has_discrepancy=has_discrepancy,
|
| 271 |
+
))
|
| 272 |
+
|
| 273 |
+
po_total = round(sum(li.contracted_total for li in line_items), 2)
|
| 274 |
+
inv_subtotal = round(sum(li.invoiced_total for li in line_items), 2)
|
| 275 |
+
tax_rate = self._pick([0.05, 0.06, 0.07, 0.08, 0.09, 0.10])
|
| 276 |
+
tax_amount = round(inv_subtotal * tax_rate, 2)
|
| 277 |
+
inv_total = round(inv_subtotal + tax_amount, 2)
|
| 278 |
+
|
| 279 |
+
po_number = self._gen_id("PO")
|
| 280 |
+
inv_number = self._gen_id("INV")
|
| 281 |
+
|
| 282 |
+
po = PurchaseOrder(
|
| 283 |
+
po_number=po_number, date=po_date, vendor=vendor, buyer=buyer,
|
| 284 |
+
line_items=line_items, total_amount=po_total, approved_budget=round(po_total * 1.05, 2),
|
| 285 |
+
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 286 |
|
| 287 |
+
invoice = Invoice(
|
| 288 |
+
invoice_number=inv_number, date=inv_date, po_reference=po_number,
|
| 289 |
+
vendor=vendor, buyer=buyer, line_items=line_items,
|
| 290 |
+
subtotal=inv_subtotal, tax_rate=tax_rate, tax_amount=tax_amount, total=inv_total,
|
| 291 |
+
)
|
| 292 |
+
|
| 293 |
+
discrepant = line_items[discrepant_idx]
|
| 294 |
+
correct_total = discrepant.contracted_total
|
| 295 |
+
overcharge = round(discrepant.invoiced_total - correct_total, 2)
|
| 296 |
+
|
| 297 |
+
return Scenario(
|
| 298 |
+
task_name="procurement_reconciliation",
|
| 299 |
+
seed=self.seed,
|
| 300 |
+
vendor=vendor, buyer=buyer,
|
| 301 |
+
purchase_order=po, invoice=invoice,
|
| 302 |
+
correct_adjustment=-overcharge, # negative = reduce invoice
|
| 303 |
+
discrepant_line_item_id=discrepant.item_id,
|
| 304 |
+
correct_line_item_price=correct_total,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 305 |
)
|
| 306 |
|
| 307 |
+
# ------------------------------------------------------------------
|
| 308 |
+
# Task 2: Medium — SLA Enforcement
|
| 309 |
+
# ------------------------------------------------------------------
|
| 310 |
+
def generate_task2(self) -> Scenario:
|
| 311 |
+
"""Generate a delayed shipment + SLA penalty scenario."""
|
| 312 |
+
scenario = self.generate_task1() # base PO/invoice
|
| 313 |
+
# Remove the pricing discrepancy for task2 (focus is on shipping)
|
| 314 |
+
for li in scenario.purchase_order.line_items:
|
| 315 |
+
li.invoiced_unit_price = li.contracted_unit_price
|
| 316 |
+
li.invoiced_total = li.contracted_total
|
| 317 |
+
li.has_discrepancy = False
|
| 318 |
+
|
| 319 |
+
# Recalculate invoice
|
| 320 |
+
inv = scenario.invoice
|
| 321 |
+
inv_subtotal = round(sum(li.contracted_total for li in inv.line_items), 2)
|
| 322 |
+
inv.subtotal = inv_subtotal
|
| 323 |
+
inv.tax_amount = round(inv_subtotal * inv.tax_rate, 2)
|
| 324 |
+
inv.total = round(inv_subtotal + inv.tax_amount, 2)
|
| 325 |
+
|
| 326 |
+
# Generate SLA
|
| 327 |
+
sla_struct = self._pick(SLA_PENALTY_STRUCTURES).copy()
|
| 328 |
+
contract_id = self._gen_id("SLA")
|
| 329 |
+
sla = SLAContract(
|
| 330 |
+
contract_id=contract_id,
|
| 331 |
+
vendor=scenario.vendor.name,
|
| 332 |
+
buyer=scenario.buyer.name,
|
| 333 |
+
effective_date=self._gen_date(month_range=(1, 3)),
|
| 334 |
+
penalty_structure=sla_struct,
|
| 335 |
+
delivery_terms=f"Delivery within 14 business days of PO issuance. Penalties per SLA clause {contract_id}-SEC4.",
|
| 336 |
+
)
|
| 337 |
+
|
| 338 |
+
# Generate shipping delay
|
| 339 |
+
delay_days = self.rng.randint(2, 12)
|
| 340 |
+
grace = sla_struct.get("grace_days", 0)
|
| 341 |
+
tracking_id = self._gen_tracking_id()
|
| 342 |
+
|
| 343 |
+
ship_log = ShippingLog(
|
| 344 |
+
tracking_id=tracking_id,
|
| 345 |
+
po_reference=scenario.purchase_order.po_number,
|
| 346 |
+
carrier=self._pick(["FedEx Freight", "UPS Supply Chain", "XPO Logistics", "USPS Priority", "DHL Express"]),
|
| 347 |
+
ship_date=scenario.purchase_order.date,
|
| 348 |
+
expected_delivery=self._gen_date(month_range=(3, 5)),
|
| 349 |
+
actual_delivery=self._gen_date(month_range=(4, 6)),
|
| 350 |
+
delay_days=delay_days,
|
| 351 |
+
status="delivered_late",
|
| 352 |
+
)
|
| 353 |
+
|
| 354 |
+
# Calculate penalty
|
| 355 |
+
penalizable_days = max(0, delay_days - grace)
|
| 356 |
+
if sla_struct["type"] == "linear":
|
| 357 |
+
rate = sla_struct["rate_per_day"]
|
| 358 |
+
cap = sla_struct["cap"]
|
| 359 |
+
penalty_pct = min(penalizable_days * rate, cap)
|
| 360 |
+
elif sla_struct["type"] == "tiered":
|
| 361 |
+
penalty_pct = 0.0
|
| 362 |
+
remaining = penalizable_days
|
| 363 |
+
for threshold, rate in sla_struct["tiers"]:
|
| 364 |
+
if remaining <= 0:
|
| 365 |
+
break
|
| 366 |
+
days_in_tier = min(remaining, threshold)
|
| 367 |
+
penalty_pct += days_in_tier * rate
|
| 368 |
+
remaining -= days_in_tier
|
| 369 |
+
penalty_pct = min(penalty_pct, sla_struct["cap"])
|
| 370 |
+
else:
|
| 371 |
+
penalty_pct = 0.0
|
| 372 |
+
|
| 373 |
+
penalty_amount = round(inv.subtotal * penalty_pct, 2)
|
| 374 |
+
|
| 375 |
+
scenario.task_name = "sla_enforcement"
|
| 376 |
+
scenario.sla_contract = sla
|
| 377 |
+
scenario.shipping_log = ship_log
|
| 378 |
+
scenario.correct_adjustment = -penalty_amount # deduction from invoice
|
| 379 |
+
scenario.penalty_amount = penalty_amount
|
| 380 |
+
scenario.discrepant_line_item_id = None
|
| 381 |
+
scenario.correct_line_item_price = None
|
| 382 |
+
|
| 383 |
+
return scenario
|
| 384 |
+
|
| 385 |
+
# ------------------------------------------------------------------
|
| 386 |
+
# Task 3: Hard — Adversarial Auditing
|
| 387 |
+
# ------------------------------------------------------------------
|
| 388 |
+
def generate_task3(self) -> Scenario:
|
| 389 |
+
"""Generate adversarial vendor dispute scenario."""
|
| 390 |
+
scenario = self.generate_task2() # has SLA + shipping
|
| 391 |
+
|
| 392 |
+
# Generate warehouse logs proving dock was open during disputed window
|
| 393 |
+
delivery_date = scenario.shipping_log.actual_delivery
|
| 394 |
+
warehouse_logs = []
|
| 395 |
+
for i in range(-1, 3): # day before through 2 days after
|
| 396 |
+
# Parse date for log entries
|
| 397 |
+
log_date = delivery_date # simplified: use actual delivery date
|
| 398 |
+
warehouse_logs.append(WarehouseLog(
|
| 399 |
+
date=log_date,
|
| 400 |
+
dock_id=f"DOCK-{self._pick(['A', 'B', 'C'])}{self.rng.randint(1, 5)}",
|
| 401 |
+
status="open",
|
| 402 |
+
staff_on_duty=self.rng.randint(3, 8),
|
| 403 |
+
shipments_received=self.rng.randint(5, 20),
|
| 404 |
+
notes=f"Normal operations. {self.rng.randint(5, 20)} deliveries processed.",
|
| 405 |
+
))
|
| 406 |
+
|
| 407 |
+
scenario.task_name = "adversarial_auditing"
|
| 408 |
+
scenario.warehouse_logs = warehouse_logs
|
| 409 |
+
scenario.vendor_claim_valid = False # vendor's claim is always invalid in this task
|
| 410 |
+
|
| 411 |
+
return scenario
|
| 412 |
+
|
| 413 |
+
|
| 414 |
+
# ---------------------------------------------------------------------------
|
| 415 |
+
# Document renderers — produce human-readable text from data structures
|
| 416 |
+
# ---------------------------------------------------------------------------
|
| 417 |
+
|
| 418 |
+
def render_purchase_order(po: PurchaseOrder) -> str:
|
| 419 |
+
lines = [
|
| 420 |
+
"═══════════════════════════════════════════",
|
| 421 |
+
" PURCHASE ORDER",
|
| 422 |
+
"═══════════════════════════════════════════",
|
| 423 |
+
f"PO Number: {po.po_number}",
|
| 424 |
+
f"Date: {po.date}",
|
| 425 |
+
f"Approved Budget: ${po.approved_budget:,.2f}",
|
| 426 |
+
"",
|
| 427 |
+
f"Vendor: {po.vendor.name}",
|
| 428 |
+
f" {po.vendor.address}",
|
| 429 |
+
f" {po.vendor.city}, {po.vendor.state} {po.vendor.zip_code}",
|
| 430 |
+
"",
|
| 431 |
+
f"Buyer: {po.buyer.name}",
|
| 432 |
+
f" {po.buyer.address}",
|
| 433 |
+
f" {po.buyer.city}, {po.buyer.state} {po.buyer.zip_code}",
|
| 434 |
+
"",
|
| 435 |
+
"Line Items:",
|
| 436 |
+
f"{'ID':<12} {'Description':<40} {'Qty':>5} {'Unit Price':>12} {'Total':>12}",
|
| 437 |
+
"─" * 85,
|
| 438 |
+
]
|
| 439 |
+
for li in po.line_items:
|
| 440 |
+
lines.append(
|
| 441 |
+
f"{li.item_id:<12} {li.description:<40} {li.quantity:>5} "
|
| 442 |
+
f"${li.contracted_unit_price:>10,.2f} ${li.contracted_total:>10,.2f}"
|
| 443 |
+
)
|
| 444 |
+
lines.extend([
|
| 445 |
+
"─" * 85,
|
| 446 |
+
f"{'PO Total:':>71} ${po.total_amount:>10,.2f}",
|
| 447 |
+
"═══════════════════════════════════════════",
|
| 448 |
+
])
|
| 449 |
+
return "\n".join(lines)
|
| 450 |
+
|
| 451 |
+
|
| 452 |
+
def render_invoice(inv: Invoice) -> str:
|
| 453 |
+
tax_pct = f"{inv.tax_rate * 100:.1f}"
|
| 454 |
+
lines = [
|
| 455 |
+
"═══════════════════════════════════════════",
|
| 456 |
+
" INVOICE",
|
| 457 |
+
"═══════════════════════════════════════════",
|
| 458 |
+
f"Invoice Number: {inv.invoice_number}",
|
| 459 |
+
f"Date: {inv.date}",
|
| 460 |
+
f"PO Reference: {inv.po_reference}",
|
| 461 |
+
"",
|
| 462 |
+
f"From: {inv.vendor.name}",
|
| 463 |
+
f" {inv.vendor.address}",
|
| 464 |
+
f" {inv.vendor.city}, {inv.vendor.state} {inv.vendor.zip_code}",
|
| 465 |
+
f" Tax ID: {inv.vendor.tax_id}",
|
| 466 |
+
"",
|
| 467 |
+
f"Bill To: {inv.buyer.name}",
|
| 468 |
+
f" {inv.buyer.address}",
|
| 469 |
+
f" {inv.buyer.city}, {inv.buyer.state} {inv.buyer.zip_code}",
|
| 470 |
+
"",
|
| 471 |
+
f"{'ID':<12} {'Description':<40} {'Qty':>5} {'Unit Price':>12} {'Amount':>12}",
|
| 472 |
+
"─" * 85,
|
| 473 |
+
]
|
| 474 |
+
for li in inv.line_items:
|
| 475 |
+
lines.append(
|
| 476 |
+
f"{li.item_id:<12} {li.description:<40} {li.quantity:>5} "
|
| 477 |
+
f"${li.invoiced_unit_price:>10,.2f} ${li.invoiced_total:>10,.2f}"
|
| 478 |
+
)
|
| 479 |
+
lines.extend([
|
| 480 |
+
"─" * 85,
|
| 481 |
+
f"{'Subtotal:':>71} ${inv.subtotal:>10,.2f}",
|
| 482 |
+
f"{'Tax (' + tax_pct + '%):':>71} ${inv.tax_amount:>10,.2f}",
|
| 483 |
+
f"{'TOTAL DUE:':>71} ${inv.total:>10,.2f}",
|
| 484 |
+
"═══════════════════════════════════════════",
|
| 485 |
+
])
|
| 486 |
+
return "\n".join(lines)
|
| 487 |
+
|
| 488 |
+
|
| 489 |
+
def render_sla(sla: SLAContract) -> str:
|
| 490 |
+
ps = sla.penalty_structure
|
| 491 |
+
lines = [
|
| 492 |
+
"═══════════════════════════════════════════",
|
| 493 |
+
" SERVICE LEVEL AGREEMENT",
|
| 494 |
+
"═══════════════════════════════════════════",
|
| 495 |
+
f"Contract ID: {sla.contract_id}",
|
| 496 |
+
f"Effective Date: {sla.effective_date}",
|
| 497 |
+
f"Vendor: {sla.vendor}",
|
| 498 |
+
f"Buyer: {sla.buyer}",
|
| 499 |
+
"",
|
| 500 |
+
f"Delivery Terms: {sla.delivery_terms}",
|
| 501 |
+
"",
|
| 502 |
+
"LATE DELIVERY PENALTY CLAUSE:",
|
| 503 |
+
]
|
| 504 |
+
if ps["type"] == "linear":
|
| 505 |
+
lines.append(f" - Penalty rate: {ps['rate_per_day'] * 100:.1f}% of invoice subtotal per day late")
|
| 506 |
+
lines.append(f" - Maximum penalty cap: {ps['cap'] * 100:.0f}% of invoice subtotal")
|
| 507 |
+
if ps["grace_days"] > 0:
|
| 508 |
+
lines.append(f" - Grace period: {ps['grace_days']} business day(s)")
|
| 509 |
+
elif ps["type"] == "tiered":
|
| 510 |
+
lines.append(" - Tiered penalty structure:")
|
| 511 |
+
prev = 0
|
| 512 |
+
for threshold, rate in ps["tiers"]:
|
| 513 |
+
if threshold >= 999:
|
| 514 |
+
lines.append(f" Day {prev + 1}+: {rate * 100:.1f}% per day")
|
| 515 |
+
else:
|
| 516 |
+
lines.append(f" Days {prev + 1}-{threshold}: {rate * 100:.1f}% per day")
|
| 517 |
+
prev = threshold
|
| 518 |
+
lines.append(f" - Maximum penalty cap: {ps['cap'] * 100:.0f}% of invoice subtotal")
|
| 519 |
+
lines.append("═══════════════════════════════════════════")
|
| 520 |
+
return "\n".join(lines)
|
| 521 |
+
|
| 522 |
+
|
| 523 |
+
def render_shipping_log(log: ShippingLog) -> str:
|
| 524 |
+
return "\n".join([
|
| 525 |
+
"═══════════════════════════════════════════",
|
| 526 |
+
" SHIPPING LOG",
|
| 527 |
+
"═══════════════════════════════════════════",
|
| 528 |
+
f"Tracking ID: {log.tracking_id}",
|
| 529 |
+
f"PO Reference: {log.po_reference}",
|
| 530 |
+
f"Carrier: {log.carrier}",
|
| 531 |
+
f"Ship Date: {log.ship_date}",
|
| 532 |
+
f"Expected Delivery: {log.expected_delivery}",
|
| 533 |
+
f"Actual Delivery: {log.actual_delivery}",
|
| 534 |
+
f"Delay: {log.delay_days} day(s)",
|
| 535 |
+
f"Status: {log.status}",
|
| 536 |
+
"═══════════════════════════════════════════",
|
| 537 |
+
])
|
| 538 |
+
|
| 539 |
+
|
| 540 |
+
def render_warehouse_logs(logs: List[WarehouseLog]) -> str:
|
| 541 |
+
lines = [
|
| 542 |
+
"═══════════════════════════════════════════",
|
| 543 |
+
" WAREHOUSE ACCESS LOGS",
|
| 544 |
+
"═══════════════════════════════════════════",
|
| 545 |
+
]
|
| 546 |
+
for wl in logs:
|
| 547 |
+
lines.extend([
|
| 548 |
+
f"Date: {wl.date} | Dock: {wl.dock_id} | Status: {wl.status.upper()}",
|
| 549 |
+
f" Staff on duty: {wl.staff_on_duty} | Shipments received: {wl.shipments_received}",
|
| 550 |
+
f" Notes: {wl.notes}",
|
| 551 |
+
"",
|
| 552 |
+
])
|
| 553 |
+
lines.append("═══════════════════════════════════════════")
|
| 554 |
+
return "\n".join(lines)
|
| 555 |
|
| 556 |
|
| 557 |
# ---------------------------------------------------------------------------
|
| 558 |
# Public API
|
| 559 |
# ---------------------------------------------------------------------------
|
| 560 |
|
| 561 |
+
TASK_GENERATORS = {
|
| 562 |
+
"procurement_reconciliation": "generate_task1",
|
| 563 |
+
"sla_enforcement": "generate_task2",
|
| 564 |
+
"adversarial_auditing": "generate_task3",
|
| 565 |
+
}
|
| 566 |
+
|
| 567 |
+
VALID_TASKS = list(TASK_GENERATORS.keys())
|
| 568 |
+
|
| 569 |
+
MAX_STEPS = {
|
| 570 |
+
"procurement_reconciliation": 10,
|
| 571 |
+
"sla_enforcement": 15,
|
| 572 |
+
"adversarial_auditing": 20,
|
| 573 |
}
|
| 574 |
|
| 575 |
|
| 576 |
+
def generate_scenario(task_name: str, seed: int = 0) -> Scenario:
|
| 577 |
+
"""Generate a complete ESCTR scenario for the given task and seed."""
|
| 578 |
engine = ProceduralEngine(seed)
|
| 579 |
+
method = TASK_GENERATORS.get(task_name, "generate_task1")
|
| 580 |
return getattr(engine, method)()
|