Musharraf commited on
Commit
a2ae67c
·
1 Parent(s): 7de3176

Add 2 new frontier-challenging tasks + reward shaping system

Browse files

- corrupted_scan: OCR-corrupted documents with char substitutions (0/O, 1/l, 5/S)
- adversarial_invoice: decoy fields, contradictions, hidden calculations, discrepancy detection
- Reward shaping: consistency bonus (+0.03), efficiency signal (+0.01-0.02), improvement tracking (+0.02)
- 15 total documents across 5 difficulty tiers (easy -> expert)
- Updated inference.py with task-specific prompts for all 5 tasks
- Comprehensive README rewrite with reward design documentation

README.md CHANGED
@@ -12,7 +12,7 @@ tags:
12
 
13
  # Invoice Extraction Environment
14
 
15
- An OpenEnv-compliant environment where AI agents extract structured data from unstructured invoice and receipt documents.
16
 
17
  **Space URL:** `https://huggingface.co/spaces/musharraf7/invoice-extraction-env`
18
 
@@ -25,14 +25,14 @@ r = requests.post(f"{url}/reset", json={"task_name": "simple_invoice"})
25
  print(r.json())
26
  ```
27
 
28
- ## Environment Description
29
 
30
- This environment simulates real-world document data extraction a task faced daily by businesses processing invoices, receipts, and purchase orders. The agent receives unstructured text documents and must extract specific structured fields (invoice numbers, dates, vendor names, line items, totals, etc.).
31
 
32
- ### Why This Matters
33
- - **$5B+ industry:** Automated document processing is one of the largest business process automation markets
34
- - **Real RL training signal:** Partial-credit rewards on a per-field basis provide rich gradient
35
- - **Difficulty progression:** Three task levels test increasingly complex reasoning
36
 
37
  ## Action Space
38
 
@@ -65,35 +65,63 @@ Each step returns an `InvoiceObservation`:
65
  | `attempts_remaining` | int | Remaining extraction attempts |
66
  | `required_fields` | list | Fields to extract |
67
  | `done` | bool | Whether the episode has ended |
68
- | `reward` | float | Reward signal [0.01.0] |
69
 
70
- ## Tasks
71
 
72
- ### 1. `simple_invoice` (Easy)
73
- Clean, well-formatted invoices with clear field labels. The agent must extract 8 fields including invoice number, date, vendor/customer names, financial totals, and itemized line items.
74
 
75
  **Required fields:** `invoice_number`, `date`, `vendor_name`, `customer_name`, `subtotal`, `tax`, `total`, `line_items`
76
 
77
- ### 2. `messy_invoice` (Medium)
78
  Same fields but from messy, inconsistently formatted documents with abbreviations, typos, and non-standard layouts.
79
 
80
  **Required fields:** Same as simple_invoice
81
 
82
- ### 3. `multi_document` (Hard)
83
- Complex multi-section documents containing a purchase order, invoice, and credit memo/payment receipt. The agent must cross-reference sections and extract 11 fields including the adjusted total.
84
 
85
- **Required fields:** All of the above + `po_number`, `adjustment_reason`, `adjusted_total`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
 
87
  ## Reward Design
88
 
89
- - **Per-field scoring:** Each field is scored independently (0.0–1.0)
90
- - Text fields: Fuzzy matching with SequenceMatcher
91
- - Numeric fields: Exact match (1.0), within 1% (0.9), within 5% (0.5)
92
- - Date fields: Normalized comparison (YYYY-MM-DD)
93
- - Line items: Matched by best-fit comparison of description, qty, price, amount
94
- - **Overall score:** Weighted average of all field scores
95
- - **Episode rewards:** Best score across all extraction attempts
96
- - **Partial progress:** Feedback identifies weak fields for refinement
 
 
 
 
 
 
 
 
 
 
 
97
 
98
  ## Setup Instructions
99
 
@@ -109,6 +137,11 @@ pip install -r requirements.txt
109
  uvicorn server.app:app --host 0.0.0.0 --port 7860
110
  ```
111
 
 
 
 
 
 
112
  ### Run inference
113
  ```bash
114
  export ENV_URL="http://localhost:7860"
@@ -135,12 +168,12 @@ python inference.py
135
  ├── server/
136
  │ ├── __init__.py
137
  │ ├── app.py # FastAPI application
138
- │ ├── environment.py # Core environment logic
139
- │ ├── documents.py # Document corpus
140
- │ ├── graders.py # Scoring/grading logic
141
- │ └── models.py # Pydantic Action/Observation types
142
  ├── __init__.py # Package declaration
143
- ├── inference.py # Baseline inference script
144
  ├── openenv.yaml # OpenEnv manifest
145
  ├── pyproject.toml # Package configuration
146
  ├── requirements.txt # Dependencies
 
12
 
13
  # Invoice Extraction Environment
14
 
15
+ An OpenEnv-compliant environment where AI agents extract structured data from unstructured invoice and receipt documents. Features **5 difficulty tiers** — from clean invoices to adversarial documents with decoy fields, OCR corruption, and hidden calculations.
16
 
17
  **Space URL:** `https://huggingface.co/spaces/musharraf7/invoice-extraction-env`
18
 
 
25
  print(r.json())
26
  ```
27
 
28
+ ## Why This Environment?
29
 
30
+ Invoice data extraction is a **$5B+ industry** problem faced daily by every business. This environment provides:
31
 
32
+ - **Real RL training signal**: Per-field partial-credit scoring gives dense reward gradients
33
+ - **Genuine difficulty progression**: From clean invoices to adversarial traps that challenge frontier models
34
+ - **Reward shaping**: Consistency bonuses, efficiency signals, and improvement tracking provide rich learning signals beyond simple field matching
35
+ - **Production relevance**: The task directly models what commercial document processing systems must solve
36
 
37
  ## Action Space
38
 
 
65
  | `attempts_remaining` | int | Remaining extraction attempts |
66
  | `required_fields` | list | Fields to extract |
67
  | `done` | bool | Whether the episode has ended |
68
+ | `reward` | float | Reward signal (0.01–0.99) |
69
 
70
+ ## Tasks (5 Difficulty Tiers)
71
 
72
+ ### 1. `simple_invoice` (Easy) — 3 attempts
73
+ Clean, well-formatted invoices with clear field labels.
74
 
75
  **Required fields:** `invoice_number`, `date`, `vendor_name`, `customer_name`, `subtotal`, `tax`, `total`, `line_items`
76
 
77
+ ### 2. `messy_invoice` (Medium) — 3 attempts
78
  Same fields but from messy, inconsistently formatted documents with abbreviations, typos, and non-standard layouts.
79
 
80
  **Required fields:** Same as simple_invoice
81
 
82
+ ### 3. `multi_document` (Hard) — 5 attempts
83
+ Complex multi-section documents containing a purchase order, invoice, and credit memo/payment receipt. The agent must cross-reference sections.
84
 
85
+ **Required fields:** All basic fields + `po_number`, `adjustment_reason`, `adjusted_total`
86
+
87
+ ### 4. `corrupted_scan` (Very Hard) — 4 attempts
88
+ Simulates OCR-scanned/faxed invoices with systematic character errors:
89
+ - Character substitutions: `0`↔`O`, `1`↔`l`↔`I`, `5`↔`S`, `8`↔`B`
90
+ - Garbled sections and scan artifacts
91
+ - The agent must **reason through noise** to recover the true values
92
+
93
+ **Required fields:** Same as simple_invoice
94
+
95
+ ### 5. `adversarial_invoice` (Expert) — 6 attempts
96
+ Adversarial documents designed to trap and challenge frontier models:
97
+ - **Decoy fields**: Multiple invoice numbers — only one is current
98
+ - **Hidden calculations**: Discounts the agent must compute
99
+ - **Contradictory sections**: PO vs invoice disagreements
100
+ - **Budget variance alerts**: Agent must identify and explain discrepancies
101
+
102
+ **Required fields:** All basic fields + `po_number`, `discount_amount`, `original_total`, `discrepancy_notes`
103
 
104
  ## Reward Design
105
 
106
+ ### Per-Field Scoring (Base Score)
107
+ - **Text fields**: Fuzzy matching with SequenceMatcher (0.0–1.0)
108
+ - **Numeric fields**: Exact match (1.0), within 1% (0.9), within 5% (0.5)
109
+ - **Date fields**: Normalized comparison (YYYY-MM-DD)
110
+ - **Line items**: Best-fit matching of description, qty, price, amount
111
+ - **Reasoning fields** (discrepancy_notes): Fuzzy matching with lower threshold
112
+
113
+ ### Reward Shaping Bonuses
114
+ | Bonus | Value | Trigger |
115
+ |-------|-------|---------|
116
+ | **Consistency** | +0.03 | Agent's subtotal + tax = total |
117
+ | **Efficiency** | +0.01–0.02 | Solution found in ≤5 steps |
118
+ | **Improvement** | up to +0.02 | Score improves on retry |
119
+
120
+ ### Episode Mechanics
121
+ - **Best score tracked** across all extraction attempts
122
+ - **Partial progress** feedback identifies weak fields for refinement
123
+ - **Early termination** at score ≥ 0.95
124
+ - **All scores** clamped to strict (0.01, 0.99) range
125
 
126
  ## Setup Instructions
127
 
 
137
  uvicorn server.app:app --host 0.0.0.0 --port 7860
138
  ```
139
 
140
+ ### Run with uv
141
+ ```bash
142
+ uv run server
143
+ ```
144
+
145
  ### Run inference
146
  ```bash
147
  export ENV_URL="http://localhost:7860"
 
168
  ├── server/
169
  │ ├── __init__.py
170
  │ ├── app.py # FastAPI application
171
+ │ ├── environment.py # Core environment logic + reward shaping
172
+ │ ├── documents.py # 15-document corpus across 5 difficulty tiers
173
+ │ ├── graders.py # Field-level scoring with fuzzy matching
174
+ │ └── models.py # Pydantic Action/Observation/State types
175
  ├── __init__.py # Package declaration
176
+ ├── inference.py # Baseline inference script (all 5 tasks)
177
  ├── openenv.yaml # OpenEnv manifest
178
  ├── pyproject.toml # Package configuration
179
  ├── requirements.txt # Dependencies
inference.py CHANGED
@@ -3,9 +3,9 @@
3
  Baseline inference script for the Invoice Extraction Environment.
4
 
5
  This script demonstrates how an LLM agent interacts with the environment
6
- to extract structured data from invoice documents. It runs all three tasks
7
- (simple_invoice, messy_invoice, multi_document) and logs results in the
8
- mandatory OpenEnv [START]/[STEP]/[END] format.
9
 
10
  Required environment variables:
11
  API_BASE_URL — OpenAI-compatible API endpoint
@@ -35,7 +35,7 @@ ENV_URL = os.getenv("ENV_URL", "http://localhost:7860")
35
  # Optional — if you use from_docker_image():
36
  LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
37
 
38
- TASKS = ["simple_invoice", "messy_invoice", "multi_document"]
39
 
40
  # ---------------------------------------------------------------------------
41
  # LLM Client (OpenAI-compatible, configured via env vars)
@@ -100,12 +100,46 @@ RULES:
100
  - For monetary amounts, use plain numbers without currency symbols (e.g. 1134.00)
101
  - For line_items, use an array of objects with keys: description, quantity, unit_price, amount
102
  - If a field cannot be found, use null
103
- - For the multi_document task, look across all document sections (invoice, credit memo, PO, etc.)
104
- - adjusted_total is the final amount after credits/payments
105
- - po_number is the purchase order reference number
106
 
107
  JSON:"""
108
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
109
  REFINE_PROMPT = """You previously extracted data from an invoice but some fields were incorrect.
110
 
111
  DOCUMENT:
@@ -128,6 +162,8 @@ RULES:
128
  - For monetary amounts, use plain numbers without currency symbols
129
  - For line_items, use an array of objects with keys: description, quantity, unit_price, amount
130
  - If a field cannot be determined, use null
 
 
131
 
132
  JSON:"""
133
 
@@ -217,7 +253,12 @@ def run_task(env_url: str, task_name: str, seed: int = 0) -> float:
217
 
218
  # Step 3: Use LLM to extract fields
219
  fields_str = "\n".join(f"- {f}" for f in required_fields)
220
- prompt = EXTRACT_PROMPT.format(document=document_text, fields=fields_str)
 
 
 
 
 
221
  llm_response = call_llm(prompt)
222
  extracted_json = extract_json_from_response(llm_response)
223
 
@@ -253,11 +294,13 @@ def run_task(env_url: str, task_name: str, seed: int = 0) -> float:
253
  weak_fields = [f for f, s in field_scores.items() if s < 0.8]
254
 
255
  # Refine with LLM
 
256
  refine_prompt = REFINE_PROMPT.format(
257
  document=document_text,
258
  previous=extracted_json,
259
  weak_fields=", ".join(weak_fields) if weak_fields else "all fields",
260
  feedback=feedback_text,
 
261
  )
262
  refined_response = call_llm(refine_prompt)
263
  refined_json = extract_json_from_response(refined_response)
 
3
  Baseline inference script for the Invoice Extraction Environment.
4
 
5
  This script demonstrates how an LLM agent interacts with the environment
6
+ to extract structured data from invoice documents. It runs all five tasks
7
+ (simple_invoice, messy_invoice, multi_document, corrupted_scan, adversarial_invoice)
8
+ and logs results in the mandatory OpenEnv [START]/[STEP]/[END] format.
9
 
10
  Required environment variables:
11
  API_BASE_URL — OpenAI-compatible API endpoint
 
35
  # Optional — if you use from_docker_image():
36
  LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
37
 
38
+ TASKS = ["simple_invoice", "messy_invoice", "multi_document", "corrupted_scan", "adversarial_invoice"]
39
 
40
  # ---------------------------------------------------------------------------
41
  # LLM Client (OpenAI-compatible, configured via env vars)
 
100
  - For monetary amounts, use plain numbers without currency symbols (e.g. 1134.00)
101
  - For line_items, use an array of objects with keys: description, quantity, unit_price, amount
102
  - If a field cannot be found, use null
103
+ {task_specific_rules}
104
+
105
+ IMPORTANT: Ensure your extracted subtotal + tax = total. Verify math consistency.
106
 
107
  JSON:"""
108
 
109
+ TASK_RULES = {
110
+ "simple_invoice": "",
111
+ "messy_invoice": (
112
+ "- This document uses informal formatting, abbreviations, and shorthand\n"
113
+ "- Look past formatting irregularities to find the actual values\n"
114
+ "- 'subtot', 's/t', 'sub' = subtotal; 'tx' = tax; 'amt due' = total"
115
+ ),
116
+ "multi_document": (
117
+ "- This contains MULTIPLE document sections (PO, Invoice, Credit Memo, etc.)\n"
118
+ "- Extract from the INVOICE section primarily\n"
119
+ "- adjusted_total is the final amount after credits/payments\n"
120
+ "- po_number is the purchase order reference number\n"
121
+ "- adjustment_reason describes why the total was adjusted"
122
+ ),
123
+ "corrupted_scan": (
124
+ "- WARNING: This is an OCR-scanned document with character errors\n"
125
+ "- Common OCR substitutions: 0<->O, 1<->l<->I, 5<->S, 8<->B\n"
126
+ "- Mentally correct OCR errors to recover the true values\n"
127
+ "- 'lNV' = 'INV', 'S' in numbers = '5', 'O' in numbers = '0'\n"
128
+ "- Verify all numbers by cross-checking (qty * unit_price = amount)"
129
+ ),
130
+ "adversarial_invoice": (
131
+ "- CAUTION: This document contains DECOY fields and contradictions\n"
132
+ "- Multiple invoice numbers may appear — use the CURRENT/ACTIVE one, not voided/draft ones\n"
133
+ "- If there is a reissue date, use that as the date (not the original date)\n"
134
+ "- subtotal is the ADJUSTED subtotal after any discounts\n"
135
+ "- discount_amount is the monetary discount value\n"
136
+ "- original_total is what the total WOULD have been without adjustments\n"
137
+ "- discrepancy_notes: describe ALL discrepancies, adjustments, and calculations\n"
138
+ "- po_number: the purchase order reference if present, else null\n"
139
+ "- Cross-reference different sections to find contradictions"
140
+ ),
141
+ }
142
+
143
  REFINE_PROMPT = """You previously extracted data from an invoice but some fields were incorrect.
144
 
145
  DOCUMENT:
 
162
  - For monetary amounts, use plain numbers without currency symbols
163
  - For line_items, use an array of objects with keys: description, quantity, unit_price, amount
164
  - If a field cannot be determined, use null
165
+ - VERIFY: subtotal + tax should equal total
166
+ {task_specific_rules}
167
 
168
  JSON:"""
169
 
 
253
 
254
  # Step 3: Use LLM to extract fields
255
  fields_str = "\n".join(f"- {f}" for f in required_fields)
256
+ task_rules = TASK_RULES.get(task_name, "")
257
+ prompt = EXTRACT_PROMPT.format(
258
+ document=document_text,
259
+ fields=fields_str,
260
+ task_specific_rules=task_rules,
261
+ )
262
  llm_response = call_llm(prompt)
263
  extracted_json = extract_json_from_response(llm_response)
264
 
 
294
  weak_fields = [f for f, s in field_scores.items() if s < 0.8]
295
 
296
  # Refine with LLM
297
+ task_rules = TASK_RULES.get(task_name, "")
298
  refine_prompt = REFINE_PROMPT.format(
299
  document=document_text,
300
  previous=extracted_json,
301
  weak_fields=", ".join(weak_fields) if weak_fields else "all fields",
302
  feedback=feedback_text,
303
+ task_specific_rules=task_rules,
304
  )
305
  refined_response = call_llm(refine_prompt)
306
  refined_json = extract_json_from_response(refined_response)
project_juiding_criterion.txt ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ al-world task simulation
2
+
3
+ The environment must simulate a task humans actually do. Not games, not toys. Examples: email triage, code review, data cleaning, scheduling, customer support, content moderation.
4
+
5
+ OpenEnv spec compliance
6
+
7
+ Implement the full OpenEnv interface: typed Observation, Action, and Reward Pydantic models. step(action) → returns observation, reward, done, info. reset() → returns initial observation. state() → returns current state. openenv.yaml with metadata. Tested via openenv validate.
8
+
9
+ Minimum 3 tasks with agent graders
10
+
11
+ Each task defines a concrete objective an agent must accomplish, with a programmatic grader that scores performance (0.0–1.0). Tasks should range: easy → medium → hard. Graders must have clear, deterministic success/failure criteria.
12
+
13
+ Meaningful reward function
14
+
15
+ Provides signal over the full trajectory (not just binary end-of-episode). Rewards partial progress toward task completion. Penalizes clearly undesirable behavior (e.g. infinite loops, destructive actions).
16
+
17
+ Baseline inference script
18
+
19
+ Uses the OpenAI API client to run a model against the environment. Reads API credentials from environment variables (OPENAI_API_KEY). Produces a reproducible baseline score on all 3 tasks.
20
+
21
+ Detailed Requirements
22
+
23
+ Non-Functional Requirements
24
+
25
+ Deploys to a Hugging Face Space
26
+
27
+ Environment must run as a containerized HF Space tagged with openenv.
28
+
29
+ Containerized execution
30
+
31
+ Must include a working Dockerfile. The environment should start cleanly with docker build + docker run.
32
+
33
+ Documentation
34
+
35
+ README must include: environment description and motivation, action and observation space definitions, task descriptions with expected difficulty, setup and usage instructions, baseline scores.
36
+
37
+ Parameter
38
+
39
+ Weight
40
+
41
+ Description
42
+
43
+ Real-world utility
44
+
45
+ 30%
46
+
47
+ Does the environment model a genuine task? Would someone actually use this to train or evaluate agents?
48
+
49
+ Task & grader quality
50
+
51
+ 25%
52
+
53
+ Are tasks well-defined with clear objectives? Do graders accurately and fairly measure success? Meaningful difficulty progression?
54
+
55
+ Environment design
56
+
57
+ 20%
58
+
59
+ Clean state management, sensible action/observation spaces, good reward shaping, proper episode boundaries.
60
+
61
+ Code quality & spec compliance
62
+
63
+ 15%
64
+
65
+ Follows OpenEnv spec, clean project structure, typed models, documented, tested, Dockerfile works.
66
+
67
+ Creativity & novelty
68
+
69
+ 10%
70
+
71
+ Novel problem domain, interesting mechanics, clever reward design, original approach.
72
+
73
+ Scoring Breakdown
74
+
75
+ Real-world utility (30%)
76
+
77
+ • 0–5: Toy/artificial problem with no practical application
78
+
79
+ • 6–15: Valid domain but shallow modeling of the real task
80
+
81
+ • 16–25: Good domain modeling, would be useful for agent evaluation
82
+
83
+ • 26–30: Excellent — fills a real gap, immediate value for the RL/agent community
84
+
85
+ Task & grader quality (25%)
86
+
87
+ • 3+ tasks with difficulty range?
88
+
89
+ • Graders produce scores between 0.0–1.0?
90
+
91
+ • Graders deterministic and reproducible?
92
+
93
+ • Hard task genuinely challenges frontier models?
94
+
95
+ Environment design (20%)
96
+
97
+ • reset() produces clean state?
98
+
99
+ • Action/observation types well-designed and documented?
100
+
101
+ • Reward function provides useful varying signal (not just sparse)?
102
+
103
+ • Episode boundaries sensible?
104
+
105
+ Code quality & spec compliance (15%)
106
+
107
+ • openenv validate passes?
108
+
109
+ • docker build && docker run works?
110
+
111
+ • HF Space deploys and responds?
112
+
113
+ • Baseline script runs and reproduces scores?
114
+
115
+ Creativity & novelty (10%)
116
+
117
+ • Domain we haven’t seen in OpenEnv before?
118
+
119
+ • Reward design has interesting properties?
120
+
121
+ • Clever mechanics that make the environment engaging?
122
+
123
+ Evaluation Criteria
124
+
125
+ Phase 1: Automated Validation
126
+
127
+ Pass/fail gate — HF Space deploys, OpenEnv spec compliance, Dockerfile builds, baseline reproduces, 3+ tasks with graders.
128
+
129
+ Phase 2: Agentic Evaluation
130
+
131
+ Scored — baseline agent re-run, standard Open LLM agent (e.g. Nemotron 3 Super) run against all environments, score variance check.
132
+
133
+ Phase 3: Human Review
134
+
135
+ Top submissions reviewed by Meta and Hugging Face engineers for real-world utility, creativity, and exploit checks.
136
+
137
+ Disqualification Criteria
138
+
139
+ Environment does not deploy or respond
140
+
141
+ Plagiarized or trivially modified existing environments
142
+
143
+ Graders that always return the same score
144
+
145
+ No baseline inference script
server/app.py CHANGED
@@ -131,11 +131,19 @@ def create_invoice_app() -> FastAPI:
131
  "name": "invoice_extraction_env",
132
  "description": (
133
  "An environment for extracting structured data from unstructured "
134
- "invoice and receipt documents. The agent must identify and extract "
135
- "fields like invoice number, dates, vendor name, line items, and totals."
 
 
136
  ),
137
- "version": "0.1.0",
138
- "tasks": ["simple_invoice", "messy_invoice", "multi_document"],
 
 
 
 
 
 
139
  }
140
 
141
  # === WebSocket (for persistent sessions) ===
 
131
  "name": "invoice_extraction_env",
132
  "description": (
133
  "An environment for extracting structured data from unstructured "
134
+ "invoice and receipt documents. Features 5 difficulty tiers from "
135
+ "clean invoices to adversarial documents with decoy fields, OCR "
136
+ "corruption, and hidden calculations. Reward shaping includes "
137
+ "consistency bonuses, efficiency signals, and improvement tracking."
138
  ),
139
+ "version": "0.2.0",
140
+ "tasks": [
141
+ "simple_invoice",
142
+ "messy_invoice",
143
+ "multi_document",
144
+ "corrupted_scan",
145
+ "adversarial_invoice",
146
+ ],
147
  }
148
 
149
  # === WebSocket (for persistent sessions) ===
server/documents.py CHANGED
@@ -465,6 +465,357 @@ Total with Backorder: $2,498.45 + $170.00 = $2,668.45
465
  },
466
  },
467
  ],
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
468
  }
469
 
470
 
@@ -483,6 +834,16 @@ TASK_REQUIRED_FIELDS = {
483
  "subtotal", "tax", "total", "line_items",
484
  "po_number", "adjustment_reason", "adjusted_total",
485
  ],
 
 
 
 
 
 
 
 
 
 
486
  }
487
 
488
 
@@ -490,7 +851,8 @@ def get_document(task_name: str, doc_index: int = 0) -> dict:
490
  """Get a document and its metadata for a given task.
491
 
492
  Args:
493
- task_name: One of 'simple_invoice', 'messy_invoice', 'multi_document'
 
494
  doc_index: Index into the document pool (will wrap around)
495
 
496
  Returns:
@@ -504,3 +866,4 @@ def get_document(task_name: str, doc_index: int = 0) -> dict:
504
  "ground_truth": doc["ground_truth"],
505
  "required_fields": TASK_REQUIRED_FIELDS.get(task_name, TASK_REQUIRED_FIELDS["simple_invoice"]),
506
  }
 
 
465
  },
466
  },
467
  ],
468
+
469
+ # =========================================================================
470
+ # CORRUPTED SCAN — OCR-like artifacts, character substitutions, garbled text
471
+ # These simulate real scanned/faxed invoices with OCR errors.
472
+ # =========================================================================
473
+ "corrupted_scan": [
474
+ {
475
+ "id": "corrupt_001",
476
+ "text": """SC4NNED D0CUMENT - Page 1 of 1
477
+
478
+ lNVOlCE
479
+
480
+ lnvoice Nurnber: lNV-2O24-OO1
481
+ Dat.e: Januery 1S, 2O24
482
+
483
+ Frorn:
484
+ Acrne Corporati0n
485
+ l23 Business Avenue
486
+ New Y0rk, NY 1OOO1
487
+
488
+ BilI To:
489
+ Widget C0.
490
+ 4S6 Cornmerce Street
491
+ Chicag0, lL 6O6O1
492
+
493
+ Descripti0n Qty Unit Price Arnount
494
+ ---------------------------------------------------------
495
+ Widget Type A 1O $2S.OO $2SO.OO
496
+ Widget Type 8 S $4O.OO $2OO.OO
497
+ ConsuIting Service 8 $7S.OO $6OO.OO
498
+
499
+ Subtotal: $1,OSO.OO
500
+ Tax (8%): $84.OO
501
+ T0tal: $1,l34.OO
502
+
503
+ Payrnent Terrns: Net 3O
504
+
505
+ --- END 0F SCAN ---
506
+ """,
507
+ "ground_truth": {
508
+ "invoice_number": "INV-2024-001",
509
+ "date": "2024-01-15",
510
+ "vendor_name": "Acme Corporation",
511
+ "customer_name": "Widget Co.",
512
+ "subtotal": 1050.00,
513
+ "tax": 84.00,
514
+ "total": 1134.00,
515
+ "line_items": [
516
+ {"description": "Widget Type A", "quantity": 10, "unit_price": 25.00, "amount": 250.00},
517
+ {"description": "Widget Type B", "quantity": 5, "unit_price": 40.00, "amount": 200.00},
518
+ {"description": "Consulting Service", "quantity": 8, "unit_price": 75.00, "amount": 600.00},
519
+ ],
520
+ },
521
+ },
522
+ {
523
+ "id": "corrupt_002",
524
+ "text": """[SCAN QUALITY: P00R - SOME CHARACTERS MAY BE lNCORRECT]
525
+
526
+ TECHSTART S0LUTl0NS LLC
527
+ 89O lnnovation Dr, Suite 2OO
528
+ San Francisc0, CA 941OS
529
+
530
+ lNV0lCE #: TS~S892
531
+ DATE: O3/O3/2O24
532
+
533
+ CUSTOMERr DataFIow lnc.
534
+ 321 AnaIytics BIvd
535
+ Austin, TX 787O1
536
+
537
+ Servicc Qty Unit Pricc Total
538
+ ----------------------------------------------------------
539
+ CIoud Hosting (MonthIy) l $4SO.OO $4SO.OO
540
+ APl lntegration Setup l $l,2OO.OO $l,2OO.OO
541
+ TechnicaI Support (hours) l2 $9S.OO $l,l4O.OO
542
+
543
+ SubtotaI: $2,79O.OO
544
+ Tax (7%)): $l9S.3O
545
+ TotaI: $2,98S.3O
546
+
547
+ Due Date: ApriI 2, 2O24
548
+
549
+ [PAGE 1/1 - SCAN C0MPLETE]
550
+ """,
551
+ "ground_truth": {
552
+ "invoice_number": "TS-5892",
553
+ "date": "2024-03-03",
554
+ "vendor_name": "TechStart Solutions LLC",
555
+ "customer_name": "DataFlow Inc.",
556
+ "subtotal": 2790.00,
557
+ "tax": 195.30,
558
+ "total": 2985.30,
559
+ "line_items": [
560
+ {"description": "Cloud Hosting (Monthly)", "quantity": 1, "unit_price": 450.00, "amount": 450.00},
561
+ {"description": "API Integration Setup", "quantity": 1, "unit_price": 1200.00, "amount": 1200.00},
562
+ {"description": "Technical Support (hours)", "quantity": 12, "unit_price": 95.00, "amount": 1140.00},
563
+ ],
564
+ },
565
+ },
566
+ {
567
+ "id": "corrupt_003",
568
+ "text": """---FAXED DOCUMENT---
569
+ RECEIVED: 02/20/2024 14:32
570
+ QUALITY: [####===---] 40%
571
+
572
+ GL0BAL SUPPLlES lNC.
573
+ 25OO lndustriaI Parkway
574
+ Detr0it, Ml 482Ol
575
+
576
+ lNVOlCE
577
+
578
+ lnvoice Number: GS-2O24-Ol47
579
+ Date: February 2O, 2024
580
+
581
+ T0:
582
+ Riverside Manufactur1ng
583
+ 78O Factory R0ad
584
+ CIeveIand, 0H 44l0l
585
+
586
+ Product Qty Price Each Line Total
587
+ -----------------------------------------------------------
588
+ SteeI BoIts (Box/lOO) SO $l2.SO $62S.OO
589
+ Copper Wire (SOOft RoII) 8 $8S.OO $68O.OO
590
+ Safety GoggIes (Pack/lO) 2O $3S.OO $7OO.OO
591
+ WeIding Rods (BundIe) lS $22.OO $33O.OO
592
+
593
+ [iIIegibIe]
594
+ SubtotaI: $2,33S.OO
595
+ SaIes Tax: $l63.4S
596
+ lnvoice T0tal: $2,498.4S
597
+
598
+ Terrns: Net 4S
599
+ ---END FAX---
600
+ """,
601
+ "ground_truth": {
602
+ "invoice_number": "GS-2024-0147",
603
+ "date": "2024-02-20",
604
+ "vendor_name": "Global Supplies Inc.",
605
+ "customer_name": "Riverside Manufacturing",
606
+ "subtotal": 2335.00,
607
+ "tax": 163.45,
608
+ "total": 2498.45,
609
+ "line_items": [
610
+ {"description": "Steel Bolts (Box/100)", "quantity": 50, "unit_price": 12.50, "amount": 625.00},
611
+ {"description": "Copper Wire (500ft Roll)", "quantity": 8, "unit_price": 85.00, "amount": 680.00},
612
+ {"description": "Safety Goggles (Pack/10)", "quantity": 20, "unit_price": 35.00, "amount": 700.00},
613
+ {"description": "Welding Rods (Bundle)", "quantity": 15, "unit_price": 22.00, "amount": 330.00},
614
+ ],
615
+ },
616
+ },
617
+ ],
618
+
619
+ # =========================================================================
620
+ # ADVERSARIAL INVOICE — Decoy fields, contradictions, hidden calculations
621
+ # Designed to genuinely challenge frontier models with traps.
622
+ # =========================================================================
623
+ "adversarial_invoice": [
624
+ {
625
+ "id": "adversarial_001",
626
+ "text": """INVOICE
627
+
628
+ *** IMPORTANT: This replaces previous invoice DRAFT-INV-999 which was voided ***
629
+
630
+ Invoice Number: INV-2024-001-R2
631
+ Previous Reference: DRAFT-INV-999 (VOIDED — DO NOT USE)
632
+ Date: January 15, 2024
633
+ Reissue Date: January 20, 2024
634
+
635
+ From:
636
+ Acme Corporation
637
+ 123 Business Avenue, New York, NY 10001
638
+ Tax ID: 12-3456789
639
+
640
+ Bill To:
641
+ Widget Co. (DBA "WidgetCorp International")
642
+ 456 Commerce Street, Chicago, IL 60601
643
+ Customer Account: WC-0042
644
+
645
+ Description Qty Unit Price Amount
646
+ ---------------------------------------------------------
647
+ Widget Type A 10 $25.00 $250.00
648
+ Widget Type B 5 $40.00 $200.00
649
+ Consulting Service 8 $75.00 $600.00
650
+ ** EARLY PAYMENT DISCOUNT: -5% on consulting **
651
+
652
+ Subtotal: $1,050.00
653
+ Discount (5%): -$30.00
654
+ Adjusted Subtotal: $1,020.00
655
+ Tax (8%): $81.60
656
+ Total: $1,101.60
657
+
658
+ NOTE: Original quote (QT-2024-555) was $1,134.00 but discount applied.
659
+ Per agreement dated Jan 12, if paid within 10 days.
660
+
661
+ Payment Terms: Net 10 (discounted) / Net 30 (full price $1,134.00)
662
+ """,
663
+ "ground_truth": {
664
+ "invoice_number": "INV-2024-001-R2",
665
+ "date": "2024-01-20",
666
+ "vendor_name": "Acme Corporation",
667
+ "customer_name": "Widget Co.",
668
+ "subtotal": 1020.00,
669
+ "tax": 81.60,
670
+ "total": 1101.60,
671
+ "discount_amount": 30.00,
672
+ "original_total": 1134.00,
673
+ "line_items": [
674
+ {"description": "Widget Type A", "quantity": 10, "unit_price": 25.00, "amount": 250.00},
675
+ {"description": "Widget Type B", "quantity": 5, "unit_price": 40.00, "amount": 200.00},
676
+ {"description": "Consulting Service", "quantity": 8, "unit_price": 75.00, "amount": 600.00},
677
+ ],
678
+ "discrepancy_notes": "5% early payment discount applied to consulting. Reissued invoice replaces voided DRAFT-INV-999. Adjusted subtotal $1,020 vs original $1,050.",
679
+ },
680
+ },
681
+ {
682
+ "id": "adversarial_002",
683
+ "text": """--- PURCHASE ORDER ---
684
+ PO#: PO-DF-2024-112
685
+ Date: February 28, 2024
686
+ Vendor: TechStart Solutions LLC
687
+ Buyer: DataFlow Inc.
688
+ Authorized Budget: $2,600.00 (pre-tax)
689
+
690
+ Items:
691
+ 1. Cloud Hosting - 1 unit @ $450.00 = $450.00
692
+ 2. API Integration - 1 unit @ $1,200.00 = $1,200.00
693
+ 3. Tech Support - 10 hours @ $95.00/hr = $950.00
694
+ PO Total: $2,600.00
695
+
696
+ --- INVOICE ---
697
+ Invoice: TS-5892-FINAL
698
+ Date: March 3, 2024
699
+ PO Reference: PO-DF-2024-112
700
+
701
+ From: TechStart Solutions LLC
702
+ To: DataFlow Inc.
703
+
704
+ Service Qty Rate Amount
705
+ Cloud Hosting (Monthly) 1 $450.00 $450.00
706
+ API Integration Setup 1 $1,200.00 $1,200.00
707
+ Technical Support (actual) 12 $95.00 $1,140.00
708
+ >> 2 hrs over PO estimate, approved by J. Smith 03/01/2024
709
+ Rush Processing Fee 1 $150.00 $150.00
710
+ >> Added per emergency request ER-2024-033
711
+
712
+ Subtotal: $2,940.00
713
+ Tax (7%): $205.80
714
+ Total: $3,145.80
715
+
716
+ !!! BUDGET VARIANCE ALERT !!!
717
+ PO Authorized: $2,600.00
718
+ Actual (pre-tax): $2,940.00
719
+ Variance: $340.00 OVER BUDGET (13.1%)
720
+ Causes: Support overage ($190), Rush fee ($150)
721
+
722
+ --- PAYMENT SCHEDULE ---
723
+ Payment 1 (due 03/15): $1,500.00
724
+ Payment 2 (due 04/02): $1,645.80
725
+ """,
726
+ "ground_truth": {
727
+ "invoice_number": "TS-5892-FINAL",
728
+ "date": "2024-03-03",
729
+ "vendor_name": "TechStart Solutions LLC",
730
+ "customer_name": "DataFlow Inc.",
731
+ "subtotal": 2940.00,
732
+ "tax": 205.80,
733
+ "total": 3145.80,
734
+ "po_number": "PO-DF-2024-112",
735
+ "discount_amount": 0.00,
736
+ "original_total": 2600.00,
737
+ "line_items": [
738
+ {"description": "Cloud Hosting (Monthly)", "quantity": 1, "unit_price": 450.00, "amount": 450.00},
739
+ {"description": "API Integration Setup", "quantity": 1, "unit_price": 1200.00, "amount": 1200.00},
740
+ {"description": "Technical Support (actual)", "quantity": 12, "unit_price": 95.00, "amount": 1140.00},
741
+ {"description": "Rush Processing Fee", "quantity": 1, "unit_price": 150.00, "amount": 150.00},
742
+ ],
743
+ "discrepancy_notes": "Invoice exceeds PO by $340 (13.1%). 2 extra support hours ($190) and rush processing fee ($150) added. PO authorized $2,600 but actual pre-tax is $2,940.",
744
+ },
745
+ },
746
+ {
747
+ "id": "adversarial_003",
748
+ "text": """CONSOLIDATED STATEMENT
749
+
750
+ Account: Riverside Manufacturing
751
+ Statement Period: February 2024
752
+ Prepared by: Global Supplies Inc., Accounts Receivable
753
+
754
+ === TRANSACTION 1: ORIGINAL INVOICE ===
755
+ Invoice: GS-2024-0147
756
+ Date: February 20, 2024
757
+ PO: PO-RM-2024-033
758
+
759
+ Steel Bolts (Box/100) 50 @ $12.50 = $625.00
760
+ Copper Wire (500ft Roll) 10 @ $85.00 = $850.00
761
+ Safety Goggles (Pack/10) 20 @ $35.00 = $700.00
762
+ Welding Rods (Bundle) 15 @ $22.00 = $330.00
763
+
764
+ Invoice Subtotal: $2,505.00
765
+ Tax (7%): $175.35
766
+ Invoice Total: $2,680.35
767
+
768
+ === TRANSACTION 2: ADJUSTMENT ===
769
+ Credit Memo: CM-2024-0201
770
+ Date: February 25, 2024
771
+ Reference: GS-2024-0147
772
+
773
+ Issue: Copper Wire — only 8 of 10 rolls delivered.
774
+ 2 rolls backordered (BO-2024-0089).
775
+ Credit for undelivered: 2 x $85.00 = $170.00
776
+ Tax adjustment: -$11.90
777
+ Total Credit: -$181.90
778
+
779
+ === TRANSACTION 3: PRICE CORRECTION ===
780
+ Debit Memo: DM-2024-0055
781
+ Date: February 27, 2024
782
+
783
+ Steel Bolts price was quoted at $12.50 but contract
784
+ rate is $13.00. Underbilled on 50 boxes.
785
+ Price difference: 50 x $0.50 = $25.00
786
+ Tax on adjustment: $1.75
787
+ Total Debit: $26.75
788
+
789
+ === ACCOUNT SUMMARY ===
790
+ Original Invoice: $2,680.35
791
+ Credit (undelivered wire): -$181.90
792
+ Debit (price correction): +$26.75
793
+ ================================
794
+ Net Amount Due: $2,525.20
795
+
796
+ Payment due by: March 20, 2024
797
+ """,
798
+ "ground_truth": {
799
+ "invoice_number": "GS-2024-0147",
800
+ "date": "2024-02-20",
801
+ "vendor_name": "Global Supplies Inc.",
802
+ "customer_name": "Riverside Manufacturing",
803
+ "subtotal": 2505.00,
804
+ "tax": 175.35,
805
+ "total": 2680.35,
806
+ "po_number": "PO-RM-2024-033",
807
+ "discount_amount": 0.00,
808
+ "original_total": 2680.35,
809
+ "line_items": [
810
+ {"description": "Steel Bolts (Box/100)", "quantity": 50, "unit_price": 12.50, "amount": 625.00},
811
+ {"description": "Copper Wire (500ft Roll)", "quantity": 10, "unit_price": 85.00, "amount": 850.00},
812
+ {"description": "Safety Goggles (Pack/10)", "quantity": 20, "unit_price": 35.00, "amount": 700.00},
813
+ {"description": "Welding Rods (Bundle)", "quantity": 15, "unit_price": 22.00, "amount": 330.00},
814
+ ],
815
+ "discrepancy_notes": "Credit memo CM-2024-0201 for 2 undelivered Copper Wire rolls (-$181.90). Debit memo DM-2024-0055 for Steel Bolts price correction (+$26.75). Net adjustment: -$155.15. Final amount due: $2,525.20.",
816
+ },
817
+ },
818
+ ],
819
  }
820
 
821
 
 
834
  "subtotal", "tax", "total", "line_items",
835
  "po_number", "adjustment_reason", "adjusted_total",
836
  ],
837
+ "corrupted_scan": [
838
+ "invoice_number", "date", "vendor_name", "customer_name",
839
+ "subtotal", "tax", "total", "line_items",
840
+ ],
841
+ "adversarial_invoice": [
842
+ "invoice_number", "date", "vendor_name", "customer_name",
843
+ "subtotal", "tax", "total", "line_items",
844
+ "po_number", "discount_amount", "original_total",
845
+ "discrepancy_notes",
846
+ ],
847
  }
848
 
849
 
 
851
  """Get a document and its metadata for a given task.
852
 
853
  Args:
854
+ task_name: One of 'simple_invoice', 'messy_invoice', 'multi_document',
855
+ 'corrupted_scan', 'adversarial_invoice'
856
  doc_index: Index into the document pool (will wrap around)
857
 
858
  Returns:
 
866
  "ground_truth": doc["ground_truth"],
867
  "required_fields": TASK_REQUIRED_FIELDS.get(task_name, TASK_REQUIRED_FIELDS["simple_invoice"]),
868
  }
869
+
server/environment.py CHANGED
@@ -20,6 +20,8 @@ MAX_ATTEMPTS = {
20
  "simple_invoice": 3,
21
  "messy_invoice": 3,
22
  "multi_document": 5,
 
 
23
  }
24
 
25
  VALID_TASKS = list(TASK_REQUIRED_FIELDS.keys())
@@ -174,16 +176,19 @@ class InvoiceExtractionEnvironment:
174
  """Return the list of required fields with descriptions."""
175
  field_descriptions = {
176
  "invoice_number": "The invoice/document number (string)",
177
- "date": "Invoice date in YYYY-MM-DD format",
178
  "vendor_name": "Name of the vendor/seller/supplier",
179
  "customer_name": "Name of the customer/buyer/bill-to party",
180
- "subtotal": "Subtotal before tax (number)",
181
  "tax": "Tax amount (number)",
182
  "total": "Total amount due (number)",
183
  "line_items": "Array of items: [{description, quantity, unit_price, amount}]",
184
  "po_number": "Purchase order reference number (string)",
185
  "adjustment_reason": "Reason for any adjustments/credits (string)",
186
  "adjusted_total": "Final adjusted total after credits/payments (number)",
 
 
 
187
  }
188
 
189
  lines = ["Required fields to extract:\n"]
@@ -243,10 +248,43 @@ class InvoiceExtractionEnvironment:
243
 
244
  # Grade the extraction
245
  self._state.attempts_used += 1
246
- score, feedback = grade_extraction(
247
  extracted, self._ground_truth, self._required_fields
248
  )
249
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
250
  # Track best score
251
  self._state.best_score = max(self._state.best_score, score)
252
  self._last_feedback = feedback
@@ -259,12 +297,15 @@ class InvoiceExtractionEnvironment:
259
  matched = sum(1 for f in feedback.values() if f.get("matched", False))
260
  total = len(feedback)
261
  feedback_text = (
262
- f"Extraction scored: {score:.2f}\n"
263
  f"Fields matched: {matched}/{total}\n"
264
- f"Best score so far: {self._state.best_score:.2f}\n"
265
  f"Attempts remaining: {attempts_remaining}\n"
266
  )
267
 
 
 
 
268
  if not done and score < 0.95:
269
  weak_fields = [
270
  name for name, data in feedback.items()
@@ -275,7 +316,7 @@ class InvoiceExtractionEnvironment:
275
  feedback_text += "\nUse 'get_feedback' for detailed per-field scores."
276
 
277
  if done:
278
- feedback_text += f"\n\nEpisode complete. Final score: {self._state.best_score:.2f}"
279
 
280
  return InvoiceObservation(
281
  done=done,
@@ -287,6 +328,9 @@ class InvoiceExtractionEnvironment:
287
  required_fields=self._required_fields,
288
  metadata={
289
  "score": score,
 
 
 
290
  "best_score": self._state.best_score,
291
  "field_scores": {k: v["score"] for k, v in feedback.items()},
292
  },
@@ -334,3 +378,19 @@ class InvoiceExtractionEnvironment:
334
  def close(self) -> None:
335
  """Clean up resources."""
336
  self._initialized = False
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  "simple_invoice": 3,
21
  "messy_invoice": 3,
22
  "multi_document": 5,
23
+ "corrupted_scan": 4,
24
+ "adversarial_invoice": 6,
25
  }
26
 
27
  VALID_TASKS = list(TASK_REQUIRED_FIELDS.keys())
 
176
  """Return the list of required fields with descriptions."""
177
  field_descriptions = {
178
  "invoice_number": "The invoice/document number (string)",
179
+ "date": "Invoice date in YYYY-MM-DD format (use reissue date if applicable)",
180
  "vendor_name": "Name of the vendor/seller/supplier",
181
  "customer_name": "Name of the customer/buyer/bill-to party",
182
+ "subtotal": "Subtotal before tax, after discounts (number)",
183
  "tax": "Tax amount (number)",
184
  "total": "Total amount due (number)",
185
  "line_items": "Array of items: [{description, quantity, unit_price, amount}]",
186
  "po_number": "Purchase order reference number (string)",
187
  "adjustment_reason": "Reason for any adjustments/credits (string)",
188
  "adjusted_total": "Final adjusted total after credits/payments (number)",
189
+ "discount_amount": "Monetary discount value applied (number, 0 if none)",
190
+ "original_total": "What the total would have been without adjustments (number)",
191
+ "discrepancy_notes": "Free-text description of all discrepancies, adjustments, and anomalies found",
192
  }
193
 
194
  lines = ["Required fields to extract:\n"]
 
248
 
249
  # Grade the extraction
250
  self._state.attempts_used += 1
251
+ base_score, feedback = grade_extraction(
252
  extracted, self._ground_truth, self._required_fields
253
  )
254
 
255
+ # === REWARD SHAPING BONUSES ===
256
+ bonus = 0.0
257
+ bonus_details = []
258
+
259
+ # 1. Mathematical consistency bonus: subtotal + tax ≈ total
260
+ ext_sub = _safe_float(extracted.get("subtotal"))
261
+ ext_tax = _safe_float(extracted.get("tax"))
262
+ ext_total = _safe_float(extracted.get("total"))
263
+ if ext_sub is not None and ext_tax is not None and ext_total is not None:
264
+ computed = round(ext_sub + ext_tax, 2)
265
+ if abs(computed - ext_total) < 0.02:
266
+ bonus += 0.03
267
+ bonus_details.append("consistency_check: +0.03")
268
+
269
+ # 2. Improvement tracking: rewarding learning from feedback
270
+ prev_score = self._state.best_score
271
+ if self._state.attempts_used > 1 and base_score > prev_score:
272
+ improvement = min(base_score - prev_score, 0.02)
273
+ bonus += improvement
274
+ bonus_details.append(f"improvement: +{improvement:.3f}")
275
+
276
+ # 3. Step efficiency signal: fewer steps = small bonus
277
+ steps_used = self._state.step_count
278
+ if steps_used <= 3:
279
+ bonus += 0.02 # Very efficient
280
+ bonus_details.append("efficiency: +0.02")
281
+ elif steps_used <= 5:
282
+ bonus += 0.01 # Moderately efficient
283
+ bonus_details.append("efficiency: +0.01")
284
+
285
+ # Apply bonus (clamped to strict (0, 1))
286
+ score = round(max(0.01, min(0.99, base_score + bonus)), 4)
287
+
288
  # Track best score
289
  self._state.best_score = max(self._state.best_score, score)
290
  self._last_feedback = feedback
 
297
  matched = sum(1 for f in feedback.values() if f.get("matched", False))
298
  total = len(feedback)
299
  feedback_text = (
300
+ f"Extraction scored: {score:.4f} (base: {base_score:.4f}, bonus: {bonus:.3f})\n"
301
  f"Fields matched: {matched}/{total}\n"
302
+ f"Best score so far: {self._state.best_score:.4f}\n"
303
  f"Attempts remaining: {attempts_remaining}\n"
304
  )
305
 
306
+ if bonus_details:
307
+ feedback_text += f"Reward bonuses: {', '.join(bonus_details)}\n"
308
+
309
  if not done and score < 0.95:
310
  weak_fields = [
311
  name for name, data in feedback.items()
 
316
  feedback_text += "\nUse 'get_feedback' for detailed per-field scores."
317
 
318
  if done:
319
+ feedback_text += f"\n\nEpisode complete. Final score: {self._state.best_score:.4f}"
320
 
321
  return InvoiceObservation(
322
  done=done,
 
328
  required_fields=self._required_fields,
329
  metadata={
330
  "score": score,
331
+ "base_score": base_score,
332
+ "bonus": bonus,
333
+ "bonus_details": bonus_details,
334
  "best_score": self._state.best_score,
335
  "field_scores": {k: v["score"] for k, v in feedback.items()},
336
  },
 
378
  def close(self) -> None:
379
  """Clean up resources."""
380
  self._initialized = False
381
+
382
+
383
+ def _safe_float(value) -> float:
384
+ """Safely convert a value to float, returning None on failure."""
385
+ if value is None:
386
+ return None
387
+ if isinstance(value, (int, float)):
388
+ return float(value)
389
+ if isinstance(value, str):
390
+ import re
391
+ cleaned = re.sub(r"[$ ,]", "", value.strip())
392
+ try:
393
+ return float(cleaned)
394
+ except (ValueError, TypeError):
395
+ return None
396
+ return None
server/graders.py CHANGED
@@ -236,9 +236,12 @@ def grade_extraction(
236
  field_scores = {}
237
  feedback = {}
238
 
239
- numeric_fields = {"total", "subtotal", "tax", "adjusted_total"}
 
240
  date_fields = {"date", "due_date"}
241
  list_fields = {"line_items"}
 
 
242
 
243
  for field in required_fields:
244
  expected = ground_truth.get(field)
@@ -250,6 +253,9 @@ def grade_extraction(
250
  score = grade_numeric(actual, expected)
251
  elif field in date_fields:
252
  score = grade_date(actual, expected)
 
 
 
253
  else:
254
  score = grade_text(actual, expected)
255
 
@@ -258,8 +264,9 @@ def grade_extraction(
258
  "score": score,
259
  "expected_type": "list" if field in list_fields else
260
  "number" if field in numeric_fields else
261
- "date" if field in date_fields else "text",
262
- "matched": score >= 0.8,
 
263
  }
264
 
265
  # Overall score = weighted average
 
236
  field_scores = {}
237
  feedback = {}
238
 
239
+ numeric_fields = {"total", "subtotal", "tax", "adjusted_total",
240
+ "discount_amount", "original_total"}
241
  date_fields = {"date", "due_date"}
242
  list_fields = {"line_items"}
243
+ # Free-text reasoning fields — graded with lower threshold
244
+ reasoning_fields = {"discrepancy_notes", "adjustment_reason"}
245
 
246
  for field in required_fields:
247
  expected = ground_truth.get(field)
 
253
  score = grade_numeric(actual, expected)
254
  elif field in date_fields:
255
  score = grade_date(actual, expected)
256
+ elif field in reasoning_fields:
257
+ # Free-text reasoning: use fuzzy matching with generous partial credit
258
+ score = grade_text(actual, expected)
259
  else:
260
  score = grade_text(actual, expected)
261
 
 
264
  "score": score,
265
  "expected_type": "list" if field in list_fields else
266
  "number" if field in numeric_fields else
267
+ "date" if field in date_fields else
268
+ "reasoning" if field in reasoning_fields else "text",
269
+ "matched": score >= 0.5 if field in reasoning_fields else score >= 0.8,
270
  }
271
 
272
  # Overall score = weighted average