Spaces:
Running
Running
Musharraf commited on
Commit ·
a2ae67c
1
Parent(s): 7de3176
Add 2 new frontier-challenging tasks + reward shaping system
Browse files- corrupted_scan: OCR-corrupted documents with char substitutions (0/O, 1/l, 5/S)
- adversarial_invoice: decoy fields, contradictions, hidden calculations, discrepancy detection
- Reward shaping: consistency bonus (+0.03), efficiency signal (+0.01-0.02), improvement tracking (+0.02)
- 15 total documents across 5 difficulty tiers (easy -> expert)
- Updated inference.py with task-specific prompts for all 5 tasks
- Comprehensive README rewrite with reward design documentation
- README.md +61 -28
- inference.py +51 -8
- project_juiding_criterion.txt +145 -0
- server/app.py +12 -4
- server/documents.py +364 -1
- server/environment.py +66 -6
- server/graders.py +10 -3
README.md
CHANGED
|
@@ -12,7 +12,7 @@ tags:
|
|
| 12 |
|
| 13 |
# Invoice Extraction Environment
|
| 14 |
|
| 15 |
-
An OpenEnv-compliant environment where AI agents extract structured data from unstructured invoice and receipt documents.
|
| 16 |
|
| 17 |
**Space URL:** `https://huggingface.co/spaces/musharraf7/invoice-extraction-env`
|
| 18 |
|
|
@@ -25,14 +25,14 @@ r = requests.post(f"{url}/reset", json={"task_name": "simple_invoice"})
|
|
| 25 |
print(r.json())
|
| 26 |
```
|
| 27 |
|
| 28 |
-
##
|
| 29 |
|
| 30 |
-
|
| 31 |
|
| 32 |
-
|
| 33 |
-
- **
|
| 34 |
-
- **
|
| 35 |
-
- **
|
| 36 |
|
| 37 |
## Action Space
|
| 38 |
|
|
@@ -65,35 +65,63 @@ Each step returns an `InvoiceObservation`:
|
|
| 65 |
| `attempts_remaining` | int | Remaining extraction attempts |
|
| 66 |
| `required_fields` | list | Fields to extract |
|
| 67 |
| `done` | bool | Whether the episode has ended |
|
| 68 |
-
| `reward` | float | Reward signal
|
| 69 |
|
| 70 |
-
## Tasks
|
| 71 |
|
| 72 |
-
### 1. `simple_invoice` (Easy)
|
| 73 |
-
Clean, well-formatted invoices with clear field labels.
|
| 74 |
|
| 75 |
**Required fields:** `invoice_number`, `date`, `vendor_name`, `customer_name`, `subtotal`, `tax`, `total`, `line_items`
|
| 76 |
|
| 77 |
-
### 2. `messy_invoice` (Medium)
|
| 78 |
Same fields but from messy, inconsistently formatted documents with abbreviations, typos, and non-standard layouts.
|
| 79 |
|
| 80 |
**Required fields:** Same as simple_invoice
|
| 81 |
|
| 82 |
-
### 3. `multi_document` (Hard)
|
| 83 |
-
Complex multi-section documents containing a purchase order, invoice, and credit memo/payment receipt. The agent must cross-reference sections
|
| 84 |
|
| 85 |
-
**Required fields:** All
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
|
| 87 |
## Reward Design
|
| 88 |
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
- **
|
| 95 |
-
|
| 96 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
|
| 98 |
## Setup Instructions
|
| 99 |
|
|
@@ -109,6 +137,11 @@ pip install -r requirements.txt
|
|
| 109 |
uvicorn server.app:app --host 0.0.0.0 --port 7860
|
| 110 |
```
|
| 111 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
### Run inference
|
| 113 |
```bash
|
| 114 |
export ENV_URL="http://localhost:7860"
|
|
@@ -135,12 +168,12 @@ python inference.py
|
|
| 135 |
├── server/
|
| 136 |
│ ├── __init__.py
|
| 137 |
│ ├── app.py # FastAPI application
|
| 138 |
-
│ ├── environment.py # Core environment logic
|
| 139 |
-
│ ├── documents.py #
|
| 140 |
-
│ ├── graders.py #
|
| 141 |
-
│ └── models.py # Pydantic Action/Observation types
|
| 142 |
├── __init__.py # Package declaration
|
| 143 |
-
├── inference.py # Baseline inference script
|
| 144 |
├── openenv.yaml # OpenEnv manifest
|
| 145 |
├── pyproject.toml # Package configuration
|
| 146 |
├── requirements.txt # Dependencies
|
|
|
|
| 12 |
|
| 13 |
# Invoice Extraction Environment
|
| 14 |
|
| 15 |
+
An OpenEnv-compliant environment where AI agents extract structured data from unstructured invoice and receipt documents. Features **5 difficulty tiers** — from clean invoices to adversarial documents with decoy fields, OCR corruption, and hidden calculations.
|
| 16 |
|
| 17 |
**Space URL:** `https://huggingface.co/spaces/musharraf7/invoice-extraction-env`
|
| 18 |
|
|
|
|
| 25 |
print(r.json())
|
| 26 |
```
|
| 27 |
|
| 28 |
+
## Why This Environment?
|
| 29 |
|
| 30 |
+
Invoice data extraction is a **$5B+ industry** problem faced daily by every business. This environment provides:
|
| 31 |
|
| 32 |
+
- **Real RL training signal**: Per-field partial-credit scoring gives dense reward gradients
|
| 33 |
+
- **Genuine difficulty progression**: From clean invoices to adversarial traps that challenge frontier models
|
| 34 |
+
- **Reward shaping**: Consistency bonuses, efficiency signals, and improvement tracking provide rich learning signals beyond simple field matching
|
| 35 |
+
- **Production relevance**: The task directly models what commercial document processing systems must solve
|
| 36 |
|
| 37 |
## Action Space
|
| 38 |
|
|
|
|
| 65 |
| `attempts_remaining` | int | Remaining extraction attempts |
|
| 66 |
| `required_fields` | list | Fields to extract |
|
| 67 |
| `done` | bool | Whether the episode has ended |
|
| 68 |
+
| `reward` | float | Reward signal (0.01–0.99) |
|
| 69 |
|
| 70 |
+
## Tasks (5 Difficulty Tiers)
|
| 71 |
|
| 72 |
+
### 1. `simple_invoice` (Easy) — 3 attempts
|
| 73 |
+
Clean, well-formatted invoices with clear field labels.
|
| 74 |
|
| 75 |
**Required fields:** `invoice_number`, `date`, `vendor_name`, `customer_name`, `subtotal`, `tax`, `total`, `line_items`
|
| 76 |
|
| 77 |
+
### 2. `messy_invoice` (Medium) — 3 attempts
|
| 78 |
Same fields but from messy, inconsistently formatted documents with abbreviations, typos, and non-standard layouts.
|
| 79 |
|
| 80 |
**Required fields:** Same as simple_invoice
|
| 81 |
|
| 82 |
+
### 3. `multi_document` (Hard) — 5 attempts
|
| 83 |
+
Complex multi-section documents containing a purchase order, invoice, and credit memo/payment receipt. The agent must cross-reference sections.
|
| 84 |
|
| 85 |
+
**Required fields:** All basic fields + `po_number`, `adjustment_reason`, `adjusted_total`
|
| 86 |
+
|
| 87 |
+
### 4. `corrupted_scan` (Very Hard) — 4 attempts
|
| 88 |
+
Simulates OCR-scanned/faxed invoices with systematic character errors:
|
| 89 |
+
- Character substitutions: `0`↔`O`, `1`↔`l`↔`I`, `5`↔`S`, `8`↔`B`
|
| 90 |
+
- Garbled sections and scan artifacts
|
| 91 |
+
- The agent must **reason through noise** to recover the true values
|
| 92 |
+
|
| 93 |
+
**Required fields:** Same as simple_invoice
|
| 94 |
+
|
| 95 |
+
### 5. `adversarial_invoice` (Expert) — 6 attempts
|
| 96 |
+
Adversarial documents designed to trap and challenge frontier models:
|
| 97 |
+
- **Decoy fields**: Multiple invoice numbers — only one is current
|
| 98 |
+
- **Hidden calculations**: Discounts the agent must compute
|
| 99 |
+
- **Contradictory sections**: PO vs invoice disagreements
|
| 100 |
+
- **Budget variance alerts**: Agent must identify and explain discrepancies
|
| 101 |
+
|
| 102 |
+
**Required fields:** All basic fields + `po_number`, `discount_amount`, `original_total`, `discrepancy_notes`
|
| 103 |
|
| 104 |
## Reward Design
|
| 105 |
|
| 106 |
+
### Per-Field Scoring (Base Score)
|
| 107 |
+
- **Text fields**: Fuzzy matching with SequenceMatcher (0.0–1.0)
|
| 108 |
+
- **Numeric fields**: Exact match (1.0), within 1% (0.9), within 5% (0.5)
|
| 109 |
+
- **Date fields**: Normalized comparison (YYYY-MM-DD)
|
| 110 |
+
- **Line items**: Best-fit matching of description, qty, price, amount
|
| 111 |
+
- **Reasoning fields** (discrepancy_notes): Fuzzy matching with lower threshold
|
| 112 |
+
|
| 113 |
+
### Reward Shaping Bonuses
|
| 114 |
+
| Bonus | Value | Trigger |
|
| 115 |
+
|-------|-------|---------|
|
| 116 |
+
| **Consistency** | +0.03 | Agent's subtotal + tax = total |
|
| 117 |
+
| **Efficiency** | +0.01–0.02 | Solution found in ≤5 steps |
|
| 118 |
+
| **Improvement** | up to +0.02 | Score improves on retry |
|
| 119 |
+
|
| 120 |
+
### Episode Mechanics
|
| 121 |
+
- **Best score tracked** across all extraction attempts
|
| 122 |
+
- **Partial progress** feedback identifies weak fields for refinement
|
| 123 |
+
- **Early termination** at score ≥ 0.95
|
| 124 |
+
- **All scores** clamped to strict (0.01, 0.99) range
|
| 125 |
|
| 126 |
## Setup Instructions
|
| 127 |
|
|
|
|
| 137 |
uvicorn server.app:app --host 0.0.0.0 --port 7860
|
| 138 |
```
|
| 139 |
|
| 140 |
+
### Run with uv
|
| 141 |
+
```bash
|
| 142 |
+
uv run server
|
| 143 |
+
```
|
| 144 |
+
|
| 145 |
### Run inference
|
| 146 |
```bash
|
| 147 |
export ENV_URL="http://localhost:7860"
|
|
|
|
| 168 |
├── server/
|
| 169 |
│ ├── __init__.py
|
| 170 |
│ ├── app.py # FastAPI application
|
| 171 |
+
│ ├── environment.py # Core environment logic + reward shaping
|
| 172 |
+
│ ├── documents.py # 15-document corpus across 5 difficulty tiers
|
| 173 |
+
│ ├── graders.py # Field-level scoring with fuzzy matching
|
| 174 |
+
│ └── models.py # Pydantic Action/Observation/State types
|
| 175 |
├── __init__.py # Package declaration
|
| 176 |
+
├── inference.py # Baseline inference script (all 5 tasks)
|
| 177 |
├── openenv.yaml # OpenEnv manifest
|
| 178 |
├── pyproject.toml # Package configuration
|
| 179 |
├── requirements.txt # Dependencies
|
inference.py
CHANGED
|
@@ -3,9 +3,9 @@
|
|
| 3 |
Baseline inference script for the Invoice Extraction Environment.
|
| 4 |
|
| 5 |
This script demonstrates how an LLM agent interacts with the environment
|
| 6 |
-
to extract structured data from invoice documents. It runs all
|
| 7 |
-
(simple_invoice, messy_invoice, multi_document
|
| 8 |
-
mandatory OpenEnv [START]/[STEP]/[END] format.
|
| 9 |
|
| 10 |
Required environment variables:
|
| 11 |
API_BASE_URL — OpenAI-compatible API endpoint
|
|
@@ -35,7 +35,7 @@ ENV_URL = os.getenv("ENV_URL", "http://localhost:7860")
|
|
| 35 |
# Optional — if you use from_docker_image():
|
| 36 |
LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
|
| 37 |
|
| 38 |
-
TASKS = ["simple_invoice", "messy_invoice", "multi_document"]
|
| 39 |
|
| 40 |
# ---------------------------------------------------------------------------
|
| 41 |
# LLM Client (OpenAI-compatible, configured via env vars)
|
|
@@ -100,12 +100,46 @@ RULES:
|
|
| 100 |
- For monetary amounts, use plain numbers without currency symbols (e.g. 1134.00)
|
| 101 |
- For line_items, use an array of objects with keys: description, quantity, unit_price, amount
|
| 102 |
- If a field cannot be found, use null
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
|
| 107 |
JSON:"""
|
| 108 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
REFINE_PROMPT = """You previously extracted data from an invoice but some fields were incorrect.
|
| 110 |
|
| 111 |
DOCUMENT:
|
|
@@ -128,6 +162,8 @@ RULES:
|
|
| 128 |
- For monetary amounts, use plain numbers without currency symbols
|
| 129 |
- For line_items, use an array of objects with keys: description, quantity, unit_price, amount
|
| 130 |
- If a field cannot be determined, use null
|
|
|
|
|
|
|
| 131 |
|
| 132 |
JSON:"""
|
| 133 |
|
|
@@ -217,7 +253,12 @@ def run_task(env_url: str, task_name: str, seed: int = 0) -> float:
|
|
| 217 |
|
| 218 |
# Step 3: Use LLM to extract fields
|
| 219 |
fields_str = "\n".join(f"- {f}" for f in required_fields)
|
| 220 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 221 |
llm_response = call_llm(prompt)
|
| 222 |
extracted_json = extract_json_from_response(llm_response)
|
| 223 |
|
|
@@ -253,11 +294,13 @@ def run_task(env_url: str, task_name: str, seed: int = 0) -> float:
|
|
| 253 |
weak_fields = [f for f, s in field_scores.items() if s < 0.8]
|
| 254 |
|
| 255 |
# Refine with LLM
|
|
|
|
| 256 |
refine_prompt = REFINE_PROMPT.format(
|
| 257 |
document=document_text,
|
| 258 |
previous=extracted_json,
|
| 259 |
weak_fields=", ".join(weak_fields) if weak_fields else "all fields",
|
| 260 |
feedback=feedback_text,
|
|
|
|
| 261 |
)
|
| 262 |
refined_response = call_llm(refine_prompt)
|
| 263 |
refined_json = extract_json_from_response(refined_response)
|
|
|
|
| 3 |
Baseline inference script for the Invoice Extraction Environment.
|
| 4 |
|
| 5 |
This script demonstrates how an LLM agent interacts with the environment
|
| 6 |
+
to extract structured data from invoice documents. It runs all five tasks
|
| 7 |
+
(simple_invoice, messy_invoice, multi_document, corrupted_scan, adversarial_invoice)
|
| 8 |
+
and logs results in the mandatory OpenEnv [START]/[STEP]/[END] format.
|
| 9 |
|
| 10 |
Required environment variables:
|
| 11 |
API_BASE_URL — OpenAI-compatible API endpoint
|
|
|
|
| 35 |
# Optional — if you use from_docker_image():
|
| 36 |
LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
|
| 37 |
|
| 38 |
+
TASKS = ["simple_invoice", "messy_invoice", "multi_document", "corrupted_scan", "adversarial_invoice"]
|
| 39 |
|
| 40 |
# ---------------------------------------------------------------------------
|
| 41 |
# LLM Client (OpenAI-compatible, configured via env vars)
|
|
|
|
| 100 |
- For monetary amounts, use plain numbers without currency symbols (e.g. 1134.00)
|
| 101 |
- For line_items, use an array of objects with keys: description, quantity, unit_price, amount
|
| 102 |
- If a field cannot be found, use null
|
| 103 |
+
{task_specific_rules}
|
| 104 |
+
|
| 105 |
+
IMPORTANT: Ensure your extracted subtotal + tax = total. Verify math consistency.
|
| 106 |
|
| 107 |
JSON:"""
|
| 108 |
|
| 109 |
+
TASK_RULES = {
|
| 110 |
+
"simple_invoice": "",
|
| 111 |
+
"messy_invoice": (
|
| 112 |
+
"- This document uses informal formatting, abbreviations, and shorthand\n"
|
| 113 |
+
"- Look past formatting irregularities to find the actual values\n"
|
| 114 |
+
"- 'subtot', 's/t', 'sub' = subtotal; 'tx' = tax; 'amt due' = total"
|
| 115 |
+
),
|
| 116 |
+
"multi_document": (
|
| 117 |
+
"- This contains MULTIPLE document sections (PO, Invoice, Credit Memo, etc.)\n"
|
| 118 |
+
"- Extract from the INVOICE section primarily\n"
|
| 119 |
+
"- adjusted_total is the final amount after credits/payments\n"
|
| 120 |
+
"- po_number is the purchase order reference number\n"
|
| 121 |
+
"- adjustment_reason describes why the total was adjusted"
|
| 122 |
+
),
|
| 123 |
+
"corrupted_scan": (
|
| 124 |
+
"- WARNING: This is an OCR-scanned document with character errors\n"
|
| 125 |
+
"- Common OCR substitutions: 0<->O, 1<->l<->I, 5<->S, 8<->B\n"
|
| 126 |
+
"- Mentally correct OCR errors to recover the true values\n"
|
| 127 |
+
"- 'lNV' = 'INV', 'S' in numbers = '5', 'O' in numbers = '0'\n"
|
| 128 |
+
"- Verify all numbers by cross-checking (qty * unit_price = amount)"
|
| 129 |
+
),
|
| 130 |
+
"adversarial_invoice": (
|
| 131 |
+
"- CAUTION: This document contains DECOY fields and contradictions\n"
|
| 132 |
+
"- Multiple invoice numbers may appear — use the CURRENT/ACTIVE one, not voided/draft ones\n"
|
| 133 |
+
"- If there is a reissue date, use that as the date (not the original date)\n"
|
| 134 |
+
"- subtotal is the ADJUSTED subtotal after any discounts\n"
|
| 135 |
+
"- discount_amount is the monetary discount value\n"
|
| 136 |
+
"- original_total is what the total WOULD have been without adjustments\n"
|
| 137 |
+
"- discrepancy_notes: describe ALL discrepancies, adjustments, and calculations\n"
|
| 138 |
+
"- po_number: the purchase order reference if present, else null\n"
|
| 139 |
+
"- Cross-reference different sections to find contradictions"
|
| 140 |
+
),
|
| 141 |
+
}
|
| 142 |
+
|
| 143 |
REFINE_PROMPT = """You previously extracted data from an invoice but some fields were incorrect.
|
| 144 |
|
| 145 |
DOCUMENT:
|
|
|
|
| 162 |
- For monetary amounts, use plain numbers without currency symbols
|
| 163 |
- For line_items, use an array of objects with keys: description, quantity, unit_price, amount
|
| 164 |
- If a field cannot be determined, use null
|
| 165 |
+
- VERIFY: subtotal + tax should equal total
|
| 166 |
+
{task_specific_rules}
|
| 167 |
|
| 168 |
JSON:"""
|
| 169 |
|
|
|
|
| 253 |
|
| 254 |
# Step 3: Use LLM to extract fields
|
| 255 |
fields_str = "\n".join(f"- {f}" for f in required_fields)
|
| 256 |
+
task_rules = TASK_RULES.get(task_name, "")
|
| 257 |
+
prompt = EXTRACT_PROMPT.format(
|
| 258 |
+
document=document_text,
|
| 259 |
+
fields=fields_str,
|
| 260 |
+
task_specific_rules=task_rules,
|
| 261 |
+
)
|
| 262 |
llm_response = call_llm(prompt)
|
| 263 |
extracted_json = extract_json_from_response(llm_response)
|
| 264 |
|
|
|
|
| 294 |
weak_fields = [f for f, s in field_scores.items() if s < 0.8]
|
| 295 |
|
| 296 |
# Refine with LLM
|
| 297 |
+
task_rules = TASK_RULES.get(task_name, "")
|
| 298 |
refine_prompt = REFINE_PROMPT.format(
|
| 299 |
document=document_text,
|
| 300 |
previous=extracted_json,
|
| 301 |
weak_fields=", ".join(weak_fields) if weak_fields else "all fields",
|
| 302 |
feedback=feedback_text,
|
| 303 |
+
task_specific_rules=task_rules,
|
| 304 |
)
|
| 305 |
refined_response = call_llm(refine_prompt)
|
| 306 |
refined_json = extract_json_from_response(refined_response)
|
project_juiding_criterion.txt
ADDED
|
@@ -0,0 +1,145 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
al-world task simulation
|
| 2 |
+
|
| 3 |
+
The environment must simulate a task humans actually do. Not games, not toys. Examples: email triage, code review, data cleaning, scheduling, customer support, content moderation.
|
| 4 |
+
|
| 5 |
+
OpenEnv spec compliance
|
| 6 |
+
|
| 7 |
+
Implement the full OpenEnv interface: typed Observation, Action, and Reward Pydantic models. step(action) → returns observation, reward, done, info. reset() → returns initial observation. state() → returns current state. openenv.yaml with metadata. Tested via openenv validate.
|
| 8 |
+
|
| 9 |
+
Minimum 3 tasks with agent graders
|
| 10 |
+
|
| 11 |
+
Each task defines a concrete objective an agent must accomplish, with a programmatic grader that scores performance (0.0–1.0). Tasks should range: easy → medium → hard. Graders must have clear, deterministic success/failure criteria.
|
| 12 |
+
|
| 13 |
+
Meaningful reward function
|
| 14 |
+
|
| 15 |
+
Provides signal over the full trajectory (not just binary end-of-episode). Rewards partial progress toward task completion. Penalizes clearly undesirable behavior (e.g. infinite loops, destructive actions).
|
| 16 |
+
|
| 17 |
+
Baseline inference script
|
| 18 |
+
|
| 19 |
+
Uses the OpenAI API client to run a model against the environment. Reads API credentials from environment variables (OPENAI_API_KEY). Produces a reproducible baseline score on all 3 tasks.
|
| 20 |
+
|
| 21 |
+
Detailed Requirements
|
| 22 |
+
|
| 23 |
+
Non-Functional Requirements
|
| 24 |
+
|
| 25 |
+
Deploys to a Hugging Face Space
|
| 26 |
+
|
| 27 |
+
Environment must run as a containerized HF Space tagged with openenv.
|
| 28 |
+
|
| 29 |
+
Containerized execution
|
| 30 |
+
|
| 31 |
+
Must include a working Dockerfile. The environment should start cleanly with docker build + docker run.
|
| 32 |
+
|
| 33 |
+
Documentation
|
| 34 |
+
|
| 35 |
+
README must include: environment description and motivation, action and observation space definitions, task descriptions with expected difficulty, setup and usage instructions, baseline scores.
|
| 36 |
+
|
| 37 |
+
Parameter
|
| 38 |
+
|
| 39 |
+
Weight
|
| 40 |
+
|
| 41 |
+
Description
|
| 42 |
+
|
| 43 |
+
Real-world utility
|
| 44 |
+
|
| 45 |
+
30%
|
| 46 |
+
|
| 47 |
+
Does the environment model a genuine task? Would someone actually use this to train or evaluate agents?
|
| 48 |
+
|
| 49 |
+
Task & grader quality
|
| 50 |
+
|
| 51 |
+
25%
|
| 52 |
+
|
| 53 |
+
Are tasks well-defined with clear objectives? Do graders accurately and fairly measure success? Meaningful difficulty progression?
|
| 54 |
+
|
| 55 |
+
Environment design
|
| 56 |
+
|
| 57 |
+
20%
|
| 58 |
+
|
| 59 |
+
Clean state management, sensible action/observation spaces, good reward shaping, proper episode boundaries.
|
| 60 |
+
|
| 61 |
+
Code quality & spec compliance
|
| 62 |
+
|
| 63 |
+
15%
|
| 64 |
+
|
| 65 |
+
Follows OpenEnv spec, clean project structure, typed models, documented, tested, Dockerfile works.
|
| 66 |
+
|
| 67 |
+
Creativity & novelty
|
| 68 |
+
|
| 69 |
+
10%
|
| 70 |
+
|
| 71 |
+
Novel problem domain, interesting mechanics, clever reward design, original approach.
|
| 72 |
+
|
| 73 |
+
Scoring Breakdown
|
| 74 |
+
|
| 75 |
+
Real-world utility (30%)
|
| 76 |
+
|
| 77 |
+
• 0–5: Toy/artificial problem with no practical application
|
| 78 |
+
|
| 79 |
+
• 6–15: Valid domain but shallow modeling of the real task
|
| 80 |
+
|
| 81 |
+
• 16–25: Good domain modeling, would be useful for agent evaluation
|
| 82 |
+
|
| 83 |
+
• 26–30: Excellent — fills a real gap, immediate value for the RL/agent community
|
| 84 |
+
|
| 85 |
+
Task & grader quality (25%)
|
| 86 |
+
|
| 87 |
+
• 3+ tasks with difficulty range?
|
| 88 |
+
|
| 89 |
+
• Graders produce scores between 0.0–1.0?
|
| 90 |
+
|
| 91 |
+
• Graders deterministic and reproducible?
|
| 92 |
+
|
| 93 |
+
• Hard task genuinely challenges frontier models?
|
| 94 |
+
|
| 95 |
+
Environment design (20%)
|
| 96 |
+
|
| 97 |
+
• reset() produces clean state?
|
| 98 |
+
|
| 99 |
+
• Action/observation types well-designed and documented?
|
| 100 |
+
|
| 101 |
+
• Reward function provides useful varying signal (not just sparse)?
|
| 102 |
+
|
| 103 |
+
• Episode boundaries sensible?
|
| 104 |
+
|
| 105 |
+
Code quality & spec compliance (15%)
|
| 106 |
+
|
| 107 |
+
• openenv validate passes?
|
| 108 |
+
|
| 109 |
+
• docker build && docker run works?
|
| 110 |
+
|
| 111 |
+
• HF Space deploys and responds?
|
| 112 |
+
|
| 113 |
+
• Baseline script runs and reproduces scores?
|
| 114 |
+
|
| 115 |
+
Creativity & novelty (10%)
|
| 116 |
+
|
| 117 |
+
• Domain we haven’t seen in OpenEnv before?
|
| 118 |
+
|
| 119 |
+
• Reward design has interesting properties?
|
| 120 |
+
|
| 121 |
+
• Clever mechanics that make the environment engaging?
|
| 122 |
+
|
| 123 |
+
Evaluation Criteria
|
| 124 |
+
|
| 125 |
+
Phase 1: Automated Validation
|
| 126 |
+
|
| 127 |
+
Pass/fail gate — HF Space deploys, OpenEnv spec compliance, Dockerfile builds, baseline reproduces, 3+ tasks with graders.
|
| 128 |
+
|
| 129 |
+
Phase 2: Agentic Evaluation
|
| 130 |
+
|
| 131 |
+
Scored — baseline agent re-run, standard Open LLM agent (e.g. Nemotron 3 Super) run against all environments, score variance check.
|
| 132 |
+
|
| 133 |
+
Phase 3: Human Review
|
| 134 |
+
|
| 135 |
+
Top submissions reviewed by Meta and Hugging Face engineers for real-world utility, creativity, and exploit checks.
|
| 136 |
+
|
| 137 |
+
Disqualification Criteria
|
| 138 |
+
|
| 139 |
+
Environment does not deploy or respond
|
| 140 |
+
|
| 141 |
+
Plagiarized or trivially modified existing environments
|
| 142 |
+
|
| 143 |
+
Graders that always return the same score
|
| 144 |
+
|
| 145 |
+
No baseline inference script
|
server/app.py
CHANGED
|
@@ -131,11 +131,19 @@ def create_invoice_app() -> FastAPI:
|
|
| 131 |
"name": "invoice_extraction_env",
|
| 132 |
"description": (
|
| 133 |
"An environment for extracting structured data from unstructured "
|
| 134 |
-
"invoice and receipt documents.
|
| 135 |
-
"
|
|
|
|
|
|
|
| 136 |
),
|
| 137 |
-
"version": "0.
|
| 138 |
-
"tasks": [
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 139 |
}
|
| 140 |
|
| 141 |
# === WebSocket (for persistent sessions) ===
|
|
|
|
| 131 |
"name": "invoice_extraction_env",
|
| 132 |
"description": (
|
| 133 |
"An environment for extracting structured data from unstructured "
|
| 134 |
+
"invoice and receipt documents. Features 5 difficulty tiers from "
|
| 135 |
+
"clean invoices to adversarial documents with decoy fields, OCR "
|
| 136 |
+
"corruption, and hidden calculations. Reward shaping includes "
|
| 137 |
+
"consistency bonuses, efficiency signals, and improvement tracking."
|
| 138 |
),
|
| 139 |
+
"version": "0.2.0",
|
| 140 |
+
"tasks": [
|
| 141 |
+
"simple_invoice",
|
| 142 |
+
"messy_invoice",
|
| 143 |
+
"multi_document",
|
| 144 |
+
"corrupted_scan",
|
| 145 |
+
"adversarial_invoice",
|
| 146 |
+
],
|
| 147 |
}
|
| 148 |
|
| 149 |
# === WebSocket (for persistent sessions) ===
|
server/documents.py
CHANGED
|
@@ -465,6 +465,357 @@ Total with Backorder: $2,498.45 + $170.00 = $2,668.45
|
|
| 465 |
},
|
| 466 |
},
|
| 467 |
],
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 468 |
}
|
| 469 |
|
| 470 |
|
|
@@ -483,6 +834,16 @@ TASK_REQUIRED_FIELDS = {
|
|
| 483 |
"subtotal", "tax", "total", "line_items",
|
| 484 |
"po_number", "adjustment_reason", "adjusted_total",
|
| 485 |
],
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 486 |
}
|
| 487 |
|
| 488 |
|
|
@@ -490,7 +851,8 @@ def get_document(task_name: str, doc_index: int = 0) -> dict:
|
|
| 490 |
"""Get a document and its metadata for a given task.
|
| 491 |
|
| 492 |
Args:
|
| 493 |
-
task_name: One of 'simple_invoice', 'messy_invoice', 'multi_document'
|
|
|
|
| 494 |
doc_index: Index into the document pool (will wrap around)
|
| 495 |
|
| 496 |
Returns:
|
|
@@ -504,3 +866,4 @@ def get_document(task_name: str, doc_index: int = 0) -> dict:
|
|
| 504 |
"ground_truth": doc["ground_truth"],
|
| 505 |
"required_fields": TASK_REQUIRED_FIELDS.get(task_name, TASK_REQUIRED_FIELDS["simple_invoice"]),
|
| 506 |
}
|
|
|
|
|
|
| 465 |
},
|
| 466 |
},
|
| 467 |
],
|
| 468 |
+
|
| 469 |
+
# =========================================================================
|
| 470 |
+
# CORRUPTED SCAN — OCR-like artifacts, character substitutions, garbled text
|
| 471 |
+
# These simulate real scanned/faxed invoices with OCR errors.
|
| 472 |
+
# =========================================================================
|
| 473 |
+
"corrupted_scan": [
|
| 474 |
+
{
|
| 475 |
+
"id": "corrupt_001",
|
| 476 |
+
"text": """SC4NNED D0CUMENT - Page 1 of 1
|
| 477 |
+
|
| 478 |
+
lNVOlCE
|
| 479 |
+
|
| 480 |
+
lnvoice Nurnber: lNV-2O24-OO1
|
| 481 |
+
Dat.e: Januery 1S, 2O24
|
| 482 |
+
|
| 483 |
+
Frorn:
|
| 484 |
+
Acrne Corporati0n
|
| 485 |
+
l23 Business Avenue
|
| 486 |
+
New Y0rk, NY 1OOO1
|
| 487 |
+
|
| 488 |
+
BilI To:
|
| 489 |
+
Widget C0.
|
| 490 |
+
4S6 Cornmerce Street
|
| 491 |
+
Chicag0, lL 6O6O1
|
| 492 |
+
|
| 493 |
+
Descripti0n Qty Unit Price Arnount
|
| 494 |
+
---------------------------------------------------------
|
| 495 |
+
Widget Type A 1O $2S.OO $2SO.OO
|
| 496 |
+
Widget Type 8 S $4O.OO $2OO.OO
|
| 497 |
+
ConsuIting Service 8 $7S.OO $6OO.OO
|
| 498 |
+
|
| 499 |
+
Subtotal: $1,OSO.OO
|
| 500 |
+
Tax (8%): $84.OO
|
| 501 |
+
T0tal: $1,l34.OO
|
| 502 |
+
|
| 503 |
+
Payrnent Terrns: Net 3O
|
| 504 |
+
|
| 505 |
+
--- END 0F SCAN ---
|
| 506 |
+
""",
|
| 507 |
+
"ground_truth": {
|
| 508 |
+
"invoice_number": "INV-2024-001",
|
| 509 |
+
"date": "2024-01-15",
|
| 510 |
+
"vendor_name": "Acme Corporation",
|
| 511 |
+
"customer_name": "Widget Co.",
|
| 512 |
+
"subtotal": 1050.00,
|
| 513 |
+
"tax": 84.00,
|
| 514 |
+
"total": 1134.00,
|
| 515 |
+
"line_items": [
|
| 516 |
+
{"description": "Widget Type A", "quantity": 10, "unit_price": 25.00, "amount": 250.00},
|
| 517 |
+
{"description": "Widget Type B", "quantity": 5, "unit_price": 40.00, "amount": 200.00},
|
| 518 |
+
{"description": "Consulting Service", "quantity": 8, "unit_price": 75.00, "amount": 600.00},
|
| 519 |
+
],
|
| 520 |
+
},
|
| 521 |
+
},
|
| 522 |
+
{
|
| 523 |
+
"id": "corrupt_002",
|
| 524 |
+
"text": """[SCAN QUALITY: P00R - SOME CHARACTERS MAY BE lNCORRECT]
|
| 525 |
+
|
| 526 |
+
TECHSTART S0LUTl0NS LLC
|
| 527 |
+
89O lnnovation Dr, Suite 2OO
|
| 528 |
+
San Francisc0, CA 941OS
|
| 529 |
+
|
| 530 |
+
lNV0lCE #: TS~S892
|
| 531 |
+
DATE: O3/O3/2O24
|
| 532 |
+
|
| 533 |
+
CUSTOMERr DataFIow lnc.
|
| 534 |
+
321 AnaIytics BIvd
|
| 535 |
+
Austin, TX 787O1
|
| 536 |
+
|
| 537 |
+
Servicc Qty Unit Pricc Total
|
| 538 |
+
----------------------------------------------------------
|
| 539 |
+
CIoud Hosting (MonthIy) l $4SO.OO $4SO.OO
|
| 540 |
+
APl lntegration Setup l $l,2OO.OO $l,2OO.OO
|
| 541 |
+
TechnicaI Support (hours) l2 $9S.OO $l,l4O.OO
|
| 542 |
+
|
| 543 |
+
SubtotaI: $2,79O.OO
|
| 544 |
+
Tax (7%)): $l9S.3O
|
| 545 |
+
TotaI: $2,98S.3O
|
| 546 |
+
|
| 547 |
+
Due Date: ApriI 2, 2O24
|
| 548 |
+
|
| 549 |
+
[PAGE 1/1 - SCAN C0MPLETE]
|
| 550 |
+
""",
|
| 551 |
+
"ground_truth": {
|
| 552 |
+
"invoice_number": "TS-5892",
|
| 553 |
+
"date": "2024-03-03",
|
| 554 |
+
"vendor_name": "TechStart Solutions LLC",
|
| 555 |
+
"customer_name": "DataFlow Inc.",
|
| 556 |
+
"subtotal": 2790.00,
|
| 557 |
+
"tax": 195.30,
|
| 558 |
+
"total": 2985.30,
|
| 559 |
+
"line_items": [
|
| 560 |
+
{"description": "Cloud Hosting (Monthly)", "quantity": 1, "unit_price": 450.00, "amount": 450.00},
|
| 561 |
+
{"description": "API Integration Setup", "quantity": 1, "unit_price": 1200.00, "amount": 1200.00},
|
| 562 |
+
{"description": "Technical Support (hours)", "quantity": 12, "unit_price": 95.00, "amount": 1140.00},
|
| 563 |
+
],
|
| 564 |
+
},
|
| 565 |
+
},
|
| 566 |
+
{
|
| 567 |
+
"id": "corrupt_003",
|
| 568 |
+
"text": """---FAXED DOCUMENT---
|
| 569 |
+
RECEIVED: 02/20/2024 14:32
|
| 570 |
+
QUALITY: [####===---] 40%
|
| 571 |
+
|
| 572 |
+
GL0BAL SUPPLlES lNC.
|
| 573 |
+
25OO lndustriaI Parkway
|
| 574 |
+
Detr0it, Ml 482Ol
|
| 575 |
+
|
| 576 |
+
lNVOlCE
|
| 577 |
+
|
| 578 |
+
lnvoice Number: GS-2O24-Ol47
|
| 579 |
+
Date: February 2O, 2024
|
| 580 |
+
|
| 581 |
+
T0:
|
| 582 |
+
Riverside Manufactur1ng
|
| 583 |
+
78O Factory R0ad
|
| 584 |
+
CIeveIand, 0H 44l0l
|
| 585 |
+
|
| 586 |
+
Product Qty Price Each Line Total
|
| 587 |
+
-----------------------------------------------------------
|
| 588 |
+
SteeI BoIts (Box/lOO) SO $l2.SO $62S.OO
|
| 589 |
+
Copper Wire (SOOft RoII) 8 $8S.OO $68O.OO
|
| 590 |
+
Safety GoggIes (Pack/lO) 2O $3S.OO $7OO.OO
|
| 591 |
+
WeIding Rods (BundIe) lS $22.OO $33O.OO
|
| 592 |
+
|
| 593 |
+
[iIIegibIe]
|
| 594 |
+
SubtotaI: $2,33S.OO
|
| 595 |
+
SaIes Tax: $l63.4S
|
| 596 |
+
lnvoice T0tal: $2,498.4S
|
| 597 |
+
|
| 598 |
+
Terrns: Net 4S
|
| 599 |
+
---END FAX---
|
| 600 |
+
""",
|
| 601 |
+
"ground_truth": {
|
| 602 |
+
"invoice_number": "GS-2024-0147",
|
| 603 |
+
"date": "2024-02-20",
|
| 604 |
+
"vendor_name": "Global Supplies Inc.",
|
| 605 |
+
"customer_name": "Riverside Manufacturing",
|
| 606 |
+
"subtotal": 2335.00,
|
| 607 |
+
"tax": 163.45,
|
| 608 |
+
"total": 2498.45,
|
| 609 |
+
"line_items": [
|
| 610 |
+
{"description": "Steel Bolts (Box/100)", "quantity": 50, "unit_price": 12.50, "amount": 625.00},
|
| 611 |
+
{"description": "Copper Wire (500ft Roll)", "quantity": 8, "unit_price": 85.00, "amount": 680.00},
|
| 612 |
+
{"description": "Safety Goggles (Pack/10)", "quantity": 20, "unit_price": 35.00, "amount": 700.00},
|
| 613 |
+
{"description": "Welding Rods (Bundle)", "quantity": 15, "unit_price": 22.00, "amount": 330.00},
|
| 614 |
+
],
|
| 615 |
+
},
|
| 616 |
+
},
|
| 617 |
+
],
|
| 618 |
+
|
| 619 |
+
# =========================================================================
|
| 620 |
+
# ADVERSARIAL INVOICE — Decoy fields, contradictions, hidden calculations
|
| 621 |
+
# Designed to genuinely challenge frontier models with traps.
|
| 622 |
+
# =========================================================================
|
| 623 |
+
"adversarial_invoice": [
|
| 624 |
+
{
|
| 625 |
+
"id": "adversarial_001",
|
| 626 |
+
"text": """INVOICE
|
| 627 |
+
|
| 628 |
+
*** IMPORTANT: This replaces previous invoice DRAFT-INV-999 which was voided ***
|
| 629 |
+
|
| 630 |
+
Invoice Number: INV-2024-001-R2
|
| 631 |
+
Previous Reference: DRAFT-INV-999 (VOIDED — DO NOT USE)
|
| 632 |
+
Date: January 15, 2024
|
| 633 |
+
Reissue Date: January 20, 2024
|
| 634 |
+
|
| 635 |
+
From:
|
| 636 |
+
Acme Corporation
|
| 637 |
+
123 Business Avenue, New York, NY 10001
|
| 638 |
+
Tax ID: 12-3456789
|
| 639 |
+
|
| 640 |
+
Bill To:
|
| 641 |
+
Widget Co. (DBA "WidgetCorp International")
|
| 642 |
+
456 Commerce Street, Chicago, IL 60601
|
| 643 |
+
Customer Account: WC-0042
|
| 644 |
+
|
| 645 |
+
Description Qty Unit Price Amount
|
| 646 |
+
---------------------------------------------------------
|
| 647 |
+
Widget Type A 10 $25.00 $250.00
|
| 648 |
+
Widget Type B 5 $40.00 $200.00
|
| 649 |
+
Consulting Service 8 $75.00 $600.00
|
| 650 |
+
** EARLY PAYMENT DISCOUNT: -5% on consulting **
|
| 651 |
+
|
| 652 |
+
Subtotal: $1,050.00
|
| 653 |
+
Discount (5%): -$30.00
|
| 654 |
+
Adjusted Subtotal: $1,020.00
|
| 655 |
+
Tax (8%): $81.60
|
| 656 |
+
Total: $1,101.60
|
| 657 |
+
|
| 658 |
+
NOTE: Original quote (QT-2024-555) was $1,134.00 but discount applied.
|
| 659 |
+
Per agreement dated Jan 12, if paid within 10 days.
|
| 660 |
+
|
| 661 |
+
Payment Terms: Net 10 (discounted) / Net 30 (full price $1,134.00)
|
| 662 |
+
""",
|
| 663 |
+
"ground_truth": {
|
| 664 |
+
"invoice_number": "INV-2024-001-R2",
|
| 665 |
+
"date": "2024-01-20",
|
| 666 |
+
"vendor_name": "Acme Corporation",
|
| 667 |
+
"customer_name": "Widget Co.",
|
| 668 |
+
"subtotal": 1020.00,
|
| 669 |
+
"tax": 81.60,
|
| 670 |
+
"total": 1101.60,
|
| 671 |
+
"discount_amount": 30.00,
|
| 672 |
+
"original_total": 1134.00,
|
| 673 |
+
"line_items": [
|
| 674 |
+
{"description": "Widget Type A", "quantity": 10, "unit_price": 25.00, "amount": 250.00},
|
| 675 |
+
{"description": "Widget Type B", "quantity": 5, "unit_price": 40.00, "amount": 200.00},
|
| 676 |
+
{"description": "Consulting Service", "quantity": 8, "unit_price": 75.00, "amount": 600.00},
|
| 677 |
+
],
|
| 678 |
+
"discrepancy_notes": "5% early payment discount applied to consulting. Reissued invoice replaces voided DRAFT-INV-999. Adjusted subtotal $1,020 vs original $1,050.",
|
| 679 |
+
},
|
| 680 |
+
},
|
| 681 |
+
{
|
| 682 |
+
"id": "adversarial_002",
|
| 683 |
+
"text": """--- PURCHASE ORDER ---
|
| 684 |
+
PO#: PO-DF-2024-112
|
| 685 |
+
Date: February 28, 2024
|
| 686 |
+
Vendor: TechStart Solutions LLC
|
| 687 |
+
Buyer: DataFlow Inc.
|
| 688 |
+
Authorized Budget: $2,600.00 (pre-tax)
|
| 689 |
+
|
| 690 |
+
Items:
|
| 691 |
+
1. Cloud Hosting - 1 unit @ $450.00 = $450.00
|
| 692 |
+
2. API Integration - 1 unit @ $1,200.00 = $1,200.00
|
| 693 |
+
3. Tech Support - 10 hours @ $95.00/hr = $950.00
|
| 694 |
+
PO Total: $2,600.00
|
| 695 |
+
|
| 696 |
+
--- INVOICE ---
|
| 697 |
+
Invoice: TS-5892-FINAL
|
| 698 |
+
Date: March 3, 2024
|
| 699 |
+
PO Reference: PO-DF-2024-112
|
| 700 |
+
|
| 701 |
+
From: TechStart Solutions LLC
|
| 702 |
+
To: DataFlow Inc.
|
| 703 |
+
|
| 704 |
+
Service Qty Rate Amount
|
| 705 |
+
Cloud Hosting (Monthly) 1 $450.00 $450.00
|
| 706 |
+
API Integration Setup 1 $1,200.00 $1,200.00
|
| 707 |
+
Technical Support (actual) 12 $95.00 $1,140.00
|
| 708 |
+
>> 2 hrs over PO estimate, approved by J. Smith 03/01/2024
|
| 709 |
+
Rush Processing Fee 1 $150.00 $150.00
|
| 710 |
+
>> Added per emergency request ER-2024-033
|
| 711 |
+
|
| 712 |
+
Subtotal: $2,940.00
|
| 713 |
+
Tax (7%): $205.80
|
| 714 |
+
Total: $3,145.80
|
| 715 |
+
|
| 716 |
+
!!! BUDGET VARIANCE ALERT !!!
|
| 717 |
+
PO Authorized: $2,600.00
|
| 718 |
+
Actual (pre-tax): $2,940.00
|
| 719 |
+
Variance: $340.00 OVER BUDGET (13.1%)
|
| 720 |
+
Causes: Support overage ($190), Rush fee ($150)
|
| 721 |
+
|
| 722 |
+
--- PAYMENT SCHEDULE ---
|
| 723 |
+
Payment 1 (due 03/15): $1,500.00
|
| 724 |
+
Payment 2 (due 04/02): $1,645.80
|
| 725 |
+
""",
|
| 726 |
+
"ground_truth": {
|
| 727 |
+
"invoice_number": "TS-5892-FINAL",
|
| 728 |
+
"date": "2024-03-03",
|
| 729 |
+
"vendor_name": "TechStart Solutions LLC",
|
| 730 |
+
"customer_name": "DataFlow Inc.",
|
| 731 |
+
"subtotal": 2940.00,
|
| 732 |
+
"tax": 205.80,
|
| 733 |
+
"total": 3145.80,
|
| 734 |
+
"po_number": "PO-DF-2024-112",
|
| 735 |
+
"discount_amount": 0.00,
|
| 736 |
+
"original_total": 2600.00,
|
| 737 |
+
"line_items": [
|
| 738 |
+
{"description": "Cloud Hosting (Monthly)", "quantity": 1, "unit_price": 450.00, "amount": 450.00},
|
| 739 |
+
{"description": "API Integration Setup", "quantity": 1, "unit_price": 1200.00, "amount": 1200.00},
|
| 740 |
+
{"description": "Technical Support (actual)", "quantity": 12, "unit_price": 95.00, "amount": 1140.00},
|
| 741 |
+
{"description": "Rush Processing Fee", "quantity": 1, "unit_price": 150.00, "amount": 150.00},
|
| 742 |
+
],
|
| 743 |
+
"discrepancy_notes": "Invoice exceeds PO by $340 (13.1%). 2 extra support hours ($190) and rush processing fee ($150) added. PO authorized $2,600 but actual pre-tax is $2,940.",
|
| 744 |
+
},
|
| 745 |
+
},
|
| 746 |
+
{
|
| 747 |
+
"id": "adversarial_003",
|
| 748 |
+
"text": """CONSOLIDATED STATEMENT
|
| 749 |
+
|
| 750 |
+
Account: Riverside Manufacturing
|
| 751 |
+
Statement Period: February 2024
|
| 752 |
+
Prepared by: Global Supplies Inc., Accounts Receivable
|
| 753 |
+
|
| 754 |
+
=== TRANSACTION 1: ORIGINAL INVOICE ===
|
| 755 |
+
Invoice: GS-2024-0147
|
| 756 |
+
Date: February 20, 2024
|
| 757 |
+
PO: PO-RM-2024-033
|
| 758 |
+
|
| 759 |
+
Steel Bolts (Box/100) 50 @ $12.50 = $625.00
|
| 760 |
+
Copper Wire (500ft Roll) 10 @ $85.00 = $850.00
|
| 761 |
+
Safety Goggles (Pack/10) 20 @ $35.00 = $700.00
|
| 762 |
+
Welding Rods (Bundle) 15 @ $22.00 = $330.00
|
| 763 |
+
|
| 764 |
+
Invoice Subtotal: $2,505.00
|
| 765 |
+
Tax (7%): $175.35
|
| 766 |
+
Invoice Total: $2,680.35
|
| 767 |
+
|
| 768 |
+
=== TRANSACTION 2: ADJUSTMENT ===
|
| 769 |
+
Credit Memo: CM-2024-0201
|
| 770 |
+
Date: February 25, 2024
|
| 771 |
+
Reference: GS-2024-0147
|
| 772 |
+
|
| 773 |
+
Issue: Copper Wire — only 8 of 10 rolls delivered.
|
| 774 |
+
2 rolls backordered (BO-2024-0089).
|
| 775 |
+
Credit for undelivered: 2 x $85.00 = $170.00
|
| 776 |
+
Tax adjustment: -$11.90
|
| 777 |
+
Total Credit: -$181.90
|
| 778 |
+
|
| 779 |
+
=== TRANSACTION 3: PRICE CORRECTION ===
|
| 780 |
+
Debit Memo: DM-2024-0055
|
| 781 |
+
Date: February 27, 2024
|
| 782 |
+
|
| 783 |
+
Steel Bolts price was quoted at $12.50 but contract
|
| 784 |
+
rate is $13.00. Underbilled on 50 boxes.
|
| 785 |
+
Price difference: 50 x $0.50 = $25.00
|
| 786 |
+
Tax on adjustment: $1.75
|
| 787 |
+
Total Debit: $26.75
|
| 788 |
+
|
| 789 |
+
=== ACCOUNT SUMMARY ===
|
| 790 |
+
Original Invoice: $2,680.35
|
| 791 |
+
Credit (undelivered wire): -$181.90
|
| 792 |
+
Debit (price correction): +$26.75
|
| 793 |
+
================================
|
| 794 |
+
Net Amount Due: $2,525.20
|
| 795 |
+
|
| 796 |
+
Payment due by: March 20, 2024
|
| 797 |
+
""",
|
| 798 |
+
"ground_truth": {
|
| 799 |
+
"invoice_number": "GS-2024-0147",
|
| 800 |
+
"date": "2024-02-20",
|
| 801 |
+
"vendor_name": "Global Supplies Inc.",
|
| 802 |
+
"customer_name": "Riverside Manufacturing",
|
| 803 |
+
"subtotal": 2505.00,
|
| 804 |
+
"tax": 175.35,
|
| 805 |
+
"total": 2680.35,
|
| 806 |
+
"po_number": "PO-RM-2024-033",
|
| 807 |
+
"discount_amount": 0.00,
|
| 808 |
+
"original_total": 2680.35,
|
| 809 |
+
"line_items": [
|
| 810 |
+
{"description": "Steel Bolts (Box/100)", "quantity": 50, "unit_price": 12.50, "amount": 625.00},
|
| 811 |
+
{"description": "Copper Wire (500ft Roll)", "quantity": 10, "unit_price": 85.00, "amount": 850.00},
|
| 812 |
+
{"description": "Safety Goggles (Pack/10)", "quantity": 20, "unit_price": 35.00, "amount": 700.00},
|
| 813 |
+
{"description": "Welding Rods (Bundle)", "quantity": 15, "unit_price": 22.00, "amount": 330.00},
|
| 814 |
+
],
|
| 815 |
+
"discrepancy_notes": "Credit memo CM-2024-0201 for 2 undelivered Copper Wire rolls (-$181.90). Debit memo DM-2024-0055 for Steel Bolts price correction (+$26.75). Net adjustment: -$155.15. Final amount due: $2,525.20.",
|
| 816 |
+
},
|
| 817 |
+
},
|
| 818 |
+
],
|
| 819 |
}
|
| 820 |
|
| 821 |
|
|
|
|
| 834 |
"subtotal", "tax", "total", "line_items",
|
| 835 |
"po_number", "adjustment_reason", "adjusted_total",
|
| 836 |
],
|
| 837 |
+
"corrupted_scan": [
|
| 838 |
+
"invoice_number", "date", "vendor_name", "customer_name",
|
| 839 |
+
"subtotal", "tax", "total", "line_items",
|
| 840 |
+
],
|
| 841 |
+
"adversarial_invoice": [
|
| 842 |
+
"invoice_number", "date", "vendor_name", "customer_name",
|
| 843 |
+
"subtotal", "tax", "total", "line_items",
|
| 844 |
+
"po_number", "discount_amount", "original_total",
|
| 845 |
+
"discrepancy_notes",
|
| 846 |
+
],
|
| 847 |
}
|
| 848 |
|
| 849 |
|
|
|
|
| 851 |
"""Get a document and its metadata for a given task.
|
| 852 |
|
| 853 |
Args:
|
| 854 |
+
task_name: One of 'simple_invoice', 'messy_invoice', 'multi_document',
|
| 855 |
+
'corrupted_scan', 'adversarial_invoice'
|
| 856 |
doc_index: Index into the document pool (will wrap around)
|
| 857 |
|
| 858 |
Returns:
|
|
|
|
| 866 |
"ground_truth": doc["ground_truth"],
|
| 867 |
"required_fields": TASK_REQUIRED_FIELDS.get(task_name, TASK_REQUIRED_FIELDS["simple_invoice"]),
|
| 868 |
}
|
| 869 |
+
|
server/environment.py
CHANGED
|
@@ -20,6 +20,8 @@ MAX_ATTEMPTS = {
|
|
| 20 |
"simple_invoice": 3,
|
| 21 |
"messy_invoice": 3,
|
| 22 |
"multi_document": 5,
|
|
|
|
|
|
|
| 23 |
}
|
| 24 |
|
| 25 |
VALID_TASKS = list(TASK_REQUIRED_FIELDS.keys())
|
|
@@ -174,16 +176,19 @@ class InvoiceExtractionEnvironment:
|
|
| 174 |
"""Return the list of required fields with descriptions."""
|
| 175 |
field_descriptions = {
|
| 176 |
"invoice_number": "The invoice/document number (string)",
|
| 177 |
-
"date": "Invoice date in YYYY-MM-DD format",
|
| 178 |
"vendor_name": "Name of the vendor/seller/supplier",
|
| 179 |
"customer_name": "Name of the customer/buyer/bill-to party",
|
| 180 |
-
"subtotal": "Subtotal before tax (number)",
|
| 181 |
"tax": "Tax amount (number)",
|
| 182 |
"total": "Total amount due (number)",
|
| 183 |
"line_items": "Array of items: [{description, quantity, unit_price, amount}]",
|
| 184 |
"po_number": "Purchase order reference number (string)",
|
| 185 |
"adjustment_reason": "Reason for any adjustments/credits (string)",
|
| 186 |
"adjusted_total": "Final adjusted total after credits/payments (number)",
|
|
|
|
|
|
|
|
|
|
| 187 |
}
|
| 188 |
|
| 189 |
lines = ["Required fields to extract:\n"]
|
|
@@ -243,10 +248,43 @@ class InvoiceExtractionEnvironment:
|
|
| 243 |
|
| 244 |
# Grade the extraction
|
| 245 |
self._state.attempts_used += 1
|
| 246 |
-
|
| 247 |
extracted, self._ground_truth, self._required_fields
|
| 248 |
)
|
| 249 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 250 |
# Track best score
|
| 251 |
self._state.best_score = max(self._state.best_score, score)
|
| 252 |
self._last_feedback = feedback
|
|
@@ -259,12 +297,15 @@ class InvoiceExtractionEnvironment:
|
|
| 259 |
matched = sum(1 for f in feedback.values() if f.get("matched", False))
|
| 260 |
total = len(feedback)
|
| 261 |
feedback_text = (
|
| 262 |
-
f"Extraction scored: {score:.
|
| 263 |
f"Fields matched: {matched}/{total}\n"
|
| 264 |
-
f"Best score so far: {self._state.best_score:.
|
| 265 |
f"Attempts remaining: {attempts_remaining}\n"
|
| 266 |
)
|
| 267 |
|
|
|
|
|
|
|
|
|
|
| 268 |
if not done and score < 0.95:
|
| 269 |
weak_fields = [
|
| 270 |
name for name, data in feedback.items()
|
|
@@ -275,7 +316,7 @@ class InvoiceExtractionEnvironment:
|
|
| 275 |
feedback_text += "\nUse 'get_feedback' for detailed per-field scores."
|
| 276 |
|
| 277 |
if done:
|
| 278 |
-
feedback_text += f"\n\nEpisode complete. Final score: {self._state.best_score:.
|
| 279 |
|
| 280 |
return InvoiceObservation(
|
| 281 |
done=done,
|
|
@@ -287,6 +328,9 @@ class InvoiceExtractionEnvironment:
|
|
| 287 |
required_fields=self._required_fields,
|
| 288 |
metadata={
|
| 289 |
"score": score,
|
|
|
|
|
|
|
|
|
|
| 290 |
"best_score": self._state.best_score,
|
| 291 |
"field_scores": {k: v["score"] for k, v in feedback.items()},
|
| 292 |
},
|
|
@@ -334,3 +378,19 @@ class InvoiceExtractionEnvironment:
|
|
| 334 |
def close(self) -> None:
|
| 335 |
"""Clean up resources."""
|
| 336 |
self._initialized = False
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
"simple_invoice": 3,
|
| 21 |
"messy_invoice": 3,
|
| 22 |
"multi_document": 5,
|
| 23 |
+
"corrupted_scan": 4,
|
| 24 |
+
"adversarial_invoice": 6,
|
| 25 |
}
|
| 26 |
|
| 27 |
VALID_TASKS = list(TASK_REQUIRED_FIELDS.keys())
|
|
|
|
| 176 |
"""Return the list of required fields with descriptions."""
|
| 177 |
field_descriptions = {
|
| 178 |
"invoice_number": "The invoice/document number (string)",
|
| 179 |
+
"date": "Invoice date in YYYY-MM-DD format (use reissue date if applicable)",
|
| 180 |
"vendor_name": "Name of the vendor/seller/supplier",
|
| 181 |
"customer_name": "Name of the customer/buyer/bill-to party",
|
| 182 |
+
"subtotal": "Subtotal before tax, after discounts (number)",
|
| 183 |
"tax": "Tax amount (number)",
|
| 184 |
"total": "Total amount due (number)",
|
| 185 |
"line_items": "Array of items: [{description, quantity, unit_price, amount}]",
|
| 186 |
"po_number": "Purchase order reference number (string)",
|
| 187 |
"adjustment_reason": "Reason for any adjustments/credits (string)",
|
| 188 |
"adjusted_total": "Final adjusted total after credits/payments (number)",
|
| 189 |
+
"discount_amount": "Monetary discount value applied (number, 0 if none)",
|
| 190 |
+
"original_total": "What the total would have been without adjustments (number)",
|
| 191 |
+
"discrepancy_notes": "Free-text description of all discrepancies, adjustments, and anomalies found",
|
| 192 |
}
|
| 193 |
|
| 194 |
lines = ["Required fields to extract:\n"]
|
|
|
|
| 248 |
|
| 249 |
# Grade the extraction
|
| 250 |
self._state.attempts_used += 1
|
| 251 |
+
base_score, feedback = grade_extraction(
|
| 252 |
extracted, self._ground_truth, self._required_fields
|
| 253 |
)
|
| 254 |
|
| 255 |
+
# === REWARD SHAPING BONUSES ===
|
| 256 |
+
bonus = 0.0
|
| 257 |
+
bonus_details = []
|
| 258 |
+
|
| 259 |
+
# 1. Mathematical consistency bonus: subtotal + tax ≈ total
|
| 260 |
+
ext_sub = _safe_float(extracted.get("subtotal"))
|
| 261 |
+
ext_tax = _safe_float(extracted.get("tax"))
|
| 262 |
+
ext_total = _safe_float(extracted.get("total"))
|
| 263 |
+
if ext_sub is not None and ext_tax is not None and ext_total is not None:
|
| 264 |
+
computed = round(ext_sub + ext_tax, 2)
|
| 265 |
+
if abs(computed - ext_total) < 0.02:
|
| 266 |
+
bonus += 0.03
|
| 267 |
+
bonus_details.append("consistency_check: +0.03")
|
| 268 |
+
|
| 269 |
+
# 2. Improvement tracking: rewarding learning from feedback
|
| 270 |
+
prev_score = self._state.best_score
|
| 271 |
+
if self._state.attempts_used > 1 and base_score > prev_score:
|
| 272 |
+
improvement = min(base_score - prev_score, 0.02)
|
| 273 |
+
bonus += improvement
|
| 274 |
+
bonus_details.append(f"improvement: +{improvement:.3f}")
|
| 275 |
+
|
| 276 |
+
# 3. Step efficiency signal: fewer steps = small bonus
|
| 277 |
+
steps_used = self._state.step_count
|
| 278 |
+
if steps_used <= 3:
|
| 279 |
+
bonus += 0.02 # Very efficient
|
| 280 |
+
bonus_details.append("efficiency: +0.02")
|
| 281 |
+
elif steps_used <= 5:
|
| 282 |
+
bonus += 0.01 # Moderately efficient
|
| 283 |
+
bonus_details.append("efficiency: +0.01")
|
| 284 |
+
|
| 285 |
+
# Apply bonus (clamped to strict (0, 1))
|
| 286 |
+
score = round(max(0.01, min(0.99, base_score + bonus)), 4)
|
| 287 |
+
|
| 288 |
# Track best score
|
| 289 |
self._state.best_score = max(self._state.best_score, score)
|
| 290 |
self._last_feedback = feedback
|
|
|
|
| 297 |
matched = sum(1 for f in feedback.values() if f.get("matched", False))
|
| 298 |
total = len(feedback)
|
| 299 |
feedback_text = (
|
| 300 |
+
f"Extraction scored: {score:.4f} (base: {base_score:.4f}, bonus: {bonus:.3f})\n"
|
| 301 |
f"Fields matched: {matched}/{total}\n"
|
| 302 |
+
f"Best score so far: {self._state.best_score:.4f}\n"
|
| 303 |
f"Attempts remaining: {attempts_remaining}\n"
|
| 304 |
)
|
| 305 |
|
| 306 |
+
if bonus_details:
|
| 307 |
+
feedback_text += f"Reward bonuses: {', '.join(bonus_details)}\n"
|
| 308 |
+
|
| 309 |
if not done and score < 0.95:
|
| 310 |
weak_fields = [
|
| 311 |
name for name, data in feedback.items()
|
|
|
|
| 316 |
feedback_text += "\nUse 'get_feedback' for detailed per-field scores."
|
| 317 |
|
| 318 |
if done:
|
| 319 |
+
feedback_text += f"\n\nEpisode complete. Final score: {self._state.best_score:.4f}"
|
| 320 |
|
| 321 |
return InvoiceObservation(
|
| 322 |
done=done,
|
|
|
|
| 328 |
required_fields=self._required_fields,
|
| 329 |
metadata={
|
| 330 |
"score": score,
|
| 331 |
+
"base_score": base_score,
|
| 332 |
+
"bonus": bonus,
|
| 333 |
+
"bonus_details": bonus_details,
|
| 334 |
"best_score": self._state.best_score,
|
| 335 |
"field_scores": {k: v["score"] for k, v in feedback.items()},
|
| 336 |
},
|
|
|
|
| 378 |
def close(self) -> None:
|
| 379 |
"""Clean up resources."""
|
| 380 |
self._initialized = False
|
| 381 |
+
|
| 382 |
+
|
| 383 |
+
def _safe_float(value) -> float:
|
| 384 |
+
"""Safely convert a value to float, returning None on failure."""
|
| 385 |
+
if value is None:
|
| 386 |
+
return None
|
| 387 |
+
if isinstance(value, (int, float)):
|
| 388 |
+
return float(value)
|
| 389 |
+
if isinstance(value, str):
|
| 390 |
+
import re
|
| 391 |
+
cleaned = re.sub(r"[$ ,]", "", value.strip())
|
| 392 |
+
try:
|
| 393 |
+
return float(cleaned)
|
| 394 |
+
except (ValueError, TypeError):
|
| 395 |
+
return None
|
| 396 |
+
return None
|
server/graders.py
CHANGED
|
@@ -236,9 +236,12 @@ def grade_extraction(
|
|
| 236 |
field_scores = {}
|
| 237 |
feedback = {}
|
| 238 |
|
| 239 |
-
numeric_fields = {"total", "subtotal", "tax", "adjusted_total"
|
|
|
|
| 240 |
date_fields = {"date", "due_date"}
|
| 241 |
list_fields = {"line_items"}
|
|
|
|
|
|
|
| 242 |
|
| 243 |
for field in required_fields:
|
| 244 |
expected = ground_truth.get(field)
|
|
@@ -250,6 +253,9 @@ def grade_extraction(
|
|
| 250 |
score = grade_numeric(actual, expected)
|
| 251 |
elif field in date_fields:
|
| 252 |
score = grade_date(actual, expected)
|
|
|
|
|
|
|
|
|
|
| 253 |
else:
|
| 254 |
score = grade_text(actual, expected)
|
| 255 |
|
|
@@ -258,8 +264,9 @@ def grade_extraction(
|
|
| 258 |
"score": score,
|
| 259 |
"expected_type": "list" if field in list_fields else
|
| 260 |
"number" if field in numeric_fields else
|
| 261 |
-
"date" if field in date_fields else
|
| 262 |
-
|
|
|
|
| 263 |
}
|
| 264 |
|
| 265 |
# Overall score = weighted average
|
|
|
|
| 236 |
field_scores = {}
|
| 237 |
feedback = {}
|
| 238 |
|
| 239 |
+
numeric_fields = {"total", "subtotal", "tax", "adjusted_total",
|
| 240 |
+
"discount_amount", "original_total"}
|
| 241 |
date_fields = {"date", "due_date"}
|
| 242 |
list_fields = {"line_items"}
|
| 243 |
+
# Free-text reasoning fields — graded with lower threshold
|
| 244 |
+
reasoning_fields = {"discrepancy_notes", "adjustment_reason"}
|
| 245 |
|
| 246 |
for field in required_fields:
|
| 247 |
expected = ground_truth.get(field)
|
|
|
|
| 253 |
score = grade_numeric(actual, expected)
|
| 254 |
elif field in date_fields:
|
| 255 |
score = grade_date(actual, expected)
|
| 256 |
+
elif field in reasoning_fields:
|
| 257 |
+
# Free-text reasoning: use fuzzy matching with generous partial credit
|
| 258 |
+
score = grade_text(actual, expected)
|
| 259 |
else:
|
| 260 |
score = grade_text(actual, expected)
|
| 261 |
|
|
|
|
| 264 |
"score": score,
|
| 265 |
"expected_type": "list" if field in list_fields else
|
| 266 |
"number" if field in numeric_fields else
|
| 267 |
+
"date" if field in date_fields else
|
| 268 |
+
"reasoning" if field in reasoning_fields else "text",
|
| 269 |
+
"matched": score >= 0.5 if field in reasoning_fields else score >= 0.8,
|
| 270 |
}
|
| 271 |
|
| 272 |
# Overall score = weighted average
|