Spaces:
Sleeping
Sleeping
| title: ESCTR Environment | |
| emoji: π’ | |
| colorFrom: indigo | |
| colorTo: green | |
| sdk: docker | |
| pinned: false | |
| app_port: 7860 | |
| tags: | |
| - openenv | |
| # π’ ESCTR: Enterprise Supply Chain & Tax Reconciliation | |
| > **Training LLMs to be autonomous financial auditors** β an OpenEnv environment for teaching AI agents to investigate procurement discrepancies, enforce SLA penalties, and navigate adversarial vendor disputes using **Reinforcement Learning with Verifiable Rewards (RLVR)**. | |
| **Space URL:** [`musharraf7/esctr-environment`](https://huggingface.co/spaces/musharraf7/esctr-environment) Β· **Training Dashboard:** [Trackio](https://huggingface.co/spaces/musharraf7/esctr-grpo-trained) Β· **Training Scripts:** [`train.py`](train.py) Β· [`train_4b.py`](train_4b.py) Β· [`train_hf_jobs.py`](train_hf_jobs.py) | |
| --- | |
| ## The Problem | |
| Every day, global enterprises process millions of procurement transactions. Between the Purchase Order, the shipping manifest, the SLA contract, and the final vendor invoice, discrepancies **inevitably** arise: | |
| - A vendor bills $45/unit instead of the contracted $40 | |
| - A shipment arrives 5 days late, triggering SLA penalty clauses | |
| - A vendor disputes the penalty, claiming *your* warehouse rejected the delivery | |
| Resolving these disputes currently requires human financial controllers to **manually cross-reference multiple siloed databases**, interpret complex contract clauses, perform precise arithmetic, and negotiate with adversarial counterparties. It's slow, expensive, and error-prone. | |
| **What if we could train LLMs to do this autonomously?** | |
| ## The Environment | |
| ESCTR provides a stateful sandbox where an LLM agent operates as an **autonomous financial controller**. Rather than just extracting data from a document, the agent must: | |
| 1. **Investigate** β query procurement databases, shipping logs, SLA contracts | |
| 2. **Reason** β cross-reference documents, calculate penalties, verify claims | |
| 3. **Negotiate** β handle adversarial vendor communications | |
| 4. **Decide** β submit a mathematically precise financial adjustment | |
| ### Three Tasks, Escalating Complexity | |
| | Task | Difficulty | Max Steps | What the Agent Must Do | | |
| |------|-----------|-----------|----------------------| | |
| | **Procurement Reconciliation** | Easy | 10 | Find an overcharged line item between PO and Invoice, calculate the exact overcharge | | |
| | **SLA Enforcement** | Medium | 15 | Discover a late shipment, retrieve the SLA contract, calculate the penalty from contract terms | | |
| | **Adversarial Auditing** | Hard | 20 | All of the above + verify warehouse logs to disprove vendor's claim + reject a settlement offer | | |
| ### The Tool Suite | |
| The agent interacts through **4 ERP tools**, each requiring precise parameters: | |
| | Tool | Purpose | Parameters | | |
| |------|---------|------------| | |
| | `query_database` | Search corporate databases | `{"table": "shipping_logs"}` | | |
| | `read_document` | Retrieve full document text | `document_id: "PO-2024-1234"` | | |
| | `communicate_vendor` | Negotiate with adversarial vendor | `message_content: "We reject..."` | | |
| | `submit_financial_decision` | Submit final adjustment (terminal) | `adjustment_amount: -450.00` | | |
| ### Procedural Generation | |
| Every scenario is generated from a seed β **same seed = same scenario = deterministic grading**. This enables: | |
| - Infinite training configurations (no memorization) | |
| - Reproducible evaluation | |
| - Fair comparison between models | |
| ## Design Rationale | |
| ESCTR is built on three foundational principles from recent RL and agent research: | |
| 1. **RLVR Paradigm**: Following Wen et al. (ICLR 2026), our environment uses rule-based, externally verifiable reward functions that incentivize multi-step reasoning β no LLM-as-judge, no fuzzy evaluation. The correct adjustment amount is always a precise floating-point number derived deterministically from contract terms. | |
| 2. **Dense Process Rewards**: Inspired by Agent-RLVR (2025) and RLVRR's reward chain decomposition, we augment sparse verifiable rewards (correct penalty β/β) with process-level environment rewards (investigation milestones, tool-use discipline) to make RL effective in long-horizon financial auditing tasks. | |
| 3. **GRPO Training**: We adopt Group Relative Policy Optimization via TRL's `GRPOTrainer` with `environment_factory`, leveraging its theoretical success amplification properties under verifiable rewards as analyzed by Mroueh (2025). Our group sampling (K rollouts per prompt, deterministic pass/fail reward) follows the DeepSeek-R1 paradigm. | |
| > *"ESCTR is to procurement and tax reconciliation what FinToolBench is to market-driven finance: a runnable environment with auditable tool traces and domain-specific compliance constraints."* | |
| ## Reward Architecture (RLVR-Inspired) | |
| Following the RLVR paradigm and RLVRR's reward chain concept, we decompose rewards into **outcome verification** (content-like) and **investigation quality** (process-like) components: | |
| ``` | |
| R_total = Ξ±Β·R_outcome + Ξ²Β·R_trajectory β penalties | |
| ``` | |
| | Component | Weight | Description | | |
| |-----------|--------|-------------| | |
| | **R_outcome** | 60-70% | Did the agent submit the correct adjustment amount? (Binary verifier) | | |
| | **R_trajectory** | 30-40% | Did the agent follow proper investigative procedure? (Checklist-style subgoals) | | |
| | **Efficiency penalty** | -0.005/step | Encourages shortest path to resolution | | |
| | **Hallucination penalty** | -0.02 | Invalid queries, nonexistent documents | | |
| | **Gullibility penalty** | -0.20 | Accepting adversarial settlement offers (Task 3) | | |
| | **Evidence bonus** | +0.05 | Citing warehouse logs as evidence (Task 3) | | |
| ### Submission-Grade Extensions (Final Iteration) | |
| To increase novelty and robustness for judging, ESCTR now includes three high-impact mechanics: | |
| 1. **Dynamic distractors**: `query_database` now returns plausible-but-irrelevant PO/invoice records. Agents must disambiguate by evidence, not template matching. | |
| 2. **Risk scorecard metrics**: deterministic risk telemetry is emitted per episode: | |
| - `risk_over_penalization` | |
| - `risk_under_penalization` | |
| - `risk_procedural_shortcut` | |
| - `risk_vendor_reliance` | |
| 3. **Auditable action graphs**: every episode captures tool-call trace and emits a Mermaid-compatible DAG in final metadata (`action_graph_mermaid`), enabling judge-friendly reasoning-path inspection. | |
| ### Why This Reward Design Matters | |
| - **Dense, not sparse**: Trajectory milestones reward correct investigative behavior (querying the right databases, reading the right documents) even if the final answer is wrong β following Agent-RLVR's guidance signal approach | |
| - **Hard to game**: An agent that spams queries gets penalized by step costs; an agent that submits without investigating gets 0 trajectory reward | |
| - **Verifiable**: The correct answer is always a precise floating-point number derived from contract terms β no subjective evaluation, aligned with RLVR's programmatic verification requirement | |
| - **Risk-aware**: Following Chen et al. (2025), we evaluate not only correctness but also risk measures such as over-penalization, under-penalization, and reliance on unverified vendor claims | |
| ## Training Results: Scaling to 4B Parameters | |
| For the OpenEnv hackathon, we trained three models on the Procurement Reconciliation task using **TRL's GRPOTrainer** with `environment_factory`, iterating across model sizes from 0.6B to 4B β following the judge-recommended approach of **small models + fast iteration**. | |
| ### π Production Model: Qwen3-4B (GRPO + LoRA) | |
| We scaled our training to **Qwen/Qwen3-4B** on a single **RTX 4090 (24GB VRAM)**, utilizing 4-bit quantization, LoRA adapters (`r=16`), and bf16 mixed precision. | |
| **Key Achievements:** | |
| 1. **Memory Efficiency**: Trained a 4-billion parameter model using only **19.74 GB peak VRAM** by strategically offloading caches and relying purely on adapter updates. | |
| 2. **Deterministic Collapse Avoided**: Solved early gradient starvation by implementing shaped investigation rewards and High-Temperature (T=1.5) / High-K (K=4) group sampling to force exploration. | |
| 3. **Flawless Tool Discipline**: The model completely suppressed its native free-text `<think>` behavior to conform to the strict JSON tool-call schema required by the ERP system, achieving **0 tool failures** over 300 episodes. | |
| 4. **Reward Progression**: Mean episodic reward climbed consistently over the 71-minute run, peaking at **0.27** as the model learned to chain multiple `read_document` calls and successfully submit financial decisions. | |
|  | |
|  | |
| | Training Phase | Mean Reward | Peak Reward | Avg Tool Calls | Tool Failures | | |
| |----------------|-------------|-------------|----------------|---------------| | |
| | **First 20 Episodes** | 0.1769 | N/A | 3.5 | 0 | | |
| | **Last 20 Episodes** | **0.1938** (+10%) | **0.2706** | **4.0** | **0** | | |
| *Hardware Time: 300 Episodes completed in exactly 71.3 minutes.* | |
| #### π The Path to 4B: Overcoming "Zero-Reward Collapse" | |
| Scaling from 0.6B to 4B was **not** plug-and-play. Our first three training attempts resulted in complete failure β loss flat at `0.0`, the model learning nothing. By analyzing completion traces, we discovered and overcame two critical bottlenecks: | |
| 1. **Token Budget Exhaustion**: Qwen3-4B's default behavior produces massive `<think>` reasoning blocks, exhausting the entire 512-token generation budget on internal monologue before making a single tool call. **Fix:** Disabled thinking mode via Jinja chat templates and raised `max_completion_length` to 1024. | |
| 2. **Deterministic Starvation**: At `temperature=1.0`, all K=4 rollouts were identical β the model deterministically made exactly 3 investigation calls and stopped, never calling `submit_financial_decision`. With zero reward variance across the group, GRPO had **zero gradient signal**. | |
| **Fix:** Implemented **Process Reward Shaping** β injecting `+0.05` partial credit for each valid investigation step. Raised `temperature=1.5` and `K=4` to force exploration diversity. This finally jump-started the gradient space. | |
| *This debugging process β from silent failure to shaped rewards β was the core engineering challenge of the project and took ~4 hours of iterative hypothesis testing.* | |
| ### π Iterative Run: Qwen3-1.7B on HF Jobs (In Progress) | |
| Following the judge recommendation to **iterate on small models with multiple runs**, we launched a third training run on **HF Jobs T4-medium** using `Qwen/Qwen3-1.7B` with LoRA adapters (`r=16`, 4-bit QLoRA) β running entirely on HuggingFace infrastructure with no local GPU required. | |
| This run won't complete before the submission deadline (~500 steps Γ 50s/step β 7 hours), but the **early metrics already confirm the training signal is healthy** and reward shaping is working as expected on the 1.7B model. | |
| **Multi-step training log (Steps 5β20):** | |
| | Step | Loss | Reward (mean) | Reward Std | Tool Calls | Entropy | | |
| |------|------|--------------|------------|------------|---------| | |
| | 5 | 0.184 | **0.195** | 0.010 | **3.9** | 0.132 | | |
| | 10 | 0.116 | 0.195 | 0.010 | **3.9** | 0.127 | | |
| | 15 | 0.088 | 0.180 | 0.029 | 3.6 | 0.028 | | |
| | 20 | 0.186 | 0.190 | 0.020 | 3.8 | 0.047 | | |
| **Key observations from early training:** | |
| - β **Non-zero reward from step 1** β no cold-start collapse, shaped rewards working immediately | |
| - β **Zero tool failures** across all 20 steps β model calling tools with valid syntax | |
| - β **Loss decreasing overall** (0.184 β 0.088 by step 15) β gradient signal flowing | |
| - β **Consistent ~3.9 tool calls/episode** β model investigating before submitting | |
| - β οΈ **High `frac_reward_zero_std` (0.6β0.8)** β some groups have identical rollouts, expected at early steps before reward diversity emerges | |
| - π **Entropy dropping** (0.132 β 0.028) β model beginning to commit to a policy | |
| The model is exhibiting the correct investigation pattern: `query_database(purchase_orders)` β `query_database(invoices)` β `read_document(PO)` β `read_document(INV)` before submitting. This matches the 4B behavior that achieved 0.27 peak reward. Training script: [`train_hf_jobs.py`](train_hf_jobs.py). | |
| ### Proof of Concept: Qwen3-0.6B | |
| We initially validated the environment loop with a 0.6B model running 500 episodes on a standard T4 GPU (~2 hours). | |
| #### Reward Curve | |
| The model improved from near-zero reward to a stable 0.30 within the first 100 training steps, representing a **222% improvement** in mean reward: | |
|  | |
| #### Training Dashboard | |
| Four-panel view showing reward, policy entropy, tool usage convergence, and completion length: | |
|  | |
| #### Baseline vs Trained Comparison | |
| | Metric | Baseline (untrained) | Trained (500 episodes) | Ξ | | |
| |--------|---------------------|----------------------|---| | |
| | Mean Reward | 0.09 | 0.30 | **+222%** | | |
| | Tool Success Rate | 60% | 100% | **+67%** | | |
| | Investigation Completeness | 40% | 100% | **+150%** | | |
| | Tool Calls/Episode | erratic (1-4) | stable 3.0 | converged | | |
| | Tool Failures | frequent | 0 | eliminated | | |
|  | |
| #### Key Findings | |
| 1. **Tool mastery learned**: The model converged to exactly 3 tool calls per episode with zero failures β it learned the correct investigation pattern (query PO β query Invoice β read documents β submit) | |
| 2. **Trajectory reward captured**: The 0.30 plateau corresponds to perfect trajectory score (all investigation milestones hit) but without solving the final arithmetic β showing the reward decomposition works as designed | |
| 3. **Policy entropy stable**: Entropy did not collapse to zero, indicating the model maintains exploration capacity for future training with larger models | |
| 4. **Scaling hypothesis**: The 0.6B model learned *investigation procedure* but not *arithmetic reasoning* β we predict larger models (3B+) will break through the 0.30 plateau to achieve outcome rewards | |
| #### Training Configuration | |
| | Parameter | Value | | |
| |-----------|-------| | |
| | Model | `Qwen/Qwen3-0.6B` | | |
| | Algorithm | GRPO (Group Relative Policy Optimization) | | |
| | Framework | TRL `GRPOTrainer` + `environment_factory` | | |
| | Episodes | 500 | | |
| | GPU | NVIDIA T4 (Colab) | | |
| | Training Time | ~2 hours | | |
| | Max Completion Length | 768 tokens | | |
| *We successfully proved that the verifiable reward chain decomposes appropriately across model sizes, scaling seamlessly from 0.6B to 4B parameters.* | |
| π **Live 0.6B training dashboard**: [Trackio Space](https://huggingface.co/spaces/musharraf7/esctr-grpo-trained) | |
| ## Quick Start | |
| ### Run the environment | |
| ```bash | |
| # Docker | |
| docker build -t esctr-env . | |
| docker run -p 7860:7860 esctr-env | |
| # Or locally | |
| pip install -r requirements.txt | |
| uvicorn server.app:app --host 0.0.0.0 --port 7860 | |
| ``` | |
| ### Connect an agent | |
| ```python | |
| import requests | |
| url = "http://localhost:7860" | |
| # Reset with a task | |
| r = requests.post(f"{url}/reset", json={"task_name": "sla_enforcement", "seed": 42}) | |
| briefing = r.json()["observation"]["system_response"] | |
| # Query a database | |
| r = requests.post(f"{url}/step", json={ | |
| "action": { | |
| "action_type": "query_database", | |
| "query_parameters": {"table": "shipping_logs"} | |
| } | |
| }) | |
| result = r.json()["observation"]["system_response"] | |
| # Submit financial decision | |
| r = requests.post(f"{url}/step", json={ | |
| "action": { | |
| "action_type": "submit_financial_decision", | |
| "adjustment_amount": -450.00, | |
| "adjustment_reason": "Late delivery penalty per SLA clause" | |
| } | |
| }) | |
| score = r.json()["reward"] | |
| ``` | |
| ### Run baseline inference | |
| ```bash | |
| export ENV_URL="http://localhost:7860" | |
| export API_BASE_URL="https://router.huggingface.co/v1" | |
| export MODEL_NAME="meta-llama/Meta-Llama-3-8B-Instruct" | |
| export HF_TOKEN="your_token" | |
| python inference.py | |
| ``` | |
| ### Run ambitious GRPO training (multi-task capable) | |
| ```bash | |
| export ESCTR_MODEL="Qwen/Qwen3-1.7B" | |
| export ESCTR_EPISODES=1000 | |
| export ESCTR_TASKS="procurement_reconciliation,sla_enforcement,adversarial_auditing" | |
| python train.py | |
| ``` | |
| ### Run bigger-model training (Round 2 push) | |
| ```bash | |
| export ESCTR_MODEL="Qwen/Qwen3-4B" | |
| export ESCTR_EPISODES=1500 | |
| export ESCTR_TASKS="procurement_reconciliation,sla_enforcement,adversarial_auditing" | |
| python train.py | |
| ``` | |
| ### Run ablations (base vs distractors vs risk shaping) | |
| ```bash | |
| python ablation.py | |
| # writes artifacts/ablation_results.json | |
| ``` | |
| ### Generate judge demo artifacts | |
| ```bash | |
| python generate_demo_artifacts.py | |
| # writes artifacts/demo_episode_trace.json + artifacts/demo_action_graph.mmd | |
| ``` | |
| ## Submission Materials | |
| - π **Writeup (Blog.md)**: [Training Autonomous Financial Auditors with RLVR](https://github.com/Musharraf1128/esctr-environment/blob/main/Blog.md) | |
| - π€ **HF Space (live env)**: [`musharraf7/esctr-environment`](https://huggingface.co/spaces/musharraf7/esctr-environment) | |
| - π **Training Dashboard**: [Trackio](https://huggingface.co/spaces/musharraf7/esctr-grpo-trained) | |
| - ποΈ **Training Scripts**: [`train.py`](https://github.com/Musharraf1128/esctr-environment/blob/main/train.py) (0.6B Β· TRL GRPO) Β· [`train_4b.py`](https://github.com/Musharraf1128/esctr-environment/blob/main/train_4b.py) (4B Β· LoRA) Β· [`train_hf_jobs.py`](https://github.com/Musharraf1128/esctr-environment/blob/main/train_hf_jobs.py) (1.7B Β· HF Jobs) | |
| - π» **GitHub Repository**: [Musharraf1128/esctr-environment](https://github.com/Musharraf1128/esctr-environment) | |
| ## Why This Matters | |
| | Question | Answer | | |
| |----------|--------| | |
| | *Does this teach an LLM something it can't do well?* | Yes β multi-step financial reasoning with tool use is a known weakness of current LLMs | | |
| | *Is the domain underexplored?* | Yes β supply chain auditing + adversarial negotiation is nearly absent from RL/LLM training benchmarks. Like EconAgentBench (ICLR 2026), we instantiate economic decision processes under partial information | | |
| | *Could a researcher write a paper about this?* | Yes β training autonomous financial auditors has direct commercial and academic value, bridging FinToolBench-style tool evaluation with RLVR-driven policy optimization | | |
| | *Is the reward hard to game?* | Yes β the correct answer is always a precise number from contract math; trajectory rewards require specific database queries | | |
| | *Path to production?* | ESCTR could plug into real procurement systems (SAP/Oracle) as a pre-audit layer, flagging discrepancies before human review | | |
| ## Case Study: Trained Agent vs Baseline | |
| A single episode on seed `42` (Procurement Reconciliation): | |
| | Step | Baseline (untrained) | Trained (GRPO, 500 ep) | | |
| |------|---------------------|------------------------| | |
| | 1 | Submits random amount immediately | `query_database(table="purchase_orders")` | | |
| | 2 | β | `query_database(table="invoices")` | | |
| | 3 | β | `read_document(document_id="PO-2025-XXXX")` | | |
| | 4 | β | `submit_financial_decision(amount=..., reason="...")` | | |
| | **Reward** | **0.00** | **0.30** | | |
| | **Investigation** | Skipped | Query PO β, Query Invoice β, Read docs β | | |
| | **Risk** | High (no evidence gathered) | Low (full audit trail) | | |
| The baseline model jumps to a decision with no investigation, while the trained agent follows a principled audit path β exactly the behavioral shift RLVR incentivizes. | |
| ## API Endpoints | |
| | Endpoint | Method | Description | | |
| |----------|--------|-------------| | |
| | `/` | GET | Landing page with demo/API links | | |
| | `/demo` | GET | Interactive Gradio demo | | |
| | `/health` | GET | Health check | | |
| | `/reset` | POST | Reset with task + seed | | |
| | `/step` | POST | Execute an action | | |
| | `/state` | GET | Current state | | |
| | `/trace` | GET | Current episode action trace (auditable tool log) | | |
| | `/schema` | GET | Action/Observation/State schemas | | |
| | `/metadata` | GET | Environment metadata | | |
| | `/ws` | WebSocket | Persistent session | | |
| ## Project Structure | |
| ``` | |
| βββ server/ | |
| β βββ __init__.py | |
| β βββ app.py # FastAPI application | |
| β βββ environment.py # Core stateful environment + tool handlers | |
| β βββ procedural.py # Deterministic scenario generation engine | |
| β βββ graders.py # Multi-axis deterministic graders (3 tasks) | |
| β βββ models.py # Pydantic Action/Observation/State schemas | |
| βββ plots/ | |
| β βββ reward_curve.png # 0.6B reward over steps | |
| β βββ reward_curve_4b.png # 4B reward over steps | |
| β βββ tool_calls_4b.png # 4B tool execution discipline | |
| β βββ training_dashboard.png # Multi-panel training metrics | |
| β βββ comparison_chart.png # Baseline vs Trained comparison | |
| βββ train.py # TRL GRPO training script (0.6B, environment_factory) | |
| βββ train_4b.py # 4B LoRA training script (RTX 4090 optimized) | |
| βββ train_hf_jobs.py # 1.7B LoRA training script (HF Jobs T4) | |
| βββ inference.py # Baseline inference script | |
| βββ openenv.yaml # OpenEnv manifest | |
| βββ pyproject.toml # Package config | |
| βββ requirements.txt # Dependencies | |
| βββ Dockerfile # Container definition | |
| βββ README.md # This file | |
| ``` | |
| ## Themes Alignment | |
| - **π World Modeling (Professional Tasks)** β Real interaction with tools and dynamic databases | |
| - **π Long-Horizon Planning** β Multi-step investigation requiring state tracking across 10-20 steps | |
| - **π€ Multi-Agent Interactions** β Adversarial vendor negotiation with settlement dynamics | |
| - **π Self-Improvement** β Escalating difficulty curriculum (Easy β Medium β Hard) | |
| ## Limitations & Future Work | |
| - **Outcome reward**: Both the 0.6B and 4B models mastered investigation procedure (perfect tool discipline) but have not yet captured outcome rewards (exact arithmetic). We hypothesize that curriculum training or chain-of-thought prompting during RL could bridge this gap. | |
| - **Single-task**: Current training focuses on Task 1 (Procurement Reconciliation); extending to SLA Enforcement and Adversarial Auditing requires curriculum-based training with warm-start from the current checkpoint. | |
| - **Vendor policy realism**: Current vendor profiles are rule-based; replacing with a second LLM (Γ la MultiAgentBench/TAMAS) would create a fully strategic multi-agent dynamic. | |
| - **Reward variance**: The shaped reward function, while effective at breaking zero-reward collapse, produces low variance across rollouts β investigating entropy bonuses or curiosity-driven exploration could help. | |
| ## References | |
| | Paper | Relevance to ESCTR | | |
| |-------|--------------------| | |
| | Wen et al., "RLVR Implicitly Incentivizes Correct Reasoning" (ICLR 2026) | Foundational paradigm β binary verifiable rewards | | |
| | Mroueh, "GRPO's Effective Loss and Success Amplification" (2025) | Theoretical justification for GRPO under verifiable rewards | | |
| | Agent-RLVR (2025) | Dense guidance signals for sparse multi-step environments | | |
| | RLVRR β "From Verifiable Dot to Reward Chain" (2025) | Reward decomposition into content + style components | | |
| | FinToolBench (2026) | Financial tool-use benchmark with auditable traces | | |
| | Chen et al., "Auditing LLM Agents in Finance Must Prioritize Risk" (2025) | Risk-first evaluation framework | | |
| | EconAgentBench (ICLR 2026) | Economic decision processes under partial information | | |
| | TL-GRPO β "Turn-Level RL for Iterative Optimization" (2026) | Turn-level RL variant for persistent-state environments | | |
| | MultiAgentBench / TAMAS (2025-2026) | Competitive multi-agent evaluation frameworks | | |