"""Light theme CSS, SVG diagrams, and HTML content for the ESCTR Gradio UI.""" INJECT_CSS = """""" HEADER_HTML = """

ESCTR

Enterprise Supply Chain & Tax Reconciliation

Blog · GitHub · Training Dashboard

""" ARCH_SVG = """
ESCTR Environment Agent (Qwen3 LLM) GRPO-trained action obs Tool Engine ▸ query_database ▸ read_document ▸ communicate_vendor ▸ submit_financial_decision (terminal action) Procedurally generated from seed — deterministic Reward Verifier R_outcome 60-70% R_trajectory 30-40% - penalties R ∈ (0.01, 0.99)
""" EPISODE_SVG = """
Typical Episode Flow ① query_database(POs) ② query_database(invoices) ③ read_document(PO-XXXX) ④ read_document(INV-XXXX) ⑤ submit_financial_decision Agent Reasoning ① Discover relevant PO IDs ② Discover invoice IDs ③ Cross-reference prices ④ Calculate discrepancy ⑤ Submit exact adjustment → Reward computed → R = f(accuracy, procedure, steps)
""" LEADERBOARD_HTML = """

Model Leaderboard

All models trained on the ESCTR environment using TRL's GRPOTrainer with environment_factory.

#ModelParamsMethodGPUPeak RewardTool CallsFailuresTime
1Qwen3-0.6B0.6BGRPOT40.304.00~2h
2Qwen3-4B (LoRA)4BGRPO + ShapedRTX 40900.274.0071m
3Qwen3-1.7B (LoRA)1.7BGRPO + ShapedT4 (HF)0.195*3.90~7h
Baseline (untrained)0.091-4frequent

* In-progress run on HF Jobs. Peak reward at step 20. Zero tool failures across all logged steps.

Key Findings

MetricUntrainedTrained (best)
Mean Reward0.090.30 (+233%)
Tool Success Rate60%100%
Investigation Completeness40%100%
Tool Calls / EpisodeErratic (1-4)Stable 4.0
""" PLOT_BASE = "https://raw.githubusercontent.com/Musharraf1128/esctr-environment/main/plots" BLOG_HTML = f"""
Training LLMs to investigate procurement fraud, enforce SLA penalties, and reject bad vendor settlements — autonomously.

The Problem

Every day, enterprises process millions of procurement transactions. Between Purchase Orders, shipping manifests, SLA contracts, and vendor invoices — discrepancies are inevitable. A vendor bills $45/unit instead of the contracted $40. A shipment arrives 5 days late, triggering penalty clauses. The vendor disputes the penalty.

Resolving these disputes means humans manually cross-referencing siloed databases, interpreting contract clauses, and performing precise arithmetic under pressure. Current LLMs can't solve this reliably — not because the individual steps are hard, but because the combination is: multi-step tool use, precise arithmetic, adversarial reasoning, and state tracking across 10-20 interaction steps.

This is the capability gap that Reinforcement Learning with Verifiable Rewards (RLVR) was designed to close.

The Environment

{ARCH_SVG}

ESCTR gives the agent three scenarios of increasing complexity:

TaskDifficultyWhat the Agent Must Do
Procurement Reconciliation🟢 EasyIdentify overcharged line items, calculate exact overcharge
SLA Enforcement🟡 MediumDiscover late shipments, retrieve SLA contract, compute penalty
Adversarial Auditing🔴 HardAll above + disprove vendor counter-claims using warehouse logs

Every scenario is procedurally generated from a seed — infinite training configurations with deterministic, reproducible grading. No memorization possible.

Reward Design

Rtotal = α · Routcome + β · Rtrajectory − penalties

Routcome (60-70%): Did the agent submit the exact correct adjustment? Rtrajectory (30-40%): Did the agent follow proper investigative procedure? Penalties: step costs (−0.005/step), hallucination (−0.02), accepting bad settlements (−0.20).

The correct answer is always a precise floating-point number derived from contract terms. No LLM-as-judge, no fuzzy rubric — pure programmatic verification.

Training Journey

Phase 1 — Proof of Concept (0.6B)

Validated the training loop with Qwen3-0.6B on a T4 GPU. Reward improved from 0.09 → 0.30 (+222%) in 500 episodes. The model learned the canonical investigation procedure with zero tool failures.

0.6B reward curve Training dashboard

Phase 2 — Scaling to 4B, and Hitting a Wall

Scaled to Qwen3-4B on an RTX 4090 with LoRA. First three attempts completely failed — loss flat at 0.0.

Problem 1: Token Budget Exhaustion. The model consumed its entire 512-token budget on <think> blocks before making a single tool call.

Problem 2: Deterministic Starvation. At temperature=1.0, all K=4 rollouts were identical. Zero reward variance → zero gradient signal.

Phase 2.5 — The Fix

1. Shaped Rewards — +0.05 partial credit per valid investigation step.
2. High Temperature — T=1.5 with K=4 rollouts forced exploration diversity.

Phase 3 — Success: 4B in 71 Minutes

4B reward curve 4B tool discipline

The tool graph tells the story: early chaos (2-4.25 calls/episode) collapses into rigid discipline — exactly 4.0 tool calls, the optimal investigate-investigate-investigate-submit pipeline.

Phase 4 — Iterating on 1.7B (HF Jobs)

Launched on HuggingFace's T4-medium. Early metrics confirm the shaped reward architecture transfers cleanly to a different model size with zero modifications.

StepLossRewardTool CallsEntropy
50.1840.1953.90.132
100.1160.1953.90.127
150.0880.1803.60.028
200.1860.1903.80.047

Technical Summary

Param0.6B4B1.7B
ModelQwen3-0.6BQwen3-4BQwen3-1.7B
GPUT4 (Colab)RTX 4090T4 (HF Jobs)
QuantNone4-bit QLoRA4-bit QLoRA
AdapterFullLoRA r=16LoRA r=16
Episodes500300500
Time~2h71m~7h
"""