"""Light theme CSS, SVG diagrams, and HTML content for the ESCTR Gradio UI.""" INJECT_CSS = """""" HEADER_HTML = """
""" ARCH_SVG = """""" EPISODE_SVG = """""" LEADERBOARD_HTML = """All models trained on the ESCTR environment using TRL's GRPOTrainer with environment_factory.
| # | Model | Params | Method | GPU | Peak Reward | Tool Calls | Failures | Time |
|---|---|---|---|---|---|---|---|---|
| 1 | Qwen3-0.6B | 0.6B | GRPO | T4 | 0.30 | 4.0 | 0 | ~2h |
| 2 | Qwen3-4B (LoRA) | 4B | GRPO + Shaped | RTX 4090 | 0.27 | 4.0 | 0 | 71m |
| 3 | Qwen3-1.7B (LoRA) | 1.7B | GRPO + Shaped | T4 (HF) | 0.195* | 3.9 | 0 | ~7h |
| — | Baseline (untrained) | — | — | — | 0.09 | 1-4 | frequent | — |
* In-progress run on HF Jobs. Peak reward at step 20. Zero tool failures across all logged steps.
| Metric | Untrained | Trained (best) |
|---|---|---|
| Mean Reward | 0.09 | 0.30 (+233%) |
| Tool Success Rate | 60% | 100% |
| Investigation Completeness | 40% | 100% |
| Tool Calls / Episode | Erratic (1-4) | Stable 4.0 |
Training LLMs to investigate procurement fraud, enforce SLA penalties, and reject bad vendor settlements — autonomously.
Every day, enterprises process millions of procurement transactions. Between Purchase Orders, shipping manifests, SLA contracts, and vendor invoices — discrepancies are inevitable. A vendor bills $45/unit instead of the contracted $40. A shipment arrives 5 days late, triggering penalty clauses. The vendor disputes the penalty.
Resolving these disputes means humans manually cross-referencing siloed databases, interpreting contract clauses, and performing precise arithmetic under pressure. Current LLMs can't solve this reliably — not because the individual steps are hard, but because the combination is: multi-step tool use, precise arithmetic, adversarial reasoning, and state tracking across 10-20 interaction steps.
This is the capability gap that Reinforcement Learning with Verifiable Rewards (RLVR) was designed to close.
ESCTR gives the agent three scenarios of increasing complexity:
| Task | Difficulty | What the Agent Must Do |
|---|---|---|
| Procurement Reconciliation | 🟢 Easy | Identify overcharged line items, calculate exact overcharge |
| SLA Enforcement | 🟡 Medium | Discover late shipments, retrieve SLA contract, compute penalty |
| Adversarial Auditing | 🔴 Hard | All above + disprove vendor counter-claims using warehouse logs |
Every scenario is procedurally generated from a seed — infinite training configurations with deterministic, reproducible grading. No memorization possible.
Routcome (60-70%): Did the agent submit the exact correct adjustment? Rtrajectory (30-40%): Did the agent follow proper investigative procedure? Penalties: step costs (−0.005/step), hallucination (−0.02), accepting bad settlements (−0.20).
The correct answer is always a precise floating-point number derived from contract terms. No LLM-as-judge, no fuzzy rubric — pure programmatic verification.
Validated the training loop with Qwen3-0.6B on a T4 GPU. Reward improved from 0.09 → 0.30 (+222%) in 500 episodes. The model learned the canonical investigation procedure with zero tool failures.
Scaled to Qwen3-4B on an RTX 4090 with LoRA. First three attempts completely failed — loss flat at 0.0.
Problem 1: Token Budget Exhaustion. The model consumed its entire 512-token budget on <think> blocks before making a single tool call.
Problem 2: Deterministic Starvation. At temperature=1.0, all K=4 rollouts were identical. Zero reward variance → zero gradient signal.
1. Shaped Rewards — +0.05 partial credit per valid investigation step.
2. High Temperature — T=1.5 with K=4 rollouts forced exploration diversity.
The tool graph tells the story: early chaos (2-4.25 calls/episode) collapses into rigid discipline — exactly 4.0 tool calls, the optimal investigate-investigate-investigate-submit pipeline.
Launched on HuggingFace's T4-medium. Early metrics confirm the shaped reward architecture transfers cleanly to a different model size with zero modifications.
| Step | Loss | Reward | Tool Calls | Entropy |
|---|---|---|---|---|
| 5 | 0.184 | 0.195 | 3.9 | 0.132 |
| 10 | 0.116 | 0.195 | 3.9 | 0.127 |
| 15 | 0.088 | 0.180 | 3.6 | 0.028 |
| 20 | 0.186 | 0.190 | 3.8 | 0.047 |
| Param | 0.6B | 4B | 1.7B |
|---|---|---|---|
| Model | Qwen3-0.6B | Qwen3-4B | Qwen3-1.7B |
| GPU | T4 (Colab) | RTX 4090 | T4 (HF Jobs) |
| Quant | None | 4-bit QLoRA | 4-bit QLoRA |
| Adapter | Full | LoRA r=16 | LoRA r=16 |
| Episodes | 500 | 300 | 500 |
| Time | ~2h | 71m | ~7h |