Model Leaderboard

All models trained on the ESCTR environment using TRL's GRPOTrainer with environment_factory.

#	Model	Params	Method	GPU	Peak Reward	Tool Calls	Failures	Time
1	Qwen3-0.6B	0.6B	GRPO	T4	0.30	4.0	0	~2h
2	Qwen3-4B (LoRA)	4B	GRPO + Shaped	RTX 4090	0.27	4.0	0	71m
3	Qwen3-1.7B (LoRA)	1.7B	GRPO + Shaped	T4 (HF)	0.195*	3.9	0	~7h
—	Baseline (untrained)	—	—	—	0.09	1-4	frequent	—

* In-progress run on HF Jobs. Peak reward at step 20. Zero tool failures across all logged steps.

Key Findings

Metric	Untrained	Trained (best)
Mean Reward	0.09	0.30 (+233%)
Tool Success Rate	60%	100%
Investigation Completeness	40%	100%
Tool Calls / Episode	Erratic (1-4)	Stable 4.0

Training LLMs to investigate procurement fraud, enforce SLA penalties, and reject bad vendor settlements — autonomously.

The Problem

Every day, enterprises process millions of procurement transactions. Between Purchase Orders, shipping manifests, SLA contracts, and vendor invoices — discrepancies are inevitable. A vendor bills $45/unit instead of the contracted $40. A shipment arrives 5 days late, triggering penalty clauses. The vendor disputes the penalty.

Resolving these disputes means humans manually cross-referencing siloed databases, interpreting contract clauses, and performing precise arithmetic under pressure. Current LLMs can't solve this reliably — not because the individual steps are hard, but because the combination is: multi-step tool use, precise arithmetic, adversarial reasoning, and state tracking across 10-20 interaction steps.

This is the capability gap that Reinforcement Learning with Verifiable Rewards (RLVR) was designed to close.

The Environment

{ARCH_SVG}

ESCTR gives the agent three scenarios of increasing complexity:

Task	Difficulty	What the Agent Must Do
Procurement Reconciliation	🟢 Easy	Identify overcharged line items, calculate exact overcharge
SLA Enforcement	🟡 Medium	Discover late shipments, retrieve SLA contract, compute penalty
Adversarial Auditing	🔴 Hard	All above + disprove vendor counter-claims using warehouse logs

Every scenario is procedurally generated from a seed — infinite training configurations with deterministic, reproducible grading. No memorization possible.

Reward Design

R_total = α · R_outcome + β · R_trajectory − penalties

R_outcome (60-70%): Did the agent submit the exact correct adjustment? R_trajectory (30-40%): Did the agent follow proper investigative procedure? Penalties: step costs (−0.005/step), hallucination (−0.02), accepting bad settlements (−0.20).

The correct answer is always a precise floating-point number derived from contract terms. No LLM-as-judge, no fuzzy rubric — pure programmatic verification.

Training Journey

Phase 1 — Proof of Concept (0.6B)

Validated the training loop with Qwen3-0.6B on a T4 GPU. Reward improved from 0.09 → 0.30 (+222%) in 500 episodes. The model learned the canonical investigation procedure with zero tool failures.

Phase 2 — Scaling to 4B, and Hitting a Wall

Scaled to Qwen3-4B on an RTX 4090 with LoRA. First three attempts completely failed — loss flat at 0.0.

Problem 1: Token Budget Exhaustion. The model consumed its entire 512-token budget on <think> blocks before making a single tool call.

Problem 2: Deterministic Starvation. At temperature=1.0, all K=4 rollouts were identical. Zero reward variance → zero gradient signal.

Phase 2.5 — The Fix

1. Shaped Rewards — +0.05 partial credit per valid investigation step.
2. High Temperature — T=1.5 with K=4 rollouts forced exploration diversity.

Phase 3 — Success: 4B in 71 Minutes

The tool graph tells the story: early chaos (2-4.25 calls/episode) collapses into rigid discipline — exactly 4.0 tool calls, the optimal investigate-investigate-investigate-submit pipeline.

Phase 4 — Iterating on 1.7B (HF Jobs)

Launched on HuggingFace's T4-medium. Early metrics confirm the shaped reward architecture transfers cleanly to a different model size with zero modifications.

Step	Loss	Reward	Tool Calls	Entropy
5	0.184	0.195	3.9	0.132
10	0.116	0.195	3.9	0.127
15	0.088	0.180	3.6	0.028
20	0.186	0.190	3.8	0.047

Technical Summary

Param	0.6B	4B	1.7B
Model	Qwen3-0.6B	Qwen3-4B	Qwen3-1.7B
GPU	T4 (Colab)	RTX 4090	T4 (HF Jobs)
Quant	None	4-bit QLoRA	4-bit QLoRA
Adapter	Full	LoRA r=16	LoRA r=16
Episodes	500	300	500
Time	~2h	71m	~7h