--- title: "Training Autonomous Financial Auditors with RLVR" thumbnail: https://raw.githubusercontent.com/Musharraf1128/esctr-environment/main/plots/reward_curve_4b.png authors: - user: musharraf7 date: 2026-04-26 tags: - reinforcement-learning - openenv - grpo - tool-use - finance --- # Training Autonomous Financial Auditors with RLVR > *What if we could train an LLM to investigate procurement fraud, enforce SLA penalties, and reject bad vendor settlements โ€” autonomously?* That's the question we set out to answer for the [OpenEnv Hackathon](https://github.com/meta-pytorch/OpenEnv). The result is **ESCTR** โ€” *Enterprise Supply Chain & Tax Reconciliation* โ€” a stateful RL environment where an LLM agent operates as a **financial controller**. It navigates a multi-step audit pipeline armed with 4 ERP tools, faces adversarial vendors, and is graded against mathematically precise reward verification. ๐Ÿข **Live Environment**: [musharraf7/esctr-environment](https://huggingface.co/spaces/musharraf7/esctr-environment) ๐Ÿ“Š **Training Dashboard**: [Trackio](https://huggingface.co/spaces/musharraf7/esctr-grpo-trained) ๐Ÿ’ป **Source Code**: [GitHub](https://github.com/Musharraf1128/esctr-environment) --- ## The Problem: Why Financial Auditing Needs RL Every day, enterprises process millions of procurement transactions. Between Purchase Orders, shipping manifests, SLA contracts, and vendor invoices โ€” discrepancies are inevitable: - A vendor bills **$45/unit** instead of the contracted **$40** - A shipment arrives **5 days late**, triggering penalty clauses - The vendor disputes the penalty, claiming *your warehouse rejected the delivery* Resolving these disputes today means humans manually cross-referencing siloed databases, interpreting contract clauses, and performing precise arithmetic under pressure. It's slow, expensive, and deeply error-prone. **Current LLMs can't solve this reliably.** Not because the individual steps are hard, but because the *combination* is: 1. **Multi-step tool use** โ€” querying databases, reading documents, communicating with vendors 2. **Precise arithmetic** under contract constraints 3. **Adversarial reasoning** โ€” rejecting manipulative settlement offers 4. **State tracking** across 10โ€“20 interaction steps This is exactly the capability gap that **Reinforcement Learning with Verifiable Rewards (RLVR)** was designed to close. So we built the environment to prove it. --- ## The Environment: Three Tasks, Escalating Stakes ESCTR gives the agent three scenarios of increasing complexity โ€” each one a realistic slice of enterprise financial operations: | Task | Difficulty | What the Agent Must Do | |------|-----------|----------------------| | **Procurement Reconciliation** | Easy | Identify overcharged line items, calculate the exact overcharge | | **SLA Enforcement** | Medium | Discover late shipments, retrieve the SLA contract, compute the penalty | | **Adversarial Auditing** | Hard | All of the above *plus* disprove vendor counter-claims using warehouse logs | The agent has four ERP tools at its disposal: - `query_database` โ€” search shipping logs, purchase orders, and invoices - `read_document` โ€” retrieve the full text of a contract or manifest - `communicate_vendor` โ€” negotiate with an adversarial vendor that will lie, deflect, and offer bad settlements - `submit_financial_decision` โ€” submit the final adjustment amount (the terminal, point-of-no-return action) Every scenario is **procedurally generated from a seed**, enabling infinite training configurations with deterministic, reproducible grading. There is no memorizing the answer โ€” the agent must investigate. --- ## Reward Design: Dense, Verifiable, Impossible to Fake Following the RLVR paradigm (Wen et al., ICLR 2026), our reward function is: ``` R_total = ฮฑ ยท R_outcome + ฮฒ ยท R_trajectory โˆ’ penalties ``` - **R_outcome** (60โ€“70%): Binary โ€” did the agent submit the *exact* correct adjustment amount? - **R_trajectory** (30โ€“40%): Did the agent follow proper investigative procedure? - **Penalties**: Step costs (โˆ’0.005/step), hallucination (โˆ’0.02), gullibility (โˆ’0.20 for accepting bad settlements) The correct answer is always a **precise floating-point number** derived from contract terms. There is no LLM-as-judge, no fuzzy rubric โ€” just pure programmatic verification. Either you found the fraud, or you didn't. --- ## Training: Three Models, Three GPUs, One Reward Signal ### Phase 1 โ€” Proof of Concept (Qwen3-0.6B) We first validated the training loop with a 0.6B model on a T4 GPU using TRL's `GRPOTrainer` with `environment_factory`. **The result spoke for itself:** the model went from a mean reward of **0.09 โ†’ 0.30** (+222%) in just 500 episodes. It perfectly learned the canonical investigation procedure โ€” query PO โ†’ query Invoice โ†’ read documents โ†’ submit โ€” with zero tool failures. ![0.6B Reward Curve](https://raw.githubusercontent.com/Musharraf1128/esctr-environment/main/plots/reward_curve.png) ![Training Dashboard](https://raw.githubusercontent.com/Musharraf1128/esctr-environment/main/plots/training_dashboard.png) The proof of concept worked. Time to scale. --- ### Phase 2 โ€” Scaling to 4B, and Hitting a Wall We scaled to **Qwen3-4B** on an RTX 4090 (24GB VRAM) with LoRA adapters. The first three attempts **completely failed** โ€” loss flat at 0.0, zero learning whatsoever. Four hours of debugging later, we found two distinct root causes: **Problem 1: Token Budget Exhaustion** Qwen3-4B produces large `` reasoning blocks by default. The model was consuming its entire 512-token generation budget on internal monologue โ€” before making a single tool call. No actions, no reward, no gradient. **Problem 2: Deterministic Starvation** Even after addressing the thinking issue, at `temperature=1.0` all K=4 rollouts in each GRPO batch were *identical*. The model had learned to deterministically make exactly 3 investigation calls and stop โ€” never reaching `submit_financial_decision`. With zero reward variance across the group, GRPO had **zero gradient signal**. The math simply didn't work. This was the core engineering challenge of the project. The model wasn't broken โ€” the training setup was starving it of the variance it needed to learn. --- ### Phase 2.5 โ€” The Fix: Shaped Rewards + Forced Exploration Two targeted changes broke the deadlock: 1. **Process Reward Shaping** โ€” Instead of only rewarding the final submission, we injected `+0.05` partial credit for each valid investigation step. This gave GRPO the gradient signal it needed to even begin learning the terminal action. 2. **High-Temperature Exploration** โ€” Raising `temperature=1.5` with K=4 rollouts forced diversity in group sampling. The model was finally exploring, failing, and learning from the contrast. --- ### Phase 3 โ€” Success: 4B Training in 71 Minutes With shaped rewards and forced exploration, the 4B model finally learned โ€” and the results were clean: ![4B Reward Curve](https://raw.githubusercontent.com/Musharraf1128/esctr-environment/main/plots/reward_curve_4b.png) ![4B Tool Discipline](https://raw.githubusercontent.com/Musharraf1128/esctr-environment/main/plots/tool_calls_4b.png) **Key results:** | Metric | Value | |--------|-------| | Peak Reward | **0.27** (vs 0.09 baseline) | | Tool Calls/Episode | Converged to exactly **4.0** | | Tool Failure Rate | **0** across 300 episodes | | Peak VRAM | **19.74 GB** on 24GB GPU | | Total Training Time | **71.3 minutes** | The tool execution graph tells the most compelling story. Early in training, the model varies wildly โ€” 2 to 4.25 tool calls per episode, chaotic and unreliable. By the end, it locks rigidly onto **exactly 4.0** โ€” having learned the optimal investigate โ†’ investigate โ†’ investigate โ†’ submit pipeline. The chaos collapses into discipline. --- ### Phase 4 โ€” Iterative Run: Qwen3-1.7B on HF Jobs (In Progress) We didn't stop at 4B. Following the principle of **fast iteration on small models**, we launched a third training run on **HF Jobs T4-medium** using `Qwen/Qwen3-1.7B` with LoRA adapters โ€” this time running entirely on HuggingFace's infrastructure. No local GPU, no RunPod โ€” just `hf jobs run` and a self-contained training script. This run won't complete before the submission deadline (~500 steps ร— 50s/step โ‰ˆ 7 hours), but the early metrics already tell the most important story of this project: **the shaped reward architecture we debugged on 4B transfers cleanly to a completely different model size with zero modifications.** **Observed training progression (Steps 5โ€“20):** | Step | Loss | Reward (mean) | Reward Std | Tool Calls/ep | Entropy | |------|------|--------------|------------|---------------|---------| | 5 | 0.184 | **0.195** | 0.010 | **3.9** | 0.132 | | 10 | 0.116 | 0.195 | 0.010 | **3.9** | 0.127 | | 15 | 0.088 | 0.180 | 0.029 | 3.6 | 0.028 | | 20 | 0.186 | 0.190 | 0.020 | 3.8 | 0.047 | What this tells us: - **No cold-start collapse** โ€” reward is non-zero from the very first logged step. The shaped investigation bonus is doing exactly what it was designed to do. - **Zero tool failures** at every step โ€” the 1.7B model calls tools with valid JSON syntax just as reliably as the 4B model. - **Loss is decreasing**, confirming gradient signal is flowing through the LoRA adapter. - **Entropy is dropping** (0.132 โ†’ 0.028) โ€” the model is committing to a policy, not just wandering. It has learned that the `query_database โ†’ read_document โ†’ submit` pipeline is the winning trajectory. The high `frac_reward_zero_std` (0.6โ€“0.8) at early steps is expected โ€” it means some GRPO groups have identical rollouts, which is normal before the model diversifies its exploration. This resolved naturally in the 4B run around step ~30. --- ## What the Agent Actually Learned | Metric | Baseline (untrained) | Trained (4B, 300 ep) | |--------|---------------------|---------------------| | Mean Reward | 0.09 | 0.20 (peak 0.27) | | Tool Success Rate | 60% | **100%** | | Investigation Completeness | 40% | **100%** | | Tool Calls/Episode | Erratic (1โ€“4) | Stable **4.0** | | Tool Failures | Frequent | **0** | The untrained model jumps straight to a decision with no evidence. The trained agent follows a principled audit path: gather evidence, read the contract, then โ€” and only then โ€” submit with conviction. Critically, the 1.7B model โ€” running on completely different hardware and at a different parameter scale โ€” exhibits the *exact same investigation pattern* from its very first training step, confirming that our reward design is robust and transferable. --- ## Technical Summary | Parameter | 0.6B Run | 4B Run | 1.7B Run (in progress) | |-----------|----------|--------|------------------------| | Model | Qwen/Qwen3-0.6B | Qwen/Qwen3-4B | Qwen/Qwen3-1.7B | | GPU | T4 (Colab) | RTX 4090 (RunPod) | T4 (HF Jobs) | | Quantization | None | 4-bit (BitsAndBytes) | 4-bit (BitsAndBytes) | | Adapter | Full model | LoRA (r=16) | LoRA (r=16) | | Episodes | 500 | 300 | 500 (planned) | | Training Time | ~2 hours | ~71 minutes | ~7 hours (ongoing) | | Framework | TRL GRPOTrainer | TRL GRPOTrainer | TRL GRPOTrainer | | Script | [`train.py`](https://huggingface.co/spaces/musharraf7/esctr-environment/blob/main/train.py) | [`train_4b.py`](https://huggingface.co/spaces/musharraf7/esctr-environment/blob/main/train_4b.py) | [`train_hf_jobs.py`](https://huggingface.co/spaces/musharraf7/esctr-environment/blob/main/train_hf_jobs.py) | --- ## Why This Matters ESCTR demonstrates that **RLVR can teach LLMs enterprise-grade financial reasoning** โ€” a domain nearly absent from existing RL training benchmarks. Unlike game environments (chess, Snake, tic-tac-toe), our environment tests capabilities that actually exist in production systems: - **Real-world professional skills** โ€” procurement auditing, SLA enforcement, dispute resolution - **Adversarial reasoning** โ€” vendor negotiation where the counterpart is actively trying to deceive you - **Verifiable, precise rewards** โ€” exact floating-point answers derived from contract mathematics - **Production integration potential** โ€” the same tool interface could plug directly into SAP or Oracle as a pre-audit layer The broader point: this is the kind of environment that pushes the frontier of *what we can train LLMs to do*. Not playing games โ€” performing the complex, multi-step reasoning that enterprises actually need and pay billions of dollars for humans to do today. --- ## Links - ๐Ÿข **Environment Space**: [musharraf7/esctr-environment](https://huggingface.co/spaces/musharraf7/esctr-environment) - ๐Ÿ“Š **Training Dashboard**: [Trackio Space](https://huggingface.co/spaces/musharraf7/esctr-grpo-trained) - ๐Ÿ‹๏ธ **Training Scripts**: [`train.py`](https://github.com/Musharraf1128/esctr-environment/blob/main/train.py) ยท [`train_4b.py`](https://github.com/Musharraf1128/esctr-environment/blob/main/train_4b.py) ยท [`train_hf_jobs.py`](https://github.com/Musharraf1128/esctr-environment/blob/main/train_hf_jobs.py) - ๐Ÿ’ป **Source Code**: [GitHub](https://github.com/Musharraf1128/esctr-environment) --- *Built for the [OpenEnv Hackathon](https://github.com/meta-pytorch/OpenEnv) by Musharraf.*