Spaces:
Running
Running
File size: 13,346 Bytes
08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 6ba8d64 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 68287c5 6ba8d64 68287c5 6ba8d64 68287c5 08a3b81 671b402 08a3b81 671b402 08a3b81 6ba8d64 08a3b81 671b402 08a3b81 68287c5 6ba8d64 08a3b81 671b402 08a3b81 671b402 08a3b81 671b402 08a3b81 68287c5 08a3b81 671b402 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 | ---
title: "Training Autonomous Financial Auditors with RLVR"
thumbnail: https://raw.githubusercontent.com/Musharraf1128/esctr-environment/main/plots/reward_curve_4b.png
authors:
- user: musharraf7
date: 2026-04-26
tags:
- reinforcement-learning
- openenv
- grpo
- tool-use
- finance
---
# Training Autonomous Financial Auditors with RLVR
> *What if we could train an LLM to investigate procurement fraud, enforce SLA penalties, and reject bad vendor settlements β autonomously?*
That's the question we set out to answer for the [OpenEnv Hackathon](https://github.com/meta-pytorch/OpenEnv).
The result is **ESCTR** β *Enterprise Supply Chain & Tax Reconciliation* β a stateful RL environment where an LLM agent operates as a **financial controller**. It navigates a multi-step audit pipeline armed with 4 ERP tools, faces adversarial vendors, and is graded against mathematically precise reward verification.
π’ **Live Environment**: [musharraf7/esctr-environment](https://huggingface.co/spaces/musharraf7/esctr-environment)
π **Training Dashboard**: [Trackio](https://huggingface.co/spaces/musharraf7/esctr-grpo-trained)
π» **Source Code**: [GitHub](https://github.com/Musharraf1128/esctr-environment)
---
## The Problem: Why Financial Auditing Needs RL
Every day, enterprises process millions of procurement transactions. Between Purchase Orders, shipping manifests, SLA contracts, and vendor invoices β discrepancies are inevitable:
- A vendor bills **$45/unit** instead of the contracted **$40**
- A shipment arrives **5 days late**, triggering penalty clauses
- The vendor disputes the penalty, claiming *your warehouse rejected the delivery*
Resolving these disputes today means humans manually cross-referencing siloed databases, interpreting contract clauses, and performing precise arithmetic under pressure. It's slow, expensive, and deeply error-prone.
**Current LLMs can't solve this reliably.** Not because the individual steps are hard, but because the *combination* is:
1. **Multi-step tool use** β querying databases, reading documents, communicating with vendors
2. **Precise arithmetic** under contract constraints
3. **Adversarial reasoning** β rejecting manipulative settlement offers
4. **State tracking** across 10β20 interaction steps
This is exactly the capability gap that **Reinforcement Learning with Verifiable Rewards (RLVR)** was designed to close. So we built the environment to prove it.
---
## The Environment: Three Tasks, Escalating Stakes
ESCTR gives the agent three scenarios of increasing complexity β each one a realistic slice of enterprise financial operations:
| Task | Difficulty | What the Agent Must Do |
|------|-----------|----------------------|
| **Procurement Reconciliation** | Easy | Identify overcharged line items, calculate the exact overcharge |
| **SLA Enforcement** | Medium | Discover late shipments, retrieve the SLA contract, compute the penalty |
| **Adversarial Auditing** | Hard | All of the above *plus* disprove vendor counter-claims using warehouse logs |
The agent has four ERP tools at its disposal:
- `query_database` β search shipping logs, purchase orders, and invoices
- `read_document` β retrieve the full text of a contract or manifest
- `communicate_vendor` β negotiate with an adversarial vendor that will lie, deflect, and offer bad settlements
- `submit_financial_decision` β submit the final adjustment amount (the terminal, point-of-no-return action)
Every scenario is **procedurally generated from a seed**, enabling infinite training configurations with deterministic, reproducible grading. There is no memorizing the answer β the agent must investigate.
---
## Reward Design: Dense, Verifiable, Impossible to Fake
Following the RLVR paradigm (Wen et al., ICLR 2026), our reward function is:
```
R_total = Ξ± Β· R_outcome + Ξ² Β· R_trajectory β penalties
```
- **R_outcome** (60β70%): Binary β did the agent submit the *exact* correct adjustment amount?
- **R_trajectory** (30β40%): Did the agent follow proper investigative procedure?
- **Penalties**: Step costs (β0.005/step), hallucination (β0.02), gullibility (β0.20 for accepting bad settlements)
The correct answer is always a **precise floating-point number** derived from contract terms. There is no LLM-as-judge, no fuzzy rubric β just pure programmatic verification. Either you found the fraud, or you didn't.
---
## Training: Three Models, Three GPUs, One Reward Signal
### Phase 1 β Proof of Concept (Qwen3-0.6B)
We first validated the training loop with a 0.6B model on a T4 GPU using TRL's `GRPOTrainer` with `environment_factory`.
**The result spoke for itself:** the model went from a mean reward of **0.09 β 0.30** (+222%) in just 500 episodes. It perfectly learned the canonical investigation procedure β query PO β query Invoice β read documents β submit β with zero tool failures.


The proof of concept worked. Time to scale.
---
### Phase 2 β Scaling to 4B, and Hitting a Wall
We scaled to **Qwen3-4B** on an RTX 4090 (24GB VRAM) with LoRA adapters. The first three attempts **completely failed** β loss flat at 0.0, zero learning whatsoever.
Four hours of debugging later, we found two distinct root causes:
**Problem 1: Token Budget Exhaustion**
Qwen3-4B produces large `<think>` reasoning blocks by default. The model was consuming its entire 512-token generation budget on internal monologue β before making a single tool call. No actions, no reward, no gradient.
**Problem 2: Deterministic Starvation**
Even after addressing the thinking issue, at `temperature=1.0` all K=4 rollouts in each GRPO batch were *identical*. The model had learned to deterministically make exactly 3 investigation calls and stop β never reaching `submit_financial_decision`. With zero reward variance across the group, GRPO had **zero gradient signal**. The math simply didn't work.
This was the core engineering challenge of the project. The model wasn't broken β the training setup was starving it of the variance it needed to learn.
---
### Phase 2.5 β The Fix: Shaped Rewards + Forced Exploration
Two targeted changes broke the deadlock:
1. **Process Reward Shaping** β Instead of only rewarding the final submission, we injected `+0.05` partial credit for each valid investigation step. This gave GRPO the gradient signal it needed to even begin learning the terminal action.
2. **High-Temperature Exploration** β Raising `temperature=1.5` with K=4 rollouts forced diversity in group sampling. The model was finally exploring, failing, and learning from the contrast.
---
### Phase 3 β Success: 4B Training in 71 Minutes
With shaped rewards and forced exploration, the 4B model finally learned β and the results were clean:


**Key results:**
| Metric | Value |
|--------|-------|
| Peak Reward | **0.27** (vs 0.09 baseline) |
| Tool Calls/Episode | Converged to exactly **4.0** |
| Tool Failure Rate | **0** across 300 episodes |
| Peak VRAM | **19.74 GB** on 24GB GPU |
| Total Training Time | **71.3 minutes** |
The tool execution graph tells the most compelling story. Early in training, the model varies wildly β 2 to 4.25 tool calls per episode, chaotic and unreliable. By the end, it locks rigidly onto **exactly 4.0** β having learned the optimal investigate β investigate β investigate β submit pipeline. The chaos collapses into discipline.
---
### Phase 4 β Iterative Run: Qwen3-1.7B on HF Jobs (In Progress)
We didn't stop at 4B. Following the principle of **fast iteration on small models**, we launched a third training run on **HF Jobs T4-medium** using `Qwen/Qwen3-1.7B` with LoRA adapters β this time running entirely on HuggingFace's infrastructure. No local GPU, no RunPod β just `hf jobs run` and a self-contained training script.
This run won't complete before the submission deadline (~500 steps Γ 50s/step β 7 hours), but the early metrics already tell the most important story of this project: **the shaped reward architecture we debugged on 4B transfers cleanly to a completely different model size with zero modifications.**
**Observed training progression (Steps 5β20):**
| Step | Loss | Reward (mean) | Reward Std | Tool Calls/ep | Entropy |
|------|------|--------------|------------|---------------|---------|
| 5 | 0.184 | **0.195** | 0.010 | **3.9** | 0.132 |
| 10 | 0.116 | 0.195 | 0.010 | **3.9** | 0.127 |
| 15 | 0.088 | 0.180 | 0.029 | 3.6 | 0.028 |
| 20 | 0.186 | 0.190 | 0.020 | 3.8 | 0.047 |
What this tells us:
- **No cold-start collapse** β reward is non-zero from the very first logged step. The shaped investigation bonus is doing exactly what it was designed to do.
- **Zero tool failures** at every step β the 1.7B model calls tools with valid JSON syntax just as reliably as the 4B model.
- **Loss is decreasing**, confirming gradient signal is flowing through the LoRA adapter.
- **Entropy is dropping** (0.132 β 0.028) β the model is committing to a policy, not just wandering. It has learned that the `query_database β read_document β submit` pipeline is the winning trajectory.
The high `frac_reward_zero_std` (0.6β0.8) at early steps is expected β it means some GRPO groups have identical rollouts, which is normal before the model diversifies its exploration. This resolved naturally in the 4B run around step ~30.
---
## What the Agent Actually Learned
| Metric | Baseline (untrained) | Trained (4B, 300 ep) |
|--------|---------------------|---------------------|
| Mean Reward | 0.09 | 0.20 (peak 0.27) |
| Tool Success Rate | 60% | **100%** |
| Investigation Completeness | 40% | **100%** |
| Tool Calls/Episode | Erratic (1β4) | Stable **4.0** |
| Tool Failures | Frequent | **0** |
The untrained model jumps straight to a decision with no evidence. The trained agent follows a principled audit path: gather evidence, read the contract, then β and only then β submit with conviction.
Critically, the 1.7B model β running on completely different hardware and at a different parameter scale β exhibits the *exact same investigation pattern* from its very first training step, confirming that our reward design is robust and transferable.
---
## Technical Summary
| Parameter | 0.6B Run | 4B Run | 1.7B Run (in progress) |
|-----------|----------|--------|------------------------|
| Model | Qwen/Qwen3-0.6B | Qwen/Qwen3-4B | Qwen/Qwen3-1.7B |
| GPU | T4 (Colab) | RTX 4090 (RunPod) | T4 (HF Jobs) |
| Quantization | None | 4-bit (BitsAndBytes) | 4-bit (BitsAndBytes) |
| Adapter | Full model | LoRA (r=16) | LoRA (r=16) |
| Episodes | 500 | 300 | 500 (planned) |
| Training Time | ~2 hours | ~71 minutes | ~7 hours (ongoing) |
| Framework | TRL GRPOTrainer | TRL GRPOTrainer | TRL GRPOTrainer |
| Script | [`train.py`](https://huggingface.co/spaces/musharraf7/esctr-environment/blob/main/train.py) | [`train_4b.py`](https://huggingface.co/spaces/musharraf7/esctr-environment/blob/main/train_4b.py) | [`train_hf_jobs.py`](https://huggingface.co/spaces/musharraf7/esctr-environment/blob/main/train_hf_jobs.py) |
---
## Why This Matters
ESCTR demonstrates that **RLVR can teach LLMs enterprise-grade financial reasoning** β a domain nearly absent from existing RL training benchmarks.
Unlike game environments (chess, Snake, tic-tac-toe), our environment tests capabilities that actually exist in production systems:
- **Real-world professional skills** β procurement auditing, SLA enforcement, dispute resolution
- **Adversarial reasoning** β vendor negotiation where the counterpart is actively trying to deceive you
- **Verifiable, precise rewards** β exact floating-point answers derived from contract mathematics
- **Production integration potential** β the same tool interface could plug directly into SAP or Oracle as a pre-audit layer
The broader point: this is the kind of environment that pushes the frontier of *what we can train LLMs to do*. Not playing games β performing the complex, multi-step reasoning that enterprises actually need and pay billions of dollars for humans to do today.
---
## Links
- π’ **Environment Space**: [musharraf7/esctr-environment](https://huggingface.co/spaces/musharraf7/esctr-environment)
- π **Training Dashboard**: [Trackio Space](https://huggingface.co/spaces/musharraf7/esctr-grpo-trained)
- ποΈ **Training Scripts**: [`train.py`](https://github.com/Musharraf1128/esctr-environment/blob/main/train.py) Β· [`train_4b.py`](https://github.com/Musharraf1128/esctr-environment/blob/main/train_4b.py) Β· [`train_hf_jobs.py`](https://github.com/Musharraf1128/esctr-environment/blob/main/train_hf_jobs.py)
- π» **Source Code**: [GitHub](https://github.com/Musharraf1128/esctr-environment)
---
*Built for the [OpenEnv Hackathon](https://github.com/meta-pytorch/OpenEnv) by Musharraf.* |