File size: 13,346 Bytes
08a3b81
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
671b402
08a3b81
671b402
08a3b81
671b402
 
 
08a3b81
671b402
08a3b81
 
 
 
 
671b402
 
 
 
 
08a3b81
671b402
08a3b81
671b402
08a3b81
671b402
08a3b81
671b402
 
08a3b81
671b402
08a3b81
 
 
671b402
08a3b81
671b402
08a3b81
 
 
671b402
 
 
08a3b81
671b402
08a3b81
671b402
 
 
 
 
 
08a3b81
 
 
671b402
08a3b81
671b402
08a3b81
 
671b402
08a3b81
 
671b402
 
 
08a3b81
671b402
08a3b81
 
 
6ba8d64
08a3b81
671b402
08a3b81
 
 
671b402
08a3b81
 
 
 
 
671b402
 
 
 
 
 
 
 
 
08a3b81
671b402
08a3b81
671b402
08a3b81
671b402
08a3b81
671b402
08a3b81
671b402
08a3b81
671b402
 
 
08a3b81
671b402
08a3b81
671b402
08a3b81
671b402
 
 
08a3b81
671b402
08a3b81
671b402
08a3b81
 
 
 
 
671b402
08a3b81
671b402
 
 
 
 
 
 
 
 
08a3b81
 
 
68287c5
 
6ba8d64
68287c5
6ba8d64
68287c5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
08a3b81
 
 
 
 
671b402
 
 
 
08a3b81
671b402
08a3b81
6ba8d64
 
08a3b81
 
671b402
08a3b81
68287c5
 
 
 
 
 
 
 
 
6ba8d64
08a3b81
 
 
 
 
671b402
 
 
08a3b81
671b402
 
 
 
08a3b81
671b402
08a3b81
 
 
 
 
 
 
68287c5
 
08a3b81
671b402
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
---
title: "Training Autonomous Financial Auditors with RLVR"
thumbnail: https://raw.githubusercontent.com/Musharraf1128/esctr-environment/main/plots/reward_curve_4b.png
authors:
  - user: musharraf7
date: 2026-04-26
tags:
  - reinforcement-learning
  - openenv
  - grpo
  - tool-use
  - finance
---

# Training Autonomous Financial Auditors with RLVR

> *What if we could train an LLM to investigate procurement fraud, enforce SLA penalties, and reject bad vendor settlements β€” autonomously?*

That's the question we set out to answer for the [OpenEnv Hackathon](https://github.com/meta-pytorch/OpenEnv).

The result is **ESCTR** β€” *Enterprise Supply Chain & Tax Reconciliation* β€” a stateful RL environment where an LLM agent operates as a **financial controller**. It navigates a multi-step audit pipeline armed with 4 ERP tools, faces adversarial vendors, and is graded against mathematically precise reward verification.

🏒 **Live Environment**: [musharraf7/esctr-environment](https://huggingface.co/spaces/musharraf7/esctr-environment)
πŸ“Š **Training Dashboard**: [Trackio](https://huggingface.co/spaces/musharraf7/esctr-grpo-trained)
πŸ’» **Source Code**: [GitHub](https://github.com/Musharraf1128/esctr-environment)

---

## The Problem: Why Financial Auditing Needs RL

Every day, enterprises process millions of procurement transactions. Between Purchase Orders, shipping manifests, SLA contracts, and vendor invoices β€” discrepancies are inevitable:

- A vendor bills **$45/unit** instead of the contracted **$40**
- A shipment arrives **5 days late**, triggering penalty clauses
- The vendor disputes the penalty, claiming *your warehouse rejected the delivery*

Resolving these disputes today means humans manually cross-referencing siloed databases, interpreting contract clauses, and performing precise arithmetic under pressure. It's slow, expensive, and deeply error-prone.

**Current LLMs can't solve this reliably.** Not because the individual steps are hard, but because the *combination* is:

1. **Multi-step tool use** β€” querying databases, reading documents, communicating with vendors
2. **Precise arithmetic** under contract constraints
3. **Adversarial reasoning** β€” rejecting manipulative settlement offers
4. **State tracking** across 10–20 interaction steps

This is exactly the capability gap that **Reinforcement Learning with Verifiable Rewards (RLVR)** was designed to close. So we built the environment to prove it.

---

## The Environment: Three Tasks, Escalating Stakes

ESCTR gives the agent three scenarios of increasing complexity β€” each one a realistic slice of enterprise financial operations:

| Task | Difficulty | What the Agent Must Do |
|------|-----------|----------------------|
| **Procurement Reconciliation** | Easy | Identify overcharged line items, calculate the exact overcharge |
| **SLA Enforcement** | Medium | Discover late shipments, retrieve the SLA contract, compute the penalty |
| **Adversarial Auditing** | Hard | All of the above *plus* disprove vendor counter-claims using warehouse logs |

The agent has four ERP tools at its disposal:

- `query_database` β€” search shipping logs, purchase orders, and invoices
- `read_document` β€” retrieve the full text of a contract or manifest
- `communicate_vendor` β€” negotiate with an adversarial vendor that will lie, deflect, and offer bad settlements
- `submit_financial_decision` β€” submit the final adjustment amount (the terminal, point-of-no-return action)

Every scenario is **procedurally generated from a seed**, enabling infinite training configurations with deterministic, reproducible grading. There is no memorizing the answer β€” the agent must investigate.

---

## Reward Design: Dense, Verifiable, Impossible to Fake

Following the RLVR paradigm (Wen et al., ICLR 2026), our reward function is:

```
R_total = Ξ± Β· R_outcome + Ξ² Β· R_trajectory βˆ’ penalties
```

- **R_outcome** (60–70%): Binary β€” did the agent submit the *exact* correct adjustment amount?
- **R_trajectory** (30–40%): Did the agent follow proper investigative procedure?
- **Penalties**: Step costs (βˆ’0.005/step), hallucination (βˆ’0.02), gullibility (βˆ’0.20 for accepting bad settlements)

The correct answer is always a **precise floating-point number** derived from contract terms. There is no LLM-as-judge, no fuzzy rubric β€” just pure programmatic verification. Either you found the fraud, or you didn't.

---

## Training: Three Models, Three GPUs, One Reward Signal

### Phase 1 β€” Proof of Concept (Qwen3-0.6B)

We first validated the training loop with a 0.6B model on a T4 GPU using TRL's `GRPOTrainer` with `environment_factory`.

**The result spoke for itself:** the model went from a mean reward of **0.09 β†’ 0.30** (+222%) in just 500 episodes. It perfectly learned the canonical investigation procedure β€” query PO β†’ query Invoice β†’ read documents β†’ submit β€” with zero tool failures.

![0.6B Reward Curve](https://raw.githubusercontent.com/Musharraf1128/esctr-environment/main/plots/reward_curve.png)

![Training Dashboard](https://raw.githubusercontent.com/Musharraf1128/esctr-environment/main/plots/training_dashboard.png)

The proof of concept worked. Time to scale.

---

### Phase 2 β€” Scaling to 4B, and Hitting a Wall

We scaled to **Qwen3-4B** on an RTX 4090 (24GB VRAM) with LoRA adapters. The first three attempts **completely failed** β€” loss flat at 0.0, zero learning whatsoever.

Four hours of debugging later, we found two distinct root causes:

**Problem 1: Token Budget Exhaustion**

Qwen3-4B produces large `<think>` reasoning blocks by default. The model was consuming its entire 512-token generation budget on internal monologue β€” before making a single tool call. No actions, no reward, no gradient.

**Problem 2: Deterministic Starvation**

Even after addressing the thinking issue, at `temperature=1.0` all K=4 rollouts in each GRPO batch were *identical*. The model had learned to deterministically make exactly 3 investigation calls and stop β€” never reaching `submit_financial_decision`. With zero reward variance across the group, GRPO had **zero gradient signal**. The math simply didn't work.

This was the core engineering challenge of the project. The model wasn't broken β€” the training setup was starving it of the variance it needed to learn.

---

### Phase 2.5 β€” The Fix: Shaped Rewards + Forced Exploration

Two targeted changes broke the deadlock:

1. **Process Reward Shaping** β€” Instead of only rewarding the final submission, we injected `+0.05` partial credit for each valid investigation step. This gave GRPO the gradient signal it needed to even begin learning the terminal action.

2. **High-Temperature Exploration** β€” Raising `temperature=1.5` with K=4 rollouts forced diversity in group sampling. The model was finally exploring, failing, and learning from the contrast.

---

### Phase 3 β€” Success: 4B Training in 71 Minutes

With shaped rewards and forced exploration, the 4B model finally learned β€” and the results were clean:

![4B Reward Curve](https://raw.githubusercontent.com/Musharraf1128/esctr-environment/main/plots/reward_curve_4b.png)

![4B Tool Discipline](https://raw.githubusercontent.com/Musharraf1128/esctr-environment/main/plots/tool_calls_4b.png)

**Key results:**

| Metric | Value |
|--------|-------|
| Peak Reward | **0.27** (vs 0.09 baseline) |
| Tool Calls/Episode | Converged to exactly **4.0** |
| Tool Failure Rate | **0** across 300 episodes |
| Peak VRAM | **19.74 GB** on 24GB GPU |
| Total Training Time | **71.3 minutes** |

The tool execution graph tells the most compelling story. Early in training, the model varies wildly β€” 2 to 4.25 tool calls per episode, chaotic and unreliable. By the end, it locks rigidly onto **exactly 4.0** β€” having learned the optimal investigate β†’ investigate β†’ investigate β†’ submit pipeline. The chaos collapses into discipline.

---

### Phase 4 β€” Iterative Run: Qwen3-1.7B on HF Jobs (In Progress)

We didn't stop at 4B. Following the principle of **fast iteration on small models**, we launched a third training run on **HF Jobs T4-medium** using `Qwen/Qwen3-1.7B` with LoRA adapters β€” this time running entirely on HuggingFace's infrastructure. No local GPU, no RunPod β€” just `hf jobs run` and a self-contained training script.

This run won't complete before the submission deadline (~500 steps Γ— 50s/step β‰ˆ 7 hours), but the early metrics already tell the most important story of this project: **the shaped reward architecture we debugged on 4B transfers cleanly to a completely different model size with zero modifications.**

**Observed training progression (Steps 5–20):**

| Step | Loss | Reward (mean) | Reward Std | Tool Calls/ep | Entropy |
|------|------|--------------|------------|---------------|---------|
| 5    | 0.184 | **0.195** | 0.010 | **3.9** | 0.132 |
| 10   | 0.116 | 0.195 | 0.010 | **3.9** | 0.127 |
| 15   | 0.088 | 0.180 | 0.029 | 3.6 | 0.028 |
| 20   | 0.186 | 0.190 | 0.020 | 3.8 | 0.047 |

What this tells us:
- **No cold-start collapse** β€” reward is non-zero from the very first logged step. The shaped investigation bonus is doing exactly what it was designed to do.
- **Zero tool failures** at every step β€” the 1.7B model calls tools with valid JSON syntax just as reliably as the 4B model.
- **Loss is decreasing**, confirming gradient signal is flowing through the LoRA adapter.
- **Entropy is dropping** (0.132 β†’ 0.028) β€” the model is committing to a policy, not just wandering. It has learned that the `query_database β†’ read_document β†’ submit` pipeline is the winning trajectory.

The high `frac_reward_zero_std` (0.6–0.8) at early steps is expected β€” it means some GRPO groups have identical rollouts, which is normal before the model diversifies its exploration. This resolved naturally in the 4B run around step ~30.

---

## What the Agent Actually Learned

| Metric | Baseline (untrained) | Trained (4B, 300 ep) |
|--------|---------------------|---------------------|
| Mean Reward | 0.09 | 0.20 (peak 0.27) |
| Tool Success Rate | 60% | **100%** |
| Investigation Completeness | 40% | **100%** |
| Tool Calls/Episode | Erratic (1–4) | Stable **4.0** |
| Tool Failures | Frequent | **0** |

The untrained model jumps straight to a decision with no evidence. The trained agent follows a principled audit path: gather evidence, read the contract, then β€” and only then β€” submit with conviction.

Critically, the 1.7B model β€” running on completely different hardware and at a different parameter scale β€” exhibits the *exact same investigation pattern* from its very first training step, confirming that our reward design is robust and transferable.

---

## Technical Summary

| Parameter | 0.6B Run | 4B Run | 1.7B Run (in progress) |
|-----------|----------|--------|------------------------|
| Model | Qwen/Qwen3-0.6B | Qwen/Qwen3-4B | Qwen/Qwen3-1.7B |
| GPU | T4 (Colab) | RTX 4090 (RunPod) | T4 (HF Jobs) |
| Quantization | None | 4-bit (BitsAndBytes) | 4-bit (BitsAndBytes) |
| Adapter | Full model | LoRA (r=16) | LoRA (r=16) |
| Episodes | 500 | 300 | 500 (planned) |
| Training Time | ~2 hours | ~71 minutes | ~7 hours (ongoing) |
| Framework | TRL GRPOTrainer | TRL GRPOTrainer | TRL GRPOTrainer |
| Script | [`train.py`](https://huggingface.co/spaces/musharraf7/esctr-environment/blob/main/train.py) | [`train_4b.py`](https://huggingface.co/spaces/musharraf7/esctr-environment/blob/main/train_4b.py) | [`train_hf_jobs.py`](https://huggingface.co/spaces/musharraf7/esctr-environment/blob/main/train_hf_jobs.py) |

---

## Why This Matters

ESCTR demonstrates that **RLVR can teach LLMs enterprise-grade financial reasoning** β€” a domain nearly absent from existing RL training benchmarks.

Unlike game environments (chess, Snake, tic-tac-toe), our environment tests capabilities that actually exist in production systems:

- **Real-world professional skills** β€” procurement auditing, SLA enforcement, dispute resolution
- **Adversarial reasoning** β€” vendor negotiation where the counterpart is actively trying to deceive you
- **Verifiable, precise rewards** β€” exact floating-point answers derived from contract mathematics
- **Production integration potential** β€” the same tool interface could plug directly into SAP or Oracle as a pre-audit layer

The broader point: this is the kind of environment that pushes the frontier of *what we can train LLMs to do*. Not playing games β€” performing the complex, multi-step reasoning that enterprises actually need and pay billions of dollars for humans to do today.

---

## Links

- 🏒 **Environment Space**: [musharraf7/esctr-environment](https://huggingface.co/spaces/musharraf7/esctr-environment)
- πŸ“Š **Training Dashboard**: [Trackio Space](https://huggingface.co/spaces/musharraf7/esctr-grpo-trained)
- πŸ‹οΈ **Training Scripts**: [`train.py`](https://github.com/Musharraf1128/esctr-environment/blob/main/train.py) Β· [`train_4b.py`](https://github.com/Musharraf1128/esctr-environment/blob/main/train_4b.py) Β· [`train_hf_jobs.py`](https://github.com/Musharraf1128/esctr-environment/blob/main/train_hf_jobs.py)
- πŸ’» **Source Code**: [GitHub](https://github.com/Musharraf1128/esctr-environment)

---

*Built for the [OpenEnv Hackathon](https://github.com/meta-pytorch/OpenEnv) by Musharraf.*