ps2181 commited on
Commit
0ed8b20
Β·
verified Β·
1 Parent(s): 65e1955

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -0
README.md CHANGED
@@ -178,16 +178,19 @@ All 3 agents trained with **TRL GRPOTrainer + Unsloth** using the deployed HF Sp
178
  ### Extractor Reward Curve
179
 
180
  ![Extractor Training](https://raw.githubusercontent.com/ps2181/invoice-processing-pipeline/main/assets/reward_curve.png)
 
181
  *Left: Total GRPO reward across 4 signals (format + field + math + completeness) over 20 training steps. Right: Live environment grader score peaking at **0.914** β€” above Qwen 72B baseline (0.67) and untrained 1.5B baseline (0.46).*
182
 
183
  ### Auditor Reward Curve (Run 2 β€” Bug Fixed)
184
 
185
  ![Auditor Training Run 2](https://raw.githubusercontent.com/ps2181/invoice-processing-pipeline/main/assets/auditor_reward_curve_run2.png)
 
186
  *Total reward (blue) and live env reward (orange) over 30 steps with Β±1 std band. Best total reward: **0.719**. Live env reward rose from 0.01 (dead signal in Run 1) to **0.52** after fixing the episode_id list bug.*
187
 
188
  ### Generator Reward Curve
189
 
190
  ![Generator Training](https://raw.githubusercontent.com/ps2181/invoice-processing-pipeline/main/assets/generator_reward_curve.png)
 
191
  *Live evasion reward (red) flat near 0 β€” Auditor+Approver caught all fraud attempts. Fraud plausibility reward (orange dashed) learned and stable at ~0.20, showing the Generator learned to produce realistic-looking invoices even without successful evasion.*
192
 
193
  ### πŸ” Reward Hacking Caught at Step 10
 
178
  ### Extractor Reward Curve
179
 
180
  ![Extractor Training](https://raw.githubusercontent.com/ps2181/invoice-processing-pipeline/main/assets/reward_curve.png)
181
+ *X-axis: training step (1–20) Β· Y-axis: reward (0–1). Left: total GRPO reward across 4 independent signals (format 0.10 + field accuracy 0.40 + math 0.25 + completeness 0.25). Right: live `/grader` score peaking at **0.914** β€” above Qwen 72B baseline (0.67) and untrained 1.5B (0.46).*
182
  *Left: Total GRPO reward across 4 signals (format + field + math + completeness) over 20 training steps. Right: Live environment grader score peaking at **0.914** β€” above Qwen 72B baseline (0.67) and untrained 1.5B baseline (0.46).*
183
 
184
  ### Auditor Reward Curve (Run 2 β€” Bug Fixed)
185
 
186
  ![Auditor Training Run 2](https://raw.githubusercontent.com/ps2181/invoice-processing-pipeline/main/assets/auditor_reward_curve_run2.png)
187
+ *X-axis: training step (1–30) Β· Y-axis: reward (0–1). Total reward (blue) and live env reward (orange) with Β±1 std band. Best total: **0.719** at step 10. Live env reward climbed from 0.01 (dead signal, Run 1) to **0.52** after fixing the TRL episode_id list indexing bug.*
188
  *Total reward (blue) and live env reward (orange) over 30 steps with Β±1 std band. Best total reward: **0.719**. Live env reward rose from 0.01 (dead signal in Run 1) to **0.52** after fixing the episode_id list bug.*
189
 
190
  ### Generator Reward Curve
191
 
192
  ![Generator Training](https://raw.githubusercontent.com/ps2181/invoice-processing-pipeline/main/assets/generator_reward_curve.png)
193
+ *X-axis: training step (1–30) Β· Y-axis: reward (0–1). Live evasion reward (red) flat near 0 β€” Auditor+Approver caught all fraud attempts. Fraud plausibility reward (orange dashed) stable at ~0.20 β€” Generator learned realistic invoice structure even without successful evasion.*
194
  *Live evasion reward (red) flat near 0 β€” Auditor+Approver caught all fraud attempts. Fraud plausibility reward (orange dashed) learned and stable at ~0.20, showing the Generator learned to produce realistic-looking invoices even without successful evasion.*
195
 
196
  ### πŸ” Reward Hacking Caught at Step 10