Jayant2304 commited on
Commit
2c07089
·
verified ·
1 Parent(s): ca276e9

Update Blog.md

Browse files
Files changed (1) hide show
  1. Blog.md +3 -2
Blog.md CHANGED
@@ -106,7 +106,7 @@ Rather than abstract descriptions, here's what the agent actually faces. These a
106
  ---
107
 
108
  ### Scenario 1: The Email That Breaks Everything
109
- *(easy_008 — medium difficulty)*
110
 
111
  It's 2:45 PM. You're on a live client call with Client_Jones that ends at 3:15.
112
 
@@ -265,8 +265,9 @@ The training loop connects directly to the live CommitmentOS API — not a stati
265
  | | Pre-RL | Post-RL |
266
  |--|--------|---------|
267
  | Success rate (reward ≥ 0.6) | 46.7% | **60.0%** |
 
268
 
269
- Gains concentrated on hard tasksexactly where long commitment chains matter most.
270
 
271
  Full weights + artifacts: [Google Drive bundle](https://drive.google.com/drive/folders/1yexZBSqyH7gWlTzYN5DlX3tXfPMmeVAK?usp=sharing)
272
 
 
106
  ---
107
 
108
  ### Scenario 1: The Email That Breaks Everything
109
+ *(med_008 — medium difficulty)*
110
 
111
  It's 2:45 PM. You're on a live client call with Client_Jones that ends at 3:15.
112
 
 
265
  | | Pre-RL | Post-RL |
266
  |--|--------|---------|
267
  | Success rate (reward ≥ 0.6) | 46.7% | **60.0%** |
268
+ | Hard task mean reward | 0.560 | **0.612** |
269
 
270
+ With 30 GRPO steps on a 1.5B model, mean reward is essentially flat expected at this compute scale. The success rate improvement is real: 2 additional tasks cross the threshold after training, with the clearest gains on hard scenarios where commitment tracking across 8–15 turns matters most. Longer training would amplify these results.
271
 
272
  Full weights + artifacts: [Google Drive bundle](https://drive.google.com/drive/folders/1yexZBSqyH7gWlTzYN5DlX3tXfPMmeVAK?usp=sharing)
273