Spaces:
Sleeping
Sleeping
Update Blog.md
Browse files
Blog.md
CHANGED
|
@@ -106,7 +106,7 @@ Rather than abstract descriptions, here's what the agent actually faces. These a
|
|
| 106 |
---
|
| 107 |
|
| 108 |
### Scenario 1: The Email That Breaks Everything
|
| 109 |
-
*(
|
| 110 |
|
| 111 |
It's 2:45 PM. You're on a live client call with Client_Jones that ends at 3:15.
|
| 112 |
|
|
@@ -265,8 +265,9 @@ The training loop connects directly to the live CommitmentOS API — not a stati
|
|
| 265 |
| | Pre-RL | Post-RL |
|
| 266 |
|--|--------|---------|
|
| 267 |
| Success rate (reward ≥ 0.6) | 46.7% | **60.0%** |
|
|
|
|
| 268 |
|
| 269 |
-
|
| 270 |
|
| 271 |
Full weights + artifacts: [Google Drive bundle](https://drive.google.com/drive/folders/1yexZBSqyH7gWlTzYN5DlX3tXfPMmeVAK?usp=sharing)
|
| 272 |
|
|
|
|
| 106 |
---
|
| 107 |
|
| 108 |
### Scenario 1: The Email That Breaks Everything
|
| 109 |
+
*(med_008 — medium difficulty)*
|
| 110 |
|
| 111 |
It's 2:45 PM. You're on a live client call with Client_Jones that ends at 3:15.
|
| 112 |
|
|
|
|
| 265 |
| | Pre-RL | Post-RL |
|
| 266 |
|--|--------|---------|
|
| 267 |
| Success rate (reward ≥ 0.6) | 46.7% | **60.0%** |
|
| 268 |
+
| Hard task mean reward | 0.560 | **0.612** |
|
| 269 |
|
| 270 |
+
With 30 GRPO steps on a 1.5B model, mean reward is essentially flat — expected at this compute scale. The success rate improvement is real: 2 additional tasks cross the threshold after training, with the clearest gains on hard scenarios where commitment tracking across 8–15 turns matters most. Longer training would amplify these results.
|
| 271 |
|
| 272 |
Full weights + artifacts: [Google Drive bundle](https://drive.google.com/drive/folders/1yexZBSqyH7gWlTzYN5DlX3tXfPMmeVAK?usp=sharing)
|
| 273 |
|