Spaces:
Sleeping
Sleeping
Commit ·
e5da154
1
Parent(s): 34a93bb
docs: update image links in BLOG.md to point to raw GitHub URLs for better accessibility
Browse files
BLOG.md
CHANGED
|
@@ -131,7 +131,7 @@ That is a complete evidence-backed operational path.
|
|
| 131 |
|
| 132 |
## Architecture
|
| 133 |
|
| 134 |
-

|
| 135 |
|
| 136 |
The environment has five main layers:
|
| 137 |
|
|
@@ -145,7 +145,7 @@ The environment has five main layers:
|
|
| 145 |
|
| 146 |
## Multi-Round Dispute Lifecycle
|
| 147 |
|
| 148 |
-

|
| 149 |
|
| 150 |
Arbitration is where the environment becomes especially interesting.
|
| 151 |
|
|
@@ -201,7 +201,7 @@ The agent has to decide not only what to do, but what to do **now**.
|
|
| 201 |
|
| 202 |
ChargebackOps uses a composable OpenEnv rubric instead of one monolithic reward.
|
| 203 |
|
| 204 |
-

|
| 135 |
|
| 136 |
The environment has five main layers:
|
| 137 |
|
|
|
|
| 145 |
|
| 146 |
## Multi-Round Dispute Lifecycle
|
| 147 |
|
| 148 |
+

|
| 149 |
|
| 150 |
Arbitration is where the environment becomes especially interesting.
|
| 151 |
|
|
|
|
| 201 |
|
| 202 |
ChargebackOps uses a composable OpenEnv rubric instead of one monolithic reward.
|
| 203 |
|
| 204 |
+

|
| 205 |
|
| 206 |
| Dimension | Weight | What it measures |
|
| 207 |
|---|---:|---|
|
|
|
|
| 255 |
|
| 256 |
I tested four scripted policies across the headline catalog and multi-seed grid.
|
| 257 |
|
| 258 |
+

|
| 259 |
|
| 260 |
| Policy | Headline avg | Multi-seed avg | Behavior |
|
| 261 |
|---|---:|---:|---|
|
|
|
|
| 328 |
|
| 329 |
## Training Results
|
| 330 |
|
| 331 |
+

|
| 332 |
|
| 333 |
The clearest legitimate learning signal is the SFT checkpoint.
|
| 334 |
|
|
|
|
| 355 |
|
| 356 |
## Per-Difficulty Behavior
|
| 357 |
|
| 358 |
+

|
| 359 |
|
| 360 |
The easy and medium cases improve most clearly after SFT.
|
| 361 |
|
|
|
|
| 393 |
|
| 394 |
The invalid action parsed as JSON but failed action validation. Because the evaluation helper fell back to the heuristic on invalid model output, the final score reflected heuristic behavior rather than trained-model behavior.
|
| 395 |
|
| 396 |
+

|
| 397 |
|
| 398 |
This produced a clear rule for typed-action RL environments:
|
| 399 |
|