mitudrudutta commited on
Commit
e5da154
·
1 Parent(s): 34a93bb

docs: update image links in BLOG.md to point to raw GitHub URLs for better accessibility

Browse files
Files changed (1) hide show
  1. BLOG.md +7 -7
BLOG.md CHANGED
@@ -131,7 +131,7 @@ That is a complete evidence-backed operational path.
131
 
132
  ## Architecture
133
 
134
- ![Architecture diagram](docs/figures/architecture.png)
135
 
136
  The environment has five main layers:
137
 
@@ -145,7 +145,7 @@ The environment has five main layers:
145
 
146
  ## Multi-Round Dispute Lifecycle
147
 
148
- ![Multi-round dispute lifecycle](docs/figures/multi_round_dispute_lifecycle.png)
149
 
150
  Arbitration is where the environment becomes especially interesting.
151
 
@@ -201,7 +201,7 @@ The agent has to decide not only what to do, but what to do **now**.
201
 
202
  ChargebackOps uses a composable OpenEnv rubric instead of one monolithic reward.
203
 
204
- ![8-dimensional rubric weights](https://huggingface.co/spaces/mitudrudutta/ChargeBackOps/resolve/main/docs/figures/rubric_weights.png)
205
 
206
  | Dimension | Weight | What it measures |
207
  |---|---:|---|
@@ -255,7 +255,7 @@ Yes.
255
 
256
  I tested four scripted policies across the headline catalog and multi-seed grid.
257
 
258
- ![Policy discrimination benchmark](https://huggingface.co/spaces/mitudrudutta/ChargeBackOps/resolve/main/docs/figures/discrimination_gradient.png)
259
 
260
  | Policy | Headline avg | Multi-seed avg | Behavior |
261
  |---|---:|---:|---|
@@ -328,7 +328,7 @@ The reason for using outcome reward is simple: the goal is not just to imitate a
328
 
329
  ## Training Results
330
 
331
- ![Training curve](https://huggingface.co/spaces/mitudrudutta/ChargeBackOps/resolve/main/docs/figures/training_curve.png)
332
 
333
  The clearest legitimate learning signal is the SFT checkpoint.
334
 
@@ -355,7 +355,7 @@ The SFT model learned the interface and improved over the base model.
355
 
356
  ## Per-Difficulty Behavior
357
 
358
- ![Training curve by family](https://huggingface.co/spaces/mitudrudutta/ChargeBackOps/resolve/main/docs/figures/training_curve_by_family.png)
359
 
360
  The easy and medium cases improve most clearly after SFT.
361
 
@@ -393,7 +393,7 @@ The closest valid actions are:
393
 
394
  The invalid action parsed as JSON but failed action validation. Because the evaluation helper fell back to the heuristic on invalid model output, the final score reflected heuristic behavior rather than trained-model behavior.
395
 
396
- ![Gaming attribution](https://huggingface.co/spaces/mitudrudutta/ChargeBackOps/resolve/main/docs/figures/gaming_attribution.png)
397
 
398
  This produced a clear rule for typed-action RL environments:
399
 
 
131
 
132
  ## Architecture
133
 
134
+ ![Architecture diagram — five layers from Pydantic models through grading](https://raw.githubusercontent.com/MitudruDutta/chargebackops/main/docs/figures/architecture.png)
135
 
136
  The environment has five main layers:
137
 
 
145
 
146
  ## Multi-Round Dispute Lifecycle
147
 
148
+ ![Multi-round dispute lifecycle — representment, pre-arbitration, arbitration, and terminal P&L](https://raw.githubusercontent.com/MitudruDutta/chargebackops/main/docs/figures/multi_round_dispute_lifecycle.png)
149
 
150
  Arbitration is where the environment becomes especially interesting.
151
 
 
201
 
202
  ChargebackOps uses a composable OpenEnv rubric instead of one monolithic reward.
203
 
204
+ ![8-dimensional rubric weights](https://raw.githubusercontent.com/MitudruDutta/chargebackops/main/docs/figures/rubric_weights.png)
205
 
206
  | Dimension | Weight | What it measures |
207
  |---|---:|---|
 
255
 
256
  I tested four scripted policies across the headline catalog and multi-seed grid.
257
 
258
+ ![Policy discrimination benchmark](https://raw.githubusercontent.com/MitudruDutta/chargebackops/main/docs/figures/discrimination_gradient.png)
259
 
260
  | Policy | Headline avg | Multi-seed avg | Behavior |
261
  |---|---:|---:|---|
 
328
 
329
  ## Training Results
330
 
331
+ ![Training curve](https://raw.githubusercontent.com/MitudruDutta/chargebackops/main/docs/figures/training_curve.png)
332
 
333
  The clearest legitimate learning signal is the SFT checkpoint.
334
 
 
355
 
356
  ## Per-Difficulty Behavior
357
 
358
+ ![Training curve by family](https://raw.githubusercontent.com/MitudruDutta/chargebackops/main/docs/figures/training_curve_by_family.png)
359
 
360
  The easy and medium cases improve most clearly after SFT.
361
 
 
393
 
394
  The invalid action parsed as JSON but failed action validation. Because the evaluation helper fell back to the heuristic on invalid model output, the final score reflected heuristic behavior rather than trained-model behavior.
395
 
396
+ ![Gaming attribution](https://raw.githubusercontent.com/MitudruDutta/chargebackops/main/docs/figures/gaming_attribution.png)
397
 
398
  This produced a clear rule for typed-action RL environments:
399