srinjoyd commited on
Commit
dca255f
Β·
1 Parent(s): 8c26ecf

fix issues with blog

Browse files
BLOG.md CHANGED
@@ -6,9 +6,7 @@
6
 
7
  > **TL;DR.** We built `incident_env` β€” an OpenEnv POMDP where an LLM agent has to diagnose a live, evolving production incident and then attribute it to a specific commit in a small repo. Then we trained **Qwen2.5-7B-Instruct** through a four-stage curriculum (baseline rollouts β†’ LoRA SFT β†’ online GRPO with `r_cross` β†’ merge). The post-trained model reaches a **mean cumulative reward of β‰ˆ1.59 vs β‰ˆ0.49** for the base, **at less than half the steps**, with tighter variance and dominant CDF across the operating range.
8
 
9
- <p align="center">
10
- <img src="docs/diagrams/pipeline.svg" alt="SRE Triage Bot training pipeline" width="100%"/>
11
- </p>
12
 
13
  > 🧭 **One-line pitch.** *Most agent benchmarks freeze a repo and ask the model to fix it. Our environment refuses to sit still β€” memory climbs, alerts cascade, and the obvious symptom is almost never the cause.*
14
 
@@ -51,9 +49,7 @@ Each service has live metric history (CPU, memory, p50/p95/p99 latency, error ra
51
 
52
  ### The agent loop
53
 
54
- <p align="center">
55
- <img src="docs/diagrams/agent_loop.svg" alt="Agent loop POMDP" width="92%"/>
56
- </p>
57
 
58
  Per-step execution is `validate β†’ mutate β†’ tick β†’ observe β†’ reward`. Two facts make the loop interesting:
59
 
@@ -257,9 +253,7 @@ flowchart TD
257
 
258
  Before the stage-by-stage detail, here is the architectural view: a **three-level hierarchy** with an orchestrator routing policy on top, two specialised subagents below it, and segment-level GRPO with cross-phase reward propagation underneath both.
259
 
260
- <p align="center">
261
- <img src="docs/diagrams/hierarchical_rl_architecture.svg" alt="Hierarchical RL architecture β€” orchestrator + specialized subagents + segment-level GRPO with r_cross" width="78%"/>
262
- </p>
263
 
264
  Three things to notice in this picture:
265
 
@@ -386,7 +380,7 @@ The output is a vanilla causal LM that vLLM, TGI, or plain `transformers` can lo
386
 
387
  ### Figure 1 β€” Reward distribution (CDF)
388
 
389
- <p align="center"><img src="docs/img/cdf.png" alt="reward CDF per source" width="80%"/></p>
390
 
391
  > *Empirical CDF of cumulative reward β€” lower curve = better (more probability mass at high reward).*
392
 
@@ -396,7 +390,7 @@ The output is a vanilla causal LM that vLLM, TGI, or plain `transformers` can lo
396
 
397
  ### Figure 2 β€” Efficiency curve (reward vs. steps)
398
 
399
- <p align="center"><img src="docs/img/efficiency.png" alt="efficiency curve" width="80%"/></p>
400
 
401
  | Model | Mean reward by ~30 steps | Steps to plateau | Οƒ at plateau |
402
  | --- | --- | --- | --- |
@@ -604,25 +598,36 @@ Every mathematical symbol used above, gathered for reference.
604
 
605
  ### Appendix B Β· Diagram source files in this repo
606
 
 
 
 
 
 
 
 
607
  | File | Used in | Notes |
608
  | --- | --- | --- |
609
- | `docs/diagrams/pipeline.svg` | Β§0 hero, Β§6 | Five-stage horizontal pipeline (data flow Base β†’ SFT β†’ GRPO β†’ Merge). Edit text/colors directly in the file. |
610
- | `docs/diagrams/agent_loop.svg` | Β§2 | Agent ↔ env loop with the partial-observation card. |
611
- | `docs/diagrams/hierarchical_rl_architecture.svg` | Β§6 | Three-level hierarchy β€” orchestrator + subagents + segment-level GRPO with `r_cross`. The *gradient* view that complements pipeline.svg's *data* view. |
612
- | Mermaid blocks (inline) | Β§2, Β§3, Β§5 | Render natively on GitHub / HF blogs. |
 
 
613
 
614
- **To render the SVGs as PNGs (e.g. for Twitter / slide decks):**
615
 
616
  ```bash
617
  # Either:
618
- npx svgexport docs/diagrams/pipeline.svg docs/img/pipeline.png 2x
619
  # or:
620
- rsvg-convert -z 2 -o docs/img/pipeline.png docs/diagrams/pipeline.svg
621
  ```
622
 
623
  **To replace the result figures**, drop your two charts at:
624
 
625
- - `docs/img/cdf.png` β€” Figure 1 (reward distribution per source)
626
- - `docs/img/efficiency.png` β€” Figure 2 (reward vs. steps)
627
 
628
  The blog already links to those paths.
 
 
 
6
 
7
  > **TL;DR.** We built `incident_env` β€” an OpenEnv POMDP where an LLM agent has to diagnose a live, evolving production incident and then attribute it to a specific commit in a small repo. Then we trained **Qwen2.5-7B-Instruct** through a four-stage curriculum (baseline rollouts β†’ LoRA SFT β†’ online GRPO with `r_cross` β†’ merge). The post-trained model reaches a **mean cumulative reward of β‰ˆ1.59 vs β‰ˆ0.49** for the base, **at less than half the steps**, with tighter variance and dominant CDF across the operating range.
8
 
9
+ ![SRE Triage Bot training pipeline](./assets/pipeline.svg)
 
 
10
 
11
  > 🧭 **One-line pitch.** *Most agent benchmarks freeze a repo and ask the model to fix it. Our environment refuses to sit still β€” memory climbs, alerts cascade, and the obvious symptom is almost never the cause.*
12
 
 
49
 
50
  ### The agent loop
51
 
52
+ ![Agent loop POMDP](./assets/agent_loop.svg)
 
 
53
 
54
  Per-step execution is `validate β†’ mutate β†’ tick β†’ observe β†’ reward`. Two facts make the loop interesting:
55
 
 
253
 
254
  Before the stage-by-stage detail, here is the architectural view: a **three-level hierarchy** with an orchestrator routing policy on top, two specialised subagents below it, and segment-level GRPO with cross-phase reward propagation underneath both.
255
 
256
+ ![Hierarchical RL architecture β€” orchestrator + specialized subagents + segment-level GRPO with r_cross](./assets/hierarchical_rl_architecture.svg)
 
 
257
 
258
  Three things to notice in this picture:
259
 
 
380
 
381
  ### Figure 1 β€” Reward distribution (CDF)
382
 
383
+ ![Reward CDF per source β€” Baseline, SFT, Posttrained RL](./assets/cdf.png)
384
 
385
  > *Empirical CDF of cumulative reward β€” lower curve = better (more probability mass at high reward).*
386
 
 
390
 
391
  ### Figure 2 β€” Efficiency curve (reward vs. steps)
392
 
393
+ ![Efficiency curve β€” mean cumulative reward vs total steps (P1 + P2)](./assets/efficiency.png)
394
 
395
  | Model | Mean reward by ~30 steps | Steps to plateau | Οƒ at plateau |
396
  | --- | --- | --- | --- |
 
598
 
599
  ### Appendix B Β· Diagram source files in this repo
600
 
601
+ All images live in **`./assets/`** at the repo root β€” the canonical HF Spaces convention. Paths in this blog use plain markdown image syntax (`![alt](./assets/file.svg)`) so they render the same way in:
602
+
603
+ - the **HF Space README/blog** (relative-path resolution),
604
+ - the **Hugging Face blog** (`huggingface.co/blog/...`),
605
+ - a **GitHub mirror** (no path changes needed),
606
+ - and a **local Markdown preview**.
607
+
608
  | File | Used in | Notes |
609
  | --- | --- | --- |
610
+ | `./assets/pipeline.svg` | Β§0 hero, Β§6 | Five-stage horizontal pipeline (data flow Base β†’ SFT β†’ GRPO β†’ Merge). |
611
+ | `./assets/agent_loop.svg` | Β§2 | Agent ↔ env loop with the partial-observation card. |
612
+ | `./assets/hierarchical_rl_architecture.svg` | Β§6 | Three-level hierarchy β€” orchestrator + subagents + segment-level GRPO with `r_cross`. The *gradient* view that complements pipeline.svg's *data* view. |
613
+ | `./assets/cdf.png` | Β§7 Figure 1 | Reward CDF per source β€” drop your chart at this path. |
614
+ | `./assets/efficiency.png` | Β§7 Figure 2 | Efficiency curve β€” drop your chart at this path. |
615
+ | Mermaid blocks (inline) | Β§2, Β§3, Β§5 | Render natively on GitHub and HF Space markdown. |
616
 
617
+ **To render the SVGs as PNGs (for Twitter / slide decks):**
618
 
619
  ```bash
620
  # Either:
621
+ npx svgexport ./assets/pipeline.svg ./assets/pipeline.png 2x
622
  # or:
623
+ rsvg-convert -z 2 -o ./assets/pipeline.png ./assets/pipeline.svg
624
  ```
625
 
626
  **To replace the result figures**, drop your two charts at:
627
 
628
+ - `./assets/cdf.png` β€” Figure 1 (reward distribution per source)
629
+ - `./assets/efficiency.png` β€” Figure 2 (reward vs. steps)
630
 
631
  The blog already links to those paths.
632
+
633
+ > **Why `./assets/...`?** HF Spaces resolve relative paths from the rendered file's directory. Putting `BLOG.md` at the repo root and all images under `./assets/` means every link works without ever rewriting a URL β€” no `https://huggingface.co/spaces/<owner>/<name>/resolve/main/...` boilerplate, no broken paths if the Space is forked.
{docs/diagrams β†’ assets}/agent_loop.svg RENAMED
File without changes
{docs/diagrams β†’ assets}/hierarchical_rl_architecture.svg RENAMED
File without changes
{docs/diagrams β†’ assets}/pipeline.svg RENAMED
File without changes