Spaces:

Meta-HF-hackathon
/

updated-policy

Sleeping

App Files Files Community

srinjoyd commited on 20 days ago

Commit

dca255f

1 Parent(s): 8c26ecf

fix issues with blog

Browse files

Files changed (4) hide show

BLOG.md +25 -20
{docs/diagrams → assets}/agent_loop.svg +0 -0
{docs/diagrams → assets}/hierarchical_rl_architecture.svg +0 -0
{docs/diagrams → assets}/pipeline.svg +0 -0

BLOG.md CHANGED Viewed

@@ -6,9 +6,7 @@
 > **TL;DR.** We built `incident_env` — an OpenEnv POMDP where an LLM agent has to diagnose a live, evolving production incident and then attribute it to a specific commit in a small repo. Then we trained **Qwen2.5-7B-Instruct** through a four-stage curriculum (baseline rollouts → LoRA SFT → online GRPO with `r_cross` → merge). The post-trained model reaches a **mean cumulative reward of ≈1.59 vs ≈0.49** for the base, **at less than half the steps**, with tighter variance and dominant CDF across the operating range.
-<p align="center">
-  <img src="docs/diagrams/pipeline.svg" alt="SRE Triage Bot training pipeline" width="100%"/>
-</p>
 > 🧭 **One-line pitch.** *Most agent benchmarks freeze a repo and ask the model to fix it. Our environment refuses to sit still — memory climbs, alerts cascade, and the obvious symptom is almost never the cause.*
@@ -51,9 +49,7 @@ Each service has live metric history (CPU, memory, p50/p95/p99 latency, error ra
 ### The agent loop
-<p align="center">
-  <img src="docs/diagrams/agent_loop.svg" alt="Agent loop POMDP" width="92%"/>
-</p>
 Per-step execution is `validate → mutate → tick → observe → reward`. Two facts make the loop interesting:
@@ -257,9 +253,7 @@ flowchart TD
 Before the stage-by-stage detail, here is the architectural view: a **three-level hierarchy** with an orchestrator routing policy on top, two specialised subagents below it, and segment-level GRPO with cross-phase reward propagation underneath both.
-<p align="center">
-  <img src="docs/diagrams/hierarchical_rl_architecture.svg" alt="Hierarchical RL architecture — orchestrator + specialized subagents + segment-level GRPO with r_cross" width="78%"/>
-</p>
 Three things to notice in this picture:
@@ -386,7 +380,7 @@ The output is a vanilla causal LM that vLLM, TGI, or plain `transformers` can lo
 ### Figure 1 — Reward distribution (CDF)
-<p align="center"><img src="docs/img/cdf.png" alt="reward CDF per source" width="80%"/></p>
 > *Empirical CDF of cumulative reward — lower curve = better (more probability mass at high reward).*
@@ -396,7 +390,7 @@ The output is a vanilla causal LM that vLLM, TGI, or plain `transformers` can lo
 ### Figure 2 — Efficiency curve (reward vs. steps)
-<p align="center"><img src="docs/img/efficiency.png" alt="efficiency curve" width="80%"/></p>
 | Model | Mean reward by ~30 steps | Steps to plateau | σ at plateau |
 | --- | --- | --- | --- |
@@ -604,25 +598,36 @@ Every mathematical symbol used above, gathered for reference.
 ### Appendix B · Diagram source files in this repo
 | File | Used in | Notes |
 | --- | --- | --- |
-| `docs/diagrams/pipeline.svg` | §0 hero, §6 | Five-stage horizontal pipeline (data flow Base → SFT → GRPO → Merge). Edit text/colors directly in the file. |
-| `docs/diagrams/agent_loop.svg` | §2 | Agent ↔ env loop with the partial-observation card. |
-| `docs/diagrams/hierarchical_rl_architecture.svg` | §6 | Three-level hierarchy — orchestrator + subagents + segment-level GRPO with `r_cross`. The *gradient* view that complements pipeline.svg's *data* view. |
-| Mermaid blocks (inline) | §2, §3, §5 | Render natively on GitHub / HF blogs. |
-**To render the SVGs as PNGs (e.g. for Twitter / slide decks):**
 ```bash
 # Either:
-npx svgexport docs/diagrams/pipeline.svg docs/img/pipeline.png 2x
 # or:
-rsvg-convert -z 2 -o docs/img/pipeline.png docs/diagrams/pipeline.svg
 ```
 **To replace the result figures**, drop your two charts at:
-- `docs/img/cdf.png` — Figure 1 (reward distribution per source)
-- `docs/img/efficiency.png` — Figure 2 (reward vs. steps)
 The blog already links to those paths.

 > **TL;DR.** We built `incident_env` — an OpenEnv POMDP where an LLM agent has to diagnose a live, evolving production incident and then attribute it to a specific commit in a small repo. Then we trained **Qwen2.5-7B-Instruct** through a four-stage curriculum (baseline rollouts → LoRA SFT → online GRPO with `r_cross` → merge). The post-trained model reaches a **mean cumulative reward of ≈1.59 vs ≈0.49** for the base, **at less than half the steps**, with tighter variance and dominant CDF across the operating range.
+![SRE Triage Bot training pipeline](./assets/pipeline.svg)
 > 🧭 **One-line pitch.** *Most agent benchmarks freeze a repo and ask the model to fix it. Our environment refuses to sit still — memory climbs, alerts cascade, and the obvious symptom is almost never the cause.*
 ### The agent loop
+![Agent loop POMDP](./assets/agent_loop.svg)
 Per-step execution is `validate → mutate → tick → observe → reward`. Two facts make the loop interesting:
 Before the stage-by-stage detail, here is the architectural view: a **three-level hierarchy** with an orchestrator routing policy on top, two specialised subagents below it, and segment-level GRPO with cross-phase reward propagation underneath both.
+![Hierarchical RL architecture — orchestrator + specialized subagents + segment-level GRPO with r_cross](./assets/hierarchical_rl_architecture.svg)
 Three things to notice in this picture:
 ### Figure 1 — Reward distribution (CDF)
+![Reward CDF per source — Baseline, SFT, Posttrained RL](./assets/cdf.png)
 > *Empirical CDF of cumulative reward — lower curve = better (more probability mass at high reward).*
 ### Figure 2 — Efficiency curve (reward vs. steps)
+![Efficiency curve — mean cumulative reward vs total steps (P1 + P2)](./assets/efficiency.png)
 | Model | Mean reward by ~30 steps | Steps to plateau | σ at plateau |
 | --- | --- | --- | --- |
 ### Appendix B · Diagram source files in this repo
+All images live in **`./assets/`** at the repo root — the canonical HF Spaces convention. Paths in this blog use plain markdown image syntax (`![alt](./assets/file.svg)`) so they render the same way in:
+- the **HF Space README/blog** (relative-path resolution),
+- the **Hugging Face blog** (`huggingface.co/blog/...`),
+- a **GitHub mirror** (no path changes needed),
+- and a **local Markdown preview**.
 | File | Used in | Notes |
 | --- | --- | --- |
+| `./assets/pipeline.svg` | §0 hero, §6 | Five-stage horizontal pipeline (data flow Base → SFT → GRPO → Merge). |
+| `./assets/agent_loop.svg` | §2 | Agent ↔ env loop with the partial-observation card. |
+| `./assets/hierarchical_rl_architecture.svg` | §6 | Three-level hierarchy — orchestrator + subagents + segment-level GRPO with `r_cross`. The *gradient* view that complements pipeline.svg's *data* view. |
+| `./assets/cdf.png` | §7 Figure 1 | Reward CDF per source — drop your chart at this path. |
+| `./assets/efficiency.png` | §7 Figure 2 | Efficiency curve — drop your chart at this path. |
+| Mermaid blocks (inline) | §2, §3, §5 | Render natively on GitHub and HF Space markdown. |
+**To render the SVGs as PNGs (for Twitter / slide decks):**
 ```bash
 # Either:
+npx svgexport ./assets/pipeline.svg ./assets/pipeline.png 2x
 # or:
+rsvg-convert -z 2 -o ./assets/pipeline.png ./assets/pipeline.svg
 ```
 **To replace the result figures**, drop your two charts at:
+- `./assets/cdf.png` — Figure 1 (reward distribution per source)
+- `./assets/efficiency.png` — Figure 2 (reward vs. steps)
 The blog already links to those paths.
+> **Why `./assets/...`?** HF Spaces resolve relative paths from the rendered file's directory. Putting `BLOG.md` at the repo root and all images under `./assets/` means every link works without ever rewriting a URL — no `https://huggingface.co/spaces/<owner>/<name>/resolve/main/...` boilerplate, no broken paths if the Space is forked.

{docs/diagrams → assets}/agent_loop.svg RENAMED Viewed

File without changes

{docs/diagrams → assets}/hierarchical_rl_architecture.svg RENAMED Viewed

File without changes

{docs/diagrams → assets}/pipeline.svg RENAMED Viewed

File without changes