Spaces:
Sleeping
Sleeping
fix issues with blog
Browse files
BLOG.md
CHANGED
|
@@ -6,9 +6,7 @@
|
|
| 6 |
|
| 7 |
> **TL;DR.** We built `incident_env` β an OpenEnv POMDP where an LLM agent has to diagnose a live, evolving production incident and then attribute it to a specific commit in a small repo. Then we trained **Qwen2.5-7B-Instruct** through a four-stage curriculum (baseline rollouts β LoRA SFT β online GRPO with `r_cross` β merge). The post-trained model reaches a **mean cumulative reward of β1.59 vs β0.49** for the base, **at less than half the steps**, with tighter variance and dominant CDF across the operating range.
|
| 8 |
|
| 9 |
-
|
| 10 |
-
<img src="docs/diagrams/pipeline.svg" alt="SRE Triage Bot training pipeline" width="100%"/>
|
| 11 |
-
</p>
|
| 12 |
|
| 13 |
> π§ **One-line pitch.** *Most agent benchmarks freeze a repo and ask the model to fix it. Our environment refuses to sit still β memory climbs, alerts cascade, and the obvious symptom is almost never the cause.*
|
| 14 |
|
|
@@ -51,9 +49,7 @@ Each service has live metric history (CPU, memory, p50/p95/p99 latency, error ra
|
|
| 51 |
|
| 52 |
### The agent loop
|
| 53 |
|
| 54 |
-
|
| 55 |
-
<img src="docs/diagrams/agent_loop.svg" alt="Agent loop POMDP" width="92%"/>
|
| 56 |
-
</p>
|
| 57 |
|
| 58 |
Per-step execution is `validate β mutate β tick β observe β reward`. Two facts make the loop interesting:
|
| 59 |
|
|
@@ -257,9 +253,7 @@ flowchart TD
|
|
| 257 |
|
| 258 |
Before the stage-by-stage detail, here is the architectural view: a **three-level hierarchy** with an orchestrator routing policy on top, two specialised subagents below it, and segment-level GRPO with cross-phase reward propagation underneath both.
|
| 259 |
|
| 260 |
-
|
| 261 |
-
<img src="docs/diagrams/hierarchical_rl_architecture.svg" alt="Hierarchical RL architecture β orchestrator + specialized subagents + segment-level GRPO with r_cross" width="78%"/>
|
| 262 |
-
</p>
|
| 263 |
|
| 264 |
Three things to notice in this picture:
|
| 265 |
|
|
@@ -386,7 +380,7 @@ The output is a vanilla causal LM that vLLM, TGI, or plain `transformers` can lo
|
|
| 386 |
|
| 387 |
### Figure 1 β Reward distribution (CDF)
|
| 388 |
|
| 389 |
-
|
| 390 |
|
| 391 |
> *Empirical CDF of cumulative reward β lower curve = better (more probability mass at high reward).*
|
| 392 |
|
|
@@ -396,7 +390,7 @@ The output is a vanilla causal LM that vLLM, TGI, or plain `transformers` can lo
|
|
| 396 |
|
| 397 |
### Figure 2 β Efficiency curve (reward vs. steps)
|
| 398 |
|
| 399 |
-
|
| 400 |
|
| 401 |
| Model | Mean reward by ~30 steps | Steps to plateau | Ο at plateau |
|
| 402 |
| --- | --- | --- | --- |
|
|
@@ -604,25 +598,36 @@ Every mathematical symbol used above, gathered for reference.
|
|
| 604 |
|
| 605 |
### Appendix B Β· Diagram source files in this repo
|
| 606 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 607 |
| File | Used in | Notes |
|
| 608 |
| --- | --- | --- |
|
| 609 |
-
| `
|
| 610 |
-
| `
|
| 611 |
-
| `
|
| 612 |
-
|
|
|
|
|
|
|
|
| 613 |
|
| 614 |
-
**To render the SVGs as PNGs (
|
| 615 |
|
| 616 |
```bash
|
| 617 |
# Either:
|
| 618 |
-
npx svgexport
|
| 619 |
# or:
|
| 620 |
-
rsvg-convert -z 2 -o
|
| 621 |
```
|
| 622 |
|
| 623 |
**To replace the result figures**, drop your two charts at:
|
| 624 |
|
| 625 |
-
- `
|
| 626 |
-
- `
|
| 627 |
|
| 628 |
The blog already links to those paths.
|
|
|
|
|
|
|
|
|
| 6 |
|
| 7 |
> **TL;DR.** We built `incident_env` β an OpenEnv POMDP where an LLM agent has to diagnose a live, evolving production incident and then attribute it to a specific commit in a small repo. Then we trained **Qwen2.5-7B-Instruct** through a four-stage curriculum (baseline rollouts β LoRA SFT β online GRPO with `r_cross` β merge). The post-trained model reaches a **mean cumulative reward of β1.59 vs β0.49** for the base, **at less than half the steps**, with tighter variance and dominant CDF across the operating range.
|
| 8 |
|
| 9 |
+

|
|
|
|
|
|
|
| 10 |
|
| 11 |
> π§ **One-line pitch.** *Most agent benchmarks freeze a repo and ask the model to fix it. Our environment refuses to sit still β memory climbs, alerts cascade, and the obvious symptom is almost never the cause.*
|
| 12 |
|
|
|
|
| 49 |
|
| 50 |
### The agent loop
|
| 51 |
|
| 52 |
+

|
|
|
|
|
|
|
| 53 |
|
| 54 |
Per-step execution is `validate β mutate β tick β observe β reward`. Two facts make the loop interesting:
|
| 55 |
|
|
|
|
| 253 |
|
| 254 |
Before the stage-by-stage detail, here is the architectural view: a **three-level hierarchy** with an orchestrator routing policy on top, two specialised subagents below it, and segment-level GRPO with cross-phase reward propagation underneath both.
|
| 255 |
|
| 256 |
+

|
|
|
|
|
|
|
| 257 |
|
| 258 |
Three things to notice in this picture:
|
| 259 |
|
|
|
|
| 380 |
|
| 381 |
### Figure 1 β Reward distribution (CDF)
|
| 382 |
|
| 383 |
+

|
| 384 |
|
| 385 |
> *Empirical CDF of cumulative reward β lower curve = better (more probability mass at high reward).*
|
| 386 |
|
|
|
|
| 390 |
|
| 391 |
### Figure 2 β Efficiency curve (reward vs. steps)
|
| 392 |
|
| 393 |
+

|
| 394 |
|
| 395 |
| Model | Mean reward by ~30 steps | Steps to plateau | Ο at plateau |
|
| 396 |
| --- | --- | --- | --- |
|
|
|
|
| 598 |
|
| 599 |
### Appendix B Β· Diagram source files in this repo
|
| 600 |
|
| 601 |
+
All images live in **`./assets/`** at the repo root β the canonical HF Spaces convention. Paths in this blog use plain markdown image syntax (``) so they render the same way in:
|
| 602 |
+
|
| 603 |
+
- the **HF Space README/blog** (relative-path resolution),
|
| 604 |
+
- the **Hugging Face blog** (`huggingface.co/blog/...`),
|
| 605 |
+
- a **GitHub mirror** (no path changes needed),
|
| 606 |
+
- and a **local Markdown preview**.
|
| 607 |
+
|
| 608 |
| File | Used in | Notes |
|
| 609 |
| --- | --- | --- |
|
| 610 |
+
| `./assets/pipeline.svg` | Β§0 hero, Β§6 | Five-stage horizontal pipeline (data flow Base β SFT β GRPO β Merge). |
|
| 611 |
+
| `./assets/agent_loop.svg` | Β§2 | Agent β env loop with the partial-observation card. |
|
| 612 |
+
| `./assets/hierarchical_rl_architecture.svg` | Β§6 | Three-level hierarchy β orchestrator + subagents + segment-level GRPO with `r_cross`. The *gradient* view that complements pipeline.svg's *data* view. |
|
| 613 |
+
| `./assets/cdf.png` | Β§7 Figure 1 | Reward CDF per source β drop your chart at this path. |
|
| 614 |
+
| `./assets/efficiency.png` | Β§7 Figure 2 | Efficiency curve β drop your chart at this path. |
|
| 615 |
+
| Mermaid blocks (inline) | Β§2, Β§3, Β§5 | Render natively on GitHub and HF Space markdown. |
|
| 616 |
|
| 617 |
+
**To render the SVGs as PNGs (for Twitter / slide decks):**
|
| 618 |
|
| 619 |
```bash
|
| 620 |
# Either:
|
| 621 |
+
npx svgexport ./assets/pipeline.svg ./assets/pipeline.png 2x
|
| 622 |
# or:
|
| 623 |
+
rsvg-convert -z 2 -o ./assets/pipeline.png ./assets/pipeline.svg
|
| 624 |
```
|
| 625 |
|
| 626 |
**To replace the result figures**, drop your two charts at:
|
| 627 |
|
| 628 |
+
- `./assets/cdf.png` β Figure 1 (reward distribution per source)
|
| 629 |
+
- `./assets/efficiency.png` β Figure 2 (reward vs. steps)
|
| 630 |
|
| 631 |
The blog already links to those paths.
|
| 632 |
+
|
| 633 |
+
> **Why `./assets/...`?** HF Spaces resolve relative paths from the rendered file's directory. Putting `BLOG.md` at the repo root and all images under `./assets/` means every link works without ever rewriting a URL β no `https://huggingface.co/spaces/<owner>/<name>/resolve/main/...` boilerplate, no broken paths if the Space is forked.
|
{docs/diagrams β assets}/agent_loop.svg
RENAMED
|
File without changes
|
{docs/diagrams β assets}/hierarchical_rl_architecture.svg
RENAMED
|
File without changes
|
{docs/diagrams β assets}/pipeline.svg
RENAMED
|
File without changes
|