Spaces:

Meta-HF-hackathon
/

updated-policy

Sleeping

App Files Files Community

srinjoyd commited on Apr 26

Commit

5e99fd1

verified ·

1 Parent(s): 59d92e0

Update BLOG.md

Browse files

Files changed (1) hide show

BLOG.md +10 -40

BLOG.md CHANGED Viewed

@@ -6,7 +6,8 @@
 > **TL;DR.** We built `incident_env` — an OpenEnv POMDP where an LLM agent has to diagnose a live, evolving production incident and then attribute it to a specific commit in a small repo. Then we trained **Qwen2.5-7B-Instruct** through a four-stage curriculum (baseline rollouts → LoRA SFT → online GRPO with `r_cross` → merge). The post-trained model reaches a **mean cumulative reward of ≈1.59 vs ≈0.49** for the base, **at less than half the steps**, with tighter variance and dominant CDF across the operating range.
-![SRE Triage Bot training pipeline](https://huggingface.co/spaces/Meta-HF-hackathon/updated-policy/blob/main/assets/pipeline.svg)
 > 🧭 **One-line pitch.** *Most agent benchmarks freeze a repo and ask the model to fix it. Our environment refuses to sit still — memory climbs, alerts cascade, and the obvious symptom is almost never the cause.*
@@ -49,7 +50,8 @@ Each service has live metric history (CPU, memory, p50/p95/p99 latency, error ra
 ### The agent loop
-![Agent loop POMDP](./assets/agent_loop.svg)
 Per-step execution is `validate → mutate → tick → observe → reward`. Two facts make the loop interesting:
@@ -196,42 +198,7 @@ In English: *how much did Phase 1's investigation actually help the code agent v
 ---
-## 5 · Tasks and pools
-```mermaid
-flowchart TD
-    Tasks["10 scenarios"] --> Easy["memory_leak"]
-    Tasks --> Med["cascading_failure"]
-    Tasks --> Hard["distributed_deadlock"]
-    Tasks --> A1["aliased_fault"]
-    Tasks --> A2["severity_inversion"]
-    Tasks --> A3["confidence_inversion"]
-    Tasks --> A4["info_ordering"]
-    Tasks --> A5["circuit_breaker_noop · no-change"]
-    Tasks --> H1["heldout_aliased_severity · compound"]
-    Tasks --> H2["heldout_confidence_ordering · compound"]
-    H1 --> PoolD
-    H2 --> PoolD
-    Easy --> PoolA & PoolB & PoolC
-    Med  --> PoolA & PoolB & PoolC
-    Hard --> PoolA & PoolB & PoolC
-    A1   --> PoolA & PoolB & PoolC
-    A2   --> PoolA & PoolB & PoolC
-    A3   --> PoolA & PoolB & PoolC
-    A4   --> PoolA & PoolB & PoolC
-    A5   --> PoolA & PoolC
-    PoolA["Pool A · p1_only · ops bootstrap"]
-    PoolB["Pool B · p2_only · code bootstrap with oracle handoff"]
-    PoolC["Pool C · joint · full P1→P2 with r_cross"]
-    PoolD["Pool D · joint · held-out compounds"]
-```
-> ✦ **Pool D is the integrity check.** Each component fault family appears during training, but the *combinations* never do. This is what answers "did the agent learn a strategy or memorise scenario fingerprints?"
-### Scenario flavours
 | Task | Hidden lesson |
 | --- | --- |
@@ -380,7 +347,9 @@ The output is a vanilla causal LM that vLLM, TGI, or plain `transformers` can lo
 ### Figure 1 — Reward distribution (CDF)
-![Reward CDF per source — Baseline, SFT, Posttrained RL](./assets/cdf.png)
 > *Empirical CDF of cumulative reward — lower curve = better (more probability mass at high reward).*
@@ -390,7 +359,8 @@ The output is a vanilla causal LM that vLLM, TGI, or plain `transformers` can lo
 ### Figure 2 — Efficiency curve (reward vs. steps)
-![Efficiency curve — mean cumulative reward vs total steps (P1 + P2)](./assets/efficiency.png)
 | Model | Mean reward by ~30 steps | Steps to plateau | σ at plateau |
 | --- | --- | --- | --- |

 > **TL;DR.** We built `incident_env` — an OpenEnv POMDP where an LLM agent has to diagnose a live, evolving production incident and then attribute it to a specific commit in a small repo. Then we trained **Qwen2.5-7B-Instruct** through a four-stage curriculum (baseline rollouts → LoRA SFT → online GRPO with `r_cross` → merge). The post-trained model reaches a **mean cumulative reward of ≈1.59 vs ≈0.49** for the base, **at less than half the steps**, with tighter variance and dominant CDF across the operating range.
+![image](https://cdn-uploads.huggingface.co/production/uploads/66e56109975df8fffc75f3c7/bNlv5ywRBRCj3Al1BKi8R.png)
 > 🧭 **One-line pitch.** *Most agent benchmarks freeze a repo and ask the model to fix it. Our environment refuses to sit still — memory climbs, alerts cascade, and the obvious symptom is almost never the cause.*
 ### The agent loop
+![image](https://cdn-uploads.huggingface.co/production/uploads/66e56109975df8fffc75f3c7/sLaYeQmysnBDQw-VcmYsp.png)
 Per-step execution is `validate → mutate → tick → observe → reward`. Two facts make the loop interesting:
 ---
+## 5 ·Scenario flavours
 | Task | Hidden lesson |
 | --- | --- |
 ### Figure 1 — Reward distribution (CDF)
+![image](https://cdn-uploads.huggingface.co/production/uploads/66e56109975df8fffc75f3c7/N79LUO_eo8nExgK5xhArc.png)
 > *Empirical CDF of cumulative reward — lower curve = better (more probability mass at high reward).*
 ### Figure 2 — Efficiency curve (reward vs. steps)
+![image](https://cdn-uploads.huggingface.co/production/uploads/66e56109975df8fffc75f3c7/zTQyShUBp5jZ76Z_6rA_-.png)
 | Model | Mean reward by ~30 steps | Steps to plateau | σ at plateau |
 | --- | --- | --- | --- |