File size: 10,090 Bytes
02f3541
 
 
 
 
 
 
 
 
 
 
 
 
8cbdbde
02f3541
 
8cbdbde
02f3541
 
 
 
 
 
 
 
 
 
 
8cbdbde
02f3541
 
 
 
 
 
 
 
 
 
8cbdbde
 
02f3541
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8cbdbde
02f3541
8cbdbde
02f3541
 
8062d98
 
 
02f3541
 
 
 
 
 
 
 
 
 
8062d98
02f3541
 
 
 
 
8062d98
02f3541
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
# Submission Checklist β€” OpenEnv India 2026 Round 2

Status against every hard gate in the official judging rules, plus every polish item that moves the judging needle. **Last verified: all 21 tests passing, HF Space live, all artifacts committed.**

---

## Hard gates (from the official rules)

| # | Rule | Status | Evidence |
|---|---|---|---|
| 1 | **Use OpenEnv (latest release). Build on top of the framework; don't reinvent the wheel.** | βœ… | `requirements.txt` pins `openenv-core>=0.2.2`, `openenv.yaml` has `version: "3.0"`, `server/environment.py` extends `openenv.core.environment.Environment`, app built via `openenv.core.env_server.create_fastapi_app`. |
| 2 | **Working training script (Unsloth / HF TRL / any RL framework), ideally as a Colab notebook so judges can re-run it.** | βœ… | [`train_trl.py`](../train_trl.py) uses HF TRL `SFTTrainer`. **[One-click Colab notebook β†—](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** runs the whole pipeline end-to-end on a T4 in ~1 h 15 min. |
| 3 | **Evidence that you actually trained: at minimum, loss and reward plots from a real run.** | βœ… | Four plots committed to [`artifacts/`](../artifacts): `training_curve.png` (loss + token accuracy), `reward_curve.png` (4-policy reward by tier), `reward_components.png` (per-component breakdown), plus the 0.5B ablation `reward_curve_qwen0p5b.png`. Full `training_log.json` + `summary_metrics.json` committed alongside. |
| 4 | **Short writeup or video: mini-blog on Hugging Face OR <2-min YouTube video, linked from README.** | βœ… | Mini-blog lives as [`docs/BLOG_POST.md`](./BLOG_POST.md) β€” shipped as part of the HF Space (rule 4 says "mini-blog on Hugging Face"; the Space is on HF and contains this file, so it renders at `huggingface.co/spaces/.../blob/main/docs/BLOG_POST.md`). All four training plots render inline via raw GitHub URLs. README and dashboard both link to it. (No separate video submission.) |
| 5 | **Push your environment to a Hugging Face Space so it's discoverable and runnable.** | βœ… | **Live at [`swapnilpatil28-multi-agent-incident-command-center.hf.space`](https://swapnilpatil28-multi-agent-incident-command-center.hf.space)** Β· Space page: [`huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center`](https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center). |
| 6 | **README motivates the problem, explains how the env works, and shows results.** | βœ… | [`README.md`](../README.md) β€” Part 1 ("Story in 2 minutes") opens with the problem in plain English, walks through the environment via role-permission tables, and shows all four plots + headline numbers. Part 2 is the full technical deep-dive (architecture, action/observation spaces, reward rubric, training pipeline, 0.5B ablation, ops/observability, testing, repo layout). |
| 7 | **README links to the HF Space + all additional materials (blog, slides, etc.).** | βœ… | "Live links" table inside Part 2 of the README lists every resource. Part 1 also has a "Try it in 30 seconds" CTA table. The dashboard header plus "Resources & documentation" grid surface the same links from the live Space itself. |
| 8 | **Do not include big video files in the HF submission β€” only public URLs.** | βœ… | No video files committed. All assets in [`artifacts/`](../artifacts) are PNG plots (≀ 162 KB each) + JSON. Repo weight is dominated by text and small images. |

---

## Judging-rubric alignment

### Environment Innovation (40%)

- [x] Multi-role, multi-agent β€” `triage_agent`, `investigator_agent`, `ops_manager_agent` with **non-overlapping permissions** (`server/domain/roles.py`).
- [x] Long-horizon β€” 3–5 sequential incidents per episode, 20–60 steps each, shared SLA + budget counters.
- [x] Professional / enterprise task simulation β€” realistic logs, metrics, KB articles, customer-tier revenue impact, SLA timers.
- [x] 13 unique incident templates across easy / medium / hard (`server/domain/incidents.py`).
- [x] Rich observation schema β€” customer tier, revenue impact, allowed actors per action, investigation targets grouped by tool, playbook hints, `reward_components`, `last_action_notes`.
- [x] Composable reward rubric with **14+ named components** and anti-gaming safeguards (`server/domain/reward.py`).
- [x] Tier-weighted business impact (`free Γ—0.6 Β· standard Γ—1.0 Β· premium Γ—1.4 Β· enterprise Γ—1.8`).
- [x] Role-based permissions + handoff scoring (`wrong_actor_penalty`, `handoff_correct`/`handoff_wrong`).

### Storytelling (30%)

- [x] README **Part 1 β€” The story in 2 minutes** written in plain English, readable by a non-technical judge in under 3 minutes.
- [x] Every plot has a one-line caption explaining what it shows.
- [x] Blog post [`docs/BLOG_POST.md`](./BLOG_POST.md) β€” eight labelled sections, four plots inline via raw GitHub URLs (render everywhere), 0.5B-vs-1.5B ablation narrative, explicit hackathon-theme mapping.
- [x] Live HF Space dashboard has a **"Story in 2 minutes"** hero panel at the top, a role-permission table, a three-card theme mapping, and a "Resources & documentation" grid with click-through links (README, blog, checklist, Colab, Space, etc.).
- [x] All documentation cross-links cleanly β€” README ↔ dashboard ↔ blog post ↔ checklist.

### Improvement in Rewards (20%)

- [x] 4-policy reward curve (`reward_curve.png`) across easy / medium / hard.
- [x] Training loss + token-accuracy curve (`training_curve.png`).
- [x] Reward-components stacked bar chart (`reward_components.png`) β€” shows *where* the improvement came from.
- [x] Ablation plot (`reward_curve_qwen0p5b.png`) for Qwen2.5-0.5B-Instruct backbone.
- [x] Per-task `improvement_sft_over_base` numbers in `summary_metrics.json`: **βˆ’1.80 / +3.13 / +10.17** (easy / medium / hard).
- [x] Final headline run: Qwen2.5-1.5B-Instruct, 8 episodes/task, 3 epochs, 680 rows β€” full `training_log.json` committed.

### Reward & Training Pipeline (10%)

- [x] Reward logic is coherent β€” rubric engine with module-level constants and unit tests (`tests/test_reward.py`).
- [x] Training pipeline genuinely connects to the running environment (no static dataset β€” rollouts collected from live `IncidentCommandCenterEnvironment`).
- [x] SFT checkpoint is saved to `artifacts/sft_model/` and reloaded for 4-policy evaluation β€” closes the loop.
- [x] 21 unit + integration tests passing (`tests/test_reward.py`, `tests/test_incidents.py`, `tests/test_environment.py`).

---

## Engineering table-stakes

- [x] Uses OpenEnv `Environment` base class properly.
- [x] Clean client/server separation β€” client only uses Pydantic models + HTTP (`client.py`).
- [x] Gym-style `reset / step / state` + OpenEnv `/close`.
- [x] Valid `openenv.yaml` manifest (version 3.0).
- [x] No reserved MCP tool names.
- [x] Structured JSON logging with per-episode seeded RNG (`server/logging_utils.py`).
- [x] Health / version / env-info / metrics endpoints (`/healthz`, `/version`, `/env-info`, `/metrics`).
- [x] Static `/artifacts` mount so the Space serves its own plots β€” no external hotlinking.
- [x] Dockerfile with `HEALTHCHECK` (`Dockerfile`, `server/Dockerfile`).
- [x] `pytest` passes cleanly: 21 / 21.
- [x] `.dockerignore` keeps image slim (excludes `sft_model/` checkpoint, keeps evidence plots).
- [x] `pre_validate.sh` + `validate-submission.sh` for one-command pre-submission smoke tests.
- [x] LICENSE (MIT) in repo root.

---

## Final submission steps

| # | Step | Status |
|---|---|---|
| 1 | Final training run (Qwen2.5-1.5B, 8 eps/task, 3 epochs) β†’ all artifacts committed | βœ… |
| 2 | Commit artifacts (`reward_curve.png`, `training_curve.png`, `reward_components.png`, `reward_curve_qwen0p5b.png`, `training_log.json`, `summary_metrics.json`, `summary_metrics_qwen0p5b.json`) | βœ… |
| 3 | Update README with real numbers + real Space / Colab / GitHub / blog links | βœ… |
| 4 | Deploy HF Space from the same commit | βœ… |
| 5 | Dashboard upgraded: hero story panel, 4 stacked plots, resources grid with README / blog / checklist links | βœ… |
| 6 | Blog post updated (`docs/BLOG_POST.md`) with fixed image paths (raw GitHub URLs) and 0.5B ablation section | βœ… |
| 7 | All 21 tests passing on latest commit | βœ… |
| 8 | Run `openenv validate` remotely against the Space β€” `./validate-submission.sh <space-url>` | βœ… |
| 9 | **Submit the Space URL in the hackathon form:** `https://swapnilpatil28-multi-agent-incident-command-center.hf.space` | βœ… |
| 10 | Do not push commits after the submission deadline β€” post-deadline commits won't be considered | βœ… |

---

## Pre-submission smoke test (copy-paste)

```bash
# 1. HF Space is serving
curl -fsS https://swapnilpatil28-multi-agent-incident-command-center.hf.space/healthz

# 2. Env-info endpoint advertises metadata
curl -s https://swapnilpatil28-multi-agent-incident-command-center.hf.space/env-info

# 3. OpenEnv validator passes remotely
./validate-submission.sh https://swapnilpatil28-multi-agent-incident-command-center.hf.space

# 4. A remote episode works
ENV_URL=https://swapnilpatil28-multi-agent-incident-command-center.hf.space python inference.py
```

## Where the judges will find each artefact

| Artefact | Primary URL |
|---|---|
| Live environment (OpenEnv-compatible) | [`swapnilpatil28-multi-agent-incident-command-center.hf.space`](https://swapnilpatil28-multi-agent-incident-command-center.hf.space) |
| Hugging Face Space page | [Space page β†—](https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center) |
| GitHub repository | [GitHub β†—](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center) |
| README (Part 1 story + Part 2 deep-dive) | [`README.md`](../README.md) |
| Mini blog post (MD file in the repo, renders on both HF Space and GitHub) | [`docs/BLOG_POST.md`](./BLOG_POST.md) |
| Reproducible training notebook | [Colab β†—](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing) |
| Training evidence (all 4 plots + JSON metrics) | [`artifacts/`](../artifacts) folder |