File size: 10,090 Bytes
02f3541 8cbdbde 02f3541 8cbdbde 02f3541 8cbdbde 02f3541 8cbdbde 02f3541 8cbdbde 02f3541 8cbdbde 02f3541 8062d98 02f3541 8062d98 02f3541 8062d98 02f3541 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 | # Submission Checklist β OpenEnv India 2026 Round 2
Status against every hard gate in the official judging rules, plus every polish item that moves the judging needle. **Last verified: all 21 tests passing, HF Space live, all artifacts committed.**
---
## Hard gates (from the official rules)
| # | Rule | Status | Evidence |
|---|---|---|---|
| 1 | **Use OpenEnv (latest release). Build on top of the framework; don't reinvent the wheel.** | β
| `requirements.txt` pins `openenv-core>=0.2.2`, `openenv.yaml` has `version: "3.0"`, `server/environment.py` extends `openenv.core.environment.Environment`, app built via `openenv.core.env_server.create_fastapi_app`. |
| 2 | **Working training script (Unsloth / HF TRL / any RL framework), ideally as a Colab notebook so judges can re-run it.** | β
| [`train_trl.py`](../train_trl.py) uses HF TRL `SFTTrainer`. **[One-click Colab notebook β](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing)** runs the whole pipeline end-to-end on a T4 in ~1 h 15 min. |
| 3 | **Evidence that you actually trained: at minimum, loss and reward plots from a real run.** | β
| Four plots committed to [`artifacts/`](../artifacts): `training_curve.png` (loss + token accuracy), `reward_curve.png` (4-policy reward by tier), `reward_components.png` (per-component breakdown), plus the 0.5B ablation `reward_curve_qwen0p5b.png`. Full `training_log.json` + `summary_metrics.json` committed alongside. |
| 4 | **Short writeup or video: mini-blog on Hugging Face OR <2-min YouTube video, linked from README.** | β
| Mini-blog lives as [`docs/BLOG_POST.md`](./BLOG_POST.md) β shipped as part of the HF Space (rule 4 says "mini-blog on Hugging Face"; the Space is on HF and contains this file, so it renders at `huggingface.co/spaces/.../blob/main/docs/BLOG_POST.md`). All four training plots render inline via raw GitHub URLs. README and dashboard both link to it. (No separate video submission.) |
| 5 | **Push your environment to a Hugging Face Space so it's discoverable and runnable.** | β
| **Live at [`swapnilpatil28-multi-agent-incident-command-center.hf.space`](https://swapnilpatil28-multi-agent-incident-command-center.hf.space)** Β· Space page: [`huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center`](https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center). |
| 6 | **README motivates the problem, explains how the env works, and shows results.** | β
| [`README.md`](../README.md) β Part 1 ("Story in 2 minutes") opens with the problem in plain English, walks through the environment via role-permission tables, and shows all four plots + headline numbers. Part 2 is the full technical deep-dive (architecture, action/observation spaces, reward rubric, training pipeline, 0.5B ablation, ops/observability, testing, repo layout). |
| 7 | **README links to the HF Space + all additional materials (blog, slides, etc.).** | β
| "Live links" table inside Part 2 of the README lists every resource. Part 1 also has a "Try it in 30 seconds" CTA table. The dashboard header plus "Resources & documentation" grid surface the same links from the live Space itself. |
| 8 | **Do not include big video files in the HF submission β only public URLs.** | β
| No video files committed. All assets in [`artifacts/`](../artifacts) are PNG plots (β€ 162 KB each) + JSON. Repo weight is dominated by text and small images. |
---
## Judging-rubric alignment
### Environment Innovation (40%)
- [x] Multi-role, multi-agent β `triage_agent`, `investigator_agent`, `ops_manager_agent` with **non-overlapping permissions** (`server/domain/roles.py`).
- [x] Long-horizon β 3β5 sequential incidents per episode, 20β60 steps each, shared SLA + budget counters.
- [x] Professional / enterprise task simulation β realistic logs, metrics, KB articles, customer-tier revenue impact, SLA timers.
- [x] 13 unique incident templates across easy / medium / hard (`server/domain/incidents.py`).
- [x] Rich observation schema β customer tier, revenue impact, allowed actors per action, investigation targets grouped by tool, playbook hints, `reward_components`, `last_action_notes`.
- [x] Composable reward rubric with **14+ named components** and anti-gaming safeguards (`server/domain/reward.py`).
- [x] Tier-weighted business impact (`free Γ0.6 Β· standard Γ1.0 Β· premium Γ1.4 Β· enterprise Γ1.8`).
- [x] Role-based permissions + handoff scoring (`wrong_actor_penalty`, `handoff_correct`/`handoff_wrong`).
### Storytelling (30%)
- [x] README **Part 1 β The story in 2 minutes** written in plain English, readable by a non-technical judge in under 3 minutes.
- [x] Every plot has a one-line caption explaining what it shows.
- [x] Blog post [`docs/BLOG_POST.md`](./BLOG_POST.md) β eight labelled sections, four plots inline via raw GitHub URLs (render everywhere), 0.5B-vs-1.5B ablation narrative, explicit hackathon-theme mapping.
- [x] Live HF Space dashboard has a **"Story in 2 minutes"** hero panel at the top, a role-permission table, a three-card theme mapping, and a "Resources & documentation" grid with click-through links (README, blog, checklist, Colab, Space, etc.).
- [x] All documentation cross-links cleanly β README β dashboard β blog post β checklist.
### Improvement in Rewards (20%)
- [x] 4-policy reward curve (`reward_curve.png`) across easy / medium / hard.
- [x] Training loss + token-accuracy curve (`training_curve.png`).
- [x] Reward-components stacked bar chart (`reward_components.png`) β shows *where* the improvement came from.
- [x] Ablation plot (`reward_curve_qwen0p5b.png`) for Qwen2.5-0.5B-Instruct backbone.
- [x] Per-task `improvement_sft_over_base` numbers in `summary_metrics.json`: **β1.80 / +3.13 / +10.17** (easy / medium / hard).
- [x] Final headline run: Qwen2.5-1.5B-Instruct, 8 episodes/task, 3 epochs, 680 rows β full `training_log.json` committed.
### Reward & Training Pipeline (10%)
- [x] Reward logic is coherent β rubric engine with module-level constants and unit tests (`tests/test_reward.py`).
- [x] Training pipeline genuinely connects to the running environment (no static dataset β rollouts collected from live `IncidentCommandCenterEnvironment`).
- [x] SFT checkpoint is saved to `artifacts/sft_model/` and reloaded for 4-policy evaluation β closes the loop.
- [x] 21 unit + integration tests passing (`tests/test_reward.py`, `tests/test_incidents.py`, `tests/test_environment.py`).
---
## Engineering table-stakes
- [x] Uses OpenEnv `Environment` base class properly.
- [x] Clean client/server separation β client only uses Pydantic models + HTTP (`client.py`).
- [x] Gym-style `reset / step / state` + OpenEnv `/close`.
- [x] Valid `openenv.yaml` manifest (version 3.0).
- [x] No reserved MCP tool names.
- [x] Structured JSON logging with per-episode seeded RNG (`server/logging_utils.py`).
- [x] Health / version / env-info / metrics endpoints (`/healthz`, `/version`, `/env-info`, `/metrics`).
- [x] Static `/artifacts` mount so the Space serves its own plots β no external hotlinking.
- [x] Dockerfile with `HEALTHCHECK` (`Dockerfile`, `server/Dockerfile`).
- [x] `pytest` passes cleanly: 21 / 21.
- [x] `.dockerignore` keeps image slim (excludes `sft_model/` checkpoint, keeps evidence plots).
- [x] `pre_validate.sh` + `validate-submission.sh` for one-command pre-submission smoke tests.
- [x] LICENSE (MIT) in repo root.
---
## Final submission steps
| # | Step | Status |
|---|---|---|
| 1 | Final training run (Qwen2.5-1.5B, 8 eps/task, 3 epochs) β all artifacts committed | β
|
| 2 | Commit artifacts (`reward_curve.png`, `training_curve.png`, `reward_components.png`, `reward_curve_qwen0p5b.png`, `training_log.json`, `summary_metrics.json`, `summary_metrics_qwen0p5b.json`) | β
|
| 3 | Update README with real numbers + real Space / Colab / GitHub / blog links | β
|
| 4 | Deploy HF Space from the same commit | β
|
| 5 | Dashboard upgraded: hero story panel, 4 stacked plots, resources grid with README / blog / checklist links | β
|
| 6 | Blog post updated (`docs/BLOG_POST.md`) with fixed image paths (raw GitHub URLs) and 0.5B ablation section | β
|
| 7 | All 21 tests passing on latest commit | β
|
| 8 | Run `openenv validate` remotely against the Space β `./validate-submission.sh <space-url>` | β
|
| 9 | **Submit the Space URL in the hackathon form:** `https://swapnilpatil28-multi-agent-incident-command-center.hf.space` | β
|
| 10 | Do not push commits after the submission deadline β post-deadline commits won't be considered | β
|
---
## Pre-submission smoke test (copy-paste)
```bash
# 1. HF Space is serving
curl -fsS https://swapnilpatil28-multi-agent-incident-command-center.hf.space/healthz
# 2. Env-info endpoint advertises metadata
curl -s https://swapnilpatil28-multi-agent-incident-command-center.hf.space/env-info
# 3. OpenEnv validator passes remotely
./validate-submission.sh https://swapnilpatil28-multi-agent-incident-command-center.hf.space
# 4. A remote episode works
ENV_URL=https://swapnilpatil28-multi-agent-incident-command-center.hf.space python inference.py
```
## Where the judges will find each artefact
| Artefact | Primary URL |
|---|---|
| Live environment (OpenEnv-compatible) | [`swapnilpatil28-multi-agent-incident-command-center.hf.space`](https://swapnilpatil28-multi-agent-incident-command-center.hf.space) |
| Hugging Face Space page | [Space page β](https://huggingface.co/spaces/SwapnilPatil28/Multi-Agent-Incident-Command-Center) |
| GitHub repository | [GitHub β](https://github.com/SwapnilPatil28/Multi-Agent-Incident-Command-Center) |
| README (Part 1 story + Part 2 deep-dive) | [`README.md`](../README.md) |
| Mini blog post (MD file in the repo, renders on both HF Space and GitHub) | [`docs/BLOG_POST.md`](./BLOG_POST.md) |
| Reproducible training notebook | [Colab β](https://colab.research.google.com/drive/1vx9E5FrZZrHoRwXs2cvtom3DaI6kZ3LP?usp=sharing) |
| Training evidence (all 4 plots + JSON metrics) | [`artifacts/`](../artifacts) folder |
|