Spaces:
Sleeping
Sleeping
| title: Incident Commander RL Arena | |
| sdk: docker | |
| app_port: 8000 | |
| pinned: false | |
| license: bsd-3-clause | |
| # Incident Commander RL Arena | |
| An OpenEnv-style reinforcement learning environment for training LLM agents to act as on-call incident commanders. The agent investigates a production incident through tools, validates a root cause, applies a safe mitigation, and writes a concise stakeholder update. | |
| The environment is designed for the OpenEnv hackathon rubric: novel environment, clear story, measurable reward improvement, and a small enough training loop to run on T4-class GPUs. | |
| ## Why This Environment | |
| LLMs are often asked to operate software systems, but most benchmarks reward static Q&A instead of stateful investigation. This arena forces multi-turn behavior: inspect evidence, avoid misleading clues, choose a mitigation, and communicate clearly. | |
| Five incident families are included: | |
| - cache saturation causing API latency | |
| - bad deploy causing auth failures | |
| - payment provider degradation | |
| - database connection leak | |
| - queue backlog from worker crash | |
| ## Tools | |
| The agent can call these tools: | |
| - `query_logs(service, query, minutes)` | |
| - `inspect_metric(service, metric, minutes)` | |
| - `read_runbook(topic)` | |
| - `test_hypothesis(root_cause)` | |
| - `apply_mitigation(mitigation)` | |
| - `final_report(root_cause, mitigation, customer_update)` | |
| Rewards prioritize correct root cause and mitigation, then evidence support, update quality, safety, and efficient tool use. | |
| ## Quickstart | |
| ```bash | |
| python -m venv .venv | |
| .venv\Scripts\activate | |
| pip install -e . | |
| python -m unittest discover -s tests | |
| python scripts/manual_episode.py --scenario cache_saturation | |
| python scripts/evaluate_baselines.py --write-assets | |
| ``` | |
| Run the OpenEnv/FastAPI server: | |
| ```bash | |
| pip install -e ".[server]" | |
| uvicorn server.app:app --host 0.0.0.0 --port 8000 | |
| ``` | |
| Health check: | |
| ```bash | |
| curl http://127.0.0.1:8000/health | |
| ``` | |
| ## Training | |
| Start with the smallest Qwen model for proof of learning: | |
| ```bash | |
| pip install -e ".[train,vllm]" | |
| python scripts/train_grpo.py \ | |
| --model Qwen/Qwen3-0.6B \ | |
| --output-dir outputs/grpo-qwen3-0.6b \ | |
| --max-steps 80 | |
| ``` | |
| Capture training logs for submission: | |
| ```bash | |
| python scripts/run_with_logs.py --name grpo_qwen3_0_6b -- python scripts/train_grpo.py --model Qwen/Qwen3-0.6B --output-dir outputs/grpo-qwen3-0.6b --max-steps 80 | |
| ``` | |
| For Hugging Face Jobs: | |
| ```bash | |
| hf jobs uv run \ | |
| --flavor t4-small \ | |
| --timeout 4h \ | |
| --with "incident-commander-env[train,vllm] @ git+https://huggingface.co/spaces/YOUR_USERNAME/incident_commander_env" \ | |
| --secrets HF_TOKEN \ | |
| -- incident-commander-train --model Qwen/Qwen3-0.6B --max-steps 80 | |
| ``` | |
| If the first run shows a reward curve, rerun with `Qwen/Qwen3-1.7B`. | |
| If Colab T4 fails on `vllm`, use the slower fallback: | |
| ```bash | |
| pip install -e ".[train]" | |
| python scripts/run_with_logs.py --name grpo_qwen3_0_6b_no_vllm -- python scripts/train_grpo.py --model Qwen/Qwen3-0.6B --output-dir outputs/grpo-qwen3-0.6b --max-steps 40 --num-generations 2 --gradient-accumulation-steps 8 --no-vllm | |
| ``` | |
| The training wrapper uses curriculum reward shaping for GRPO: intermediate evidence | |
| collection, supported hypothesis checks, and safe mitigations receive small capped | |
| rewards, while the environment's final evaluation score still requires the correct | |
| root cause, mitigation, and stakeholder report. This avoids all-zero batches in | |
| tool-use RL while preserving the harder final incident score. | |
| ## Evaluation | |
| `scripts/evaluate_baselines.py` writes: | |
| - `outputs/evals/baseline_eval.json` | |
| - `assets/baseline_vs_oracle.png` | |
| - `assets/reward_curve_template.png` | |
| Use the log wrapper for evidence: | |
| ```bash | |
| python scripts/run_with_logs.py --name unit_tests -- python -m unittest discover -s tests | |
| python scripts/run_with_logs.py --name manual_episode_cache -- python scripts/manual_episode.py --scenario cache_saturation | |
| python scripts/run_with_logs.py --name baseline_eval -- python scripts/evaluate_baselines.py --write-assets | |
| ``` | |
| Current local smoke evaluation: | |
| | Policy | Mean reward | Root cause acc | Mitigation acc | Unsafe rate | Avg tool calls | | |
| | --- | ---: | ---: | ---: | ---: | ---: | | |
| | Weak baseline | 0.172 | 0.000 | 0.000 | 0.000 | 2.00 | | |
| | Oracle trace | 1.000 | 1.000 | 1.000 | 0.000 | 6.00 | | |
| | Trained model | fill from GRPO run | fill | fill | fill | fill | | |
|  | |
| Replace the template below with the real GRPO reward curve after training: | |
|  | |
| ## Multi-Agent Incident Command | |
| The repo also includes a deterministic multi-agent orchestration layer for demos and | |
| future training curriculum: | |
| - `api_auth_specialist`: checks API logs, auth-like metrics, deploy and login signals. | |
| - `data_store_specialist`: checks database or ledger logs, connection metrics, and storage runbooks. | |
| - `dependency_worker_specialist`: checks cache, queue, worker, gateway, and provider signals. | |
| - Parent commander: synthesizes specialist findings, tests the strongest hypothesis, applies the safest mitigation, and submits the final stakeholder report. | |
| Run a local multi-agent episode: | |
| ```bash | |
| python scripts/multi_agent_demo.py --scenario payment_provider_degradation | |
| python scripts/multi_agent_demo.py --scenario database_connection_leak --json | |
| ``` | |
| The Space exposes the same flow at: | |
| ```text | |
| /orchestrate?scenario_id=payment_provider_degradation | |
| ``` | |
| This is intentionally reproducible and uses generated incidents instead of live cloud | |
| credentials. A production extension can connect specialists to AWS CloudWatch, Azure | |
| Monitor, GCP Cloud Logging, Prometheus, or a read-only demo database, but the judged | |
| environment remains deterministic so rewards and logs are repeatable. | |
| ## Deploy to Hugging Face Space | |
| Recommended path: | |
| ```bash | |
| pip install -e ".[server]" | |
| openenv push --repo-id YOUR_USERNAME/incident_commander_env | |
| ``` | |
| Fallback Docker path: | |
| ```bash | |
| docker build -f server/Dockerfile -t incident-commander-env . | |
| docker run --rm -p 8000:8000 incident-commander-env | |
| ``` | |
| ## Submission Checklist | |
| - Push this repo to a Hugging Face Space. | |
| - Link the Space URL in this README. | |
| - Run `scripts/train_grpo.py` and commit reward/loss plots. | |
| - Add logs from `outputs/logs/` and one before/after transcript from `outputs/`. | |
| - Record a demo under two minutes showing an untrained failure and a trained improvement. | |
| ## Local Package Layout | |
| - `incident_commander_env/`: package, models, scenario engine, reward logic | |
| - `server/`: OpenEnv/FastAPI entrypoint and Dockerfile | |
| - `scripts/`: smoke test, baseline evaluation, GRPO training | |
| - `tests/`: deterministic unit tests | |
| - `notebooks/`: Colab-oriented training notebook | |
| ## HF Training Result | |
| A GRPO training run completed on Hugging Face Jobs. | |
| - Job URL: https://huggingface.co/jobs/Dar3devil/69ed03a7d2c8bd8662bce22d | |
| - Hardware: `a10g-large` | |
| - Model: `Qwen/Qwen3-0.6B` | |
| - Steps: `180/180 completed` | |
| - First logged reward: `0.0375` | |
| - Best logged reward: `0.2300` | |
| - Final logged reward: `0.2125` | |
| - Final train loss: `-0.005505` | |
|  | |
| Compact training evidence: | |
| - `outputs/evals/hf_training_summary.md` | |
| - `outputs/evals/hf_training_metrics.csv` | |
| - `outputs/logs/hf_job_69ed03a7_inspect.txt` | |
| Full raw training log is saved locally and available from the HF Jobs page. | |