Spaces:
Sleeping
title: Incident Commander RL Arena
sdk: docker
app_port: 8000
pinned: false
license: bsd-3-clause
Incident Commander RL Arena
An OpenEnv-style reinforcement learning environment for training LLM agents to act as on-call incident commanders. The agent investigates a production incident through tools, validates a root cause, applies a safe mitigation, and writes a concise stakeholder update.
The environment is designed for the OpenEnv hackathon rubric: novel environment, clear story, measurable reward improvement, and a small enough training loop to run on T4-class GPUs.
Why This Environment
LLMs are often asked to operate software systems, but most benchmarks reward static Q&A instead of stateful investigation. This arena forces multi-turn behavior: inspect evidence, avoid misleading clues, choose a mitigation, and communicate clearly.
Five incident families are included:
- cache saturation causing API latency
- bad deploy causing auth failures
- payment provider degradation
- database connection leak
- queue backlog from worker crash
Tools
The agent can call these tools:
query_logs(service, query, minutes)inspect_metric(service, metric, minutes)read_runbook(topic)test_hypothesis(root_cause)apply_mitigation(mitigation)final_report(root_cause, mitigation, customer_update)
Rewards prioritize correct root cause and mitigation, then evidence support, update quality, safety, and efficient tool use.
Quickstart
python -m venv .venv
.venv\Scripts\activate
pip install -e .
python -m unittest discover -s tests
python scripts/manual_episode.py --scenario cache_saturation
python scripts/evaluate_baselines.py --write-assets
Run the OpenEnv/FastAPI server:
pip install -e ".[server]"
uvicorn server.app:app --host 0.0.0.0 --port 8000
Health check:
curl http://127.0.0.1:8000/health
Training
Start with the smallest Qwen model for proof of learning:
pip install -e ".[train,vllm]"
python scripts/train_grpo.py \
--model Qwen/Qwen3-0.6B \
--output-dir outputs/grpo-qwen3-0.6b \
--max-steps 80
Capture training logs for submission:
python scripts/run_with_logs.py --name grpo_qwen3_0_6b -- python scripts/train_grpo.py --model Qwen/Qwen3-0.6B --output-dir outputs/grpo-qwen3-0.6b --max-steps 80
For Hugging Face Jobs:
hf jobs uv run \
--flavor t4-small \
--timeout 4h \
--with "incident-commander-env[train,vllm] @ git+https://huggingface.co/spaces/YOUR_USERNAME/incident_commander_env" \
--secrets HF_TOKEN \
-- incident-commander-train --model Qwen/Qwen3-0.6B --max-steps 80
If the first run shows a reward curve, rerun with Qwen/Qwen3-1.7B.
If Colab T4 fails on vllm, use the slower fallback:
pip install -e ".[train]"
python scripts/run_with_logs.py --name grpo_qwen3_0_6b_no_vllm -- python scripts/train_grpo.py --model Qwen/Qwen3-0.6B --output-dir outputs/grpo-qwen3-0.6b --max-steps 40 --num-generations 2 --gradient-accumulation-steps 8 --no-vllm
The training wrapper uses curriculum reward shaping for GRPO: intermediate evidence collection, supported hypothesis checks, and safe mitigations receive small capped rewards, while the environment's final evaluation score still requires the correct root cause, mitigation, and stakeholder report. This avoids all-zero batches in tool-use RL while preserving the harder final incident score.
Evaluation
scripts/evaluate_baselines.py writes:
outputs/evals/baseline_eval.jsonassets/baseline_vs_oracle.pngassets/reward_curve_template.png
Use the log wrapper for evidence:
python scripts/run_with_logs.py --name unit_tests -- python -m unittest discover -s tests
python scripts/run_with_logs.py --name manual_episode_cache -- python scripts/manual_episode.py --scenario cache_saturation
python scripts/run_with_logs.py --name baseline_eval -- python scripts/evaluate_baselines.py --write-assets
Current local smoke evaluation:
| Policy | Mean reward | Root cause acc | Mitigation acc | Unsafe rate | Avg tool calls |
|---|---|---|---|---|---|
| Weak baseline | 0.172 | 0.000 | 0.000 | 0.000 | 2.00 |
| Oracle trace | 1.000 | 1.000 | 1.000 | 0.000 | 6.00 |
| Trained model | fill from GRPO run | fill | fill | fill | fill |
Replace the template below with the real GRPO reward curve after training:
Multi-Agent Incident Command
The repo also includes a deterministic multi-agent orchestration layer for demos and future training curriculum:
api_auth_specialist: checks API logs, auth-like metrics, deploy and login signals.data_store_specialist: checks database or ledger logs, connection metrics, and storage runbooks.dependency_worker_specialist: checks cache, queue, worker, gateway, and provider signals.- Parent commander: synthesizes specialist findings, tests the strongest hypothesis, applies the safest mitigation, and submits the final stakeholder report.
Run a local multi-agent episode:
python scripts/multi_agent_demo.py --scenario payment_provider_degradation
python scripts/multi_agent_demo.py --scenario database_connection_leak --json
The Space exposes the same flow at:
/orchestrate?scenario_id=payment_provider_degradation
This is intentionally reproducible and uses generated incidents instead of live cloud credentials. A production extension can connect specialists to AWS CloudWatch, Azure Monitor, GCP Cloud Logging, Prometheus, or a read-only demo database, but the judged environment remains deterministic so rewards and logs are repeatable.
Deploy to Hugging Face Space
Recommended path:
pip install -e ".[server]"
openenv push --repo-id YOUR_USERNAME/incident_commander_env
Fallback Docker path:
docker build -f server/Dockerfile -t incident-commander-env .
docker run --rm -p 8000:8000 incident-commander-env
Submission Checklist
- Push this repo to a Hugging Face Space.
- Link the Space URL in this README.
- Run
scripts/train_grpo.pyand commit reward/loss plots. - Add logs from
outputs/logs/and one before/after transcript fromoutputs/. - Record a demo under two minutes showing an untrained failure and a trained improvement.
Local Package Layout
incident_commander_env/: package, models, scenario engine, reward logicserver/: OpenEnv/FastAPI entrypoint and Dockerfilescripts/: smoke test, baseline evaluation, GRPO trainingtests/: deterministic unit testsnotebooks/: Colab-oriented training notebook
HF Training Result
A GRPO training run completed on Hugging Face Jobs.
- Job URL: https://huggingface.co/jobs/Dar3devil/69ed03a7d2c8bd8662bce22d
- Hardware:
a10g-large - Model:
Qwen/Qwen3-0.6B - Steps:
180/180 completed - First logged reward:
0.0375 - Best logged reward:
0.2300 - Final logged reward:
0.2125 - Final train loss:
-0.005505
Compact training evidence:
outputs/evals/hf_training_summary.mdoutputs/evals/hf_training_metrics.csvoutputs/logs/hf_job_69ed03a7_inspect.txt
Full raw training log is saved locally and available from the HF Jobs page.


