Spaces:

Dar3devil
/

incident_commander_env

Sleeping

App Files Files Community

incident_commander_env / README.md

Dar4devil

Add curriculum reward shaping and multi-agent orchestration

a7c2722 2 months ago

preview code

Raw

History Blame Contribute Delete

7.33 kB

	---
	title: Incident Commander RL Arena
	sdk: docker
	app_port: 8000
	pinned: false
	license: bsd-3-clause
	---

	# Incident Commander RL Arena

	An OpenEnv-style reinforcement learning environment for training LLM agents to act as on-call incident commanders. The agent investigates a production incident through tools, validates a root cause, applies a safe mitigation, and writes a concise stakeholder update.

	The environment is designed for the OpenEnv hackathon rubric: novel environment, clear story, measurable reward improvement, and a small enough training loop to run on T4-class GPUs.

	## Why This Environment

	LLMs are often asked to operate software systems, but most benchmarks reward static Q&A instead of stateful investigation. This arena forces multi-turn behavior: inspect evidence, avoid misleading clues, choose a mitigation, and communicate clearly.

	Five incident families are included:

	- cache saturation causing API latency
	- bad deploy causing auth failures
	- payment provider degradation
	- database connection leak
	- queue backlog from worker crash

	## Tools

	The agent can call these tools:

	- `query_logs(service, query, minutes)`
	- `inspect_metric(service, metric, minutes)`
	- `read_runbook(topic)`
	- `test_hypothesis(root_cause)`
	- `apply_mitigation(mitigation)`
	- `final_report(root_cause, mitigation, customer_update)`

	Rewards prioritize correct root cause and mitigation, then evidence support, update quality, safety, and efficient tool use.

	## Quickstart

	```bash
	python -m venv .venv
	.venv\Scripts\activate
	pip install -e .
	python -m unittest discover -s tests
	python scripts/manual_episode.py --scenario cache_saturation
	python scripts/evaluate_baselines.py --write-assets
	```

	Run the OpenEnv/FastAPI server:

	```bash
	pip install -e ".[server]"
	uvicorn server.app:app --host 0.0.0.0 --port 8000
	```

	Health check:

	```bash
	curl http://127.0.0.1:8000/health
	```

	## Training

	Start with the smallest Qwen model for proof of learning:

	```bash
	pip install -e ".[train,vllm]"
	python scripts/train_grpo.py \
	--model Qwen/Qwen3-0.6B \
	--output-dir outputs/grpo-qwen3-0.6b \
	--max-steps 80
	```

	Capture training logs for submission:

	```bash
	python scripts/run_with_logs.py --name grpo_qwen3_0_6b -- python scripts/train_grpo.py --model Qwen/Qwen3-0.6B --output-dir outputs/grpo-qwen3-0.6b --max-steps 80
	```

	For Hugging Face Jobs:

	```bash
	hf jobs uv run \
	--flavor t4-small \
	--timeout 4h \
	--with "incident-commander-env[train,vllm] @ git+https://huggingface.co/spaces/YOUR_USERNAME/incident_commander_env" \
	--secrets HF_TOKEN \
	-- incident-commander-train --model Qwen/Qwen3-0.6B --max-steps 80
	```

	If the first run shows a reward curve, rerun with `Qwen/Qwen3-1.7B`.

	If Colab T4 fails on `vllm`, use the slower fallback:

	```bash
	pip install -e ".[train]"
	python scripts/run_with_logs.py --name grpo_qwen3_0_6b_no_vllm -- python scripts/train_grpo.py --model Qwen/Qwen3-0.6B --output-dir outputs/grpo-qwen3-0.6b --max-steps 40 --num-generations 2 --gradient-accumulation-steps 8 --no-vllm
	```

	The training wrapper uses curriculum reward shaping for GRPO: intermediate evidence
	collection, supported hypothesis checks, and safe mitigations receive small capped
	rewards, while the environment's final evaluation score still requires the correct
	root cause, mitigation, and stakeholder report. This avoids all-zero batches in
	tool-use RL while preserving the harder final incident score.

	## Evaluation

	`scripts/evaluate_baselines.py` writes:

	- `outputs/evals/baseline_eval.json`
	- `assets/baseline_vs_oracle.png`
	- `assets/reward_curve_template.png`

	Use the log wrapper for evidence:

	```bash
	python scripts/run_with_logs.py --name unit_tests -- python -m unittest discover -s tests
	python scripts/run_with_logs.py --name manual_episode_cache -- python scripts/manual_episode.py --scenario cache_saturation
	python scripts/run_with_logs.py --name baseline_eval -- python scripts/evaluate_baselines.py --write-assets
	```

	Current local smoke evaluation:

	\| Policy \| Mean reward \| Root cause acc \| Mitigation acc \| Unsafe rate \| Avg tool calls \|
	\| --- \| ---: \| ---: \| ---: \| ---: \| ---: \|
	\| Weak baseline \| 0.172 \| 0.000 \| 0.000 \| 0.000 \| 2.00 \|
	\| Oracle trace \| 1.000 \| 1.000 \| 1.000 \| 0.000 \| 6.00 \|
	\| Trained model \| fill from GRPO run \| fill \| fill \| fill \| fill \|

	![Baseline vs oracle](assets/baseline_vs_oracle.png)

	Replace the template below with the real GRPO reward curve after training:

	![Reward curve template](assets/reward_curve_template.png)

	## Multi-Agent Incident Command

	The repo also includes a deterministic multi-agent orchestration layer for demos and
	future training curriculum:

	- `api_auth_specialist`: checks API logs, auth-like metrics, deploy and login signals.
	- `data_store_specialist`: checks database or ledger logs, connection metrics, and storage runbooks.
	- `dependency_worker_specialist`: checks cache, queue, worker, gateway, and provider signals.
	- Parent commander: synthesizes specialist findings, tests the strongest hypothesis, applies the safest mitigation, and submits the final stakeholder report.

	Run a local multi-agent episode:

	```bash
	python scripts/multi_agent_demo.py --scenario payment_provider_degradation
	python scripts/multi_agent_demo.py --scenario database_connection_leak --json
	```

	The Space exposes the same flow at:

	```text
	/orchestrate?scenario_id=payment_provider_degradation
	```

	This is intentionally reproducible and uses generated incidents instead of live cloud
	credentials. A production extension can connect specialists to AWS CloudWatch, Azure
	Monitor, GCP Cloud Logging, Prometheus, or a read-only demo database, but the judged
	environment remains deterministic so rewards and logs are repeatable.

	## Deploy to Hugging Face Space

	Recommended path:

	```bash
	pip install -e ".[server]"
	openenv push --repo-id YOUR_USERNAME/incident_commander_env
	```

	Fallback Docker path:

	```bash
	docker build -f server/Dockerfile -t incident-commander-env .
	docker run --rm -p 8000:8000 incident-commander-env
	```

	## Submission Checklist

	- Push this repo to a Hugging Face Space.
	- Link the Space URL in this README.
	- Run `scripts/train_grpo.py` and commit reward/loss plots.
	- Add logs from `outputs/logs/` and one before/after transcript from `outputs/`.
	- Record a demo under two minutes showing an untrained failure and a trained improvement.

	## Local Package Layout

	- `incident_commander_env/`: package, models, scenario engine, reward logic
	- `server/`: OpenEnv/FastAPI entrypoint and Dockerfile
	- `scripts/`: smoke test, baseline evaluation, GRPO training
	- `tests/`: deterministic unit tests
	- `notebooks/`: Colab-oriented training notebook

	## HF Training Result

	A GRPO training run completed on Hugging Face Jobs.

	- Job URL: https://huggingface.co/jobs/Dar3devil/69ed03a7d2c8bd8662bce22d
	- Hardware: `a10g-large`
	- Model: `Qwen/Qwen3-0.6B`
	- Steps: `180/180 completed`
	- First logged reward: `0.0375`
	- Best logged reward: `0.2300`
	- Final logged reward: `0.2125`
	- Final train loss: `-0.005505`

	![HF training reward curve](assets/hf_training_reward_curve.png)

	Compact training evidence:
	- `outputs/evals/hf_training_summary.md`
	- `outputs/evals/hf_training_metrics.csv`
	- `outputs/logs/hf_job_69ed03a7_inspect.txt`

	Full raw training log is saved locally and available from the HF Jobs page.