Spaces:

Dar3devil
/

incident_commander_env

Sleeping

App Files Files Community

incident_commander_env / README.md

Dar4devil

Add curriculum reward shaping and multi-agent orchestration

a7c2722 2 months ago

preview code

Raw

History Blame Contribute Delete

7.33 kB

metadata

title: Incident Commander RL Arena
sdk: docker
app_port: 8000
pinned: false
license: bsd-3-clause

Incident Commander RL Arena

An OpenEnv-style reinforcement learning environment for training LLM agents to act as on-call incident commanders. The agent investigates a production incident through tools, validates a root cause, applies a safe mitigation, and writes a concise stakeholder update.

The environment is designed for the OpenEnv hackathon rubric: novel environment, clear story, measurable reward improvement, and a small enough training loop to run on T4-class GPUs.

Why This Environment

LLMs are often asked to operate software systems, but most benchmarks reward static Q&A instead of stateful investigation. This arena forces multi-turn behavior: inspect evidence, avoid misleading clues, choose a mitigation, and communicate clearly.

Five incident families are included:

cache saturation causing API latency
bad deploy causing auth failures
payment provider degradation
database connection leak
queue backlog from worker crash

Tools

The agent can call these tools:

query_logs(service, query, minutes)
inspect_metric(service, metric, minutes)
read_runbook(topic)
test_hypothesis(root_cause)
apply_mitigation(mitigation)
final_report(root_cause, mitigation, customer_update)

Rewards prioritize correct root cause and mitigation, then evidence support, update quality, safety, and efficient tool use.

Quickstart

python -m venv .venv
.venv\Scripts\activate
pip install -e .
python -m unittest discover -s tests
python scripts/manual_episode.py --scenario cache_saturation
python scripts/evaluate_baselines.py --write-assets

Run the OpenEnv/FastAPI server:

pip install -e ".[server]"
uvicorn server.app:app --host 0.0.0.0 --port 8000

Health check:

curl http://127.0.0.1:8000/health

Training

Start with the smallest Qwen model for proof of learning:

pip install -e ".[train,vllm]"
python scripts/train_grpo.py \
  --model Qwen/Qwen3-0.6B \
  --output-dir outputs/grpo-qwen3-0.6b \
  --max-steps 80

Capture training logs for submission:

python scripts/run_with_logs.py --name grpo_qwen3_0_6b -- python scripts/train_grpo.py --model Qwen/Qwen3-0.6B --output-dir outputs/grpo-qwen3-0.6b --max-steps 80

For Hugging Face Jobs:

hf jobs uv run \
  --flavor t4-small \
  --timeout 4h \
  --with "incident-commander-env[train,vllm] @ git+https://huggingface.co/spaces/YOUR_USERNAME/incident_commander_env" \
  --secrets HF_TOKEN \
  -- incident-commander-train --model Qwen/Qwen3-0.6B --max-steps 80

If the first run shows a reward curve, rerun with Qwen/Qwen3-1.7B.

If Colab T4 fails on vllm, use the slower fallback:

pip install -e ".[train]"
python scripts/run_with_logs.py --name grpo_qwen3_0_6b_no_vllm -- python scripts/train_grpo.py --model Qwen/Qwen3-0.6B --output-dir outputs/grpo-qwen3-0.6b --max-steps 40 --num-generations 2 --gradient-accumulation-steps 8 --no-vllm

The training wrapper uses curriculum reward shaping for GRPO: intermediate evidence collection, supported hypothesis checks, and safe mitigations receive small capped rewards, while the environment's final evaluation score still requires the correct root cause, mitigation, and stakeholder report. This avoids all-zero batches in tool-use RL while preserving the harder final incident score.

Evaluation

scripts/evaluate_baselines.py writes:

outputs/evals/baseline_eval.json
assets/baseline_vs_oracle.png
assets/reward_curve_template.png

Use the log wrapper for evidence:

python scripts/run_with_logs.py --name unit_tests -- python -m unittest discover -s tests
python scripts/run_with_logs.py --name manual_episode_cache -- python scripts/manual_episode.py --scenario cache_saturation
python scripts/run_with_logs.py --name baseline_eval -- python scripts/evaluate_baselines.py --write-assets

Current local smoke evaluation:

Policy	Mean reward	Root cause acc	Mitigation acc	Unsafe rate	Avg tool calls
Weak baseline	0.172	0.000	0.000	0.000	2.00
Oracle trace	1.000	1.000	1.000	0.000	6.00
Trained model	fill from GRPO run	fill	fill	fill	fill

Replace the template below with the real GRPO reward curve after training:

Multi-Agent Incident Command

The repo also includes a deterministic multi-agent orchestration layer for demos and future training curriculum:

api_auth_specialist: checks API logs, auth-like metrics, deploy and login signals.
data_store_specialist: checks database or ledger logs, connection metrics, and storage runbooks.
dependency_worker_specialist: checks cache, queue, worker, gateway, and provider signals.
Parent commander: synthesizes specialist findings, tests the strongest hypothesis, applies the safest mitigation, and submits the final stakeholder report.

Run a local multi-agent episode:

python scripts/multi_agent_demo.py --scenario payment_provider_degradation
python scripts/multi_agent_demo.py --scenario database_connection_leak --json

The Space exposes the same flow at:

/orchestrate?scenario_id=payment_provider_degradation

This is intentionally reproducible and uses generated incidents instead of live cloud credentials. A production extension can connect specialists to AWS CloudWatch, Azure Monitor, GCP Cloud Logging, Prometheus, or a read-only demo database, but the judged environment remains deterministic so rewards and logs are repeatable.

Deploy to Hugging Face Space

Recommended path:

pip install -e ".[server]"
openenv push --repo-id YOUR_USERNAME/incident_commander_env

Fallback Docker path:

docker build -f server/Dockerfile -t incident-commander-env .
docker run --rm -p 8000:8000 incident-commander-env

Submission Checklist

Push this repo to a Hugging Face Space.
Link the Space URL in this README.
Run scripts/train_grpo.py and commit reward/loss plots.
Add logs from outputs/logs/ and one before/after transcript from outputs/.
Record a demo under two minutes showing an untrained failure and a trained improvement.

Local Package Layout

incident_commander_env/: package, models, scenario engine, reward logic
server/: OpenEnv/FastAPI entrypoint and Dockerfile
scripts/: smoke test, baseline evaluation, GRPO training
tests/: deterministic unit tests
notebooks/: Colab-oriented training notebook

HF Training Result

A GRPO training run completed on Hugging Face Jobs.

Job URL: https://huggingface.co/jobs/Dar3devil/69ed03a7d2c8bd8662bce22d
Hardware: a10g-large
Model: Qwen/Qwen3-0.6B
Steps: 180/180 completed
First logged reward: 0.0375
Best logged reward: 0.2300
Final logged reward: 0.2125
Final train loss: -0.005505

Compact training evidence:

outputs/evals/hf_training_summary.md
outputs/evals/hf_training_metrics.csv
outputs/logs/hf_job_69ed03a7_inspect.txt

Full raw training log is saved locally and available from the HF Jobs page.