Dar4devil's picture
Add curriculum reward shaping and multi-agent orchestration
a7c2722
|
Raw
History Blame Contribute Delete
7.33 kB
metadata
title: Incident Commander RL Arena
sdk: docker
app_port: 8000
pinned: false
license: bsd-3-clause

Incident Commander RL Arena

An OpenEnv-style reinforcement learning environment for training LLM agents to act as on-call incident commanders. The agent investigates a production incident through tools, validates a root cause, applies a safe mitigation, and writes a concise stakeholder update.

The environment is designed for the OpenEnv hackathon rubric: novel environment, clear story, measurable reward improvement, and a small enough training loop to run on T4-class GPUs.

Why This Environment

LLMs are often asked to operate software systems, but most benchmarks reward static Q&A instead of stateful investigation. This arena forces multi-turn behavior: inspect evidence, avoid misleading clues, choose a mitigation, and communicate clearly.

Five incident families are included:

  • cache saturation causing API latency
  • bad deploy causing auth failures
  • payment provider degradation
  • database connection leak
  • queue backlog from worker crash

Tools

The agent can call these tools:

  • query_logs(service, query, minutes)
  • inspect_metric(service, metric, minutes)
  • read_runbook(topic)
  • test_hypothesis(root_cause)
  • apply_mitigation(mitigation)
  • final_report(root_cause, mitigation, customer_update)

Rewards prioritize correct root cause and mitigation, then evidence support, update quality, safety, and efficient tool use.

Quickstart

python -m venv .venv
.venv\Scripts\activate
pip install -e .
python -m unittest discover -s tests
python scripts/manual_episode.py --scenario cache_saturation
python scripts/evaluate_baselines.py --write-assets

Run the OpenEnv/FastAPI server:

pip install -e ".[server]"
uvicorn server.app:app --host 0.0.0.0 --port 8000

Health check:

curl http://127.0.0.1:8000/health

Training

Start with the smallest Qwen model for proof of learning:

pip install -e ".[train,vllm]"
python scripts/train_grpo.py \
  --model Qwen/Qwen3-0.6B \
  --output-dir outputs/grpo-qwen3-0.6b \
  --max-steps 80

Capture training logs for submission:

python scripts/run_with_logs.py --name grpo_qwen3_0_6b -- python scripts/train_grpo.py --model Qwen/Qwen3-0.6B --output-dir outputs/grpo-qwen3-0.6b --max-steps 80

For Hugging Face Jobs:

hf jobs uv run \
  --flavor t4-small \
  --timeout 4h \
  --with "incident-commander-env[train,vllm] @ git+https://huggingface.co/spaces/YOUR_USERNAME/incident_commander_env" \
  --secrets HF_TOKEN \
  -- incident-commander-train --model Qwen/Qwen3-0.6B --max-steps 80

If the first run shows a reward curve, rerun with Qwen/Qwen3-1.7B.

If Colab T4 fails on vllm, use the slower fallback:

pip install -e ".[train]"
python scripts/run_with_logs.py --name grpo_qwen3_0_6b_no_vllm -- python scripts/train_grpo.py --model Qwen/Qwen3-0.6B --output-dir outputs/grpo-qwen3-0.6b --max-steps 40 --num-generations 2 --gradient-accumulation-steps 8 --no-vllm

The training wrapper uses curriculum reward shaping for GRPO: intermediate evidence collection, supported hypothesis checks, and safe mitigations receive small capped rewards, while the environment's final evaluation score still requires the correct root cause, mitigation, and stakeholder report. This avoids all-zero batches in tool-use RL while preserving the harder final incident score.

Evaluation

scripts/evaluate_baselines.py writes:

  • outputs/evals/baseline_eval.json
  • assets/baseline_vs_oracle.png
  • assets/reward_curve_template.png

Use the log wrapper for evidence:

python scripts/run_with_logs.py --name unit_tests -- python -m unittest discover -s tests
python scripts/run_with_logs.py --name manual_episode_cache -- python scripts/manual_episode.py --scenario cache_saturation
python scripts/run_with_logs.py --name baseline_eval -- python scripts/evaluate_baselines.py --write-assets

Current local smoke evaluation:

Policy Mean reward Root cause acc Mitigation acc Unsafe rate Avg tool calls
Weak baseline 0.172 0.000 0.000 0.000 2.00
Oracle trace 1.000 1.000 1.000 0.000 6.00
Trained model fill from GRPO run fill fill fill fill

Baseline vs oracle

Replace the template below with the real GRPO reward curve after training:

Reward curve template

Multi-Agent Incident Command

The repo also includes a deterministic multi-agent orchestration layer for demos and future training curriculum:

  • api_auth_specialist: checks API logs, auth-like metrics, deploy and login signals.
  • data_store_specialist: checks database or ledger logs, connection metrics, and storage runbooks.
  • dependency_worker_specialist: checks cache, queue, worker, gateway, and provider signals.
  • Parent commander: synthesizes specialist findings, tests the strongest hypothesis, applies the safest mitigation, and submits the final stakeholder report.

Run a local multi-agent episode:

python scripts/multi_agent_demo.py --scenario payment_provider_degradation
python scripts/multi_agent_demo.py --scenario database_connection_leak --json

The Space exposes the same flow at:

/orchestrate?scenario_id=payment_provider_degradation

This is intentionally reproducible and uses generated incidents instead of live cloud credentials. A production extension can connect specialists to AWS CloudWatch, Azure Monitor, GCP Cloud Logging, Prometheus, or a read-only demo database, but the judged environment remains deterministic so rewards and logs are repeatable.

Deploy to Hugging Face Space

Recommended path:

pip install -e ".[server]"
openenv push --repo-id YOUR_USERNAME/incident_commander_env

Fallback Docker path:

docker build -f server/Dockerfile -t incident-commander-env .
docker run --rm -p 8000:8000 incident-commander-env

Submission Checklist

  • Push this repo to a Hugging Face Space.
  • Link the Space URL in this README.
  • Run scripts/train_grpo.py and commit reward/loss plots.
  • Add logs from outputs/logs/ and one before/after transcript from outputs/.
  • Record a demo under two minutes showing an untrained failure and a trained improvement.

Local Package Layout

  • incident_commander_env/: package, models, scenario engine, reward logic
  • server/: OpenEnv/FastAPI entrypoint and Dockerfile
  • scripts/: smoke test, baseline evaluation, GRPO training
  • tests/: deterministic unit tests
  • notebooks/: Colab-oriented training notebook

HF Training Result

A GRPO training run completed on Hugging Face Jobs.

HF training reward curve

Compact training evidence:

  • outputs/evals/hf_training_summary.md
  • outputs/evals/hf_training_metrics.csv
  • outputs/logs/hf_job_69ed03a7_inspect.txt

Full raw training log is saved locally and available from the HF Jobs page.