Dar4devil's picture
Add curriculum reward shaping and multi-agent orchestration
a7c2722
|
Raw
History Blame Contribute Delete
7.33 kB
---
title: Incident Commander RL Arena
sdk: docker
app_port: 8000
pinned: false
license: bsd-3-clause
---
# Incident Commander RL Arena
An OpenEnv-style reinforcement learning environment for training LLM agents to act as on-call incident commanders. The agent investigates a production incident through tools, validates a root cause, applies a safe mitigation, and writes a concise stakeholder update.
The environment is designed for the OpenEnv hackathon rubric: novel environment, clear story, measurable reward improvement, and a small enough training loop to run on T4-class GPUs.
## Why This Environment
LLMs are often asked to operate software systems, but most benchmarks reward static Q&A instead of stateful investigation. This arena forces multi-turn behavior: inspect evidence, avoid misleading clues, choose a mitigation, and communicate clearly.
Five incident families are included:
- cache saturation causing API latency
- bad deploy causing auth failures
- payment provider degradation
- database connection leak
- queue backlog from worker crash
## Tools
The agent can call these tools:
- `query_logs(service, query, minutes)`
- `inspect_metric(service, metric, minutes)`
- `read_runbook(topic)`
- `test_hypothesis(root_cause)`
- `apply_mitigation(mitigation)`
- `final_report(root_cause, mitigation, customer_update)`
Rewards prioritize correct root cause and mitigation, then evidence support, update quality, safety, and efficient tool use.
## Quickstart
```bash
python -m venv .venv
.venv\Scripts\activate
pip install -e .
python -m unittest discover -s tests
python scripts/manual_episode.py --scenario cache_saturation
python scripts/evaluate_baselines.py --write-assets
```
Run the OpenEnv/FastAPI server:
```bash
pip install -e ".[server]"
uvicorn server.app:app --host 0.0.0.0 --port 8000
```
Health check:
```bash
curl http://127.0.0.1:8000/health
```
## Training
Start with the smallest Qwen model for proof of learning:
```bash
pip install -e ".[train,vllm]"
python scripts/train_grpo.py \
--model Qwen/Qwen3-0.6B \
--output-dir outputs/grpo-qwen3-0.6b \
--max-steps 80
```
Capture training logs for submission:
```bash
python scripts/run_with_logs.py --name grpo_qwen3_0_6b -- python scripts/train_grpo.py --model Qwen/Qwen3-0.6B --output-dir outputs/grpo-qwen3-0.6b --max-steps 80
```
For Hugging Face Jobs:
```bash
hf jobs uv run \
--flavor t4-small \
--timeout 4h \
--with "incident-commander-env[train,vllm] @ git+https://huggingface.co/spaces/YOUR_USERNAME/incident_commander_env" \
--secrets HF_TOKEN \
-- incident-commander-train --model Qwen/Qwen3-0.6B --max-steps 80
```
If the first run shows a reward curve, rerun with `Qwen/Qwen3-1.7B`.
If Colab T4 fails on `vllm`, use the slower fallback:
```bash
pip install -e ".[train]"
python scripts/run_with_logs.py --name grpo_qwen3_0_6b_no_vllm -- python scripts/train_grpo.py --model Qwen/Qwen3-0.6B --output-dir outputs/grpo-qwen3-0.6b --max-steps 40 --num-generations 2 --gradient-accumulation-steps 8 --no-vllm
```
The training wrapper uses curriculum reward shaping for GRPO: intermediate evidence
collection, supported hypothesis checks, and safe mitigations receive small capped
rewards, while the environment's final evaluation score still requires the correct
root cause, mitigation, and stakeholder report. This avoids all-zero batches in
tool-use RL while preserving the harder final incident score.
## Evaluation
`scripts/evaluate_baselines.py` writes:
- `outputs/evals/baseline_eval.json`
- `assets/baseline_vs_oracle.png`
- `assets/reward_curve_template.png`
Use the log wrapper for evidence:
```bash
python scripts/run_with_logs.py --name unit_tests -- python -m unittest discover -s tests
python scripts/run_with_logs.py --name manual_episode_cache -- python scripts/manual_episode.py --scenario cache_saturation
python scripts/run_with_logs.py --name baseline_eval -- python scripts/evaluate_baselines.py --write-assets
```
Current local smoke evaluation:
| Policy | Mean reward | Root cause acc | Mitigation acc | Unsafe rate | Avg tool calls |
| --- | ---: | ---: | ---: | ---: | ---: |
| Weak baseline | 0.172 | 0.000 | 0.000 | 0.000 | 2.00 |
| Oracle trace | 1.000 | 1.000 | 1.000 | 0.000 | 6.00 |
| Trained model | fill from GRPO run | fill | fill | fill | fill |
![Baseline vs oracle](assets/baseline_vs_oracle.png)
Replace the template below with the real GRPO reward curve after training:
![Reward curve template](assets/reward_curve_template.png)
## Multi-Agent Incident Command
The repo also includes a deterministic multi-agent orchestration layer for demos and
future training curriculum:
- `api_auth_specialist`: checks API logs, auth-like metrics, deploy and login signals.
- `data_store_specialist`: checks database or ledger logs, connection metrics, and storage runbooks.
- `dependency_worker_specialist`: checks cache, queue, worker, gateway, and provider signals.
- Parent commander: synthesizes specialist findings, tests the strongest hypothesis, applies the safest mitigation, and submits the final stakeholder report.
Run a local multi-agent episode:
```bash
python scripts/multi_agent_demo.py --scenario payment_provider_degradation
python scripts/multi_agent_demo.py --scenario database_connection_leak --json
```
The Space exposes the same flow at:
```text
/orchestrate?scenario_id=payment_provider_degradation
```
This is intentionally reproducible and uses generated incidents instead of live cloud
credentials. A production extension can connect specialists to AWS CloudWatch, Azure
Monitor, GCP Cloud Logging, Prometheus, or a read-only demo database, but the judged
environment remains deterministic so rewards and logs are repeatable.
## Deploy to Hugging Face Space
Recommended path:
```bash
pip install -e ".[server]"
openenv push --repo-id YOUR_USERNAME/incident_commander_env
```
Fallback Docker path:
```bash
docker build -f server/Dockerfile -t incident-commander-env .
docker run --rm -p 8000:8000 incident-commander-env
```
## Submission Checklist
- Push this repo to a Hugging Face Space.
- Link the Space URL in this README.
- Run `scripts/train_grpo.py` and commit reward/loss plots.
- Add logs from `outputs/logs/` and one before/after transcript from `outputs/`.
- Record a demo under two minutes showing an untrained failure and a trained improvement.
## Local Package Layout
- `incident_commander_env/`: package, models, scenario engine, reward logic
- `server/`: OpenEnv/FastAPI entrypoint and Dockerfile
- `scripts/`: smoke test, baseline evaluation, GRPO training
- `tests/`: deterministic unit tests
- `notebooks/`: Colab-oriented training notebook
## HF Training Result
A GRPO training run completed on Hugging Face Jobs.
- Job URL: https://huggingface.co/jobs/Dar3devil/69ed03a7d2c8bd8662bce22d
- Hardware: `a10g-large`
- Model: `Qwen/Qwen3-0.6B`
- Steps: `180/180 completed`
- First logged reward: `0.0375`
- Best logged reward: `0.2300`
- Final logged reward: `0.2125`
- Final train loss: `-0.005505`
![HF training reward curve](assets/hf_training_reward_curve.png)
Compact training evidence:
- `outputs/evals/hf_training_summary.md`
- `outputs/evals/hf_training_metrics.csv`
- `outputs/logs/hf_job_69ed03a7_inspect.txt`
Full raw training log is saved locally and available from the HF Jobs page.