metadata
title: OpenOps Incident Commander
emoji: π¨
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
OpenOps: AI Incident Commander Environment
Production incident management environment where AI agents learn to handle real-world outages.
Overview
OpenOps simulates production incidents requiring an AI Incident Commander to:
- Investigate alerts and logs
- Identify root causes
- Execute mitigation actions (restart/rollback/scale)
- Communicate with teams and users
- Resolve incidents quickly to minimize revenue loss
Environment Specification
Observation Space
{
"active_alerts": List[str],
"service_status": Dict[str, str],
"recent_logs": Dict[str, List[str]],
"metrics_summary": Dict[str, float],
"customer_complaints": int,
"time_elapsed": int,
"revenue_loss": float,
"teams_notified": bool,
"status_page_updated": bool,
"user_communication_sent": bool
}
Action Space (21 actions)
- 0: read_alerts
- 1-4: inspect_logs_{service}
- 5-8: check_metrics_{service}
- 9-12: restart_{service}
- 13-14: rollback_{service}
- 15-16: scale_{service}
- 17-19: Communication actions
- 20: resolve_incident
Three Tasks
Task 1 (Easy): Simple API Crash
- API service down due to OOM
- Solution: Inspect logs β Restart API
Task 2 (Medium): Bad Deployment
- Database deployment broke queries
- Solution: Inspect logs β Rollback deployment β Notify team
Task 3 (Hard): Cascading Failure
- Database overload β API timeouts β Customer impact
- Solution: Inspect both services β Scale database β Restart API β Communicate
Installation
pip install -r server/requirements.txt
Usage
Run Locally
cd server
uvicorn app:app --reload
Run Inference
export OPENAI_API_KEY="your-key"
python inference.py
Grading
Each task scored 0.0-1.0 based on:
- Investigation quality
- Correct mitigation actions
- Communication
- Successful resolution
Deployment
Deploy to HuggingFace Spaces:
openenv push
