File size: 2,294 Bytes
d02897f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---

title: OpenOps Incident Commander
emoji: 🚨
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
---

# OpenOps: AI Incident Commander Environment

Production incident management environment where AI agents learn to handle real-world outages.

## Overview

OpenOps simulates production incidents requiring an AI Incident Commander to:
- Investigate alerts and logs
- Identify root causes  
- Execute mitigation actions (restart/rollback/scale)
- Communicate with teams and users
- Resolve incidents quickly to minimize revenue loss

## Environment Specification

### Observation Space
```python

{

    "active_alerts": List[str],

    "service_status": Dict[str, str],

    "recent_logs": Dict[str, List[str]],

    "metrics_summary": Dict[str, float],

    "customer_complaints": int,

    "time_elapsed": int,

    "revenue_loss": float,

    "teams_notified": bool,

    "status_page_updated": bool,

    "user_communication_sent": bool

}

```

### Action Space (21 actions)
- 0: read_alerts

- 1-4: inspect_logs_{service}

- 5-8: check_metrics_{service}

- 9-12: restart_{service}
- 13-14: rollback_{service}

- 15-16: scale_{service}
- 17-19: Communication actions
- 20: resolve_incident



### Three Tasks



**Task 1 (Easy): Simple API Crash**

- API service down due to OOM

- Solution: Inspect logs β†’ Restart API



**Task 2 (Medium): Bad Deployment**

- Database deployment broke queries

- Solution: Inspect logs β†’ Rollback deployment β†’ Notify team



**Task 3 (Hard): Cascading Failure**

- Database overload β†’ API timeouts β†’ Customer impact

- Solution: Inspect both services β†’ Scale database β†’ Restart API β†’ Communicate



## Installation

```bash

pip install -r server/requirements.txt

```



## Usage



### Run Locally

```bash

cd server

uvicorn app:app --reload

```



### Run Inference

```bash

export OPENAI_API_KEY="your-key"

python inference.py

```



## Grading



Each task scored 0.0-1.0 based on:

- Investigation quality

- Correct mitigation actions

- Communication

- Successful resolution



## Deployment



Deploy to HuggingFace Spaces:

```bash

openenv push

```



## πŸ“Š Sample Output

![local inference outputt](https://github.com/arya892004/OpenOps/blob/main/assets/output.png?raw=true)