File size: 4,275 Bytes
5b64237
 
9667fa6
2930dae
5b64237
2930dae
 
 
 
5b64237
 
 
2930dae
 
 
 
5b64237
2930dae
5510ae2
2930dae
5510ae2
2930dae
5510ae2
2930dae
5510ae2
 
 
 
 
2930dae
5510ae2
2930dae
5510ae2
2930dae
5510ae2
2930dae
5510ae2
 
 
 
 
 
2930dae
5510ae2
2930dae
5510ae2
2930dae
5510ae2
 
 
 
 
 
 
 
 
 
 
2930dae
5510ae2
2930dae
5510ae2
2930dae
5510ae2
 
 
2930dae
5510ae2
2930dae
5510ae2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2930dae
5b64237
2930dae
5510ae2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2930dae
5510ae2
 
 
2930dae
5510ae2
2930dae
5510ae2
 
 
2930dae
5510ae2
2930dae
5510ae2
2930dae
5510ae2
 
 
2930dae
5510ae2
2930dae
5510ae2
 
 
2930dae
5510ae2
2930dae
5510ae2
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
---
title: DevOpsEnv
emoji: 🛠️
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
tags:
  - openenv
  - devops
  - sre
  - troubleshooting
  - agent-evaluation
pinned: false
---

# DevOpsEnv

DevOpsEnv is an OpenEnv-compliant environment for training and evaluating AI agents on realistic DevOps/SRE incident response workflows.

## Motivation

This environment models a real operational workflow that engineers do in production:

- inspect system state
- run diagnostic commands
- apply targeted config/code fixes
- verify impact
- submit a final resolution

It is intentionally designed around common SRE failure classes (service outage, deployment misconfiguration, runtime memory issue) instead of toy interactions.

## OpenEnv Compliance

The project implements the required OpenEnv interface:

- typed Pydantic models for `Observation`, `Action`, `Reward`, `StepResult`, `State`
- `POST /reset` returns the initial observation
- `POST /step` returns `observation`, `reward`, `done`, `info`
- `GET /state` returns current episode state
- `POST /grader` returns deterministic final score and breakdown
- `openenv.yaml` metadata/spec included

## Observation Space

`Observation` includes:

- task metadata (`task_id`, `task_description`)
- episode controls (`episode_id`, `step_number`, `max_steps`)
- `system_state`:
  - running processes
  - service status
  - open HTTP ports
  - docker containers
  - logs
  - filesystem snapshot
  - cpu and memory metrics
- interaction history and current `available_actions`

## Action Space

`Action.action_type` is one of:

- `bash_cmd`: execute simulated shell command (`command`)
- `file_edit`: overwrite known config/source file (`file_path`, `file_content`)
- `submit`: terminate and grade current episode (`summary` optional)

## Tasks and Difficulty

The environment ships with 3 graded tasks:

1. `task1` (easy): recover crashed Nginx and verify HTTP health.
2. `task2` (medium): correct docker-compose port mapping and redeploy.
3. `task3` (hard): diagnose memory leak behavior, patch service code, restart cleanly.

Each task has deterministic grading with score in `[0.0, 1.0]` and criterion-level breakdown.

## Reward Design

Rewards are dense and shaped to provide trajectory signal:

- per-step cost discourages long loops
- action-type reward for useful commands/edits
- progress bonuses for key milestones (validation, successful restart, verified outputs)
- penalties for repeated identical actions and invalid edits
- terminal bonus from grader score on episode completion

## Local Setup

### 1) Install dependencies

```bash
pip install -r requirements.txt
```

### 2) Run API server

```bash
uvicorn app:app --host 0.0.0.0 --port 7860
```

### 3) Check health

```bash
curl http://127.0.0.1:7860/health
```

### 4) Validate OpenEnv package

```bash
openenv validate
```

## Baseline Inference Script

The required baseline script is at project root: `inference.py`.

It:

- uses the OpenAI Python client
- reads mandatory LLM variables:
  - `API_BASE_URL`
  - `MODEL_NAME`
  - `HF_TOKEN`
- runs all three tasks by default
- emits strict structured stdout lines:
  - `[START] ...`
  - `[STEP] ...`
  - `[END] ...`

### Inference environment variables

```bash
export OPENENV_BASE_URL="http://127.0.0.1:7860"
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="<your_token>"
```

### Run baseline

```bash
python inference.py
```

Run a single task:

```bash
python inference.py --task task2
```

## Docker

Build:

```bash
docker build -t devopsenv:latest .
```

Run:

```bash
docker run --rm -p 7860:7860 devopsenv:latest
```

## Hugging Face Spaces Deployment

This repository is configured for Docker Spaces:

- README frontmatter sets `sdk: docker`
- container exposes and serves on port `7860`
- includes `openenv` tag

After pushing to a Space, verify:

- `POST /reset` returns 200
- `openenv validate` passes
- `python inference.py` completes within runtime constraints

## Pre-Submission Checklist

- HF Space endpoint responds to `/reset`
- `openenv validate` passes
- `docker build` succeeds
- `inference.py` runs and logs strict `[START]/[STEP]/[END]` format
- all 3 tasks produce valid grader scores in `[0.0, 1.0]`