---
title: DevOpsEnv
emoji: 🛠️
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
tags:
  - openenv
  - devops
  - sre
  - troubleshooting
  - agent-evaluation
pinned: false
---

# DevOpsEnv

DevOpsEnv is an OpenEnv-compliant environment for training and evaluating AI agents on realistic DevOps/SRE incident response workflows.

## Motivation

This environment models a real operational workflow that engineers do in production:

- inspect system state
- run diagnostic commands
- apply targeted config/code fixes
- verify impact
- submit a final resolution

It is intentionally designed around common SRE failure classes (service outage, deployment misconfiguration, runtime memory issue) instead of toy interactions.

## OpenEnv Compliance

The project implements the required OpenEnv interface:

- typed Pydantic models for `Observation`, `Action`, `Reward`, `StepResult`, `State`
- `POST /reset` returns the initial observation
- `POST /step` returns `observation`, `reward`, `done`, `info`
- `GET /state` returns current episode state
- `POST /grader` returns deterministic final score and breakdown
- `openenv.yaml` metadata/spec included

## Observation Space

`Observation` includes:

- task metadata (`task_id`, `task_description`)
- episode controls (`episode_id`, `step_number`, `max_steps`)
- `system_state`:
  - running processes
  - service status
  - open HTTP ports
  - docker containers
  - logs
  - filesystem snapshot
  - cpu and memory metrics
- interaction history and current `available_actions`

## Action Space

`Action.action_type` is one of:

- `bash_cmd`: execute simulated shell command (`command`)
- `file_edit`: overwrite known config/source file (`file_path`, `file_content`)
- `submit`: terminate and grade current episode (`summary` optional)

## Tasks and Difficulty

The environment ships with 3 graded tasks:

1. `task1` (easy): recover crashed Nginx and verify HTTP health.
2. `task2` (medium): correct docker-compose port mapping and redeploy.
3. `task3` (hard): diagnose memory leak behavior, patch service code, restart cleanly.

Each task has deterministic grading with score in `[0.0, 1.0]` and criterion-level breakdown.

## Reward Design

Rewards are dense and shaped to provide trajectory signal:

- per-step cost discourages long loops
- action-type reward for useful commands/edits
- progress bonuses for key milestones (validation, successful restart, verified outputs)
- penalties for repeated identical actions and invalid edits
- terminal bonus from grader score on episode completion

## Local Setup

### 1) Install dependencies

```bash
pip install -r requirements.txt
```

### 2) Run API server

```bash
uvicorn app:app --host 0.0.0.0 --port 7860
```

### 3) Check health

```bash
curl http://127.0.0.1:7860/health
```

### 4) Validate OpenEnv package

```bash
openenv validate
```

## Baseline Inference Script

The required baseline script is at project root: `inference.py`.

It:

- uses the OpenAI Python client
- reads mandatory LLM variables:
  - `API_BASE_URL`
  - `MODEL_NAME`
  - `HF_TOKEN`
- runs all three tasks by default
- emits strict structured stdout lines:
  - `[START] ...`
  - `[STEP] ...`
  - `[END] ...`

### Inference environment variables

```bash
export OPENENV_BASE_URL="http://127.0.0.1:7860"
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="<your_token>"
```

### Run baseline

```bash
python inference.py
```

Run a single task:

```bash
python inference.py --task task2
```

## Docker

Build:

```bash
docker build -t devopsenv:latest .
```

Run:

```bash
docker run --rm -p 7860:7860 devopsenv:latest
```

## Hugging Face Spaces Deployment

This repository is configured for Docker Spaces:

- README frontmatter sets `sdk: docker`
- container exposes and serves on port `7860`
- includes `openenv` tag

After pushing to a Space, verify:

- `POST /reset` returns 200
- `openenv validate` passes
- `python inference.py` completes within runtime constraints

## Pre-Submission Checklist

- HF Space endpoint responds to `/reset`
- `openenv validate` passes
- `docker build` succeeds
- `inference.py` runs and logs strict `[START]/[STEP]/[END]` format
- all 3 tasks produce valid grader scores in `[0.0, 1.0]`