Spaces:
Sleeping
title: DevOpsEnv
emoji: 🛠️
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
tags:
- openenv
- devops
- sre
- troubleshooting
- agent-evaluation
pinned: false
DevOpsEnv
DevOpsEnv is an OpenEnv-compliant environment for training and evaluating AI agents on realistic DevOps/SRE incident response workflows.
Motivation
This environment models a real operational workflow that engineers do in production:
- inspect system state
- run diagnostic commands
- apply targeted config/code fixes
- verify impact
- submit a final resolution
It is intentionally designed around common SRE failure classes (service outage, deployment misconfiguration, runtime memory issue) instead of toy interactions.
OpenEnv Compliance
The project implements the required OpenEnv interface:
- typed Pydantic models for
Observation,Action,Reward,StepResult,State POST /resetreturns the initial observationPOST /stepreturnsobservation,reward,done,infoGET /statereturns current episode statePOST /graderreturns deterministic final score and breakdownopenenv.yamlmetadata/spec included
Observation Space
Observation includes:
- task metadata (
task_id,task_description) - episode controls (
episode_id,step_number,max_steps) system_state:- running processes
- service status
- open HTTP ports
- docker containers
- logs
- filesystem snapshot
- cpu and memory metrics
- interaction history and current
available_actions
Action Space
Action.action_type is one of:
bash_cmd: execute simulated shell command (command)file_edit: overwrite known config/source file (file_path,file_content)submit: terminate and grade current episode (summaryoptional)
Tasks and Difficulty
The environment ships with 3 graded tasks:
task1(easy): recover crashed Nginx and verify HTTP health.task2(medium): correct docker-compose port mapping and redeploy.task3(hard): diagnose memory leak behavior, patch service code, restart cleanly.
Each task has deterministic grading with score in [0.0, 1.0] and criterion-level breakdown.
Reward Design
Rewards are dense and shaped to provide trajectory signal:
- per-step cost discourages long loops
- action-type reward for useful commands/edits
- progress bonuses for key milestones (validation, successful restart, verified outputs)
- penalties for repeated identical actions and invalid edits
- terminal bonus from grader score on episode completion
Local Setup
1) Install dependencies
pip install -r requirements.txt
2) Run API server
uvicorn app:app --host 0.0.0.0 --port 7860
3) Check health
curl http://127.0.0.1:7860/health
4) Validate OpenEnv package
openenv validate
Baseline Inference Script
The required baseline script is at project root: inference.py.
It:
- uses the OpenAI Python client
- reads mandatory LLM variables:
API_BASE_URLMODEL_NAMEHF_TOKEN
- runs all three tasks by default
- emits strict structured stdout lines:
[START] ...[STEP] ...[END] ...
Inference environment variables
export OPENENV_BASE_URL="http://127.0.0.1:7860"
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="<your_token>"
Run baseline
python inference.py
Run a single task:
python inference.py --task task2
Docker
Build:
docker build -t devopsenv:latest .
Run:
docker run --rm -p 7860:7860 devopsenv:latest
Hugging Face Spaces Deployment
This repository is configured for Docker Spaces:
- README frontmatter sets
sdk: docker - container exposes and serves on port
7860 - includes
openenvtag
After pushing to a Space, verify:
POST /resetreturns 200openenv validatepassespython inference.pycompletes within runtime constraints
Pre-Submission Checklist
- HF Space endpoint responds to
/reset openenv validatepassesdocker buildsucceedsinference.pyruns and logs strict[START]/[STEP]/[END]format- all 3 tasks produce valid grader scores in
[0.0, 1.0]