--- title: DevOpsEnv emoji: 🛠️ colorFrom: blue colorTo: green sdk: docker app_port: 7860 tags: - openenv - devops - sre - troubleshooting - agent-evaluation pinned: false --- # DevOpsEnv DevOpsEnv is an OpenEnv-compliant environment for training and evaluating AI agents on realistic DevOps/SRE incident response workflows. ## Motivation This environment models a real operational workflow that engineers do in production: - inspect system state - run diagnostic commands - apply targeted config/code fixes - verify impact - submit a final resolution It is intentionally designed around common SRE failure classes (service outage, deployment misconfiguration, runtime memory issue) instead of toy interactions. ## OpenEnv Compliance The project implements the required OpenEnv interface: - typed Pydantic models for `Observation`, `Action`, `Reward`, `StepResult`, `State` - `POST /reset` returns the initial observation - `POST /step` returns `observation`, `reward`, `done`, `info` - `GET /state` returns current episode state - `POST /grader` returns deterministic final score and breakdown - `openenv.yaml` metadata/spec included ## Observation Space `Observation` includes: - task metadata (`task_id`, `task_description`) - episode controls (`episode_id`, `step_number`, `max_steps`) - `system_state`: - running processes - service status - open HTTP ports - docker containers - logs - filesystem snapshot - cpu and memory metrics - interaction history and current `available_actions` ## Action Space `Action.action_type` is one of: - `bash_cmd`: execute simulated shell command (`command`) - `file_edit`: overwrite known config/source file (`file_path`, `file_content`) - `submit`: terminate and grade current episode (`summary` optional) ## Tasks and Difficulty The environment ships with 3 graded tasks: 1. `task1` (easy): recover crashed Nginx and verify HTTP health. 2. `task2` (medium): correct docker-compose port mapping and redeploy. 3. `task3` (hard): diagnose memory leak behavior, patch service code, restart cleanly. Each task has deterministic grading with score in `[0.0, 1.0]` and criterion-level breakdown. ## Reward Design Rewards are dense and shaped to provide trajectory signal: - per-step cost discourages long loops - action-type reward for useful commands/edits - progress bonuses for key milestones (validation, successful restart, verified outputs) - penalties for repeated identical actions and invalid edits - terminal bonus from grader score on episode completion ## Local Setup ### 1) Install dependencies ```bash pip install -r requirements.txt ``` ### 2) Run API server ```bash uvicorn app:app --host 0.0.0.0 --port 7860 ``` ### 3) Check health ```bash curl http://127.0.0.1:7860/health ``` ### 4) Validate OpenEnv package ```bash openenv validate ``` ## Baseline Inference Script The required baseline script is at project root: `inference.py`. It: - uses the OpenAI Python client - reads mandatory LLM variables: - `API_BASE_URL` - `MODEL_NAME` - `HF_TOKEN` - runs all three tasks by default - emits strict structured stdout lines: - `[START] ...` - `[STEP] ...` - `[END] ...` ### Inference environment variables ```bash export OPENENV_BASE_URL="http://127.0.0.1:7860" export API_BASE_URL="https://router.huggingface.co/v1" export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct" export HF_TOKEN="" ``` ### Run baseline ```bash python inference.py ``` Run a single task: ```bash python inference.py --task task2 ``` ## Docker Build: ```bash docker build -t devopsenv:latest . ``` Run: ```bash docker run --rm -p 7860:7860 devopsenv:latest ``` ## Hugging Face Spaces Deployment This repository is configured for Docker Spaces: - README frontmatter sets `sdk: docker` - container exposes and serves on port `7860` - includes `openenv` tag After pushing to a Space, verify: - `POST /reset` returns 200 - `openenv validate` passes - `python inference.py` completes within runtime constraints ## Pre-Submission Checklist - HF Space endpoint responds to `/reset` - `openenv validate` passes - `docker build` succeeds - `inference.py` runs and logs strict `[START]/[STEP]/[END]` format - all 3 tasks produce valid grader scores in `[0.0, 1.0]`