Spaces:
Sleeping
Sleeping
| --- | |
| title: DevOpsEnv | |
| emoji: 🛠️ | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: docker | |
| app_port: 7860 | |
| tags: | |
| - openenv | |
| - devops | |
| - sre | |
| - troubleshooting | |
| - agent-evaluation | |
| pinned: false | |
| # DevOpsEnv | |
| DevOpsEnv is an OpenEnv-compliant environment for training and evaluating AI agents on realistic DevOps/SRE incident response workflows. | |
| ## Motivation | |
| This environment models a real operational workflow that engineers do in production: | |
| - inspect system state | |
| - run diagnostic commands | |
| - apply targeted config/code fixes | |
| - verify impact | |
| - submit a final resolution | |
| It is intentionally designed around common SRE failure classes (service outage, deployment misconfiguration, runtime memory issue) instead of toy interactions. | |
| ## OpenEnv Compliance | |
| The project implements the required OpenEnv interface: | |
| - typed Pydantic models for `Observation`, `Action`, `Reward`, `StepResult`, `State` | |
| - `POST /reset` returns the initial observation | |
| - `POST /step` returns `observation`, `reward`, `done`, `info` | |
| - `GET /state` returns current episode state | |
| - `POST /grader` returns deterministic final score and breakdown | |
| - `openenv.yaml` metadata/spec included | |
| ## Observation Space | |
| `Observation` includes: | |
| - task metadata (`task_id`, `task_description`) | |
| - episode controls (`episode_id`, `step_number`, `max_steps`) | |
| - `system_state`: | |
| - running processes | |
| - service status | |
| - open HTTP ports | |
| - docker containers | |
| - logs | |
| - filesystem snapshot | |
| - cpu and memory metrics | |
| - interaction history and current `available_actions` | |
| ## Action Space | |
| `Action.action_type` is one of: | |
| - `bash_cmd`: execute simulated shell command (`command`) | |
| - `file_edit`: overwrite known config/source file (`file_path`, `file_content`) | |
| - `submit`: terminate and grade current episode (`summary` optional) | |
| ## Tasks and Difficulty | |
| The environment ships with 3 graded tasks: | |
| 1. `task1` (easy): recover crashed Nginx and verify HTTP health. | |
| 2. `task2` (medium): correct docker-compose port mapping and redeploy. | |
| 3. `task3` (hard): diagnose memory leak behavior, patch service code, restart cleanly. | |
| Each task has deterministic grading with score in `[0.0, 1.0]` and criterion-level breakdown. | |
| ## Reward Design | |
| Rewards are dense and shaped to provide trajectory signal: | |
| - per-step cost discourages long loops | |
| - action-type reward for useful commands/edits | |
| - progress bonuses for key milestones (validation, successful restart, verified outputs) | |
| - penalties for repeated identical actions and invalid edits | |
| - terminal bonus from grader score on episode completion | |
| ## Local Setup | |
| ### 1) Install dependencies | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ### 2) Run API server | |
| ```bash | |
| uvicorn app:app --host 0.0.0.0 --port 7860 | |
| ``` | |
| ### 3) Check health | |
| ```bash | |
| curl http://127.0.0.1:7860/health | |
| ``` | |
| ### 4) Validate OpenEnv package | |
| ```bash | |
| openenv validate | |
| ``` | |
| ## Baseline Inference Script | |
| The required baseline script is at project root: `inference.py`. | |
| It: | |
| - uses the OpenAI Python client | |
| - reads mandatory LLM variables: | |
| - `API_BASE_URL` | |
| - `MODEL_NAME` | |
| - `HF_TOKEN` | |
| - runs all three tasks by default | |
| - emits strict structured stdout lines: | |
| - `[START] ...` | |
| - `[STEP] ...` | |
| - `[END] ...` | |
| ### Inference environment variables | |
| ```bash | |
| export OPENENV_BASE_URL="http://127.0.0.1:7860" | |
| export API_BASE_URL="https://router.huggingface.co/v1" | |
| export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct" | |
| export HF_TOKEN="<your_token>" | |
| ``` | |
| ### Run baseline | |
| ```bash | |
| python inference.py | |
| ``` | |
| Run a single task: | |
| ```bash | |
| python inference.py --task task2 | |
| ``` | |
| ## Docker | |
| Build: | |
| ```bash | |
| docker build -t devopsenv:latest . | |
| ``` | |
| Run: | |
| ```bash | |
| docker run --rm -p 7860:7860 devopsenv:latest | |
| ``` | |
| ## Hugging Face Spaces Deployment | |
| This repository is configured for Docker Spaces: | |
| - README frontmatter sets `sdk: docker` | |
| - container exposes and serves on port `7860` | |
| - includes `openenv` tag | |
| After pushing to a Space, verify: | |
| - `POST /reset` returns 200 | |
| - `openenv validate` passes | |
| - `python inference.py` completes within runtime constraints | |
| ## Pre-Submission Checklist | |
| - HF Space endpoint responds to `/reset` | |
| - `openenv validate` passes | |
| - `docker build` succeeds | |
| - `inference.py` runs and logs strict `[START]/[STEP]/[END]` format | |
| - all 3 tasks produce valid grader scores in `[0.0, 1.0]` | |