deprec / README.md
exploring-solver's picture
updated logic to use qwen with intended tasks
9667fa6
|
Raw
History Blame Contribute Delete
4.28 kB
---
title: DevOpsEnv
emoji: 🛠️
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
tags:
- openenv
- devops
- sre
- troubleshooting
- agent-evaluation
pinned: false
---
# DevOpsEnv
DevOpsEnv is an OpenEnv-compliant environment for training and evaluating AI agents on realistic DevOps/SRE incident response workflows.
## Motivation
This environment models a real operational workflow that engineers do in production:
- inspect system state
- run diagnostic commands
- apply targeted config/code fixes
- verify impact
- submit a final resolution
It is intentionally designed around common SRE failure classes (service outage, deployment misconfiguration, runtime memory issue) instead of toy interactions.
## OpenEnv Compliance
The project implements the required OpenEnv interface:
- typed Pydantic models for `Observation`, `Action`, `Reward`, `StepResult`, `State`
- `POST /reset` returns the initial observation
- `POST /step` returns `observation`, `reward`, `done`, `info`
- `GET /state` returns current episode state
- `POST /grader` returns deterministic final score and breakdown
- `openenv.yaml` metadata/spec included
## Observation Space
`Observation` includes:
- task metadata (`task_id`, `task_description`)
- episode controls (`episode_id`, `step_number`, `max_steps`)
- `system_state`:
- running processes
- service status
- open HTTP ports
- docker containers
- logs
- filesystem snapshot
- cpu and memory metrics
- interaction history and current `available_actions`
## Action Space
`Action.action_type` is one of:
- `bash_cmd`: execute simulated shell command (`command`)
- `file_edit`: overwrite known config/source file (`file_path`, `file_content`)
- `submit`: terminate and grade current episode (`summary` optional)
## Tasks and Difficulty
The environment ships with 3 graded tasks:
1. `task1` (easy): recover crashed Nginx and verify HTTP health.
2. `task2` (medium): correct docker-compose port mapping and redeploy.
3. `task3` (hard): diagnose memory leak behavior, patch service code, restart cleanly.
Each task has deterministic grading with score in `[0.0, 1.0]` and criterion-level breakdown.
## Reward Design
Rewards are dense and shaped to provide trajectory signal:
- per-step cost discourages long loops
- action-type reward for useful commands/edits
- progress bonuses for key milestones (validation, successful restart, verified outputs)
- penalties for repeated identical actions and invalid edits
- terminal bonus from grader score on episode completion
## Local Setup
### 1) Install dependencies
```bash
pip install -r requirements.txt
```
### 2) Run API server
```bash
uvicorn app:app --host 0.0.0.0 --port 7860
```
### 3) Check health
```bash
curl http://127.0.0.1:7860/health
```
### 4) Validate OpenEnv package
```bash
openenv validate
```
## Baseline Inference Script
The required baseline script is at project root: `inference.py`.
It:
- uses the OpenAI Python client
- reads mandatory LLM variables:
- `API_BASE_URL`
- `MODEL_NAME`
- `HF_TOKEN`
- runs all three tasks by default
- emits strict structured stdout lines:
- `[START] ...`
- `[STEP] ...`
- `[END] ...`
### Inference environment variables
```bash
export OPENENV_BASE_URL="http://127.0.0.1:7860"
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="<your_token>"
```
### Run baseline
```bash
python inference.py
```
Run a single task:
```bash
python inference.py --task task2
```
## Docker
Build:
```bash
docker build -t devopsenv:latest .
```
Run:
```bash
docker run --rm -p 7860:7860 devopsenv:latest
```
## Hugging Face Spaces Deployment
This repository is configured for Docker Spaces:
- README frontmatter sets `sdk: docker`
- container exposes and serves on port `7860`
- includes `openenv` tag
After pushing to a Space, verify:
- `POST /reset` returns 200
- `openenv validate` passes
- `python inference.py` completes within runtime constraints
## Pre-Submission Checklist
- HF Space endpoint responds to `/reset`
- `openenv validate` passes
- `docker build` succeeds
- `inference.py` runs and logs strict `[START]/[STEP]/[END]` format
- all 3 tasks produce valid grader scores in `[0.0, 1.0]`