Spaces:

exploring-solver
/

deprec

Sleeping

App Files Files Community

deprec / README.md

exploring-solver

updated logic to use qwen with intended tasks

9667fa6 3 months ago

preview code

Raw

History Blame Contribute Delete

4.28 kB

metadata

title: DevOpsEnv
emoji: 🛠️
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
tags:
  - openenv
  - devops
  - sre
  - troubleshooting
  - agent-evaluation
pinned: false

DevOpsEnv

DevOpsEnv is an OpenEnv-compliant environment for training and evaluating AI agents on realistic DevOps/SRE incident response workflows.

Motivation

This environment models a real operational workflow that engineers do in production:

inspect system state
run diagnostic commands
apply targeted config/code fixes
verify impact
submit a final resolution

It is intentionally designed around common SRE failure classes (service outage, deployment misconfiguration, runtime memory issue) instead of toy interactions.

OpenEnv Compliance

The project implements the required OpenEnv interface:

typed Pydantic models for Observation, Action, Reward, StepResult, State
POST /reset returns the initial observation
POST /step returns observation, reward, done, info
GET /state returns current episode state
POST /grader returns deterministic final score and breakdown
openenv.yaml metadata/spec included

Observation Space

Observation includes:

task metadata (task_id, task_description)
episode controls (episode_id, step_number, max_steps)
system_state:
- running processes
- service status
- open HTTP ports
- docker containers
- logs
- filesystem snapshot
- cpu and memory metrics
interaction history and current available_actions

Action Space

Action.action_type is one of:

bash_cmd: execute simulated shell command (command)
file_edit: overwrite known config/source file (file_path, file_content)
submit: terminate and grade current episode (summary optional)

Tasks and Difficulty

The environment ships with 3 graded tasks:

task1 (easy): recover crashed Nginx and verify HTTP health.
task2 (medium): correct docker-compose port mapping and redeploy.
task3 (hard): diagnose memory leak behavior, patch service code, restart cleanly.

Each task has deterministic grading with score in [0.0, 1.0] and criterion-level breakdown.

Reward Design

Rewards are dense and shaped to provide trajectory signal:

per-step cost discourages long loops
action-type reward for useful commands/edits
progress bonuses for key milestones (validation, successful restart, verified outputs)
penalties for repeated identical actions and invalid edits
terminal bonus from grader score on episode completion

Local Setup

1) Install dependencies

pip install -r requirements.txt

2) Run API server

uvicorn app:app --host 0.0.0.0 --port 7860

3) Check health

curl http://127.0.0.1:7860/health

4) Validate OpenEnv package

openenv validate

Baseline Inference Script

The required baseline script is at project root: inference.py.

It:

uses the OpenAI Python client
reads mandatory LLM variables:
- API_BASE_URL
- MODEL_NAME
- HF_TOKEN
runs all three tasks by default
emits strict structured stdout lines:
- [START] ...
- [STEP] ...
- [END] ...

Inference environment variables

export OPENENV_BASE_URL="http://127.0.0.1:7860"
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="<your_token>"

Run baseline

python inference.py

Run a single task:

python inference.py --task task2

Docker

Build:

docker build -t devopsenv:latest .

Run:

docker run --rm -p 7860:7860 devopsenv:latest

Hugging Face Spaces Deployment

This repository is configured for Docker Spaces:

README frontmatter sets sdk: docker
container exposes and serves on port 7860
includes openenv tag

After pushing to a Space, verify:

POST /reset returns 200
openenv validate passes
python inference.py completes within runtime constraints

Pre-Submission Checklist

HF Space endpoint responds to /reset
openenv validate passes
docker build succeeds
inference.py runs and logs strict [START]/[STEP]/[END] format
all 3 tasks produce valid grader scores in [0.0, 1.0]