deprec / README.md
exploring-solver's picture
updated logic to use qwen with intended tasks
9667fa6
|
Raw
History Blame Contribute Delete
4.28 kB
metadata
title: DevOpsEnv
emoji: 🛠️
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
tags:
  - openenv
  - devops
  - sre
  - troubleshooting
  - agent-evaluation
pinned: false

DevOpsEnv

DevOpsEnv is an OpenEnv-compliant environment for training and evaluating AI agents on realistic DevOps/SRE incident response workflows.

Motivation

This environment models a real operational workflow that engineers do in production:

  • inspect system state
  • run diagnostic commands
  • apply targeted config/code fixes
  • verify impact
  • submit a final resolution

It is intentionally designed around common SRE failure classes (service outage, deployment misconfiguration, runtime memory issue) instead of toy interactions.

OpenEnv Compliance

The project implements the required OpenEnv interface:

  • typed Pydantic models for Observation, Action, Reward, StepResult, State
  • POST /reset returns the initial observation
  • POST /step returns observation, reward, done, info
  • GET /state returns current episode state
  • POST /grader returns deterministic final score and breakdown
  • openenv.yaml metadata/spec included

Observation Space

Observation includes:

  • task metadata (task_id, task_description)
  • episode controls (episode_id, step_number, max_steps)
  • system_state:
    • running processes
    • service status
    • open HTTP ports
    • docker containers
    • logs
    • filesystem snapshot
    • cpu and memory metrics
  • interaction history and current available_actions

Action Space

Action.action_type is one of:

  • bash_cmd: execute simulated shell command (command)
  • file_edit: overwrite known config/source file (file_path, file_content)
  • submit: terminate and grade current episode (summary optional)

Tasks and Difficulty

The environment ships with 3 graded tasks:

  1. task1 (easy): recover crashed Nginx and verify HTTP health.
  2. task2 (medium): correct docker-compose port mapping and redeploy.
  3. task3 (hard): diagnose memory leak behavior, patch service code, restart cleanly.

Each task has deterministic grading with score in [0.0, 1.0] and criterion-level breakdown.

Reward Design

Rewards are dense and shaped to provide trajectory signal:

  • per-step cost discourages long loops
  • action-type reward for useful commands/edits
  • progress bonuses for key milestones (validation, successful restart, verified outputs)
  • penalties for repeated identical actions and invalid edits
  • terminal bonus from grader score on episode completion

Local Setup

1) Install dependencies

pip install -r requirements.txt

2) Run API server

uvicorn app:app --host 0.0.0.0 --port 7860

3) Check health

curl http://127.0.0.1:7860/health

4) Validate OpenEnv package

openenv validate

Baseline Inference Script

The required baseline script is at project root: inference.py.

It:

  • uses the OpenAI Python client
  • reads mandatory LLM variables:
    • API_BASE_URL
    • MODEL_NAME
    • HF_TOKEN
  • runs all three tasks by default
  • emits strict structured stdout lines:
    • [START] ...
    • [STEP] ...
    • [END] ...

Inference environment variables

export OPENENV_BASE_URL="http://127.0.0.1:7860"
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="<your_token>"

Run baseline

python inference.py

Run a single task:

python inference.py --task task2

Docker

Build:

docker build -t devopsenv:latest .

Run:

docker run --rm -p 7860:7860 devopsenv:latest

Hugging Face Spaces Deployment

This repository is configured for Docker Spaces:

  • README frontmatter sets sdk: docker
  • container exposes and serves on port 7860
  • includes openenv tag

After pushing to a Space, verify:

  • POST /reset returns 200
  • openenv validate passes
  • python inference.py completes within runtime constraints

Pre-Submission Checklist

  • HF Space endpoint responds to /reset
  • openenv validate passes
  • docker build succeeds
  • inference.py runs and logs strict [START]/[STEP]/[END] format
  • all 3 tasks produce valid grader scores in [0.0, 1.0]