devops-pipeline-env / README.md
yashash045's picture
Upload folder using huggingface_hub
25b32ce verified
metadata
title: DevOps Pipeline Environment
emoji: πŸš€
colorFrom: blue
colorTo: green
sdk: docker
pinned: true
app_port: 8000
base_path: /web
tags:
  - openenv

DevOps Pipeline Environment

Overview

This environment enables training AI agents for automated DevOps incident management β€” an AI SRE agent that can diagnose failures, manage deployments, and make judgment calls under production pressure. It simulates a realistic microservice architecture where services have interdependent health metrics, cascading failures propagate through dependency chains, and every action has trade-off consequences.

CI/CD deployment management is the most common engineering workflow at companies like Meta, Google, and Amazon. This environment captures the real decision-making complexity of production deployments: flaky tests vs real bugs, config errors that only surface in staging, cascading failures that spiral through dependency chains, and production incidents where every minute of downtime costs revenue. The agent must investigate before acting, fix root causes before symptoms, and accept that every intervention has side effects.

This environment is useful for training RL agents to assist with deployment workflows, evaluating LLM reasoning under ambiguity and time pressure, and benchmarking investigation-before-action behavior. Novel for OpenEnv: No CI/CD pipeline environment exists on the Hub.

Why This Environment Matters

CI/CD pipeline management is the most common engineering workflow at Meta, Google, and Amazon β€” every SRE team performs incident response daily. This environment fills a gap: no CI/CD pipeline environment exists on the OpenEnv Hub.

This environment trains agents to:

  • Investigate before acting β€” partial observability forces information gathering
  • Diagnose root causes β€” cascading failures require tracing through dependency chains
  • Make trade-off decisions β€” every action has side effects (deploy spikes CPU, rollback risks regression)
  • Act under time pressure β€” health degrades each step in incident tasks
  • Choose between valid strategies β€” judgment_call has 3 resolution paths with different risk/reward profiles

The large gap between LLM baseline (0.18–0.70) and optimal (0.63–0.98) demonstrates significant room for RL training improvement across the full skill spectrum.

Service Dependency Graph

database-primary (PostgreSQL β€” root, no dependencies)
β”œβ”€β”€ auth-service (OAuth/JWT provider, depends on database-primary)
β”‚   β”œβ”€β”€ api-gateway (router/load balancer, depends on database-primary + auth-service)
β”‚   β”‚   └── web-frontend (UI app, depends on api-gateway + auth-service)
β”‚   └── web-frontend
└── cache-service (Redis cache, depends on database-primary)

Dependency chain: database-primary β†’ auth-service β†’ api-gateway β†’ web-frontend and database-primary β†’ cache-service. When an upstream service degrades, its dependents accumulate errors and latency each step.

Tasks

Task 1: Clean Deploy (Easy)

Deploy 2 services (api-gateway v2.3.1, web-frontend v1.9.0) with all tests passing. No complications β€” tests basic pipeline execution and deployment sequencing.

  • Max steps: 15
  • Services: database-primary, auth-service, api-gateway, web-frontend
  • Key challenge: Execute staging β†’ production deployment flow without breaking healthy services

Task 2: Broken Pipeline (Medium)

Diagnose test failures, fix a config error, run a migration, and deploy 3 services. Not all test failures are blocking β€” the agent must distinguish flaky tests from real bugs.

  • Max steps: 20
  • Services: database-primary, auth-service, api-gateway, web-frontend, cache-service
  • Key challenge: Wrong Redis host in cache-service config, pending migration blocks api-gateway deploy, 3 test failures (2 flaky, 1 deprecated)

Task 3: The Judgment Call (Hard)

Production incident β€” api-gateway at 1500ms latency and 12 errors/sec. A partially-tested hotfix (v2.3.2) is available. Multiple valid resolution paths with different risk/reward tradeoffs. Health degrades every step (time pressure).

  • Max steps: 12
  • Services: database-primary (under load), auth-service, api-gateway (degraded), web-frontend
  • Key challenge: Three valid paths β€” deploy hotfix + fix auth config (expert, highest score), rollback (safe but loses features), hotfix only (partial fix). Each has cascading consequences on web-frontend.

Task 4: Cascading Failure (Medium-Hard)

Root cause analysis across a dependency chain. cache-service is down due to a config error (max_connections: 5), dragging api-gateway and web-frontend down via cascading failures. Fixing downstream services while root cause persists is futile.

  • Max steps: 15
  • Services: database-primary, auth-service, cache-service (root cause), api-gateway (degraded), web-frontend (degrading)
  • Key challenge: Identify root cause in cache-service config, fix it, then recover downstream services in dependency order

Task 5: Capacity Crisis (Medium-Hard)

database-primary is approaching capacity limits under a traffic surge. CPU climbing, connection pool near saturation. The agent must act proactively before cascading failures begin β€” once the database goes down, recovery is extremely difficult.

  • Max steps: 15
  • Services: database-primary (stressed), auth-service, api-gateway, cache-service, web-frontend
  • Key challenge: Proactive intervention β€” increase max_connections and shared_buffers before tipping points trigger cascading collapse

Task 6: Random Incident (Variable β€” Procedural Generation)

Procedurally generated incident from a seed. The failing service (api-gateway, cache-service, auth-service, or web-frontend), failure type (config_error, degraded_performance, capacity_limit, memory_leak, or certificate_expiry), and severity (moderate or severe) are all randomized. 30% chance of compound incident (two services failing simultaneously). Different seeds produce different scenarios β€” infinite variation for curriculum learning.

  • Max steps: 15
  • Services: All 5 (one randomly failing)
  • Key challenge: Read the task description to identify the failing service and failure type, investigate, diagnose, and fix β€” with no prior knowledge of what's broken

Procedural Generation

The random_incident task generates unique scenarios from a seed, enabling:

  • Curriculum learning: Start with easy seeds, progressively increase difficulty
  • Generalization testing: Verify agents handle novel failure combinations
  • Infinite training data: Every seed produces a different incident

Failure space: 5 failure types Γ— 4 services Γ— 2 severities = 40 primary configurations, with 30% compound incidents and randomized initial conditions β€” hundreds of unique scenarios.

Action Space

9 typed action types via PipelineAction:

Action Description Required Fields
view_pipeline View overall pipeline status and service summary β€”
view_logs View recent logs for a service (reveals CPU/memory) service_name
view_config View current config key-value pairs service_name
edit_config Modify config key-value pairs (causes restart latency spike) service_name, config_edits
run_migration Execute a pending database migration migration_name
deploy Deploy service version to staging, then promote to production service_name, target_version
rollback Rollback service to previous version (25% regression risk) service_name
approve Approve current state and end episode reason
abort Abort deployment and end episode reason

Observation Space

PipelineObservation provides the agent's view of the system:

  • summary: One-line status β€” highlights degraded/down services at a glance (e.g., "WARNING: api-gateway degraded (lat=1500ms, err=12.0/s)" or "All services nominal.")
  • services: List of ServiceStatus β€” name, health, version, error_rate, latency, active_connections, last_deploy_timestamp. Partial observability: CPU and memory are hidden (show 0.0) until the agent runs view_logs for that service.
  • task_description and goal: Natural language context for the current task
  • available_actions: Context-sensitive list of valid action types
  • last_action_result / last_action_error: Feedback from the previous step
  • pipeline: Current stage, commit SHA, test pass/fail counts, build logs snippet
  • migrations: Pending and applied migrations
  • active_alerts: Critical/warning/info alerts with timestamps
  • config_snapshot: Config key-value pairs (populated after view_config or edit_config)
  • step_number / max_steps: Current progress

Reward Design

Dense per-step reward that creates a learnable gradient for RL training. Investigation rewards use diminishing-returns exploration β€” first investigation of an unhealthy service gives +0.04, with decay as more services are investigated. Health improvements give proportional reward via system health delta (+0.005 per 1% improvement). Sub-goal milestones reward intermediate progress: config fixes (+0.08), migrations (+0.06), and alert resolution (+0.03). Breaking healthy services is heavily penalized (-0.30). All grading is outcome-based β€” no procedure-based criteria. Rewards are task-adaptive β€” harder tasks with time pressure get steeper gradients (1.0x–1.5x urgency scaling), creating a curriculum-aware reward landscape. Rewards are bounded [-0.35, +0.30] per step to prevent training instability.

Signal Reward Condition
Service deployed to production +0.15 Service reaches prod successfully
Service verified in staging +0.05 Staging health check passes
Config error fixed +0.08 Service health improved after config change
Migration completed +0.06 Pending migration count decreased
Alert resolved +0.03 Alert count decreased
Investigation (degraded svc) +0.04Γ— First-time view on unhealthy service (with decay)
Investigation (healthy svc) +0.01Γ— First-time view on healthy service (with decay)
Health improvement +0.005/1% System health delta
Broke healthy service -0.30 Service went from healthy to degraded/down
Repeated investigation -0.01/-0.03 Same view on same target (-0.03 if consecutive)
Repeated exact action -0.02 Same action_type + service as last step

Reward Shaping Theory

The health delta component (+0.005 per 1% system health improvement) approximates potential-based reward shaping (Ng et al., 1999), where Ξ¦(s) = system_health/100. This preserves optimal policy while providing continuous gradient signal.

Investigation bonuses use count-based exploration decay: reward = base Γ— 1/(1 + 0.3n), consistent with Bellemare et al. (2016). This incentivizes initial exploration while preventing reward hacking through repeated view actions.

Task urgency scaling (1.0×–1.5Γ—) implements curriculum-aware reward calibration β€” harder tasks receive steeper gradients to maintain learning signal despite longer optimal trajectories.

Exploit Resistance

Attack Vector Defense
Repeated view_pipeline spam Diminishing returns decay: reward = base/(1+0.3n), consecutive repeat -0.03
Break-then-fix exploit -0.30 penalty exceeds +0.15 deploy + health recovery gains
Step-stalling for investigation bonuses Capped by diminishing returns + max_steps + efficiency grader component
Config-grep pattern matching Hard tasks removed prescriptive log messages; agent must diagnose from symptoms
Ignoring secondary incidents Compound incident grader awards 0.10 bonus for fixing secondary service

Baseline Scores

Model: Qwen/Qwen2.5-72B-Instruct via HuggingFace Router

Task Difficulty LLM Baseline Optimal Gap
clean_deploy Easy 0.700 0.947 +0.247
broken_pipeline Medium 0.482 0.890 +0.408
judgment_call Hard 0.184 0.935 +0.751
cascading_failure Med-Hard 0.280 0.883 +0.603
capacity_crisis Med-Hard 0.250 0.634 +0.384
random_incident (seed 6006) Variable 0.350 0.982 +0.632

LLM baselines re-calibrated after environment tuning (v2). Optimal scores from scripted expert trajectories. The large gap between LLM baseline and optimal demonstrates significant room for RL training improvement β€” the environment produces meaningful reward signal across the full skill spectrum. The random_incident task generates unique scenarios from each seed, enabling curriculum learning.

Seed Curriculum for RL Training

The random_incident task generates unique scenarios from each seed via DEVOPS_SEED env var. With 5 failure types, compound incidents, and randomized initial conditions, the configuration space produces hundreds of distinct scenarios.

Recommended curriculum:

  • Seeds 1–20: Single-service failures, moderate severity (warm-up)
  • Seeds 21–60: Mix of single and compound incidents (core training)
  • Seeds 61–100: Severe failures with compound incidents (advanced)

Set DEVOPS_SEED at reset time (reads from env var each episode).

Difficulty Analysis

Task Decision Depth Info Asymmetry Time Pressure Optimal Steps
clean_deploy Low (1) None None 4–6
broken_pipeline Medium (3) Medium Low 8–12
judgment_call High (5) High High 5–8
cascading_failure High (4) High Medium 6–10
capacity_crisis Medium (3) Medium Medium 6–10
random_incident Variable Variable Variable 5–12

Example Episode Trajectory

Task: broken_pipeline β€” diagnose and fix a broken deployment pipeline.

Step 1: view_logs("cache-service")       β†’ reward +0.02  (investigation bonus, reveals Redis config error)
Step 2: edit_config("cache-service",
          redis.host β†’ "redis-prod...")   β†’ reward +0.10  (health improvement from fixing config)
Step 3: deploy("api-gateway", "v2.3.1")  β†’ reward +0.05  (staging verified)
Step 4: deploy("api-gateway", "v2.3.1")  β†’ reward +0.15  (promoted to production)
Step 5: approve("All services healthy")  β†’ reward +0.03  (episode complete)

Environment Features

  • 6 tasks (5 hand-crafted + 1 procedurally generated) for curriculum learning
  • 5 microservices with realistic dependency graph
  • Stochastic simulation with seeded RNG for full reproducibility
  • Realistic production logs (Java/Node stack traces, timestamps, red herrings)
  • Partial observability (CPU/memory hidden until investigated via view_logs)
  • Cascading failures propagate through dependency chain each step
  • Cross-metric compounding (error β†’ CPU β†’ latency spirals, and reverse recovery)
  • Non-linear tipping points (CPU cliff at 85%, latency cliff at 2000ms)
  • Trade-off effects on every action (deploy β†’ CPU spike, rollback β†’ 25% regression risk, config edit β†’ restart latency)
  • Time pressure on incident tasks (health degrades each step in judgment_call)
  • Multi-path task design (judgment_call has 3 valid resolution paths with different scores)
  • Dense per-step reward with anti-reward-hacking safeguards (bounded, no procedure bonuses)
  • Observation summary field for quick triage

Formal MDP Description

Property Description
State space S 5 services Γ— (health ∈ {healthy, degraded, down}, cpu ∈ [0,100], memory ∈ [0,100], error_rate ∈ [0,50], latency ∈ [0,5000], config: Dict) + pipeline status + migration status + alerts. Partially observable β€” CPU/memory/latency/error_rate hidden until investigated.
Action space A 9 discrete action types Γ— parameterized service targets. ~45 effective actions.
Transition T Deterministic core with stochastic elements (8% transient staging failure, 25% rollback regression, deploy quality variance). Seeded RNG ensures reproducibility.
Reward R Dense, bounded [-0.35, +0.30] per step. Potential-based health delta + milestone rewards + exploration bonuses with diminishing returns. Task-adaptive urgency scaling (1.0×–1.5Γ—).
Episode length 12–20 steps depending on task. Terminates on approve/abort/max_steps/catastrophic failure (health < 20%).
Discount factor Recommended Ξ³ = 0.99 (short episodes, dense rewards).

Stochastic Elements (All Seeded)

All randomness uses random.Random(seed) β€” same seed + same actions = identical outcomes.

Element Probability Location
Transient staging failure 8% per deploy deploy_to_staging()
Rollback regression 25% per rollback rollback()
Deploy quality 70% clean / 20% minor / 10% unstable deploy_to_production()
Compound incident 30% in random_incident RandomIncidentScenario.setup()
Initial health variance Β±10 CPU, Β±15 latency RandomIncidentScenario.setup()

Determinism guarantee: reset() re-seeds the RNG from fixed task seeds. Two resets with the same task produce identical initial states.

Setup

# Install dependencies
uv sync

# Run locally (without Docker)
uv run python -m uvicorn server.app:app --host 0.0.0.0 --port 8000

# Build and run with Docker
docker build -t devops-pipeline-env .
docker run -p 8000:8000 devops-pipeline-env

# Test reset endpoint
curl -X POST -H "Content-Type: application/json" -d '{}' http://localhost:8000/reset

# Open web UI
# http://localhost:8000/web

# Run inference
export HF_TOKEN=your_token_here
uv run inference.py

# Validate and deploy
openenv validate
openenv push --repo-id your-username/devops-pipeline-env

API Endpoints

Endpoint Method Description
/reset POST Reset environment (new episode)
/step POST Execute an action, returns observation
/state GET Get current environment state
/tasks GET List available tasks and action schema
/health GET Health check
/baseline POST Pre-recorded LLM baseline scores
/grader POST Score the current active episode
/ws WS WebSocket for persistent sessions
/web GET Gradio web interface