Spaces:
Sleeping
Sleeping
| title: DevOps Pipeline Environment | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: docker | |
| pinned: true | |
| app_port: 8000 | |
| base_path: /web | |
| tags: | |
| - openenv | |
| # DevOps Pipeline Environment | |
| ## Overview | |
| This environment enables training AI agents for automated DevOps incident management β an AI SRE agent that can diagnose failures, manage deployments, and make judgment calls under production pressure. It simulates a realistic microservice architecture where services have interdependent health metrics, cascading failures propagate through dependency chains, and every action has trade-off consequences. | |
| CI/CD deployment management is the most common engineering workflow at companies like Meta, Google, and Amazon. This environment captures the real decision-making complexity of production deployments: flaky tests vs real bugs, config errors that only surface in staging, cascading failures that spiral through dependency chains, and production incidents where every minute of downtime costs revenue. The agent must investigate before acting, fix root causes before symptoms, and accept that every intervention has side effects. | |
| This environment is useful for training RL agents to assist with deployment workflows, evaluating LLM reasoning under ambiguity and time pressure, and benchmarking investigation-before-action behavior. **Novel for OpenEnv**: No CI/CD pipeline environment exists on the Hub. | |
| ## Why This Environment Matters | |
| CI/CD pipeline management is the most common engineering workflow at Meta, Google, and Amazon β every SRE team performs incident response daily. This environment fills a gap: no CI/CD pipeline environment exists on the OpenEnv Hub. | |
| This environment trains agents to: | |
| - **Investigate before acting** β partial observability forces information gathering | |
| - **Diagnose root causes** β cascading failures require tracing through dependency chains | |
| - **Make trade-off decisions** β every action has side effects (deploy spikes CPU, rollback risks regression) | |
| - **Act under time pressure** β health degrades each step in incident tasks | |
| - **Choose between valid strategies** β judgment_call has 3 resolution paths with different risk/reward profiles | |
| The large gap between LLM baseline (0.18β0.70) and optimal (0.63β0.98) demonstrates significant room for RL training improvement across the full skill spectrum. | |
| ## Service Dependency Graph | |
| ``` | |
| database-primary (PostgreSQL β root, no dependencies) | |
| βββ auth-service (OAuth/JWT provider, depends on database-primary) | |
| β βββ api-gateway (router/load balancer, depends on database-primary + auth-service) | |
| β β βββ web-frontend (UI app, depends on api-gateway + auth-service) | |
| β βββ web-frontend | |
| βββ cache-service (Redis cache, depends on database-primary) | |
| ``` | |
| Dependency chain: `database-primary β auth-service β api-gateway β web-frontend` and `database-primary β cache-service`. When an upstream service degrades, its dependents accumulate errors and latency each step. | |
| ## Tasks | |
| ### Task 1: Clean Deploy (Easy) | |
| Deploy 2 services (api-gateway v2.3.1, web-frontend v1.9.0) with all tests passing. No complications β tests basic pipeline execution and deployment sequencing. | |
| - **Max steps**: 15 | |
| - **Services**: database-primary, auth-service, api-gateway, web-frontend | |
| - **Key challenge**: Execute staging β production deployment flow without breaking healthy services | |
| ### Task 2: Broken Pipeline (Medium) | |
| Diagnose test failures, fix a config error, run a migration, and deploy 3 services. Not all test failures are blocking β the agent must distinguish flaky tests from real bugs. | |
| - **Max steps**: 20 | |
| - **Services**: database-primary, auth-service, api-gateway, web-frontend, cache-service | |
| - **Key challenge**: Wrong Redis host in cache-service config, pending migration blocks api-gateway deploy, 3 test failures (2 flaky, 1 deprecated) | |
| ### Task 3: The Judgment Call (Hard) | |
| Production incident β api-gateway at 1500ms latency and 12 errors/sec. A partially-tested hotfix (v2.3.2) is available. Multiple valid resolution paths with different risk/reward tradeoffs. Health degrades every step (time pressure). | |
| - **Max steps**: 12 | |
| - **Services**: database-primary (under load), auth-service, api-gateway (degraded), web-frontend | |
| - **Key challenge**: Three valid paths β deploy hotfix + fix auth config (expert, highest score), rollback (safe but loses features), hotfix only (partial fix). Each has cascading consequences on web-frontend. | |
| ### Task 4: Cascading Failure (Medium-Hard) | |
| Root cause analysis across a dependency chain. cache-service is down due to a config error (max_connections: 5), dragging api-gateway and web-frontend down via cascading failures. Fixing downstream services while root cause persists is futile. | |
| - **Max steps**: 15 | |
| - **Services**: database-primary, auth-service, cache-service (root cause), api-gateway (degraded), web-frontend (degrading) | |
| - **Key challenge**: Identify root cause in cache-service config, fix it, then recover downstream services in dependency order | |
| ### Task 5: Capacity Crisis (Medium-Hard) | |
| database-primary is approaching capacity limits under a traffic surge. CPU climbing, connection pool near saturation. The agent must act proactively before cascading failures begin β once the database goes down, recovery is extremely difficult. | |
| - **Max steps**: 15 | |
| - **Services**: database-primary (stressed), auth-service, api-gateway, cache-service, web-frontend | |
| - **Key challenge**: Proactive intervention β increase max_connections and shared_buffers before tipping points trigger cascading collapse | |
| ### Task 6: Random Incident (Variable β Procedural Generation) | |
| Procedurally generated incident from a seed. The failing service (api-gateway, cache-service, auth-service, or web-frontend), failure type (config_error, degraded_performance, capacity_limit, memory_leak, or certificate_expiry), and severity (moderate or severe) are all randomized. 30% chance of compound incident (two services failing simultaneously). Different seeds produce different scenarios β infinite variation for curriculum learning. | |
| - **Max steps**: 15 | |
| - **Services**: All 5 (one randomly failing) | |
| - **Key challenge**: Read the task description to identify the failing service and failure type, investigate, diagnose, and fix β with no prior knowledge of what's broken | |
| ## Procedural Generation | |
| The `random_incident` task generates unique scenarios from a seed, enabling: | |
| - **Curriculum learning**: Start with easy seeds, progressively increase difficulty | |
| - **Generalization testing**: Verify agents handle novel failure combinations | |
| - **Infinite training data**: Every seed produces a different incident | |
| Failure space: 5 failure types Γ 4 services Γ 2 severities = 40 primary configurations, with 30% compound incidents and randomized initial conditions β hundreds of unique scenarios. | |
| ## Action Space | |
| 9 typed action types via `PipelineAction`: | |
| | Action | Description | Required Fields | | |
| |--------|-------------|-----------------| | |
| | `view_pipeline` | View overall pipeline status and service summary | β | | |
| | `view_logs` | View recent logs for a service (reveals CPU/memory) | `service_name` | | |
| | `view_config` | View current config key-value pairs | `service_name` | | |
| | `edit_config` | Modify config key-value pairs (causes restart latency spike) | `service_name`, `config_edits` | | |
| | `run_migration` | Execute a pending database migration | `migration_name` | | |
| | `deploy` | Deploy service version to staging, then promote to production | `service_name`, `target_version` | | |
| | `rollback` | Rollback service to previous version (25% regression risk) | `service_name` | | |
| | `approve` | Approve current state and end episode | `reason` | | |
| | `abort` | Abort deployment and end episode | `reason` | | |
| ## Observation Space | |
| `PipelineObservation` provides the agent's view of the system: | |
| - **summary**: One-line status β highlights degraded/down services at a glance (e.g., `"WARNING: api-gateway degraded (lat=1500ms, err=12.0/s)"` or `"All services nominal."`) | |
| - **services**: List of `ServiceStatus` β name, health, version, error_rate, latency, active_connections, last_deploy_timestamp. **Partial observability**: CPU and memory are hidden (show 0.0) until the agent runs `view_logs` for that service. | |
| - **task_description** and **goal**: Natural language context for the current task | |
| - **available_actions**: Context-sensitive list of valid action types | |
| - **last_action_result / last_action_error**: Feedback from the previous step | |
| - **pipeline**: Current stage, commit SHA, test pass/fail counts, build logs snippet | |
| - **migrations**: Pending and applied migrations | |
| - **active_alerts**: Critical/warning/info alerts with timestamps | |
| - **config_snapshot**: Config key-value pairs (populated after `view_config` or `edit_config`) | |
| - **step_number / max_steps**: Current progress | |
| ## Reward Design | |
| Dense per-step reward that creates a learnable gradient for RL training. Investigation rewards use **diminishing-returns exploration** β first investigation of an unhealthy service gives +0.04, with decay as more services are investigated. Health improvements give proportional reward via system health delta (+0.005 per 1% improvement). **Sub-goal milestones** reward intermediate progress: config fixes (+0.08), migrations (+0.06), and alert resolution (+0.03). Breaking healthy services is heavily penalized (-0.30). All grading is outcome-based β no procedure-based criteria. Rewards are **task-adaptive** β harder tasks with time pressure get steeper gradients (1.0xβ1.5x urgency scaling), creating a curriculum-aware reward landscape. Rewards are bounded [-0.35, +0.30] per step to prevent training instability. | |
| | Signal | Reward | Condition | | |
| |--------|--------|-----------| | |
| | Service deployed to production | +0.15 | Service reaches prod successfully | | |
| | Service verified in staging | +0.05 | Staging health check passes | | |
| | Config error fixed | +0.08 | Service health improved after config change | | |
| | Migration completed | +0.06 | Pending migration count decreased | | |
| | Alert resolved | +0.03 | Alert count decreased | | |
| | Investigation (degraded svc) | +0.04Γ | First-time view on unhealthy service (with decay) | | |
| | Investigation (healthy svc) | +0.01Γ | First-time view on healthy service (with decay) | | |
| | Health improvement | +0.005/1% | System health delta | | |
| | Broke healthy service | -0.30 | Service went from healthy to degraded/down | | |
| | Repeated investigation | -0.01/-0.03 | Same view on same target (-0.03 if consecutive) | | |
| | Repeated exact action | -0.02 | Same action_type + service as last step | | |
| ### Reward Shaping Theory | |
| The health delta component (+0.005 per 1% system health improvement) approximates potential-based reward shaping (Ng et al., 1999), where Ξ¦(s) = system_health/100. This preserves optimal policy while providing continuous gradient signal. | |
| Investigation bonuses use count-based exploration decay: reward = base Γ 1/(1 + 0.3n), consistent with Bellemare et al. (2016). This incentivizes initial exploration while preventing reward hacking through repeated view actions. | |
| Task urgency scaling (1.0Γβ1.5Γ) implements curriculum-aware reward calibration β harder tasks receive steeper gradients to maintain learning signal despite longer optimal trajectories. | |
| ### Exploit Resistance | |
| | Attack Vector | Defense | | |
| |--------------|---------| | |
| | Repeated view_pipeline spam | Diminishing returns decay: reward = base/(1+0.3n), consecutive repeat -0.03 | | |
| | Break-then-fix exploit | -0.30 penalty exceeds +0.15 deploy + health recovery gains | | |
| | Step-stalling for investigation bonuses | Capped by diminishing returns + max_steps + efficiency grader component | | |
| | Config-grep pattern matching | Hard tasks removed prescriptive log messages; agent must diagnose from symptoms | | |
| | Ignoring secondary incidents | Compound incident grader awards 0.10 bonus for fixing secondary service | | |
| ## Baseline Scores | |
| Model: `Qwen/Qwen2.5-72B-Instruct` via HuggingFace Router | |
| | Task | Difficulty | LLM Baseline | Optimal | Gap | | |
| |------|-----------|-------------|---------|-----| | |
| | clean_deploy | Easy | 0.700 | 0.947 | +0.247 | | |
| | broken_pipeline | Medium | 0.482 | 0.890 | +0.408 | | |
| | judgment_call | Hard | 0.184 | 0.935 | +0.751 | | |
| | cascading_failure | Med-Hard | 0.280 | 0.883 | +0.603 | | |
| | capacity_crisis | Med-Hard | 0.250 | 0.634 | +0.384 | | |
| | random_incident (seed 6006) | Variable | 0.350 | 0.982 | +0.632 | | |
| LLM baselines re-calibrated after environment tuning (v2). Optimal scores from scripted expert trajectories. The large gap between LLM baseline and optimal demonstrates significant room for RL training improvement β the environment produces meaningful reward signal across the full skill spectrum. The `random_incident` task generates unique scenarios from each seed, enabling curriculum learning. | |
| ### Seed Curriculum for RL Training | |
| The `random_incident` task generates unique scenarios from each seed via `DEVOPS_SEED` env var. With 5 failure types, compound incidents, and randomized initial conditions, the configuration space produces hundreds of distinct scenarios. | |
| **Recommended curriculum:** | |
| - Seeds 1β20: Single-service failures, moderate severity (warm-up) | |
| - Seeds 21β60: Mix of single and compound incidents (core training) | |
| - Seeds 61β100: Severe failures with compound incidents (advanced) | |
| Set `DEVOPS_SEED` at reset time (reads from env var each episode). | |
| ### Difficulty Analysis | |
| | Task | Decision Depth | Info Asymmetry | Time Pressure | Optimal Steps | | |
| |------|---------------|----------------|---------------|---------------| | |
| | clean_deploy | Low (1) | None | None | 4β6 | | |
| | broken_pipeline | Medium (3) | Medium | Low | 8β12 | | |
| | judgment_call | High (5) | High | High | 5β8 | | |
| | cascading_failure | High (4) | High | Medium | 6β10 | | |
| | capacity_crisis | Medium (3) | Medium | Medium | 6β10 | | |
| | random_incident | Variable | Variable | Variable | 5β12 | | |
| ## Example Episode Trajectory | |
| **Task: broken_pipeline** β diagnose and fix a broken deployment pipeline. | |
| ``` | |
| Step 1: view_logs("cache-service") β reward +0.02 (investigation bonus, reveals Redis config error) | |
| Step 2: edit_config("cache-service", | |
| redis.host β "redis-prod...") β reward +0.10 (health improvement from fixing config) | |
| Step 3: deploy("api-gateway", "v2.3.1") β reward +0.05 (staging verified) | |
| Step 4: deploy("api-gateway", "v2.3.1") β reward +0.15 (promoted to production) | |
| Step 5: approve("All services healthy") β reward +0.03 (episode complete) | |
| ``` | |
| ## Environment Features | |
| - 6 tasks (5 hand-crafted + 1 procedurally generated) for curriculum learning | |
| - 5 microservices with realistic dependency graph | |
| - Stochastic simulation with seeded RNG for full reproducibility | |
| - Realistic production logs (Java/Node stack traces, timestamps, red herrings) | |
| - Partial observability (CPU/memory hidden until investigated via view_logs) | |
| - Cascading failures propagate through dependency chain each step | |
| - Cross-metric compounding (error β CPU β latency spirals, and reverse recovery) | |
| - Non-linear tipping points (CPU cliff at 85%, latency cliff at 2000ms) | |
| - Trade-off effects on every action (deploy β CPU spike, rollback β 25% regression risk, config edit β restart latency) | |
| - Time pressure on incident tasks (health degrades each step in judgment_call) | |
| - Multi-path task design (judgment_call has 3 valid resolution paths with different scores) | |
| - Dense per-step reward with anti-reward-hacking safeguards (bounded, no procedure bonuses) | |
| - Observation summary field for quick triage | |
| ## Formal MDP Description | |
| | Property | Description | | |
| |----------|-------------| | |
| | **State space S** | 5 services Γ (health β {healthy, degraded, down}, cpu β [0,100], memory β [0,100], error_rate β [0,50], latency β [0,5000], config: Dict) + pipeline status + migration status + alerts. Partially observable β CPU/memory/latency/error_rate hidden until investigated. | | |
| | **Action space A** | 9 discrete action types Γ parameterized service targets. ~45 effective actions. | | |
| | **Transition T** | Deterministic core with stochastic elements (8% transient staging failure, 25% rollback regression, deploy quality variance). Seeded RNG ensures reproducibility. | | |
| | **Reward R** | Dense, bounded [-0.35, +0.30] per step. Potential-based health delta + milestone rewards + exploration bonuses with diminishing returns. Task-adaptive urgency scaling (1.0Γβ1.5Γ). | | |
| | **Episode length** | 12β20 steps depending on task. Terminates on approve/abort/max_steps/catastrophic failure (health < 20%). | | |
| | **Discount factor** | Recommended Ξ³ = 0.99 (short episodes, dense rewards). | | |
| ### Stochastic Elements (All Seeded) | |
| All randomness uses `random.Random(seed)` β same seed + same actions = identical outcomes. | |
| | Element | Probability | Location | | |
| |---------|------------|----------| | |
| | Transient staging failure | 8% per deploy | `deploy_to_staging()` | | |
| | Rollback regression | 25% per rollback | `rollback()` | | |
| | Deploy quality | 70% clean / 20% minor / 10% unstable | `deploy_to_production()` | | |
| | Compound incident | 30% in random_incident | `RandomIncidentScenario.setup()` | | |
| | Initial health variance | Β±10 CPU, Β±15 latency | `RandomIncidentScenario.setup()` | | |
| Determinism guarantee: `reset()` re-seeds the RNG from fixed task seeds. Two resets with the same task produce identical initial states. | |
| ## Setup | |
| ```bash | |
| # Install dependencies | |
| uv sync | |
| # Run locally (without Docker) | |
| uv run python -m uvicorn server.app:app --host 0.0.0.0 --port 8000 | |
| # Build and run with Docker | |
| docker build -t devops-pipeline-env . | |
| docker run -p 8000:8000 devops-pipeline-env | |
| # Test reset endpoint | |
| curl -X POST -H "Content-Type: application/json" -d '{}' http://localhost:8000/reset | |
| # Open web UI | |
| # http://localhost:8000/web | |
| # Run inference | |
| export HF_TOKEN=your_token_here | |
| uv run inference.py | |
| # Validate and deploy | |
| openenv validate | |
| openenv push --repo-id your-username/devops-pipeline-env | |
| ``` | |
| ## API Endpoints | |
| | Endpoint | Method | Description | | |
| |----------|--------|-------------| | |
| | `/reset` | POST | Reset environment (new episode) | | |
| | `/step` | POST | Execute an action, returns observation | | |
| | `/state` | GET | Get current environment state | | |
| | `/tasks` | GET | List available tasks and action schema | | |
| | `/health` | GET | Health check | | |
| | `/baseline` | POST | Pre-recorded LLM baseline scores | | |
| | `/grader` | POST | Score the current active episode | | |
| | `/ws` | WS | WebSocket for persistent sessions | | |
| | `/web` | GET | Gradio web interface | | |