title: Cloud-Native DevOps Debug Environment
emoji: π§
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
Cloud-Native DevOps Debug Environment
An OpenEnv-compatible environment where AI agents learn to debug broken GitHub Actions workflows, Dockerfiles, and Kubernetes manifests. Built for the OpenEnv Hackathon by Scaler School of Technology (partners: Meta, HuggingFace, PyTorch).
Why Cloud-Native Debugging?
Every developer who ships code hits deployment pipeline failures. A misconfigured Dockerfile, a broken GitHub Actions workflow, a missing secret, a Kubernetes selector mismatch β these are the bugs that waste hours of developer time every week. They're hard to debug because:
- Error messages are cryptic ("unable to prepare context: unable to evaluate symlinks")
- The feedback loop is slow (push, wait for CI, read logs, fix, repeat)
- Multiple config files interact in non-obvious ways (Dockerfile + workflow + secrets + K8s manifests)
- Kubernetes errors require cross-resource reasoning (Deployment labels must match Service selectors)
This environment teaches AI agents to do what senior DevOps engineers do: read the error, trace it to the root cause across multiple files, and fix it.
How It Works: The Complete Flow
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. RESET β
β Agent receives: β
β - Broken config files (Dockerfile / workflow / K8s YAML) β
β - Error message from the failed build/deploy β
β - Available secrets list β
β - Number of issues to find β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 2. OBSERVE β THINK β ACT (repeat up to 10 steps) β
β Agent reads the error, analyzes the files, then: β
β - edit_file: replace broken content with fixed content β
β - replace_line: fix a specific line number β
β - add_line / add_block: insert missing content β
β - delete_line / delete_block: remove bad content β
β - request_hint: get a clue (-4% score penalty) β
β - submit: "I'm done fixing" β
β β
β After each action, agent gets: β
β - Updated file contents β
β - Reward signal (+0.3 per fix, -0.02 for failed edits) β
β - How many issues are now fixed β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 3. GRADE β
β Deterministic scoring based on: β
β - What fraction of issues were fixed β
β - Whether ALL issues were fixed (bonus) β
β - How many steps it took (efficiency) β
β - How many hints were used (penalty) β
β Score range: (0, 1) exclusive β never exactly 0 or 1 β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The 10 Tasks (50 Scenarios)
Evaluation runs all 50 scenarios deterministically across all 10 tasks for reproducible scoring.
Task 1: Dockerfile Syntax Errors β Easy
Simple typos and instruction errors that break docker build.
| # | Scenario | What's Broken | Real-World Context |
|---|---|---|---|
| 1 | typo_filename |
COPY requirments.txt . β misspelled filename |
Most common Docker build error on Stack Overflow |
| 2 | invalid_base_image |
FROM python:3.9-slimm β extra 'm' in tag |
Happens when copy-pasting image tags |
| 3 | invalid_run_syntax |
RUN pip install ... \n && python setup.py β broken line continuation |
Formatting multi-line RUN commands is tricky |
| 4 | copy_missing_source |
COPY dist/ but build output is in build/ |
Source directory doesn't exist in build context |
| 5 | missing_from_instruction |
No FROM instruction at all |
Dockerfile must start with FROM |
Task 2: Dockerfile Runtime Errors β Medium
The Dockerfile builds successfully, but the container crashes at runtime.
| # | Scenario | What's Broken | Real-World Context |
|---|---|---|---|
| 1 | missing_workdir |
No WORKDIR β files scatter to / |
Container runs but npm start can't find package.json |
| 2 | cmd_entrypoint_conflict |
Both ENTRYPOINT and CMD defined as full commands | Process starts incorrectly |
| 3 | entrypoint_not_executable |
Shell script lacks execute permission | chmod +x missing β "permission denied" |
| 4 | missing_required_env |
App needs DATABASE_URL but it's not set |
Container crashes: "DATABASE_URL is not defined" |
| 5 | non_root_privileged_port |
Non-root user tries to bind port 80 | Security best practice conflicts with port < 1024 |
Task 3: Workflow Syntax & Structure β Easy
GitHub Actions YAML has structural problems that GitHub rejects before any job runs.
| # | Scenario | What's Broken | Real-World Context |
|---|---|---|---|
| 1 | checkout_after_build |
docker build before actions/checkout |
No source code β "Dockerfile not found" |
| 2 | missing_runs_on |
Job has no runs-on field |
Every job needs a runner |
| 3 | invalid_trigger_syntax |
branches: main instead of branches: [main] |
Must be a YAML list |
| 4 | missing_step_uses_or_run |
Step has a name but no uses: or run: |
Invalid step |
| 5 | missing_on_trigger |
No on: block at all |
Workflow never triggers |
Task 4: Workflow Secrets & Permissions β Medium
Secrets exist but aren't wired correctly to the workflow steps.
| # | Scenario | What's Broken | Real-World Context |
|---|---|---|---|
| 1 | missing_env_secrets |
$DOCKER_PASSWORD without env: mapping |
Secrets must be passed via env: block |
| 2 | wrong_secret_syntax |
${ secrets.TOKEN } instead of ${{ secrets.TOKEN }} |
Single vs double braces |
| 3 | missing_token_permissions |
Pushing to GHCR without permissions: packages: write |
GITHUB_TOKEN is read-only by default |
| 4 | secret_not_in_env |
$SLACK_WEBHOOK_URL not in env: |
Very common mistake |
| 5 | ghcr_wrong_credentials |
Using DOCKER_PASSWORD for GHCR login |
GHCR uses GITHUB_TOKEN |
Task 5: CI + Docker Integration β Medium
The workflow AND the Dockerfile interact. Fixing one file alone isn't enough.
| # | Scenario | What's Broken | Real-World Context |
|---|---|---|---|
| 1 | missing_buildx_for_platforms |
Multi-platform build without setup-buildx-action |
Need BuildKit for cross-compile |
| 2 | missing_load_true |
build-push-action without load: true β next step can't find image |
Buildx doesn't load into local daemon by default |
| 3 | wrong_build_context |
Context is ./backend but Dockerfile path is ./Dockerfile |
Path mismatch |
| 4 | cache_without_mode_max |
GHA cache export missing mode=max |
Cache doesn't persist |
| 5 | push_without_login |
docker push without docker login first |
"denied: requested access" |
Task 6: Multi-Stage Pipeline & Matrix β Hard
Complex pipelines with multiple interacting bugs. Agent must find 2-3 issues across files.
| # | Scenario | What's Broken | Real-World Context |
|---|---|---|---|
| 1 | artifact_path_mismatch |
COPY --from=builder /app/dist but React outputs to /app/build |
CRA uses build/, Vite uses dist/ |
| 2 | matrix_platform_arg |
$BUILDPLATFORM without ARG BUILDPLATFORM |
Multi-arch needs platform ARGs |
| 3 | cross_job_artifact |
Test job downloads artifact but missing needs: build |
Jobs run in parallel by default |
| 4 | multiple_issues |
Dockerfile typo + workflow secrets not wired (2 bugs) | Problems compound across files |
| 5 | matrix_version_failure |
Matrix includes Node 14 but code needs >= 16 + missing needs: |
2 bugs to find |
Task 7: Kubernetes Pod Failures β Medium
Pod crashes and scheduling failures in Kubernetes deployments.
| # | Scenario | What's Broken | Real-World Context |
|---|---|---|---|
| 1 | oom_killed |
Memory limit 64Mi too low β CrashLoopBackOff/OOMKilled | Most common K8s production issue |
| 2 | image_pull_backoff |
Image tag typo nginx:latset β ImagePullBackOff |
Copy-paste tag errors |
| 3 | wrong_command |
command: ["python", "workers.py"] but file is worker.py |
File name mismatch |
| 4 | missing_configmap |
envFrom: configMapRef: app-config but ConfigMap doesn't exist |
CreateContainerConfigError |
| 5 | liveness_probe_failing |
Liveness probe port 3000 but app listens on 8080 | Probe misconfiguration causes restarts |
Task 8: Kubernetes Service & Ingress Issues β Hard
Networking issues where pods run fine but traffic doesn't reach them. Error messages are intentionally vague β the agent must diagnose from kubectl output.
| # | Scenario | What's Broken | Real-World Context |
|---|---|---|---|
| 1 | selector_mismatch |
Service selector app: api but pod label is app: api-server |
No endpoints β most common K8s networking bug |
| 2 | port_mismatch |
Service targetPort 8080 but container listens on 3000 | Connection refused |
| 3 | ingress_wrong_service |
Ingress references api-svc but service name is api-service |
Ingress 404 |
| 4 | network_policy_blocking |
NetworkPolicy with empty ingress rules blocks all traffic | Database unreachable |
| 5 | missing_ingress_class |
No ingressClassName: nginx specified |
Ingress controller doesn't pick it up |
Task 9: CI/CD Build & Push Pipeline β Hard
GHA-to-Docker-to-Registry pipeline failures spanning multiple files.
| # | Scenario | What's Broken | Real-World Context |
|---|---|---|---|
| 1 | registry_mismatch |
Build tags ghcr.io/... but push targets docker.io/... |
Registry URL mismatch between steps |
| 2 | image_tag_mismatch |
Build uses github.ref_name but push uses github.sha |
"image not found locally" |
| 3 | inconsistent_tagging |
docker tag myuser/api:latest but image was built as myuser/api:${{ github.sha }} |
Tag source doesn't exist |
| 4 | build_arg_not_passed |
Dockerfile ARG APP_VERSION but no --build-arg in workflow |
Version file is empty |
| 5 | dockerfile_path_in_subdirectory |
Workflow points to ./Dockerfile but it's at ./services/api/Dockerfile |
Monorepo path mismatch |
Task 10: Full Stack Deployment Pipeline β Expert
Multi-error scenarios spanning the entire stack: GHA + Dockerfile + K8s manifests. 2-4 bugs per scenario requiring cross-file reasoning. Error messages are intentionally vague β the agent must trace root causes from symptoms.
| # | Scenario | What's Broken | Real-World Context |
|---|---|---|---|
| 1 | full_pipeline_ghcr_and_selector |
GHCR token not mapped + K8s Service selector mismatch | 2 bugs across workflow + K8s |
| 2 | full_pipeline_three_bugs |
Missing checkout + no WORKDIR + wrong container/service port | 4 bugs across 4 files |
| 3 | full_pipeline_ghcr_dockerfile_k8s |
Wrong GHCR secret + base image typo + OOM memory limit | 3 bugs across all layers |
| 4 | full_pipeline_permissions_image_ingress |
Missing packages:write + hardcoded image placeholder + no ingressClassName | 3 bugs |
| 5 | full_pipeline_secrets_build_probe |
Docker secrets not wired + wrong build output dir + probe port mismatch | 4 bugs across all layers |
Fix Validation: Simulator-Based
Fixes are validated using structural simulators, not string matching. This means:
- Alternative valid fixes are accepted. Setting memory to
512Miinstead of256Miboth resolve the OOM β the simulator accepts either. - Three independent simulators run after every edit:
- DockerSimulator: validates Dockerfile syntax (FROM, COPY, EXPOSE, RUN) and runtime behavior (WORKDIR, CMD/ENTRYPOINT, permissions, ENV)
- WorkflowSimulator: parses YAML, checks triggers, runs-on, step ordering, secrets wiring, permissions, buildx requirements, registry consistency
- KubernetesSimulator: validates manifests, cross-resource dependencies (Service selector β Deployment labels), pod status simulation (OOM, ImagePullBackOff), service endpoint reachability
- 7 granular checks are tracked:
docker_build,docker_run,workflow_parse,workflow_exec,k8s_valid,k8s_pod_running,k8s_service_active - Progress = how many checks flip from fail β pass compared to the initial broken state
Available Actions
Each step, the agent chooses exactly one action:
| Action | What It Does | When to Use |
|---|---|---|
edit_file |
Replace old_content with new_content in a file |
Most common β fix a broken line or block |
replace_line |
Replace content at a specific line number | When you know exactly which line is wrong |
add_line |
Insert a new line into a file | Adding missing instructions (e.g., missing WORKDIR) |
delete_line |
Remove a specific line | Removing a bad instruction |
add_block |
Insert a multi-line block | Adding entire sections (e.g., env: block with secrets) |
delete_block |
Remove a multi-line block | Removing incorrect sections |
request_hint |
Get a clue about what's wrong | Costs -4% on final score β use sparingly |
submit |
Declare "I'm done" β triggers final evaluation | When all fixes are applied |
Important: edit_file requires old_content to match exactly (including whitespace). If it doesn't match, the edit fails and the agent gets a -0.02 reward penalty.
Grading System
Scoring is deterministic (same actions always produce the same score), difficulty-aware (harder tasks are graded more generously), and scores are strictly in (0, 1) exclusive β never exactly 0 or 1.
The Formula
FINAL SCORE = Base + Partial Fixes + Complete Bonus + Difficulty Bonus + Efficiency - Hint Penalty - Failed Edit Penalty
Clamped to (0.01, 0.99).
Component Breakdown
| Component | Weight | Description |
|---|---|---|
| Base score | 5% | Participation credit (guarantees score > 0) |
| Partial fixes | 35% | Proportional to issues_fixed / issues_total |
| Complete bonus | 25% | All issues fixed |
| Difficulty bonus | 0-3% | Extra reward for fully solving hard/expert tasks |
| Efficiency | 25% | Decays with extra steps β slower decay for harder tasks |
| Hint penalty | -3% to -4% each | Per request_hint action (cheaper for hard/expert) |
| Failed edit penalty | -2% each | Per edit with no valid file path |
Difficulty Modifiers
| Difficulty | Max Score | Efficiency Decay | Hint Cost |
|---|---|---|---|
| Easy | 0.90 | 0.03/step (strict) | 4% each |
| Medium | 0.90 | 0.027/step | 4% each |
| Hard/Expert | 0.93 | 0.021/step (forgiving) | 3% each |
Evaluation
The evaluation pipeline runs all 50 scenarios across all 10 tasks deterministically:
# Runs all 10 tasks Γ 5 scenarios = 50 episodes
results = run_baseline_episodes() # num_episodes=None runs all
# Per-episode scores in (0, 1)
# Aggregate = mean of all 50 scores
aggregate = sum(r.score for r in results) / len(results)
This ensures:
- Reproducibility: same agent produces same score every time
- Complete coverage: every error pattern is tested
- Fair comparison: all agents face the same 50 scenarios
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Root page |
/health |
GET | Health check β returns {"status": "healthy"} |
/metadata |
GET | Environment name, description, version, tags |
/schema |
GET | Action, observation, and state JSON schemas |
/reset |
POST | Start a new episode (optional: task_id, scenario_id, seed) |
/step |
POST | Take an action and receive observation + reward |
/state |
GET | Get current observation without taking an action |
/info |
GET | Task list with metadata |
/tasks |
GET | List all tasks with difficulty levels |
/grader |
POST | Grade a trajectory (list of step dicts) |
/baseline |
POST | Run baseline across all scenarios (optional: task_id, num_episodes) |
/mcp |
POST | JSON-RPC 2.0 MCP endpoint (initialize, tools/list) |
Example: Full Episode via API
# 1. Start an episode
curl -X POST http://localhost:7860/reset \
-H "Content-Type: application/json" \
-d '{"task_id": "k8s_pod_failures", "scenario_id": "oom_killed"}'
# 2. Fix the memory limit (any reasonable value works β simulator validates structurally)
curl -X POST http://localhost:7860/step \
-H "Content-Type: application/json" \
-d '{
"action": {
"action_type": "edit_file",
"edits": [{
"file_path": "k8s/deployment.yaml",
"old_content": "memory: \"64Mi\"",
"new_content": "memory: \"512Mi\""
}]
}
}'
# Response: reward=0.3, issues_fixed=1/1, done=true
Quick Start
Local Development
pip install -r requirements.txt
python -m uvicorn server.app:app --host 0.0.0.0 --port 7860
Run Tests
pytest tests/ -v
Docker
docker build -t cloud-native-devops-env .
docker run -p 7860:7860 cloud-native-devops-env
Baseline Inference (with LLM)
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=meta-llama/Llama-3.1-70B-Instruct
export HF_TOKEN=your_token_here
python inference.py
Project Structure
cloud-native-devops-env/
βββ openenv.yaml # OpenEnv environment specification
βββ inference.py # LLM baseline (OpenAI client + HF router)
βββ baseline_runner.py # Heuristic baseline β runs all 50 scenarios
βββ Dockerfile # Production container
βββ requirements.txt # Python dependencies
β
βββ server/
β βββ app.py # FastAPI with 12 endpoints
β βββ models.py # Pydantic models (type-safe API)
β βββ environment.py # Core environment loop (reset/step/state)
β βββ tasks/
β β βββ base.py # BaseTask with scenario loading
β β βββ task_registry.py # Maps task_id β task class (10 tasks)
β β βββ task_1_build_errors.py # 5 Dockerfile syntax scenarios
β β βββ task_2_docker_runtime.py # 5 Dockerfile runtime scenarios
β β βββ task_3_workflow_syntax.py # 5 workflow structure scenarios
β β βββ task_4_workflow_secrets_permissions.py # 5 secrets scenarios
β β βββ task_5_ci_docker_integration.py # 5 integration scenarios
β β βββ task_6_multi_stage_matrix.py # 5 multi-issue scenarios
β β βββ k8s_pod.py # 5 Kubernetes pod failure scenarios
β β βββ k8s_networking.py # 5 K8s networking scenarios
β β βββ pipeline_build_deploy.py # 5 GHAβDockerβRegistry scenarios
β β βββ pipeline_full.py # 5 full-stack multi-error scenarios
β βββ graders/
β β βββ __init__.py # Deterministic trajectory grader
β βββ simulators/
β βββ docker_simulator.py # Dockerfile build + runtime validation
β βββ workflow_simulator.py # GHA workflow parse + execution validation
β βββ k8s_simulator.py # K8s manifest + cross-resource validation
β
βββ tests/
βββ test_endpoints.py # API endpoint tests
βββ test_determinism.py # Grader determinism + score range tests
βββ test_baseline.py # Heuristic baseline tests
βββ test_environment_flow.py # Episode flow tests
βββ test_simulators.py # Simulator unit tests
Design Decisions
- Full cloud-native stack: Docker + GitHub Actions + Kubernetes β the three pillars of modern deployment pipelines.
- Simulator-based validation: Structural rule-based simulators validate fixes instead of string matching. Alternative valid fixes are accepted (e.g.,
512Miand256Miboth fix an OOM). Deterministic, fast, no security concerns. - Dense rewards: Partial credit at every step (+0.3 per fix, -0.02 per failed edit) rather than sparse pass/fail.
- Difficulty progression: Easy tasks are single-file, single-issue. Expert tasks are multi-file, multi-issue with interacting bugs across all three layers.
- Vague error messages in harder tasks: Easy tasks have explicit error messages. Hard/Expert tasks have realistic, vague messages that require the agent to actually diagnose the issue from context.
- Deterministic evaluation: All 50 scenarios run every time for reproducible, comparable scores in (0, 1) exclusive.
- 50 scenarios from real bugs: Every scenario is based on actual developer mistakes documented on Stack Overflow, GitHub Issues, and official documentation.
License
MIT