Krishna1107's picture
inference fixed, port changed to 7860
eb895b1
metadata
title: Cloud-Native DevOps Debug Environment
emoji: πŸ”§
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false

Cloud-Native DevOps Debug Environment

An OpenEnv-compatible environment where AI agents learn to debug broken GitHub Actions workflows, Dockerfiles, and Kubernetes manifests. Built for the OpenEnv Hackathon by Scaler School of Technology (partners: Meta, HuggingFace, PyTorch).

Why Cloud-Native Debugging?

Every developer who ships code hits deployment pipeline failures. A misconfigured Dockerfile, a broken GitHub Actions workflow, a missing secret, a Kubernetes selector mismatch β€” these are the bugs that waste hours of developer time every week. They're hard to debug because:

  • Error messages are cryptic ("unable to prepare context: unable to evaluate symlinks")
  • The feedback loop is slow (push, wait for CI, read logs, fix, repeat)
  • Multiple config files interact in non-obvious ways (Dockerfile + workflow + secrets + K8s manifests)
  • Kubernetes errors require cross-resource reasoning (Deployment labels must match Service selectors)

This environment teaches AI agents to do what senior DevOps engineers do: read the error, trace it to the root cause across multiple files, and fix it.


How It Works: The Complete Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  1. RESET                                                     β”‚
β”‚     Agent receives:                                           β”‚
β”‚     - Broken config files (Dockerfile / workflow / K8s YAML)  β”‚
β”‚     - Error message from the failed build/deploy              β”‚
β”‚     - Available secrets list                                  β”‚
β”‚     - Number of issues to find                                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  2. OBSERVE β†’ THINK β†’ ACT  (repeat up to 10 steps)           β”‚
β”‚     Agent reads the error, analyzes the files, then:          β”‚
β”‚     - edit_file: replace broken content with fixed content    β”‚
β”‚     - replace_line: fix a specific line number                β”‚
β”‚     - add_line / add_block: insert missing content            β”‚
β”‚     - delete_line / delete_block: remove bad content          β”‚
β”‚     - request_hint: get a clue (-4% score penalty)            β”‚
β”‚     - submit: "I'm done fixing"                               β”‚
β”‚                                                               β”‚
β”‚     After each action, agent gets:                            β”‚
β”‚     - Updated file contents                                   β”‚
β”‚     - Reward signal (+0.3 per fix, -0.02 for failed edits)   β”‚
β”‚     - How many issues are now fixed                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  3. GRADE                                                     β”‚
β”‚     Deterministic scoring based on:                           β”‚
β”‚     - What fraction of issues were fixed                      β”‚
β”‚     - Whether ALL issues were fixed (bonus)                   β”‚
β”‚     - How many steps it took (efficiency)                     β”‚
β”‚     - How many hints were used (penalty)                      β”‚
β”‚     Score range: (0, 1) exclusive β€” never exactly 0 or 1     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The 10 Tasks (50 Scenarios)

Evaluation runs all 50 scenarios deterministically across all 10 tasks for reproducible scoring.

Task 1: Dockerfile Syntax Errors β€” Easy

Simple typos and instruction errors that break docker build.

# Scenario What's Broken Real-World Context
1 typo_filename COPY requirments.txt . β€” misspelled filename Most common Docker build error on Stack Overflow
2 invalid_base_image FROM python:3.9-slimm β€” extra 'm' in tag Happens when copy-pasting image tags
3 invalid_run_syntax RUN pip install ... \n && python setup.py β€” broken line continuation Formatting multi-line RUN commands is tricky
4 copy_missing_source COPY dist/ but build output is in build/ Source directory doesn't exist in build context
5 missing_from_instruction No FROM instruction at all Dockerfile must start with FROM

Task 2: Dockerfile Runtime Errors β€” Medium

The Dockerfile builds successfully, but the container crashes at runtime.

# Scenario What's Broken Real-World Context
1 missing_workdir No WORKDIR β€” files scatter to / Container runs but npm start can't find package.json
2 cmd_entrypoint_conflict Both ENTRYPOINT and CMD defined as full commands Process starts incorrectly
3 entrypoint_not_executable Shell script lacks execute permission chmod +x missing β€” "permission denied"
4 missing_required_env App needs DATABASE_URL but it's not set Container crashes: "DATABASE_URL is not defined"
5 non_root_privileged_port Non-root user tries to bind port 80 Security best practice conflicts with port < 1024

Task 3: Workflow Syntax & Structure β€” Easy

GitHub Actions YAML has structural problems that GitHub rejects before any job runs.

# Scenario What's Broken Real-World Context
1 checkout_after_build docker build before actions/checkout No source code β€” "Dockerfile not found"
2 missing_runs_on Job has no runs-on field Every job needs a runner
3 invalid_trigger_syntax branches: main instead of branches: [main] Must be a YAML list
4 missing_step_uses_or_run Step has a name but no uses: or run: Invalid step
5 missing_on_trigger No on: block at all Workflow never triggers

Task 4: Workflow Secrets & Permissions β€” Medium

Secrets exist but aren't wired correctly to the workflow steps.

# Scenario What's Broken Real-World Context
1 missing_env_secrets $DOCKER_PASSWORD without env: mapping Secrets must be passed via env: block
2 wrong_secret_syntax ${ secrets.TOKEN } instead of ${{ secrets.TOKEN }} Single vs double braces
3 missing_token_permissions Pushing to GHCR without permissions: packages: write GITHUB_TOKEN is read-only by default
4 secret_not_in_env $SLACK_WEBHOOK_URL not in env: Very common mistake
5 ghcr_wrong_credentials Using DOCKER_PASSWORD for GHCR login GHCR uses GITHUB_TOKEN

Task 5: CI + Docker Integration β€” Medium

The workflow AND the Dockerfile interact. Fixing one file alone isn't enough.

# Scenario What's Broken Real-World Context
1 missing_buildx_for_platforms Multi-platform build without setup-buildx-action Need BuildKit for cross-compile
2 missing_load_true build-push-action without load: true β€” next step can't find image Buildx doesn't load into local daemon by default
3 wrong_build_context Context is ./backend but Dockerfile path is ./Dockerfile Path mismatch
4 cache_without_mode_max GHA cache export missing mode=max Cache doesn't persist
5 push_without_login docker push without docker login first "denied: requested access"

Task 6: Multi-Stage Pipeline & Matrix β€” Hard

Complex pipelines with multiple interacting bugs. Agent must find 2-3 issues across files.

# Scenario What's Broken Real-World Context
1 artifact_path_mismatch COPY --from=builder /app/dist but React outputs to /app/build CRA uses build/, Vite uses dist/
2 matrix_platform_arg $BUILDPLATFORM without ARG BUILDPLATFORM Multi-arch needs platform ARGs
3 cross_job_artifact Test job downloads artifact but missing needs: build Jobs run in parallel by default
4 multiple_issues Dockerfile typo + workflow secrets not wired (2 bugs) Problems compound across files
5 matrix_version_failure Matrix includes Node 14 but code needs >= 16 + missing needs: 2 bugs to find

Task 7: Kubernetes Pod Failures β€” Medium

Pod crashes and scheduling failures in Kubernetes deployments.

# Scenario What's Broken Real-World Context
1 oom_killed Memory limit 64Mi too low β€” CrashLoopBackOff/OOMKilled Most common K8s production issue
2 image_pull_backoff Image tag typo nginx:latset β†’ ImagePullBackOff Copy-paste tag errors
3 wrong_command command: ["python", "workers.py"] but file is worker.py File name mismatch
4 missing_configmap envFrom: configMapRef: app-config but ConfigMap doesn't exist CreateContainerConfigError
5 liveness_probe_failing Liveness probe port 3000 but app listens on 8080 Probe misconfiguration causes restarts

Task 8: Kubernetes Service & Ingress Issues β€” Hard

Networking issues where pods run fine but traffic doesn't reach them. Error messages are intentionally vague β€” the agent must diagnose from kubectl output.

# Scenario What's Broken Real-World Context
1 selector_mismatch Service selector app: api but pod label is app: api-server No endpoints β€” most common K8s networking bug
2 port_mismatch Service targetPort 8080 but container listens on 3000 Connection refused
3 ingress_wrong_service Ingress references api-svc but service name is api-service Ingress 404
4 network_policy_blocking NetworkPolicy with empty ingress rules blocks all traffic Database unreachable
5 missing_ingress_class No ingressClassName: nginx specified Ingress controller doesn't pick it up

Task 9: CI/CD Build & Push Pipeline β€” Hard

GHA-to-Docker-to-Registry pipeline failures spanning multiple files.

# Scenario What's Broken Real-World Context
1 registry_mismatch Build tags ghcr.io/... but push targets docker.io/... Registry URL mismatch between steps
2 image_tag_mismatch Build uses github.ref_name but push uses github.sha "image not found locally"
3 inconsistent_tagging docker tag myuser/api:latest but image was built as myuser/api:${{ github.sha }} Tag source doesn't exist
4 build_arg_not_passed Dockerfile ARG APP_VERSION but no --build-arg in workflow Version file is empty
5 dockerfile_path_in_subdirectory Workflow points to ./Dockerfile but it's at ./services/api/Dockerfile Monorepo path mismatch

Task 10: Full Stack Deployment Pipeline β€” Expert

Multi-error scenarios spanning the entire stack: GHA + Dockerfile + K8s manifests. 2-4 bugs per scenario requiring cross-file reasoning. Error messages are intentionally vague β€” the agent must trace root causes from symptoms.

# Scenario What's Broken Real-World Context
1 full_pipeline_ghcr_and_selector GHCR token not mapped + K8s Service selector mismatch 2 bugs across workflow + K8s
2 full_pipeline_three_bugs Missing checkout + no WORKDIR + wrong container/service port 4 bugs across 4 files
3 full_pipeline_ghcr_dockerfile_k8s Wrong GHCR secret + base image typo + OOM memory limit 3 bugs across all layers
4 full_pipeline_permissions_image_ingress Missing packages:write + hardcoded image placeholder + no ingressClassName 3 bugs
5 full_pipeline_secrets_build_probe Docker secrets not wired + wrong build output dir + probe port mismatch 4 bugs across all layers

Fix Validation: Simulator-Based

Fixes are validated using structural simulators, not string matching. This means:

  • Alternative valid fixes are accepted. Setting memory to 512Mi instead of 256Mi both resolve the OOM β€” the simulator accepts either.
  • Three independent simulators run after every edit:
    • DockerSimulator: validates Dockerfile syntax (FROM, COPY, EXPOSE, RUN) and runtime behavior (WORKDIR, CMD/ENTRYPOINT, permissions, ENV)
    • WorkflowSimulator: parses YAML, checks triggers, runs-on, step ordering, secrets wiring, permissions, buildx requirements, registry consistency
    • KubernetesSimulator: validates manifests, cross-resource dependencies (Service selector ↔ Deployment labels), pod status simulation (OOM, ImagePullBackOff), service endpoint reachability
  • 7 granular checks are tracked: docker_build, docker_run, workflow_parse, workflow_exec, k8s_valid, k8s_pod_running, k8s_service_active
  • Progress = how many checks flip from fail β†’ pass compared to the initial broken state

Available Actions

Each step, the agent chooses exactly one action:

Action What It Does When to Use
edit_file Replace old_content with new_content in a file Most common β€” fix a broken line or block
replace_line Replace content at a specific line number When you know exactly which line is wrong
add_line Insert a new line into a file Adding missing instructions (e.g., missing WORKDIR)
delete_line Remove a specific line Removing a bad instruction
add_block Insert a multi-line block Adding entire sections (e.g., env: block with secrets)
delete_block Remove a multi-line block Removing incorrect sections
request_hint Get a clue about what's wrong Costs -4% on final score β€” use sparingly
submit Declare "I'm done" β€” triggers final evaluation When all fixes are applied

Important: edit_file requires old_content to match exactly (including whitespace). If it doesn't match, the edit fails and the agent gets a -0.02 reward penalty.


Grading System

Scoring is deterministic (same actions always produce the same score), difficulty-aware (harder tasks are graded more generously), and scores are strictly in (0, 1) exclusive β€” never exactly 0 or 1.

The Formula

FINAL SCORE = Base + Partial Fixes + Complete Bonus + Difficulty Bonus + Efficiency - Hint Penalty - Failed Edit Penalty

Clamped to (0.01, 0.99).

Component Breakdown

Component Weight Description
Base score 5% Participation credit (guarantees score > 0)
Partial fixes 35% Proportional to issues_fixed / issues_total
Complete bonus 25% All issues fixed
Difficulty bonus 0-3% Extra reward for fully solving hard/expert tasks
Efficiency 25% Decays with extra steps β€” slower decay for harder tasks
Hint penalty -3% to -4% each Per request_hint action (cheaper for hard/expert)
Failed edit penalty -2% each Per edit with no valid file path

Difficulty Modifiers

Difficulty Max Score Efficiency Decay Hint Cost
Easy 0.90 0.03/step (strict) 4% each
Medium 0.90 0.027/step 4% each
Hard/Expert 0.93 0.021/step (forgiving) 3% each

Evaluation

The evaluation pipeline runs all 50 scenarios across all 10 tasks deterministically:

# Runs all 10 tasks Γ— 5 scenarios = 50 episodes
results = run_baseline_episodes()  # num_episodes=None runs all

# Per-episode scores in (0, 1)
# Aggregate = mean of all 50 scores
aggregate = sum(r.score for r in results) / len(results)

This ensures:

  • Reproducibility: same agent produces same score every time
  • Complete coverage: every error pattern is tested
  • Fair comparison: all agents face the same 50 scenarios

API Endpoints

Endpoint Method Description
/ GET Root page
/health GET Health check β€” returns {"status": "healthy"}
/metadata GET Environment name, description, version, tags
/schema GET Action, observation, and state JSON schemas
/reset POST Start a new episode (optional: task_id, scenario_id, seed)
/step POST Take an action and receive observation + reward
/state GET Get current observation without taking an action
/info GET Task list with metadata
/tasks GET List all tasks with difficulty levels
/grader POST Grade a trajectory (list of step dicts)
/baseline POST Run baseline across all scenarios (optional: task_id, num_episodes)
/mcp POST JSON-RPC 2.0 MCP endpoint (initialize, tools/list)

Example: Full Episode via API

# 1. Start an episode
curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "k8s_pod_failures", "scenario_id": "oom_killed"}'

# 2. Fix the memory limit (any reasonable value works β€” simulator validates structurally)
curl -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{
    "action": {
      "action_type": "edit_file",
      "edits": [{
        "file_path": "k8s/deployment.yaml",
        "old_content": "memory: \"64Mi\"",
        "new_content": "memory: \"512Mi\""
      }]
    }
  }'

# Response: reward=0.3, issues_fixed=1/1, done=true

Quick Start

Local Development

pip install -r requirements.txt
python -m uvicorn server.app:app --host 0.0.0.0 --port 7860

Run Tests

pytest tests/ -v

Docker

docker build -t cloud-native-devops-env .
docker run -p 7860:7860 cloud-native-devops-env

Baseline Inference (with LLM)

export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=meta-llama/Llama-3.1-70B-Instruct
export HF_TOKEN=your_token_here
python inference.py

Project Structure

cloud-native-devops-env/
β”œβ”€β”€ openenv.yaml              # OpenEnv environment specification
β”œβ”€β”€ inference.py              # LLM baseline (OpenAI client + HF router)
β”œβ”€β”€ baseline_runner.py        # Heuristic baseline β€” runs all 50 scenarios
β”œβ”€β”€ Dockerfile                # Production container
β”œβ”€β”€ requirements.txt          # Python dependencies
β”‚
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ app.py                # FastAPI with 12 endpoints
β”‚   β”œβ”€β”€ models.py             # Pydantic models (type-safe API)
β”‚   β”œβ”€β”€ environment.py        # Core environment loop (reset/step/state)
β”‚   β”œβ”€β”€ tasks/
β”‚   β”‚   β”œβ”€β”€ base.py           # BaseTask with scenario loading
β”‚   β”‚   β”œβ”€β”€ task_registry.py  # Maps task_id β†’ task class (10 tasks)
β”‚   β”‚   β”œβ”€β”€ task_1_build_errors.py        # 5 Dockerfile syntax scenarios
β”‚   β”‚   β”œβ”€β”€ task_2_docker_runtime.py      # 5 Dockerfile runtime scenarios
β”‚   β”‚   β”œβ”€β”€ task_3_workflow_syntax.py     # 5 workflow structure scenarios
β”‚   β”‚   β”œβ”€β”€ task_4_workflow_secrets_permissions.py  # 5 secrets scenarios
β”‚   β”‚   β”œβ”€β”€ task_5_ci_docker_integration.py        # 5 integration scenarios
β”‚   β”‚   β”œβ”€β”€ task_6_multi_stage_matrix.py           # 5 multi-issue scenarios
β”‚   β”‚   β”œβ”€β”€ k8s_pod.py                   # 5 Kubernetes pod failure scenarios
β”‚   β”‚   β”œβ”€β”€ k8s_networking.py            # 5 K8s networking scenarios
β”‚   β”‚   β”œβ”€β”€ pipeline_build_deploy.py     # 5 GHAβ†’Dockerβ†’Registry scenarios
β”‚   β”‚   └── pipeline_full.py             # 5 full-stack multi-error scenarios
β”‚   β”œβ”€β”€ graders/
β”‚   β”‚   └── __init__.py       # Deterministic trajectory grader
β”‚   └── simulators/
β”‚       β”œβ”€β”€ docker_simulator.py   # Dockerfile build + runtime validation
β”‚       β”œβ”€β”€ workflow_simulator.py # GHA workflow parse + execution validation
β”‚       └── k8s_simulator.py     # K8s manifest + cross-resource validation
β”‚
└── tests/
    β”œβ”€β”€ test_endpoints.py     # API endpoint tests
    β”œβ”€β”€ test_determinism.py   # Grader determinism + score range tests
    β”œβ”€β”€ test_baseline.py      # Heuristic baseline tests
    β”œβ”€β”€ test_environment_flow.py  # Episode flow tests
    └── test_simulators.py    # Simulator unit tests

Design Decisions

  1. Full cloud-native stack: Docker + GitHub Actions + Kubernetes β€” the three pillars of modern deployment pipelines.
  2. Simulator-based validation: Structural rule-based simulators validate fixes instead of string matching. Alternative valid fixes are accepted (e.g., 512Mi and 256Mi both fix an OOM). Deterministic, fast, no security concerns.
  3. Dense rewards: Partial credit at every step (+0.3 per fix, -0.02 per failed edit) rather than sparse pass/fail.
  4. Difficulty progression: Easy tasks are single-file, single-issue. Expert tasks are multi-file, multi-issue with interacting bugs across all three layers.
  5. Vague error messages in harder tasks: Easy tasks have explicit error messages. Hard/Expert tasks have realistic, vague messages that require the agent to actually diagnose the issue from context.
  6. Deterministic evaluation: All 50 scenarios run every time for reproducible, comparable scores in (0, 1) exclusive.
  7. 50 scenarios from real bugs: Every scenario is based on actual developer mistakes documented on Stack Overflow, GitHub Issues, and official documentation.

License

MIT