Spaces:

jester1177
/

cloudnative-devops-debug-env

Sleeping

App Files Files Community

cloudnative-devops-debug-env / README.md

Krishna1107

inference fixed, port changed to 7860

eb895b1 about 1 month ago

preview code

raw

history blame contribute delete

22.2 kB

metadata

title: Cloud-Native DevOps Debug Environment
emoji: 🔧
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false

Cloud-Native DevOps Debug Environment

An OpenEnv-compatible environment where AI agents learn to debug broken GitHub Actions workflows, Dockerfiles, and Kubernetes manifests. Built for the OpenEnv Hackathon by Scaler School of Technology (partners: Meta, HuggingFace, PyTorch).

Why Cloud-Native Debugging?

Every developer who ships code hits deployment pipeline failures. A misconfigured Dockerfile, a broken GitHub Actions workflow, a missing secret, a Kubernetes selector mismatch — these are the bugs that waste hours of developer time every week. They're hard to debug because:

Error messages are cryptic ("unable to prepare context: unable to evaluate symlinks")
The feedback loop is slow (push, wait for CI, read logs, fix, repeat)
Multiple config files interact in non-obvious ways (Dockerfile + workflow + secrets + K8s manifests)
Kubernetes errors require cross-resource reasoning (Deployment labels must match Service selectors)

This environment teaches AI agents to do what senior DevOps engineers do: read the error, trace it to the root cause across multiple files, and fix it.

How It Works: The Complete Flow

┌──────────────────────────────────────────────────────────────┐
│  1. RESET                                                     │
│     Agent receives:                                           │
│     - Broken config files (Dockerfile / workflow / K8s YAML)  │
│     - Error message from the failed build/deploy              │
│     - Available secrets list                                  │
│     - Number of issues to find                                │
├──────────────────────────────────────────────────────────────┤
│  2. OBSERVE → THINK → ACT  (repeat up to 10 steps)           │
│     Agent reads the error, analyzes the files, then:          │
│     - edit_file: replace broken content with fixed content    │
│     - replace_line: fix a specific line number                │
│     - add_line / add_block: insert missing content            │
│     - delete_line / delete_block: remove bad content          │
│     - request_hint: get a clue (-4% score penalty)            │
│     - submit: "I'm done fixing"                               │
│                                                               │
│     After each action, agent gets:                            │
│     - Updated file contents                                   │
│     - Reward signal (+0.3 per fix, -0.02 for failed edits)   │
│     - How many issues are now fixed                           │
├──────────────────────────────────────────────────────────────┤
│  3. GRADE                                                     │
│     Deterministic scoring based on:                           │
│     - What fraction of issues were fixed                      │
│     - Whether ALL issues were fixed (bonus)                   │
│     - How many steps it took (efficiency)                     │
│     - How many hints were used (penalty)                      │
│     Score range: (0, 1) exclusive — never exactly 0 or 1     │
└──────────────────────────────────────────────────────────────┘

The 10 Tasks (50 Scenarios)

Evaluation runs all 50 scenarios deterministically across all 10 tasks for reproducible scoring.

Task 1: Dockerfile Syntax Errors — Easy

Simple typos and instruction errors that break docker build.

#	Scenario	What's Broken	Real-World Context
1	`typo_filename`	`COPY requirments.txt .` — misspelled filename	Most common Docker build error on Stack Overflow
2	`invalid_base_image`	`FROM python:3.9-slimm` — extra 'm' in tag	Happens when copy-pasting image tags
3	`invalid_run_syntax`	`RUN pip install ... \n && python setup.py` — broken line continuation	Formatting multi-line RUN commands is tricky
4	`copy_missing_source`	`COPY dist/` but build output is in `build/`	Source directory doesn't exist in build context
5	`missing_from_instruction`	No `FROM` instruction at all	Dockerfile must start with FROM

Task 2: Dockerfile Runtime Errors — Medium

The Dockerfile builds successfully, but the container crashes at runtime.

#	Scenario	What's Broken	Real-World Context
1	`missing_workdir`	No WORKDIR — files scatter to `/`	Container runs but `npm start` can't find `package.json`
2	`cmd_entrypoint_conflict`	Both ENTRYPOINT and CMD defined as full commands	Process starts incorrectly
3	`entrypoint_not_executable`	Shell script lacks execute permission	`chmod +x` missing — "permission denied"
4	`missing_required_env`	App needs `DATABASE_URL` but it's not set	Container crashes: "DATABASE_URL is not defined"
5	`non_root_privileged_port`	Non-root user tries to bind port 80	Security best practice conflicts with port < 1024

Task 3: Workflow Syntax & Structure — Easy

GitHub Actions YAML has structural problems that GitHub rejects before any job runs.

#	Scenario	What's Broken	Real-World Context
1	`checkout_after_build`	`docker build` before `actions/checkout`	No source code — "Dockerfile not found"
2	`missing_runs_on`	Job has no `runs-on` field	Every job needs a runner
3	`invalid_trigger_syntax`	`branches: main` instead of `branches: [main]`	Must be a YAML list
4	`missing_step_uses_or_run`	Step has a name but no `uses:` or `run:`	Invalid step
5	`missing_on_trigger`	No `on:` block at all	Workflow never triggers

Task 4: Workflow Secrets & Permissions — Medium

Secrets exist but aren't wired correctly to the workflow steps.

#	Scenario	What's Broken	Real-World Context
1	`missing_env_secrets`	`$DOCKER_PASSWORD` without `env:` mapping	Secrets must be passed via `env:` block
2	`wrong_secret_syntax`	`${ secrets.TOKEN }` instead of `${{ secrets.TOKEN }}`	Single vs double braces
3	`missing_token_permissions`	Pushing to GHCR without `permissions: packages: write`	GITHUB_TOKEN is read-only by default
4	`secret_not_in_env`	`$SLACK_WEBHOOK_URL` not in `env:`	Very common mistake
5	`ghcr_wrong_credentials`	Using `DOCKER_PASSWORD` for GHCR login	GHCR uses `GITHUB_TOKEN`

Task 5: CI + Docker Integration — Medium

The workflow AND the Dockerfile interact. Fixing one file alone isn't enough.

#	Scenario	What's Broken	Real-World Context
1	`missing_buildx_for_platforms`	Multi-platform build without `setup-buildx-action`	Need BuildKit for cross-compile
2	`missing_load_true`	`build-push-action` without `load: true` — next step can't find image	Buildx doesn't load into local daemon by default
3	`wrong_build_context`	Context is `./backend` but Dockerfile path is `./Dockerfile`	Path mismatch
4	`cache_without_mode_max`	GHA cache export missing `mode=max`	Cache doesn't persist
5	`push_without_login`	`docker push` without `docker login` first	"denied: requested access"

Task 6: Multi-Stage Pipeline & Matrix — Hard

Complex pipelines with multiple interacting bugs. Agent must find 2-3 issues across files.

#	Scenario	What's Broken	Real-World Context
1	`artifact_path_mismatch`	`COPY --from=builder /app/dist` but React outputs to `/app/build`	CRA uses `build/`, Vite uses `dist/`
2	`matrix_platform_arg`	`$BUILDPLATFORM` without `ARG BUILDPLATFORM`	Multi-arch needs platform ARGs
3	`cross_job_artifact`	Test job downloads artifact but missing `needs: build`	Jobs run in parallel by default
4	`multiple_issues`	Dockerfile typo + workflow secrets not wired (2 bugs)	Problems compound across files
5	`matrix_version_failure`	Matrix includes Node 14 but code needs >= 16 + missing `needs:`	2 bugs to find

Task 7: Kubernetes Pod Failures — Medium

Pod crashes and scheduling failures in Kubernetes deployments.

#	Scenario	What's Broken	Real-World Context
1	`oom_killed`	Memory limit 64Mi too low — CrashLoopBackOff/OOMKilled	Most common K8s production issue
2	`image_pull_backoff`	Image tag typo `nginx:latset` → ImagePullBackOff	Copy-paste tag errors
3	`wrong_command`	`command: ["python", "workers.py"]` but file is `worker.py`	File name mismatch
4	`missing_configmap`	`envFrom: configMapRef: app-config` but ConfigMap doesn't exist	CreateContainerConfigError
5	`liveness_probe_failing`	Liveness probe port 3000 but app listens on 8080	Probe misconfiguration causes restarts

Task 8: Kubernetes Service & Ingress Issues — Hard

Networking issues where pods run fine but traffic doesn't reach them. Error messages are intentionally vague — the agent must diagnose from kubectl output.

#	Scenario	What's Broken	Real-World Context
1	`selector_mismatch`	Service selector `app: api` but pod label is `app: api-server`	No endpoints — most common K8s networking bug
2	`port_mismatch`	Service targetPort 8080 but container listens on 3000	Connection refused
3	`ingress_wrong_service`	Ingress references `api-svc` but service name is `api-service`	Ingress 404
4	`network_policy_blocking`	NetworkPolicy with empty ingress rules blocks all traffic	Database unreachable
5	`missing_ingress_class`	No `ingressClassName: nginx` specified	Ingress controller doesn't pick it up

Task 9: CI/CD Build & Push Pipeline — Hard

GHA-to-Docker-to-Registry pipeline failures spanning multiple files.

#	Scenario	What's Broken	Real-World Context
1	`registry_mismatch`	Build tags `ghcr.io/...` but push targets `docker.io/...`	Registry URL mismatch between steps
2	`image_tag_mismatch`	Build uses `github.ref_name` but push uses `github.sha`	"image not found locally"
3	`inconsistent_tagging`	`docker tag myuser/api:latest` but image was built as `myuser/api:${{ github.sha }}`	Tag source doesn't exist
4	`build_arg_not_passed`	Dockerfile `ARG APP_VERSION` but no `--build-arg` in workflow	Version file is empty
5	`dockerfile_path_in_subdirectory`	Workflow points to `./Dockerfile` but it's at `./services/api/Dockerfile`	Monorepo path mismatch

Task 10: Full Stack Deployment Pipeline — Expert

Multi-error scenarios spanning the entire stack: GHA + Dockerfile + K8s manifests. 2-4 bugs per scenario requiring cross-file reasoning. Error messages are intentionally vague — the agent must trace root causes from symptoms.

#	Scenario	What's Broken	Real-World Context
1	`full_pipeline_ghcr_and_selector`	GHCR token not mapped + K8s Service selector mismatch	2 bugs across workflow + K8s
2	`full_pipeline_three_bugs`	Missing checkout + no WORKDIR + wrong container/service port	4 bugs across 4 files
3	`full_pipeline_ghcr_dockerfile_k8s`	Wrong GHCR secret + base image typo + OOM memory limit	3 bugs across all layers
4	`full_pipeline_permissions_image_ingress`	Missing packages:write + hardcoded image placeholder + no ingressClassName	3 bugs
5	`full_pipeline_secrets_build_probe`	Docker secrets not wired + wrong build output dir + probe port mismatch	4 bugs across all layers

Fix Validation: Simulator-Based

Fixes are validated using structural simulators, not string matching. This means:

Alternative valid fixes are accepted. Setting memory to 512Mi instead of 256Mi both resolve the OOM — the simulator accepts either.
Three independent simulators run after every edit:
- DockerSimulator: validates Dockerfile syntax (FROM, COPY, EXPOSE, RUN) and runtime behavior (WORKDIR, CMD/ENTRYPOINT, permissions, ENV)
- WorkflowSimulator: parses YAML, checks triggers, runs-on, step ordering, secrets wiring, permissions, buildx requirements, registry consistency
- KubernetesSimulator: validates manifests, cross-resource dependencies (Service selector ↔ Deployment labels), pod status simulation (OOM, ImagePullBackOff), service endpoint reachability
7 granular checks are tracked: docker_build, docker_run, workflow_parse, workflow_exec, k8s_valid, k8s_pod_running, k8s_service_active
Progress = how many checks flip from fail → pass compared to the initial broken state

Available Actions

Each step, the agent chooses exactly one action:

Action	What It Does	When to Use
`edit_file`	Replace `old_content` with `new_content` in a file	Most common — fix a broken line or block
`replace_line`	Replace content at a specific line number	When you know exactly which line is wrong
`add_line`	Insert a new line into a file	Adding missing instructions (e.g., missing `WORKDIR`)
`delete_line`	Remove a specific line	Removing a bad instruction
`add_block`	Insert a multi-line block	Adding entire sections (e.g., `env:` block with secrets)
`delete_block`	Remove a multi-line block	Removing incorrect sections
`request_hint`	Get a clue about what's wrong	Costs -4% on final score — use sparingly
`submit`	Declare "I'm done" — triggers final evaluation	When all fixes are applied

Important: edit_file requires old_content to match exactly (including whitespace). If it doesn't match, the edit fails and the agent gets a -0.02 reward penalty.

Grading System

Scoring is deterministic (same actions always produce the same score), difficulty-aware (harder tasks are graded more generously), and scores are strictly in (0, 1) exclusive — never exactly 0 or 1.

The Formula

FINAL SCORE = Base + Partial Fixes + Complete Bonus + Difficulty Bonus + Efficiency - Hint Penalty - Failed Edit Penalty

Clamped to (0.01, 0.99).

Component Breakdown

Component	Weight	Description
Base score	5%	Participation credit (guarantees score > 0)
Partial fixes	35%	Proportional to `issues_fixed / issues_total`
Complete bonus	25%	All issues fixed
Difficulty bonus	0-3%	Extra reward for fully solving hard/expert tasks
Efficiency	25%	Decays with extra steps — slower decay for harder tasks
Hint penalty	-3% to -4% each	Per `request_hint` action (cheaper for hard/expert)
Failed edit penalty	-2% each	Per edit with no valid file path

Difficulty Modifiers

Difficulty	Max Score	Efficiency Decay	Hint Cost
Easy	0.90	0.03/step (strict)	4% each
Medium	0.90	0.027/step	4% each
Hard/Expert	0.93	0.021/step (forgiving)	3% each

Evaluation

The evaluation pipeline runs all 50 scenarios across all 10 tasks deterministically:

# Runs all 10 tasks × 5 scenarios = 50 episodes
results = run_baseline_episodes()  # num_episodes=None runs all

# Per-episode scores in (0, 1)
# Aggregate = mean of all 50 scores
aggregate = sum(r.score for r in results) / len(results)

This ensures:

Reproducibility: same agent produces same score every time
Complete coverage: every error pattern is tested
Fair comparison: all agents face the same 50 scenarios

API Endpoints

Endpoint	Method	Description
`/`	GET	Root page
`/health`	GET	Health check — returns `{"status": "healthy"}`
`/metadata`	GET	Environment name, description, version, tags
`/schema`	GET	Action, observation, and state JSON schemas
`/reset`	POST	Start a new episode (optional: `task_id`, `scenario_id`, `seed`)
`/step`	POST	Take an action and receive observation + reward
`/state`	GET	Get current observation without taking an action
`/info`	GET	Task list with metadata
`/tasks`	GET	List all tasks with difficulty levels
`/grader`	POST	Grade a trajectory (list of step dicts)
`/baseline`	POST	Run baseline across all scenarios (optional: `task_id`, `num_episodes`)
`/mcp`	POST	JSON-RPC 2.0 MCP endpoint (initialize, tools/list)

Example: Full Episode via API

# 1. Start an episode
curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "k8s_pod_failures", "scenario_id": "oom_killed"}'

# 2. Fix the memory limit (any reasonable value works — simulator validates structurally)
curl -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{
    "action": {
      "action_type": "edit_file",
      "edits": [{
        "file_path": "k8s/deployment.yaml",
        "old_content": "memory: \"64Mi\"",
        "new_content": "memory: \"512Mi\""
      }]
    }
  }'

# Response: reward=0.3, issues_fixed=1/1, done=true

Quick Start

Local Development

pip install -r requirements.txt
python -m uvicorn server.app:app --host 0.0.0.0 --port 7860

Run Tests

pytest tests/ -v

Docker

docker build -t cloud-native-devops-env .
docker run -p 7860:7860 cloud-native-devops-env

Baseline Inference (with LLM)

export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=meta-llama/Llama-3.1-70B-Instruct
export HF_TOKEN=your_token_here
python inference.py

Project Structure

cloud-native-devops-env/
├── openenv.yaml              # OpenEnv environment specification
├── inference.py              # LLM baseline (OpenAI client + HF router)
├── baseline_runner.py        # Heuristic baseline — runs all 50 scenarios
├── Dockerfile                # Production container
├── requirements.txt          # Python dependencies
│
├── server/
│   ├── app.py                # FastAPI with 12 endpoints
│   ├── models.py             # Pydantic models (type-safe API)
│   ├── environment.py        # Core environment loop (reset/step/state)
│   ├── tasks/
│   │   ├── base.py           # BaseTask with scenario loading
│   │   ├── task_registry.py  # Maps task_id → task class (10 tasks)
│   │   ├── task_1_build_errors.py        # 5 Dockerfile syntax scenarios
│   │   ├── task_2_docker_runtime.py      # 5 Dockerfile runtime scenarios
│   │   ├── task_3_workflow_syntax.py     # 5 workflow structure scenarios
│   │   ├── task_4_workflow_secrets_permissions.py  # 5 secrets scenarios
│   │   ├── task_5_ci_docker_integration.py        # 5 integration scenarios
│   │   ├── task_6_multi_stage_matrix.py           # 5 multi-issue scenarios
│   │   ├── k8s_pod.py                   # 5 Kubernetes pod failure scenarios
│   │   ├── k8s_networking.py            # 5 K8s networking scenarios
│   │   ├── pipeline_build_deploy.py     # 5 GHA→Docker→Registry scenarios
│   │   └── pipeline_full.py             # 5 full-stack multi-error scenarios
│   ├── graders/
│   │   └── __init__.py       # Deterministic trajectory grader
│   └── simulators/
│       ├── docker_simulator.py   # Dockerfile build + runtime validation
│       ├── workflow_simulator.py # GHA workflow parse + execution validation
│       └── k8s_simulator.py     # K8s manifest + cross-resource validation
│
└── tests/
    ├── test_endpoints.py     # API endpoint tests
    ├── test_determinism.py   # Grader determinism + score range tests
    ├── test_baseline.py      # Heuristic baseline tests
    ├── test_environment_flow.py  # Episode flow tests
    └── test_simulators.py    # Simulator unit tests

Design Decisions

Full cloud-native stack: Docker + GitHub Actions + Kubernetes — the three pillars of modern deployment pipelines.
Simulator-based validation: Structural rule-based simulators validate fixes instead of string matching. Alternative valid fixes are accepted (e.g., 512Mi and 256Mi both fix an OOM). Deterministic, fast, no security concerns.
Dense rewards: Partial credit at every step (+0.3 per fix, -0.02 per failed edit) rather than sparse pass/fail.
Difficulty progression: Easy tasks are single-file, single-issue. Expert tasks are multi-file, multi-issue with interacting bugs across all three layers.
Vague error messages in harder tasks: Easy tasks have explicit error messages. Hard/Expert tasks have realistic, vague messages that require the agent to actually diagnose the issue from context.
Deterministic evaluation: All 50 scenarios run every time for reproducible, comparable scores in (0, 1) exclusive.
50 scenarios from real bugs: Every scenario is based on actual developer mistakes documented on Stack Overflow, GitHub Issues, and official documentation.

License

MIT