Spaces:

jester1177
/

cloudnative-devops-debug-env

Sleeping

App Files Files Community

cloudnative-devops-debug-env / README.md

Krishna1107

inference fixed, port changed to 7860

eb895b1 about 1 month ago

preview code

raw

history blame contribute delete

22.2 kB

	---
	title: Cloud-Native DevOps Debug Environment
	emoji: 🔧
	colorFrom: blue
	colorTo: green
	sdk: docker
	app_port: 7860
	pinned: false
	---

	# Cloud-Native DevOps Debug Environment

	An OpenEnv-compatible environment where AI agents learn to debug broken GitHub Actions workflows, Dockerfiles, and Kubernetes manifests. Built for the OpenEnv Hackathon by Scaler School of Technology (partners: Meta, HuggingFace, PyTorch).

	## Why Cloud-Native Debugging?

	Every developer who ships code hits deployment pipeline failures. A misconfigured Dockerfile, a broken GitHub Actions workflow, a missing secret, a Kubernetes selector mismatch — these are the bugs that waste hours of developer time every week. They're hard to debug because:

	- Error messages are cryptic ("unable to prepare context: unable to evaluate symlinks")
	- The feedback loop is slow (push, wait for CI, read logs, fix, repeat)
	- Multiple config files interact in non-obvious ways (Dockerfile + workflow + secrets + K8s manifests)
	- Kubernetes errors require cross-resource reasoning (Deployment labels must match Service selectors)

	This environment teaches AI agents to do what senior DevOps engineers do: read the error, trace it to the root cause across multiple files, and fix it.

	---

	## How It Works: The Complete Flow

	```
	┌──────────────────────────────────────────────────────────────┐
	│ 1. RESET │
	│ Agent receives: │
	│ - Broken config files (Dockerfile / workflow / K8s YAML) │
	│ - Error message from the failed build/deploy │
	│ - Available secrets list │
	│ - Number of issues to find │
	├──────────────────────────────────────────────────────────────┤
	│ 2. OBSERVE → THINK → ACT (repeat up to 10 steps) │
	│ Agent reads the error, analyzes the files, then: │
	│ - edit_file: replace broken content with fixed content │
	│ - replace_line: fix a specific line number │
	│ - add_line / add_block: insert missing content │
	│ - delete_line / delete_block: remove bad content │
	│ - request_hint: get a clue (-4% score penalty) │
	│ - submit: "I'm done fixing" │
	│ │
	│ After each action, agent gets: │
	│ - Updated file contents │
	│ - Reward signal (+0.3 per fix, -0.02 for failed edits) │
	│ - How many issues are now fixed │
	├──────────────────────────────────────────────────────────────┤
	│ 3. GRADE │
	│ Deterministic scoring based on: │
	│ - What fraction of issues were fixed │
	│ - Whether ALL issues were fixed (bonus) │
	│ - How many steps it took (efficiency) │
	│ - How many hints were used (penalty) │
	│ Score range: (0, 1) exclusive — never exactly 0 or 1 │
	└──────────────────────────────────────────────────────────────┘
	```

	---

	## The 10 Tasks (50 Scenarios)

	Evaluation runs all 50 scenarios deterministically across all 10 tasks for reproducible scoring.

	### Task 1: Dockerfile Syntax Errors — Easy

	Simple typos and instruction errors that break `docker build`.

	\| # \| Scenario \| What's Broken \| Real-World Context \|
	\|---\|----------\|---------------\|-------------------\|
	\| 1 \| `typo_filename` \| `COPY requirments.txt .` — misspelled filename \| Most common Docker build error on Stack Overflow \|
	\| 2 \| `invalid_base_image` \| `FROM python:3.9-slimm` — extra 'm' in tag \| Happens when copy-pasting image tags \|
	\| 3 \| `invalid_run_syntax` \| `RUN pip install ... \n && python setup.py` — broken line continuation \| Formatting multi-line RUN commands is tricky \|
	\| 4 \| `copy_missing_source` \| `COPY dist/` but build output is in `build/` \| Source directory doesn't exist in build context \|
	\| 5 \| `missing_from_instruction` \| No `FROM` instruction at all \| Dockerfile must start with FROM \|

	### Task 2: Dockerfile Runtime Errors — Medium

	The Dockerfile builds successfully, but the container crashes at runtime.

	\| # \| Scenario \| What's Broken \| Real-World Context \|
	\|---\|----------\|---------------\|-------------------\|
	\| 1 \| `missing_workdir` \| No WORKDIR — files scatter to `/` \| Container runs but `npm start` can't find `package.json` \|
	\| 2 \| `cmd_entrypoint_conflict` \| Both ENTRYPOINT and CMD defined as full commands \| Process starts incorrectly \|
	\| 3 \| `entrypoint_not_executable` \| Shell script lacks execute permission \| `chmod +x` missing — "permission denied" \|
	\| 4 \| `missing_required_env` \| App needs `DATABASE_URL` but it's not set \| Container crashes: "DATABASE_URL is not defined" \|
	\| 5 \| `non_root_privileged_port` \| Non-root user tries to bind port 80 \| Security best practice conflicts with port < 1024 \|

	### Task 3: Workflow Syntax & Structure — Easy

	GitHub Actions YAML has structural problems that GitHub rejects before any job runs.

	\| # \| Scenario \| What's Broken \| Real-World Context \|
	\|---\|----------\|---------------\|-------------------\|
	\| 1 \| `checkout_after_build` \| `docker build` before `actions/checkout` \| No source code — "Dockerfile not found" \|
	\| 2 \| `missing_runs_on` \| Job has no `runs-on` field \| Every job needs a runner \|
	\| 3 \| `invalid_trigger_syntax` \| `branches: main` instead of `branches: [main]` \| Must be a YAML list \|
	\| 4 \| `missing_step_uses_or_run` \| Step has a name but no `uses:` or `run:` \| Invalid step \|
	\| 5 \| `missing_on_trigger` \| No `on:` block at all \| Workflow never triggers \|

	### Task 4: Workflow Secrets & Permissions — Medium

	Secrets exist but aren't wired correctly to the workflow steps.

	\| # \| Scenario \| What's Broken \| Real-World Context \|
	\|---\|----------\|---------------\|-------------------\|
	\| 1 \| `missing_env_secrets` \| `$DOCKER_PASSWORD` without `env:` mapping \| Secrets must be passed via `env:` block \|
	\| 2 \| `wrong_secret_syntax` \| `${ secrets.TOKEN }` instead of `${{ secrets.TOKEN }}` \| Single vs double braces \|
	\| 3 \| `missing_token_permissions` \| Pushing to GHCR without `permissions: packages: write` \| GITHUB_TOKEN is read-only by default \|
	\| 4 \| `secret_not_in_env` \| `$SLACK_WEBHOOK_URL` not in `env:` \| Very common mistake \|
	\| 5 \| `ghcr_wrong_credentials` \| Using `DOCKER_PASSWORD` for GHCR login \| GHCR uses `GITHUB_TOKEN` \|

	### Task 5: CI + Docker Integration — Medium

	The workflow AND the Dockerfile interact. Fixing one file alone isn't enough.

	\| # \| Scenario \| What's Broken \| Real-World Context \|
	\|---\|----------\|---------------\|-------------------\|
	\| 1 \| `missing_buildx_for_platforms` \| Multi-platform build without `setup-buildx-action` \| Need BuildKit for cross-compile \|
	\| 2 \| `missing_load_true` \| `build-push-action` without `load: true` — next step can't find image \| Buildx doesn't load into local daemon by default \|
	\| 3 \| `wrong_build_context` \| Context is `./backend` but Dockerfile path is `./Dockerfile` \| Path mismatch \|
	\| 4 \| `cache_without_mode_max` \| GHA cache export missing `mode=max` \| Cache doesn't persist \|
	\| 5 \| `push_without_login` \| `docker push` without `docker login` first \| "denied: requested access" \|

	### Task 6: Multi-Stage Pipeline & Matrix — Hard

	Complex pipelines with multiple interacting bugs. Agent must find 2-3 issues across files.

	\| # \| Scenario \| What's Broken \| Real-World Context \|
	\|---\|----------\|---------------\|-------------------\|
	\| 1 \| `artifact_path_mismatch` \| `COPY --from=builder /app/dist` but React outputs to `/app/build` \| CRA uses `build/`, Vite uses `dist/` \|
	\| 2 \| `matrix_platform_arg` \| `$BUILDPLATFORM` without `ARG BUILDPLATFORM` \| Multi-arch needs platform ARGs \|
	\| 3 \| `cross_job_artifact` \| Test job downloads artifact but missing `needs: build` \| Jobs run in parallel by default \|
	\| 4 \| `multiple_issues` \| Dockerfile typo + workflow secrets not wired (2 bugs) \| Problems compound across files \|
	\| 5 \| `matrix_version_failure` \| Matrix includes Node 14 but code needs >= 16 + missing `needs:` \| 2 bugs to find \|

	### Task 7: Kubernetes Pod Failures — Medium

	Pod crashes and scheduling failures in Kubernetes deployments.

	\| # \| Scenario \| What's Broken \| Real-World Context \|
	\|---\|----------\|---------------\|-------------------\|
	\| 1 \| `oom_killed` \| Memory limit 64Mi too low — CrashLoopBackOff/OOMKilled \| Most common K8s production issue \|
	\| 2 \| `image_pull_backoff` \| Image tag typo `nginx:latset` → ImagePullBackOff \| Copy-paste tag errors \|
	\| 3 \| `wrong_command` \| `command: ["python", "workers.py"]` but file is `worker.py` \| File name mismatch \|
	\| 4 \| `missing_configmap` \| `envFrom: configMapRef: app-config` but ConfigMap doesn't exist \| CreateContainerConfigError \|
	\| 5 \| `liveness_probe_failing` \| Liveness probe port 3000 but app listens on 8080 \| Probe misconfiguration causes restarts \|

	### Task 8: Kubernetes Service & Ingress Issues — Hard

	Networking issues where pods run fine but traffic doesn't reach them. Error messages are intentionally vague — the agent must diagnose from kubectl output.

	\| # \| Scenario \| What's Broken \| Real-World Context \|
	\|---\|----------\|---------------\|-------------------\|
	\| 1 \| `selector_mismatch` \| Service selector `app: api` but pod label is `app: api-server` \| No endpoints — most common K8s networking bug \|
	\| 2 \| `port_mismatch` \| Service targetPort 8080 but container listens on 3000 \| Connection refused \|
	\| 3 \| `ingress_wrong_service` \| Ingress references `api-svc` but service name is `api-service` \| Ingress 404 \|
	\| 4 \| `network_policy_blocking` \| NetworkPolicy with empty ingress rules blocks all traffic \| Database unreachable \|
	\| 5 \| `missing_ingress_class` \| No `ingressClassName: nginx` specified \| Ingress controller doesn't pick it up \|

	### Task 9: CI/CD Build & Push Pipeline — Hard

	GHA-to-Docker-to-Registry pipeline failures spanning multiple files.

	\| # \| Scenario \| What's Broken \| Real-World Context \|
	\|---\|----------\|---------------\|-------------------\|
	\| 1 \| `registry_mismatch` \| Build tags `ghcr.io/...` but push targets `docker.io/...` \| Registry URL mismatch between steps \|
	\| 2 \| `image_tag_mismatch` \| Build uses `github.ref_name` but push uses `github.sha` \| "image not found locally" \|
	\| 3 \| `inconsistent_tagging` \| `docker tag myuser/api:latest` but image was built as `myuser/api:${{ github.sha }}` \| Tag source doesn't exist \|
	\| 4 \| `build_arg_not_passed` \| Dockerfile `ARG APP_VERSION` but no `--build-arg` in workflow \| Version file is empty \|
	\| 5 \| `dockerfile_path_in_subdirectory` \| Workflow points to `./Dockerfile` but it's at `./services/api/Dockerfile` \| Monorepo path mismatch \|

	### Task 10: Full Stack Deployment Pipeline — Expert

	Multi-error scenarios spanning the entire stack: GHA + Dockerfile + K8s manifests. 2-4 bugs per scenario requiring cross-file reasoning. Error messages are intentionally vague — the agent must trace root causes from symptoms.

	\| # \| Scenario \| What's Broken \| Real-World Context \|
	\|---\|----------\|---------------\|-------------------\|
	\| 1 \| `full_pipeline_ghcr_and_selector` \| GHCR token not mapped + K8s Service selector mismatch \| 2 bugs across workflow + K8s \|
	\| 2 \| `full_pipeline_three_bugs` \| Missing checkout + no WORKDIR + wrong container/service port \| 4 bugs across 4 files \|
	\| 3 \| `full_pipeline_ghcr_dockerfile_k8s` \| Wrong GHCR secret + base image typo + OOM memory limit \| 3 bugs across all layers \|
	\| 4 \| `full_pipeline_permissions_image_ingress` \| Missing packages:write + hardcoded image placeholder + no ingressClassName \| 3 bugs \|
	\| 5 \| `full_pipeline_secrets_build_probe` \| Docker secrets not wired + wrong build output dir + probe port mismatch \| 4 bugs across all layers \|

	---

	## Fix Validation: Simulator-Based

	Fixes are validated using structural simulators, not string matching. This means:

	- Alternative valid fixes are accepted. Setting memory to `512Mi` instead of `256Mi` both resolve the OOM — the simulator accepts either.
	- Three independent simulators run after every edit:
	- DockerSimulator: validates Dockerfile syntax (FROM, COPY, EXPOSE, RUN) and runtime behavior (WORKDIR, CMD/ENTRYPOINT, permissions, ENV)
	- WorkflowSimulator: parses YAML, checks triggers, runs-on, step ordering, secrets wiring, permissions, buildx requirements, registry consistency
	- KubernetesSimulator: validates manifests, cross-resource dependencies (Service selector ↔ Deployment labels), pod status simulation (OOM, ImagePullBackOff), service endpoint reachability
	- 7 granular checks are tracked: `docker_build`, `docker_run`, `workflow_parse`, `workflow_exec`, `k8s_valid`, `k8s_pod_running`, `k8s_service_active`
	- Progress = how many checks flip from fail → pass compared to the initial broken state

	---

	## Available Actions

	Each step, the agent chooses exactly one action:

	\| Action \| What It Does \| When to Use \|
	\|--------\|-------------\|-------------\|
	\| `edit_file` \| Replace `old_content` with `new_content` in a file \| Most common — fix a broken line or block \|
	\| `replace_line` \| Replace content at a specific line number \| When you know exactly which line is wrong \|
	\| `add_line` \| Insert a new line into a file \| Adding missing instructions (e.g., missing `WORKDIR`) \|
	\| `delete_line` \| Remove a specific line \| Removing a bad instruction \|
	\| `add_block` \| Insert a multi-line block \| Adding entire sections (e.g., `env:` block with secrets) \|
	\| `delete_block` \| Remove a multi-line block \| Removing incorrect sections \|
	\| `request_hint` \| Get a clue about what's wrong \| Costs -4% on final score — use sparingly \|
	\| `submit` \| Declare "I'm done" — triggers final evaluation \| When all fixes are applied \|

	Important: `edit_file` requires `old_content` to match exactly (including whitespace). If it doesn't match, the edit fails and the agent gets a -0.02 reward penalty.

	---

	## Grading System

	Scoring is deterministic (same actions always produce the same score), difficulty-aware (harder tasks are graded more generously), and scores are strictly in (0, 1) exclusive — never exactly 0 or 1.

	### The Formula

	```
	FINAL SCORE = Base + Partial Fixes + Complete Bonus + Difficulty Bonus + Efficiency - Hint Penalty - Failed Edit Penalty
	```

	Clamped to `(0.01, 0.99)`.

	### Component Breakdown

	\| Component \| Weight \| Description \|
	\|-----------\|--------\|-------------\|
	\| Base score \| 5% \| Participation credit (guarantees score > 0) \|
	\| Partial fixes \| 35% \| Proportional to `issues_fixed / issues_total` \|
	\| Complete bonus \| 25% \| All issues fixed \|
	\| Difficulty bonus \| 0-3% \| Extra reward for fully solving hard/expert tasks \|
	\| Efficiency \| 25% \| Decays with extra steps — slower decay for harder tasks \|
	\| Hint penalty \| -3% to -4% each \| Per `request_hint` action (cheaper for hard/expert) \|
	\| Failed edit penalty \| -2% each \| Per edit with no valid file path \|

	### Difficulty Modifiers

	\| Difficulty \| Max Score \| Efficiency Decay \| Hint Cost \|
	\|------------\|-----------\|------------------\|-----------\|
	\| Easy \| 0.90 \| 0.03/step (strict) \| 4% each \|
	\| Medium \| 0.90 \| 0.027/step \| 4% each \|
	\| Hard/Expert \| 0.93 \| 0.021/step (forgiving) \| 3% each \|

	---

	## Evaluation

	The evaluation pipeline runs all 50 scenarios across all 10 tasks deterministically:

	```python
	# Runs all 10 tasks × 5 scenarios = 50 episodes
	results = run_baseline_episodes() # num_episodes=None runs all

	# Per-episode scores in (0, 1)
	# Aggregate = mean of all 50 scores
	aggregate = sum(r.score for r in results) / len(results)
	```

	This ensures:
	- Reproducibility: same agent produces same score every time
	- Complete coverage: every error pattern is tested
	- Fair comparison: all agents face the same 50 scenarios

	---

	## API Endpoints

	\| Endpoint \| Method \| Description \|
	\|----------\|--------\|-------------\|
	\| `/` \| GET \| Root page \|
	\| `/health` \| GET \| Health check — returns `{"status": "healthy"}` \|
	\| `/metadata` \| GET \| Environment name, description, version, tags \|
	\| `/schema` \| GET \| Action, observation, and state JSON schemas \|
	\| `/reset` \| POST \| Start a new episode (optional: `task_id`, `scenario_id`, `seed`) \|
	\| `/step` \| POST \| Take an action and receive observation + reward \|
	\| `/state` \| GET \| Get current observation without taking an action \|
	\| `/info` \| GET \| Task list with metadata \|
	\| `/tasks` \| GET \| List all tasks with difficulty levels \|
	\| `/grader` \| POST \| Grade a trajectory (list of step dicts) \|
	\| `/baseline` \| POST \| Run baseline across all scenarios (optional: `task_id`, `num_episodes`) \|
	\| `/mcp` \| POST \| JSON-RPC 2.0 MCP endpoint (initialize, tools/list) \|

	### Example: Full Episode via API

	```bash
	# 1. Start an episode
	curl -X POST http://localhost:7860/reset \
	-H "Content-Type: application/json" \
	-d '{"task_id": "k8s_pod_failures", "scenario_id": "oom_killed"}'

	# 2. Fix the memory limit (any reasonable value works — simulator validates structurally)
	curl -X POST http://localhost:7860/step \
	-H "Content-Type: application/json" \
	-d '{
	"action": {
	"action_type": "edit_file",
	"edits": [{
	"file_path": "k8s/deployment.yaml",
	"old_content": "memory: \"64Mi\"",
	"new_content": "memory: \"512Mi\""
	}]
	}
	}'

	# Response: reward=0.3, issues_fixed=1/1, done=true
	```

	---

	## Quick Start

	### Local Development

	```bash
	pip install -r requirements.txt
	python -m uvicorn server.app:app --host 0.0.0.0 --port 7860
	```

	### Run Tests

	```bash
	pytest tests/ -v
	```

	### Docker

	```bash
	docker build -t cloud-native-devops-env .
	docker run -p 7860:7860 cloud-native-devops-env
	```

	### Baseline Inference (with LLM)

	```bash
	export API_BASE_URL=https://router.huggingface.co/v1
	export MODEL_NAME=meta-llama/Llama-3.1-70B-Instruct
	export HF_TOKEN=your_token_here
	python inference.py
	```

	---

	## Project Structure

	```
	cloud-native-devops-env/
	├── openenv.yaml # OpenEnv environment specification
	├── inference.py # LLM baseline (OpenAI client + HF router)
	├── baseline_runner.py # Heuristic baseline — runs all 50 scenarios
	├── Dockerfile # Production container
	├── requirements.txt # Python dependencies
	│
	├── server/
	│ ├── app.py # FastAPI with 12 endpoints
	│ ├── models.py # Pydantic models (type-safe API)
	│ ├── environment.py # Core environment loop (reset/step/state)
	│ ├── tasks/
	│ │ ├── base.py # BaseTask with scenario loading
	│ │ ├── task_registry.py # Maps task_id → task class (10 tasks)
	│ │ ├── task_1_build_errors.py # 5 Dockerfile syntax scenarios
	│ │ ├── task_2_docker_runtime.py # 5 Dockerfile runtime scenarios
	│ │ ├── task_3_workflow_syntax.py # 5 workflow structure scenarios
	│ │ ├── task_4_workflow_secrets_permissions.py # 5 secrets scenarios
	│ │ ├── task_5_ci_docker_integration.py # 5 integration scenarios
	│ │ ├── task_6_multi_stage_matrix.py # 5 multi-issue scenarios
	│ │ ├── k8s_pod.py # 5 Kubernetes pod failure scenarios
	│ │ ├── k8s_networking.py # 5 K8s networking scenarios
	│ │ ├── pipeline_build_deploy.py # 5 GHA→Docker→Registry scenarios
	│ │ └── pipeline_full.py # 5 full-stack multi-error scenarios
	│ ├── graders/
	│ │ └── __init__.py # Deterministic trajectory grader
	│ └── simulators/
	│ ├── docker_simulator.py # Dockerfile build + runtime validation
	│ ├── workflow_simulator.py # GHA workflow parse + execution validation
	│ └── k8s_simulator.py # K8s manifest + cross-resource validation
	│
	└── tests/
	├── test_endpoints.py # API endpoint tests
	├── test_determinism.py # Grader determinism + score range tests
	├── test_baseline.py # Heuristic baseline tests
	├── test_environment_flow.py # Episode flow tests
	└── test_simulators.py # Simulator unit tests
	```

	## Design Decisions

	1. Full cloud-native stack: Docker + GitHub Actions + Kubernetes — the three pillars of modern deployment pipelines.
	2. Simulator-based validation: Structural rule-based simulators validate fixes instead of string matching. Alternative valid fixes are accepted (e.g., `512Mi` and `256Mi` both fix an OOM). Deterministic, fast, no security concerns.
	3. Dense rewards: Partial credit at every step (+0.3 per fix, -0.02 per failed edit) rather than sparse pass/fail.
	4. Difficulty progression: Easy tasks are single-file, single-issue. Expert tasks are multi-file, multi-issue with interacting bugs across all three layers.
	5. Vague error messages in harder tasks: Easy tasks have explicit error messages. Hard/Expert tasks have realistic, vague messages that require the agent to actually diagnose the issue from context.
	6. Deterministic evaluation: All 50 scenarios run every time for reproducible, comparable scores in (0, 1) exclusive.
	7. 50 scenarios from real bugs: Every scenario is based on actual developer mistakes documented on Stack Overflow, GitHub Issues, and official documentation.

	## License

	MIT