Krishna1107's picture
inference fixed, port changed to 7860
eb895b1
---
title: Cloud-Native DevOps Debug Environment
emoji: πŸ”§
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
---
# Cloud-Native DevOps Debug Environment
An OpenEnv-compatible environment where AI agents learn to debug broken GitHub Actions workflows, Dockerfiles, and Kubernetes manifests. Built for the OpenEnv Hackathon by Scaler School of Technology (partners: Meta, HuggingFace, PyTorch).
## Why Cloud-Native Debugging?
Every developer who ships code hits deployment pipeline failures. A misconfigured Dockerfile, a broken GitHub Actions workflow, a missing secret, a Kubernetes selector mismatch β€” these are the bugs that waste hours of developer time every week. They're hard to debug because:
- Error messages are cryptic ("unable to prepare context: unable to evaluate symlinks")
- The feedback loop is slow (push, wait for CI, read logs, fix, repeat)
- Multiple config files interact in non-obvious ways (Dockerfile + workflow + secrets + K8s manifests)
- Kubernetes errors require cross-resource reasoning (Deployment labels must match Service selectors)
This environment teaches AI agents to do what senior DevOps engineers do: read the error, trace it to the root cause across multiple files, and fix it.
---
## How It Works: The Complete Flow
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. RESET β”‚
β”‚ Agent receives: β”‚
β”‚ - Broken config files (Dockerfile / workflow / K8s YAML) β”‚
β”‚ - Error message from the failed build/deploy β”‚
β”‚ - Available secrets list β”‚
β”‚ - Number of issues to find β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2. OBSERVE β†’ THINK β†’ ACT (repeat up to 10 steps) β”‚
β”‚ Agent reads the error, analyzes the files, then: β”‚
β”‚ - edit_file: replace broken content with fixed content β”‚
β”‚ - replace_line: fix a specific line number β”‚
β”‚ - add_line / add_block: insert missing content β”‚
β”‚ - delete_line / delete_block: remove bad content β”‚
β”‚ - request_hint: get a clue (-4% score penalty) β”‚
β”‚ - submit: "I'm done fixing" β”‚
β”‚ β”‚
β”‚ After each action, agent gets: β”‚
β”‚ - Updated file contents β”‚
β”‚ - Reward signal (+0.3 per fix, -0.02 for failed edits) β”‚
β”‚ - How many issues are now fixed β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 3. GRADE β”‚
β”‚ Deterministic scoring based on: β”‚
β”‚ - What fraction of issues were fixed β”‚
β”‚ - Whether ALL issues were fixed (bonus) β”‚
β”‚ - How many steps it took (efficiency) β”‚
β”‚ - How many hints were used (penalty) β”‚
β”‚ Score range: (0, 1) exclusive β€” never exactly 0 or 1 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
---
## The 10 Tasks (50 Scenarios)
Evaluation runs **all 50 scenarios deterministically** across all 10 tasks for reproducible scoring.
### Task 1: Dockerfile Syntax Errors β€” Easy
Simple typos and instruction errors that break `docker build`.
| # | Scenario | What's Broken | Real-World Context |
|---|----------|---------------|-------------------|
| 1 | `typo_filename` | `COPY requirments.txt .` β€” misspelled filename | Most common Docker build error on Stack Overflow |
| 2 | `invalid_base_image` | `FROM python:3.9-slimm` β€” extra 'm' in tag | Happens when copy-pasting image tags |
| 3 | `invalid_run_syntax` | `RUN pip install ... \n && python setup.py` β€” broken line continuation | Formatting multi-line RUN commands is tricky |
| 4 | `copy_missing_source` | `COPY dist/` but build output is in `build/` | Source directory doesn't exist in build context |
| 5 | `missing_from_instruction` | No `FROM` instruction at all | Dockerfile must start with FROM |
### Task 2: Dockerfile Runtime Errors β€” Medium
The Dockerfile builds successfully, but the container crashes at runtime.
| # | Scenario | What's Broken | Real-World Context |
|---|----------|---------------|-------------------|
| 1 | `missing_workdir` | No WORKDIR β€” files scatter to `/` | Container runs but `npm start` can't find `package.json` |
| 2 | `cmd_entrypoint_conflict` | Both ENTRYPOINT and CMD defined as full commands | Process starts incorrectly |
| 3 | `entrypoint_not_executable` | Shell script lacks execute permission | `chmod +x` missing β€” "permission denied" |
| 4 | `missing_required_env` | App needs `DATABASE_URL` but it's not set | Container crashes: "DATABASE_URL is not defined" |
| 5 | `non_root_privileged_port` | Non-root user tries to bind port 80 | Security best practice conflicts with port < 1024 |
### Task 3: Workflow Syntax & Structure β€” Easy
GitHub Actions YAML has structural problems that GitHub rejects before any job runs.
| # | Scenario | What's Broken | Real-World Context |
|---|----------|---------------|-------------------|
| 1 | `checkout_after_build` | `docker build` before `actions/checkout` | No source code β€” "Dockerfile not found" |
| 2 | `missing_runs_on` | Job has no `runs-on` field | Every job needs a runner |
| 3 | `invalid_trigger_syntax` | `branches: main` instead of `branches: [main]` | Must be a YAML list |
| 4 | `missing_step_uses_or_run` | Step has a name but no `uses:` or `run:` | Invalid step |
| 5 | `missing_on_trigger` | No `on:` block at all | Workflow never triggers |
### Task 4: Workflow Secrets & Permissions β€” Medium
Secrets exist but aren't wired correctly to the workflow steps.
| # | Scenario | What's Broken | Real-World Context |
|---|----------|---------------|-------------------|
| 1 | `missing_env_secrets` | `$DOCKER_PASSWORD` without `env:` mapping | Secrets must be passed via `env:` block |
| 2 | `wrong_secret_syntax` | `${ secrets.TOKEN }` instead of `${{ secrets.TOKEN }}` | Single vs double braces |
| 3 | `missing_token_permissions` | Pushing to GHCR without `permissions: packages: write` | GITHUB_TOKEN is read-only by default |
| 4 | `secret_not_in_env` | `$SLACK_WEBHOOK_URL` not in `env:` | Very common mistake |
| 5 | `ghcr_wrong_credentials` | Using `DOCKER_PASSWORD` for GHCR login | GHCR uses `GITHUB_TOKEN` |
### Task 5: CI + Docker Integration β€” Medium
The workflow AND the Dockerfile interact. Fixing one file alone isn't enough.
| # | Scenario | What's Broken | Real-World Context |
|---|----------|---------------|-------------------|
| 1 | `missing_buildx_for_platforms` | Multi-platform build without `setup-buildx-action` | Need BuildKit for cross-compile |
| 2 | `missing_load_true` | `build-push-action` without `load: true` β€” next step can't find image | Buildx doesn't load into local daemon by default |
| 3 | `wrong_build_context` | Context is `./backend` but Dockerfile path is `./Dockerfile` | Path mismatch |
| 4 | `cache_without_mode_max` | GHA cache export missing `mode=max` | Cache doesn't persist |
| 5 | `push_without_login` | `docker push` without `docker login` first | "denied: requested access" |
### Task 6: Multi-Stage Pipeline & Matrix β€” Hard
Complex pipelines with multiple interacting bugs. Agent must find 2-3 issues across files.
| # | Scenario | What's Broken | Real-World Context |
|---|----------|---------------|-------------------|
| 1 | `artifact_path_mismatch` | `COPY --from=builder /app/dist` but React outputs to `/app/build` | CRA uses `build/`, Vite uses `dist/` |
| 2 | `matrix_platform_arg` | `$BUILDPLATFORM` without `ARG BUILDPLATFORM` | Multi-arch needs platform ARGs |
| 3 | `cross_job_artifact` | Test job downloads artifact but missing `needs: build` | Jobs run in parallel by default |
| 4 | `multiple_issues` | Dockerfile typo + workflow secrets not wired (2 bugs) | Problems compound across files |
| 5 | `matrix_version_failure` | Matrix includes Node 14 but code needs >= 16 + missing `needs:` | 2 bugs to find |
### Task 7: Kubernetes Pod Failures β€” Medium
Pod crashes and scheduling failures in Kubernetes deployments.
| # | Scenario | What's Broken | Real-World Context |
|---|----------|---------------|-------------------|
| 1 | `oom_killed` | Memory limit 64Mi too low β€” CrashLoopBackOff/OOMKilled | Most common K8s production issue |
| 2 | `image_pull_backoff` | Image tag typo `nginx:latset` β†’ ImagePullBackOff | Copy-paste tag errors |
| 3 | `wrong_command` | `command: ["python", "workers.py"]` but file is `worker.py` | File name mismatch |
| 4 | `missing_configmap` | `envFrom: configMapRef: app-config` but ConfigMap doesn't exist | CreateContainerConfigError |
| 5 | `liveness_probe_failing` | Liveness probe port 3000 but app listens on 8080 | Probe misconfiguration causes restarts |
### Task 8: Kubernetes Service & Ingress Issues β€” Hard
Networking issues where pods run fine but traffic doesn't reach them. Error messages are intentionally vague β€” the agent must diagnose from kubectl output.
| # | Scenario | What's Broken | Real-World Context |
|---|----------|---------------|-------------------|
| 1 | `selector_mismatch` | Service selector `app: api` but pod label is `app: api-server` | No endpoints β€” most common K8s networking bug |
| 2 | `port_mismatch` | Service targetPort 8080 but container listens on 3000 | Connection refused |
| 3 | `ingress_wrong_service` | Ingress references `api-svc` but service name is `api-service` | Ingress 404 |
| 4 | `network_policy_blocking` | NetworkPolicy with empty ingress rules blocks all traffic | Database unreachable |
| 5 | `missing_ingress_class` | No `ingressClassName: nginx` specified | Ingress controller doesn't pick it up |
### Task 9: CI/CD Build & Push Pipeline β€” Hard
GHA-to-Docker-to-Registry pipeline failures spanning multiple files.
| # | Scenario | What's Broken | Real-World Context |
|---|----------|---------------|-------------------|
| 1 | `registry_mismatch` | Build tags `ghcr.io/...` but push targets `docker.io/...` | Registry URL mismatch between steps |
| 2 | `image_tag_mismatch` | Build uses `github.ref_name` but push uses `github.sha` | "image not found locally" |
| 3 | `inconsistent_tagging` | `docker tag myuser/api:latest` but image was built as `myuser/api:${{ github.sha }}` | Tag source doesn't exist |
| 4 | `build_arg_not_passed` | Dockerfile `ARG APP_VERSION` but no `--build-arg` in workflow | Version file is empty |
| 5 | `dockerfile_path_in_subdirectory` | Workflow points to `./Dockerfile` but it's at `./services/api/Dockerfile` | Monorepo path mismatch |
### Task 10: Full Stack Deployment Pipeline β€” Expert
Multi-error scenarios spanning the entire stack: GHA + Dockerfile + K8s manifests. 2-4 bugs per scenario requiring cross-file reasoning. Error messages are intentionally vague β€” the agent must trace root causes from symptoms.
| # | Scenario | What's Broken | Real-World Context |
|---|----------|---------------|-------------------|
| 1 | `full_pipeline_ghcr_and_selector` | GHCR token not mapped + K8s Service selector mismatch | 2 bugs across workflow + K8s |
| 2 | `full_pipeline_three_bugs` | Missing checkout + no WORKDIR + wrong container/service port | 4 bugs across 4 files |
| 3 | `full_pipeline_ghcr_dockerfile_k8s` | Wrong GHCR secret + base image typo + OOM memory limit | 3 bugs across all layers |
| 4 | `full_pipeline_permissions_image_ingress` | Missing packages:write + hardcoded image placeholder + no ingressClassName | 3 bugs |
| 5 | `full_pipeline_secrets_build_probe` | Docker secrets not wired + wrong build output dir + probe port mismatch | 4 bugs across all layers |
---
## Fix Validation: Simulator-Based
Fixes are validated using **structural simulators**, not string matching. This means:
- **Alternative valid fixes are accepted.** Setting memory to `512Mi` instead of `256Mi` both resolve the OOM β€” the simulator accepts either.
- **Three independent simulators** run after every edit:
- **DockerSimulator**: validates Dockerfile syntax (FROM, COPY, EXPOSE, RUN) and runtime behavior (WORKDIR, CMD/ENTRYPOINT, permissions, ENV)
- **WorkflowSimulator**: parses YAML, checks triggers, runs-on, step ordering, secrets wiring, permissions, buildx requirements, registry consistency
- **KubernetesSimulator**: validates manifests, cross-resource dependencies (Service selector ↔ Deployment labels), pod status simulation (OOM, ImagePullBackOff), service endpoint reachability
- **7 granular checks** are tracked: `docker_build`, `docker_run`, `workflow_parse`, `workflow_exec`, `k8s_valid`, `k8s_pod_running`, `k8s_service_active`
- Progress = how many checks flip from fail β†’ pass compared to the initial broken state
---
## Available Actions
Each step, the agent chooses exactly one action:
| Action | What It Does | When to Use |
|--------|-------------|-------------|
| `edit_file` | Replace `old_content` with `new_content` in a file | Most common β€” fix a broken line or block |
| `replace_line` | Replace content at a specific line number | When you know exactly which line is wrong |
| `add_line` | Insert a new line into a file | Adding missing instructions (e.g., missing `WORKDIR`) |
| `delete_line` | Remove a specific line | Removing a bad instruction |
| `add_block` | Insert a multi-line block | Adding entire sections (e.g., `env:` block with secrets) |
| `delete_block` | Remove a multi-line block | Removing incorrect sections |
| `request_hint` | Get a clue about what's wrong | Costs -4% on final score β€” use sparingly |
| `submit` | Declare "I'm done" β€” triggers final evaluation | When all fixes are applied |
**Important:** `edit_file` requires `old_content` to match **exactly** (including whitespace). If it doesn't match, the edit fails and the agent gets a -0.02 reward penalty.
---
## Grading System
Scoring is **deterministic** (same actions always produce the same score), **difficulty-aware** (harder tasks are graded more generously), and scores are strictly in **(0, 1) exclusive** β€” never exactly 0 or 1.
### The Formula
```
FINAL SCORE = Base + Partial Fixes + Complete Bonus + Difficulty Bonus + Efficiency - Hint Penalty - Failed Edit Penalty
```
Clamped to `(0.01, 0.99)`.
### Component Breakdown
| Component | Weight | Description |
|-----------|--------|-------------|
| Base score | 5% | Participation credit (guarantees score > 0) |
| Partial fixes | 35% | Proportional to `issues_fixed / issues_total` |
| Complete bonus | 25% | All issues fixed |
| Difficulty bonus | 0-3% | Extra reward for fully solving hard/expert tasks |
| Efficiency | 25% | Decays with extra steps β€” slower decay for harder tasks |
| Hint penalty | -3% to -4% each | Per `request_hint` action (cheaper for hard/expert) |
| Failed edit penalty | -2% each | Per edit with no valid file path |
### Difficulty Modifiers
| Difficulty | Max Score | Efficiency Decay | Hint Cost |
|------------|-----------|------------------|-----------|
| Easy | 0.90 | 0.03/step (strict) | 4% each |
| Medium | 0.90 | 0.027/step | 4% each |
| Hard/Expert | 0.93 | 0.021/step (forgiving) | 3% each |
---
## Evaluation
The evaluation pipeline runs **all 50 scenarios across all 10 tasks** deterministically:
```python
# Runs all 10 tasks Γ— 5 scenarios = 50 episodes
results = run_baseline_episodes() # num_episodes=None runs all
# Per-episode scores in (0, 1)
# Aggregate = mean of all 50 scores
aggregate = sum(r.score for r in results) / len(results)
```
This ensures:
- **Reproducibility**: same agent produces same score every time
- **Complete coverage**: every error pattern is tested
- **Fair comparison**: all agents face the same 50 scenarios
---
## API Endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/` | GET | Root page |
| `/health` | GET | Health check β€” returns `{"status": "healthy"}` |
| `/metadata` | GET | Environment name, description, version, tags |
| `/schema` | GET | Action, observation, and state JSON schemas |
| `/reset` | POST | Start a new episode (optional: `task_id`, `scenario_id`, `seed`) |
| `/step` | POST | Take an action and receive observation + reward |
| `/state` | GET | Get current observation without taking an action |
| `/info` | GET | Task list with metadata |
| `/tasks` | GET | List all tasks with difficulty levels |
| `/grader` | POST | Grade a trajectory (list of step dicts) |
| `/baseline` | POST | Run baseline across all scenarios (optional: `task_id`, `num_episodes`) |
| `/mcp` | POST | JSON-RPC 2.0 MCP endpoint (initialize, tools/list) |
### Example: Full Episode via API
```bash
# 1. Start an episode
curl -X POST http://localhost:7860/reset \
-H "Content-Type: application/json" \
-d '{"task_id": "k8s_pod_failures", "scenario_id": "oom_killed"}'
# 2. Fix the memory limit (any reasonable value works β€” simulator validates structurally)
curl -X POST http://localhost:7860/step \
-H "Content-Type: application/json" \
-d '{
"action": {
"action_type": "edit_file",
"edits": [{
"file_path": "k8s/deployment.yaml",
"old_content": "memory: \"64Mi\"",
"new_content": "memory: \"512Mi\""
}]
}
}'
# Response: reward=0.3, issues_fixed=1/1, done=true
```
---
## Quick Start
### Local Development
```bash
pip install -r requirements.txt
python -m uvicorn server.app:app --host 0.0.0.0 --port 7860
```
### Run Tests
```bash
pytest tests/ -v
```
### Docker
```bash
docker build -t cloud-native-devops-env .
docker run -p 7860:7860 cloud-native-devops-env
```
### Baseline Inference (with LLM)
```bash
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=meta-llama/Llama-3.1-70B-Instruct
export HF_TOKEN=your_token_here
python inference.py
```
---
## Project Structure
```
cloud-native-devops-env/
β”œβ”€β”€ openenv.yaml # OpenEnv environment specification
β”œβ”€β”€ inference.py # LLM baseline (OpenAI client + HF router)
β”œβ”€β”€ baseline_runner.py # Heuristic baseline β€” runs all 50 scenarios
β”œβ”€β”€ Dockerfile # Production container
β”œβ”€β”€ requirements.txt # Python dependencies
β”‚
β”œβ”€β”€ server/
β”‚ β”œβ”€β”€ app.py # FastAPI with 12 endpoints
β”‚ β”œβ”€β”€ models.py # Pydantic models (type-safe API)
β”‚ β”œβ”€β”€ environment.py # Core environment loop (reset/step/state)
β”‚ β”œβ”€β”€ tasks/
β”‚ β”‚ β”œβ”€β”€ base.py # BaseTask with scenario loading
β”‚ β”‚ β”œβ”€β”€ task_registry.py # Maps task_id β†’ task class (10 tasks)
β”‚ β”‚ β”œβ”€β”€ task_1_build_errors.py # 5 Dockerfile syntax scenarios
β”‚ β”‚ β”œβ”€β”€ task_2_docker_runtime.py # 5 Dockerfile runtime scenarios
β”‚ β”‚ β”œβ”€β”€ task_3_workflow_syntax.py # 5 workflow structure scenarios
β”‚ β”‚ β”œβ”€β”€ task_4_workflow_secrets_permissions.py # 5 secrets scenarios
β”‚ β”‚ β”œβ”€β”€ task_5_ci_docker_integration.py # 5 integration scenarios
β”‚ β”‚ β”œβ”€β”€ task_6_multi_stage_matrix.py # 5 multi-issue scenarios
β”‚ β”‚ β”œβ”€β”€ k8s_pod.py # 5 Kubernetes pod failure scenarios
β”‚ β”‚ β”œβ”€β”€ k8s_networking.py # 5 K8s networking scenarios
β”‚ β”‚ β”œβ”€β”€ pipeline_build_deploy.py # 5 GHAβ†’Dockerβ†’Registry scenarios
β”‚ β”‚ └── pipeline_full.py # 5 full-stack multi-error scenarios
β”‚ β”œβ”€β”€ graders/
β”‚ β”‚ └── __init__.py # Deterministic trajectory grader
β”‚ └── simulators/
β”‚ β”œβ”€β”€ docker_simulator.py # Dockerfile build + runtime validation
β”‚ β”œβ”€β”€ workflow_simulator.py # GHA workflow parse + execution validation
β”‚ └── k8s_simulator.py # K8s manifest + cross-resource validation
β”‚
└── tests/
β”œβ”€β”€ test_endpoints.py # API endpoint tests
β”œβ”€β”€ test_determinism.py # Grader determinism + score range tests
β”œβ”€β”€ test_baseline.py # Heuristic baseline tests
β”œβ”€β”€ test_environment_flow.py # Episode flow tests
└── test_simulators.py # Simulator unit tests
```
## Design Decisions
1. **Full cloud-native stack**: Docker + GitHub Actions + Kubernetes β€” the three pillars of modern deployment pipelines.
2. **Simulator-based validation**: Structural rule-based simulators validate fixes instead of string matching. Alternative valid fixes are accepted (e.g., `512Mi` and `256Mi` both fix an OOM). Deterministic, fast, no security concerns.
3. **Dense rewards**: Partial credit at every step (+0.3 per fix, -0.02 per failed edit) rather than sparse pass/fail.
4. **Difficulty progression**: Easy tasks are single-file, single-issue. Expert tasks are multi-file, multi-issue with interacting bugs across all three layers.
5. **Vague error messages in harder tasks**: Easy tasks have explicit error messages. Hard/Expert tasks have realistic, vague messages that require the agent to actually diagnose the issue from context.
6. **Deterministic evaluation**: All 50 scenarios run every time for reproducible, comparable scores in (0, 1) exclusive.
7. **50 scenarios from real bugs**: Every scenario is based on actual developer mistakes documented on Stack Overflow, GitHub Issues, and official documentation.
## License
MIT