File size: 22,192 Bytes
d4e911d
c8f3b98
d4e911d
 
 
 
eb895b1
d4e911d
 
 
c8f3b98
85b7ac8
c8f3b98
85b7ac8
c8f3b98
85b7ac8
c8f3b98
557930c
d129f63
 
c8f3b98
 
557930c
c8f3b98
557930c
d129f63
557930c
d129f63
557930c
d129f63
 
 
 
c8f3b98
d129f63
 
 
 
 
 
 
 
 
 
2794920
d129f63
 
 
 
 
 
 
 
 
 
 
 
 
2794920
d129f63
 
557930c
d129f63
557930c
c8f3b98
557930c
2794920
 
d129f63
557930c
c8f3b98
557930c
d129f63
 
 
 
 
2794920
c8f3b98
557930c
d129f63
557930c
c8f3b98
85b7ac8
d129f63
 
 
c8f3b98
 
 
 
d129f63
 
 
c8f3b98
d129f63
 
 
c8f3b98
 
 
 
 
d129f63
 
 
c8f3b98
d129f63
 
 
c8f3b98
 
 
 
 
d129f63
2794920
d129f63
c8f3b98
d129f63
 
 
c8f3b98
2794920
c8f3b98
 
 
d129f63
 
 
c8f3b98
d129f63
 
 
c8f3b98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2794920
c8f3b98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2794920
c8f3b98
2794920
c8f3b98
2794920
c8f3b98
 
 
2794920
c8f3b98
 
 
 
 
 
 
 
d129f63
 
 
2794920
 
 
 
 
 
 
 
 
 
 
 
 
 
d129f63
 
 
 
 
 
 
 
 
 
 
 
2794920
d129f63
 
 
 
 
 
2794920
d129f63
2794920
d129f63
 
 
 
4de7d31
85b7ac8
 
2794920
85b7ac8
d129f63
 
c8f3b98
 
2794920
c8f3b98
 
4de7d31
 
 
c8f3b98
d129f63
4de7d31
 
 
 
 
 
 
 
2794920
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4de7d31
d129f63
 
 
 
 
 
c8f3b98
 
d129f63
 
 
 
 
 
 
 
2794920
d129f63
 
 
 
 
 
eb895b1
557930c
c8f3b98
d129f63
2794920
eb895b1
557930c
 
 
 
 
c8f3b98
 
2794920
557930c
 
 
 
c8f3b98
d129f63
 
 
 
 
 
 
 
 
 
eb895b1
85b7ac8
557930c
 
 
 
 
 
 
 
 
 
c8f3b98
eb895b1
557930c
 
 
 
 
 
 
 
 
 
 
d129f63
557930c
 
 
 
c8f3b98
d129f63
 
2794920
557930c
 
 
 
c8f3b98
d129f63
 
557930c
d129f63
c8f3b98
d129f63
 
 
 
 
c8f3b98
 
 
 
 
557930c
c8f3b98
d129f63
2794920
 
 
557930c
 
d129f63
 
 
 
 
557930c
 
 
 
c8f3b98
2794920
c8f3b98
 
2794920
 
 
557930c
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
---
title: Cloud-Native DevOps Debug Environment
emoji: πŸ”§
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
---

# Cloud-Native DevOps Debug Environment

An OpenEnv-compatible environment where AI agents learn to debug broken GitHub Actions workflows, Dockerfiles, and Kubernetes manifests. Built for the OpenEnv Hackathon by Scaler School of Technology (partners: Meta, HuggingFace, PyTorch).

## Why Cloud-Native Debugging?

Every developer who ships code hits deployment pipeline failures. A misconfigured Dockerfile, a broken GitHub Actions workflow, a missing secret, a Kubernetes selector mismatch β€” these are the bugs that waste hours of developer time every week. They're hard to debug because:

- Error messages are cryptic ("unable to prepare context: unable to evaluate symlinks")
- The feedback loop is slow (push, wait for CI, read logs, fix, repeat)
- Multiple config files interact in non-obvious ways (Dockerfile + workflow + secrets + K8s manifests)
- Kubernetes errors require cross-resource reasoning (Deployment labels must match Service selectors)

This environment teaches AI agents to do what senior DevOps engineers do: read the error, trace it to the root cause across multiple files, and fix it.

---

## How It Works: The Complete Flow

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  1. RESET                                                     β”‚
β”‚     Agent receives:                                           β”‚
β”‚     - Broken config files (Dockerfile / workflow / K8s YAML)  β”‚
β”‚     - Error message from the failed build/deploy              β”‚
β”‚     - Available secrets list                                  β”‚
β”‚     - Number of issues to find                                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  2. OBSERVE β†’ THINK β†’ ACT  (repeat up to 10 steps)           β”‚
β”‚     Agent reads the error, analyzes the files, then:          β”‚
β”‚     - edit_file: replace broken content with fixed content    β”‚
β”‚     - replace_line: fix a specific line number                β”‚
β”‚     - add_line / add_block: insert missing content            β”‚
β”‚     - delete_line / delete_block: remove bad content          β”‚
β”‚     - request_hint: get a clue (-4% score penalty)            β”‚
β”‚     - submit: "I'm done fixing"                               β”‚
β”‚                                                               β”‚
β”‚     After each action, agent gets:                            β”‚
β”‚     - Updated file contents                                   β”‚
β”‚     - Reward signal (+0.3 per fix, -0.02 for failed edits)   β”‚
β”‚     - How many issues are now fixed                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  3. GRADE                                                     β”‚
β”‚     Deterministic scoring based on:                           β”‚
β”‚     - What fraction of issues were fixed                      β”‚
β”‚     - Whether ALL issues were fixed (bonus)                   β”‚
β”‚     - How many steps it took (efficiency)                     β”‚
β”‚     - How many hints were used (penalty)                      β”‚
β”‚     Score range: (0, 1) exclusive β€” never exactly 0 or 1     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

---

## The 10 Tasks (50 Scenarios)

Evaluation runs **all 50 scenarios deterministically** across all 10 tasks for reproducible scoring.

### Task 1: Dockerfile Syntax Errors β€” Easy

Simple typos and instruction errors that break `docker build`.

| # | Scenario | What's Broken | Real-World Context |
|---|----------|---------------|-------------------|
| 1 | `typo_filename` | `COPY requirments.txt .` β€” misspelled filename | Most common Docker build error on Stack Overflow |
| 2 | `invalid_base_image` | `FROM python:3.9-slimm` β€” extra 'm' in tag | Happens when copy-pasting image tags |
| 3 | `invalid_run_syntax` | `RUN pip install ... \n && python setup.py` β€” broken line continuation | Formatting multi-line RUN commands is tricky |
| 4 | `copy_missing_source` | `COPY dist/` but build output is in `build/` | Source directory doesn't exist in build context |
| 5 | `missing_from_instruction` | No `FROM` instruction at all | Dockerfile must start with FROM |

### Task 2: Dockerfile Runtime Errors β€” Medium

The Dockerfile builds successfully, but the container crashes at runtime.

| # | Scenario | What's Broken | Real-World Context |
|---|----------|---------------|-------------------|
| 1 | `missing_workdir` | No WORKDIR β€” files scatter to `/` | Container runs but `npm start` can't find `package.json` |
| 2 | `cmd_entrypoint_conflict` | Both ENTRYPOINT and CMD defined as full commands | Process starts incorrectly |
| 3 | `entrypoint_not_executable` | Shell script lacks execute permission | `chmod +x` missing β€” "permission denied" |
| 4 | `missing_required_env` | App needs `DATABASE_URL` but it's not set | Container crashes: "DATABASE_URL is not defined" |
| 5 | `non_root_privileged_port` | Non-root user tries to bind port 80 | Security best practice conflicts with port < 1024 |

### Task 3: Workflow Syntax & Structure β€” Easy

GitHub Actions YAML has structural problems that GitHub rejects before any job runs.

| # | Scenario | What's Broken | Real-World Context |
|---|----------|---------------|-------------------|
| 1 | `checkout_after_build` | `docker build` before `actions/checkout` | No source code β€” "Dockerfile not found" |
| 2 | `missing_runs_on` | Job has no `runs-on` field | Every job needs a runner |
| 3 | `invalid_trigger_syntax` | `branches: main` instead of `branches: [main]` | Must be a YAML list |
| 4 | `missing_step_uses_or_run` | Step has a name but no `uses:` or `run:` | Invalid step |
| 5 | `missing_on_trigger` | No `on:` block at all | Workflow never triggers |

### Task 4: Workflow Secrets & Permissions β€” Medium

Secrets exist but aren't wired correctly to the workflow steps.

| # | Scenario | What's Broken | Real-World Context |
|---|----------|---------------|-------------------|
| 1 | `missing_env_secrets` | `$DOCKER_PASSWORD` without `env:` mapping | Secrets must be passed via `env:` block |
| 2 | `wrong_secret_syntax` | `${ secrets.TOKEN }` instead of `${{ secrets.TOKEN }}` | Single vs double braces |
| 3 | `missing_token_permissions` | Pushing to GHCR without `permissions: packages: write` | GITHUB_TOKEN is read-only by default |
| 4 | `secret_not_in_env` | `$SLACK_WEBHOOK_URL` not in `env:` | Very common mistake |
| 5 | `ghcr_wrong_credentials` | Using `DOCKER_PASSWORD` for GHCR login | GHCR uses `GITHUB_TOKEN` |

### Task 5: CI + Docker Integration β€” Medium

The workflow AND the Dockerfile interact. Fixing one file alone isn't enough.

| # | Scenario | What's Broken | Real-World Context |
|---|----------|---------------|-------------------|
| 1 | `missing_buildx_for_platforms` | Multi-platform build without `setup-buildx-action` | Need BuildKit for cross-compile |
| 2 | `missing_load_true` | `build-push-action` without `load: true` β€” next step can't find image | Buildx doesn't load into local daemon by default |
| 3 | `wrong_build_context` | Context is `./backend` but Dockerfile path is `./Dockerfile` | Path mismatch |
| 4 | `cache_without_mode_max` | GHA cache export missing `mode=max` | Cache doesn't persist |
| 5 | `push_without_login` | `docker push` without `docker login` first | "denied: requested access" |

### Task 6: Multi-Stage Pipeline & Matrix β€” Hard

Complex pipelines with multiple interacting bugs. Agent must find 2-3 issues across files.

| # | Scenario | What's Broken | Real-World Context |
|---|----------|---------------|-------------------|
| 1 | `artifact_path_mismatch` | `COPY --from=builder /app/dist` but React outputs to `/app/build` | CRA uses `build/`, Vite uses `dist/` |
| 2 | `matrix_platform_arg` | `$BUILDPLATFORM` without `ARG BUILDPLATFORM` | Multi-arch needs platform ARGs |
| 3 | `cross_job_artifact` | Test job downloads artifact but missing `needs: build` | Jobs run in parallel by default |
| 4 | `multiple_issues` | Dockerfile typo + workflow secrets not wired (2 bugs) | Problems compound across files |
| 5 | `matrix_version_failure` | Matrix includes Node 14 but code needs >= 16 + missing `needs:` | 2 bugs to find |

### Task 7: Kubernetes Pod Failures β€” Medium

Pod crashes and scheduling failures in Kubernetes deployments.

| # | Scenario | What's Broken | Real-World Context |
|---|----------|---------------|-------------------|
| 1 | `oom_killed` | Memory limit 64Mi too low β€” CrashLoopBackOff/OOMKilled | Most common K8s production issue |
| 2 | `image_pull_backoff` | Image tag typo `nginx:latset` β†’ ImagePullBackOff | Copy-paste tag errors |
| 3 | `wrong_command` | `command: ["python", "workers.py"]` but file is `worker.py` | File name mismatch |
| 4 | `missing_configmap` | `envFrom: configMapRef: app-config` but ConfigMap doesn't exist | CreateContainerConfigError |
| 5 | `liveness_probe_failing` | Liveness probe port 3000 but app listens on 8080 | Probe misconfiguration causes restarts |

### Task 8: Kubernetes Service & Ingress Issues β€” Hard

Networking issues where pods run fine but traffic doesn't reach them. Error messages are intentionally vague β€” the agent must diagnose from kubectl output.

| # | Scenario | What's Broken | Real-World Context |
|---|----------|---------------|-------------------|
| 1 | `selector_mismatch` | Service selector `app: api` but pod label is `app: api-server` | No endpoints β€” most common K8s networking bug |
| 2 | `port_mismatch` | Service targetPort 8080 but container listens on 3000 | Connection refused |
| 3 | `ingress_wrong_service` | Ingress references `api-svc` but service name is `api-service` | Ingress 404 |
| 4 | `network_policy_blocking` | NetworkPolicy with empty ingress rules blocks all traffic | Database unreachable |
| 5 | `missing_ingress_class` | No `ingressClassName: nginx` specified | Ingress controller doesn't pick it up |

### Task 9: CI/CD Build & Push Pipeline β€” Hard

GHA-to-Docker-to-Registry pipeline failures spanning multiple files.

| # | Scenario | What's Broken | Real-World Context |
|---|----------|---------------|-------------------|
| 1 | `registry_mismatch` | Build tags `ghcr.io/...` but push targets `docker.io/...` | Registry URL mismatch between steps |
| 2 | `image_tag_mismatch` | Build uses `github.ref_name` but push uses `github.sha` | "image not found locally" |
| 3 | `inconsistent_tagging` | `docker tag myuser/api:latest` but image was built as `myuser/api:${{ github.sha }}` | Tag source doesn't exist |
| 4 | `build_arg_not_passed` | Dockerfile `ARG APP_VERSION` but no `--build-arg` in workflow | Version file is empty |
| 5 | `dockerfile_path_in_subdirectory` | Workflow points to `./Dockerfile` but it's at `./services/api/Dockerfile` | Monorepo path mismatch |

### Task 10: Full Stack Deployment Pipeline β€” Expert

Multi-error scenarios spanning the entire stack: GHA + Dockerfile + K8s manifests. 2-4 bugs per scenario requiring cross-file reasoning. Error messages are intentionally vague β€” the agent must trace root causes from symptoms.

| # | Scenario | What's Broken | Real-World Context |
|---|----------|---------------|-------------------|
| 1 | `full_pipeline_ghcr_and_selector` | GHCR token not mapped + K8s Service selector mismatch | 2 bugs across workflow + K8s |
| 2 | `full_pipeline_three_bugs` | Missing checkout + no WORKDIR + wrong container/service port | 4 bugs across 4 files |
| 3 | `full_pipeline_ghcr_dockerfile_k8s` | Wrong GHCR secret + base image typo + OOM memory limit | 3 bugs across all layers |
| 4 | `full_pipeline_permissions_image_ingress` | Missing packages:write + hardcoded image placeholder + no ingressClassName | 3 bugs |
| 5 | `full_pipeline_secrets_build_probe` | Docker secrets not wired + wrong build output dir + probe port mismatch | 4 bugs across all layers |

---

## Fix Validation: Simulator-Based

Fixes are validated using **structural simulators**, not string matching. This means:

- **Alternative valid fixes are accepted.** Setting memory to `512Mi` instead of `256Mi` both resolve the OOM β€” the simulator accepts either.
- **Three independent simulators** run after every edit:
  - **DockerSimulator**: validates Dockerfile syntax (FROM, COPY, EXPOSE, RUN) and runtime behavior (WORKDIR, CMD/ENTRYPOINT, permissions, ENV)
  - **WorkflowSimulator**: parses YAML, checks triggers, runs-on, step ordering, secrets wiring, permissions, buildx requirements, registry consistency
  - **KubernetesSimulator**: validates manifests, cross-resource dependencies (Service selector ↔ Deployment labels), pod status simulation (OOM, ImagePullBackOff), service endpoint reachability
- **7 granular checks** are tracked: `docker_build`, `docker_run`, `workflow_parse`, `workflow_exec`, `k8s_valid`, `k8s_pod_running`, `k8s_service_active`
- Progress = how many checks flip from fail β†’ pass compared to the initial broken state

---

## Available Actions

Each step, the agent chooses exactly one action:

| Action | What It Does | When to Use |
|--------|-------------|-------------|
| `edit_file` | Replace `old_content` with `new_content` in a file | Most common β€” fix a broken line or block |
| `replace_line` | Replace content at a specific line number | When you know exactly which line is wrong |
| `add_line` | Insert a new line into a file | Adding missing instructions (e.g., missing `WORKDIR`) |
| `delete_line` | Remove a specific line | Removing a bad instruction |
| `add_block` | Insert a multi-line block | Adding entire sections (e.g., `env:` block with secrets) |
| `delete_block` | Remove a multi-line block | Removing incorrect sections |
| `request_hint` | Get a clue about what's wrong | Costs -4% on final score β€” use sparingly |
| `submit` | Declare "I'm done" β€” triggers final evaluation | When all fixes are applied |

**Important:** `edit_file` requires `old_content` to match **exactly** (including whitespace). If it doesn't match, the edit fails and the agent gets a -0.02 reward penalty.

---

## Grading System

Scoring is **deterministic** (same actions always produce the same score), **difficulty-aware** (harder tasks are graded more generously), and scores are strictly in **(0, 1) exclusive** β€” never exactly 0 or 1.

### The Formula

```
FINAL SCORE = Base + Partial Fixes + Complete Bonus + Difficulty Bonus + Efficiency - Hint Penalty - Failed Edit Penalty
```

Clamped to `(0.01, 0.99)`.

### Component Breakdown

| Component | Weight | Description |
|-----------|--------|-------------|
| Base score | 5% | Participation credit (guarantees score > 0) |
| Partial fixes | 35% | Proportional to `issues_fixed / issues_total` |
| Complete bonus | 25% | All issues fixed |
| Difficulty bonus | 0-3% | Extra reward for fully solving hard/expert tasks |
| Efficiency | 25% | Decays with extra steps β€” slower decay for harder tasks |
| Hint penalty | -3% to -4% each | Per `request_hint` action (cheaper for hard/expert) |
| Failed edit penalty | -2% each | Per edit with no valid file path |

### Difficulty Modifiers

| Difficulty | Max Score | Efficiency Decay | Hint Cost |
|------------|-----------|------------------|-----------|
| Easy | 0.90 | 0.03/step (strict) | 4% each |
| Medium | 0.90 | 0.027/step | 4% each |
| Hard/Expert | 0.93 | 0.021/step (forgiving) | 3% each |

---

## Evaluation

The evaluation pipeline runs **all 50 scenarios across all 10 tasks** deterministically:

```python
# Runs all 10 tasks Γ— 5 scenarios = 50 episodes
results = run_baseline_episodes()  # num_episodes=None runs all

# Per-episode scores in (0, 1)
# Aggregate = mean of all 50 scores
aggregate = sum(r.score for r in results) / len(results)
```

This ensures:
- **Reproducibility**: same agent produces same score every time
- **Complete coverage**: every error pattern is tested
- **Fair comparison**: all agents face the same 50 scenarios

---

## API Endpoints

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/` | GET | Root page |
| `/health` | GET | Health check β€” returns `{"status": "healthy"}` |
| `/metadata` | GET | Environment name, description, version, tags |
| `/schema` | GET | Action, observation, and state JSON schemas |
| `/reset` | POST | Start a new episode (optional: `task_id`, `scenario_id`, `seed`) |
| `/step` | POST | Take an action and receive observation + reward |
| `/state` | GET | Get current observation without taking an action |
| `/info` | GET | Task list with metadata |
| `/tasks` | GET | List all tasks with difficulty levels |
| `/grader` | POST | Grade a trajectory (list of step dicts) |
| `/baseline` | POST | Run baseline across all scenarios (optional: `task_id`, `num_episodes`) |
| `/mcp` | POST | JSON-RPC 2.0 MCP endpoint (initialize, tools/list) |

### Example: Full Episode via API

```bash
# 1. Start an episode
curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "k8s_pod_failures", "scenario_id": "oom_killed"}'

# 2. Fix the memory limit (any reasonable value works β€” simulator validates structurally)
curl -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{
    "action": {
      "action_type": "edit_file",
      "edits": [{
        "file_path": "k8s/deployment.yaml",
        "old_content": "memory: \"64Mi\"",
        "new_content": "memory: \"512Mi\""
      }]
    }
  }'

# Response: reward=0.3, issues_fixed=1/1, done=true
```

---

## Quick Start

### Local Development

```bash
pip install -r requirements.txt
python -m uvicorn server.app:app --host 0.0.0.0 --port 7860
```

### Run Tests

```bash
pytest tests/ -v
```

### Docker

```bash
docker build -t cloud-native-devops-env .
docker run -p 7860:7860 cloud-native-devops-env
```

### Baseline Inference (with LLM)

```bash
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=meta-llama/Llama-3.1-70B-Instruct
export HF_TOKEN=your_token_here
python inference.py
```

---

## Project Structure

```
cloud-native-devops-env/
β”œβ”€β”€ openenv.yaml              # OpenEnv environment specification
β”œβ”€β”€ inference.py              # LLM baseline (OpenAI client + HF router)
β”œβ”€β”€ baseline_runner.py        # Heuristic baseline β€” runs all 50 scenarios
β”œβ”€β”€ Dockerfile                # Production container
β”œβ”€β”€ requirements.txt          # Python dependencies
β”‚
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ app.py                # FastAPI with 12 endpoints
β”‚   β”œβ”€β”€ models.py             # Pydantic models (type-safe API)
β”‚   β”œβ”€β”€ environment.py        # Core environment loop (reset/step/state)
β”‚   β”œβ”€β”€ tasks/
β”‚   β”‚   β”œβ”€β”€ base.py           # BaseTask with scenario loading
β”‚   β”‚   β”œβ”€β”€ task_registry.py  # Maps task_id β†’ task class (10 tasks)
β”‚   β”‚   β”œβ”€β”€ task_1_build_errors.py        # 5 Dockerfile syntax scenarios
β”‚   β”‚   β”œβ”€β”€ task_2_docker_runtime.py      # 5 Dockerfile runtime scenarios
β”‚   β”‚   β”œβ”€β”€ task_3_workflow_syntax.py     # 5 workflow structure scenarios
β”‚   β”‚   β”œβ”€β”€ task_4_workflow_secrets_permissions.py  # 5 secrets scenarios
β”‚   β”‚   β”œβ”€β”€ task_5_ci_docker_integration.py        # 5 integration scenarios
β”‚   β”‚   β”œβ”€β”€ task_6_multi_stage_matrix.py           # 5 multi-issue scenarios
β”‚   β”‚   β”œβ”€β”€ k8s_pod.py                   # 5 Kubernetes pod failure scenarios
β”‚   β”‚   β”œβ”€β”€ k8s_networking.py            # 5 K8s networking scenarios
β”‚   β”‚   β”œβ”€β”€ pipeline_build_deploy.py     # 5 GHAβ†’Dockerβ†’Registry scenarios
β”‚   β”‚   └── pipeline_full.py             # 5 full-stack multi-error scenarios
β”‚   β”œβ”€β”€ graders/
β”‚   β”‚   └── __init__.py       # Deterministic trajectory grader
β”‚   └── simulators/
β”‚       β”œβ”€β”€ docker_simulator.py   # Dockerfile build + runtime validation
β”‚       β”œβ”€β”€ workflow_simulator.py # GHA workflow parse + execution validation
β”‚       └── k8s_simulator.py     # K8s manifest + cross-resource validation
β”‚
└── tests/
    β”œβ”€β”€ test_endpoints.py     # API endpoint tests
    β”œβ”€β”€ test_determinism.py   # Grader determinism + score range tests
    β”œβ”€β”€ test_baseline.py      # Heuristic baseline tests
    β”œβ”€β”€ test_environment_flow.py  # Episode flow tests
    └── test_simulators.py    # Simulator unit tests
```

## Design Decisions

1. **Full cloud-native stack**: Docker + GitHub Actions + Kubernetes β€” the three pillars of modern deployment pipelines.
2. **Simulator-based validation**: Structural rule-based simulators validate fixes instead of string matching. Alternative valid fixes are accepted (e.g., `512Mi` and `256Mi` both fix an OOM). Deterministic, fast, no security concerns.
3. **Dense rewards**: Partial credit at every step (+0.3 per fix, -0.02 per failed edit) rather than sparse pass/fail.
4. **Difficulty progression**: Easy tasks are single-file, single-issue. Expert tasks are multi-file, multi-issue with interacting bugs across all three layers.
5. **Vague error messages in harder tasks**: Easy tasks have explicit error messages. Hard/Expert tasks have realistic, vague messages that require the agent to actually diagnose the issue from context.
6. **Deterministic evaluation**: All 50 scenarios run every time for reproducible, comparable scores in (0, 1) exclusive.
7. **50 scenarios from real bugs**: Every scenario is based on actual developer mistakes documented on Stack Overflow, GitHub Issues, and official documentation.

## License

MIT