Broken Dockerfiles. Misconfigured workflows. K8s pods stuck in CrashLoopBackOff. This environment throws real-world deployment failures at AI agents and measures how well they can track down the root cause and fix it.
Same loop every DevOps engineer runs through, just faster.
Agent gets broken config files — a Dockerfile, a workflow YAML, some K8s manifests — along with whatever error the pipeline spit out.
Read the error, find the bug, edit the file. Could be a typo, a wrong port, a missing secret. Up to 10 steps to get it right.
Deterministic scoring: how many issues got fixed, how quickly, and whether hints were needed. Harder tasks are graded more generously.
From single-typo Dockerfile fixes to multi-bug pipeline debugging across GHA + Docker + K8s.
The classic stuff — misspelled filenames, bad base image tags, broken RUN continuations. Things that make docker build fail immediately.
It builds fine, then crashes at runtime. Missing WORKDIR, CMD/ENTRYPOINT conflicts, permission issues, missing env vars.
GitHub Actions YAML that GitHub refuses to even parse. Missing runs-on, wrong trigger format, steps without actions.
The secret is right there in the repo settings, but the workflow can't see it. Missing env: blocks, wrong ${{ }} syntax, token permission gaps.
The workflow and Dockerfile depend on each other. Build context mismatches, missing buildx setup, login without secrets.
Multi-stage builds, matrix strategies, cross-job artifacts. Two or three bugs that only make sense when you look at the files together.
Pods stuck in CrashLoopBackOff or ImagePullBackOff. OOM kills, wrong commands, missing ConfigMaps, misconfigured probes.
Pods are running, but nobody can reach them. Selector mismatches, wrong targetPorts, NetworkPolicies blocking traffic, missing ingress classes.
End-to-end GHA-to-Docker-to-Registry failures. GHCR tokens not wired, image tag mismatches between build and push, missing permissions.
The real deal — 2 to 4 bugs scattered across a GHA workflow, Dockerfile, and K8s manifests at the same time. Requires cross-file reasoning.
Deterministic, difficulty-aware scoring. Same actions, same score. Harder tasks get more room to breathe.
Credit for each issue you fix, even if you don't get them all
Extra credit when every single issue is resolved
Fewer steps is better — decay is gentler on hard tasks
Solving hard/expert tasks perfectly earns extra points
Per hint used — cheaper on harder tasks where hints are fair
Everything you need to run episodes, grade trajectories, and inspect the environment.
| Endpoint | Method | Description |
|---|---|---|
| /health | GET | Returns {"status": "healthy"} |
| /metadata | GET | Environment name, version, tags |
| /tasks | GET | All 10 tasks with difficulty levels |
| /info | GET | Full task list with schemas |
| /reset | POST | Start a new episode (pick a task or get a random one) |
| /step | POST | Take an action, get back observation + reward |
| /state | GET | Current observation without acting |
| /grader | POST | Score a trajectory after the episode |
| /baseline | POST | Run the built-in heuristic baseline |
| /schema | GET | Action and observation JSON schemas |
| /mcp | POST | JSON-RPC 2.0 MCP endpoint |
| /docs | GET | Interactive Swagger docs |
Fix a K8s OOMKilled pod in 3 commands.