Commit Β·
d129f63
1
Parent(s): 8886ce5
updated README with details
Browse files- .gitignore +1 -0
- README.md +303 -100
.gitignore
CHANGED
|
@@ -2,6 +2,7 @@
|
|
| 2 |
__pycache__/
|
| 3 |
*.py[cod]
|
| 4 |
*$py.class
|
|
|
|
| 5 |
|
| 6 |
# Virtual environments
|
| 7 |
.venv/
|
|
|
|
| 2 |
__pycache__/
|
| 3 |
*.py[cod]
|
| 4 |
*$py.class
|
| 5 |
+
.claude/
|
| 6 |
|
| 7 |
# Virtual environments
|
| 8 |
.venv/
|
README.md
CHANGED
|
@@ -12,78 +12,267 @@ pinned: false
|
|
| 12 |
|
| 13 |
An OpenEnv-compatible environment where AI agents learn to debug broken GitHub Actions workflows and Dockerfiles. Built for the OpenEnv Hackathon by Scaler School of Technology (partners: Meta, HuggingFace, PyTorch).
|
| 14 |
|
| 15 |
-
##
|
| 16 |
|
| 17 |
-
|
| 18 |
-
- Broken configuration files (Dockerfile, GitHub Actions YAML)
|
| 19 |
-
- Error messages from failed builds/workflows
|
| 20 |
-
- Context about available secrets and runner environment
|
| 21 |
|
| 22 |
-
|
|
|
|
|
|
|
| 23 |
|
| 24 |
-
|
| 25 |
|
| 26 |
-
|
| 27 |
-
|---|---------|-------------|------------|-----------|
|
| 28 |
-
| 1 | `dockerfile_syntax` | Fix Dockerfile instruction/syntax errors | Easy | 5 |
|
| 29 |
-
| 2 | `dockerfile_runtime` | Fix Dockerfile runtime/execution issues | Medium | 5 |
|
| 30 |
-
| 3 | `workflow_syntax_structure` | Fix GitHub Actions YAML structure | Easy | 5 |
|
| 31 |
-
| 4 | `workflow_secrets_permissions` | Fix secret wiring and permissions | Medium | 5 |
|
| 32 |
-
| 5 | `ci_docker_integration` | Debug combined CI + Docker failures | Medium-Hard | 5 |
|
| 33 |
-
| 6 | `multi_stage_pipeline_matrix` | Debug multi-stage and matrix pipelines | Hard | 5 |
|
| 34 |
|
| 35 |
-
|
| 36 |
|
| 37 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
-
|
| 40 |
-
|----------|--------|-------------|
|
| 41 |
-
| `/` | GET | Health check |
|
| 42 |
-
| `/reset` | POST | Start a new episode (optional: `task_id`, `scenario_id`, `seed`) |
|
| 43 |
-
| `/step` | POST | Take an action (`edit_file`, `replace_line`, `add_line`, `delete_line`, `submit`, `request_hint`) |
|
| 44 |
-
| `/state` | GET | Get current observation |
|
| 45 |
-
| `/info` | GET | Environment metadata and schemas |
|
| 46 |
-
| `/tasks` | GET | List all tasks |
|
| 47 |
-
| `/grader` | POST | Grade a trajectory |
|
| 48 |
-
| `/baseline` | POST | Run built-in heuristic baseline |
|
| 49 |
|
| 50 |
-
##
|
| 51 |
|
| 52 |
-
|
| 53 |
|
| 54 |
-
|
| 55 |
-
|-----------|--------|-------------|
|
| 56 |
-
| Partial fixes | 40% | Proportional to issues fixed |
|
| 57 |
-
| Complete solution | 30% | Bonus when ALL issues fixed |
|
| 58 |
-
| Efficiency | 30% | Bonus for minimal steps (decays with extra steps) |
|
| 59 |
-
| Hint penalty | -5% each | Per hint requested |
|
| 60 |
|
| 61 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
-
##
|
| 64 |
|
| 65 |
-
|
| 66 |
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
```
|
| 71 |
|
| 72 |
-
|
| 73 |
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
|
| 78 |
-
|
| 79 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
|
| 81 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
curl -X POST http://localhost:7860/reset \
|
| 83 |
-H "Content-Type: application/json" \
|
| 84 |
-
-d '{"task_id": "dockerfile_syntax"}'
|
| 85 |
|
| 86 |
-
#
|
|
|
|
|
|
|
| 87 |
curl -X POST http://localhost:7860/step \
|
| 88 |
-H "Content-Type: application/json" \
|
| 89 |
-d '{
|
|
@@ -97,10 +286,43 @@ curl -X POST http://localhost:7860/step \
|
|
| 97 |
}
|
| 98 |
}'
|
| 99 |
|
| 100 |
-
#
|
|
|
|
|
|
|
| 101 |
curl -X POST http://localhost:7860/step \
|
| 102 |
-H "Content-Type: application/json" \
|
| 103 |
-d '{"action": {"action_type": "submit"}}'
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
```
|
| 105 |
|
| 106 |
### Run Tests
|
|
@@ -125,73 +347,54 @@ export HF_TOKEN=your_token_here
|
|
| 125 |
python inference.py
|
| 126 |
```
|
| 127 |
|
| 128 |
-
|
| 129 |
-
```bash
|
| 130 |
-
python inference.py dockerfile_syntax
|
| 131 |
-
```
|
| 132 |
|
| 133 |
## Project Structure
|
| 134 |
|
| 135 |
```
|
| 136 |
cicd-debug-env/
|
| 137 |
-
βββ openenv.yaml # OpenEnv
|
| 138 |
-
βββ inference.py # LLM baseline
|
| 139 |
βββ baseline_runner.py # Heuristic baseline for /baseline endpoint
|
| 140 |
βββ Dockerfile # Production container
|
| 141 |
βββ requirements.txt # Python dependencies
|
| 142 |
-
βββ README.md
|
| 143 |
β
|
| 144 |
βββ server/
|
| 145 |
-
β βββ
|
| 146 |
-
β βββ
|
| 147 |
-
β βββ
|
| 148 |
-
β βββ environment.py # Core environment logic
|
| 149 |
-
β β
|
| 150 |
β βββ tasks/
|
| 151 |
-
β β βββ base.py # BaseTask
|
| 152 |
-
β β βββ task_registry.py #
|
| 153 |
-
β β βββ task_1_build_errors.py
|
| 154 |
-
β β βββ task_2_docker_runtime.py
|
| 155 |
-
β β βββ task_3_workflow_syntax.py
|
| 156 |
-
β β βββ task_4_workflow_secrets_permissions.py
|
| 157 |
-
β β βββ task_5_ci_docker_integration.py
|
| 158 |
-
β β βββ task_6_multi_stage_matrix.py
|
| 159 |
-
β β
|
| 160 |
β βββ graders/
|
| 161 |
-
β β βββ __init__.py # Deterministic grader
|
| 162 |
-
β β βββ base.py # Base grader
|
| 163 |
-
β
|
| 164 |
-
β
|
| 165 |
-
β
|
| 166 |
-
β β βββ workflow_simulator.py # Workflow validation (15+ rules)
|
| 167 |
-
β β
|
| 168 |
-
β βββ utils/
|
| 169 |
-
β βββ yaml_parser.py
|
| 170 |
β
|
| 171 |
βββ tests/
|
| 172 |
-
βββ
|
| 173 |
-
βββ
|
| 174 |
-
|
|
|
|
|
|
|
| 175 |
```
|
| 176 |
|
| 177 |
-
## Expected Baseline Scores
|
| 178 |
-
|
| 179 |
-
| Task | Expected |
|
| 180 |
-
|------|----------|
|
| 181 |
-
| dockerfile_syntax | 0.70 |
|
| 182 |
-
| dockerfile_runtime | 0.55 |
|
| 183 |
-
| workflow_syntax_structure | 0.65 |
|
| 184 |
-
| workflow_secrets_permissions | 0.50 |
|
| 185 |
-
| ci_docker_integration | 0.45 |
|
| 186 |
-
| multi_stage_pipeline_matrix | 0.30 |
|
| 187 |
-
|
| 188 |
## Design Decisions
|
| 189 |
|
| 190 |
-
1. **
|
| 191 |
-
2. **Simulated validation**: Static analysis instead of
|
| 192 |
-
3. **Dense rewards**: Partial credit at every step rather than sparse pass/fail
|
| 193 |
-
4. **
|
| 194 |
-
5. **
|
|
|
|
| 195 |
|
| 196 |
## License
|
| 197 |
|
|
|
|
| 12 |
|
| 13 |
An OpenEnv-compatible environment where AI agents learn to debug broken GitHub Actions workflows and Dockerfiles. Built for the OpenEnv Hackathon by Scaler School of Technology (partners: Meta, HuggingFace, PyTorch).
|
| 14 |
|
| 15 |
+
## Why CI/CD Debugging?
|
| 16 |
|
| 17 |
+
Every developer who ships code hits CI/CD failures. A misconfigured Dockerfile, a broken GitHub Actions workflow, a missing secret β these are the bugs that waste hours of developer time every week. They're hard to debug because:
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
+
- Error messages are cryptic ("unable to prepare context: unable to evaluate symlinks")
|
| 20 |
+
- The feedback loop is slow (push, wait for CI, read logs, fix, repeat)
|
| 21 |
+
- Multiple config files interact in non-obvious ways (Dockerfile + workflow + secrets)
|
| 22 |
|
| 23 |
+
This environment teaches AI agents to do what senior DevOps engineers do: read the error, trace it to the root cause, and fix it.
|
| 24 |
|
| 25 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
+
## How It Works: The Complete Flow
|
| 28 |
|
| 29 |
+
```
|
| 30 |
+
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 31 |
+
β 1. RESET β
|
| 32 |
+
β Agent receives: β
|
| 33 |
+
β - Broken config files (Dockerfile / workflow YAML) β
|
| 34 |
+
β - Error message from the failed build/deploy β
|
| 35 |
+
β - Available secrets list β
|
| 36 |
+
β - Number of issues to find β
|
| 37 |
+
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 38 |
+
β 2. OBSERVE β THINK β ACT (repeat up to 10 steps) β
|
| 39 |
+
β Agent reads the error, analyzes the files, then: β
|
| 40 |
+
β - edit_file: replace broken content with fixed content β
|
| 41 |
+
β - replace_line: fix a specific line number β
|
| 42 |
+
β - add_line / add_block: insert missing content β
|
| 43 |
+
β - delete_line / delete_block: remove bad content β
|
| 44 |
+
β - request_hint: get a clue (-5% score penalty) β
|
| 45 |
+
β - submit: "I'm done fixing" β
|
| 46 |
+
β β
|
| 47 |
+
β After each action, agent gets: β
|
| 48 |
+
β - Updated file contents β
|
| 49 |
+
β - Reward signal (+0.3 per fix, -0.02 for failed edits) β
|
| 50 |
+
β - How many issues are now fixed β
|
| 51 |
+
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 52 |
+
β 3. GRADE β
|
| 53 |
+
β Deterministic scoring based on: β
|
| 54 |
+
β - What fraction of issues were fixed β
|
| 55 |
+
β - Whether ALL issues were fixed (bonus) β
|
| 56 |
+
β - How many steps it took (efficiency) β
|
| 57 |
+
β - How many hints were used (penalty) β
|
| 58 |
+
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 59 |
+
```
|
| 60 |
|
| 61 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
+
## The 6 Tasks (30 Scenarios)
|
| 64 |
|
| 65 |
+
### Task 1: Dockerfile Syntax Errors β Easy
|
| 66 |
|
| 67 |
+
Simple typos and instruction errors that break `docker build`. These are the bugs every developer makes on day one.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
|
| 69 |
+
| # | Scenario | What's Broken | Real-World Context |
|
| 70 |
+
|---|----------|---------------|-------------------|
|
| 71 |
+
| 1 | `typo_filename` | `COPY requirments.txt .` β misspelled filename | Most common Docker build error on Stack Overflow |
|
| 72 |
+
| 2 | `invalid_base_image` | `FROM python:3.9-slimm` β extra 'm' in tag | Happens when copy-pasting image tags |
|
| 73 |
+
| 3 | `invalid_run_syntax` | `RUN pip install ... \n && python setup.py` β broken line continuation | Formatting multi-line RUN commands is tricky |
|
| 74 |
+
| 4 | `invalid_expose` | `EXPOSE "eighty"` β string instead of port number | EXPOSE only accepts numeric ports |
|
| 75 |
+
| 5 | `missing_from_instruction` | No `FROM` instruction at all | Dockerfile must start with FROM (or ARG before FROM) |
|
| 76 |
|
| 77 |
+
### Task 2: Dockerfile Runtime Errors β Medium
|
| 78 |
|
| 79 |
+
The Dockerfile builds successfully, but the container crashes when you run it. These are harder because the error appears at runtime, not build time.
|
| 80 |
|
| 81 |
+
| # | Scenario | What's Broken | Real-World Context |
|
| 82 |
+
|---|----------|---------------|-------------------|
|
| 83 |
+
| 1 | `missing_workdir` | No WORKDIR β files scatter to `/` | Container runs but `npm start` can't find `package.json` |
|
| 84 |
+
| 2 | `cmd_entrypoint_conflict` | Both ENTRYPOINT and CMD defined as full commands | Process starts incorrectly; CMD should be args-only when ENTRYPOINT exists |
|
| 85 |
+
| 3 | `entrypoint_not_executable` | Shell script lacks execute permission | `chmod +x` missing β "permission denied" at container start |
|
| 86 |
+
| 4 | `missing_required_env` | App needs `DATABASE_URL` but it's not set | Container starts then crashes: "DATABASE_URL is not defined" |
|
| 87 |
+
| 5 | `non_root_privileged_port` | Non-root user tries to bind port 80 | Security best practice (non-root) conflicts with port < 1024 |
|
| 88 |
+
|
| 89 |
+
### Task 3: Workflow Syntax & Structure β Easy
|
| 90 |
+
|
| 91 |
+
GitHub Actions YAML has structural problems. GitHub rejects these before any job runs.
|
| 92 |
+
|
| 93 |
+
| # | Scenario | What's Broken | Real-World Context |
|
| 94 |
+
|---|----------|---------------|-------------------|
|
| 95 |
+
| 1 | `checkout_after_build` | `docker build` runs before `actions/checkout` | No source code checked out β "Dockerfile not found" |
|
| 96 |
+
| 2 | `missing_runs_on` | Job has no `runs-on` field | GitHub Actions rejects: every job needs a runner |
|
| 97 |
+
| 3 | `invalid_trigger_syntax` | `branches: main` instead of `branches: [main]` | Must be a YAML list, not a scalar string |
|
| 98 |
+
| 4 | `missing_step_uses_or_run` | Step has a name but no `uses:` or `run:` | Invalid step β must do something |
|
| 99 |
+
| 5 | `missing_on_trigger` | No `on:` block at all | Workflow never triggers β GitHub doesn't know when to run it |
|
| 100 |
+
|
| 101 |
+
### Task 4: Workflow Secrets & Permissions β Medium
|
| 102 |
+
|
| 103 |
+
Secrets exist in the repository but aren't wired correctly to the workflow steps. These are the bugs that make you say "but the secret is right there!"
|
| 104 |
+
|
| 105 |
+
| # | Scenario | What's Broken | Real-World Context |
|
| 106 |
+
|---|----------|---------------|-------------------|
|
| 107 |
+
| 1 | `missing_env_secrets` | `$DOCKER_PASSWORD` in `run:` but no `env:` mapping | Secrets must be explicitly passed via `env:` block |
|
| 108 |
+
| 2 | `wrong_secret_syntax` | `${ secrets.TOKEN }` instead of `${{ secrets.TOKEN }}` | Single braces vs double braces β subtle syntax difference |
|
| 109 |
+
| 3 | `missing_token_permissions` | Pushing to GHCR without `permissions: packages: write` | GITHUB_TOKEN is read-only by default since 2023 |
|
| 110 |
+
| 4 | `secret_not_in_env` | `curl` uses `$SLACK_WEBHOOK_URL` but it's not in `env:` | Same pattern as #1 β very common mistake |
|
| 111 |
+
| 5 | `ghcr_wrong_credentials` | Using `DOCKER_PASSWORD` for GHCR login | GHCR uses `GITHUB_TOKEN`, not Docker Hub credentials |
|
| 112 |
+
|
| 113 |
+
### Task 5: CI + Docker Integration β Medium-Hard
|
| 114 |
+
|
| 115 |
+
The workflow AND the Dockerfile interact. Fixing one file alone isn't enough β you need to understand how they work together.
|
| 116 |
+
|
| 117 |
+
| # | Scenario | What's Broken | Real-World Context |
|
| 118 |
+
|---|----------|---------------|-------------------|
|
| 119 |
+
| 1 | `missing_buildx_for_platforms` | Multi-platform build without `setup-buildx-action` | Standard Docker builder can't cross-compile; need BuildKit |
|
| 120 |
+
| 2 | `login_secrets_not_wired` | `docker login` step missing `env:` for secrets | Auth fails β "unauthorized: authentication required" |
|
| 121 |
+
| 3 | `wrong_build_context` | Context is `./backend` but Dockerfile path is `./Dockerfile` | Path mismatch β build can't find the Dockerfile |
|
| 122 |
+
| 4 | `cache_without_mode_max` | GHA cache export missing `mode=max` | Cache doesn't persist intermediate layers; slow rebuilds |
|
| 123 |
+
| 5 | `push_without_login` | `docker push` without `docker login` first | "denied: requested access to the resource is denied" |
|
| 124 |
+
|
| 125 |
+
### Task 6: Multi-Stage Pipeline & Matrix β Hard
|
| 126 |
+
|
| 127 |
+
Complex pipelines with multiple interacting bugs. The agent must find and fix 2-3 issues across multiple files.
|
| 128 |
+
|
| 129 |
+
| # | Scenario | What's Broken | Real-World Context |
|
| 130 |
+
|---|----------|---------------|-------------------|
|
| 131 |
+
| 1 | `artifact_path_mismatch` | `COPY --from=builder /app/dist` but React outputs to `/app/build` | Framework output directories vary β CRA uses `build/`, Vite uses `dist/` |
|
| 132 |
+
| 2 | `matrix_platform_arg` | Uses `$BUILDPLATFORM` without `ARG BUILDPLATFORM` declaration | Multi-arch builds need platform ARGs declared before FROM |
|
| 133 |
+
| 3 | `cross_job_artifact` | Test job downloads artifact but missing `needs: build` | Jobs run in parallel by default β artifact doesn't exist yet |
|
| 134 |
+
| 4 | `multiple_issues` | Dockerfile typo + workflow secrets not wired (2 bugs) | Real debugging: problems compound across files |
|
| 135 |
+
| 5 | `matrix_version_failure` | Matrix includes Node 14 but code needs >= 16 + missing `needs:` | Version compatibility + job ordering β 2 bugs to find |
|
| 136 |
+
|
| 137 |
+
---
|
| 138 |
+
|
| 139 |
+
## Available Actions
|
| 140 |
+
|
| 141 |
+
Each step, the agent chooses exactly one action:
|
| 142 |
+
|
| 143 |
+
| Action | What It Does | When to Use |
|
| 144 |
+
|--------|-------------|-------------|
|
| 145 |
+
| `edit_file` | Replace `old_content` with `new_content` in a file | Most common β fix a broken line or block |
|
| 146 |
+
| `replace_line` | Replace content at a specific line number | When you know exactly which line is wrong |
|
| 147 |
+
| `add_line` | Insert a new line into a file | Adding missing instructions (e.g., missing `WORKDIR`) |
|
| 148 |
+
| `delete_line` | Remove a specific line | Removing a bad instruction |
|
| 149 |
+
| `add_block` | Insert a multi-line block | Adding entire sections (e.g., `env:` block with secrets) |
|
| 150 |
+
| `delete_block` | Remove a multi-line block | Removing incorrect sections |
|
| 151 |
+
| `request_hint` | Get a clue about what's wrong | Costs -5% on final score β use sparingly |
|
| 152 |
+
| `submit` | Declare "I'm done" β triggers final evaluation | When all fixes are applied |
|
| 153 |
+
|
| 154 |
+
**Important:** `edit_file` requires `old_content` to match **exactly** (including whitespace). If it doesn't match, the edit fails and the agent gets a -0.02 reward penalty.
|
| 155 |
+
|
| 156 |
+
---
|
| 157 |
+
|
| 158 |
+
## Grading System β How Scores Work
|
| 159 |
+
|
| 160 |
+
Scoring is **deterministic** (same actions always produce the same score) and **dynamic** (different strategies get different scores).
|
| 161 |
+
|
| 162 |
+
### The Formula
|
| 163 |
+
|
| 164 |
+
```
|
| 165 |
+
FINAL SCORE = Partial Fixes + Complete Bonus + Efficiency - Hint Penalty
|
| 166 |
```
|
| 167 |
|
| 168 |
+
Clamped to `[0.0, 1.0]`.
|
| 169 |
|
| 170 |
+
### Component Breakdown
|
| 171 |
+
|
| 172 |
+
#### 1. Partial Fix Credit (40% max)
|
| 173 |
+
|
| 174 |
+
```
|
| 175 |
+
partial = 0.40 x (issues_fixed / issues_total)
|
| 176 |
+
```
|
| 177 |
+
|
| 178 |
+
| Fixed | Total | Partial Score |
|
| 179 |
+
|-------|-------|---------------|
|
| 180 |
+
| 0/2 | 2 | 0.00 |
|
| 181 |
+
| 1/2 | 2 | 0.20 |
|
| 182 |
+
| 2/2 | 2 | 0.40 |
|
| 183 |
+
| 1/3 | 3 | 0.133 |
|
| 184 |
+
|
| 185 |
+
#### 2. Complete Solution Bonus (30% max)
|
| 186 |
+
|
| 187 |
+
```
|
| 188 |
+
complete = 0.30 if ALL issues fixed
|
| 189 |
+
complete = 0.00 otherwise
|
| 190 |
+
```
|
| 191 |
+
|
| 192 |
+
All-or-nothing. Fix 2/3 issues? You get 0. Fix 3/3? You get 0.30.
|
| 193 |
+
|
| 194 |
+
#### 3. Efficiency Bonus (30% max)
|
| 195 |
|
| 196 |
+
```
|
| 197 |
+
if issues_fixed == 0: efficiency = 0.00 (no credit for doing nothing)
|
| 198 |
+
if steps <= issues_total: efficiency = 0.30 (optimal β full bonus)
|
| 199 |
+
if steps > issues_total: efficiency = 0.30 - 0.03 per extra step
|
| 200 |
+
```
|
| 201 |
+
|
| 202 |
+
Rewards agents that fix issues quickly. The "optimal" number of steps equals the number of issues (one fix per step).
|
| 203 |
+
|
| 204 |
+
| Issues | Steps Taken | Efficiency Score |
|
| 205 |
+
|--------|-------------|-----------------|
|
| 206 |
+
| 1 | 1 | 0.30 (optimal) |
|
| 207 |
+
| 1 | 3 | 0.24 |
|
| 208 |
+
| 1 | 8 | 0.09 |
|
| 209 |
+
| 2 | 2 | 0.30 (optimal) |
|
| 210 |
+
| 2 | 5 | 0.21 |
|
| 211 |
+
| 0 fixed | any | 0.00 |
|
| 212 |
+
|
| 213 |
+
#### 4. Hint Penalty (-5% each)
|
| 214 |
+
|
| 215 |
+
```
|
| 216 |
+
penalty = 0.05 x hints_used
|
| 217 |
+
```
|
| 218 |
+
|
| 219 |
+
Each `request_hint` action costs 5% off the final score.
|
| 220 |
+
|
| 221 |
+
### Score Examples
|
| 222 |
+
|
| 223 |
+
| Scenario | Partial | Complete | Efficiency | Hints | **Final Score** |
|
| 224 |
+
|----------|---------|----------|------------|-------|-----------------|
|
| 225 |
+
| Fixed 0/2 issues | 0.00 | 0.00 | 0.00 | 0 | **0.000** |
|
| 226 |
+
| Fixed 1/2 in 3 steps | 0.20 | 0.00 | 0.27 | 0 | **~0.470** |
|
| 227 |
+
| Fixed 2/2 in 5 steps | 0.40 | 0.30 | 0.21 | 0 | **~0.910** |
|
| 228 |
+
| Fixed 1/1 in 1 step | 0.40 | 0.30 | 0.30 | 0 | **1.000** |
|
| 229 |
+
| Fixed 1/1 + 2 hints | 0.40 | 0.30 | 0.30 | -0.10 | **0.900** |
|
| 230 |
+
| Submitted immediately | 0.00 | 0.00 | 0.00 | 0 | **0.000** |
|
| 231 |
+
|
| 232 |
+
### Per-Step Rewards (Dense Feedback)
|
| 233 |
+
|
| 234 |
+
The agent also gets **immediate rewards** after each action (not just at the end):
|
| 235 |
|
| 236 |
+
| Event | Reward |
|
| 237 |
+
|-------|--------|
|
| 238 |
+
| Fix validated (issue resolved) | +0.3 per issue fixed |
|
| 239 |
+
| Successful validation improvement | +0.1 |
|
| 240 |
+
| Failed edit (old_content didn't match) | -0.02 |
|
| 241 |
+
| Request hint | -0.05 |
|
| 242 |
+
| Submit (terminal) | 0.0 |
|
| 243 |
+
|
| 244 |
+
This dense reward signal helps RL agents learn faster than sparse pass/fail grading.
|
| 245 |
+
|
| 246 |
+
---
|
| 247 |
+
|
| 248 |
+
## API Endpoints
|
| 249 |
+
|
| 250 |
+
| Endpoint | Method | Description |
|
| 251 |
+
|----------|--------|-------------|
|
| 252 |
+
| `/` | GET | Root health check |
|
| 253 |
+
| `/health` | GET | OpenEnv health endpoint β returns `{"status": "healthy"}` |
|
| 254 |
+
| `/metadata` | GET | Environment name, description, version, tags |
|
| 255 |
+
| `/schema` | GET | Action, observation, and state JSON schemas |
|
| 256 |
+
| `/reset` | POST | Start a new episode (optional: `task_id`, `scenario_id`, `seed`) |
|
| 257 |
+
| `/step` | POST | Take an action and receive observation + reward |
|
| 258 |
+
| `/state` | GET | Get current observation without taking an action |
|
| 259 |
+
| `/info` | GET | Task list with metadata |
|
| 260 |
+
| `/tasks` | GET | List all tasks with difficulty levels |
|
| 261 |
+
| `/grader` | POST | Grade a trajectory (list of step dicts) |
|
| 262 |
+
| `/baseline` | POST | Run built-in heuristic baseline |
|
| 263 |
+
| `/mcp` | POST | JSON-RPC 2.0 MCP endpoint (initialize, tools/list) |
|
| 264 |
+
|
| 265 |
+
### Example: Full Episode via API
|
| 266 |
+
|
| 267 |
+
```bash
|
| 268 |
+
# 1. Start an episode
|
| 269 |
curl -X POST http://localhost:7860/reset \
|
| 270 |
-H "Content-Type: application/json" \
|
| 271 |
+
-d '{"task_id": "dockerfile_syntax", "scenario_id": "typo_filename"}'
|
| 272 |
|
| 273 |
+
# Response: observation with broken Dockerfile + error message
|
| 274 |
+
|
| 275 |
+
# 2. Fix the typo
|
| 276 |
curl -X POST http://localhost:7860/step \
|
| 277 |
-H "Content-Type: application/json" \
|
| 278 |
-d '{
|
|
|
|
| 286 |
}
|
| 287 |
}'
|
| 288 |
|
| 289 |
+
# Response: reward=0.4, issues_fixed=1/1
|
| 290 |
+
|
| 291 |
+
# 3. Submit
|
| 292 |
curl -X POST http://localhost:7860/step \
|
| 293 |
-H "Content-Type: application/json" \
|
| 294 |
-d '{"action": {"action_type": "submit"}}'
|
| 295 |
+
|
| 296 |
+
# Response: done=true, episode complete
|
| 297 |
+
```
|
| 298 |
+
|
| 299 |
+
---
|
| 300 |
+
|
| 301 |
+
## Baseline Results (Llama 3.1 70B)
|
| 302 |
+
|
| 303 |
+
Tested with `meta-llama/Llama-3.1-70B-Instruct` via HuggingFace router:
|
| 304 |
+
|
| 305 |
+
| Task | Score | Notes |
|
| 306 |
+
|------|-------|-------|
|
| 307 |
+
| dockerfile_syntax | 1.000 | Solved perfectly in 1 step |
|
| 308 |
+
| dockerfile_runtime | 1.000 | Solved perfectly in 1 step |
|
| 309 |
+
| workflow_syntax_structure | 0.000 | LLM struggled with exact whitespace matching |
|
| 310 |
+
| workflow_secrets_permissions | 1.000 | Solved perfectly in 1 step |
|
| 311 |
+
| ci_docker_integration | 0.000 | Multi-step fix needed; LLM edits didn't match exactly |
|
| 312 |
+
| multi_stage_pipeline_matrix | 0.283 | Fixed 1/3 issues |
|
| 313 |
+
| **OVERALL** | **0.547** | |
|
| 314 |
+
|
| 315 |
+
This shows the environment is both **solvable** (3 perfect scores) and **challenging** (2 zero scores, 1 partial). The main difficulty is exact string matching for edits β a realistic constraint that mirrors real file editing.
|
| 316 |
+
|
| 317 |
+
---
|
| 318 |
+
|
| 319 |
+
## Quick Start
|
| 320 |
+
|
| 321 |
+
### Local Development
|
| 322 |
+
|
| 323 |
+
```bash
|
| 324 |
+
pip install -r requirements.txt
|
| 325 |
+
python -m uvicorn server.main:app --host 0.0.0.0 --port 7860
|
| 326 |
```
|
| 327 |
|
| 328 |
### Run Tests
|
|
|
|
| 347 |
python inference.py
|
| 348 |
```
|
| 349 |
|
| 350 |
+
---
|
|
|
|
|
|
|
|
|
|
| 351 |
|
| 352 |
## Project Structure
|
| 353 |
|
| 354 |
```
|
| 355 |
cicd-debug-env/
|
| 356 |
+
βββ openenv.yaml # OpenEnv environment specification
|
| 357 |
+
βββ inference.py # LLM baseline (OpenAI client + HF router)
|
| 358 |
βββ baseline_runner.py # Heuristic baseline for /baseline endpoint
|
| 359 |
βββ Dockerfile # Production container
|
| 360 |
βββ requirements.txt # Python dependencies
|
|
|
|
| 361 |
β
|
| 362 |
βββ server/
|
| 363 |
+
β βββ main.py # FastAPI with 12 endpoints
|
| 364 |
+
β βββ models.py # Pydantic models (type-safe API)
|
| 365 |
+
β βββ environment.py # Core environment loop (reset/step/state)
|
|
|
|
|
|
|
| 366 |
β βββ tasks/
|
| 367 |
+
β β βββ base.py # BaseTask with scenario loading
|
| 368 |
+
β β βββ task_registry.py # Maps task_id β task class
|
| 369 |
+
β β βββ task_1_build_errors.py # 5 Dockerfile syntax scenarios
|
| 370 |
+
β β βββ task_2_docker_runtime.py # 5 Dockerfile runtime scenarios
|
| 371 |
+
β β βββ task_3_workflow_syntax.py # 5 workflow structure scenarios
|
| 372 |
+
β β βββ task_4_workflow_secrets_permissions.py # 5 secrets scenarios
|
| 373 |
+
β β βββ task_5_ci_docker_integration.py # 5 integration scenarios
|
| 374 |
+
β β βββ task_6_multi_stage_matrix.py # 5 multi-issue scenarios
|
|
|
|
| 375 |
β βββ graders/
|
| 376 |
+
β β βββ __init__.py # Deterministic trajectory grader
|
| 377 |
+
β β βββ base.py # Base grader with weight constants
|
| 378 |
+
β βββ simulators/
|
| 379 |
+
β βββ docker_simulator.py # 15+ Dockerfile validation rules
|
| 380 |
+
β βββ workflow_simulator.py # 15+ workflow validation rules
|
|
|
|
|
|
|
|
|
|
|
|
|
| 381 |
β
|
| 382 |
βββ tests/
|
| 383 |
+
βββ test_endpoints.py # API endpoint tests
|
| 384 |
+
βββ test_determinism.py # Grader determinism + score range tests
|
| 385 |
+
βββ test_baseline.py # Heuristic baseline tests
|
| 386 |
+
βββ test_environment_flow.py # Episode flow tests
|
| 387 |
+
βββ test_simulators.py # Simulator unit tests
|
| 388 |
```
|
| 389 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 390 |
## Design Decisions
|
| 391 |
|
| 392 |
+
1. **Docker + GitHub Actions combined**: These two tools intersect in every modern deployment pipeline. Debugging their interaction is the hardest part of DevOps.
|
| 393 |
+
2. **Simulated validation (no real Docker)**: Static analysis rules instead of running actual containers. This gives deterministic results, fast execution, and no security concerns.
|
| 394 |
+
3. **Dense rewards**: Partial credit at every step (+0.3 per fix, -0.02 per failed edit) rather than sparse pass/fail. Helps RL agents learn faster.
|
| 395 |
+
4. **Difficulty progression**: Easy tasks are single-file, single-issue. Hard tasks are multi-file, multi-issue with interacting bugs.
|
| 396 |
+
5. **Exact string matching for edits**: Mirrors real file editing β whitespace matters. This is intentionally challenging for LLMs.
|
| 397 |
+
6. **30 scenarios from real bugs**: Every scenario is based on actual developer mistakes documented on Stack Overflow, GitHub Issues, and Docker/GitHub Actions documentation.
|
| 398 |
|
| 399 |
## License
|
| 400 |
|