Krishna1107 commited on
Commit
d129f63
Β·
1 Parent(s): 8886ce5

updated README with details

Browse files
Files changed (2) hide show
  1. .gitignore +1 -0
  2. README.md +303 -100
.gitignore CHANGED
@@ -2,6 +2,7 @@
2
  __pycache__/
3
  *.py[cod]
4
  *$py.class
 
5
 
6
  # Virtual environments
7
  .venv/
 
2
  __pycache__/
3
  *.py[cod]
4
  *$py.class
5
+ .claude/
6
 
7
  # Virtual environments
8
  .venv/
README.md CHANGED
@@ -12,78 +12,267 @@ pinned: false
12
 
13
  An OpenEnv-compatible environment where AI agents learn to debug broken GitHub Actions workflows and Dockerfiles. Built for the OpenEnv Hackathon by Scaler School of Technology (partners: Meta, HuggingFace, PyTorch).
14
 
15
- ## What It Does
16
 
17
- Agents receive:
18
- - Broken configuration files (Dockerfile, GitHub Actions YAML)
19
- - Error messages from failed builds/workflows
20
- - Context about available secrets and runner environment
21
 
22
- Agents must analyze errors, identify root causes, edit files to fix issues, and submit solutions. The environment provides dense reward feedback at every step.
 
 
23
 
24
- ## Tasks
25
 
26
- | # | Task ID | Description | Difficulty | Scenarios |
27
- |---|---------|-------------|------------|-----------|
28
- | 1 | `dockerfile_syntax` | Fix Dockerfile instruction/syntax errors | Easy | 5 |
29
- | 2 | `dockerfile_runtime` | Fix Dockerfile runtime/execution issues | Medium | 5 |
30
- | 3 | `workflow_syntax_structure` | Fix GitHub Actions YAML structure | Easy | 5 |
31
- | 4 | `workflow_secrets_permissions` | Fix secret wiring and permissions | Medium | 5 |
32
- | 5 | `ci_docker_integration` | Debug combined CI + Docker failures | Medium-Hard | 5 |
33
- | 6 | `multi_stage_pipeline_matrix` | Debug multi-stage and matrix pipelines | Hard | 5 |
34
 
35
- 30 total scenarios across 6 tasks with clear difficulty progression.
36
 
37
- ## API Endpoints
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
- | Endpoint | Method | Description |
40
- |----------|--------|-------------|
41
- | `/` | GET | Health check |
42
- | `/reset` | POST | Start a new episode (optional: `task_id`, `scenario_id`, `seed`) |
43
- | `/step` | POST | Take an action (`edit_file`, `replace_line`, `add_line`, `delete_line`, `submit`, `request_hint`) |
44
- | `/state` | GET | Get current observation |
45
- | `/info` | GET | Environment metadata and schemas |
46
- | `/tasks` | GET | List all tasks |
47
- | `/grader` | POST | Grade a trajectory |
48
- | `/baseline` | POST | Run built-in heuristic baseline |
49
 
50
- ## Grading
51
 
52
- Scoring is **deterministic** and **dynamic** (same actions = same score, different actions = different scores).
53
 
54
- | Component | Weight | Description |
55
- |-----------|--------|-------------|
56
- | Partial fixes | 40% | Proportional to issues fixed |
57
- | Complete solution | 30% | Bonus when ALL issues fixed |
58
- | Efficiency | 30% | Bonus for minimal steps (decays with extra steps) |
59
- | Hint penalty | -5% each | Per hint requested |
60
 
61
- Score range: `0.0` (no progress) to `1.0` (all fixed efficiently).
 
 
 
 
 
 
62
 
63
- ## Quick Start
64
 
65
- ### Local Development
66
 
67
- ```bash
68
- pip install -r requirements.txt
69
- python -m uvicorn server.main:app --host 0.0.0.0 --port 7860
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
  ```
71
 
72
- ### Test Endpoints
73
 
74
- ```bash
75
- # Health check
76
- curl http://localhost:7860/
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
 
78
- # List tasks
79
- curl http://localhost:7860/tasks
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
 
81
- # Start an episode
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
  curl -X POST http://localhost:7860/reset \
83
  -H "Content-Type: application/json" \
84
- -d '{"task_id": "dockerfile_syntax"}'
85
 
86
- # Take an action
 
 
87
  curl -X POST http://localhost:7860/step \
88
  -H "Content-Type: application/json" \
89
  -d '{
@@ -97,10 +286,43 @@ curl -X POST http://localhost:7860/step \
97
  }
98
  }'
99
 
100
- # Submit solution
 
 
101
  curl -X POST http://localhost:7860/step \
102
  -H "Content-Type: application/json" \
103
  -d '{"action": {"action_type": "submit"}}'
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
  ```
105
 
106
  ### Run Tests
@@ -125,73 +347,54 @@ export HF_TOKEN=your_token_here
125
  python inference.py
126
  ```
127
 
128
- Run on a specific task:
129
- ```bash
130
- python inference.py dockerfile_syntax
131
- ```
132
 
133
  ## Project Structure
134
 
135
  ```
136
  cicd-debug-env/
137
- β”œβ”€β”€ openenv.yaml # OpenEnv metadata
138
- β”œβ”€β”€ inference.py # LLM baseline script
139
  β”œβ”€β”€ baseline_runner.py # Heuristic baseline for /baseline endpoint
140
  β”œβ”€β”€ Dockerfile # Production container
141
  β”œβ”€β”€ requirements.txt # Python dependencies
142
- β”œβ”€β”€ README.md
143
  β”‚
144
  β”œβ”€β”€ server/
145
- β”‚ β”œβ”€β”€ __init__.py
146
- β”‚ β”œβ”€β”€ main.py # FastAPI with all 8 endpoints
147
- β”‚ β”œβ”€β”€ models.py # Pydantic models
148
- β”‚ β”œβ”€β”€ environment.py # Core environment logic
149
- β”‚ β”‚
150
  β”‚ β”œβ”€β”€ tasks/
151
- β”‚ β”‚ β”œβ”€β”€ base.py # BaseTask class
152
- β”‚ β”‚ β”œβ”€β”€ task_registry.py # Task registry
153
- β”‚ β”‚ β”œβ”€β”€ task_1_build_errors.py
154
- β”‚ β”‚ β”œβ”€β”€ task_2_docker_runtime.py
155
- β”‚ β”‚ β”œβ”€β”€ task_3_workflow_syntax.py
156
- β”‚ β”‚ β”œβ”€β”€ task_4_workflow_secrets_permissions.py
157
- β”‚ β”‚ β”œβ”€β”€ task_5_ci_docker_integration.py
158
- β”‚ β”‚ └── task_6_multi_stage_matrix.py
159
- β”‚ β”‚
160
  β”‚ β”œβ”€β”€ graders/
161
- β”‚ β”‚ β”œβ”€β”€ __init__.py # Deterministic grader
162
- β”‚ β”‚ └── base.py # Base grader class
163
- β”‚ β”‚
164
- β”‚ β”œβ”€β”€ simulators/
165
- β”‚ β”‚ β”œβ”€β”€ docker_simulator.py # Dockerfile validation (15+ rules)
166
- β”‚ β”‚ └── workflow_simulator.py # Workflow validation (15+ rules)
167
- β”‚ β”‚
168
- β”‚ └── utils/
169
- β”‚ └── yaml_parser.py
170
  β”‚
171
  └── tests/
172
- β”œβ”€β”€ conftest.py
173
- β”œβ”€β”€ test_endpoints.py
174
- └── test_determinism.py
 
 
175
  ```
176
 
177
- ## Expected Baseline Scores
178
-
179
- | Task | Expected |
180
- |------|----------|
181
- | dockerfile_syntax | 0.70 |
182
- | dockerfile_runtime | 0.55 |
183
- | workflow_syntax_structure | 0.65 |
184
- | workflow_secrets_permissions | 0.50 |
185
- | ci_docker_integration | 0.45 |
186
- | multi_stage_pipeline_matrix | 0.30 |
187
-
188
  ## Design Decisions
189
 
190
- 1. **Combined Docker + GitHub Actions**: The intersection of these tools is the most painful real-world failure mode
191
- 2. **Simulated validation**: Static analysis instead of real Docker containers for speed and determinism
192
- 3. **Dense rewards**: Partial credit at every step rather than sparse pass/fail
193
- 4. **6 tasks (2+2+2)**: 2 Docker-only + 2 Workflow-only + 2 Combined with clear difficulty progression
194
- 5. **OpenAI client for baseline**: Required by hackathon specification
 
195
 
196
  ## License
197
 
 
12
 
13
  An OpenEnv-compatible environment where AI agents learn to debug broken GitHub Actions workflows and Dockerfiles. Built for the OpenEnv Hackathon by Scaler School of Technology (partners: Meta, HuggingFace, PyTorch).
14
 
15
+ ## Why CI/CD Debugging?
16
 
17
+ Every developer who ships code hits CI/CD failures. A misconfigured Dockerfile, a broken GitHub Actions workflow, a missing secret β€” these are the bugs that waste hours of developer time every week. They're hard to debug because:
 
 
 
18
 
19
+ - Error messages are cryptic ("unable to prepare context: unable to evaluate symlinks")
20
+ - The feedback loop is slow (push, wait for CI, read logs, fix, repeat)
21
+ - Multiple config files interact in non-obvious ways (Dockerfile + workflow + secrets)
22
 
23
+ This environment teaches AI agents to do what senior DevOps engineers do: read the error, trace it to the root cause, and fix it.
24
 
25
+ ---
 
 
 
 
 
 
 
26
 
27
+ ## How It Works: The Complete Flow
28
 
29
+ ```
30
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
31
+ β”‚ 1. RESET β”‚
32
+ β”‚ Agent receives: β”‚
33
+ β”‚ - Broken config files (Dockerfile / workflow YAML) β”‚
34
+ β”‚ - Error message from the failed build/deploy β”‚
35
+ β”‚ - Available secrets list β”‚
36
+ β”‚ - Number of issues to find β”‚
37
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
38
+ β”‚ 2. OBSERVE β†’ THINK β†’ ACT (repeat up to 10 steps) β”‚
39
+ β”‚ Agent reads the error, analyzes the files, then: β”‚
40
+ β”‚ - edit_file: replace broken content with fixed content β”‚
41
+ β”‚ - replace_line: fix a specific line number β”‚
42
+ β”‚ - add_line / add_block: insert missing content β”‚
43
+ β”‚ - delete_line / delete_block: remove bad content β”‚
44
+ β”‚ - request_hint: get a clue (-5% score penalty) β”‚
45
+ β”‚ - submit: "I'm done fixing" β”‚
46
+ β”‚ β”‚
47
+ β”‚ After each action, agent gets: β”‚
48
+ β”‚ - Updated file contents β”‚
49
+ β”‚ - Reward signal (+0.3 per fix, -0.02 for failed edits) β”‚
50
+ β”‚ - How many issues are now fixed β”‚
51
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
52
+ β”‚ 3. GRADE β”‚
53
+ β”‚ Deterministic scoring based on: β”‚
54
+ β”‚ - What fraction of issues were fixed β”‚
55
+ β”‚ - Whether ALL issues were fixed (bonus) β”‚
56
+ β”‚ - How many steps it took (efficiency) β”‚
57
+ β”‚ - How many hints were used (penalty) β”‚
58
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
59
+ ```
60
 
61
+ ---
 
 
 
 
 
 
 
 
 
62
 
63
+ ## The 6 Tasks (30 Scenarios)
64
 
65
+ ### Task 1: Dockerfile Syntax Errors β€” Easy
66
 
67
+ Simple typos and instruction errors that break `docker build`. These are the bugs every developer makes on day one.
 
 
 
 
 
68
 
69
+ | # | Scenario | What's Broken | Real-World Context |
70
+ |---|----------|---------------|-------------------|
71
+ | 1 | `typo_filename` | `COPY requirments.txt .` β€” misspelled filename | Most common Docker build error on Stack Overflow |
72
+ | 2 | `invalid_base_image` | `FROM python:3.9-slimm` β€” extra 'm' in tag | Happens when copy-pasting image tags |
73
+ | 3 | `invalid_run_syntax` | `RUN pip install ... \n && python setup.py` β€” broken line continuation | Formatting multi-line RUN commands is tricky |
74
+ | 4 | `invalid_expose` | `EXPOSE "eighty"` β€” string instead of port number | EXPOSE only accepts numeric ports |
75
+ | 5 | `missing_from_instruction` | No `FROM` instruction at all | Dockerfile must start with FROM (or ARG before FROM) |
76
 
77
+ ### Task 2: Dockerfile Runtime Errors β€” Medium
78
 
79
+ The Dockerfile builds successfully, but the container crashes when you run it. These are harder because the error appears at runtime, not build time.
80
 
81
+ | # | Scenario | What's Broken | Real-World Context |
82
+ |---|----------|---------------|-------------------|
83
+ | 1 | `missing_workdir` | No WORKDIR β€” files scatter to `/` | Container runs but `npm start` can't find `package.json` |
84
+ | 2 | `cmd_entrypoint_conflict` | Both ENTRYPOINT and CMD defined as full commands | Process starts incorrectly; CMD should be args-only when ENTRYPOINT exists |
85
+ | 3 | `entrypoint_not_executable` | Shell script lacks execute permission | `chmod +x` missing β€” "permission denied" at container start |
86
+ | 4 | `missing_required_env` | App needs `DATABASE_URL` but it's not set | Container starts then crashes: "DATABASE_URL is not defined" |
87
+ | 5 | `non_root_privileged_port` | Non-root user tries to bind port 80 | Security best practice (non-root) conflicts with port < 1024 |
88
+
89
+ ### Task 3: Workflow Syntax & Structure β€” Easy
90
+
91
+ GitHub Actions YAML has structural problems. GitHub rejects these before any job runs.
92
+
93
+ | # | Scenario | What's Broken | Real-World Context |
94
+ |---|----------|---------------|-------------------|
95
+ | 1 | `checkout_after_build` | `docker build` runs before `actions/checkout` | No source code checked out β€” "Dockerfile not found" |
96
+ | 2 | `missing_runs_on` | Job has no `runs-on` field | GitHub Actions rejects: every job needs a runner |
97
+ | 3 | `invalid_trigger_syntax` | `branches: main` instead of `branches: [main]` | Must be a YAML list, not a scalar string |
98
+ | 4 | `missing_step_uses_or_run` | Step has a name but no `uses:` or `run:` | Invalid step β€” must do something |
99
+ | 5 | `missing_on_trigger` | No `on:` block at all | Workflow never triggers β€” GitHub doesn't know when to run it |
100
+
101
+ ### Task 4: Workflow Secrets & Permissions β€” Medium
102
+
103
+ Secrets exist in the repository but aren't wired correctly to the workflow steps. These are the bugs that make you say "but the secret is right there!"
104
+
105
+ | # | Scenario | What's Broken | Real-World Context |
106
+ |---|----------|---------------|-------------------|
107
+ | 1 | `missing_env_secrets` | `$DOCKER_PASSWORD` in `run:` but no `env:` mapping | Secrets must be explicitly passed via `env:` block |
108
+ | 2 | `wrong_secret_syntax` | `${ secrets.TOKEN }` instead of `${{ secrets.TOKEN }}` | Single braces vs double braces β€” subtle syntax difference |
109
+ | 3 | `missing_token_permissions` | Pushing to GHCR without `permissions: packages: write` | GITHUB_TOKEN is read-only by default since 2023 |
110
+ | 4 | `secret_not_in_env` | `curl` uses `$SLACK_WEBHOOK_URL` but it's not in `env:` | Same pattern as #1 β€” very common mistake |
111
+ | 5 | `ghcr_wrong_credentials` | Using `DOCKER_PASSWORD` for GHCR login | GHCR uses `GITHUB_TOKEN`, not Docker Hub credentials |
112
+
113
+ ### Task 5: CI + Docker Integration β€” Medium-Hard
114
+
115
+ The workflow AND the Dockerfile interact. Fixing one file alone isn't enough β€” you need to understand how they work together.
116
+
117
+ | # | Scenario | What's Broken | Real-World Context |
118
+ |---|----------|---------------|-------------------|
119
+ | 1 | `missing_buildx_for_platforms` | Multi-platform build without `setup-buildx-action` | Standard Docker builder can't cross-compile; need BuildKit |
120
+ | 2 | `login_secrets_not_wired` | `docker login` step missing `env:` for secrets | Auth fails β€” "unauthorized: authentication required" |
121
+ | 3 | `wrong_build_context` | Context is `./backend` but Dockerfile path is `./Dockerfile` | Path mismatch β€” build can't find the Dockerfile |
122
+ | 4 | `cache_without_mode_max` | GHA cache export missing `mode=max` | Cache doesn't persist intermediate layers; slow rebuilds |
123
+ | 5 | `push_without_login` | `docker push` without `docker login` first | "denied: requested access to the resource is denied" |
124
+
125
+ ### Task 6: Multi-Stage Pipeline & Matrix β€” Hard
126
+
127
+ Complex pipelines with multiple interacting bugs. The agent must find and fix 2-3 issues across multiple files.
128
+
129
+ | # | Scenario | What's Broken | Real-World Context |
130
+ |---|----------|---------------|-------------------|
131
+ | 1 | `artifact_path_mismatch` | `COPY --from=builder /app/dist` but React outputs to `/app/build` | Framework output directories vary β€” CRA uses `build/`, Vite uses `dist/` |
132
+ | 2 | `matrix_platform_arg` | Uses `$BUILDPLATFORM` without `ARG BUILDPLATFORM` declaration | Multi-arch builds need platform ARGs declared before FROM |
133
+ | 3 | `cross_job_artifact` | Test job downloads artifact but missing `needs: build` | Jobs run in parallel by default β€” artifact doesn't exist yet |
134
+ | 4 | `multiple_issues` | Dockerfile typo + workflow secrets not wired (2 bugs) | Real debugging: problems compound across files |
135
+ | 5 | `matrix_version_failure` | Matrix includes Node 14 but code needs >= 16 + missing `needs:` | Version compatibility + job ordering β€” 2 bugs to find |
136
+
137
+ ---
138
+
139
+ ## Available Actions
140
+
141
+ Each step, the agent chooses exactly one action:
142
+
143
+ | Action | What It Does | When to Use |
144
+ |--------|-------------|-------------|
145
+ | `edit_file` | Replace `old_content` with `new_content` in a file | Most common β€” fix a broken line or block |
146
+ | `replace_line` | Replace content at a specific line number | When you know exactly which line is wrong |
147
+ | `add_line` | Insert a new line into a file | Adding missing instructions (e.g., missing `WORKDIR`) |
148
+ | `delete_line` | Remove a specific line | Removing a bad instruction |
149
+ | `add_block` | Insert a multi-line block | Adding entire sections (e.g., `env:` block with secrets) |
150
+ | `delete_block` | Remove a multi-line block | Removing incorrect sections |
151
+ | `request_hint` | Get a clue about what's wrong | Costs -5% on final score β€” use sparingly |
152
+ | `submit` | Declare "I'm done" β€” triggers final evaluation | When all fixes are applied |
153
+
154
+ **Important:** `edit_file` requires `old_content` to match **exactly** (including whitespace). If it doesn't match, the edit fails and the agent gets a -0.02 reward penalty.
155
+
156
+ ---
157
+
158
+ ## Grading System β€” How Scores Work
159
+
160
+ Scoring is **deterministic** (same actions always produce the same score) and **dynamic** (different strategies get different scores).
161
+
162
+ ### The Formula
163
+
164
+ ```
165
+ FINAL SCORE = Partial Fixes + Complete Bonus + Efficiency - Hint Penalty
166
  ```
167
 
168
+ Clamped to `[0.0, 1.0]`.
169
 
170
+ ### Component Breakdown
171
+
172
+ #### 1. Partial Fix Credit (40% max)
173
+
174
+ ```
175
+ partial = 0.40 x (issues_fixed / issues_total)
176
+ ```
177
+
178
+ | Fixed | Total | Partial Score |
179
+ |-------|-------|---------------|
180
+ | 0/2 | 2 | 0.00 |
181
+ | 1/2 | 2 | 0.20 |
182
+ | 2/2 | 2 | 0.40 |
183
+ | 1/3 | 3 | 0.133 |
184
+
185
+ #### 2. Complete Solution Bonus (30% max)
186
+
187
+ ```
188
+ complete = 0.30 if ALL issues fixed
189
+ complete = 0.00 otherwise
190
+ ```
191
+
192
+ All-or-nothing. Fix 2/3 issues? You get 0. Fix 3/3? You get 0.30.
193
+
194
+ #### 3. Efficiency Bonus (30% max)
195
 
196
+ ```
197
+ if issues_fixed == 0: efficiency = 0.00 (no credit for doing nothing)
198
+ if steps <= issues_total: efficiency = 0.30 (optimal β€” full bonus)
199
+ if steps > issues_total: efficiency = 0.30 - 0.03 per extra step
200
+ ```
201
+
202
+ Rewards agents that fix issues quickly. The "optimal" number of steps equals the number of issues (one fix per step).
203
+
204
+ | Issues | Steps Taken | Efficiency Score |
205
+ |--------|-------------|-----------------|
206
+ | 1 | 1 | 0.30 (optimal) |
207
+ | 1 | 3 | 0.24 |
208
+ | 1 | 8 | 0.09 |
209
+ | 2 | 2 | 0.30 (optimal) |
210
+ | 2 | 5 | 0.21 |
211
+ | 0 fixed | any | 0.00 |
212
+
213
+ #### 4. Hint Penalty (-5% each)
214
+
215
+ ```
216
+ penalty = 0.05 x hints_used
217
+ ```
218
+
219
+ Each `request_hint` action costs 5% off the final score.
220
+
221
+ ### Score Examples
222
+
223
+ | Scenario | Partial | Complete | Efficiency | Hints | **Final Score** |
224
+ |----------|---------|----------|------------|-------|-----------------|
225
+ | Fixed 0/2 issues | 0.00 | 0.00 | 0.00 | 0 | **0.000** |
226
+ | Fixed 1/2 in 3 steps | 0.20 | 0.00 | 0.27 | 0 | **~0.470** |
227
+ | Fixed 2/2 in 5 steps | 0.40 | 0.30 | 0.21 | 0 | **~0.910** |
228
+ | Fixed 1/1 in 1 step | 0.40 | 0.30 | 0.30 | 0 | **1.000** |
229
+ | Fixed 1/1 + 2 hints | 0.40 | 0.30 | 0.30 | -0.10 | **0.900** |
230
+ | Submitted immediately | 0.00 | 0.00 | 0.00 | 0 | **0.000** |
231
+
232
+ ### Per-Step Rewards (Dense Feedback)
233
+
234
+ The agent also gets **immediate rewards** after each action (not just at the end):
235
 
236
+ | Event | Reward |
237
+ |-------|--------|
238
+ | Fix validated (issue resolved) | +0.3 per issue fixed |
239
+ | Successful validation improvement | +0.1 |
240
+ | Failed edit (old_content didn't match) | -0.02 |
241
+ | Request hint | -0.05 |
242
+ | Submit (terminal) | 0.0 |
243
+
244
+ This dense reward signal helps RL agents learn faster than sparse pass/fail grading.
245
+
246
+ ---
247
+
248
+ ## API Endpoints
249
+
250
+ | Endpoint | Method | Description |
251
+ |----------|--------|-------------|
252
+ | `/` | GET | Root health check |
253
+ | `/health` | GET | OpenEnv health endpoint β€” returns `{"status": "healthy"}` |
254
+ | `/metadata` | GET | Environment name, description, version, tags |
255
+ | `/schema` | GET | Action, observation, and state JSON schemas |
256
+ | `/reset` | POST | Start a new episode (optional: `task_id`, `scenario_id`, `seed`) |
257
+ | `/step` | POST | Take an action and receive observation + reward |
258
+ | `/state` | GET | Get current observation without taking an action |
259
+ | `/info` | GET | Task list with metadata |
260
+ | `/tasks` | GET | List all tasks with difficulty levels |
261
+ | `/grader` | POST | Grade a trajectory (list of step dicts) |
262
+ | `/baseline` | POST | Run built-in heuristic baseline |
263
+ | `/mcp` | POST | JSON-RPC 2.0 MCP endpoint (initialize, tools/list) |
264
+
265
+ ### Example: Full Episode via API
266
+
267
+ ```bash
268
+ # 1. Start an episode
269
  curl -X POST http://localhost:7860/reset \
270
  -H "Content-Type: application/json" \
271
+ -d '{"task_id": "dockerfile_syntax", "scenario_id": "typo_filename"}'
272
 
273
+ # Response: observation with broken Dockerfile + error message
274
+
275
+ # 2. Fix the typo
276
  curl -X POST http://localhost:7860/step \
277
  -H "Content-Type: application/json" \
278
  -d '{
 
286
  }
287
  }'
288
 
289
+ # Response: reward=0.4, issues_fixed=1/1
290
+
291
+ # 3. Submit
292
  curl -X POST http://localhost:7860/step \
293
  -H "Content-Type: application/json" \
294
  -d '{"action": {"action_type": "submit"}}'
295
+
296
+ # Response: done=true, episode complete
297
+ ```
298
+
299
+ ---
300
+
301
+ ## Baseline Results (Llama 3.1 70B)
302
+
303
+ Tested with `meta-llama/Llama-3.1-70B-Instruct` via HuggingFace router:
304
+
305
+ | Task | Score | Notes |
306
+ |------|-------|-------|
307
+ | dockerfile_syntax | 1.000 | Solved perfectly in 1 step |
308
+ | dockerfile_runtime | 1.000 | Solved perfectly in 1 step |
309
+ | workflow_syntax_structure | 0.000 | LLM struggled with exact whitespace matching |
310
+ | workflow_secrets_permissions | 1.000 | Solved perfectly in 1 step |
311
+ | ci_docker_integration | 0.000 | Multi-step fix needed; LLM edits didn't match exactly |
312
+ | multi_stage_pipeline_matrix | 0.283 | Fixed 1/3 issues |
313
+ | **OVERALL** | **0.547** | |
314
+
315
+ This shows the environment is both **solvable** (3 perfect scores) and **challenging** (2 zero scores, 1 partial). The main difficulty is exact string matching for edits β€” a realistic constraint that mirrors real file editing.
316
+
317
+ ---
318
+
319
+ ## Quick Start
320
+
321
+ ### Local Development
322
+
323
+ ```bash
324
+ pip install -r requirements.txt
325
+ python -m uvicorn server.main:app --host 0.0.0.0 --port 7860
326
  ```
327
 
328
  ### Run Tests
 
347
  python inference.py
348
  ```
349
 
350
+ ---
 
 
 
351
 
352
  ## Project Structure
353
 
354
  ```
355
  cicd-debug-env/
356
+ β”œβ”€β”€ openenv.yaml # OpenEnv environment specification
357
+ β”œβ”€β”€ inference.py # LLM baseline (OpenAI client + HF router)
358
  β”œβ”€β”€ baseline_runner.py # Heuristic baseline for /baseline endpoint
359
  β”œβ”€β”€ Dockerfile # Production container
360
  β”œβ”€β”€ requirements.txt # Python dependencies
 
361
  β”‚
362
  β”œβ”€β”€ server/
363
+ β”‚ β”œβ”€β”€ main.py # FastAPI with 12 endpoints
364
+ β”‚ β”œβ”€β”€ models.py # Pydantic models (type-safe API)
365
+ β”‚ β”œβ”€β”€ environment.py # Core environment loop (reset/step/state)
 
 
366
  β”‚ β”œβ”€β”€ tasks/
367
+ β”‚ β”‚ β”œβ”€β”€ base.py # BaseTask with scenario loading
368
+ β”‚ β”‚ β”œβ”€β”€ task_registry.py # Maps task_id β†’ task class
369
+ β”‚ β”‚ β”œβ”€β”€ task_1_build_errors.py # 5 Dockerfile syntax scenarios
370
+ β”‚ β”‚ β”œβ”€β”€ task_2_docker_runtime.py # 5 Dockerfile runtime scenarios
371
+ β”‚ β”‚ β”œβ”€β”€ task_3_workflow_syntax.py # 5 workflow structure scenarios
372
+ β”‚ β”‚ β”œβ”€β”€ task_4_workflow_secrets_permissions.py # 5 secrets scenarios
373
+ β”‚ β”‚ β”œβ”€β”€ task_5_ci_docker_integration.py # 5 integration scenarios
374
+ β”‚ β”‚ └── task_6_multi_stage_matrix.py # 5 multi-issue scenarios
 
375
  β”‚ β”œβ”€β”€ graders/
376
+ β”‚ β”‚ β”œβ”€β”€ __init__.py # Deterministic trajectory grader
377
+ β”‚ β”‚ └── base.py # Base grader with weight constants
378
+ β”‚ └── simulators/
379
+ β”‚ β”œβ”€β”€ docker_simulator.py # 15+ Dockerfile validation rules
380
+ β”‚ └── workflow_simulator.py # 15+ workflow validation rules
 
 
 
 
381
  β”‚
382
  └── tests/
383
+ β”œβ”€β”€ test_endpoints.py # API endpoint tests
384
+ β”œβ”€β”€ test_determinism.py # Grader determinism + score range tests
385
+ β”œβ”€β”€ test_baseline.py # Heuristic baseline tests
386
+ β”œβ”€β”€ test_environment_flow.py # Episode flow tests
387
+ └── test_simulators.py # Simulator unit tests
388
  ```
389
 
 
 
 
 
 
 
 
 
 
 
 
390
  ## Design Decisions
391
 
392
+ 1. **Docker + GitHub Actions combined**: These two tools intersect in every modern deployment pipeline. Debugging their interaction is the hardest part of DevOps.
393
+ 2. **Simulated validation (no real Docker)**: Static analysis rules instead of running actual containers. This gives deterministic results, fast execution, and no security concerns.
394
+ 3. **Dense rewards**: Partial credit at every step (+0.3 per fix, -0.02 per failed edit) rather than sparse pass/fail. Helps RL agents learn faster.
395
+ 4. **Difficulty progression**: Easy tasks are single-file, single-issue. Hard tasks are multi-file, multi-issue with interacting bugs.
396
+ 5. **Exact string matching for edits**: Mirrors real file editing β€” whitespace matters. This is intentionally challenging for LLMs.
397
+ 6. **30 scenarios from real bugs**: Every scenario is based on actual developer mistakes documented on Stack Overflow, GitHub Issues, and Docker/GitHub Actions documentation.
398
 
399
  ## License
400