Commit ·
557930c
1
Parent(s): 4b07aaf
final changes, will deploy
Browse files- .gitignore +1 -1
- CONTEXT.md +0 -347
- Dockerfile +9 -2
- IMPLEMENTATION_PLAN.md +0 -2814
- README.md +175 -12
- baseline_runner.py +138 -26
- inference.py +305 -2
- requirements.txt +0 -0
- tests/test_baseline.py +52 -0
- tests/test_endpoints.py +130 -11
.gitignore
CHANGED
|
@@ -40,4 +40,4 @@ Thumbs.db
|
|
| 40 |
|
| 41 |
*.zip
|
| 42 |
|
| 43 |
-
|
|
|
|
| 40 |
|
| 41 |
*.zip
|
| 42 |
|
| 43 |
+
context/
|
CONTEXT.md
DELETED
|
@@ -1,347 +0,0 @@
|
|
| 1 |
-
# 🧠 PROJECT CONTEXT
|
| 2 |
-
## CI/CD Debug Environment for OpenEnv Hackathon
|
| 3 |
-
|
| 4 |
-
> **For Claude Code**: Read this file first to understand the project background, decisions made, and current status.
|
| 5 |
-
|
| 6 |
-
---
|
| 7 |
-
|
| 8 |
-
## 📋 HACKATHON OVERVIEW
|
| 9 |
-
|
| 10 |
-
**Event**: OpenEnv Hackathon by Scaler School of Technology
|
| 11 |
-
**Partners**: Meta, HuggingFace, PyTorch
|
| 12 |
-
**Deadline**: April 8, 2026 (Round 1 online submission)
|
| 13 |
-
**Finale**: April 25-26, 2026 in Bangalore
|
| 14 |
-
**Prize Pool**: $30,000 + direct interview opportunities
|
| 15 |
-
|
| 16 |
-
**Goal**: Build a complete, real-world OpenEnv environment that an AI agent can learn from through the standard step()/reset()/state() API.
|
| 17 |
-
|
| 18 |
-
---
|
| 19 |
-
|
| 20 |
-
## 🎯 WHAT WE'RE BUILDING
|
| 21 |
-
|
| 22 |
-
**Environment Name**: `cicd-debug-env`
|
| 23 |
-
**Concept**: AI agents debug broken GitHub Actions workflows and Dockerfiles
|
| 24 |
-
|
| 25 |
-
The agent receives:
|
| 26 |
-
1. Error messages from failed builds/workflows
|
| 27 |
-
2. Configuration files (Dockerfile, workflow YAML)
|
| 28 |
-
3. Context about available secrets
|
| 29 |
-
|
| 30 |
-
The agent must:
|
| 31 |
-
1. Analyze the error
|
| 32 |
-
2. Identify the root cause
|
| 33 |
-
3. Fix the files
|
| 34 |
-
4. Submit the solution
|
| 35 |
-
|
| 36 |
-
---
|
| 37 |
-
|
| 38 |
-
## 🏆 WHY THIS IDEA WINS
|
| 39 |
-
|
| 40 |
-
| Criteria | Weight | Our Score | Why |
|
| 41 |
-
|----------|--------|-----------|-----|
|
| 42 |
-
| Real-world utility | 30% | 30/30 | Every developer debugs Docker + CI/CD daily |
|
| 43 |
-
| Task & grader quality | 25% | 25/25 | 6 tasks, deterministic + dynamic graders |
|
| 44 |
-
| Environment design | 20% | 20/20 | Clean state, typed models, dense rewards |
|
| 45 |
-
| Code quality & spec | 15% | 15/15 | Full OpenEnv compliance |
|
| 46 |
-
| Creativity & novelty | 10% | 10/10 | First CI/CD debugging env on OpenEnv |
|
| 47 |
-
|
| 48 |
-
**Key Insight**: Judges are Meta/HuggingFace engineers who debug Docker and GitHub Actions EVERY DAY.
|
| 49 |
-
|
| 50 |
-
---
|
| 51 |
-
|
| 52 |
-
## 📊 THE 6 TASKS
|
| 53 |
-
|
| 54 |
-
| # | Task ID | Name | Difficulty | Category |
|
| 55 |
-
|---|---------|------|------------|----------|
|
| 56 |
-
| 1 | `dockerfile_syntax` | Dockerfile Syntax Errors | Easy | Docker |
|
| 57 |
-
| 2 | `dockerfile_runtime` | Dockerfile Runtime Errors | Medium | Docker |
|
| 58 |
-
| 3 | `workflow_syntax_structure` | Workflow Syntax and Structure | Easy | Workflow |
|
| 59 |
-
| 4 | `workflow_secrets_permissions` | Workflow Secrets and Permissions | Medium | Workflow |
|
| 60 |
-
| 5 | `ci_docker_integration` | CI and Docker Build Integration | Medium-Hard | Combined |
|
| 61 |
-
| 6 | `multi_stage_pipeline_matrix` | Multi-Stage Pipeline and Matrix | Hard | Combined |
|
| 62 |
-
|
| 63 |
-
**Structure**: 2 Docker-only + 2 Workflow-only + 2 Combined = 6 tasks total
|
| 64 |
-
|
| 65 |
-
**Scenarios per task**: Aim for 4-5 scenarios each (total ~25-30 scenarios)
|
| 66 |
-
|
| 67 |
-
---
|
| 68 |
-
|
| 69 |
-
## 📝 GRADING LOGIC
|
| 70 |
-
|
| 71 |
-
### Key Principles:
|
| 72 |
-
- **DYNAMIC**: Score depends on what the agent actually does
|
| 73 |
-
- **DETERMINISTIC**: Same actions = same score (required for reproducibility)
|
| 74 |
-
- **PARTIAL CREDIT**: Reward progress, not just final solution
|
| 75 |
-
|
| 76 |
-
### Score Components:
|
| 77 |
-
|
| 78 |
-
| Component | Weight | Description |
|
| 79 |
-
|-----------|--------|-------------|
|
| 80 |
-
| Issue Identification | 15% | Agent targets correct file/line |
|
| 81 |
-
| Partial Fixes | 25% | Fix is partially correct |
|
| 82 |
-
| Complete Fixes | 40% | All issues fully resolved |
|
| 83 |
-
| Efficiency Bonus | 15% | Solved in minimal steps |
|
| 84 |
-
| Hint Penalty | -5% each | Penalty for hints used |
|
| 85 |
-
|
| 86 |
-
### Example:
|
| 87 |
-
```
|
| 88 |
-
Scenario: Dockerfile has 2 bugs
|
| 89 |
-
|
| 90 |
-
Agent fixes bug 1 only → ~0.4 score
|
| 91 |
-
Agent fixes bug 2 only → ~0.4 score
|
| 92 |
-
Agent fixes both → ~0.85 score
|
| 93 |
-
Agent fixes both quickly → ~1.0 score (with efficiency bonus)
|
| 94 |
-
Agent uses 2 hints → -0.10 penalty
|
| 95 |
-
```
|
| 96 |
-
|
| 97 |
-
---
|
| 98 |
-
|
| 99 |
-
## 🔌 REQUIRED API ENDPOINTS (7 total)
|
| 100 |
-
|
| 101 |
-
| Endpoint | Method | Purpose |
|
| 102 |
-
|----------|--------|---------|
|
| 103 |
-
| `/` | GET | Health check |
|
| 104 |
-
| `/reset` | POST | Start new episode |
|
| 105 |
-
| `/step` | POST | Take action |
|
| 106 |
-
| `/state` | GET | Current state |
|
| 107 |
-
| `/info` | GET | Environment metadata |
|
| 108 |
-
| `/tasks` | GET | List tasks |
|
| 109 |
-
| `/grader` | POST | Grade trajectory |
|
| 110 |
-
| `/baseline` | POST | Run baseline agent |
|
| 111 |
-
|
| 112 |
-
---
|
| 113 |
-
|
| 114 |
-
## 📁 PROJECT STRUCTURE
|
| 115 |
-
|
| 116 |
-
```
|
| 117 |
-
cicd-debug-env/
|
| 118 |
-
├── openenv.yaml # OpenEnv metadata (REQUIRED)
|
| 119 |
-
├── inference.py # Baseline script (REQUIRED)
|
| 120 |
-
├── Dockerfile # For HF Spaces (REQUIRED)
|
| 121 |
-
├── requirements.txt
|
| 122 |
-
├── README.md
|
| 123 |
-
├── CONTEXT.md # This file
|
| 124 |
-
│
|
| 125 |
-
├── server/
|
| 126 |
-
│ ├── __init__.py
|
| 127 |
-
│ ├── main.py # FastAPI with all 7 endpoints
|
| 128 |
-
│ ├── models.py # Pydantic models
|
| 129 |
-
│ ├── environment.py # Core environment logic
|
| 130 |
-
│ │
|
| 131 |
-
│ ├── tasks/
|
| 132 |
-
│ │ ├── __init__.py
|
| 133 |
-
│ │ ├── base.py
|
| 134 |
-
│ │ ├── task_registry.py
|
| 135 |
-
│ │ ├── task_1_dockerfile_syntax.py
|
| 136 |
-
│ │ ├── task_2_dockerfile_runtime.py
|
| 137 |
-
│ │ ├── task_3_workflow_syntax_structure.py
|
| 138 |
-
│ │ ├── task_4_workflow_secrets_permissions.py
|
| 139 |
-
│ │ ├── task_5_ci_docker_integration.py
|
| 140 |
-
│ │ └── task_6_multi_stage_pipeline_matrix.py
|
| 141 |
-
│ │
|
| 142 |
-
│ ├── graders/
|
| 143 |
-
│ │ ├── __init__.py
|
| 144 |
-
│ │ └── grader.py
|
| 145 |
-
│ │
|
| 146 |
-
│ ├── simulators/
|
| 147 |
-
│ │ ├── __init__.py
|
| 148 |
-
│ │ ├── docker_simulator.py
|
| 149 |
-
│ │ └── workflow_simulator.py
|
| 150 |
-
│ │
|
| 151 |
-
│ └── utils/
|
| 152 |
-
│ └── yaml_parser.py
|
| 153 |
-
│
|
| 154 |
-
└── tests/
|
| 155 |
-
├── conftest.py
|
| 156 |
-
└── test_endpoints.py
|
| 157 |
-
```
|
| 158 |
-
|
| 159 |
-
---
|
| 160 |
-
|
| 161 |
-
## 🎯 EXPECTED BASELINE SCORES
|
| 162 |
-
|
| 163 |
-
| Task | Expected Score |
|
| 164 |
-
|------|---------------|
|
| 165 |
-
| dockerfile_syntax | 0.70 |
|
| 166 |
-
| dockerfile_runtime | 0.55 |
|
| 167 |
-
| workflow_syntax_structure | 0.65 |
|
| 168 |
-
| workflow_secrets_permissions | 0.50 |
|
| 169 |
-
| ci_docker_integration | 0.45 |
|
| 170 |
-
| multi_stage_pipeline_matrix | 0.30 |
|
| 171 |
-
|
| 172 |
-
---
|
| 173 |
-
|
| 174 |
-
## ✅ CURRENT STATUS
|
| 175 |
-
|
| 176 |
-
### What's Been Decided:
|
| 177 |
-
- [x] Environment concept (CI/CD debugging)
|
| 178 |
-
- [x] 6 tasks with difficulty progression
|
| 179 |
-
- [x] Grading logic (dynamic + deterministic)
|
| 180 |
-
- [x] Project structure
|
| 181 |
-
- [x] Implementation plan created
|
| 182 |
-
|
| 183 |
-
### Day 1-2: Foundation (COMPLETE)
|
| 184 |
-
- [x] Pydantic models (server/models.py) — Observation, Action, FileEdit, GraderResult, etc.
|
| 185 |
-
- [x] FastAPI server (server/main.py) — All 7 endpoints working
|
| 186 |
-
- [x] openenv.yaml — Full spec compliance
|
| 187 |
-
|
| 188 |
-
### Day 3-4: Core Environment (COMPLETE)
|
| 189 |
-
- [x] Core environment (server/environment.py) — reset, step, state, hint, submit
|
| 190 |
-
- [x] Docker simulator (server/simulators/docker_simulator.py) — 15+ validation rules
|
| 191 |
-
- [x] Workflow simulator (server/simulators/workflow_simulator.py) — 15+ validation rules
|
| 192 |
-
|
| 193 |
-
### Day 5-6: Tasks & Scenarios (COMPLETE)
|
| 194 |
-
- [x] Task 1: dockerfile_syntax (5 scenarios) — typo, bad tag, RUN syntax, EXPOSE, missing FROM
|
| 195 |
-
- [x] Task 2: dockerfile_runtime (5 scenarios) — WORKDIR, CMD/ENTRYPOINT, chmod, ENV, port
|
| 196 |
-
- [x] Task 3: workflow_syntax_structure (5 scenarios) — checkout order, runs-on, triggers, uses/run, on
|
| 197 |
-
- [x] Task 4: workflow_secrets_permissions (5 scenarios) — env secrets, ${{ }}, permissions, env mapping, GHCR
|
| 198 |
-
- [x] Task 5: ci_docker_integration (5 scenarios) — buildx, login secrets, context path, cache, push auth
|
| 199 |
-
- [x] Task 6: multi_stage_pipeline_matrix (5 scenarios) — dist/build, platform ARGs, needs, multi-issue, matrix
|
| 200 |
-
- [x] 30/30 scenarios verified end-to-end
|
| 201 |
-
|
| 202 |
-
### Day 7: Graders & Rewards (COMPLETE)
|
| 203 |
-
- [x] Grader implementation — deterministic, dynamic, partial credit
|
| 204 |
-
- [x] Reward shaping — dense rewards at every step
|
| 205 |
-
- [x] Determinism verified — same input = same output (17 tests)
|
| 206 |
-
- [x] Score ranges verified — 0.0 to 1.0, matching CONTEXT.md examples
|
| 207 |
-
- [x] 26/26 total tests passing
|
| 208 |
-
|
| 209 |
-
### Remaining (Day 8-10):
|
| 210 |
-
- [ ] Baseline inference script (inference.py)
|
| 211 |
-
- [ ] Dockerfile for deployment
|
| 212 |
-
- [ ] Deploy to HuggingFace Spaces
|
| 213 |
-
- [ ] Run `openenv validate`
|
| 214 |
-
- [ ] Test with real LLM (Llama 3.1 70B)
|
| 215 |
-
- [ ] Verify baseline scores match expectations
|
| 216 |
-
- [ ] Write comprehensive README
|
| 217 |
-
- [ ] Final polish and submit
|
| 218 |
-
|
| 219 |
-
---
|
| 220 |
-
|
| 221 |
-
## 🧪 HOW TO RUN
|
| 222 |
-
|
| 223 |
-
### Local Development:
|
| 224 |
-
```bash
|
| 225 |
-
pip install -r requirements.txt
|
| 226 |
-
python -m server.main
|
| 227 |
-
# Server at http://localhost:7860
|
| 228 |
-
```
|
| 229 |
-
|
| 230 |
-
### Test Endpoints:
|
| 231 |
-
```bash
|
| 232 |
-
curl http://localhost:7860/
|
| 233 |
-
curl http://localhost:7860/info
|
| 234 |
-
curl -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d '{}'
|
| 235 |
-
```
|
| 236 |
-
|
| 237 |
-
### Run Tests:
|
| 238 |
-
```bash
|
| 239 |
-
pytest tests/ -v
|
| 240 |
-
```
|
| 241 |
-
|
| 242 |
-
### Docker:
|
| 243 |
-
```bash
|
| 244 |
-
docker build -t cicd-debug-env .
|
| 245 |
-
docker run -p 7860:7860 cicd-debug-env
|
| 246 |
-
```
|
| 247 |
-
|
| 248 |
-
### Baseline Inference:
|
| 249 |
-
```bash
|
| 250 |
-
export API_BASE_URL=https://router.huggingface.co/v1
|
| 251 |
-
export MODEL_NAME=meta-llama/Llama-3.1-70B-Instruct
|
| 252 |
-
export HF_TOKEN=your_token_here
|
| 253 |
-
python inference.py
|
| 254 |
-
```
|
| 255 |
-
|
| 256 |
-
---
|
| 257 |
-
|
| 258 |
-
## 🚨 DISQUALIFICATION CRITERIA (AVOID!)
|
| 259 |
-
|
| 260 |
-
- ❌ Environment does not deploy or respond
|
| 261 |
-
- ❌ Plagiarized or trivially modified existing environments
|
| 262 |
-
- ❌ Graders that always return the same score
|
| 263 |
-
- ❌ No baseline inference script
|
| 264 |
-
|
| 265 |
-
---
|
| 266 |
-
|
| 267 |
-
## 💡 KEY DESIGN DECISIONS
|
| 268 |
-
|
| 269 |
-
1. **Combined Docker + GitHub Actions**: The intersection is the most painful real-world failure
|
| 270 |
-
|
| 271 |
-
2. **6 tasks (2+2+2)**: 2 Docker + 2 Workflow + 2 Combined, clear difficulty progression
|
| 272 |
-
|
| 273 |
-
3. **Dynamic but deterministic grading**: Score varies by agent actions, but same actions = same score
|
| 274 |
-
|
| 275 |
-
4. **Simulated validation**: No real Docker containers, just static analysis for speed and determinism
|
| 276 |
-
|
| 277 |
-
5. **Dense rewards with partial credit**: Better than sparse (pass/fail) for agent training
|
| 278 |
-
|
| 279 |
-
6. **OpenAI client for baseline**: Required by hackathon (not Anthropic client)
|
| 280 |
-
|
| 281 |
-
---
|
| 282 |
-
|
| 283 |
-
## 📚 REFERENCE: Scenario Structure
|
| 284 |
-
|
| 285 |
-
Each scenario should have:
|
| 286 |
-
```python
|
| 287 |
-
{
|
| 288 |
-
"id": "unique_scenario_id",
|
| 289 |
-
"files": [
|
| 290 |
-
{
|
| 291 |
-
"path": "Dockerfile",
|
| 292 |
-
"type": "dockerfile",
|
| 293 |
-
"content": "FROM python:3.11-slim\n..."
|
| 294 |
-
}
|
| 295 |
-
],
|
| 296 |
-
"error": {
|
| 297 |
-
"phase": "docker_build",
|
| 298 |
-
"message": "COPY failed: file not found...",
|
| 299 |
-
"exit_code": 1,
|
| 300 |
-
"failed_step": "COPY requirements.txt",
|
| 301 |
-
"line_hint": 3
|
| 302 |
-
},
|
| 303 |
-
"expected_fixes": [
|
| 304 |
-
{
|
| 305 |
-
"file": "Dockerfile",
|
| 306 |
-
"type": "contains", # or "not_contains", "line_equals", "regex"
|
| 307 |
-
"expected": "COPY requirements.txt",
|
| 308 |
-
"line": 3,
|
| 309 |
-
"hint": "Check the spelling of the filename",
|
| 310 |
-
"points": 0.5
|
| 311 |
-
}
|
| 312 |
-
]
|
| 313 |
-
}
|
| 314 |
-
```
|
| 315 |
-
|
| 316 |
-
---
|
| 317 |
-
|
| 318 |
-
## 📞 COMMON ISSUES TO DEBUG
|
| 319 |
-
|
| 320 |
-
### Dockerfile Issues:
|
| 321 |
-
- Typos in filenames (requirments.txt)
|
| 322 |
-
- Invalid base image tags (python:3.11-slimm)
|
| 323 |
-
- Invalid EXPOSE syntax (EXPOSE "eighty")
|
| 324 |
-
- Missing WORKDIR before COPY
|
| 325 |
-
- Permission issues (chmod +x)
|
| 326 |
-
- CMD/ENTRYPOINT conflicts
|
| 327 |
-
|
| 328 |
-
### Workflow Issues:
|
| 329 |
-
- Missing env block for secrets
|
| 330 |
-
- Wrong secret syntax (${ vs ${{)
|
| 331 |
-
- Missing runs-on field
|
| 332 |
-
- Checkout after build (wrong order)
|
| 333 |
-
- Missing permissions for GITHUB_TOKEN
|
| 334 |
-
- Invalid event triggers
|
| 335 |
-
- Duplicate job IDs
|
| 336 |
-
|
| 337 |
-
### Combined Issues:
|
| 338 |
-
- Docker login needs secrets in env block
|
| 339 |
-
- Multi-platform builds need setup-buildx-action
|
| 340 |
-
- Cross-job artifacts need 'needs' dependency
|
| 341 |
-
- Path mismatches (dist vs build directory)
|
| 342 |
-
- GHCR uses GITHUB_TOKEN not DOCKER_PASSWORD
|
| 343 |
-
|
| 344 |
-
---
|
| 345 |
-
|
| 346 |
-
*Last updated: April 4, 2026*
|
| 347 |
-
*Author: Krishna*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Dockerfile
CHANGED
|
@@ -2,14 +2,21 @@ FROM python:3.11-slim
|
|
| 2 |
|
| 3 |
WORKDIR /app
|
| 4 |
|
|
|
|
| 5 |
COPY requirements.txt .
|
| 6 |
RUN pip install --no-cache-dir -r requirements.txt
|
| 7 |
|
|
|
|
| 8 |
COPY server/ ./server/
|
| 9 |
-
COPY openenv.yaml .
|
| 10 |
-
COPY inference.py .
|
| 11 |
COPY baseline_runner.py .
|
|
|
|
|
|
|
| 12 |
|
|
|
|
| 13 |
EXPOSE 7860
|
| 14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
CMD ["python", "-m", "uvicorn", "server.main:app", "--host", "0.0.0.0", "--port", "7860"]
|
|
|
|
| 2 |
|
| 3 |
WORKDIR /app
|
| 4 |
|
| 5 |
+
# Install dependencies first (layer caching)
|
| 6 |
COPY requirements.txt .
|
| 7 |
RUN pip install --no-cache-dir -r requirements.txt
|
| 8 |
|
| 9 |
+
# Copy application code
|
| 10 |
COPY server/ ./server/
|
|
|
|
|
|
|
| 11 |
COPY baseline_runner.py .
|
| 12 |
+
COPY inference.py .
|
| 13 |
+
COPY openenv.yaml .
|
| 14 |
|
| 15 |
+
# HuggingFace Spaces expects port 7860
|
| 16 |
EXPOSE 7860
|
| 17 |
|
| 18 |
+
# Health check
|
| 19 |
+
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
|
| 20 |
+
CMD python -c "import requests; requests.get('http://localhost:7860/')" || exit 1
|
| 21 |
+
|
| 22 |
CMD ["python", "-m", "uvicorn", "server.main:app", "--host", "0.0.0.0", "--port", "7860"]
|
IMPLEMENTATION_PLAN.md
DELETED
|
@@ -1,2814 +0,0 @@
|
|
| 1 |
-
# 🏗️ CI/CD Infrastructure Debugging Environment
|
| 2 |
-
## Complete Implementation Plan
|
| 3 |
-
|
| 4 |
-
---
|
| 5 |
-
|
| 6 |
-
# 📋 TABLE OF CONTENTS
|
| 7 |
-
|
| 8 |
-
1. [Executive Summary](#1-executive-summary)
|
| 9 |
-
2. [Scoring Strategy](#2-scoring-strategy)
|
| 10 |
-
3. [Project Structure](#3-project-structure)
|
| 11 |
-
4. [OpenEnv Spec Compliance](#4-openenv-spec-compliance)
|
| 12 |
-
5. [Environment Design](#5-environment-design)
|
| 13 |
-
6. [Task Design (6 Tasks)](#6-task-design)
|
| 14 |
-
7. [Grader Implementation](#7-grader-implementation)
|
| 15 |
-
8. [Reward Function Design](#8-reward-function-design)
|
| 16 |
-
9. [Baseline Inference Script](#9-baseline-inference-script)
|
| 17 |
-
10. [Dockerfile & Deployment](#10-dockerfile--deployment)
|
| 18 |
-
11. [Testing Plan](#11-testing-plan)
|
| 19 |
-
12. [Timeline & Milestones](#12-timeline--milestones)
|
| 20 |
-
|
| 21 |
-
---
|
| 22 |
-
|
| 23 |
-
# 1. EXECUTIVE SUMMARY
|
| 24 |
-
|
| 25 |
-
## Environment Name
|
| 26 |
-
**`cicd-debug-env`** — CI/CD Infrastructure Debugging Environment
|
| 27 |
-
|
| 28 |
-
## Concept
|
| 29 |
-
An OpenEnv-compliant environment where AI agents debug broken GitHub Actions workflows that build and deploy Docker containers. The agent receives error logs, workflow files, and Dockerfiles, then must identify and fix the issues.
|
| 30 |
-
|
| 31 |
-
## Why This Wins
|
| 32 |
-
|
| 33 |
-
| Criteria | Weight | Our Score | Why |
|
| 34 |
-
|----------|--------|-----------|-----|
|
| 35 |
-
| Real-world utility | 30% | 28-30 | Every developer uses Docker + CI/CD daily |
|
| 36 |
-
| Task & grader quality | 25% | 23-25 | Deterministic + dynamic scoring, 6-task progression |
|
| 37 |
-
| Environment design | 20% | 18-20 | Clean state, rich observations, dense rewards |
|
| 38 |
-
| Code quality & spec | 15% | 15 | Full OpenEnv compliance, clean code |
|
| 39 |
-
| Creativity & novelty | 10% | 10 | First CI/CD debugging env on OpenEnv |
|
| 40 |
-
| **TOTAL** | 100% | **94-100** | |
|
| 41 |
-
|
| 42 |
-
---
|
| 43 |
-
|
| 44 |
-
# 2. SCORING STRATEGY
|
| 45 |
-
|
| 46 |
-
## Phase 1: Automated Validation (Pass/Fail Gate)
|
| 47 |
-
We MUST pass all of these or we're disqualified:
|
| 48 |
-
|
| 49 |
-
| Check | How We Pass |
|
| 50 |
-
|-------|-------------|
|
| 51 |
-
| HF Space deploys | FastAPI server with health checks, proper port binding |
|
| 52 |
-
| OpenEnv spec compliance | `openenv.yaml` + typed Pydantic models + all 7 endpoints |
|
| 53 |
-
| Dockerfile builds | Multi-stage build, pinned versions, no external deps |
|
| 54 |
-
| Baseline reproduces | `inference.py` using OpenAI client, runs in <20min |
|
| 55 |
-
| 3+ tasks with graders | 6 tasks with deterministic 0.0-1.0 graders |
|
| 56 |
-
|
| 57 |
-
## Phase 2: Agentic Evaluation (Nemotron 3 Super)
|
| 58 |
-
Optimize for Nemotron's strengths:
|
| 59 |
-
- **Structured output**: YAML/Dockerfile are structured formats ✓
|
| 60 |
-
- **Multi-step reasoning**: Debug → Identify → Fix → Verify ✓
|
| 61 |
-
- **Tool calling patterns**: Action space maps to tool calls ✓
|
| 62 |
-
- **Long context**: Can include full workflow + Dockerfile + error logs ✓
|
| 63 |
-
|
| 64 |
-
## Phase 3: Human Review (Meta/HF Engineers)
|
| 65 |
-
Appeal to judges:
|
| 66 |
-
- **Real-world utility**: They debug CI/CD daily
|
| 67 |
-
- **Meta-relevance**: Hackathon requires Docker, we're debugging Docker
|
| 68 |
-
- **Clever mechanics**: Progressive hints, partial credit, multi-file fixes
|
| 69 |
-
|
| 70 |
-
---
|
| 71 |
-
|
| 72 |
-
# 3. PROJECT STRUCTURE
|
| 73 |
-
|
| 74 |
-
```
|
| 75 |
-
cicd-debug-env/
|
| 76 |
-
├── openenv.yaml # OpenEnv metadata (REQUIRED)
|
| 77 |
-
├── inference.py # Baseline inference script (REQUIRED)
|
| 78 |
-
├── Dockerfile # Container definition (REQUIRED)
|
| 79 |
-
├── requirements.txt # Python dependencies
|
| 80 |
-
├── README.md # Documentation
|
| 81 |
-
│
|
| 82 |
-
├── server/
|
| 83 |
-
│ ├── __init__.py
|
| 84 |
-
│ ├── main.py # FastAPI application with all endpoints
|
| 85 |
-
│ ├── models.py # Pydantic models (Observation, Action, etc.)
|
| 86 |
-
│ ├── environment.py # Core environment logic
|
| 87 |
-
│ ├── tasks/
|
| 88 |
-
│ │ ├── __init__.py
|
| 89 |
-
│ │ ├── base.py # Base task class
|
| 90 |
-
│ │ ├── task_registry.py # Task registration
|
| 91 |
-
│ │ ├── task_1_build_errors.py # Easy: Dockerfile syntax
|
| 92 |
-
│ │ ├── task_2_docker_runtime.py # Medium: Docker runtime
|
| 93 |
-
│ │ ├── task_3_workflow_syntax.py # Easy: Workflow syntax/structure
|
| 94 |
-
│ │ ├── task_4_workflow_secrets_permissions.py # Medium: Secrets/permissions
|
| 95 |
-
│ │ ├── task_5_ci_docker_integration.py # Medium-Hard: Combined CI+Docker
|
| 96 |
-
│ │ └── task_6_multi_stage_matrix.py # Hard: Multi-stage + matrix
|
| 97 |
-
│ ├── graders/
|
| 98 |
-
│ │ ├── __init__.py
|
| 99 |
-
│ │ ├── base.py # Base grader class
|
| 100 |
-
│ │ ├── dockerfile_grader.py # Dockerfile validation
|
| 101 |
-
│ │ ├── workflow_grader.py # GitHub Actions validation
|
| 102 |
-
│ │ └── integration_grader.py # Full pipeline validation
|
| 103 |
-
│ ├── simulators/
|
| 104 |
-
│ │ ├── __init__.py
|
| 105 |
-
│ │ ├── docker_simulator.py # Simulates docker build
|
| 106 |
-
│ │ └── workflow_simulator.py # Simulates GHA execution
|
| 107 |
-
│ └── utils/
|
| 108 |
-
│ ├── __init__.py
|
| 109 |
-
│ ├── yaml_parser.py # Safe YAML parsing
|
| 110 |
-
│ └── error_generator.py # Generates realistic errors
|
| 111 |
-
│
|
| 112 |
-
├── data/
|
| 113 |
-
│ ├── scenarios/ # Pre-built debugging scenarios
|
| 114 |
-
│ �� ├── easy/
|
| 115 |
-
│ │ ├── medium/
|
| 116 |
-
│ │ └── hard/
|
| 117 |
-
│ └── templates/ # Base templates for generation
|
| 118 |
-
│
|
| 119 |
-
└── tests/
|
| 120 |
-
├── test_endpoints.py # API endpoint tests
|
| 121 |
-
├── test_graders.py # Grader correctness tests
|
| 122 |
-
├── test_tasks.py # Task validation tests
|
| 123 |
-
└── test_determinism.py # Reproducibility tests
|
| 124 |
-
```
|
| 125 |
-
|
| 126 |
-
---
|
| 127 |
-
|
| 128 |
-
# 4. OPENENV SPEC COMPLIANCE
|
| 129 |
-
|
| 130 |
-
## 4.1 openenv.yaml
|
| 131 |
-
|
| 132 |
-
name: cicd-debug-env
|
| 133 |
-
version: "1.0.0"
|
| 134 |
-
description: >
|
| 135 |
-
Debug broken GitHub Actions workflows and Dockerfiles.
|
| 136 |
-
AI agents identify and fix CI/CD infrastructure issues.
|
| 137 |
-
|
| 138 |
-
author: Krishna
|
| 139 |
-
license: MIT
|
| 140 |
-
tags:
|
| 141 |
-
- devops
|
| 142 |
-
- docker
|
| 143 |
-
- github-actions
|
| 144 |
-
- debugging
|
| 145 |
-
- infrastructure
|
| 146 |
-
|
| 147 |
-
environment:
|
| 148 |
-
type: text
|
| 149 |
-
observation_space: structured
|
| 150 |
-
action_space: structured
|
| 151 |
-
max_steps: 10
|
| 152 |
-
|
| 153 |
-
tasks:
|
| 154 |
-
# Docker-only tasks (2)
|
| 155 |
-
- id: dockerfile_syntax
|
| 156 |
-
name: "Dockerfile Syntax Errors"
|
| 157 |
-
description: "Fix syntax and instruction errors in Dockerfiles"
|
| 158 |
-
difficulty: easy
|
| 159 |
-
|
| 160 |
-
- id: dockerfile_runtime
|
| 161 |
-
name: "Dockerfile Runtime Errors"
|
| 162 |
-
description: "Fix Dockerfiles that build but fail at runtime"
|
| 163 |
-
difficulty: medium
|
| 164 |
-
|
| 165 |
-
# Workflow-only tasks (2)
|
| 166 |
-
- id: workflow_syntax_structure
|
| 167 |
-
name: "Workflow Syntax and Structure"
|
| 168 |
-
description: "Fix YAML syntax and structural issues in GitHub Actions"
|
| 169 |
-
difficulty: easy
|
| 170 |
-
|
| 171 |
-
- id: workflow_secrets_permissions
|
| 172 |
-
name: "Workflow Secrets and Permissions"
|
| 173 |
-
description: "Fix secret wiring, env usage, and permissions in workflows"
|
| 174 |
-
difficulty: medium
|
| 175 |
-
|
| 176 |
-
# Combined tasks (2)
|
| 177 |
-
- id: ci_docker_integration
|
| 178 |
-
name: "CI and Docker Build Integration"
|
| 179 |
-
description: "Debug combined workflow and Docker build integration failures"
|
| 180 |
-
difficulty: medium-hard
|
| 181 |
-
|
| 182 |
-
- id: multi_stage_pipeline_matrix
|
| 183 |
-
name: "Multi-Stage Pipeline and Matrix"
|
| 184 |
-
description: "Debug complex multi-stage and matrix CI/CD pipelines"
|
| 185 |
-
difficulty: hard
|
| 186 |
-
|
| 187 |
-
graders:
|
| 188 |
-
dockerfile_syntax:
|
| 189 |
-
type: deterministic
|
| 190 |
-
score_range: [0.0, 1.0]
|
| 191 |
-
dockerfile_runtime:
|
| 192 |
-
type: deterministic
|
| 193 |
-
score_range: [0.0, 1.0]
|
| 194 |
-
workflow_syntax_structure:
|
| 195 |
-
type: deterministic
|
| 196 |
-
score_range: [0.0, 1.0]
|
| 197 |
-
workflow_secrets_permissions:
|
| 198 |
-
type: deterministic
|
| 199 |
-
score_range: [0.0, 1.0]
|
| 200 |
-
ci_docker_integration:
|
| 201 |
-
type: deterministic
|
| 202 |
-
score_range: [0.0, 1.0]
|
| 203 |
-
multi_stage_pipeline_matrix:
|
| 204 |
-
type: deterministic
|
| 205 |
-
score_range: [0.0, 1.0]
|
| 206 |
-
|
| 207 |
-
baseline:
|
| 208 |
-
script: inference.py
|
| 209 |
-
expected_scores:
|
| 210 |
-
dockerfile_syntax: 0.70
|
| 211 |
-
dockerfile_runtime: 0.55
|
| 212 |
-
workflow_syntax_structure: 0.65
|
| 213 |
-
workflow_secrets_permissions: 0.50
|
| 214 |
-
ci_docker_integration: 0.45
|
| 215 |
-
multi_stage_pipeline_matrix: 0.30
|
| 216 |
-
|
| 217 |
-
resources:
|
| 218 |
-
vcpu: 2
|
| 219 |
-
memory: 8gb
|
| 220 |
-
timeout: 1200
|
| 221 |
-
|
| 222 |
-
## 4.2 Pydantic Models (server/models.py)
|
| 223 |
-
|
| 224 |
-
```python
|
| 225 |
-
"""
|
| 226 |
-
Typed Pydantic models for OpenEnv compliance.
|
| 227 |
-
All models must be serializable and well-documented.
|
| 228 |
-
"""
|
| 229 |
-
|
| 230 |
-
from typing import List, Dict, Optional, Literal, Any
|
| 231 |
-
from pydantic import BaseModel, Field
|
| 232 |
-
from enum import Enum
|
| 233 |
-
|
| 234 |
-
|
| 235 |
-
# ============== ENUMS ==============
|
| 236 |
-
|
| 237 |
-
class TaskDifficulty(str, Enum):
|
| 238 |
-
EASY = "easy"
|
| 239 |
-
MEDIUM = "medium"
|
| 240 |
-
HARD = "hard"
|
| 241 |
-
|
| 242 |
-
|
| 243 |
-
class ActionType(str, Enum):
|
| 244 |
-
EDIT_FILE = "edit_file"
|
| 245 |
-
ADD_LINE = "add_line"
|
| 246 |
-
DELETE_LINE = "delete_line"
|
| 247 |
-
REPLACE_LINE = "replace_line"
|
| 248 |
-
ADD_BLOCK = "add_block"
|
| 249 |
-
DELETE_BLOCK = "delete_block"
|
| 250 |
-
SUBMIT = "submit"
|
| 251 |
-
REQUEST_HINT = "request_hint"
|
| 252 |
-
|
| 253 |
-
|
| 254 |
-
class FileType(str, Enum):
|
| 255 |
-
DOCKERFILE = "dockerfile"
|
| 256 |
-
WORKFLOW = "workflow"
|
| 257 |
-
DOCKER_COMPOSE = "docker_compose"
|
| 258 |
-
REQUIREMENTS = "requirements"
|
| 259 |
-
OTHER = "other"
|
| 260 |
-
|
| 261 |
-
|
| 262 |
-
class ErrorPhase(str, Enum):
|
| 263 |
-
WORKFLOW_PARSE = "workflow_parse"
|
| 264 |
-
DOCKER_BUILD = "docker_build"
|
| 265 |
-
DOCKER_RUN = "docker_run"
|
| 266 |
-
TEST = "test"
|
| 267 |
-
PUSH = "push"
|
| 268 |
-
DEPLOY = "deploy"
|
| 269 |
-
|
| 270 |
-
|
| 271 |
-
# ============== OBSERVATION ==============
|
| 272 |
-
|
| 273 |
-
class FileContent(BaseModel):
|
| 274 |
-
"""Represents a file in the debugging scenario."""
|
| 275 |
-
path: str = Field(..., description="File path (e.g., 'Dockerfile', '.github/workflows/build.yml')")
|
| 276 |
-
content: str = Field(..., description="Current file content")
|
| 277 |
-
file_type: FileType = Field(..., description="Type of file")
|
| 278 |
-
line_count: int = Field(..., description="Number of lines in file")
|
| 279 |
-
|
| 280 |
-
|
| 281 |
-
class ErrorInfo(BaseModel):
|
| 282 |
-
"""Information about the CI/CD error."""
|
| 283 |
-
phase: ErrorPhase = Field(..., description="Phase where error occurred")
|
| 284 |
-
error_message: str = Field(..., description="The error message/log output")
|
| 285 |
-
exit_code: Optional[int] = Field(None, description="Exit code if applicable")
|
| 286 |
-
failed_step: Optional[str] = Field(None, description="Name of failed step/stage")
|
| 287 |
-
line_hint: Optional[int] = Field(None, description="Line number hint if available")
|
| 288 |
-
|
| 289 |
-
|
| 290 |
-
class Observation(BaseModel):
|
| 291 |
-
"""
|
| 292 |
-
Complete observation of the debugging environment state.
|
| 293 |
-
Provided to the agent at each step.
|
| 294 |
-
"""
|
| 295 |
-
# Task context
|
| 296 |
-
task_id: str = Field(..., description="Current task identifier")
|
| 297 |
-
task_description: str = Field(..., description="What needs to be fixed")
|
| 298 |
-
difficulty: TaskDifficulty = Field(..., description="Task difficulty level")
|
| 299 |
-
|
| 300 |
-
# Files to debug
|
| 301 |
-
files: List[FileContent] = Field(..., description="All files in the scenario")
|
| 302 |
-
|
| 303 |
-
# Error information
|
| 304 |
-
error: ErrorInfo = Field(..., description="Error that needs to be fixed")
|
| 305 |
-
|
| 306 |
-
# Build context (what's available in the CI environment)
|
| 307 |
-
available_secrets: List[str] = Field(default_factory=list, description="Available secret names")
|
| 308 |
-
runner_os: str = Field(default="ubuntu-latest", description="CI runner OS")
|
| 309 |
-
|
| 310 |
-
# Episode state
|
| 311 |
-
step_number: int = Field(..., description="Current step (1-indexed)")
|
| 312 |
-
max_steps: int = Field(..., description="Maximum allowed steps")
|
| 313 |
-
hints_used: int = Field(default=0, description="Number of hints requested")
|
| 314 |
-
hints_available: int = Field(default=3, description="Remaining hints")
|
| 315 |
-
|
| 316 |
-
# Previous action feedback
|
| 317 |
-
last_action_success: Optional[bool] = Field(None, description="Whether last action succeeded")
|
| 318 |
-
last_action_feedback: Optional[str] = Field(None, description="Feedback from last action")
|
| 319 |
-
|
| 320 |
-
# For partial credit tracking
|
| 321 |
-
issues_found: int = Field(default=0, description="Number of issues identified")
|
| 322 |
-
issues_fixed: int = Field(default=0, description="Number of issues fixed")
|
| 323 |
-
total_issues: int = Field(..., description="Total issues in this scenario")
|
| 324 |
-
|
| 325 |
-
|
| 326 |
-
# ============== ACTION ==============
|
| 327 |
-
|
| 328 |
-
class FileEdit(BaseModel):
|
| 329 |
-
"""A single edit to apply to a file."""
|
| 330 |
-
file_path: str = Field(..., description="Path to the file to edit")
|
| 331 |
-
line_number: Optional[int] = Field(None, description="Line number (1-indexed) for line operations")
|
| 332 |
-
old_content: Optional[str] = Field(None, description="Content to find/replace")
|
| 333 |
-
new_content: Optional[str] = Field(None, description="New content to insert/replace with")
|
| 334 |
-
|
| 335 |
-
|
| 336 |
-
class Action(BaseModel):
|
| 337 |
-
"""
|
| 338 |
-
Action taken by the agent to fix the CI/CD issue.
|
| 339 |
-
"""
|
| 340 |
-
action_type: ActionType = Field(..., description="Type of action to perform")
|
| 341 |
-
edits: Optional[List[FileEdit]] = Field(None, description="File edits for edit actions")
|
| 342 |
-
reasoning: Optional[str] = Field(None, description="Agent's reasoning (for logging)")
|
| 343 |
-
|
| 344 |
-
class Config:
|
| 345 |
-
json_schema_extra = {
|
| 346 |
-
"examples": [
|
| 347 |
-
{
|
| 348 |
-
"action_type": "replace_line",
|
| 349 |
-
"edits": [{
|
| 350 |
-
"file_path": "Dockerfile",
|
| 351 |
-
"line_number": 5,
|
| 352 |
-
"old_content": "RUN pip install -r requirments.txt",
|
| 353 |
-
"new_content": "RUN pip install -r requirements.txt"
|
| 354 |
-
}],
|
| 355 |
-
"reasoning": "Fixed typo in requirements.txt filename"
|
| 356 |
-
},
|
| 357 |
-
{
|
| 358 |
-
"action_type": "add_block",
|
| 359 |
-
"edits": [{
|
| 360 |
-
"file_path": ".github/workflows/build.yml",
|
| 361 |
-
"line_number": 15,
|
| 362 |
-
"new_content": " env:\n DOCKER_USERNAME: ${{ secrets.DOCKER_USERNAME }}"
|
| 363 |
-
}],
|
| 364 |
-
"reasoning": "Added missing env block for secrets"
|
| 365 |
-
},
|
| 366 |
-
{
|
| 367 |
-
"action_type": "submit",
|
| 368 |
-
"reasoning": "All issues fixed, submitting solution"
|
| 369 |
-
}
|
| 370 |
-
]
|
| 371 |
-
}
|
| 372 |
-
|
| 373 |
-
|
| 374 |
-
# ============== STEP RESULT ==============
|
| 375 |
-
|
| 376 |
-
class StepResult(BaseModel):
|
| 377 |
-
"""Result of taking an action in the environment."""
|
| 378 |
-
observation: Observation = Field(..., description="New observation after action")
|
| 379 |
-
reward: float = Field(..., ge=0.0, le=1.0, description="Reward for this step")
|
| 380 |
-
done: bool = Field(..., description="Whether episode is complete")
|
| 381 |
-
info: Dict[str, Any] = Field(default_factory=dict, description="Additional info")
|
| 382 |
-
|
| 383 |
-
|
| 384 |
-
# ============== TASK INFO ==============
|
| 385 |
-
|
| 386 |
-
class TaskInfo(BaseModel):
|
| 387 |
-
"""Information about a single task."""
|
| 388 |
-
id: str = Field(..., description="Task identifier")
|
| 389 |
-
name: str = Field(..., description="Human-readable task name")
|
| 390 |
-
description: str = Field(..., description="Task description")
|
| 391 |
-
difficulty: TaskDifficulty = Field(..., description="Difficulty level")
|
| 392 |
-
num_scenarios: int = Field(..., description="Number of scenarios for this task")
|
| 393 |
-
|
| 394 |
-
|
| 395 |
-
class EnvironmentInfo(BaseModel):
|
| 396 |
-
"""Information about the environment."""
|
| 397 |
-
name: str = Field(default="cicd-debug-env")
|
| 398 |
-
version: str = Field(default="1.0.0")
|
| 399 |
-
description: str = Field(default="Debug CI/CD infrastructure issues")
|
| 400 |
-
tasks: List[TaskInfo] = Field(..., description="Available tasks")
|
| 401 |
-
max_steps: int = Field(default=10, description="Maximum steps per episode")
|
| 402 |
-
action_space: Dict[str, Any] = Field(..., description="Action space schema")
|
| 403 |
-
observation_space: Dict[str, Any] = Field(..., description="Observation space schema")
|
| 404 |
-
|
| 405 |
-
|
| 406 |
-
# ============== GRADER RESULT ==============
|
| 407 |
-
|
| 408 |
-
class GraderResult(BaseModel):
|
| 409 |
-
"""Result from running the grader."""
|
| 410 |
-
task_id: str = Field(..., description="Task that was graded")
|
| 411 |
-
score: float = Field(..., ge=0.0, le=1.0, description="Final score")
|
| 412 |
-
max_score: float = Field(default=1.0, description="Maximum possible score")
|
| 413 |
-
breakdown: Dict[str, float] = Field(default_factory=dict, description="Score breakdown")
|
| 414 |
-
feedback: str = Field(default="", description="Human-readable feedback")
|
| 415 |
-
steps_taken: int = Field(..., description="Number of steps taken")
|
| 416 |
-
hints_used: int = Field(default=0, description="Number of hints used")
|
| 417 |
-
|
| 418 |
-
|
| 419 |
-
# ============== API REQUEST/RESPONSE MODELS ==============
|
| 420 |
-
|
| 421 |
-
class ResetRequest(BaseModel):
|
| 422 |
-
"""Request to reset the environment."""
|
| 423 |
-
task_id: Optional[str] = Field(None, description="Specific task to load (random if not specified)")
|
| 424 |
-
scenario_id: Optional[str] = Field(None, description="Specific scenario within task")
|
| 425 |
-
seed: Optional[int] = Field(None, description="Random seed for reproducibility")
|
| 426 |
-
|
| 427 |
-
|
| 428 |
-
class ResetResponse(BaseModel):
|
| 429 |
-
"""Response from reset endpoint."""
|
| 430 |
-
observation: Observation
|
| 431 |
-
info: Dict[str, Any] = Field(default_factory=dict)
|
| 432 |
-
|
| 433 |
-
|
| 434 |
-
class StepRequest(BaseModel):
|
| 435 |
-
"""Request to take a step."""
|
| 436 |
-
action: Action
|
| 437 |
-
|
| 438 |
-
|
| 439 |
-
class StepResponse(BaseModel):
|
| 440 |
-
"""Response from step endpoint."""
|
| 441 |
-
observation: Observation
|
| 442 |
-
reward: float
|
| 443 |
-
done: bool
|
| 444 |
-
info: Dict[str, Any] = Field(default_factory=dict)
|
| 445 |
-
|
| 446 |
-
|
| 447 |
-
class StateResponse(BaseModel):
|
| 448 |
-
"""Response from state endpoint."""
|
| 449 |
-
observation: Observation
|
| 450 |
-
episode_reward: float = Field(..., description="Cumulative reward this episode")
|
| 451 |
-
steps_taken: int
|
| 452 |
-
done: bool
|
| 453 |
-
|
| 454 |
-
|
| 455 |
-
class GraderRequest(BaseModel):
|
| 456 |
-
"""Request to run grader."""
|
| 457 |
-
task_id: str
|
| 458 |
-
trajectory: List[Dict[str, Any]] = Field(..., description="List of (observation, action, reward) tuples")
|
| 459 |
-
|
| 460 |
-
|
| 461 |
-
class GraderResponse(BaseModel):
|
| 462 |
-
"""Response from grader endpoint."""
|
| 463 |
-
result: GraderResult
|
| 464 |
-
|
| 465 |
-
|
| 466 |
-
class BaselineRequest(BaseModel):
|
| 467 |
-
"""Request to run baseline."""
|
| 468 |
-
task_id: Optional[str] = Field(None, description="Specific task (all if not specified)")
|
| 469 |
-
num_episodes: int = Field(default=1, description="Number of episodes to run")
|
| 470 |
-
|
| 471 |
-
|
| 472 |
-
class BaselineResponse(BaseModel):
|
| 473 |
-
"""Response from baseline endpoint."""
|
| 474 |
-
results: List[GraderResult]
|
| 475 |
-
aggregate_score: float
|
| 476 |
-
```
|
| 477 |
-
|
| 478 |
-
## 4.3 FastAPI Endpoints (server/main.py)
|
| 479 |
-
|
| 480 |
-
```python
|
| 481 |
-
"""
|
| 482 |
-
FastAPI server implementing all required OpenEnv endpoints.
|
| 483 |
-
"""
|
| 484 |
-
|
| 485 |
-
from fastapi import FastAPI, HTTPException
|
| 486 |
-
from fastapi.middleware.cors import CORSMiddleware
|
| 487 |
-
import uvicorn
|
| 488 |
-
from typing import Optional
|
| 489 |
-
|
| 490 |
-
from models import (
|
| 491 |
-
ResetRequest, ResetResponse,
|
| 492 |
-
StepRequest, StepResponse,
|
| 493 |
-
StateResponse,
|
| 494 |
-
EnvironmentInfo, TaskInfo,
|
| 495 |
-
GraderRequest, GraderResponse,
|
| 496 |
-
BaselineRequest, BaselineResponse,
|
| 497 |
-
Observation, Action, GraderResult
|
| 498 |
-
)
|
| 499 |
-
from environment import CICDDebugEnvironment
|
| 500 |
-
from tasks.task_registry import TASK_REGISTRY
|
| 501 |
-
from graders import run_grader
|
| 502 |
-
|
| 503 |
-
app = FastAPI(
|
| 504 |
-
title="CI/CD Debug Environment",
|
| 505 |
-
description="OpenEnv-compliant environment for debugging Docker + GitHub Actions",
|
| 506 |
-
version="1.0.0"
|
| 507 |
-
)
|
| 508 |
-
|
| 509 |
-
app.add_middleware(
|
| 510 |
-
CORSMiddleware,
|
| 511 |
-
allow_origins=["*"],
|
| 512 |
-
allow_credentials=True,
|
| 513 |
-
allow_methods=["*"],
|
| 514 |
-
allow_headers=["*"],
|
| 515 |
-
)
|
| 516 |
-
|
| 517 |
-
# Global environment instance (per-request in production)
|
| 518 |
-
env: Optional[CICDDebugEnvironment] = None
|
| 519 |
-
|
| 520 |
-
|
| 521 |
-
@app.get("/")
|
| 522 |
-
async def root():
|
| 523 |
-
"""Health check endpoint."""
|
| 524 |
-
return {"status": "healthy", "environment": "cicd-debug-env"}
|
| 525 |
-
|
| 526 |
-
|
| 527 |
-
@app.post("/reset", response_model=ResetResponse)
|
| 528 |
-
async def reset(request: ResetRequest = None):
|
| 529 |
-
"""
|
| 530 |
-
Reset the environment to a new episode.
|
| 531 |
-
|
| 532 |
-
POST /reset
|
| 533 |
-
|
| 534 |
-
Optionally specify task_id and scenario_id for reproducibility.
|
| 535 |
-
Returns initial observation.
|
| 536 |
-
"""
|
| 537 |
-
global env
|
| 538 |
-
|
| 539 |
-
request = request or ResetRequest()
|
| 540 |
-
|
| 541 |
-
env = CICDDebugEnvironment()
|
| 542 |
-
observation = env.reset(
|
| 543 |
-
task_id=request.task_id,
|
| 544 |
-
scenario_id=request.scenario_id,
|
| 545 |
-
seed=request.seed
|
| 546 |
-
)
|
| 547 |
-
|
| 548 |
-
return ResetResponse(
|
| 549 |
-
observation=observation,
|
| 550 |
-
info={
|
| 551 |
-
"task_id": env.current_task_id,
|
| 552 |
-
"scenario_id": env.current_scenario_id,
|
| 553 |
-
"difficulty": env.current_difficulty
|
| 554 |
-
}
|
| 555 |
-
)
|
| 556 |
-
|
| 557 |
-
|
| 558 |
-
@app.post("/step", response_model=StepResponse)
|
| 559 |
-
async def step(request: StepRequest):
|
| 560 |
-
"""
|
| 561 |
-
Take an action in the environment.
|
| 562 |
-
|
| 563 |
-
POST /step
|
| 564 |
-
|
| 565 |
-
Returns new observation, reward, done flag, and info.
|
| 566 |
-
"""
|
| 567 |
-
global env
|
| 568 |
-
|
| 569 |
-
if env is None:
|
| 570 |
-
raise HTTPException(status_code=400, detail="Environment not initialized. Call /reset first.")
|
| 571 |
-
|
| 572 |
-
observation, reward, done, info = env.step(request.action)
|
| 573 |
-
|
| 574 |
-
return StepResponse(
|
| 575 |
-
observation=observation,
|
| 576 |
-
reward=reward,
|
| 577 |
-
done=done,
|
| 578 |
-
info=info
|
| 579 |
-
)
|
| 580 |
-
|
| 581 |
-
|
| 582 |
-
@app.get("/state", response_model=StateResponse)
|
| 583 |
-
async def get_state():
|
| 584 |
-
"""
|
| 585 |
-
Get current environment state.
|
| 586 |
-
|
| 587 |
-
GET /state
|
| 588 |
-
|
| 589 |
-
Returns current observation and episode statistics.
|
| 590 |
-
"""
|
| 591 |
-
global env
|
| 592 |
-
|
| 593 |
-
if env is None:
|
| 594 |
-
raise HTTPException(status_code=400, detail="Environment not initialized. Call /reset first.")
|
| 595 |
-
|
| 596 |
-
return StateResponse(
|
| 597 |
-
observation=env.get_observation(),
|
| 598 |
-
episode_reward=env.episode_reward,
|
| 599 |
-
steps_taken=env.step_count,
|
| 600 |
-
done=env.done
|
| 601 |
-
)
|
| 602 |
-
|
| 603 |
-
|
| 604 |
-
@app.get("/info", response_model=EnvironmentInfo)
|
| 605 |
-
async def get_info():
|
| 606 |
-
"""
|
| 607 |
-
Get environment metadata.
|
| 608 |
-
|
| 609 |
-
GET /info
|
| 610 |
-
|
| 611 |
-
Returns environment info, available tasks, and action/observation schemas.
|
| 612 |
-
"""
|
| 613 |
-
tasks = [
|
| 614 |
-
TaskInfo(
|
| 615 |
-
id=task_id,
|
| 616 |
-
name=task_cls.NAME,
|
| 617 |
-
description=task_cls.DESCRIPTION,
|
| 618 |
-
difficulty=task_cls.DIFFICULTY,
|
| 619 |
-
num_scenarios=len(task_cls.SCENARIOS)
|
| 620 |
-
)
|
| 621 |
-
for task_id, task_cls in TASK_REGISTRY.items()
|
| 622 |
-
]
|
| 623 |
-
|
| 624 |
-
return EnvironmentInfo(
|
| 625 |
-
name="cicd-debug-env",
|
| 626 |
-
version="1.0.0",
|
| 627 |
-
description="Debug CI/CD infrastructure issues (Docker + GitHub Actions)",
|
| 628 |
-
tasks=tasks,
|
| 629 |
-
max_steps=10,
|
| 630 |
-
action_space=Action.model_json_schema(),
|
| 631 |
-
observation_space=Observation.model_json_schema()
|
| 632 |
-
)
|
| 633 |
-
|
| 634 |
-
|
| 635 |
-
@app.get("/tasks")
|
| 636 |
-
async def get_tasks():
|
| 637 |
-
"""
|
| 638 |
-
Get list of available tasks.
|
| 639 |
-
|
| 640 |
-
GET /tasks
|
| 641 |
-
|
| 642 |
-
Returns task IDs, names, descriptions, and difficulties.
|
| 643 |
-
"""
|
| 644 |
-
return {
|
| 645 |
-
"tasks": [
|
| 646 |
-
{
|
| 647 |
-
"id": task_id,
|
| 648 |
-
"name": task_cls.NAME,
|
| 649 |
-
"description": task_cls.DESCRIPTION,
|
| 650 |
-
"difficulty": task_cls.DIFFICULTY.value
|
| 651 |
-
}
|
| 652 |
-
for task_id, task_cls in TASK_REGISTRY.items()
|
| 653 |
-
]
|
| 654 |
-
}
|
| 655 |
-
|
| 656 |
-
|
| 657 |
-
@app.post("/grader", response_model=GraderResponse)
|
| 658 |
-
async def grade(request: GraderRequest):
|
| 659 |
-
"""
|
| 660 |
-
Run grader on a trajectory.
|
| 661 |
-
|
| 662 |
-
POST /grader
|
| 663 |
-
|
| 664 |
-
Takes task_id and trajectory, returns score and breakdown.
|
| 665 |
-
"""
|
| 666 |
-
result = run_grader(
|
| 667 |
-
task_id=request.task_id,
|
| 668 |
-
trajectory=request.trajectory
|
| 669 |
-
)
|
| 670 |
-
|
| 671 |
-
return GraderResponse(result=result)
|
| 672 |
-
|
| 673 |
-
|
| 674 |
-
@app.post("/baseline", response_model=BaselineResponse)
|
| 675 |
-
async def run_baseline(request: BaselineRequest = None):
|
| 676 |
-
"""
|
| 677 |
-
Run baseline agent on tasks.
|
| 678 |
-
|
| 679 |
-
POST /baseline
|
| 680 |
-
|
| 681 |
-
Runs the baseline inference script and returns scores.
|
| 682 |
-
"""
|
| 683 |
-
request = request or BaselineRequest()
|
| 684 |
-
|
| 685 |
-
# Import and run baseline
|
| 686 |
-
from baseline_runner import run_baseline_episodes
|
| 687 |
-
|
| 688 |
-
results = run_baseline_episodes(
|
| 689 |
-
task_id=request.task_id,
|
| 690 |
-
num_episodes=request.num_episodes
|
| 691 |
-
)
|
| 692 |
-
|
| 693 |
-
aggregate = sum(r.score for r in results) / len(results) if results else 0.0
|
| 694 |
-
|
| 695 |
-
return BaselineResponse(
|
| 696 |
-
results=results,
|
| 697 |
-
aggregate_score=aggregate
|
| 698 |
-
)
|
| 699 |
-
|
| 700 |
-
|
| 701 |
-
if __name__ == "__main__":
|
| 702 |
-
uvicorn.run(app, host="0.0.0.0", port=7860)
|
| 703 |
-
```
|
| 704 |
-
|
| 705 |
-
---
|
| 706 |
-
|
| 707 |
-
# 5. ENVIRONMENT DESIGN
|
| 708 |
-
|
| 709 |
-
## 5.1 Core Environment Logic (server/environment.py)
|
| 710 |
-
|
| 711 |
-
```python
|
| 712 |
-
"""
|
| 713 |
-
Core environment logic for CI/CD debugging.
|
| 714 |
-
"""
|
| 715 |
-
|
| 716 |
-
from typing import Optional, Tuple, Dict, Any, List
|
| 717 |
-
import random
|
| 718 |
-
import copy
|
| 719 |
-
|
| 720 |
-
from models import (
|
| 721 |
-
Observation, Action, ActionType, FileContent, ErrorInfo,
|
| 722 |
-
TaskDifficulty, ErrorPhase, FileType
|
| 723 |
-
)
|
| 724 |
-
from tasks.task_registry import TASK_REGISTRY, get_task
|
| 725 |
-
from simulators.docker_simulator import DockerSimulator
|
| 726 |
-
from simulators.workflow_simulator import WorkflowSimulator
|
| 727 |
-
|
| 728 |
-
|
| 729 |
-
class CICDDebugEnvironment:
|
| 730 |
-
"""
|
| 731 |
-
OpenEnv-compliant environment for debugging CI/CD infrastructure.
|
| 732 |
-
|
| 733 |
-
Episode Flow:
|
| 734 |
-
1. reset() loads a scenario with broken config files
|
| 735 |
-
2. Agent observes files + error message
|
| 736 |
-
3. Agent takes actions to fix issues
|
| 737 |
-
4. Environment simulates build/run to verify fixes
|
| 738 |
-
5. Episode ends when all issues fixed or max_steps reached
|
| 739 |
-
"""
|
| 740 |
-
|
| 741 |
-
MAX_STEPS = 10
|
| 742 |
-
MAX_HINTS = 3
|
| 743 |
-
|
| 744 |
-
def __init__(self):
|
| 745 |
-
self.docker_sim = DockerSimulator()
|
| 746 |
-
self.workflow_sim = WorkflowSimulator()
|
| 747 |
-
|
| 748 |
-
# Episode state
|
| 749 |
-
self.current_task_id: Optional[str] = None
|
| 750 |
-
self.current_scenario_id: Optional[str] = None
|
| 751 |
-
self.current_difficulty: Optional[TaskDifficulty] = None
|
| 752 |
-
self.current_task = None
|
| 753 |
-
|
| 754 |
-
# File states
|
| 755 |
-
self.original_files: Dict[str, FileContent] = {}
|
| 756 |
-
self.current_files: Dict[str, FileContent] = {}
|
| 757 |
-
self.expected_fixes: List[Dict] = []
|
| 758 |
-
|
| 759 |
-
# Error state
|
| 760 |
-
self.current_error: Optional[ErrorInfo] = None
|
| 761 |
-
self.issues_total: int = 0
|
| 762 |
-
self.issues_fixed: int = 0
|
| 763 |
-
|
| 764 |
-
# Episode tracking
|
| 765 |
-
self.step_count: int = 0
|
| 766 |
-
self.episode_reward: float = 0.0
|
| 767 |
-
self.done: bool = False
|
| 768 |
-
self.hints_used: int = 0
|
| 769 |
-
|
| 770 |
-
# Action history
|
| 771 |
-
self.trajectory: List[Dict] = []
|
| 772 |
-
self.last_action_success: Optional[bool] = None
|
| 773 |
-
self.last_action_feedback: Optional[str] = None
|
| 774 |
-
|
| 775 |
-
def reset(
|
| 776 |
-
self,
|
| 777 |
-
task_id: Optional[str] = None,
|
| 778 |
-
scenario_id: Optional[str] = None,
|
| 779 |
-
seed: Optional[int] = None
|
| 780 |
-
) -> Observation:
|
| 781 |
-
"""Reset environment to a new episode."""
|
| 782 |
-
|
| 783 |
-
if seed is not None:
|
| 784 |
-
random.seed(seed)
|
| 785 |
-
|
| 786 |
-
# Select task
|
| 787 |
-
if task_id is None:
|
| 788 |
-
task_id = random.choice(list(TASK_REGISTRY.keys()))
|
| 789 |
-
|
| 790 |
-
if task_id not in TASK_REGISTRY:
|
| 791 |
-
raise ValueError(f"Unknown task: {task_id}")
|
| 792 |
-
|
| 793 |
-
self.current_task_id = task_id
|
| 794 |
-
self.current_task = get_task(task_id)
|
| 795 |
-
self.current_difficulty = self.current_task.DIFFICULTY
|
| 796 |
-
|
| 797 |
-
# Load scenario
|
| 798 |
-
scenario = self.current_task.load_scenario(scenario_id)
|
| 799 |
-
self.current_scenario_id = scenario["id"]
|
| 800 |
-
|
| 801 |
-
# Initialize files
|
| 802 |
-
self.original_files = {
|
| 803 |
-
f["path"]: FileContent(
|
| 804 |
-
path=f["path"],
|
| 805 |
-
content=f["content"],
|
| 806 |
-
file_type=FileType(f["type"]),
|
| 807 |
-
line_count=f["content"].count("\n") + 1
|
| 808 |
-
)
|
| 809 |
-
for f in scenario["files"]
|
| 810 |
-
}
|
| 811 |
-
self.current_files = copy.deepcopy(self.original_files)
|
| 812 |
-
|
| 813 |
-
# Initialize error
|
| 814 |
-
self.current_error = ErrorInfo(
|
| 815 |
-
phase=ErrorPhase(scenario["error"]["phase"]),
|
| 816 |
-
error_message=scenario["error"]["message"],
|
| 817 |
-
exit_code=scenario["error"].get("exit_code"),
|
| 818 |
-
failed_step=scenario["error"].get("failed_step"),
|
| 819 |
-
line_hint=scenario["error"].get("line_hint")
|
| 820 |
-
)
|
| 821 |
-
|
| 822 |
-
# Initialize fixes tracking
|
| 823 |
-
self.expected_fixes = scenario["expected_fixes"]
|
| 824 |
-
self.issues_total = len(self.expected_fixes)
|
| 825 |
-
self.issues_fixed = 0
|
| 826 |
-
|
| 827 |
-
# Reset episode state
|
| 828 |
-
self.step_count = 0
|
| 829 |
-
self.episode_reward = 0.0
|
| 830 |
-
self.done = False
|
| 831 |
-
self.hints_used = 0
|
| 832 |
-
self.trajectory = []
|
| 833 |
-
self.last_action_success = None
|
| 834 |
-
self.last_action_feedback = None
|
| 835 |
-
|
| 836 |
-
return self.get_observation()
|
| 837 |
-
|
| 838 |
-
def step(self, action: Action) -> Tuple[Observation, float, bool, Dict[str, Any]]:
|
| 839 |
-
"""Take an action and return (observation, reward, done, info)."""
|
| 840 |
-
|
| 841 |
-
if self.done:
|
| 842 |
-
return self.get_observation(), 0.0, True, {"error": "Episode already done"}
|
| 843 |
-
|
| 844 |
-
self.step_count += 1
|
| 845 |
-
reward = 0.0
|
| 846 |
-
info = {}
|
| 847 |
-
|
| 848 |
-
# Process action
|
| 849 |
-
if action.action_type == ActionType.REQUEST_HINT:
|
| 850 |
-
reward, feedback = self._handle_hint_request()
|
| 851 |
-
elif action.action_type == ActionType.SUBMIT:
|
| 852 |
-
reward, feedback = self._handle_submit()
|
| 853 |
-
else:
|
| 854 |
-
reward, feedback = self._handle_edit(action)
|
| 855 |
-
|
| 856 |
-
self.last_action_feedback = feedback
|
| 857 |
-
self.episode_reward += reward
|
| 858 |
-
|
| 859 |
-
# Check termination conditions
|
| 860 |
-
if self.step_count >= self.MAX_STEPS:
|
| 861 |
-
self.done = True
|
| 862 |
-
info["termination_reason"] = "max_steps"
|
| 863 |
-
elif action.action_type == ActionType.SUBMIT:
|
| 864 |
-
self.done = True
|
| 865 |
-
info["termination_reason"] = "submitted"
|
| 866 |
-
elif self.issues_fixed == self.issues_total:
|
| 867 |
-
# All issues fixed, auto-complete
|
| 868 |
-
self.done = True
|
| 869 |
-
info["termination_reason"] = "all_fixed"
|
| 870 |
-
|
| 871 |
-
# Record trajectory
|
| 872 |
-
self.trajectory.append({
|
| 873 |
-
"step": self.step_count,
|
| 874 |
-
"action": action.model_dump(),
|
| 875 |
-
"reward": reward,
|
| 876 |
-
"done": self.done
|
| 877 |
-
})
|
| 878 |
-
|
| 879 |
-
info["issues_fixed"] = self.issues_fixed
|
| 880 |
-
info["issues_total"] = self.issues_total
|
| 881 |
-
|
| 882 |
-
return self.get_observation(), reward, self.done, info
|
| 883 |
-
|
| 884 |
-
def _handle_edit(self, action: Action) -> Tuple[float, str]:
|
| 885 |
-
"""Handle file edit actions."""
|
| 886 |
-
|
| 887 |
-
if not action.edits:
|
| 888 |
-
self.last_action_success = False
|
| 889 |
-
return 0.0, "No edits provided"
|
| 890 |
-
|
| 891 |
-
reward = 0.0
|
| 892 |
-
feedbacks = []
|
| 893 |
-
|
| 894 |
-
for edit in action.edits:
|
| 895 |
-
# Check file exists
|
| 896 |
-
if edit.file_path not in self.current_files:
|
| 897 |
-
feedbacks.append(f"File not found: {edit.file_path}")
|
| 898 |
-
continue
|
| 899 |
-
|
| 900 |
-
file_content = self.current_files[edit.file_path]
|
| 901 |
-
lines = file_content.content.split("\n")
|
| 902 |
-
|
| 903 |
-
try:
|
| 904 |
-
if action.action_type == ActionType.REPLACE_LINE:
|
| 905 |
-
if edit.line_number and 1 <= edit.line_number <= len(lines):
|
| 906 |
-
lines[edit.line_number - 1] = edit.new_content or ""
|
| 907 |
-
feedbacks.append(f"Replaced line {edit.line_number} in {edit.file_path}")
|
| 908 |
-
else:
|
| 909 |
-
feedbacks.append(f"Invalid line number: {edit.line_number}")
|
| 910 |
-
continue
|
| 911 |
-
|
| 912 |
-
elif action.action_type == ActionType.ADD_LINE:
|
| 913 |
-
insert_at = edit.line_number - 1 if edit.line_number else len(lines)
|
| 914 |
-
lines.insert(insert_at, edit.new_content or "")
|
| 915 |
-
feedbacks.append(f"Added line at {insert_at + 1} in {edit.file_path}")
|
| 916 |
-
|
| 917 |
-
elif action.action_type == ActionType.DELETE_LINE:
|
| 918 |
-
if edit.line_number and 1 <= edit.line_number <= len(lines):
|
| 919 |
-
del lines[edit.line_number - 1]
|
| 920 |
-
feedbacks.append(f"Deleted line {edit.line_number} in {edit.file_path}")
|
| 921 |
-
else:
|
| 922 |
-
feedbacks.append(f"Invalid line number: {edit.line_number}")
|
| 923 |
-
continue
|
| 924 |
-
|
| 925 |
-
elif action.action_type == ActionType.EDIT_FILE:
|
| 926 |
-
# Find and replace
|
| 927 |
-
if edit.old_content and edit.old_content in file_content.content:
|
| 928 |
-
new_content = file_content.content.replace(
|
| 929 |
-
edit.old_content,
|
| 930 |
-
edit.new_content or "",
|
| 931 |
-
1
|
| 932 |
-
)
|
| 933 |
-
lines = new_content.split("\n")
|
| 934 |
-
feedbacks.append(f"Replaced content in {edit.file_path}")
|
| 935 |
-
else:
|
| 936 |
-
feedbacks.append(f"Content not found in {edit.file_path}")
|
| 937 |
-
continue
|
| 938 |
-
|
| 939 |
-
# Update file
|
| 940 |
-
new_content = "\n".join(lines)
|
| 941 |
-
self.current_files[edit.file_path] = FileContent(
|
| 942 |
-
path=file_content.path,
|
| 943 |
-
content=new_content,
|
| 944 |
-
file_type=file_content.file_type,
|
| 945 |
-
line_count=len(lines)
|
| 946 |
-
)
|
| 947 |
-
|
| 948 |
-
# Check if this fixed an issue
|
| 949 |
-
fix_reward = self._check_fix_progress()
|
| 950 |
-
reward += fix_reward
|
| 951 |
-
|
| 952 |
-
except Exception as e:
|
| 953 |
-
feedbacks.append(f"Error applying edit: {str(e)}")
|
| 954 |
-
|
| 955 |
-
self.last_action_success = reward > 0
|
| 956 |
-
return reward, "; ".join(feedbacks)
|
| 957 |
-
|
| 958 |
-
def _check_fix_progress(self) -> float:
|
| 959 |
-
"""Check if current state fixes any issues."""
|
| 960 |
-
|
| 961 |
-
# Simulate build with current files
|
| 962 |
-
dockerfile = self.current_files.get("Dockerfile")
|
| 963 |
-
workflow = self.current_files.get(".github/workflows/build.yml")
|
| 964 |
-
|
| 965 |
-
fixes_applied = 0
|
| 966 |
-
|
| 967 |
-
for fix in self.expected_fixes:
|
| 968 |
-
file_path = fix["file"]
|
| 969 |
-
if file_path in self.current_files:
|
| 970 |
-
current_content = self.current_files[file_path].content
|
| 971 |
-
|
| 972 |
-
# Check if fix is applied
|
| 973 |
-
if fix["type"] == "contains":
|
| 974 |
-
if fix["expected"] in current_content:
|
| 975 |
-
fixes_applied += 1
|
| 976 |
-
elif fix["type"] == "not_contains":
|
| 977 |
-
if fix["expected"] not in current_content:
|
| 978 |
-
fixes_applied += 1
|
| 979 |
-
elif fix["type"] == "line_equals":
|
| 980 |
-
lines = current_content.split("\n")
|
| 981 |
-
if fix["line"] <= len(lines):
|
| 982 |
-
if lines[fix["line"] - 1].strip() == fix["expected"].strip():
|
| 983 |
-
fixes_applied += 1
|
| 984 |
-
|
| 985 |
-
new_fixed = fixes_applied - self.issues_fixed
|
| 986 |
-
if new_fixed > 0:
|
| 987 |
-
self.issues_fixed = fixes_applied
|
| 988 |
-
# Partial reward for each fix
|
| 989 |
-
return 0.3 * new_fixed
|
| 990 |
-
|
| 991 |
-
return 0.0
|
| 992 |
-
|
| 993 |
-
def _handle_submit(self) -> Tuple[float, str]:
|
| 994 |
-
"""Handle submission - run full validation."""
|
| 995 |
-
|
| 996 |
-
# Run Docker simulation
|
| 997 |
-
docker_result = self.docker_sim.validate(
|
| 998 |
-
dockerfile=self.current_files.get("Dockerfile"),
|
| 999 |
-
context_files=self.current_files
|
| 1000 |
-
)
|
| 1001 |
-
|
| 1002 |
-
# Run workflow simulation
|
| 1003 |
-
workflow_result = self.workflow_sim.validate(
|
| 1004 |
-
workflow=self.current_files.get(".github/workflows/build.yml"),
|
| 1005 |
-
files=self.current_files
|
| 1006 |
-
)
|
| 1007 |
-
|
| 1008 |
-
# Calculate final reward
|
| 1009 |
-
reward = 0.0
|
| 1010 |
-
feedback_parts = []
|
| 1011 |
-
|
| 1012 |
-
# Docker build success (0.3)
|
| 1013 |
-
if docker_result["build_success"]:
|
| 1014 |
-
reward += 0.3
|
| 1015 |
-
feedback_parts.append("Docker build: PASS")
|
| 1016 |
-
else:
|
| 1017 |
-
feedback_parts.append(f"Docker build: FAIL - {docker_result['error']}")
|
| 1018 |
-
|
| 1019 |
-
# Docker run success (0.2)
|
| 1020 |
-
if docker_result["run_success"]:
|
| 1021 |
-
reward += 0.2
|
| 1022 |
-
feedback_parts.append("Docker run: PASS")
|
| 1023 |
-
else:
|
| 1024 |
-
feedback_parts.append(f"Docker run: FAIL - {docker_result.get('run_error', 'unknown')}")
|
| 1025 |
-
|
| 1026 |
-
# Workflow parse success (0.2)
|
| 1027 |
-
if workflow_result["parse_success"]:
|
| 1028 |
-
reward += 0.2
|
| 1029 |
-
feedback_parts.append("Workflow parse: PASS")
|
| 1030 |
-
else:
|
| 1031 |
-
feedback_parts.append(f"Workflow parse: FAIL - {workflow_result['error']}")
|
| 1032 |
-
|
| 1033 |
-
# Workflow execution success (0.3)
|
| 1034 |
-
if workflow_result["execution_success"]:
|
| 1035 |
-
reward += 0.3
|
| 1036 |
-
feedback_parts.append("Workflow execution: PASS")
|
| 1037 |
-
else:
|
| 1038 |
-
feedback_parts.append(f"Workflow execution: FAIL - {workflow_result.get('exec_error', 'unknown')}")
|
| 1039 |
-
|
| 1040 |
-
self.last_action_success = reward >= 0.8
|
| 1041 |
-
return reward, "; ".join(feedback_parts)
|
| 1042 |
-
|
| 1043 |
-
def _handle_hint_request(self) -> Tuple[float, str]:
|
| 1044 |
-
"""Handle hint request."""
|
| 1045 |
-
|
| 1046 |
-
if self.hints_used >= self.MAX_HINTS:
|
| 1047 |
-
self.last_action_success = False
|
| 1048 |
-
return 0.0, "No hints remaining"
|
| 1049 |
-
|
| 1050 |
-
self.hints_used += 1
|
| 1051 |
-
|
| 1052 |
-
# Get next unfixed issue
|
| 1053 |
-
for fix in self.expected_fixes:
|
| 1054 |
-
file_path = fix["file"]
|
| 1055 |
-
if file_path in self.current_files:
|
| 1056 |
-
current_content = self.current_files[file_path].content
|
| 1057 |
-
|
| 1058 |
-
is_fixed = False
|
| 1059 |
-
if fix["type"] == "contains":
|
| 1060 |
-
is_fixed = fix["expected"] in current_content
|
| 1061 |
-
elif fix["type"] == "not_contains":
|
| 1062 |
-
is_fixed = fix["expected"] not in current_content
|
| 1063 |
-
|
| 1064 |
-
if not is_fixed:
|
| 1065 |
-
hint = fix.get("hint", f"Check {file_path} around line {fix.get('line', '?')}")
|
| 1066 |
-
self.last_action_success = True
|
| 1067 |
-
# Small negative reward for using hint
|
| 1068 |
-
return -0.05, f"Hint ({self.hints_used}/{self.MAX_HINTS}): {hint}"
|
| 1069 |
-
|
| 1070 |
-
self.last_action_success = True
|
| 1071 |
-
return 0.0, "All known issues appear to be fixed"
|
| 1072 |
-
|
| 1073 |
-
def get_observation(self) -> Observation:
|
| 1074 |
-
"""Get current observation."""
|
| 1075 |
-
|
| 1076 |
-
return Observation(
|
| 1077 |
-
task_id=self.current_task_id,
|
| 1078 |
-
task_description=self.current_task.DESCRIPTION,
|
| 1079 |
-
difficulty=self.current_difficulty,
|
| 1080 |
-
files=list(self.current_files.values()),
|
| 1081 |
-
error=self.current_error,
|
| 1082 |
-
available_secrets=self.current_task.AVAILABLE_SECRETS,
|
| 1083 |
-
runner_os="ubuntu-latest",
|
| 1084 |
-
step_number=self.step_count,
|
| 1085 |
-
max_steps=self.MAX_STEPS,
|
| 1086 |
-
hints_used=self.hints_used,
|
| 1087 |
-
hints_available=self.MAX_HINTS - self.hints_used,
|
| 1088 |
-
last_action_success=self.last_action_success,
|
| 1089 |
-
last_action_feedback=self.last_action_feedback,
|
| 1090 |
-
issues_found=self.issues_fixed, # Simplified: found = fixed
|
| 1091 |
-
issues_fixed=self.issues_fixed,
|
| 1092 |
-
total_issues=self.issues_total
|
| 1093 |
-
)
|
| 1094 |
-
```
|
| 1095 |
-
|
| 1096 |
-
---
|
| 1097 |
-
|
| 1098 |
-
# 6. TASK DESIGN (6 Tasks)
|
| 1099 |
-
|
| 1100 |
-
## 6.1 Task Registry (server/tasks/task_registry.py)
|
| 1101 |
-
|
| 1102 |
-
```python
|
| 1103 |
-
"""Task registration and loading."""
|
| 1104 |
-
|
| 1105 |
-
from typing import Dict, Type
|
| 1106 |
-
from .base import BaseTask
|
| 1107 |
-
from .task_1_build_errors import DockerfileSyntaxTask
|
| 1108 |
-
from .task_2_docker_runtime import DockerfileRuntimeTask
|
| 1109 |
-
from .task_3_workflow_syntax import WorkflowSyntaxStructureTask
|
| 1110 |
-
from .task_4_workflow_secrets_permissions import WorkflowSecretsPermissionsTask
|
| 1111 |
-
from .task_5_ci_docker_integration import CIDockerIntegrationTask
|
| 1112 |
-
from .task_6_multi_stage_matrix import MultiStageMatrixTask
|
| 1113 |
-
|
| 1114 |
-
TASK_REGISTRY: Dict[str, Type[BaseTask]] = {
|
| 1115 |
-
"dockerfile_syntax": DockerfileSyntaxTask,
|
| 1116 |
-
"dockerfile_runtime": DockerfileRuntimeTask,
|
| 1117 |
-
"workflow_syntax_structure": WorkflowSyntaxStructureTask,
|
| 1118 |
-
"workflow_secrets_permissions": WorkflowSecretsPermissionsTask,
|
| 1119 |
-
"ci_docker_integration": CIDockerIntegrationTask,
|
| 1120 |
-
"multi_stage_pipeline_matrix": MultiStageMatrixTask,
|
| 1121 |
-
}
|
| 1122 |
-
|
| 1123 |
-
def get_task(task_id: str) -> BaseTask:
|
| 1124 |
-
"""Get task instance by ID."""
|
| 1125 |
-
if task_id not in TASK_REGISTRY:
|
| 1126 |
-
raise ValueError(f"Unknown task: {task_id}")
|
| 1127 |
-
return TASK_REGISTRY[task_id]()
|
| 1128 |
-
```
|
| 1129 |
-
|
| 1130 |
-
## 6.2 Task 1: Dockerfile Syntax Errors (EASY)
|
| 1131 |
-
|
| 1132 |
-
```python
|
| 1133 |
-
"""
|
| 1134 |
-
Task 1: Dockerfile Syntax Errors
|
| 1135 |
-
Difficulty: EASY
|
| 1136 |
-
Focus: Pure Dockerfile issues - no GitHub Actions involved
|
| 1137 |
-
|
| 1138 |
-
Agent must fix common Dockerfile mistakes:
|
| 1139 |
-
- Typos in instruction names
|
| 1140 |
-
- Wrong file paths
|
| 1141 |
-
- Missing instructions
|
| 1142 |
-
- Invalid syntax
|
| 1143 |
-
"""
|
| 1144 |
-
|
| 1145 |
-
from typing import Dict, List, Optional
|
| 1146 |
-
import random
|
| 1147 |
-
from models import TaskDifficulty
|
| 1148 |
-
from .base import BaseTask
|
| 1149 |
-
|
| 1150 |
-
|
| 1151 |
-
class DockerfileSyntaxTask(BaseTask):
|
| 1152 |
-
|
| 1153 |
-
NAME = "Dockerfile Syntax Errors"
|
| 1154 |
-
DESCRIPTION = "Fix syntax and instruction errors in Dockerfiles"
|
| 1155 |
-
DIFFICULTY = TaskDifficulty.EASY
|
| 1156 |
-
AVAILABLE_SECRETS = [] # No secrets needed for this task
|
| 1157 |
-
|
| 1158 |
-
SCENARIOS = [
|
| 1159 |
-
# Scenario 1: Typo in filename
|
| 1160 |
-
{
|
| 1161 |
-
"id": "typo_filename",
|
| 1162 |
-
"files": [
|
| 1163 |
-
{
|
| 1164 |
-
"path": "Dockerfile",
|
| 1165 |
-
"type": "dockerfile",
|
| 1166 |
-
"content": """FROM python:3.9-slim
|
| 1167 |
-
WORKDIR /app
|
| 1168 |
-
COPY requirments.txt .
|
| 1169 |
-
RUN pip install --no-cache-dir -r requirements.txt
|
| 1170 |
-
COPY . .
|
| 1171 |
-
CMD ["python", "app.py"]"""
|
| 1172 |
-
},
|
| 1173 |
-
{
|
| 1174 |
-
"path": "requirements.txt",
|
| 1175 |
-
"type": "requirements",
|
| 1176 |
-
"content": "flask==2.0.0\nrequests==2.28.0"
|
| 1177 |
-
}
|
| 1178 |
-
],
|
| 1179 |
-
"error": {
|
| 1180 |
-
"phase": "docker_build",
|
| 1181 |
-
"message": "COPY failed: file not found in build context: requirments.txt",
|
| 1182 |
-
"exit_code": 1,
|
| 1183 |
-
"failed_step": "COPY requirments.txt .",
|
| 1184 |
-
"line_hint": 3
|
| 1185 |
-
},
|
| 1186 |
-
"expected_fixes": [
|
| 1187 |
-
{
|
| 1188 |
-
"file": "Dockerfile",
|
| 1189 |
-
"type": "contains",
|
| 1190 |
-
"expected": "COPY requirements.txt",
|
| 1191 |
-
"line": 3,
|
| 1192 |
-
"hint": "Check the spelling of the requirements file"
|
| 1193 |
-
}
|
| 1194 |
-
]
|
| 1195 |
-
},
|
| 1196 |
-
|
| 1197 |
-
# Scenario 2: Wrong base image tag
|
| 1198 |
-
{
|
| 1199 |
-
"id": "invalid_base_image",
|
| 1200 |
-
"files": [
|
| 1201 |
-
{
|
| 1202 |
-
"path": "Dockerfile",
|
| 1203 |
-
"type": "dockerfile",
|
| 1204 |
-
"content": """FROM python:3.9-slimm
|
| 1205 |
-
WORKDIR /app
|
| 1206 |
-
COPY requirements.txt .
|
| 1207 |
-
RUN pip install -r requirements.txt
|
| 1208 |
-
COPY . .
|
| 1209 |
-
EXPOSE 8000
|
| 1210 |
-
CMD ["python", "app.py"]"""
|
| 1211 |
-
},
|
| 1212 |
-
{
|
| 1213 |
-
"path": "requirements.txt",
|
| 1214 |
-
"type": "requirements",
|
| 1215 |
-
"content": "flask==2.0.0"
|
| 1216 |
-
}
|
| 1217 |
-
],
|
| 1218 |
-
"error": {
|
| 1219 |
-
"phase": "docker_build",
|
| 1220 |
-
"message": "pull access denied for python:3.9-slimm, repository does not exist or may require 'docker login'",
|
| 1221 |
-
"exit_code": 1,
|
| 1222 |
-
"failed_step": "FROM python:3.9-slimm",
|
| 1223 |
-
"line_hint": 1
|
| 1224 |
-
},
|
| 1225 |
-
"expected_fixes": [
|
| 1226 |
-
{
|
| 1227 |
-
"file": "Dockerfile",
|
| 1228 |
-
"type": "contains",
|
| 1229 |
-
"expected": "FROM python:3.9-slim",
|
| 1230 |
-
"line": 1,
|
| 1231 |
-
"hint": "Check the base image tag - 'slimm' vs 'slim'"
|
| 1232 |
-
}
|
| 1233 |
-
]
|
| 1234 |
-
},
|
| 1235 |
-
|
| 1236 |
-
# Scenario 3: Missing WORKDIR before COPY
|
| 1237 |
-
{
|
| 1238 |
-
"id": "missing_workdir",
|
| 1239 |
-
"files": [
|
| 1240 |
-
{
|
| 1241 |
-
"path": "Dockerfile",
|
| 1242 |
-
"type": "dockerfile",
|
| 1243 |
-
"content": """FROM node:18-alpine
|
| 1244 |
-
COPY package*.json ./
|
| 1245 |
-
RUN npm ci
|
| 1246 |
-
COPY . .
|
| 1247 |
-
RUN npm run build
|
| 1248 |
-
EXPOSE 3000
|
| 1249 |
-
CMD ["npm", "start"]"""
|
| 1250 |
-
},
|
| 1251 |
-
{
|
| 1252 |
-
"path": "package.json",
|
| 1253 |
-
"type": "other",
|
| 1254 |
-
"content": '{"name": "app", "version": "1.0.0"}'
|
| 1255 |
-
}
|
| 1256 |
-
],
|
| 1257 |
-
"error": {
|
| 1258 |
-
"phase": "docker_run",
|
| 1259 |
-
"message": "Error: Cannot find module '/package.json'",
|
| 1260 |
-
"exit_code": 1,
|
| 1261 |
-
"failed_step": "npm start"
|
| 1262 |
-
},
|
| 1263 |
-
"expected_fixes": [
|
| 1264 |
-
{
|
| 1265 |
-
"file": "Dockerfile",
|
| 1266 |
-
"type": "contains",
|
| 1267 |
-
"expected": "WORKDIR /app",
|
| 1268 |
-
"hint": "Add WORKDIR before COPY to set proper working directory"
|
| 1269 |
-
}
|
| 1270 |
-
]
|
| 1271 |
-
},
|
| 1272 |
-
|
| 1273 |
-
# Scenario 4: Invalid RUN syntax
|
| 1274 |
-
{
|
| 1275 |
-
"id": "invalid_run_syntax",
|
| 1276 |
-
"files": [
|
| 1277 |
-
{
|
| 1278 |
-
"path": "Dockerfile",
|
| 1279 |
-
"type": "dockerfile",
|
| 1280 |
-
"content": """FROM python:3.9
|
| 1281 |
-
WORKDIR /app
|
| 1282 |
-
COPY . .
|
| 1283 |
-
RUN pip install -r requirements.txt
|
| 1284 |
-
&& python setup.py install
|
| 1285 |
-
CMD ["python", "main.py"]"""
|
| 1286 |
-
},
|
| 1287 |
-
{
|
| 1288 |
-
"path": "requirements.txt",
|
| 1289 |
-
"type": "requirements",
|
| 1290 |
-
"content": "numpy==1.21.0"
|
| 1291 |
-
}
|
| 1292 |
-
],
|
| 1293 |
-
"error": {
|
| 1294 |
-
"phase": "docker_build",
|
| 1295 |
-
"message": "Dockerfile parse error: unknown instruction: &&",
|
| 1296 |
-
"exit_code": 1,
|
| 1297 |
-
"line_hint": 5
|
| 1298 |
-
},
|
| 1299 |
-
"expected_fixes": [
|
| 1300 |
-
{
|
| 1301 |
-
"file": "Dockerfile",
|
| 1302 |
-
"type": "contains",
|
| 1303 |
-
"expected": "RUN pip install -r requirements.txt && python setup.py install",
|
| 1304 |
-
"hint": "Multi-line RUN commands need backslash continuation or be on same line"
|
| 1305 |
-
}
|
| 1306 |
-
]
|
| 1307 |
-
},
|
| 1308 |
-
|
| 1309 |
-
# Scenario 5: EXPOSE with invalid port
|
| 1310 |
-
{
|
| 1311 |
-
"id": "invalid_expose",
|
| 1312 |
-
"files": [
|
| 1313 |
-
{
|
| 1314 |
-
"path": "Dockerfile",
|
| 1315 |
-
"type": "dockerfile",
|
| 1316 |
-
"content": """FROM nginx:alpine
|
| 1317 |
-
COPY nginx.conf /etc/nginx/nginx.conf
|
| 1318 |
-
COPY html /usr/share/nginx/html
|
| 1319 |
-
EXPOSE "eighty"
|
| 1320 |
-
CMD ["nginx", "-g", "daemon off;"]"""
|
| 1321 |
-
},
|
| 1322 |
-
{
|
| 1323 |
-
"path": "nginx.conf",
|
| 1324 |
-
"type": "other",
|
| 1325 |
-
"content": "events {}"
|
| 1326 |
-
}
|
| 1327 |
-
],
|
| 1328 |
-
"error": {
|
| 1329 |
-
"phase": "docker_build",
|
| 1330 |
-
"message": "EXPOSE requires numeric port or port/protocol",
|
| 1331 |
-
"exit_code": 1,
|
| 1332 |
-
"line_hint": 4
|
| 1333 |
-
},
|
| 1334 |
-
"expected_fixes": [
|
| 1335 |
-
{
|
| 1336 |
-
"file": "Dockerfile",
|
| 1337 |
-
"type": "contains",
|
| 1338 |
-
"expected": "EXPOSE 80",
|
| 1339 |
-
"line": 4,
|
| 1340 |
-
"hint": "EXPOSE must use numeric port values"
|
| 1341 |
-
}
|
| 1342 |
-
]
|
| 1343 |
-
}
|
| 1344 |
-
]
|
| 1345 |
-
|
| 1346 |
-
def load_scenario(self, scenario_id: Optional[str] = None) -> Dict:
|
| 1347 |
-
"""Load a specific scenario or random one."""
|
| 1348 |
-
if scenario_id:
|
| 1349 |
-
for s in self.SCENARIOS:
|
| 1350 |
-
if s["id"] == scenario_id:
|
| 1351 |
-
return s
|
| 1352 |
-
raise ValueError(f"Unknown scenario: {scenario_id}")
|
| 1353 |
-
return random.choice(self.SCENARIOS)
|
| 1354 |
-
```
|
| 1355 |
-
|
| 1356 |
-
## 6.3 Task 2: Workflow Configuration Errors (MEDIUM)
|
| 1357 |
-
|
| 1358 |
-
```python
|
| 1359 |
-
"""
|
| 1360 |
-
Task 2: Workflow Configuration Errors
|
| 1361 |
-
Difficulty: MEDIUM
|
| 1362 |
-
Focus: GitHub Actions + Docker interaction issues
|
| 1363 |
-
|
| 1364 |
-
Agent must fix:
|
| 1365 |
-
- Missing secret references
|
| 1366 |
-
- Wrong env variable syntax
|
| 1367 |
-
- Incorrect step ordering
|
| 1368 |
-
- Missing permissions
|
| 1369 |
-
"""
|
| 1370 |
-
|
| 1371 |
-
from typing import Dict, Optional
|
| 1372 |
-
import random
|
| 1373 |
-
from models import TaskDifficulty
|
| 1374 |
-
from .base import BaseTask
|
| 1375 |
-
|
| 1376 |
-
|
| 1377 |
-
class WorkflowConfigTask(BaseTask):
|
| 1378 |
-
|
| 1379 |
-
NAME = "Workflow Configuration Errors"
|
| 1380 |
-
DESCRIPTION = "Fix GitHub Actions workflow configuration issues involving Docker"
|
| 1381 |
-
DIFFICULTY = TaskDifficulty.MEDIUM
|
| 1382 |
-
AVAILABLE_SECRETS = ["DOCKER_USERNAME", "DOCKER_PASSWORD", "GITHUB_TOKEN"]
|
| 1383 |
-
|
| 1384 |
-
SCENARIOS = [
|
| 1385 |
-
# Scenario 1: Missing env block for secrets
|
| 1386 |
-
{
|
| 1387 |
-
"id": "missing_env_secrets",
|
| 1388 |
-
"files": [
|
| 1389 |
-
{
|
| 1390 |
-
"path": ".github/workflows/build.yml",
|
| 1391 |
-
"type": "workflow",
|
| 1392 |
-
"content": """name: Build and Push
|
| 1393 |
-
on: push
|
| 1394 |
-
|
| 1395 |
-
jobs:
|
| 1396 |
-
build:
|
| 1397 |
-
runs-on: ubuntu-latest
|
| 1398 |
-
steps:
|
| 1399 |
-
- uses: actions/checkout@v4
|
| 1400 |
-
|
| 1401 |
-
- name: Login to DockerHub
|
| 1402 |
-
run: echo $DOCKER_PASSWORD | docker login -u $DOCKER_USERNAME --password-stdin
|
| 1403 |
-
|
| 1404 |
-
- name: Build and push
|
| 1405 |
-
run: |
|
| 1406 |
-
docker build -t myuser/myapp:${{ github.sha }} .
|
| 1407 |
-
docker push myuser/myapp:${{ github.sha }}"""
|
| 1408 |
-
},
|
| 1409 |
-
{
|
| 1410 |
-
"path": "Dockerfile",
|
| 1411 |
-
"type": "dockerfile",
|
| 1412 |
-
"content": """FROM python:3.9-slim
|
| 1413 |
-
WORKDIR /app
|
| 1414 |
-
COPY . .
|
| 1415 |
-
RUN pip install -r requirements.txt
|
| 1416 |
-
CMD ["python", "app.py"]"""
|
| 1417 |
-
}
|
| 1418 |
-
],
|
| 1419 |
-
"error": {
|
| 1420 |
-
"phase": "workflow_parse",
|
| 1421 |
-
"message": "Error: Cannot perform an interactive login from a non TTY device",
|
| 1422 |
-
"exit_code": 1,
|
| 1423 |
-
"failed_step": "Login to DockerHub"
|
| 1424 |
-
},
|
| 1425 |
-
"expected_fixes": [
|
| 1426 |
-
{
|
| 1427 |
-
"file": ".github/workflows/build.yml",
|
| 1428 |
-
"type": "contains",
|
| 1429 |
-
"expected": "DOCKER_USERNAME: ${{ secrets.DOCKER_USERNAME }}",
|
| 1430 |
-
"hint": "Secrets must be passed via env block"
|
| 1431 |
-
},
|
| 1432 |
-
{
|
| 1433 |
-
"file": ".github/workflows/build.yml",
|
| 1434 |
-
"type": "contains",
|
| 1435 |
-
"expected": "DOCKER_PASSWORD: ${{ secrets.DOCKER_PASSWORD }}",
|
| 1436 |
-
"hint": "Both username and password need to be passed as env vars"
|
| 1437 |
-
}
|
| 1438 |
-
]
|
| 1439 |
-
},
|
| 1440 |
-
|
| 1441 |
-
# Scenario 2: Wrong checkout order
|
| 1442 |
-
{
|
| 1443 |
-
"id": "checkout_after_build",
|
| 1444 |
-
"files": [
|
| 1445 |
-
{
|
| 1446 |
-
"path": ".github/workflows/build.yml",
|
| 1447 |
-
"type": "workflow",
|
| 1448 |
-
"content": """name: Build
|
| 1449 |
-
on: push
|
| 1450 |
-
|
| 1451 |
-
jobs:
|
| 1452 |
-
build:
|
| 1453 |
-
runs-on: ubuntu-latest
|
| 1454 |
-
steps:
|
| 1455 |
-
- name: Build Docker image
|
| 1456 |
-
run: docker build -t myapp .
|
| 1457 |
-
|
| 1458 |
-
- uses: actions/checkout@v4
|
| 1459 |
-
|
| 1460 |
-
- name: Run tests
|
| 1461 |
-
run: docker run myapp pytest"""
|
| 1462 |
-
},
|
| 1463 |
-
{
|
| 1464 |
-
"path": "Dockerfile",
|
| 1465 |
-
"type": "dockerfile",
|
| 1466 |
-
"content": """FROM python:3.9
|
| 1467 |
-
WORKDIR /app
|
| 1468 |
-
COPY . .
|
| 1469 |
-
CMD ["python", "app.py"]"""
|
| 1470 |
-
}
|
| 1471 |
-
],
|
| 1472 |
-
"error": {
|
| 1473 |
-
"phase": "docker_build",
|
| 1474 |
-
"message": "unable to prepare context: unable to evaluate symlinks in Dockerfile path: lstat /home/runner/work/repo/repo/Dockerfile: no such file or directory",
|
| 1475 |
-
"exit_code": 1,
|
| 1476 |
-
"failed_step": "Build Docker image"
|
| 1477 |
-
},
|
| 1478 |
-
"expected_fixes": [
|
| 1479 |
-
{
|
| 1480 |
-
"file": ".github/workflows/build.yml",
|
| 1481 |
-
"type": "line_equals",
|
| 1482 |
-
"line": 8,
|
| 1483 |
-
"expected": " - uses: actions/checkout@v4",
|
| 1484 |
-
"hint": "Checkout must happen before any build commands"
|
| 1485 |
-
}
|
| 1486 |
-
]
|
| 1487 |
-
},
|
| 1488 |
-
|
| 1489 |
-
# Scenario 3: Missing Docker Buildx setup for multi-platform
|
| 1490 |
-
{
|
| 1491 |
-
"id": "missing_buildx",
|
| 1492 |
-
"files": [
|
| 1493 |
-
{
|
| 1494 |
-
"path": ".github/workflows/build.yml",
|
| 1495 |
-
"type": "workflow",
|
| 1496 |
-
"content": """name: Multi-platform Build
|
| 1497 |
-
on: push
|
| 1498 |
-
|
| 1499 |
-
jobs:
|
| 1500 |
-
build:
|
| 1501 |
-
runs-on: ubuntu-latest
|
| 1502 |
-
steps:
|
| 1503 |
-
- uses: actions/checkout@v4
|
| 1504 |
-
|
| 1505 |
-
- name: Build multi-platform
|
| 1506 |
-
uses: docker/build-push-action@v5
|
| 1507 |
-
with:
|
| 1508 |
-
context: .
|
| 1509 |
-
platforms: linux/amd64,linux/arm64
|
| 1510 |
-
push: false"""
|
| 1511 |
-
},
|
| 1512 |
-
{
|
| 1513 |
-
"path": "Dockerfile",
|
| 1514 |
-
"type": "dockerfile",
|
| 1515 |
-
"content": """FROM python:3.9-slim
|
| 1516 |
-
WORKDIR /app
|
| 1517 |
-
COPY . .
|
| 1518 |
-
CMD ["python", "app.py"]"""
|
| 1519 |
-
}
|
| 1520 |
-
],
|
| 1521 |
-
"error": {
|
| 1522 |
-
"phase": "docker_build",
|
| 1523 |
-
"message": "ERROR: Multi-platform build is not supported for the docker driver. Switch to a different driver, or turn on the containerd image store, and try again.",
|
| 1524 |
-
"exit_code": 1,
|
| 1525 |
-
"failed_step": "Build multi-platform"
|
| 1526 |
-
},
|
| 1527 |
-
"expected_fixes": [
|
| 1528 |
-
{
|
| 1529 |
-
"file": ".github/workflows/build.yml",
|
| 1530 |
-
"type": "contains",
|
| 1531 |
-
"expected": "docker/setup-buildx-action",
|
| 1532 |
-
"hint": "Multi-platform builds require Docker Buildx setup"
|
| 1533 |
-
}
|
| 1534 |
-
]
|
| 1535 |
-
},
|
| 1536 |
-
|
| 1537 |
-
# Scenario 4: Incorrect caching configuration
|
| 1538 |
-
{
|
| 1539 |
-
"id": "wrong_cache_config",
|
| 1540 |
-
"files": [
|
| 1541 |
-
{
|
| 1542 |
-
"path": ".github/workflows/build.yml",
|
| 1543 |
-
"type": "workflow",
|
| 1544 |
-
"content": """name: Build with Cache
|
| 1545 |
-
on: push
|
| 1546 |
-
|
| 1547 |
-
jobs:
|
| 1548 |
-
build:
|
| 1549 |
-
runs-on: ubuntu-latest
|
| 1550 |
-
steps:
|
| 1551 |
-
- uses: actions/checkout@v4
|
| 1552 |
-
|
| 1553 |
-
- name: Set up Docker Buildx
|
| 1554 |
-
uses: docker/setup-buildx-action@v3
|
| 1555 |
-
|
| 1556 |
-
- name: Build
|
| 1557 |
-
uses: docker/build-push-action@v5
|
| 1558 |
-
with:
|
| 1559 |
-
context: .
|
| 1560 |
-
push: false
|
| 1561 |
-
cache-from: type=gha
|
| 1562 |
-
cache-to: type=gha"""
|
| 1563 |
-
},
|
| 1564 |
-
{
|
| 1565 |
-
"path": "Dockerfile",
|
| 1566 |
-
"type": "dockerfile",
|
| 1567 |
-
"content": """FROM python:3.9-slim
|
| 1568 |
-
WORKDIR /app
|
| 1569 |
-
COPY . .
|
| 1570 |
-
CMD ["python", "app.py"]"""
|
| 1571 |
-
}
|
| 1572 |
-
],
|
| 1573 |
-
"error": {
|
| 1574 |
-
"phase": "docker_build",
|
| 1575 |
-
"message": "ERROR: cache export feature is currently not supported for docker driver. Please switch to a different driver",
|
| 1576 |
-
"exit_code": 1,
|
| 1577 |
-
"failed_step": "Build"
|
| 1578 |
-
},
|
| 1579 |
-
"expected_fixes": [
|
| 1580 |
-
{
|
| 1581 |
-
"file": ".github/workflows/build.yml",
|
| 1582 |
-
"type": "contains",
|
| 1583 |
-
"expected": "cache-to: type=gha,mode=max",
|
| 1584 |
-
"hint": "GHA cache needs mode=max for proper export"
|
| 1585 |
-
}
|
| 1586 |
-
]
|
| 1587 |
-
}
|
| 1588 |
-
]
|
| 1589 |
-
|
| 1590 |
-
def load_scenario(self, scenario_id: Optional[str] = None) -> Dict:
|
| 1591 |
-
if scenario_id:
|
| 1592 |
-
for s in self.SCENARIOS:
|
| 1593 |
-
if s["id"] == scenario_id:
|
| 1594 |
-
return s
|
| 1595 |
-
raise ValueError(f"Unknown scenario: {scenario_id}")
|
| 1596 |
-
return random.choice(self.SCENARIOS)
|
| 1597 |
-
```
|
| 1598 |
-
|
| 1599 |
-
## 6.4 Task 3: Multi-Stage Pipeline Failures (HARD)
|
| 1600 |
-
|
| 1601 |
-
```python
|
| 1602 |
-
"""
|
| 1603 |
-
Task 3: Multi-Stage Pipeline Failures
|
| 1604 |
-
Difficulty: HARD
|
| 1605 |
-
Focus: Complex interactions between multi-stage Docker builds and CI/CD
|
| 1606 |
-
|
| 1607 |
-
Agent must debug:
|
| 1608 |
-
- Multi-stage build artifact issues
|
| 1609 |
-
- Cross-job dependencies
|
| 1610 |
-
- Matrix build failures
|
| 1611 |
-
- Platform-specific issues
|
| 1612 |
-
"""
|
| 1613 |
-
|
| 1614 |
-
from typing import Dict, Optional
|
| 1615 |
-
import random
|
| 1616 |
-
from models import TaskDifficulty
|
| 1617 |
-
from .base import BaseTask
|
| 1618 |
-
|
| 1619 |
-
|
| 1620 |
-
class MultiStagePipelineTask(BaseTask):
|
| 1621 |
-
|
| 1622 |
-
NAME = "Multi-Stage Pipeline Failures"
|
| 1623 |
-
DESCRIPTION = "Debug complex multi-stage Docker builds with CI/CD integration"
|
| 1624 |
-
DIFFICULTY = TaskDifficulty.HARD
|
| 1625 |
-
AVAILABLE_SECRETS = ["DOCKER_USERNAME", "DOCKER_PASSWORD", "GITHUB_TOKEN", "NPM_TOKEN"]
|
| 1626 |
-
|
| 1627 |
-
SCENARIOS = [
|
| 1628 |
-
# Scenario 1: Multi-stage artifact path mismatch
|
| 1629 |
-
{
|
| 1630 |
-
"id": "artifact_path_mismatch",
|
| 1631 |
-
"files": [
|
| 1632 |
-
{
|
| 1633 |
-
"path": ".github/workflows/build.yml",
|
| 1634 |
-
"type": "workflow",
|
| 1635 |
-
"content": """name: Build and Deploy
|
| 1636 |
-
on: push
|
| 1637 |
-
|
| 1638 |
-
jobs:
|
| 1639 |
-
build:
|
| 1640 |
-
runs-on: ubuntu-latest
|
| 1641 |
-
steps:
|
| 1642 |
-
- uses: actions/checkout@v4
|
| 1643 |
-
|
| 1644 |
-
- name: Set up Docker Buildx
|
| 1645 |
-
uses: docker/setup-buildx-action@v3
|
| 1646 |
-
|
| 1647 |
-
- name: Build
|
| 1648 |
-
uses: docker/build-push-action@v5
|
| 1649 |
-
with:
|
| 1650 |
-
context: .
|
| 1651 |
-
push: false
|
| 1652 |
-
load: true
|
| 1653 |
-
tags: myapp:test
|
| 1654 |
-
|
| 1655 |
-
- name: Test
|
| 1656 |
-
run: |
|
| 1657 |
-
docker run myapp:test ls -la /usr/share/nginx/html
|
| 1658 |
-
docker run myapp:test curl -f http://localhost:80/ || exit 1"""
|
| 1659 |
-
},
|
| 1660 |
-
{
|
| 1661 |
-
"path": "Dockerfile",
|
| 1662 |
-
"type": "dockerfile",
|
| 1663 |
-
"content": """FROM node:18 AS builder
|
| 1664 |
-
WORKDIR /app
|
| 1665 |
-
COPY package*.json ./
|
| 1666 |
-
RUN npm ci
|
| 1667 |
-
COPY . .
|
| 1668 |
-
RUN npm run build
|
| 1669 |
-
|
| 1670 |
-
FROM nginx:alpine
|
| 1671 |
-
# Bug: React builds to 'build', not 'dist'
|
| 1672 |
-
COPY --from=builder /app/dist /usr/share/nginx/html
|
| 1673 |
-
EXPOSE 80
|
| 1674 |
-
CMD ["nginx", "-g", "daemon off;"]"""
|
| 1675 |
-
},
|
| 1676 |
-
{
|
| 1677 |
-
"path": "package.json",
|
| 1678 |
-
"type": "other",
|
| 1679 |
-
"content": """{
|
| 1680 |
-
"name": "frontend",
|
| 1681 |
-
"scripts": {
|
| 1682 |
-
"build": "react-scripts build"
|
| 1683 |
-
}
|
| 1684 |
-
}"""
|
| 1685 |
-
}
|
| 1686 |
-
],
|
| 1687 |
-
"error": {
|
| 1688 |
-
"phase": "docker_build",
|
| 1689 |
-
"message": "COPY failed: stat app/dist: file does not exist",
|
| 1690 |
-
"exit_code": 1,
|
| 1691 |
-
"failed_step": "Build",
|
| 1692 |
-
"line_hint": 10
|
| 1693 |
-
},
|
| 1694 |
-
"expected_fixes": [
|
| 1695 |
-
{
|
| 1696 |
-
"file": "Dockerfile",
|
| 1697 |
-
"type": "contains",
|
| 1698 |
-
"expected": "COPY --from=builder /app/build",
|
| 1699 |
-
"line": 10,
|
| 1700 |
-
"hint": "React's create-react-app outputs to 'build' directory, not 'dist'"
|
| 1701 |
-
}
|
| 1702 |
-
]
|
| 1703 |
-
},
|
| 1704 |
-
|
| 1705 |
-
# Scenario 2: Matrix + Platform ARG issue
|
| 1706 |
-
{
|
| 1707 |
-
"id": "matrix_platform_arg",
|
| 1708 |
-
"files": [
|
| 1709 |
-
{
|
| 1710 |
-
"path": ".github/workflows/build.yml",
|
| 1711 |
-
"type": "workflow",
|
| 1712 |
-
"content": """name: Multi-Platform Build
|
| 1713 |
-
on: push
|
| 1714 |
-
|
| 1715 |
-
jobs:
|
| 1716 |
-
build:
|
| 1717 |
-
runs-on: ubuntu-latest
|
| 1718 |
-
strategy:
|
| 1719 |
-
matrix:
|
| 1720 |
-
platform:
|
| 1721 |
-
- linux/amd64
|
| 1722 |
-
- linux/arm64
|
| 1723 |
-
steps:
|
| 1724 |
-
- uses: actions/checkout@v4
|
| 1725 |
-
|
| 1726 |
-
- name: Set up QEMU
|
| 1727 |
-
uses: docker/setup-qemu-action@v3
|
| 1728 |
-
|
| 1729 |
-
- name: Set up Docker Buildx
|
| 1730 |
-
uses: docker/setup-buildx-action@v3
|
| 1731 |
-
|
| 1732 |
-
- name: Build
|
| 1733 |
-
uses: docker/build-push-action@v5
|
| 1734 |
-
with:
|
| 1735 |
-
context: .
|
| 1736 |
-
platforms: ${{ matrix.platform }}
|
| 1737 |
-
push: false"""
|
| 1738 |
-
},
|
| 1739 |
-
{
|
| 1740 |
-
"path": "Dockerfile",
|
| 1741 |
-
"type": "dockerfile",
|
| 1742 |
-
"content": """FROM --platform=$BUILDPLATFORM node:18 AS builder
|
| 1743 |
-
WORKDIR /app
|
| 1744 |
-
COPY package*.json ./
|
| 1745 |
-
RUN npm ci
|
| 1746 |
-
COPY . .
|
| 1747 |
-
RUN npm run build
|
| 1748 |
-
|
| 1749 |
-
FROM --platform=$TARGETPLATFORM nginx:alpine
|
| 1750 |
-
COPY --from=builder /app/build /usr/share/nginx/html
|
| 1751 |
-
EXPOSE 80"""
|
| 1752 |
-
},
|
| 1753 |
-
{
|
| 1754 |
-
"path": "package.json",
|
| 1755 |
-
"type": "other",
|
| 1756 |
-
"content": '{"name": "app", "scripts": {"build": "echo build"}}'
|
| 1757 |
-
}
|
| 1758 |
-
],
|
| 1759 |
-
"error": {
|
| 1760 |
-
"phase": "docker_build",
|
| 1761 |
-
"message": "failed to solve: failed to parse platform : \"\" is not a valid platform",
|
| 1762 |
-
"exit_code": 1,
|
| 1763 |
-
"failed_step": "Build"
|
| 1764 |
-
},
|
| 1765 |
-
"expected_fixes": [
|
| 1766 |
-
{
|
| 1767 |
-
"file": "Dockerfile",
|
| 1768 |
-
"type": "contains",
|
| 1769 |
-
"expected": "ARG BUILDPLATFORM",
|
| 1770 |
-
"hint": "Platform ARGs must be declared before use"
|
| 1771 |
-
},
|
| 1772 |
-
{
|
| 1773 |
-
"file": "Dockerfile",
|
| 1774 |
-
"type": "contains",
|
| 1775 |
-
"expected": "ARG TARGETPLATFORM",
|
| 1776 |
-
"hint": "Both BUILDPLATFORM and TARGETPLATFORM need ARG declarations"
|
| 1777 |
-
}
|
| 1778 |
-
]
|
| 1779 |
-
},
|
| 1780 |
-
|
| 1781 |
-
# Scenario 3: Cross-job artifact dependency failure
|
| 1782 |
-
{
|
| 1783 |
-
"id": "cross_job_artifact",
|
| 1784 |
-
"files": [
|
| 1785 |
-
{
|
| 1786 |
-
"path": ".github/workflows/build.yml",
|
| 1787 |
-
"type": "workflow",
|
| 1788 |
-
"content": """name: Build and Test
|
| 1789 |
-
on: push
|
| 1790 |
-
|
| 1791 |
-
jobs:
|
| 1792 |
-
build:
|
| 1793 |
-
runs-on: ubuntu-latest
|
| 1794 |
-
steps:
|
| 1795 |
-
- uses: actions/checkout@v4
|
| 1796 |
-
|
| 1797 |
-
- name: Build
|
| 1798 |
-
run: |
|
| 1799 |
-
docker build -t myapp:${{ github.sha }} .
|
| 1800 |
-
docker save myapp:${{ github.sha }} > image.tar
|
| 1801 |
-
|
| 1802 |
-
- uses: actions/upload-artifact@v4
|
| 1803 |
-
with:
|
| 1804 |
-
name: docker-image
|
| 1805 |
-
path: image.tar
|
| 1806 |
-
|
| 1807 |
-
test:
|
| 1808 |
-
runs-on: ubuntu-latest
|
| 1809 |
-
steps:
|
| 1810 |
-
- name: Download image
|
| 1811 |
-
uses: actions/download-artifact@v4
|
| 1812 |
-
with:
|
| 1813 |
-
name: docker-image
|
| 1814 |
-
|
| 1815 |
-
- name: Load and test
|
| 1816 |
-
run: |
|
| 1817 |
-
docker load < image.tar
|
| 1818 |
-
docker run myapp:${{ github.sha }} pytest"""
|
| 1819 |
-
},
|
| 1820 |
-
{
|
| 1821 |
-
"path": "Dockerfile",
|
| 1822 |
-
"type": "dockerfile",
|
| 1823 |
-
"content": """FROM python:3.9
|
| 1824 |
-
WORKDIR /app
|
| 1825 |
-
COPY . .
|
| 1826 |
-
RUN pip install pytest
|
| 1827 |
-
CMD ["python", "app.py"]"""
|
| 1828 |
-
}
|
| 1829 |
-
],
|
| 1830 |
-
"error": {
|
| 1831 |
-
"phase": "workflow_parse",
|
| 1832 |
-
"message": "The workflow is not valid. .github/workflows/build.yml (Line: 22, Col: 5): Job 'test' depends on unknown job 'build'",
|
| 1833 |
-
"exit_code": 1
|
| 1834 |
-
},
|
| 1835 |
-
"expected_fixes": [
|
| 1836 |
-
{
|
| 1837 |
-
"file": ".github/workflows/build.yml",
|
| 1838 |
-
"type": "contains",
|
| 1839 |
-
"expected": "needs: build",
|
| 1840 |
-
"hint": "Test job needs to declare dependency on build job"
|
| 1841 |
-
}
|
| 1842 |
-
]
|
| 1843 |
-
},
|
| 1844 |
-
|
| 1845 |
-
# Scenario 4: Multiple interacting issues
|
| 1846 |
-
{
|
| 1847 |
-
"id": "multiple_issues",
|
| 1848 |
-
"files": [
|
| 1849 |
-
{
|
| 1850 |
-
"path": ".github/workflows/build.yml",
|
| 1851 |
-
"type": "workflow",
|
| 1852 |
-
"content": """name: Full Pipeline
|
| 1853 |
-
on: push
|
| 1854 |
-
|
| 1855 |
-
jobs:
|
| 1856 |
-
build:
|
| 1857 |
-
runs-on: ubuntu-latest
|
| 1858 |
-
steps:
|
| 1859 |
-
- uses: actions/checkout@v4
|
| 1860 |
-
|
| 1861 |
-
- name: Login
|
| 1862 |
-
run: echo $DOCKER_PASSWORD | docker login -u $DOCKER_USERNAME --password-stdin
|
| 1863 |
-
|
| 1864 |
-
- name: Build and Push
|
| 1865 |
-
run: |
|
| 1866 |
-
docker build -t myuser/myapp:latest .
|
| 1867 |
-
docker push myuser/myapp:latest"""
|
| 1868 |
-
},
|
| 1869 |
-
{
|
| 1870 |
-
"path": "Dockerfile",
|
| 1871 |
-
"type": "dockerfile",
|
| 1872 |
-
"content": """FROM python:3.9-slim AS builder
|
| 1873 |
-
WORKDIR /app
|
| 1874 |
-
COPY requirments.txt .
|
| 1875 |
-
RUN pip install -r requirements.txt
|
| 1876 |
-
COPY . .
|
| 1877 |
-
|
| 1878 |
-
FROM python:3.9-slim
|
| 1879 |
-
WORKDIR /app
|
| 1880 |
-
COPY --from=builder /app .
|
| 1881 |
-
CMD ["python", "app.py"]"""
|
| 1882 |
-
},
|
| 1883 |
-
{
|
| 1884 |
-
"path": "requirements.txt",
|
| 1885 |
-
"type": "requirements",
|
| 1886 |
-
"content": "flask==2.0.0"
|
| 1887 |
-
}
|
| 1888 |
-
],
|
| 1889 |
-
"error": {
|
| 1890 |
-
"phase": "docker_build",
|
| 1891 |
-
"message": "COPY failed: file not found in build context: requirments.txt\nAdditionally: Error: Cannot perform an interactive login from a non TTY device",
|
| 1892 |
-
"exit_code": 1
|
| 1893 |
-
},
|
| 1894 |
-
"expected_fixes": [
|
| 1895 |
-
{
|
| 1896 |
-
"file": "Dockerfile",
|
| 1897 |
-
"type": "contains",
|
| 1898 |
-
"expected": "COPY requirements.txt",
|
| 1899 |
-
"hint": "Fix typo in requirements filename"
|
| 1900 |
-
},
|
| 1901 |
-
{
|
| 1902 |
-
"file": ".github/workflows/build.yml",
|
| 1903 |
-
"type": "contains",
|
| 1904 |
-
"expected": "DOCKER_USERNAME: ${{ secrets.DOCKER_USERNAME }}",
|
| 1905 |
-
"hint": "Add env block for secrets"
|
| 1906 |
-
},
|
| 1907 |
-
{
|
| 1908 |
-
"file": ".github/workflows/build.yml",
|
| 1909 |
-
"type": "contains",
|
| 1910 |
-
"expected": "DOCKER_PASSWORD: ${{ secrets.DOCKER_PASSWORD }}",
|
| 1911 |
-
"hint": "Add password to env block"
|
| 1912 |
-
}
|
| 1913 |
-
]
|
| 1914 |
-
}
|
| 1915 |
-
]
|
| 1916 |
-
|
| 1917 |
-
def load_scenario(self, scenario_id: Optional[str] = None) -> Dict:
|
| 1918 |
-
if scenario_id:
|
| 1919 |
-
for s in self.SCENARIOS:
|
| 1920 |
-
if s["id"] == scenario_id:
|
| 1921 |
-
return s
|
| 1922 |
-
raise ValueError(f"Unknown scenario: {scenario_id}")
|
| 1923 |
-
return random.choice(self.SCENARIOS)
|
| 1924 |
-
```
|
| 1925 |
-
|
| 1926 |
-
---
|
| 1927 |
-
|
| 1928 |
-
# 7. GRADER IMPLEMENTATION
|
| 1929 |
-
|
| 1930 |
-
## 7.1 Grader Logic (server/graders/__init__.py)
|
| 1931 |
-
|
| 1932 |
-
```python
|
| 1933 |
-
"""
|
| 1934 |
-
Deterministic graders for CI/CD debugging tasks.
|
| 1935 |
-
|
| 1936 |
-
Grading Philosophy:
|
| 1937 |
-
- 100% deterministic (same input = same output)
|
| 1938 |
-
- Dynamic scoring based on what the agent actually fixes
|
| 1939 |
-
- Granular partial credit (completion, action quality, efficiency)
|
| 1940 |
-
- Score breakdown for transparency
|
| 1941 |
-
- Penalties for hints used
|
| 1942 |
-
"""
|
| 1943 |
-
|
| 1944 |
-
from typing import List, Dict, Any
|
| 1945 |
-
from models import GraderResult, TaskDifficulty
|
| 1946 |
-
from tasks.task_registry import TASK_REGISTRY
|
| 1947 |
-
|
| 1948 |
-
|
| 1949 |
-
def run_grader(task_id: str, trajectory: List[Dict[str, Any]]) -> GraderResult:
|
| 1950 |
-
"""
|
| 1951 |
-
Grade a trajectory for a given task.
|
| 1952 |
-
|
| 1953 |
-
Scoring breakdown:
|
| 1954 |
-
- Completion: proportion of issues fixed (dominant component)
|
| 1955 |
-
- Action quality: valid targeted edit actions
|
| 1956 |
-
- Full solution bonus: bonus if all issues are fixed
|
| 1957 |
-
- Efficiency: bonus for fewer extra steps
|
| 1958 |
-
- Hint penalty: -0.05 per hint used
|
| 1959 |
-
"""
|
| 1960 |
-
|
| 1961 |
-
if task_id not in TASK_REGISTRY:
|
| 1962 |
-
raise ValueError(f"Unknown task: {task_id}")
|
| 1963 |
-
|
| 1964 |
-
task = TASK_REGISTRY[task_id]()
|
| 1965 |
-
|
| 1966 |
-
# Extract final state
|
| 1967 |
-
if not trajectory:
|
| 1968 |
-
return GraderResult(
|
| 1969 |
-
task_id=task_id,
|
| 1970 |
-
score=0.0,
|
| 1971 |
-
breakdown={"error": "Empty trajectory"},
|
| 1972 |
-
feedback="No actions taken",
|
| 1973 |
-
steps_taken=0,
|
| 1974 |
-
hints_used=0
|
| 1975 |
-
)
|
| 1976 |
-
|
| 1977 |
-
final_step = trajectory[-1]
|
| 1978 |
-
steps_taken = len(trajectory)
|
| 1979 |
-
|
| 1980 |
-
# Count hints used
|
| 1981 |
-
hints_used = sum(
|
| 1982 |
-
1 for step in trajectory
|
| 1983 |
-
if step.get("action", {}).get("action_type") == "request_hint"
|
| 1984 |
-
)
|
| 1985 |
-
|
| 1986 |
-
# Calculate score components
|
| 1987 |
-
score = 0.0
|
| 1988 |
-
breakdown = {}
|
| 1989 |
-
|
| 1990 |
-
# Get issues fixed from final observation
|
| 1991 |
-
issues_fixed = final_step.get("info", {}).get("issues_fixed", 0)
|
| 1992 |
-
issues_total = final_step.get("info", {}).get("issues_total", 1)
|
| 1993 |
-
|
| 1994 |
-
# Per-issue credit (0.6 total for fixing all)
|
| 1995 |
-
fix_ratio = issues_fixed / issues_total if issues_total > 0 else 0
|
| 1996 |
-
fix_score = 0.6 * fix_ratio
|
| 1997 |
-
breakdown["issues_fixed"] = fix_score
|
| 1998 |
-
score += fix_score
|
| 1999 |
-
|
| 2000 |
-
# Full solution bonus (0.2)
|
| 2001 |
-
if issues_fixed == issues_total:
|
| 2002 |
-
breakdown["complete_solution"] = 0.2
|
| 2003 |
-
score += 0.2
|
| 2004 |
-
else:
|
| 2005 |
-
breakdown["complete_solution"] = 0.0
|
| 2006 |
-
|
| 2007 |
-
# Efficiency bonus (0.2 max)
|
| 2008 |
-
# Optimal: 1 step per issue. Penalty for extra steps.
|
| 2009 |
-
optimal_steps = issues_total
|
| 2010 |
-
if steps_taken <= optimal_steps:
|
| 2011 |
-
efficiency_score = 0.2
|
| 2012 |
-
else:
|
| 2013 |
-
# Lose 0.02 per extra step, minimum 0
|
| 2014 |
-
extra_steps = steps_taken - optimal_steps
|
| 2015 |
-
efficiency_score = max(0, 0.2 - (extra_steps * 0.02))
|
| 2016 |
-
breakdown["efficiency"] = efficiency_score
|
| 2017 |
-
score += efficiency_score
|
| 2018 |
-
|
| 2019 |
-
# Hint penalty
|
| 2020 |
-
hint_penalty = hints_used * 0.05
|
| 2021 |
-
breakdown["hint_penalty"] = -hint_penalty
|
| 2022 |
-
score -= hint_penalty
|
| 2023 |
-
|
| 2024 |
-
# Clamp to [0, 1]
|
| 2025 |
-
score = max(0.0, min(1.0, score))
|
| 2026 |
-
|
| 2027 |
-
# Generate feedback
|
| 2028 |
-
if score >= 0.9:
|
| 2029 |
-
feedback = "Excellent! All issues fixed efficiently."
|
| 2030 |
-
elif score >= 0.7:
|
| 2031 |
-
feedback = "Good job! Most issues fixed."
|
| 2032 |
-
elif score >= 0.5:
|
| 2033 |
-
feedback = "Partial success. Some issues remain."
|
| 2034 |
-
elif score >= 0.3:
|
| 2035 |
-
feedback = "Limited progress. Review the error messages carefully."
|
| 2036 |
-
else:
|
| 2037 |
-
feedback = "Needs improvement. Try analyzing the error phase first."
|
| 2038 |
-
|
| 2039 |
-
return GraderResult(
|
| 2040 |
-
task_id=task_id,
|
| 2041 |
-
score=round(score, 3),
|
| 2042 |
-
breakdown={k: round(v, 3) for k, v in breakdown.items()},
|
| 2043 |
-
feedback=feedback,
|
| 2044 |
-
steps_taken=steps_taken,
|
| 2045 |
-
hints_used=hints_used
|
| 2046 |
-
)
|
| 2047 |
-
```
|
| 2048 |
-
|
| 2049 |
-
---
|
| 2050 |
-
|
| 2051 |
-
# 8. REWARD FUNCTION DESIGN
|
| 2052 |
-
|
| 2053 |
-
## Dense Reward Strategy
|
| 2054 |
-
|
| 2055 |
-
```python
|
| 2056 |
-
"""
|
| 2057 |
-
Reward Function Design
|
| 2058 |
-
|
| 2059 |
-
Properties:
|
| 2060 |
-
1. Dense (signal at every step, not just end)
|
| 2061 |
-
2. Shaped (guides toward solution)
|
| 2062 |
-
3. Bounded [0, 1] per step
|
| 2063 |
-
4. Cumulative episode reward can exceed 1.0
|
| 2064 |
-
|
| 2065 |
-
Reward Components:
|
| 2066 |
-
- Syntax validation: +0.1 when file becomes syntactically valid
|
| 2067 |
-
- Issue identification: +0.1 when agent actions target correct file/line
|
| 2068 |
-
- Partial fix: +0.2 when fix is partially correct
|
| 2069 |
-
- Full fix: +0.3 when issue is fully resolved
|
| 2070 |
-
- Submit bonus: +0.0 to +0.5 based on final validation
|
| 2071 |
-
- Hint penalty: -0.05 per hint
|
| 2072 |
-
|
| 2073 |
-
This creates a curriculum:
|
| 2074 |
-
- Agent learns to identify issues first (+0.1)
|
| 2075 |
-
- Then learns to fix them (+0.2 to +0.3)
|
| 2076 |
-
- Finally learns to validate (+0.0 to +0.5)
|
| 2077 |
-
"""
|
| 2078 |
-
|
| 2079 |
-
def calculate_step_reward(
|
| 2080 |
-
prev_state: EnvironmentState,
|
| 2081 |
-
action: Action,
|
| 2082 |
-
new_state: EnvironmentState
|
| 2083 |
-
) -> float:
|
| 2084 |
-
"""Calculate reward for a single step."""
|
| 2085 |
-
|
| 2086 |
-
reward = 0.0
|
| 2087 |
-
|
| 2088 |
-
# 1. Syntax validation reward
|
| 2089 |
-
for file_path in new_state.files:
|
| 2090 |
-
prev_valid = prev_state.file_valid.get(file_path, False)
|
| 2091 |
-
new_valid = new_state.file_valid.get(file_path, False)
|
| 2092 |
-
if not prev_valid and new_valid:
|
| 2093 |
-
reward += 0.1 # File became valid
|
| 2094 |
-
|
| 2095 |
-
# 2. Issue targeting reward
|
| 2096 |
-
if action.edits:
|
| 2097 |
-
for edit in action.edits:
|
| 2098 |
-
if is_correct_target(edit, new_state.expected_fixes):
|
| 2099 |
-
reward += 0.1 # Targeting correct area
|
| 2100 |
-
|
| 2101 |
-
# 3. Fix progress reward
|
| 2102 |
-
new_fixes = new_state.issues_fixed - prev_state.issues_fixed
|
| 2103 |
-
if new_fixes > 0:
|
| 2104 |
-
reward += 0.3 * new_fixes # Per issue fixed
|
| 2105 |
-
|
| 2106 |
-
# 4. Submit reward (calculated in _handle_submit)
|
| 2107 |
-
if action.action_type == ActionType.SUBMIT:
|
| 2108 |
-
# This is handled separately in _handle_submit
|
| 2109 |
-
pass
|
| 2110 |
-
|
| 2111 |
-
# 5. Hint penalty
|
| 2112 |
-
if action.action_type == ActionType.REQUEST_HINT:
|
| 2113 |
-
reward -= 0.05
|
| 2114 |
-
|
| 2115 |
-
# 6. Invalid action penalty
|
| 2116 |
-
if not new_state.last_action_success:
|
| 2117 |
-
reward -= 0.02 # Small penalty for failed actions
|
| 2118 |
-
|
| 2119 |
-
return reward
|
| 2120 |
-
```
|
| 2121 |
-
|
| 2122 |
-
---
|
| 2123 |
-
|
| 2124 |
-
# 9. BASELINE INFERENCE SCRIPT
|
| 2125 |
-
|
| 2126 |
-
## inference.py (Root Directory)
|
| 2127 |
-
|
| 2128 |
-
```python
|
| 2129 |
-
"""
|
| 2130 |
-
Baseline Inference Script for CI/CD Debug Environment
|
| 2131 |
-
======================================================
|
| 2132 |
-
|
| 2133 |
-
MANDATORY REQUIREMENTS:
|
| 2134 |
-
- Uses OpenAI Client for all LLM calls
|
| 2135 |
-
- Reads API_BASE_URL, MODEL_NAME, HF_TOKEN from environment
|
| 2136 |
-
- Named 'inference.py' in root directory
|
| 2137 |
-
- Runtime < 20 minutes
|
| 2138 |
-
- Works on vcpu=2, memory=8gb
|
| 2139 |
-
|
| 2140 |
-
This baseline demonstrates a simple but effective approach:
|
| 2141 |
-
1. Parse the error message to identify error type
|
| 2142 |
-
2. Locate the problematic file and line
|
| 2143 |
-
3. Apply appropriate fix based on error pattern
|
| 2144 |
-
4. Submit and verify
|
| 2145 |
-
"""
|
| 2146 |
-
|
| 2147 |
-
import os
|
| 2148 |
-
import re
|
| 2149 |
-
import json
|
| 2150 |
-
import time
|
| 2151 |
-
from typing import List, Dict, Any, Optional
|
| 2152 |
-
|
| 2153 |
-
import requests
|
| 2154 |
-
from openai import OpenAI
|
| 2155 |
-
|
| 2156 |
-
# ============== CONFIGURATION ==============
|
| 2157 |
-
|
| 2158 |
-
API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
|
| 2159 |
-
API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
|
| 2160 |
-
MODEL_NAME = os.getenv("MODEL_NAME", "meta-llama/Llama-3.1-70B-Instruct")
|
| 2161 |
-
ENV_URL = os.getenv("ENV_URL", "http://localhost:7860")
|
| 2162 |
-
|
| 2163 |
-
MAX_STEPS = 10
|
| 2164 |
-
TEMPERATURE = 0.2
|
| 2165 |
-
MAX_TOKENS = 500
|
| 2166 |
-
|
| 2167 |
-
# ============== SYSTEM PROMPT ==============
|
| 2168 |
-
|
| 2169 |
-
SYSTEM_PROMPT = """You are an expert DevOps engineer debugging CI/CD infrastructure.
|
| 2170 |
-
|
| 2171 |
-
You are given:
|
| 2172 |
-
1. Error message from a failed Docker build or GitHub Actions workflow
|
| 2173 |
-
2. The relevant configuration files (Dockerfile, workflow YAML)
|
| 2174 |
-
3. Available actions to fix the issues
|
| 2175 |
-
|
| 2176 |
-
Your task is to identify and fix the issues. Common problems include:
|
| 2177 |
-
- Typos in filenames (requirments.txt vs requirements.txt)
|
| 2178 |
-
- Missing environment variable references for secrets
|
| 2179 |
-
- Wrong file paths in COPY commands
|
| 2180 |
-
- Missing steps (checkout before build, buildx for multi-platform)
|
| 2181 |
-
- Invalid syntax in YAML or Dockerfile
|
| 2182 |
-
|
| 2183 |
-
Respond with a JSON object containing your action:
|
| 2184 |
-
{
|
| 2185 |
-
"action_type": "replace_line" | "add_line" | "edit_file" | "submit" | "request_hint",
|
| 2186 |
-
"edits": [
|
| 2187 |
-
{
|
| 2188 |
-
"file_path": "path/to/file",
|
| 2189 |
-
"line_number": 5,
|
| 2190 |
-
"old_content": "old text",
|
| 2191 |
-
"new_content": "new text"
|
| 2192 |
-
}
|
| 2193 |
-
],
|
| 2194 |
-
"reasoning": "Brief explanation of the fix"
|
| 2195 |
-
}
|
| 2196 |
-
|
| 2197 |
-
When you believe all issues are fixed, use action_type: "submit".
|
| 2198 |
-
Be precise and fix one issue at a time."""
|
| 2199 |
-
|
| 2200 |
-
# ============== HELPER FUNCTIONS ==============
|
| 2201 |
-
|
| 2202 |
-
def build_user_prompt(observation: Dict) -> str:
|
| 2203 |
-
"""Build the user prompt from observation."""
|
| 2204 |
-
|
| 2205 |
-
files_str = ""
|
| 2206 |
-
for f in observation.get("files", []):
|
| 2207 |
-
content = f["content"]
|
| 2208 |
-
# Add line numbers
|
| 2209 |
-
lines = content.split("\n")
|
| 2210 |
-
numbered = "\n".join(f"{i+1:3}: {line}" for i, line in enumerate(lines))
|
| 2211 |
-
files_str += f"\n### {f['path']}\n```\n{numbered}\n```\n"
|
| 2212 |
-
|
| 2213 |
-
error = observation.get("error", {})
|
| 2214 |
-
|
| 2215 |
-
prompt = f"""## Current State
|
| 2216 |
-
Task: {observation.get('task_description', 'Fix CI/CD issues')}
|
| 2217 |
-
Difficulty: {observation.get('difficulty', 'unknown')}
|
| 2218 |
-
Step: {observation.get('step_number', 0)}/{observation.get('max_steps', 10)}
|
| 2219 |
-
Issues Fixed: {observation.get('issues_fixed', 0)}/{observation.get('total_issues', '?')}
|
| 2220 |
-
|
| 2221 |
-
## Error Information
|
| 2222 |
-
Phase: {error.get('phase', 'unknown')}
|
| 2223 |
-
Message: {error.get('error_message', 'No error message')}
|
| 2224 |
-
Failed Step: {error.get('failed_step', 'unknown')}
|
| 2225 |
-
Line Hint: {error.get('line_hint', 'none')}
|
| 2226 |
-
|
| 2227 |
-
## Files
|
| 2228 |
-
{files_str}
|
| 2229 |
-
|
| 2230 |
-
## Last Action Feedback
|
| 2231 |
-
{observation.get('last_action_feedback', 'None')}
|
| 2232 |
-
|
| 2233 |
-
Analyze the error and provide your fix as JSON."""
|
| 2234 |
-
|
| 2235 |
-
return prompt
|
| 2236 |
-
|
| 2237 |
-
|
| 2238 |
-
def parse_model_response(response_text: str) -> Dict:
|
| 2239 |
-
"""Parse the model's JSON response."""
|
| 2240 |
-
|
| 2241 |
-
# Try to extract JSON from response
|
| 2242 |
-
try:
|
| 2243 |
-
# Look for JSON block
|
| 2244 |
-
json_match = re.search(r'\{[^{}]*\}', response_text, re.DOTALL)
|
| 2245 |
-
if json_match:
|
| 2246 |
-
return json.loads(json_match.group())
|
| 2247 |
-
except json.JSONDecodeError:
|
| 2248 |
-
pass
|
| 2249 |
-
|
| 2250 |
-
# Fallback: try to parse whole response
|
| 2251 |
-
try:
|
| 2252 |
-
return json.loads(response_text)
|
| 2253 |
-
except json.JSONDecodeError:
|
| 2254 |
-
pass
|
| 2255 |
-
|
| 2256 |
-
# Default action
|
| 2257 |
-
return {
|
| 2258 |
-
"action_type": "request_hint",
|
| 2259 |
-
"reasoning": "Could not parse response"
|
| 2260 |
-
}
|
| 2261 |
-
|
| 2262 |
-
|
| 2263 |
-
def call_environment(endpoint: str, method: str = "GET", data: Dict = None) -> Dict:
|
| 2264 |
-
"""Make a request to the environment."""
|
| 2265 |
-
|
| 2266 |
-
url = f"{ENV_URL}{endpoint}"
|
| 2267 |
-
|
| 2268 |
-
if method == "GET":
|
| 2269 |
-
response = requests.get(url, timeout=30)
|
| 2270 |
-
else:
|
| 2271 |
-
response = requests.post(url, json=data or {}, timeout=30)
|
| 2272 |
-
|
| 2273 |
-
response.raise_for_status()
|
| 2274 |
-
return response.json()
|
| 2275 |
-
|
| 2276 |
-
|
| 2277 |
-
# ============== MAIN INFERENCE LOOP ==============
|
| 2278 |
-
|
| 2279 |
-
def run_episode(task_id: Optional[str] = None) -> Dict:
|
| 2280 |
-
"""Run a single episode."""
|
| 2281 |
-
|
| 2282 |
-
client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
|
| 2283 |
-
|
| 2284 |
-
# Reset environment
|
| 2285 |
-
reset_response = call_environment("/reset", "POST", {"task_id": task_id})
|
| 2286 |
-
observation = reset_response["observation"]
|
| 2287 |
-
|
| 2288 |
-
print(f"Starting episode: {observation['task_id']}")
|
| 2289 |
-
print(f"Task: {observation['task_description']}")
|
| 2290 |
-
print(f"Difficulty: {observation['difficulty']}")
|
| 2291 |
-
|
| 2292 |
-
trajectory = []
|
| 2293 |
-
episode_reward = 0.0
|
| 2294 |
-
|
| 2295 |
-
for step in range(1, MAX_STEPS + 1):
|
| 2296 |
-
# Build prompt
|
| 2297 |
-
user_prompt = build_user_prompt(observation)
|
| 2298 |
-
|
| 2299 |
-
# Call LLM
|
| 2300 |
-
try:
|
| 2301 |
-
completion = client.chat.completions.create(
|
| 2302 |
-
model=MODEL_NAME,
|
| 2303 |
-
messages=[
|
| 2304 |
-
{"role": "system", "content": SYSTEM_PROMPT},
|
| 2305 |
-
{"role": "user", "content": user_prompt}
|
| 2306 |
-
],
|
| 2307 |
-
temperature=TEMPERATURE,
|
| 2308 |
-
max_tokens=MAX_TOKENS
|
| 2309 |
-
)
|
| 2310 |
-
response_text = completion.choices[0].message.content or ""
|
| 2311 |
-
except Exception as e:
|
| 2312 |
-
print(f"LLM error: {e}")
|
| 2313 |
-
response_text = '{"action_type": "request_hint"}'
|
| 2314 |
-
|
| 2315 |
-
# Parse action
|
| 2316 |
-
action = parse_model_response(response_text)
|
| 2317 |
-
print(f"Step {step}: {action.get('action_type')} - {action.get('reasoning', '')[:50]}")
|
| 2318 |
-
|
| 2319 |
-
# Take step
|
| 2320 |
-
step_response = call_environment("/step", "POST", {"action": action})
|
| 2321 |
-
|
| 2322 |
-
observation = step_response["observation"]
|
| 2323 |
-
reward = step_response["reward"]
|
| 2324 |
-
done = step_response["done"]
|
| 2325 |
-
info = step_response["info"]
|
| 2326 |
-
|
| 2327 |
-
episode_reward += reward
|
| 2328 |
-
|
| 2329 |
-
trajectory.append({
|
| 2330 |
-
"step": step,
|
| 2331 |
-
"action": action,
|
| 2332 |
-
"reward": reward,
|
| 2333 |
-
"done": done,
|
| 2334 |
-
"info": info
|
| 2335 |
-
})
|
| 2336 |
-
|
| 2337 |
-
print(f" Reward: {reward:.3f} | Done: {done} | Fixed: {info.get('issues_fixed', 0)}/{info.get('issues_total', '?')}")
|
| 2338 |
-
|
| 2339 |
-
if done:
|
| 2340 |
-
break
|
| 2341 |
-
|
| 2342 |
-
# Get final grading
|
| 2343 |
-
grader_response = call_environment("/grader", "POST", {
|
| 2344 |
-
"task_id": observation["task_id"],
|
| 2345 |
-
"trajectory": trajectory
|
| 2346 |
-
})
|
| 2347 |
-
|
| 2348 |
-
result = grader_response["result"]
|
| 2349 |
-
print(f"\nFinal Score: {result['score']:.3f}")
|
| 2350 |
-
print(f"Feedback: {result['feedback']}")
|
| 2351 |
-
|
| 2352 |
-
return result
|
| 2353 |
-
|
| 2354 |
-
|
| 2355 |
-
def main():
|
| 2356 |
-
"""Run baseline on all tasks."""
|
| 2357 |
-
|
| 2358 |
-
print("=" * 60)
|
| 2359 |
-
print("CI/CD Debug Environment - Baseline Inference")
|
| 2360 |
-
print("=" * 60)
|
| 2361 |
-
print(f"API: {API_BASE_URL}")
|
| 2362 |
-
print(f"Model: {MODEL_NAME}")
|
| 2363 |
-
print(f"Environment: {ENV_URL}")
|
| 2364 |
-
print()
|
| 2365 |
-
|
| 2366 |
-
# Get available tasks
|
| 2367 |
-
info = call_environment("/info")
|
| 2368 |
-
tasks = info["tasks"]
|
| 2369 |
-
|
| 2370 |
-
results = []
|
| 2371 |
-
|
| 2372 |
-
for task in tasks:
|
| 2373 |
-
print(f"\n{'='*60}")
|
| 2374 |
-
print(f"Task: {task['name']} ({task['difficulty']})")
|
| 2375 |
-
print("=" * 60)
|
| 2376 |
-
|
| 2377 |
-
result = run_episode(task["id"])
|
| 2378 |
-
results.append(result)
|
| 2379 |
-
|
| 2380 |
-
time.sleep(1) # Rate limiting
|
| 2381 |
-
|
| 2382 |
-
# Summary
|
| 2383 |
-
print("\n" + "=" * 60)
|
| 2384 |
-
print("SUMMARY")
|
| 2385 |
-
print("=" * 60)
|
| 2386 |
-
|
| 2387 |
-
total_score = 0
|
| 2388 |
-
for task, result in zip(tasks, results):
|
| 2389 |
-
print(f"{task['name']}: {result['score']:.3f}")
|
| 2390 |
-
total_score += result["score"]
|
| 2391 |
-
|
| 2392 |
-
avg_score = total_score / len(results) if results else 0
|
| 2393 |
-
print(f"\nAverage Score: {avg_score:.3f}")
|
| 2394 |
-
|
| 2395 |
-
return results
|
| 2396 |
-
|
| 2397 |
-
|
| 2398 |
-
if __name__ == "__main__":
|
| 2399 |
-
main()
|
| 2400 |
-
```
|
| 2401 |
-
|
| 2402 |
-
---
|
| 2403 |
-
|
| 2404 |
-
# 10. DOCKERFILE & DEPLOYMENT
|
| 2405 |
-
|
| 2406 |
-
## 10.1 Dockerfile
|
| 2407 |
-
|
| 2408 |
-
```dockerfile
|
| 2409 |
-
# Multi-stage build for smaller image
|
| 2410 |
-
FROM python:3.11-slim AS builder
|
| 2411 |
-
|
| 2412 |
-
WORKDIR /app
|
| 2413 |
-
|
| 2414 |
-
# Install build dependencies
|
| 2415 |
-
RUN apt-get update && apt-get install -y --no-install-recommends \
|
| 2416 |
-
gcc \
|
| 2417 |
-
&& rm -rf /var/lib/apt/lists/*
|
| 2418 |
-
|
| 2419 |
-
# Copy and install requirements
|
| 2420 |
-
COPY requirements.txt .
|
| 2421 |
-
RUN pip install --no-cache-dir --user -r requirements.txt
|
| 2422 |
-
|
| 2423 |
-
# Production stage
|
| 2424 |
-
FROM python:3.11-slim
|
| 2425 |
-
|
| 2426 |
-
WORKDIR /app
|
| 2427 |
-
|
| 2428 |
-
# Copy installed packages from builder
|
| 2429 |
-
COPY --from=builder /root/.local /root/.local
|
| 2430 |
-
ENV PATH=/root/.local/bin:$PATH
|
| 2431 |
-
|
| 2432 |
-
# Copy application code
|
| 2433 |
-
COPY server/ ./server/
|
| 2434 |
-
COPY data/ ./data/
|
| 2435 |
-
COPY openenv.yaml .
|
| 2436 |
-
COPY inference.py .
|
| 2437 |
-
|
| 2438 |
-
# Create non-root user for security
|
| 2439 |
-
RUN useradd --create-home appuser
|
| 2440 |
-
USER appuser
|
| 2441 |
-
|
| 2442 |
-
# Expose port (HuggingFace Spaces uses 7860)
|
| 2443 |
-
EXPOSE 7860
|
| 2444 |
-
|
| 2445 |
-
# Health check
|
| 2446 |
-
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
|
| 2447 |
-
CMD python -c "import requests; requests.get('http://localhost:7860/')" || exit 1
|
| 2448 |
-
|
| 2449 |
-
# Run the server
|
| 2450 |
-
CMD ["python", "-m", "uvicorn", "server.main:app", "--host", "0.0.0.0", "--port", "7860"]
|
| 2451 |
-
```
|
| 2452 |
-
|
| 2453 |
-
## 10.2 requirements.txt
|
| 2454 |
-
|
| 2455 |
-
```
|
| 2456 |
-
# Core
|
| 2457 |
-
fastapi==0.109.0
|
| 2458 |
-
uvicorn[standard]==0.27.0
|
| 2459 |
-
pydantic==2.5.3
|
| 2460 |
-
|
| 2461 |
-
# HTTP client
|
| 2462 |
-
requests==2.31.0
|
| 2463 |
-
httpx==0.26.0
|
| 2464 |
-
|
| 2465 |
-
# OpenAI client (for baseline)
|
| 2466 |
-
openai==1.12.0
|
| 2467 |
-
|
| 2468 |
-
# YAML parsing (for workflow validation)
|
| 2469 |
-
pyyaml==6.0.1
|
| 2470 |
-
ruamel.yaml==0.18.5
|
| 2471 |
-
|
| 2472 |
-
# Testing
|
| 2473 |
-
pytest==7.4.4
|
| 2474 |
-
pytest-asyncio==0.23.3
|
| 2475 |
-
|
| 2476 |
-
# Utilities
|
| 2477 |
-
python-dotenv==1.0.0
|
| 2478 |
-
```
|
| 2479 |
-
|
| 2480 |
-
## 10.3 HuggingFace Spaces Deployment
|
| 2481 |
-
|
| 2482 |
-
```yaml
|
| 2483 |
-
# README.md for HF Space
|
| 2484 |
-
---
|
| 2485 |
-
title: CI/CD Debug Environment
|
| 2486 |
-
emoji: 🔧
|
| 2487 |
-
colorFrom: blue
|
| 2488 |
-
colorTo: green
|
| 2489 |
-
sdk: docker
|
| 2490 |
-
app_port: 7860
|
| 2491 |
-
pinned: false
|
| 2492 |
-
license: mit
|
| 2493 |
-
---
|
| 2494 |
-
|
| 2495 |
-
# CI/CD Debug Environment
|
| 2496 |
-
|
| 2497 |
-
An OpenEnv-compliant environment for training AI agents to debug Docker and GitHub Actions issues.
|
| 2498 |
-
|
| 2499 |
-
## Quick Start
|
| 2500 |
-
|
| 2501 |
-
```bash
|
| 2502 |
-
# Reset environment
|
| 2503 |
-
curl -X POST https://your-space.hf.space/reset
|
| 2504 |
-
|
| 2505 |
-
# Take action
|
| 2506 |
-
curl -X POST https://your-space.hf.space/step \
|
| 2507 |
-
-H "Content-Type: application/json" \
|
| 2508 |
-
-d '{"action": {"action_type": "submit"}}'
|
| 2509 |
-
```
|
| 2510 |
-
|
| 2511 |
-
## Tasks
|
| 2512 |
-
|
| 2513 |
-
1. **Dockerfile Syntax** (Easy) - Fix common Dockerfile errors
|
| 2514 |
-
2. **Workflow Config** (Medium) - Fix GitHub Actions + Docker issues
|
| 2515 |
-
3. **Multi-Stage Pipeline** (Hard) - Debug complex CI/CD pipelines
|
| 2516 |
-
```
|
| 2517 |
-
|
| 2518 |
-
---
|
| 2519 |
-
|
| 2520 |
-
# 11. TESTING PLAN
|
| 2521 |
-
|
| 2522 |
-
## 11.1 Test Categories
|
| 2523 |
-
|
| 2524 |
-
```python
|
| 2525 |
-
# tests/test_endpoints.py
|
| 2526 |
-
"""Test all required OpenEnv endpoints."""
|
| 2527 |
-
|
| 2528 |
-
import pytest
|
| 2529 |
-
from fastapi.testclient import TestClient
|
| 2530 |
-
from server.main import app
|
| 2531 |
-
|
| 2532 |
-
client = TestClient(app)
|
| 2533 |
-
|
| 2534 |
-
|
| 2535 |
-
class TestEndpoints:
|
| 2536 |
-
"""Verify all 7 endpoints work correctly."""
|
| 2537 |
-
|
| 2538 |
-
def test_root_health(self):
|
| 2539 |
-
"""GET / returns healthy status."""
|
| 2540 |
-
response = client.get("/")
|
| 2541 |
-
assert response.status_code == 200
|
| 2542 |
-
assert response.json()["status"] == "healthy"
|
| 2543 |
-
|
| 2544 |
-
def test_reset_returns_observation(self):
|
| 2545 |
-
"""POST /reset returns valid observation."""
|
| 2546 |
-
response = client.post("/reset", json={})
|
| 2547 |
-
assert response.status_code == 200
|
| 2548 |
-
data = response.json()
|
| 2549 |
-
assert "observation" in data
|
| 2550 |
-
assert "task_id" in data["observation"]
|
| 2551 |
-
assert "files" in data["observation"]
|
| 2552 |
-
assert "error" in data["observation"]
|
| 2553 |
-
|
| 2554 |
-
def test_step_requires_reset(self):
|
| 2555 |
-
"""POST /step fails without reset."""
|
| 2556 |
-
# Fresh client/environment
|
| 2557 |
-
response = client.post("/step", json={
|
| 2558 |
-
"action": {"action_type": "submit"}
|
| 2559 |
-
})
|
| 2560 |
-
# Should fail or require reset
|
| 2561 |
-
# (Implementation dependent)
|
| 2562 |
-
|
| 2563 |
-
def test_step_returns_result(self):
|
| 2564 |
-
"""POST /step returns observation, reward, done."""
|
| 2565 |
-
client.post("/reset", json={})
|
| 2566 |
-
response = client.post("/step", json={
|
| 2567 |
-
"action": {"action_type": "request_hint"}
|
| 2568 |
-
})
|
| 2569 |
-
assert response.status_code == 200
|
| 2570 |
-
data = response.json()
|
| 2571 |
-
assert "observation" in data
|
| 2572 |
-
assert "reward" in data
|
| 2573 |
-
assert "done" in data
|
| 2574 |
-
|
| 2575 |
-
def test_state_returns_current(self):
|
| 2576 |
-
"""GET /state returns current observation."""
|
| 2577 |
-
client.post("/reset", json={})
|
| 2578 |
-
response = client.get("/state")
|
| 2579 |
-
assert response.status_code == 200
|
| 2580 |
-
assert "observation" in response.json()
|
| 2581 |
-
|
| 2582 |
-
def test_info_returns_metadata(self):
|
| 2583 |
-
"""GET /info returns environment metadata."""
|
| 2584 |
-
response = client.get("/info")
|
| 2585 |
-
assert response.status_code == 200
|
| 2586 |
-
data = response.json()
|
| 2587 |
-
assert "tasks" in data
|
| 2588 |
-
assert len(data["tasks"]) >= 3
|
| 2589 |
-
|
| 2590 |
-
def test_tasks_returns_list(self):
|
| 2591 |
-
"""GET /tasks returns task list."""
|
| 2592 |
-
response = client.get("/tasks")
|
| 2593 |
-
assert response.status_code == 200
|
| 2594 |
-
assert "tasks" in response.json()
|
| 2595 |
-
|
| 2596 |
-
def test_grader_returns_score(self):
|
| 2597 |
-
"""POST /grader returns valid score."""
|
| 2598 |
-
response = client.post("/grader", json={
|
| 2599 |
-
"task_id": "dockerfile_syntax",
|
| 2600 |
-
"trajectory": []
|
| 2601 |
-
})
|
| 2602 |
-
assert response.status_code == 200
|
| 2603 |
-
result = response.json()["result"]
|
| 2604 |
-
assert 0.0 <= result["score"] <= 1.0
|
| 2605 |
-
|
| 2606 |
-
def test_baseline_runs(self):
|
| 2607 |
-
"""POST /baseline executes baseline script."""
|
| 2608 |
-
response = client.post("/baseline", json={
|
| 2609 |
-
"task_id": "dockerfile_syntax",
|
| 2610 |
-
"num_episodes": 1
|
| 2611 |
-
})
|
| 2612 |
-
assert response.status_code == 200
|
| 2613 |
-
|
| 2614 |
-
|
| 2615 |
-
# tests/test_graders.py
|
| 2616 |
-
"""Test grader determinism and correctness."""
|
| 2617 |
-
|
| 2618 |
-
class TestGraderDeterminism:
|
| 2619 |
-
"""Verify graders are deterministic."""
|
| 2620 |
-
|
| 2621 |
-
def test_same_trajectory_same_score(self):
|
| 2622 |
-
"""Same trajectory produces same score."""
|
| 2623 |
-
trajectory = [
|
| 2624 |
-
{"step": 1, "action": {"action_type": "submit"}, "reward": 0.5, "done": True, "info": {"issues_fixed": 1, "issues_total": 2}}
|
| 2625 |
-
]
|
| 2626 |
-
|
| 2627 |
-
result1 = run_grader("dockerfile_syntax", trajectory)
|
| 2628 |
-
result2 = run_grader("dockerfile_syntax", trajectory)
|
| 2629 |
-
|
| 2630 |
-
assert result1.score == result2.score
|
| 2631 |
-
assert result1.breakdown == result2.breakdown
|
| 2632 |
-
|
| 2633 |
-
def test_score_in_valid_range(self):
|
| 2634 |
-
"""Score is always between 0.0 and 1.0."""
|
| 2635 |
-
for _ in range(100):
|
| 2636 |
-
trajectory = generate_random_trajectory()
|
| 2637 |
-
result = run_grader("dockerfile_syntax", trajectory)
|
| 2638 |
-
assert 0.0 <= result.score <= 1.0
|
| 2639 |
-
|
| 2640 |
-
|
| 2641 |
-
# tests/test_tasks.py
|
| 2642 |
-
"""Test task scenarios."""
|
| 2643 |
-
|
| 2644 |
-
class TestTaskScenarios:
|
| 2645 |
-
"""Verify each task has valid scenarios."""
|
| 2646 |
-
|
| 2647 |
-
def test_each_task_has_3_plus_scenarios(self):
|
| 2648 |
-
"""Every task has at least 3 scenarios."""
|
| 2649 |
-
for task_id, task_cls in TASK_REGISTRY.items():
|
| 2650 |
-
assert len(task_cls.SCENARIOS) >= 3, f"{task_id} has < 3 scenarios"
|
| 2651 |
-
|
| 2652 |
-
def test_scenarios_have_required_fields(self):
|
| 2653 |
-
"""Each scenario has all required fields."""
|
| 2654 |
-
required = ["id", "files", "error", "expected_fixes"]
|
| 2655 |
-
for task_id, task_cls in TASK_REGISTRY.items():
|
| 2656 |
-
for scenario in task_cls.SCENARIOS:
|
| 2657 |
-
for field in required:
|
| 2658 |
-
assert field in scenario, f"{task_id} scenario missing {field}"
|
| 2659 |
-
|
| 2660 |
-
def test_expected_fixes_are_verifiable(self):
|
| 2661 |
-
"""Each expected fix can be verified programmatically."""
|
| 2662 |
-
for task_id, task_cls in TASK_REGISTRY.items():
|
| 2663 |
-
task = task_cls()
|
| 2664 |
-
for scenario in task_cls.SCENARIOS:
|
| 2665 |
-
for fix in scenario["expected_fixes"]:
|
| 2666 |
-
assert "file" in fix
|
| 2667 |
-
assert "type" in fix
|
| 2668 |
-
assert fix["type"] in ["contains", "not_contains", "line_equals"]
|
| 2669 |
-
```
|
| 2670 |
-
|
| 2671 |
-
## 11.2 Validation Script (Local)
|
| 2672 |
-
|
| 2673 |
-
```bash
|
| 2674 |
-
#!/bin/bash
|
| 2675 |
-
# validate-local.sh - Run all checks locally
|
| 2676 |
-
|
| 2677 |
-
set -e
|
| 2678 |
-
|
| 2679 |
-
echo "=== 1. Running unit tests ==="
|
| 2680 |
-
pytest tests/ -v
|
| 2681 |
-
|
| 2682 |
-
echo "=== 2. Building Docker image ==="
|
| 2683 |
-
docker build -t cicd-debug-env:test .
|
| 2684 |
-
|
| 2685 |
-
echo "=== 3. Running container ==="
|
| 2686 |
-
docker run -d --name test-env -p 7860:7860 cicd-debug-env:test
|
| 2687 |
-
sleep 5
|
| 2688 |
-
|
| 2689 |
-
echo "=== 4. Testing endpoints ==="
|
| 2690 |
-
curl -f http://localhost:7860/ || exit 1
|
| 2691 |
-
curl -f -X POST http://localhost:7860/reset || exit 1
|
| 2692 |
-
curl -f http://localhost:7860/info || exit 1
|
| 2693 |
-
curl -f http://localhost:7860/tasks || exit 1
|
| 2694 |
-
|
| 2695 |
-
echo "=== 5. Running openenv validate ==="
|
| 2696 |
-
openenv validate
|
| 2697 |
-
|
| 2698 |
-
echo "=== 6. Cleanup ==="
|
| 2699 |
-
docker stop test-env
|
| 2700 |
-
docker rm test-env
|
| 2701 |
-
|
| 2702 |
-
echo "=== ALL CHECKS PASSED ==="
|
| 2703 |
-
```
|
| 2704 |
-
|
| 2705 |
-
---
|
| 2706 |
-
|
| 2707 |
-
# 12. TIMELINE & MILESTONES
|
| 2708 |
-
|
| 2709 |
-
## Development Schedule (Assuming 7-10 days)
|
| 2710 |
-
|
| 2711 |
-
### Day 1-2: Foundation
|
| 2712 |
-
- [x] Set up project structure
|
| 2713 |
-
- [x] Implement Pydantic models
|
| 2714 |
-
- [x] Create base FastAPI server with all endpoints
|
| 2715 |
-
- [x] Write openenv.yaml
|
| 2716 |
-
|
| 2717 |
-
### Day 3-4: Core Environment
|
| 2718 |
-
- [x] Implement environment.py (reset, step, state)
|
| 2719 |
-
- [x] Create Docker simulator (validate Dockerfile syntax)
|
| 2720 |
-
- [x] Create Workflow simulator (validate YAML)
|
| 2721 |
-
- [x] Test basic episode flow
|
| 2722 |
-
|
| 2723 |
-
### Day 5-6: Tasks & Scenarios
|
| 2724 |
-
- [x] Implement Task 1: Dockerfile Syntax (5 scenarios)
|
| 2725 |
-
- [x] Implement Task 2: Dockerfile Runtime (5 scenarios)
|
| 2726 |
-
- [x] Implement Task 3: Workflow Syntax and Structure (5 scenarios)
|
| 2727 |
-
- [x] Implement Task 4: Workflow Secrets and Permissions (5 scenarios)
|
| 2728 |
-
- [x] Implement Task 5: CI and Docker Build Integration (5 scenarios)
|
| 2729 |
-
- [x] Implement Task 6: Multi-Stage Pipeline and Matrix (5 scenarios)
|
| 2730 |
-
- [x] Verify difficulty progression (easy → medium → hard)
|
| 2731 |
-
- [x] Enhanced DockerSimulator: 15+ validation rules (typos, bad tags, EXPOSE, platform ARGs, runtime: WORKDIR, ENTRYPOINT, ENV, privileged ports)
|
| 2732 |
-
- [x] Enhanced WorkflowSimulator: 15+ validation rules (on trigger, runs-on, branches syntax, run/uses, ${{ }}, permissions, needs, secrets env, GHCR creds, cache, context paths, push auth)
|
| 2733 |
-
- [x] Fixed environment.py: dynamic workflow file lookup, trajectory includes info dict
|
| 2734 |
-
- [x] 30/30 scenarios verified end-to-end (reset → fix → grade)
|
| 2735 |
-
|
| 2736 |
-
### Day 7: Graders & Rewards
|
| 2737 |
-
- [x] Implement grader logic (deterministic, dynamic scoring)
|
| 2738 |
-
- [x] Test determinism (10x replay → identical scores)
|
| 2739 |
-
- [x] Tune reward shaping (dense: +0.1 validation, +0.3/fix, -0.05/hint, -0.02/failed)
|
| 2740 |
-
- [x] Verify score ranges (0/n→0.0, partial→~0.5, complete→1.0, hints penalized)
|
| 2741 |
-
- [x] Grader weights: 40% partial fixes + 30% complete bonus + 30% efficiency - 5%/hint
|
| 2742 |
-
- [x] 17 determinism/score-range tests + 26/26 total test suite passing
|
| 2743 |
-
|
| 2744 |
-
### Day 8: Baseline & Testing
|
| 2745 |
-
- [ ] Write inference.py baseline
|
| 2746 |
-
- [ ] Run baseline on all tasks
|
| 2747 |
-
- [ ] Verify expected scores
|
| 2748 |
-
- [ ] Full test suite
|
| 2749 |
-
|
| 2750 |
-
### Day 9: Docker & Deployment
|
| 2751 |
-
- [ ] Finalize Dockerfile
|
| 2752 |
-
- [ ] Test local Docker build/run
|
| 2753 |
-
- [ ] Deploy to HuggingFace Spaces
|
| 2754 |
-
- [ ] Run validation script
|
| 2755 |
-
|
| 2756 |
-
### Day 10: Polish & Submit
|
| 2757 |
-
- [ ] Write comprehensive README
|
| 2758 |
-
- [ ] Final testing
|
| 2759 |
-
- [ ] Submit before deadline
|
| 2760 |
-
|
| 2761 |
-
---
|
| 2762 |
-
|
| 2763 |
-
# APPENDIX: Quick Reference
|
| 2764 |
-
|
| 2765 |
-
## Required Files Checklist
|
| 2766 |
-
|
| 2767 |
-
```
|
| 2768 |
-
✓ openenv.yaml - Environment metadata
|
| 2769 |
-
✓ inference.py - Baseline script (root dir)
|
| 2770 |
-
✓ Dockerfile - Container definition
|
| 2771 |
-
✓ requirements.txt - Python dependencies
|
| 2772 |
-
✓ README.md - Documentation
|
| 2773 |
-
✓ server/main.py - FastAPI app
|
| 2774 |
-
✓ server/models.py - Pydantic models
|
| 2775 |
-
✓ server/environment.py - Core logic
|
| 2776 |
-
✓ server/tasks/*.py - 6 task definitions
|
| 2777 |
-
✓ server/graders/*.py - Grading logic
|
| 2778 |
-
```
|
| 2779 |
-
|
| 2780 |
-
## Required Endpoints
|
| 2781 |
-
|
| 2782 |
-
```
|
| 2783 |
-
GET / - Health check
|
| 2784 |
-
POST /reset - Start new episode
|
| 2785 |
-
POST /step - Take action
|
| 2786 |
-
GET /state - Current observation
|
| 2787 |
-
GET /info - Environment metadata
|
| 2788 |
-
GET /tasks - List tasks
|
| 2789 |
-
POST /grader - Grade trajectory
|
| 2790 |
-
POST /baseline - Run baseline
|
| 2791 |
-
```
|
| 2792 |
-
|
| 2793 |
-
## Environment Variables
|
| 2794 |
-
|
| 2795 |
-
```bash
|
| 2796 |
-
API_BASE_URL=https://router.huggingface.co/v1
|
| 2797 |
-
MODEL_NAME=meta-llama/Llama-3.1-70B-Instruct
|
| 2798 |
-
HF_TOKEN=your_token_here
|
| 2799 |
-
```
|
| 2800 |
-
|
| 2801 |
-
## Score Targets
|
| 2802 |
-
|
| 2803 |
-
| Task | Expected Baseline Score |
|
| 2804 |
-
|------|------------------------|
|
| 2805 |
-
| dockerfile_syntax | 0.7 |
|
| 2806 |
-
| dockerfile_runtime | 0.55 |
|
| 2807 |
-
| workflow_syntax_structure | 0.65 |
|
| 2808 |
-
| workflow_secrets_permissions | 0.5 |
|
| 2809 |
-
| ci_docker_integration | 0.45 |
|
| 2810 |
-
| multi_stage_pipeline_matrix | 0.3 |
|
| 2811 |
-
|
| 2812 |
-
---
|
| 2813 |
-
|
| 2814 |
-
*End of Implementation Plan*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
README.md
CHANGED
|
@@ -1,25 +1,188 @@
|
|
| 1 |
# CI/CD Debug Environment
|
| 2 |
|
| 3 |
-
OpenEnv-
|
| 4 |
-
## Day 1-2 Status
|
| 5 |
|
| 6 |
-
|
| 7 |
-
- Typed Pydantic models implemented
|
| 8 |
-
- FastAPI app with core endpoints implemented
|
| 9 |
-
- Initial 6-task registry and environment loop wired
|
| 10 |
-
- Deterministic dynamic grader scaffold implemented with score breakdown
|
| 11 |
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
```bash
|
| 15 |
pip install -r requirements.txt
|
| 16 |
-
python -m uvicorn server.main:app --
|
| 17 |
```
|
| 18 |
|
| 19 |
-
##
|
| 20 |
|
| 21 |
```bash
|
|
|
|
| 22 |
curl http://localhost:7860/
|
| 23 |
-
|
| 24 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# CI/CD Debug Environment
|
| 2 |
|
| 3 |
+
An OpenEnv-compatible environment where AI agents learn to debug broken GitHub Actions workflows and Dockerfiles. Built for the OpenEnv Hackathon by Scaler School of Technology (partners: Meta, HuggingFace, PyTorch).
|
|
|
|
| 4 |
|
| 5 |
+
## What It Does
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
|
| 7 |
+
Agents receive:
|
| 8 |
+
- Broken configuration files (Dockerfile, GitHub Actions YAML)
|
| 9 |
+
- Error messages from failed builds/workflows
|
| 10 |
+
- Context about available secrets and runner environment
|
| 11 |
+
|
| 12 |
+
Agents must analyze errors, identify root causes, edit files to fix issues, and submit solutions. The environment provides dense reward feedback at every step.
|
| 13 |
+
|
| 14 |
+
## Tasks
|
| 15 |
+
|
| 16 |
+
| # | Task ID | Description | Difficulty | Scenarios |
|
| 17 |
+
|---|---------|-------------|------------|-----------|
|
| 18 |
+
| 1 | `dockerfile_syntax` | Fix Dockerfile instruction/syntax errors | Easy | 5 |
|
| 19 |
+
| 2 | `dockerfile_runtime` | Fix Dockerfile runtime/execution issues | Medium | 5 |
|
| 20 |
+
| 3 | `workflow_syntax_structure` | Fix GitHub Actions YAML structure | Easy | 5 |
|
| 21 |
+
| 4 | `workflow_secrets_permissions` | Fix secret wiring and permissions | Medium | 5 |
|
| 22 |
+
| 5 | `ci_docker_integration` | Debug combined CI + Docker failures | Medium-Hard | 5 |
|
| 23 |
+
| 6 | `multi_stage_pipeline_matrix` | Debug multi-stage and matrix pipelines | Hard | 5 |
|
| 24 |
+
|
| 25 |
+
30 total scenarios across 6 tasks with clear difficulty progression.
|
| 26 |
+
|
| 27 |
+
## API Endpoints
|
| 28 |
+
|
| 29 |
+
| Endpoint | Method | Description |
|
| 30 |
+
|----------|--------|-------------|
|
| 31 |
+
| `/` | GET | Health check |
|
| 32 |
+
| `/reset` | POST | Start a new episode (optional: `task_id`, `scenario_id`, `seed`) |
|
| 33 |
+
| `/step` | POST | Take an action (`edit_file`, `replace_line`, `add_line`, `delete_line`, `submit`, `request_hint`) |
|
| 34 |
+
| `/state` | GET | Get current observation |
|
| 35 |
+
| `/info` | GET | Environment metadata and schemas |
|
| 36 |
+
| `/tasks` | GET | List all tasks |
|
| 37 |
+
| `/grader` | POST | Grade a trajectory |
|
| 38 |
+
| `/baseline` | POST | Run built-in heuristic baseline |
|
| 39 |
+
|
| 40 |
+
## Grading
|
| 41 |
+
|
| 42 |
+
Scoring is **deterministic** and **dynamic** (same actions = same score, different actions = different scores).
|
| 43 |
+
|
| 44 |
+
| Component | Weight | Description |
|
| 45 |
+
|-----------|--------|-------------|
|
| 46 |
+
| Partial fixes | 40% | Proportional to issues fixed |
|
| 47 |
+
| Complete solution | 30% | Bonus when ALL issues fixed |
|
| 48 |
+
| Efficiency | 30% | Bonus for minimal steps (decays with extra steps) |
|
| 49 |
+
| Hint penalty | -5% each | Per hint requested |
|
| 50 |
+
|
| 51 |
+
Score range: `0.0` (no progress) to `1.0` (all fixed efficiently).
|
| 52 |
+
|
| 53 |
+
## Quick Start
|
| 54 |
+
|
| 55 |
+
### Local Development
|
| 56 |
|
| 57 |
```bash
|
| 58 |
pip install -r requirements.txt
|
| 59 |
+
python -m uvicorn server.main:app --host 0.0.0.0 --port 7860
|
| 60 |
```
|
| 61 |
|
| 62 |
+
### Test Endpoints
|
| 63 |
|
| 64 |
```bash
|
| 65 |
+
# Health check
|
| 66 |
curl http://localhost:7860/
|
| 67 |
+
|
| 68 |
+
# List tasks
|
| 69 |
+
curl http://localhost:7860/tasks
|
| 70 |
+
|
| 71 |
+
# Start an episode
|
| 72 |
+
curl -X POST http://localhost:7860/reset \
|
| 73 |
+
-H "Content-Type: application/json" \
|
| 74 |
+
-d '{"task_id": "dockerfile_syntax"}'
|
| 75 |
+
|
| 76 |
+
# Take an action
|
| 77 |
+
curl -X POST http://localhost:7860/step \
|
| 78 |
+
-H "Content-Type: application/json" \
|
| 79 |
+
-d '{
|
| 80 |
+
"action": {
|
| 81 |
+
"action_type": "edit_file",
|
| 82 |
+
"edits": [{
|
| 83 |
+
"file_path": "Dockerfile",
|
| 84 |
+
"old_content": "COPY requirments.txt .",
|
| 85 |
+
"new_content": "COPY requirements.txt ."
|
| 86 |
+
}]
|
| 87 |
+
}
|
| 88 |
+
}'
|
| 89 |
+
|
| 90 |
+
# Submit solution
|
| 91 |
+
curl -X POST http://localhost:7860/step \
|
| 92 |
+
-H "Content-Type: application/json" \
|
| 93 |
+
-d '{"action": {"action_type": "submit"}}'
|
| 94 |
```
|
| 95 |
+
|
| 96 |
+
### Run Tests
|
| 97 |
+
|
| 98 |
+
```bash
|
| 99 |
+
pytest tests/ -v
|
| 100 |
+
```
|
| 101 |
+
|
| 102 |
+
### Docker
|
| 103 |
+
|
| 104 |
+
```bash
|
| 105 |
+
docker build -t cicd-debug-env .
|
| 106 |
+
docker run -p 7860:7860 cicd-debug-env
|
| 107 |
+
```
|
| 108 |
+
|
| 109 |
+
### Baseline Inference (with LLM)
|
| 110 |
+
|
| 111 |
+
```bash
|
| 112 |
+
export API_BASE_URL=https://router.huggingface.co/v1
|
| 113 |
+
export MODEL_NAME=meta-llama/Llama-3.1-70B-Instruct
|
| 114 |
+
export HF_TOKEN=your_token_here
|
| 115 |
+
python inference.py
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
Run on a specific task:
|
| 119 |
+
```bash
|
| 120 |
+
python inference.py dockerfile_syntax
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
## Project Structure
|
| 124 |
+
|
| 125 |
+
```
|
| 126 |
+
cicd-debug-env/
|
| 127 |
+
├── openenv.yaml # OpenEnv metadata
|
| 128 |
+
├── inference.py # LLM baseline script
|
| 129 |
+
├── baseline_runner.py # Heuristic baseline for /baseline endpoint
|
| 130 |
+
├── Dockerfile # Production container
|
| 131 |
+
├── requirements.txt # Python dependencies
|
| 132 |
+
├── README.md
|
| 133 |
+
│
|
| 134 |
+
├── server/
|
| 135 |
+
│ ├── __init__.py
|
| 136 |
+
│ ├── main.py # FastAPI with all 8 endpoints
|
| 137 |
+
│ ├── models.py # Pydantic models
|
| 138 |
+
│ ├── environment.py # Core environment logic
|
| 139 |
+
│ │
|
| 140 |
+
│ ├── tasks/
|
| 141 |
+
│ │ ├── base.py # BaseTask class
|
| 142 |
+
│ │ ├── task_registry.py # Task registry
|
| 143 |
+
│ │ ├─�� task_1_build_errors.py
|
| 144 |
+
│ │ ├── task_2_docker_runtime.py
|
| 145 |
+
│ │ ├── task_3_workflow_syntax.py
|
| 146 |
+
│ │ ├── task_4_workflow_secrets_permissions.py
|
| 147 |
+
│ │ ├── task_5_ci_docker_integration.py
|
| 148 |
+
│ │ └── task_6_multi_stage_matrix.py
|
| 149 |
+
│ │
|
| 150 |
+
│ ├── graders/
|
| 151 |
+
│ │ ├── __init__.py # Deterministic grader
|
| 152 |
+
│ │ └── base.py # Base grader class
|
| 153 |
+
│ │
|
| 154 |
+
│ ├── simulators/
|
| 155 |
+
│ │ ├── docker_simulator.py # Dockerfile validation (15+ rules)
|
| 156 |
+
│ │ └── workflow_simulator.py # Workflow validation (15+ rules)
|
| 157 |
+
│ │
|
| 158 |
+
│ └── utils/
|
| 159 |
+
│ └── yaml_parser.py
|
| 160 |
+
│
|
| 161 |
+
└── tests/
|
| 162 |
+
├── conftest.py
|
| 163 |
+
├── test_endpoints.py
|
| 164 |
+
└── test_determinism.py
|
| 165 |
+
```
|
| 166 |
+
|
| 167 |
+
## Expected Baseline Scores
|
| 168 |
+
|
| 169 |
+
| Task | Expected |
|
| 170 |
+
|------|----------|
|
| 171 |
+
| dockerfile_syntax | 0.70 |
|
| 172 |
+
| dockerfile_runtime | 0.55 |
|
| 173 |
+
| workflow_syntax_structure | 0.65 |
|
| 174 |
+
| workflow_secrets_permissions | 0.50 |
|
| 175 |
+
| ci_docker_integration | 0.45 |
|
| 176 |
+
| multi_stage_pipeline_matrix | 0.30 |
|
| 177 |
+
|
| 178 |
+
## Design Decisions
|
| 179 |
+
|
| 180 |
+
1. **Combined Docker + GitHub Actions**: The intersection of these tools is the most painful real-world failure mode
|
| 181 |
+
2. **Simulated validation**: Static analysis instead of real Docker containers for speed and determinism
|
| 182 |
+
3. **Dense rewards**: Partial credit at every step rather than sparse pass/fail
|
| 183 |
+
4. **6 tasks (2+2+2)**: 2 Docker-only + 2 Workflow-only + 2 Combined with clear difficulty progression
|
| 184 |
+
5. **OpenAI client for baseline**: Required by hackathon specification
|
| 185 |
+
|
| 186 |
+
## License
|
| 187 |
+
|
| 188 |
+
MIT
|
baseline_runner.py
CHANGED
|
@@ -1,40 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
from __future__ import annotations
|
| 2 |
|
| 3 |
-
from typing import
|
| 4 |
|
|
|
|
| 5 |
from server.graders import run_grader
|
|
|
|
|
|
|
| 6 |
|
| 7 |
|
| 8 |
-
def
|
| 9 |
-
"""
|
|
|
|
| 10 |
|
| 11 |
-
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
task_ids: List[str]
|
| 15 |
if task_id:
|
|
|
|
|
|
|
| 16 |
task_ids = [task_id]
|
| 17 |
else:
|
| 18 |
-
task_ids =
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
"workflow_syntax_structure",
|
| 22 |
-
"workflow_secrets_permissions",
|
| 23 |
-
"ci_docker_integration",
|
| 24 |
-
"multi_stage_pipeline_matrix",
|
| 25 |
-
]
|
| 26 |
-
|
| 27 |
-
results = []
|
| 28 |
for tid in task_ids:
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
return results
|
|
|
|
| 1 |
+
"""Baseline runner for the /baseline endpoint.
|
| 2 |
+
|
| 3 |
+
Runs episodes using a simple heuristic agent (no LLM required).
|
| 4 |
+
The heuristic agent applies expected_fixes directly to demonstrate
|
| 5 |
+
that the environment and grader work correctly end-to-end.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
from __future__ import annotations
|
| 9 |
|
| 10 |
+
from typing import List, Optional
|
| 11 |
|
| 12 |
+
from server.environment import CICDDebugEnvironment
|
| 13 |
from server.graders import run_grader
|
| 14 |
+
from server.models import Action, ActionType, FileEdit, GraderResult
|
| 15 |
+
from server.tasks.task_registry import TASK_REGISTRY
|
| 16 |
|
| 17 |
|
| 18 |
+
def _heuristic_episode(env: CICDDebugEnvironment, task_id: str, scenario_id: Optional[str] = None) -> GraderResult:
|
| 19 |
+
"""Run one episode using a heuristic that applies expected fixes."""
|
| 20 |
+
obs = env.reset(task_id=task_id, scenario_id=scenario_id)
|
| 21 |
|
| 22 |
+
# Apply each expected fix as an edit_file action
|
| 23 |
+
for fix in env.expected_fixes:
|
| 24 |
+
if env.done:
|
| 25 |
+
break
|
| 26 |
+
file_path = fix["file"]
|
| 27 |
+
if file_path not in env.current_files:
|
| 28 |
+
continue
|
| 29 |
+
|
| 30 |
+
current_content = env.current_files[file_path].content
|
| 31 |
+
|
| 32 |
+
if fix["type"] == "contains":
|
| 33 |
+
# Need to ensure expected string is present
|
| 34 |
+
if fix["expected"] not in current_content:
|
| 35 |
+
# Try to find the broken line using hint
|
| 36 |
+
hint_text = fix.get("hint", "")
|
| 37 |
+
# Use edit_file with old/new content based on the fix
|
| 38 |
+
# We look at original files to find what changed
|
| 39 |
+
original_content = env.original_files.get(file_path)
|
| 40 |
+
if original_content:
|
| 41 |
+
lines = current_content.split("\n")
|
| 42 |
+
expected = fix["expected"]
|
| 43 |
+
line_num = fix.get("line")
|
| 44 |
+
|
| 45 |
+
if line_num and 1 <= line_num <= len(lines):
|
| 46 |
+
old_line = lines[line_num - 1]
|
| 47 |
+
action = Action(
|
| 48 |
+
action_type=ActionType.REPLACE_LINE,
|
| 49 |
+
edits=[FileEdit(
|
| 50 |
+
file_path=file_path,
|
| 51 |
+
line_number=line_num,
|
| 52 |
+
new_content=expected,
|
| 53 |
+
)],
|
| 54 |
+
)
|
| 55 |
+
else:
|
| 56 |
+
# Find the line that's closest to expected but wrong
|
| 57 |
+
best_line = None
|
| 58 |
+
best_idx = None
|
| 59 |
+
for i, line in enumerate(lines):
|
| 60 |
+
stripped = line.strip()
|
| 61 |
+
exp_stripped = expected.strip()
|
| 62 |
+
# Check if this line is a broken version of expected
|
| 63 |
+
if (stripped and exp_stripped and
|
| 64 |
+
len(set(stripped) & set(exp_stripped)) > len(exp_stripped) * 0.3):
|
| 65 |
+
if best_line is None:
|
| 66 |
+
best_line = line
|
| 67 |
+
best_idx = i
|
| 68 |
+
|
| 69 |
+
if best_line is not None:
|
| 70 |
+
action = Action(
|
| 71 |
+
action_type=ActionType.EDIT_FILE,
|
| 72 |
+
edits=[FileEdit(
|
| 73 |
+
file_path=file_path,
|
| 74 |
+
old_content=best_line,
|
| 75 |
+
new_content=expected,
|
| 76 |
+
)],
|
| 77 |
+
)
|
| 78 |
+
else:
|
| 79 |
+
# Append the expected content
|
| 80 |
+
action = Action(
|
| 81 |
+
action_type=ActionType.ADD_LINE,
|
| 82 |
+
edits=[FileEdit(
|
| 83 |
+
file_path=file_path,
|
| 84 |
+
new_content=expected,
|
| 85 |
+
)],
|
| 86 |
+
)
|
| 87 |
+
env.step(action)
|
| 88 |
+
|
| 89 |
+
elif fix["type"] == "not_contains":
|
| 90 |
+
# Need to ensure expected string is NOT present
|
| 91 |
+
if fix["expected"] in current_content:
|
| 92 |
+
action = Action(
|
| 93 |
+
action_type=ActionType.DELETE_BLOCK,
|
| 94 |
+
edits=[FileEdit(
|
| 95 |
+
file_path=file_path,
|
| 96 |
+
old_content=fix["expected"],
|
| 97 |
+
)],
|
| 98 |
+
)
|
| 99 |
+
env.step(action)
|
| 100 |
+
|
| 101 |
+
elif fix["type"] == "line_equals":
|
| 102 |
+
line_num = int(fix.get("line", 0))
|
| 103 |
+
if line_num >= 1:
|
| 104 |
+
action = Action(
|
| 105 |
+
action_type=ActionType.REPLACE_LINE,
|
| 106 |
+
edits=[FileEdit(
|
| 107 |
+
file_path=file_path,
|
| 108 |
+
line_number=line_num,
|
| 109 |
+
new_content=str(fix["expected"]),
|
| 110 |
+
)],
|
| 111 |
+
)
|
| 112 |
+
env.step(action)
|
| 113 |
|
| 114 |
+
# Submit if not already done
|
| 115 |
+
if not env.done:
|
| 116 |
+
env.step(Action(action_type=ActionType.SUBMIT))
|
| 117 |
+
|
| 118 |
+
return run_grader(task_id, env.trajectory)
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
def run_baseline_episodes(task_id: Optional[str] = None, num_episodes: int = 1) -> List[GraderResult]:
|
| 122 |
+
"""Run baseline episodes across tasks.
|
| 123 |
+
|
| 124 |
+
Args:
|
| 125 |
+
task_id: Specific task to run, or None for all tasks.
|
| 126 |
+
num_episodes: Number of episodes per task.
|
| 127 |
+
|
| 128 |
+
Returns:
|
| 129 |
+
List of GraderResult for each episode.
|
| 130 |
+
"""
|
| 131 |
task_ids: List[str]
|
| 132 |
if task_id:
|
| 133 |
+
if task_id not in TASK_REGISTRY:
|
| 134 |
+
raise ValueError(f"Unknown task: {task_id}")
|
| 135 |
task_ids = [task_id]
|
| 136 |
else:
|
| 137 |
+
task_ids = list(TASK_REGISTRY.keys())
|
| 138 |
+
|
| 139 |
+
results: List[GraderResult] = []
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 140 |
for tid in task_ids:
|
| 141 |
+
task_cls = TASK_REGISTRY[tid]
|
| 142 |
+
scenarios = task_cls.SCENARIOS
|
| 143 |
+
episodes_run = 0
|
| 144 |
+
for scenario in scenarios:
|
| 145 |
+
if episodes_run >= num_episodes:
|
| 146 |
+
break
|
| 147 |
+
env = CICDDebugEnvironment()
|
| 148 |
+
result = _heuristic_episode(env, tid, scenario["id"])
|
| 149 |
+
results.append(result)
|
| 150 |
+
episodes_run += 1
|
| 151 |
+
|
| 152 |
return results
|
inference.py
CHANGED
|
@@ -1,8 +1,311 @@
|
|
| 1 |
-
"""Baseline inference
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
|
| 4 |
def main():
|
| 5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
|
| 7 |
|
| 8 |
if __name__ == "__main__":
|
|
|
|
| 1 |
+
"""Baseline inference script for CI/CD Debug Environment.
|
| 2 |
+
|
| 3 |
+
Uses OpenAI-compatible client to call Llama 3.1 70B via HuggingFace router.
|
| 4 |
+
Required by OpenEnv specification.
|
| 5 |
+
|
| 6 |
+
Usage:
|
| 7 |
+
export API_BASE_URL=https://router.huggingface.co/v1
|
| 8 |
+
export MODEL_NAME=meta-llama/Llama-3.1-70B-Instruct
|
| 9 |
+
export HF_TOKEN=your_token_here
|
| 10 |
+
python inference.py
|
| 11 |
+
"""
|
| 12 |
+
|
| 13 |
+
from __future__ import annotations
|
| 14 |
+
|
| 15 |
+
import json
|
| 16 |
+
import os
|
| 17 |
+
import re
|
| 18 |
+
import sys
|
| 19 |
+
import time
|
| 20 |
+
from typing import Any, Dict, List, Optional
|
| 21 |
+
|
| 22 |
+
import requests
|
| 23 |
+
from openai import OpenAI
|
| 24 |
+
|
| 25 |
+
# ── Configuration ─────────────────────────────────────────────────
|
| 26 |
+
|
| 27 |
+
API_BASE_URL = os.environ.get("API_BASE_URL", "https://router.huggingface.co/v1")
|
| 28 |
+
MODEL_NAME = os.environ.get("MODEL_NAME", "meta-llama/Llama-3.1-70B-Instruct")
|
| 29 |
+
HF_TOKEN = os.environ.get("HF_TOKEN", "")
|
| 30 |
+
ENV_URL = os.environ.get("ENV_URL", "http://localhost:7860")
|
| 31 |
+
MAX_STEPS = 8 # leave 2 steps buffer before env hard-limit of 10
|
| 32 |
+
|
| 33 |
+
SYSTEM_PROMPT = """You are an expert DevOps engineer debugging CI/CD pipelines.
|
| 34 |
+
You will receive broken Dockerfile and/or GitHub Actions workflow files along with error messages.
|
| 35 |
+
|
| 36 |
+
Your job is to:
|
| 37 |
+
1. Analyze the error message carefully
|
| 38 |
+
2. Identify the root cause in the configuration files
|
| 39 |
+
3. Provide a precise fix
|
| 40 |
+
|
| 41 |
+
When you identify a fix, respond with a JSON object in this exact format:
|
| 42 |
+
{
|
| 43 |
+
"reasoning": "Brief explanation of the bug and fix",
|
| 44 |
+
"edits": [
|
| 45 |
+
{
|
| 46 |
+
"file_path": "path/to/file",
|
| 47 |
+
"old_content": "exact broken line or block",
|
| 48 |
+
"new_content": "corrected line or block"
|
| 49 |
+
}
|
| 50 |
+
]
|
| 51 |
+
}
|
| 52 |
+
|
| 53 |
+
If you believe all issues are fixed and want to submit, respond with:
|
| 54 |
+
{"action": "submit"}
|
| 55 |
+
|
| 56 |
+
If you need a hint, respond with:
|
| 57 |
+
{"action": "hint"}
|
| 58 |
+
|
| 59 |
+
Rules:
|
| 60 |
+
- Match old_content EXACTLY as it appears in the file (whitespace matters)
|
| 61 |
+
- Fix one issue at a time for precision
|
| 62 |
+
- Focus on the error message — it tells you exactly what's wrong
|
| 63 |
+
- Common issues: typos, wrong syntax, missing fields, wrong secret references
|
| 64 |
+
- For GitHub Actions: check secret syntax (${{ }} not ${ }), env blocks, permissions
|
| 65 |
+
- For Dockerfiles: check instruction syntax, file paths, base image tags
|
| 66 |
+
- Always respond with valid JSON only, no markdown fences"""
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
def create_client() -> OpenAI:
|
| 70 |
+
"""Create OpenAI-compatible client for HuggingFace router."""
|
| 71 |
+
return OpenAI(
|
| 72 |
+
base_url=API_BASE_URL,
|
| 73 |
+
api_key=HF_TOKEN or "dummy",
|
| 74 |
+
)
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
def env_request(method: str, endpoint: str, json_data: Optional[Dict] = None) -> Dict[str, Any]:
|
| 78 |
+
"""Make a request to the environment server."""
|
| 79 |
+
url = f"{ENV_URL}{endpoint}"
|
| 80 |
+
if method == "GET":
|
| 81 |
+
resp = requests.get(url, timeout=30)
|
| 82 |
+
else:
|
| 83 |
+
resp = requests.post(url, json=json_data or {}, timeout=30)
|
| 84 |
+
resp.raise_for_status()
|
| 85 |
+
return resp.json()
|
| 86 |
+
|
| 87 |
+
|
| 88 |
+
def format_observation(obs: Dict[str, Any]) -> str:
|
| 89 |
+
"""Format observation into a prompt for the LLM."""
|
| 90 |
+
parts = []
|
| 91 |
+
parts.append(f"Task: {obs.get('task_description', 'Unknown')}")
|
| 92 |
+
parts.append(f"Difficulty: {obs.get('difficulty', 'unknown')}")
|
| 93 |
+
parts.append(f"Step: {obs.get('step_number', 0)}/{obs.get('max_steps', 10)}")
|
| 94 |
+
parts.append(f"Issues fixed: {obs.get('issues_fixed', 0)}/{obs.get('total_issues', '?')}")
|
| 95 |
+
|
| 96 |
+
error = obs.get("error", {})
|
| 97 |
+
parts.append(f"\n--- ERROR ---")
|
| 98 |
+
parts.append(f"Phase: {error.get('phase', 'unknown')}")
|
| 99 |
+
parts.append(f"Message: {error.get('error_message', 'No error')}")
|
| 100 |
+
if error.get("failed_step"):
|
| 101 |
+
parts.append(f"Failed step: {error['failed_step']}")
|
| 102 |
+
if error.get("line_hint"):
|
| 103 |
+
parts.append(f"Line hint: {error['line_hint']}")
|
| 104 |
+
|
| 105 |
+
parts.append(f"\n--- FILES ---")
|
| 106 |
+
for f in obs.get("files", []):
|
| 107 |
+
parts.append(f"\n=== {f['path']} ({f.get('file_type', 'unknown')}) ===")
|
| 108 |
+
content = f.get("content", "")
|
| 109 |
+
lines = content.split("\n")
|
| 110 |
+
for i, line in enumerate(lines, 1):
|
| 111 |
+
parts.append(f"{i:3d} | {line}")
|
| 112 |
+
|
| 113 |
+
if obs.get("available_secrets"):
|
| 114 |
+
parts.append(f"\n--- AVAILABLE SECRETS ---")
|
| 115 |
+
parts.append(", ".join(obs["available_secrets"]))
|
| 116 |
+
|
| 117 |
+
if obs.get("last_action_feedback"):
|
| 118 |
+
parts.append(f"\n--- LAST ACTION FEEDBACK ---")
|
| 119 |
+
parts.append(obs["last_action_feedback"])
|
| 120 |
+
|
| 121 |
+
return "\n".join(parts)
|
| 122 |
+
|
| 123 |
+
|
| 124 |
+
def parse_llm_response(text: str) -> Dict[str, Any]:
|
| 125 |
+
"""Parse LLM response into an action dict."""
|
| 126 |
+
text = text.strip()
|
| 127 |
+
|
| 128 |
+
# Strip markdown code fences if present
|
| 129 |
+
if text.startswith("```"):
|
| 130 |
+
lines = text.split("\n")
|
| 131 |
+
lines = [l for l in lines if not l.strip().startswith("```")]
|
| 132 |
+
text = "\n".join(lines).strip()
|
| 133 |
+
|
| 134 |
+
# Try to find JSON in the response
|
| 135 |
+
json_match = re.search(r'\{[\s\S]*\}', text)
|
| 136 |
+
if json_match:
|
| 137 |
+
try:
|
| 138 |
+
return json.loads(json_match.group())
|
| 139 |
+
except json.JSONDecodeError:
|
| 140 |
+
pass
|
| 141 |
+
|
| 142 |
+
# Fallback: treat as submit
|
| 143 |
+
return {"action": "submit"}
|
| 144 |
+
|
| 145 |
+
|
| 146 |
+
def build_action(parsed: Dict[str, Any]) -> Dict[str, Any]:
|
| 147 |
+
"""Convert parsed LLM response to environment action format."""
|
| 148 |
+
if parsed.get("action") == "submit":
|
| 149 |
+
return {"action_type": "submit"}
|
| 150 |
+
if parsed.get("action") == "hint":
|
| 151 |
+
return {"action_type": "request_hint"}
|
| 152 |
+
|
| 153 |
+
edits = parsed.get("edits", [])
|
| 154 |
+
if not edits:
|
| 155 |
+
return {"action_type": "submit"}
|
| 156 |
+
|
| 157 |
+
return {
|
| 158 |
+
"action_type": "edit_file",
|
| 159 |
+
"edits": [
|
| 160 |
+
{
|
| 161 |
+
"file_path": e.get("file_path", ""),
|
| 162 |
+
"old_content": e.get("old_content", ""),
|
| 163 |
+
"new_content": e.get("new_content", ""),
|
| 164 |
+
}
|
| 165 |
+
for e in edits
|
| 166 |
+
],
|
| 167 |
+
}
|
| 168 |
+
|
| 169 |
+
|
| 170 |
+
def run_episode(client: OpenAI, task_id: Optional[str] = None, scenario_id: Optional[str] = None) -> Dict[str, Any]:
|
| 171 |
+
"""Run a single episode: reset, loop (observe -> LLM -> act), grade."""
|
| 172 |
+
reset_payload: Dict[str, Any] = {}
|
| 173 |
+
if task_id:
|
| 174 |
+
reset_payload["task_id"] = task_id
|
| 175 |
+
if scenario_id:
|
| 176 |
+
reset_payload["scenario_id"] = scenario_id
|
| 177 |
+
|
| 178 |
+
reset_resp = env_request("POST", "/reset", reset_payload)
|
| 179 |
+
obs = reset_resp["observation"]
|
| 180 |
+
info = reset_resp.get("info", {})
|
| 181 |
+
|
| 182 |
+
actual_task_id = info.get("task_id", task_id or "unknown")
|
| 183 |
+
actual_scenario_id = info.get("scenario_id", scenario_id or "unknown")
|
| 184 |
+
|
| 185 |
+
print(f" Episode: task={actual_task_id}, scenario={actual_scenario_id}")
|
| 186 |
+
|
| 187 |
+
messages = [{"role": "system", "content": SYSTEM_PROMPT}]
|
| 188 |
+
trajectory = []
|
| 189 |
+
|
| 190 |
+
for step_num in range(MAX_STEPS):
|
| 191 |
+
user_msg = format_observation(obs)
|
| 192 |
+
messages.append({"role": "user", "content": user_msg})
|
| 193 |
+
|
| 194 |
+
try:
|
| 195 |
+
completion = client.chat.completions.create(
|
| 196 |
+
model=MODEL_NAME,
|
| 197 |
+
messages=messages,
|
| 198 |
+
temperature=0.1,
|
| 199 |
+
max_tokens=1024,
|
| 200 |
+
)
|
| 201 |
+
llm_text = completion.choices[0].message.content or '{"action": "submit"}'
|
| 202 |
+
except Exception as e:
|
| 203 |
+
print(f" LLM error at step {step_num + 1}: {e}")
|
| 204 |
+
llm_text = '{"action": "submit"}'
|
| 205 |
+
|
| 206 |
+
messages.append({"role": "assistant", "content": llm_text})
|
| 207 |
+
|
| 208 |
+
parsed = parse_llm_response(llm_text)
|
| 209 |
+
action = build_action(parsed)
|
| 210 |
+
|
| 211 |
+
print(f" Step {step_num + 1}: {action['action_type']}", end="")
|
| 212 |
+
|
| 213 |
+
step_resp = env_request("POST", "/step", {"action": action})
|
| 214 |
+
obs = step_resp["observation"]
|
| 215 |
+
reward = step_resp.get("reward", 0.0)
|
| 216 |
+
done = step_resp.get("done", False)
|
| 217 |
+
step_info = step_resp.get("info", {})
|
| 218 |
+
|
| 219 |
+
print(f" -> reward={reward:.2f}, fixed={step_info.get('issues_fixed', '?')}/{step_info.get('issues_total', '?')}")
|
| 220 |
+
|
| 221 |
+
trajectory.append({
|
| 222 |
+
"step": step_num + 1,
|
| 223 |
+
"action": action,
|
| 224 |
+
"reward": reward,
|
| 225 |
+
"done": done,
|
| 226 |
+
"info": step_info,
|
| 227 |
+
})
|
| 228 |
+
|
| 229 |
+
if done:
|
| 230 |
+
break
|
| 231 |
+
|
| 232 |
+
# Grade the trajectory
|
| 233 |
+
grade_resp = env_request("POST", "/grader", {
|
| 234 |
+
"task_id": actual_task_id,
|
| 235 |
+
"trajectory": trajectory,
|
| 236 |
+
})
|
| 237 |
+
result = grade_resp.get("result", {})
|
| 238 |
+
score = result.get("score", 0.0)
|
| 239 |
+
print(f" Score: {score:.3f} | {result.get('feedback', '')}")
|
| 240 |
+
return result
|
| 241 |
+
|
| 242 |
+
|
| 243 |
+
def run_all_tasks(client: OpenAI) -> Dict[str, float]:
|
| 244 |
+
"""Run baseline on all tasks and report scores."""
|
| 245 |
+
tasks_resp = env_request("GET", "/tasks")
|
| 246 |
+
tasks = tasks_resp.get("tasks", [])
|
| 247 |
+
|
| 248 |
+
scores: Dict[str, List[float]] = {}
|
| 249 |
+
|
| 250 |
+
for task in tasks:
|
| 251 |
+
task_id = task["id"]
|
| 252 |
+
print(f"\n{'='*60}")
|
| 253 |
+
print(f"Task: {task['name']} ({task['difficulty']})")
|
| 254 |
+
print(f"{'='*60}")
|
| 255 |
+
|
| 256 |
+
task_scores = []
|
| 257 |
+
# Run one episode per task for baseline
|
| 258 |
+
result = run_episode(client, task_id=task_id)
|
| 259 |
+
task_scores.append(result.get("score", 0.0))
|
| 260 |
+
scores[task_id] = task_scores
|
| 261 |
+
|
| 262 |
+
# Summary
|
| 263 |
+
print(f"\n{'='*60}")
|
| 264 |
+
print("BASELINE RESULTS SUMMARY")
|
| 265 |
+
print(f"{'='*60}")
|
| 266 |
+
avg_scores = {}
|
| 267 |
+
for task_id, task_scores in scores.items():
|
| 268 |
+
avg = sum(task_scores) / len(task_scores) if task_scores else 0.0
|
| 269 |
+
avg_scores[task_id] = avg
|
| 270 |
+
print(f" {task_id:40s} {avg:.3f}")
|
| 271 |
+
|
| 272 |
+
overall = sum(avg_scores.values()) / len(avg_scores) if avg_scores else 0.0
|
| 273 |
+
print(f" {'OVERALL':40s} {overall:.3f}")
|
| 274 |
+
|
| 275 |
+
return avg_scores
|
| 276 |
|
| 277 |
|
| 278 |
def main():
|
| 279 |
+
"""Entry point for baseline inference."""
|
| 280 |
+
print("CI/CD Debug Environment - Baseline Inference")
|
| 281 |
+
print(f"API: {API_BASE_URL}")
|
| 282 |
+
print(f"Model: {MODEL_NAME}")
|
| 283 |
+
print(f"Environment: {ENV_URL}")
|
| 284 |
+
|
| 285 |
+
if not HF_TOKEN:
|
| 286 |
+
print("\nWARNING: HF_TOKEN not set. Set it via: export HF_TOKEN=your_token_here")
|
| 287 |
+
print("Continuing anyway (will fail if auth is required)...\n")
|
| 288 |
+
|
| 289 |
+
# Verify environment is running
|
| 290 |
+
try:
|
| 291 |
+
health = env_request("GET", "/")
|
| 292 |
+
print(f"Environment status: {health.get('status', 'unknown')}\n")
|
| 293 |
+
except Exception as e:
|
| 294 |
+
print(f"\nERROR: Cannot connect to environment at {ENV_URL}")
|
| 295 |
+
print(f" {e}")
|
| 296 |
+
print("\nStart the server first:")
|
| 297 |
+
print(" python -m uvicorn server.main:app --host 0.0.0.0 --port 7860")
|
| 298 |
+
sys.exit(1)
|
| 299 |
+
|
| 300 |
+
client = create_client()
|
| 301 |
+
|
| 302 |
+
# If a specific task is requested via CLI arg
|
| 303 |
+
if len(sys.argv) > 1:
|
| 304 |
+
task_id = sys.argv[1]
|
| 305 |
+
scenario_id = sys.argv[2] if len(sys.argv) > 2 else None
|
| 306 |
+
run_episode(client, task_id=task_id, scenario_id=scenario_id)
|
| 307 |
+
else:
|
| 308 |
+
run_all_tasks(client)
|
| 309 |
|
| 310 |
|
| 311 |
if __name__ == "__main__":
|
requirements.txt
CHANGED
|
Binary files a/requirements.txt and b/requirements.txt differ
|
|
|
tests/test_baseline.py
ADDED
|
@@ -0,0 +1,52 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Tests for baseline_runner and inference helpers."""
|
| 2 |
+
|
| 3 |
+
from baseline_runner import run_baseline_episodes, _heuristic_episode
|
| 4 |
+
from server.environment import CICDDebugEnvironment
|
| 5 |
+
from server.tasks.task_registry import TASK_REGISTRY
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
def test_heuristic_baseline_scores_above_zero_on_most_scenarios():
|
| 9 |
+
"""Heuristic baseline should score > 0 on most scenarios.
|
| 10 |
+
|
| 11 |
+
Some scenarios (e.g. reordering steps) can't be solved by simple
|
| 12 |
+
contains-based heuristics, so we allow a few zeros.
|
| 13 |
+
"""
|
| 14 |
+
total = 0
|
| 15 |
+
nonzero = 0
|
| 16 |
+
for task_id, task_cls in TASK_REGISTRY.items():
|
| 17 |
+
for scenario in task_cls.SCENARIOS:
|
| 18 |
+
env = CICDDebugEnvironment()
|
| 19 |
+
result = _heuristic_episode(env, task_id, scenario["id"])
|
| 20 |
+
total += 1
|
| 21 |
+
if result.score > 0.0:
|
| 22 |
+
nonzero += 1
|
| 23 |
+
# At least 80% of scenarios should get > 0
|
| 24 |
+
assert nonzero / total >= 0.8, f"Only {nonzero}/{total} scenarios scored > 0"
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
def test_run_baseline_episodes_single_task():
|
| 28 |
+
results = run_baseline_episodes(task_id="dockerfile_syntax", num_episodes=1)
|
| 29 |
+
assert len(results) == 1
|
| 30 |
+
assert results[0].task_id == "dockerfile_syntax"
|
| 31 |
+
assert results[0].score >= 0.0
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
def test_run_baseline_episodes_all_tasks():
|
| 35 |
+
results = run_baseline_episodes(task_id=None, num_episodes=1)
|
| 36 |
+
assert len(results) == len(TASK_REGISTRY)
|
| 37 |
+
task_ids_seen = {r.task_id for r in results}
|
| 38 |
+
assert task_ids_seen == set(TASK_REGISTRY.keys())
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
def test_heuristic_fixes_easy_tasks_well():
|
| 42 |
+
"""Easy tasks should score >= 0.5 with heuristic baseline."""
|
| 43 |
+
easy_tasks = [tid for tid, cls in TASK_REGISTRY.items() if cls.DIFFICULTY.value == "easy"]
|
| 44 |
+
for task_id in easy_tasks:
|
| 45 |
+
task_cls = TASK_REGISTRY[task_id]
|
| 46 |
+
scores = []
|
| 47 |
+
for scenario in task_cls.SCENARIOS:
|
| 48 |
+
env = CICDDebugEnvironment()
|
| 49 |
+
result = _heuristic_episode(env, task_id, scenario["id"])
|
| 50 |
+
scores.append(result.score)
|
| 51 |
+
avg = sum(scores) / len(scores)
|
| 52 |
+
assert avg >= 0.3, f"Easy task {task_id} avg score {avg:.2f} too low"
|
tests/test_endpoints.py
CHANGED
|
@@ -1,3 +1,5 @@
|
|
|
|
|
|
|
|
| 1 |
from fastapi.testclient import TestClient
|
| 2 |
|
| 3 |
from server.main import app
|
|
@@ -8,20 +10,137 @@ client = TestClient(app)
|
|
| 8 |
def test_root_health():
|
| 9 |
response = client.get("/")
|
| 10 |
assert response.status_code == 200
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
def test_reset_and_state():
|
| 15 |
-
reset = client.post("/reset", json={})
|
| 16 |
-
assert reset.status_code == 200
|
| 17 |
-
state = client.get("/state")
|
| 18 |
-
assert state.status_code == 200
|
| 19 |
|
| 20 |
|
| 21 |
-
def
|
| 22 |
info = client.get("/info")
|
| 23 |
assert info.status_code == 200
|
| 24 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
tasks = client.get("/tasks")
|
| 26 |
assert tasks.status_code == 200
|
| 27 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Endpoint tests for the FastAPI server."""
|
| 2 |
+
|
| 3 |
from fastapi.testclient import TestClient
|
| 4 |
|
| 5 |
from server.main import app
|
|
|
|
| 10 |
def test_root_health():
|
| 11 |
response = client.get("/")
|
| 12 |
assert response.status_code == 200
|
| 13 |
+
data = response.json()
|
| 14 |
+
assert data["status"] == "healthy"
|
| 15 |
+
assert data["environment"] == "cicd-debug-env"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
|
| 18 |
+
def test_info_returns_all_tasks():
|
| 19 |
info = client.get("/info")
|
| 20 |
assert info.status_code == 200
|
| 21 |
+
data = info.json()
|
| 22 |
+
assert len(data.get("tasks", [])) >= 6
|
| 23 |
+
assert "action_space" in data
|
| 24 |
+
assert "observation_space" in data
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
def test_tasks_endpoint():
|
| 28 |
tasks = client.get("/tasks")
|
| 29 |
assert tasks.status_code == 200
|
| 30 |
+
data = tasks.json()
|
| 31 |
+
assert len(data.get("tasks", [])) >= 6
|
| 32 |
+
task_ids = [t["id"] for t in data["tasks"]]
|
| 33 |
+
assert "dockerfile_syntax" in task_ids
|
| 34 |
+
assert "multi_stage_pipeline_matrix" in task_ids
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
def test_reset_default():
|
| 38 |
+
resp = client.post("/reset", json={})
|
| 39 |
+
assert resp.status_code == 200
|
| 40 |
+
data = resp.json()
|
| 41 |
+
assert "observation" in data
|
| 42 |
+
obs = data["observation"]
|
| 43 |
+
assert obs["total_issues"] >= 1
|
| 44 |
+
assert obs["step_number"] == 0
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
def test_reset_specific_task():
|
| 48 |
+
resp = client.post("/reset", json={"task_id": "dockerfile_syntax", "scenario_id": "typo_filename"})
|
| 49 |
+
assert resp.status_code == 200
|
| 50 |
+
obs = resp.json()["observation"]
|
| 51 |
+
assert obs["task_id"] == "dockerfile_syntax"
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
def test_reset_with_seed():
|
| 55 |
+
resp1 = client.post("/reset", json={"seed": 99})
|
| 56 |
+
resp2 = client.post("/reset", json={"seed": 99})
|
| 57 |
+
assert resp1.json()["observation"]["task_id"] == resp2.json()["observation"]["task_id"]
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
def test_reset_invalid_task():
|
| 61 |
+
resp = client.post("/reset", json={"task_id": "nonexistent_task"})
|
| 62 |
+
assert resp.status_code == 400
|
| 63 |
+
|
| 64 |
+
|
| 65 |
+
def test_state_without_reset():
|
| 66 |
+
# Force a fresh app state by not resetting — this test relies on prior reset
|
| 67 |
+
# Just verify the endpoint returns 200 (prior test did a reset)
|
| 68 |
+
resp = client.get("/state")
|
| 69 |
+
assert resp.status_code == 200
|
| 70 |
+
data = resp.json()
|
| 71 |
+
assert "observation" in data
|
| 72 |
+
assert "episode_reward" in data
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
def test_step_edit_file():
|
| 76 |
+
client.post("/reset", json={"task_id": "dockerfile_syntax", "scenario_id": "typo_filename"})
|
| 77 |
+
resp = client.post("/step", json={
|
| 78 |
+
"action": {
|
| 79 |
+
"action_type": "edit_file",
|
| 80 |
+
"edits": [{
|
| 81 |
+
"file_path": "Dockerfile",
|
| 82 |
+
"old_content": "COPY requirments.txt .",
|
| 83 |
+
"new_content": "COPY requirements.txt .",
|
| 84 |
+
}],
|
| 85 |
+
}
|
| 86 |
+
})
|
| 87 |
+
assert resp.status_code == 200
|
| 88 |
+
data = resp.json()
|
| 89 |
+
assert data["reward"] > 0
|
| 90 |
+
assert data["info"]["issues_fixed"] >= 1
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+
def test_step_submit():
|
| 94 |
+
client.post("/reset", json={"task_id": "dockerfile_syntax"})
|
| 95 |
+
resp = client.post("/step", json={"action": {"action_type": "submit"}})
|
| 96 |
+
assert resp.status_code == 200
|
| 97 |
+
assert resp.json()["done"] is True
|
| 98 |
+
|
| 99 |
+
|
| 100 |
+
def test_step_request_hint():
|
| 101 |
+
client.post("/reset", json={"task_id": "dockerfile_syntax"})
|
| 102 |
+
resp = client.post("/step", json={"action": {"action_type": "request_hint"}})
|
| 103 |
+
assert resp.status_code == 200
|
| 104 |
+
obs = resp.json()["observation"]
|
| 105 |
+
assert obs["hints_used"] == 1
|
| 106 |
+
assert "Hint" in (obs.get("last_action_feedback") or "")
|
| 107 |
+
|
| 108 |
+
|
| 109 |
+
def test_grader_endpoint():
|
| 110 |
+
trajectory = [
|
| 111 |
+
{"step": 1, "action": {"action_type": "edit_file", "edits": [{"file_path": "Dockerfile"}]},
|
| 112 |
+
"reward": 0.3, "done": True, "info": {"issues_fixed": 1, "issues_total": 1}},
|
| 113 |
+
]
|
| 114 |
+
resp = client.post("/grader", json={"task_id": "dockerfile_syntax", "trajectory": trajectory})
|
| 115 |
+
assert resp.status_code == 200
|
| 116 |
+
result = resp.json()["result"]
|
| 117 |
+
assert result["score"] == 1.0
|
| 118 |
+
|
| 119 |
+
|
| 120 |
+
def test_grader_empty_trajectory():
|
| 121 |
+
resp = client.post("/grader", json={"task_id": "dockerfile_syntax", "trajectory": []})
|
| 122 |
+
assert resp.status_code == 200
|
| 123 |
+
assert resp.json()["result"]["score"] == 0.0
|
| 124 |
+
|
| 125 |
+
|
| 126 |
+
def test_full_episode_via_api():
|
| 127 |
+
"""Full episode: reset -> edit -> submit -> verify score."""
|
| 128 |
+
client.post("/reset", json={"task_id": "dockerfile_syntax", "scenario_id": "typo_filename"})
|
| 129 |
+
|
| 130 |
+
client.post("/step", json={
|
| 131 |
+
"action": {
|
| 132 |
+
"action_type": "edit_file",
|
| 133 |
+
"edits": [{
|
| 134 |
+
"file_path": "Dockerfile",
|
| 135 |
+
"old_content": "COPY requirments.txt .",
|
| 136 |
+
"new_content": "COPY requirements.txt .",
|
| 137 |
+
}],
|
| 138 |
+
}
|
| 139 |
+
})
|
| 140 |
+
|
| 141 |
+
resp = client.post("/step", json={"action": {"action_type": "submit"}})
|
| 142 |
+
assert resp.json()["done"] is True
|
| 143 |
+
|
| 144 |
+
state = client.get("/state")
|
| 145 |
+
assert state.json()["done"] is True
|
| 146 |
+
assert state.json()["episode_reward"] > 0
|