Spaces:

jester1177
/

cloudnative-devops-debug-env

Sleeping

App Files Files Community

Krishna1107 commited on Apr 5

Commit

d129f63

1 Parent(s): 8886ce5

updated README with details

Browse files

Files changed (2) hide show

.gitignore +1 -0
README.md +303 -100

.gitignore CHANGED Viewed

@@ -2,6 +2,7 @@
 __pycache__/
 *.py[cod]
 *$py.class
 # Virtual environments
 .venv/

 __pycache__/
 *.py[cod]
 *$py.class
+.claude/
 # Virtual environments
 .venv/

README.md CHANGED Viewed

@@ -12,78 +12,267 @@ pinned: false
 An OpenEnv-compatible environment where AI agents learn to debug broken GitHub Actions workflows and Dockerfiles. Built for the OpenEnv Hackathon by Scaler School of Technology (partners: Meta, HuggingFace, PyTorch).
-## What It Does
-Agents receive:
-- Broken configuration files (Dockerfile, GitHub Actions YAML)
-- Error messages from failed builds/workflows
-- Context about available secrets and runner environment
-Agents must analyze errors, identify root causes, edit files to fix issues, and submit solutions. The environment provides dense reward feedback at every step.
-## Tasks
-| # | Task ID | Description | Difficulty | Scenarios |
-|---|---------|-------------|------------|-----------|
-| 1 | `dockerfile_syntax` | Fix Dockerfile instruction/syntax errors | Easy | 5 |
-| 2 | `dockerfile_runtime` | Fix Dockerfile runtime/execution issues | Medium | 5 |
-| 3 | `workflow_syntax_structure` | Fix GitHub Actions YAML structure | Easy | 5 |
-| 4 | `workflow_secrets_permissions` | Fix secret wiring and permissions | Medium | 5 |
-| 5 | `ci_docker_integration` | Debug combined CI + Docker failures | Medium-Hard | 5 |
-| 6 | `multi_stage_pipeline_matrix` | Debug multi-stage and matrix pipelines | Hard | 5 |
-30 total scenarios across 6 tasks with clear difficulty progression.
-## API Endpoints
-| Endpoint | Method | Description |
-|----------|--------|-------------|
-| `/` | GET | Health check |
-| `/reset` | POST | Start a new episode (optional: `task_id`, `scenario_id`, `seed`) |
-| `/step` | POST | Take an action (`edit_file`, `replace_line`, `add_line`, `delete_line`, `submit`, `request_hint`) |
-| `/state` | GET | Get current observation |
-| `/info` | GET | Environment metadata and schemas |
-| `/tasks` | GET | List all tasks |
-| `/grader` | POST | Grade a trajectory |
-| `/baseline` | POST | Run built-in heuristic baseline |
-## Grading
-Scoring is **deterministic** and **dynamic** (same actions = same score, different actions = different scores).
-| Component | Weight | Description |
-|-----------|--------|-------------|
-| Partial fixes | 40% | Proportional to issues fixed |
-| Complete solution | 30% | Bonus when ALL issues fixed |
-| Efficiency | 30% | Bonus for minimal steps (decays with extra steps) |
-| Hint penalty | -5% each | Per hint requested |
-Score range: `0.0` (no progress) to `1.0` (all fixed efficiently).
-## Quick Start
-### Local Development
-```bash
-pip install -r requirements.txt
-python -m uvicorn server.main:app --host 0.0.0.0 --port 7860
 ```
-### Test Endpoints
-```bash
-# Health check
-curl http://localhost:7860/
-# List tasks
-curl http://localhost:7860/tasks
-# Start an episode
 curl -X POST http://localhost:7860/reset \
   -H "Content-Type: application/json" \
-  -d '{"task_id": "dockerfile_syntax"}'
-# Take an action
 curl -X POST http://localhost:7860/step \
   -H "Content-Type: application/json" \
   -d '{
@@ -97,10 +286,43 @@ curl -X POST http://localhost:7860/step \
     }
   }'
-# Submit solution
 curl -X POST http://localhost:7860/step \
   -H "Content-Type: application/json" \
   -d '{"action": {"action_type": "submit"}}'
 ```
 ### Run Tests
@@ -125,73 +347,54 @@ export HF_TOKEN=your_token_here
 python inference.py
 ```
-Run on a specific task:
-```bash
-python inference.py dockerfile_syntax
-```
 ## Project Structure
 ```
 cicd-debug-env/
-├── openenv.yaml              # OpenEnv metadata
-├── inference.py              # LLM baseline script
 ├── baseline_runner.py        # Heuristic baseline for /baseline endpoint
 ├── Dockerfile                # Production container
 ├── requirements.txt          # Python dependencies
-├── README.md
 │
 ├── server/
-│   ├── __init__.py
-│   ├── main.py               # FastAPI with all 8 endpoints
-│   ├── models.py             # Pydantic models
-│   ├── environment.py        # Core environment logic
-│   │
 │   ├── tasks/
-│   │   ├── base.py           # BaseTask class
-│   │   ├── task_registry.py  # Task registry
-│   │   ├── task_1_build_errors.py
-│   │   ├── task_2_docker_runtime.py
-│   │   ├── task_3_workflow_syntax.py
-│   │   ├── task_4_workflow_secrets_permissions.py
-│   │   ├── task_5_ci_docker_integration.py
-│   │   └── task_6_multi_stage_matrix.py
-│   │
 │   ├── graders/
-│   │   ├── __init__.py       # Deterministic grader
-│   │   └── base.py           # Base grader class
-│   │
-│   ├── simulators/
-│   │   ├── docker_simulator.py   # Dockerfile validation (15+ rules)
-│   │   └── workflow_simulator.py # Workflow validation (15+ rules)
-│   │
-│   └── utils/
-│       └── yaml_parser.py
 │
 └── tests/
-    ├── conftest.py
-    ├── test_endpoints.py
-    └── test_determinism.py
 ```
-## Expected Baseline Scores
-| Task | Expected |
-|------|----------|
-| dockerfile_syntax | 0.70 |
-| dockerfile_runtime | 0.55 |
-| workflow_syntax_structure | 0.65 |
-| workflow_secrets_permissions | 0.50 |
-| ci_docker_integration | 0.45 |
-| multi_stage_pipeline_matrix | 0.30 |
 ## Design Decisions
-1. **Combined Docker + GitHub Actions**: The intersection of these tools is the most painful real-world failure mode
-2. **Simulated validation**: Static analysis instead of real Docker containers for speed and determinism
-3. **Dense rewards**: Partial credit at every step rather than sparse pass/fail
-4. **6 tasks (2+2+2)**: 2 Docker-only + 2 Workflow-only + 2 Combined with clear difficulty progression
-5. **OpenAI client for baseline**: Required by hackathon specification
 ## License

 An OpenEnv-compatible environment where AI agents learn to debug broken GitHub Actions workflows and Dockerfiles. Built for the OpenEnv Hackathon by Scaler School of Technology (partners: Meta, HuggingFace, PyTorch).
+## Why CI/CD Debugging?
+Every developer who ships code hits CI/CD failures. A misconfigured Dockerfile, a broken GitHub Actions workflow, a missing secret — these are the bugs that waste hours of developer time every week. They're hard to debug because:
+- Error messages are cryptic ("unable to prepare context: unable to evaluate symlinks")
+- The feedback loop is slow (push, wait for CI, read logs, fix, repeat)
+- Multiple config files interact in non-obvious ways (Dockerfile + workflow + secrets)
+This environment teaches AI agents to do what senior DevOps engineers do: read the error, trace it to the root cause, and fix it.
+---
+## How It Works: The Complete Flow
+```
+┌──────────────────────────────────────────────────────────────┐
+│  1. RESET                                                     │
+│     Agent receives:                                           │
+│     - Broken config files (Dockerfile / workflow YAML)        │
+│     - Error message from the failed build/deploy              │
+│     - Available secrets list                                  │
+│     - Number of issues to find                                │
+├──────────────────────────────────────────────────────────────┤
+│  2. OBSERVE → THINK → ACT  (repeat up to 10 steps)           │
+│     Agent reads the error, analyzes the files, then:          │
+│     - edit_file: replace broken content with fixed content    │
+│     - replace_line: fix a specific line number                │
+│     - add_line / add_block: insert missing content            │
+│     - delete_line / delete_block: remove bad content          │
+│     - request_hint: get a clue (-5% score penalty)            │
+│     - submit: "I'm done fixing"                               │
+│                                                               │
+│     After each action, agent gets:                            │
+│     - Updated file contents                                   │
+│     - Reward signal (+0.3 per fix, -0.02 for failed edits)   │
+│     - How many issues are now fixed                           │
+├──────────────────────────────────────────────────────────────┤
+│  3. GRADE                                                     │
+│     Deterministic scoring based on:                           │
+│     - What fraction of issues were fixed                      │
+│     - Whether ALL issues were fixed (bonus)                   │
+│     - How many steps it took (efficiency)                     │
+│     - How many hints were used (penalty)                      │
+└──────────────────────────────────────────────────────────────┘
+```
+---
+## The 6 Tasks (30 Scenarios)
+### Task 1: Dockerfile Syntax Errors — Easy
+Simple typos and instruction errors that break `docker build`. These are the bugs every developer makes on day one.
+| # | Scenario | What's Broken | Real-World Context |
+|---|----------|---------------|-------------------|
+| 1 | `typo_filename` | `COPY requirments.txt .` — misspelled filename | Most common Docker build error on Stack Overflow |
+| 2 | `invalid_base_image` | `FROM python:3.9-slimm` — extra 'm' in tag | Happens when copy-pasting image tags |
+| 3 | `invalid_run_syntax` | `RUN pip install ... \n && python setup.py` — broken line continuation | Formatting multi-line RUN commands is tricky |
+| 4 | `invalid_expose` | `EXPOSE "eighty"` — string instead of port number | EXPOSE only accepts numeric ports |
+| 5 | `missing_from_instruction` | No `FROM` instruction at all | Dockerfile must start with FROM (or ARG before FROM) |
+### Task 2: Dockerfile Runtime Errors — Medium
+The Dockerfile builds successfully, but the container crashes when you run it. These are harder because the error appears at runtime, not build time.
+| # | Scenario | What's Broken | Real-World Context |
+|---|----------|---------------|-------------------|
+| 1 | `missing_workdir` | No WORKDIR — files scatter to `/` | Container runs but `npm start` can't find `package.json` |
+| 2 | `cmd_entrypoint_conflict` | Both ENTRYPOINT and CMD defined as full commands | Process starts incorrectly; CMD should be args-only when ENTRYPOINT exists |
+| 3 | `entrypoint_not_executable` | Shell script lacks execute permission | `chmod +x` missing — "permission denied" at container start |
+| 4 | `missing_required_env` | App needs `DATABASE_URL` but it's not set | Container starts then crashes: "DATABASE_URL is not defined" |
+| 5 | `non_root_privileged_port` | Non-root user tries to bind port 80 | Security best practice (non-root) conflicts with port < 1024 |
+### Task 3: Workflow Syntax & Structure — Easy
+GitHub Actions YAML has structural problems. GitHub rejects these before any job runs.
+| # | Scenario | What's Broken | Real-World Context |
+|---|----------|---------------|-------------------|
+| 1 | `checkout_after_build` | `docker build` runs before `actions/checkout` | No source code checked out — "Dockerfile not found" |
+| 2 | `missing_runs_on` | Job has no `runs-on` field | GitHub Actions rejects: every job needs a runner |
+| 3 | `invalid_trigger_syntax` | `branches: main` instead of `branches: [main]` | Must be a YAML list, not a scalar string |
+| 4 | `missing_step_uses_or_run` | Step has a name but no `uses:` or `run:` | Invalid step — must do something |
+| 5 | `missing_on_trigger` | No `on:` block at all | Workflow never triggers — GitHub doesn't know when to run it |
+### Task 4: Workflow Secrets & Permissions — Medium
+Secrets exist in the repository but aren't wired correctly to the workflow steps. These are the bugs that make you say "but the secret is right there!"
+| # | Scenario | What's Broken | Real-World Context |
+|---|----------|---------------|-------------------|
+| 1 | `missing_env_secrets` | `$DOCKER_PASSWORD` in `run:` but no `env:` mapping | Secrets must be explicitly passed via `env:` block |
+| 2 | `wrong_secret_syntax` | `${ secrets.TOKEN }` instead of `${{ secrets.TOKEN }}` | Single braces vs double braces — subtle syntax difference |
+| 3 | `missing_token_permissions` | Pushing to GHCR without `permissions: packages: write` | GITHUB_TOKEN is read-only by default since 2023 |
+| 4 | `secret_not_in_env` | `curl` uses `$SLACK_WEBHOOK_URL` but it's not in `env:` | Same pattern as #1 — very common mistake |
+| 5 | `ghcr_wrong_credentials` | Using `DOCKER_PASSWORD` for GHCR login | GHCR uses `GITHUB_TOKEN`, not Docker Hub credentials |
+### Task 5: CI + Docker Integration — Medium-Hard
+The workflow AND the Dockerfile interact. Fixing one file alone isn't enough — you need to understand how they work together.
+| # | Scenario | What's Broken | Real-World Context |
+|---|----------|---------------|-------------------|
+| 1 | `missing_buildx_for_platforms` | Multi-platform build without `setup-buildx-action` | Standard Docker builder can't cross-compile; need BuildKit |
+| 2 | `login_secrets_not_wired` | `docker login` step missing `env:` for secrets | Auth fails — "unauthorized: authentication required" |
+| 3 | `wrong_build_context` | Context is `./backend` but Dockerfile path is `./Dockerfile` | Path mismatch — build can't find the Dockerfile |
+| 4 | `cache_without_mode_max` | GHA cache export missing `mode=max` | Cache doesn't persist intermediate layers; slow rebuilds |
+| 5 | `push_without_login` | `docker push` without `docker login` first | "denied: requested access to the resource is denied" |
+### Task 6: Multi-Stage Pipeline & Matrix — Hard
+Complex pipelines with multiple interacting bugs. The agent must find and fix 2-3 issues across multiple files.
+| # | Scenario | What's Broken | Real-World Context |
+|---|----------|---------------|-------------------|
+| 1 | `artifact_path_mismatch` | `COPY --from=builder /app/dist` but React outputs to `/app/build` | Framework output directories vary — CRA uses `build/`, Vite uses `dist/` |
+| 2 | `matrix_platform_arg` | Uses `$BUILDPLATFORM` without `ARG BUILDPLATFORM` declaration | Multi-arch builds need platform ARGs declared before FROM |
+| 3 | `cross_job_artifact` | Test job downloads artifact but missing `needs: build` | Jobs run in parallel by default — artifact doesn't exist yet |
+| 4 | `multiple_issues` | Dockerfile typo + workflow secrets not wired (2 bugs) | Real debugging: problems compound across files |
+| 5 | `matrix_version_failure` | Matrix includes Node 14 but code needs >= 16 + missing `needs:` | Version compatibility + job ordering — 2 bugs to find |
+---
+## Available Actions
+Each step, the agent chooses exactly one action:
+| Action | What It Does | When to Use |
+|--------|-------------|-------------|
+| `edit_file` | Replace `old_content` with `new_content` in a file | Most common — fix a broken line or block |
+| `replace_line` | Replace content at a specific line number | When you know exactly which line is wrong |
+| `add_line` | Insert a new line into a file | Adding missing instructions (e.g., missing `WORKDIR`) |
+| `delete_line` | Remove a specific line | Removing a bad instruction |
+| `add_block` | Insert a multi-line block | Adding entire sections (e.g., `env:` block with secrets) |
+| `delete_block` | Remove a multi-line block | Removing incorrect sections |
+| `request_hint` | Get a clue about what's wrong | Costs -5% on final score — use sparingly |
+| `submit` | Declare "I'm done" — triggers final evaluation | When all fixes are applied |
+**Important:** `edit_file` requires `old_content` to match **exactly** (including whitespace). If it doesn't match, the edit fails and the agent gets a -0.02 reward penalty.
+---
+## Grading System — How Scores Work
+Scoring is **deterministic** (same actions always produce the same score) and **dynamic** (different strategies get different scores).
+### The Formula
+```
+FINAL SCORE = Partial Fixes + Complete Bonus + Efficiency - Hint Penalty
 ```
+Clamped to `[0.0, 1.0]`.
+### Component Breakdown
+#### 1. Partial Fix Credit (40% max)
+```
+partial = 0.40 x (issues_fixed / issues_total)
+```
+| Fixed | Total | Partial Score |
+|-------|-------|---------------|
+| 0/2 | 2 | 0.00 |
+| 1/2 | 2 | 0.20 |
+| 2/2 | 2 | 0.40 |
+| 1/3 | 3 | 0.133 |
+#### 2. Complete Solution Bonus (30% max)
+```
+complete = 0.30  if ALL issues fixed
+complete = 0.00  otherwise
+```
+All-or-nothing. Fix 2/3 issues? You get 0. Fix 3/3? You get 0.30.
+#### 3. Efficiency Bonus (30% max)
+```
+if issues_fixed == 0:     efficiency = 0.00  (no credit for doing nothing)
+if steps <= issues_total:  efficiency = 0.30  (optimal — full bonus)
+if steps > issues_total:   efficiency = 0.30 - 0.03 per extra step
+```
+Rewards agents that fix issues quickly. The "optimal" number of steps equals the number of issues (one fix per step).
+| Issues | Steps Taken | Efficiency Score |
+|--------|-------------|-----------------|
+| 1 | 1 | 0.30 (optimal) |
+| 1 | 3 | 0.24 |
+| 1 | 8 | 0.09 |
+| 2 | 2 | 0.30 (optimal) |
+| 2 | 5 | 0.21 |
+| 0 fixed | any | 0.00 |
+#### 4. Hint Penalty (-5% each)
+```
+penalty = 0.05 x hints_used
+```
+Each `request_hint` action costs 5% off the final score.
+### Score Examples
+| Scenario | Partial | Complete | Efficiency | Hints | **Final Score** |
+|----------|---------|----------|------------|-------|-----------------|
+| Fixed 0/2 issues | 0.00 | 0.00 | 0.00 | 0 | **0.000** |
+| Fixed 1/2 in 3 steps | 0.20 | 0.00 | 0.27 | 0 | **~0.470** |
+| Fixed 2/2 in 5 steps | 0.40 | 0.30 | 0.21 | 0 | **~0.910** |
+| Fixed 1/1 in 1 step | 0.40 | 0.30 | 0.30 | 0 | **1.000** |
+| Fixed 1/1 + 2 hints | 0.40 | 0.30 | 0.30 | -0.10 | **0.900** |
+| Submitted immediately | 0.00 | 0.00 | 0.00 | 0 | **0.000** |
+### Per-Step Rewards (Dense Feedback)
+The agent also gets **immediate rewards** after each action (not just at the end):
+| Event | Reward |
+|-------|--------|
+| Fix validated (issue resolved) | +0.3 per issue fixed |
+| Successful validation improvement | +0.1 |
+| Failed edit (old_content didn't match) | -0.02 |
+| Request hint | -0.05 |
+| Submit (terminal) | 0.0 |
+This dense reward signal helps RL agents learn faster than sparse pass/fail grading.
+---
+## API Endpoints
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/` | GET | Root health check |
+| `/health` | GET | OpenEnv health endpoint — returns `{"status": "healthy"}` |
+| `/metadata` | GET | Environment name, description, version, tags |
+| `/schema` | GET | Action, observation, and state JSON schemas |
+| `/reset` | POST | Start a new episode (optional: `task_id`, `scenario_id`, `seed`) |
+| `/step` | POST | Take an action and receive observation + reward |
+| `/state` | GET | Get current observation without taking an action |
+| `/info` | GET | Task list with metadata |
+| `/tasks` | GET | List all tasks with difficulty levels |
+| `/grader` | POST | Grade a trajectory (list of step dicts) |
+| `/baseline` | POST | Run built-in heuristic baseline |
+| `/mcp` | POST | JSON-RPC 2.0 MCP endpoint (initialize, tools/list) |
+### Example: Full Episode via API
+```bash
+# 1. Start an episode
 curl -X POST http://localhost:7860/reset \
   -H "Content-Type: application/json" \
+  -d '{"task_id": "dockerfile_syntax", "scenario_id": "typo_filename"}'
+# Response: observation with broken Dockerfile + error message
+# 2. Fix the typo
 curl -X POST http://localhost:7860/step \
   -H "Content-Type: application/json" \
   -d '{
     }
   }'
+# Response: reward=0.4, issues_fixed=1/1
+# 3. Submit
 curl -X POST http://localhost:7860/step \
   -H "Content-Type: application/json" \
   -d '{"action": {"action_type": "submit"}}'
+# Response: done=true, episode complete
+```
+---
+## Baseline Results (Llama 3.1 70B)
+Tested with `meta-llama/Llama-3.1-70B-Instruct` via HuggingFace router:
+| Task | Score | Notes |
+|------|-------|-------|
+| dockerfile_syntax | 1.000 | Solved perfectly in 1 step |
+| dockerfile_runtime | 1.000 | Solved perfectly in 1 step |
+| workflow_syntax_structure | 0.000 | LLM struggled with exact whitespace matching |
+| workflow_secrets_permissions | 1.000 | Solved perfectly in 1 step |
+| ci_docker_integration | 0.000 | Multi-step fix needed; LLM edits didn't match exactly |
+| multi_stage_pipeline_matrix | 0.283 | Fixed 1/3 issues |
+| **OVERALL** | **0.547** | |
+This shows the environment is both **solvable** (3 perfect scores) and **challenging** (2 zero scores, 1 partial). The main difficulty is exact string matching for edits — a realistic constraint that mirrors real file editing.
+---
+## Quick Start
+### Local Development
+```bash
+pip install -r requirements.txt
+python -m uvicorn server.main:app --host 0.0.0.0 --port 7860
 ```
 ### Run Tests
 python inference.py
 ```
+---
 ## Project Structure
 ```
 cicd-debug-env/
+├── openenv.yaml              # OpenEnv environment specification
+├── inference.py              # LLM baseline (OpenAI client + HF router)
 ├── baseline_runner.py        # Heuristic baseline for /baseline endpoint
 ├── Dockerfile                # Production container
 ├── requirements.txt          # Python dependencies
 │
 ├── server/
+│   ├── main.py               # FastAPI with 12 endpoints
+│   ├── models.py             # Pydantic models (type-safe API)
+│   ├── environment.py        # Core environment loop (reset/step/state)
 │   ├── tasks/
+│   │   ├── base.py           # BaseTask with scenario loading
+│   │   ├── task_registry.py  # Maps task_id → task class
+│   │   ├── task_1_build_errors.py        # 5 Dockerfile syntax scenarios
+│   │   ├── task_2_docker_runtime.py      # 5 Dockerfile runtime scenarios
+│   │   ├── task_3_workflow_syntax.py     # 5 workflow structure scenarios
+│   │   ├── task_4_workflow_secrets_permissions.py  # 5 secrets scenarios
+│   │   ├── task_5_ci_docker_integration.py        # 5 integration scenarios
+│   │   └── task_6_multi_stage_matrix.py           # 5 multi-issue scenarios
 │   ├── graders/
+│   │   ├── __init__.py       # Deterministic trajectory grader
+│   │   └── base.py           # Base grader with weight constants
+│   └── simulators/
+│       ├── docker_simulator.py   # 15+ Dockerfile validation rules
+│       └── workflow_simulator.py # 15+ workflow validation rules
 │
 └── tests/
+    ├── test_endpoints.py     # API endpoint tests
+    ├── test_determinism.py   # Grader determinism + score range tests
+    ├── test_baseline.py      # Heuristic baseline tests
+    ├── test_environment_flow.py  # Episode flow tests
+    └── test_simulators.py    # Simulator unit tests
 ```
 ## Design Decisions
+1. **Docker + GitHub Actions combined**: These two tools intersect in every modern deployment pipeline. Debugging their interaction is the hardest part of DevOps.
+2. **Simulated validation (no real Docker)**: Static analysis rules instead of running actual containers. This gives deterministic results, fast execution, and no security concerns.
+3. **Dense rewards**: Partial credit at every step (+0.3 per fix, -0.02 per failed edit) rather than sparse pass/fail. Helps RL agents learn faster.
+4. **Difficulty progression**: Easy tasks are single-file, single-issue. Hard tasks are multi-file, multi-issue with interacting bugs.
+5. **Exact string matching for edits**: Mirrors real file editing — whitespace matters. This is intentionally challenging for LLMs.
+6. **30 scenarios from real bugs**: Every scenario is based on actual developer mistakes documented on Stack Overflow, GitHub Issues, and Docker/GitHub Actions documentation.
 ## License