# CI/CD Doctor — Advanced Reference Deep dive into the environment internals: architecture, I/O contracts, task variants, reward shaping, grader semantics, and layout. If you are only trying to run the env, start with the [root README](../README.md). --- ## 1. Environment Overview > **Highlight:** The environment is a pure in-memory simulation. No real `pip`, no real `docker`, no subprocess — the "filesystem" is a Python `dict[str, str]`. Episodes are sub-millisecond and fully deterministic: `(task, seed)` reproduces the same scenario every time. ``` Agent issues a command string ─► parser.py │ ▼ environment/server/environment.py │ ┌────────────────────────┼────────────────────────┐ ▼ ▼ ▼ in-memory filesystem stage_runner.py grader.py (mutated by edits) (simulated stages) (reward + tiers) │ │ │ └────────────────────────┴────────────────────────┘ │ ▼ PipelineObservation back to agent ``` **Episode lifecycle.** `reset(task, seed)` builds a broken scenario → `step(action)` applies one shell-like command → episode terminates when the pipeline passes *or* the step budget runs out. --- ## 2. Action & Observation Spaces > **Highlight:** All I/O is typed with Pydantic v2 models in [environment/models.py](../environment/models.py). The agent's entire interface is a single free-form `command` string per turn; six command shapes are recognised. ### `PipelineAction` ```python class PipelineAction(BaseModel): command: str # raw shell-like string, e.g. 'cat requirements.txt' ``` Six command shapes are recognised by [environment/parser.py](../environment/parser.py): | Command | Example | Effect | |---|---|---| | `cat ` | `cat requirements.txt` | Read a file from the in-memory FS | | `echo "" >> ` | `echo "pandas" >> requirements.txt` | Append a line to a file | | `sed -i 's/old/new/' ` | `sed -i 's/3.10/3.11/' Dockerfile` | Replace all occurrences of text in a file | | `pipeline run` | `pipeline run` | Execute full pipeline and return logs | | `pipeline logs [stage]` | `pipeline logs install` | Show last pipeline logs (optionally filtered by stage) | | `pipeline status` | `pipeline status` | Show current pipeline state (`not_run` / `failed` / `passed`) | | `diagnose ""` | `diagnose "Missing env var SECRET_KEY"` | Record agent diagnosis (used for reward bonuses) | Anything else returns `Command not recognized` with `exit_code=1`. ### `PipelineObservation` ```python class PipelineObservation(BaseModel): stdout: str # what the agent sees this turn exit_code: int # 0 = success, 1 = error pipeline_status: str # 'not_run' | 'failed' | 'passed' steps_remaining: int done: bool = False reward: float = 0.0 ``` ### `PipelineState` (server-side only) ```python class PipelineState(BaseModel): episode_id: str task: str # "easy" | "medium" | "hard" filesystem: Dict[str, str] pipeline_status: str step_count: int done: bool total_reward: float answer_key: Dict[str, Any] # never sent to agent, used by grader milestones: List[str] = Field(default_factory=list) # grader-only, tracks unlocked reward tiers ``` Tracks full episode state inside the server, including filesystem mutations, progress, and reward accumulation. - `answer_key` is hidden from the agent and used only for structural validation in the grader. - `milestones` track progression through the debugging lifecycle (investigated → diagnosed → fixed → verified). --- ## 3. Task Generation & Logic (Procedural Complexity) **Design Philosophy** Tasks are not static templates. They are programmatically synthesized scenarios generated by `core/scenarios/generator.py`. Each episode is a unique composition of: - a pipeline graph - injected faults - a deterministic seed This makes the environment **non-memorizable**, forcing agents to rely on **generalized diagnostic reasoning** instead of string matching. --- ### Difficulty Tiers & Behavioral Intent Tasks are categorized by the **depth of reasoning** required. | Tier | Max Steps | Ideal Steps | Faults | Strategic Complexity | |---|---|---|---|---| | Easy | 10 | 3 | 1 | Linear: single-file lookup → direct fix | | Medium | 15 | 6 | 2 | Relational: cross-file reasoning | | Hard | 25 | 10 | 3 | Sequential: cascading failures | --- ### How the Generator Synthesizes an Episode Each episode is constructed in four stages: 1. **Base Filesystem** A clean project snapshot is initialized. 2. **Pipeline Definition** CI/CD stages are constructed (e.g., `install → test → build`). 3. **Fault Injection** Files are mutated with **typed faults**, such as: - `package_present` / `package_version` - `dockerfile_base` - `env_var_present` - `config_value` - `ci_stage_order` - `port_value` 4. **Answer Key Generation** A hidden ground-truth spec used by the grader for **structural validation**. --- ### Scenario Breakdown #### Easy — Localized Debugging Focus: **Information retrieval** - Failure is confined to a single file - Example: `app.py` imports a missing dependency **Agent goal:** Map runtime error → specific file → apply fix --- #### Medium — Cross-Subsystem Reasoning Focus: **Iterative discovery** - Two faults across different subsystems - Only the *first failing stage* is visible initially **Key concept: Shadowing** > Fixing one issue reveals the next. | Variant | Pipeline | Faults | |---|---|---| | A | install → env_check → build | missing env var + Docker mismatch | | B | install → config → smoke_test | dependency + config gate | | C | install → port_check → build | port mismatch + Docker issue | **Agent requirement:** - Prioritize fixes correctly - Maintain state across iterations --- #### Hard — Cascading Failures Focus: **Causal + temporal reasoning** - Three faults chained across stages - Each fix changes future observations Example chain: CI stage order incorrect → build executes prematurely → dependency resolution fails **Key property: Temporal dependency** - Fixing earlier stages alters downstream failures --- ### Why This Design Works #### 1. Partial Observability The agent never sees all failures at once. #### 2. Structural Validation Correctness is semantic: - not "does file match?" - but "is the system now valid?" #### 3. Anti-Shortcut Mechanics - **File Integrity Check** Prevents appending junk to pass tests - **Blind Edit Penalty** Forces reading before editing - **Edit Spam Penalty** Discourages brute-force iteration --- ### Optimal Agent Policy The correct strategy is not: `try random fixes → rerun` It is : `observe → localize → read → diagnose → fix → verify → repeat` Each difficulty level increases pressure on: - localisation accuracy - causal reasoning - sequencing of fixes ### Why hard is genuinely hard - **Docker base reasoning (`alpine` vs `slim`)** Errors like `gcc: command not found` require understanding that `alpine` lacks build tools/glibc. The correct fix is switching to `python:3.11-slim`, not just bumping versions. - **Dependency compatibility (not presence)** Failures like `numpy==1.21` are not about missing packages, but **version conflicts** with transitive dependencies. The agent must reason about compatibility, not just add lines. - **Sequential error revelation** Only one failure is visible per pipeline run. Fixing one stage reveals the next, forcing **multi-step reasoning loops**. - **Exploration vs efficiency trade-off** Reading everything wastes steps (efficiency penalty), but blind edits are penalized. The agent must act **surgically**, not exhaustively. --- ## 4. Reward Function ## 4. Grader Logic & Reward Shaping > The grader rewards *process quality*, not just success. Agents are guided through a realistic debugging flow: investigate → diagnose → fix → verify. Each step reward is composed of: **grade(state) delta + balance_score(state, ctx)** --- ### Core Score (Structural Progress) - **Fix Credit (max +0.20)** Proportional to fraction of correctly applied fixes. - **Pipeline Passed (+0.50)** Awarded only when `pipeline_status == "passed"`. - **File Integrity (−0.10 → 0.0)** Penalizes excessive edits (e.g., appending large amounts of code). --- ### Milestone-Based Progression | Stage | Description | Reward | |------|------------|--------| | Investigated | First pipeline run to observe failure | +0.10 | | Diagnosed | Reads relevant diagnostic/source files | +0.10 | | Fix Applied | Valid structural fix detected | +0.15 | | Verified | Pipeline successfully passes | +0.50 | Progress is **state-driven**, not command-driven. --- ### Behavioral Shaping (Per-Step) #### Rewards - **Correct Diagnosis**: +0.10 - **Cross-File Reasoning**: +0.05 #### Penalties - **Blind Edits** (edit without reading): −0.10 - **Edit Spam** (>2 edits per file): −0.05 each - **Idle Pipeline Runs** (no FS changes): −0.05 - **Stalling** (no progress): −0.05 - **Regression** (breaking prior fix): −0.15 - **Inefficiency**: −0.02 per step beyond ideal (6 steps) --- ### Key Design Insight The grader differentiates: - **Structured debugging** → rewarded - **Brute-force / guesswork** → penalized Partial fixes receive proportional credit, enabling meaningful learning even in multi-error environments. --- ## 6. Project Structure ``` CI_CD_Doctor/ ├── Dockerfile ← container setup ├── README.md ← main project overview ├── __init__.py ├── client.py ← environment client interface ├── models.py ← core data models (Action / State / Observation) ├── inference.py ← baseline agent runner ├── openenv.yaml ← OpenEnv task + grader config ├── pyproject.toml ├── uv.lock ← dependency lockfile │ ├── core/ ← modularized environment logic │ ├── __init__.py │ ├── grading/ │ │ └── grader.py ← scoring + reward shaping logic │ ├── pipeline/ │ │ └── stage_runner.py ← simulated CI/CD stages │ ├── scenarios/ │ │ └── generator.py ← task + variant generation │ ├── utils/ │ │ └── packages.py ← dependency definitions │ └── validation/ │ ├── parser.py ← command parsing logic │ └── validator.py ← structural validation (CI rules, configs) │ ├── server/ ← execution backend │ ├── __init__.py │ ├── app.py ← FastAPI entrypoint │ ├── app_2.py ← alternate server setup │ └── environment.py ← main env loop (reset/step/state) │ ├── docs/ │ ├── README.md. ← HF space readme │ └── advanced_readme.md ← detailed system design ``` --- ## 7. Development ### Run the server locally ```bash uvicorn server.app:app --reload ```