# CI/CD Doctor — Advanced Reference

Deep dive into the environment internals: architecture, I/O contracts, task variants, reward shaping, grader semantics, and layout. If you are only trying to run the env, start with the [root README](../README.md).

---

## 1. Environment Overview

> **Highlight:** The environment is a pure in-memory simulation. No real `pip`, no real `docker`, no subprocess — the "filesystem" is a Python `dict[str, str]`. Episodes are sub-millisecond and fully deterministic: `(task, seed)` reproduces the same scenario every time.

```
Agent issues a command string  ─►  parser.py
                                       │
                                       ▼
                              environment/server/environment.py
                                       │
              ┌────────────────────────┼────────────────────────┐
              ▼                        ▼                        ▼
       in-memory filesystem      stage_runner.py            grader.py
        (mutated by edits)       (simulated stages)       (reward + tiers)
              │                        │                        │
              └────────────────────────┴────────────────────────┘
                                       │
                                       ▼
                             PipelineObservation back to agent
```

**Episode lifecycle.** `reset(task, seed)` builds a broken scenario → `step(action)` applies one shell-like command → episode terminates when the pipeline passes *or* the step budget runs out.

---

## 2. Action & Observation Spaces

> **Highlight:** All I/O is typed with Pydantic v2 models in [environment/models.py](../environment/models.py). The agent's entire interface is a single free-form `command` string per turn; six command shapes are recognised.

### `PipelineAction`

```python
class PipelineAction(BaseModel):
    command: str   # raw shell-like string, e.g. 'cat requirements.txt'
```

Six command shapes are recognised by [environment/parser.py](../environment/parser.py):

| Command | Example | Effect |
|---|---|---|
| `cat <file>` | `cat requirements.txt` | Read a file from the in-memory FS |
| `echo "<text>" >> <file>` | `echo "pandas" >> requirements.txt` | Append a line to a file |
| `sed -i 's/old/new/' <file>` | `sed -i 's/3.10/3.11/' Dockerfile` | Replace all occurrences of text in a file |
| `pipeline run` | `pipeline run` | Execute full pipeline and return logs |
| `pipeline logs [stage]` | `pipeline logs install` | Show last pipeline logs (optionally filtered by stage) |
| `pipeline status` | `pipeline status` | Show current pipeline state (`not_run` / `failed` / `passed`) |
| `diagnose "<reason>"` | `diagnose "Missing env var SECRET_KEY"` | Record agent diagnosis (used for reward bonuses) |

Anything else returns `Command not recognized` with `exit_code=1`.

### `PipelineObservation`

```python
class PipelineObservation(BaseModel):
    stdout: str              # what the agent sees this turn
    exit_code: int           # 0 = success, 1 = error
    pipeline_status: str     # 'not_run' | 'failed' | 'passed'
    steps_remaining: int
    done: bool = False
    reward: float = 0.0
```

### `PipelineState` (server-side only)

```python
class PipelineState(BaseModel):
    episode_id: str
    task: str  # "easy" | "medium" | "hard"
    filesystem: Dict[str, str]
    pipeline_status: str
    step_count: int
    done: bool
    total_reward: float
    answer_key: Dict[str, Any]  # never sent to agent, used by grader
    milestones: List[str] = Field(default_factory=list)  # grader-only, tracks unlocked reward tiers
```

Tracks full episode state inside the server, including filesystem mutations, progress, and reward accumulation. 

- `answer_key` is hidden from the agent and used only for structural validation in the grader.
- `milestones` track progression through the debugging lifecycle (investigated → diagnosed → fixed → verified).

---

## 3. Task Generation & Logic (Procedural Complexity)

**Design Philosophy**  
Tasks are not static templates. They are programmatically synthesized scenarios generated by `core/scenarios/generator.py`.

Each episode is a unique composition of:
- a pipeline graph
- injected faults
- a deterministic seed

This makes the environment **non-memorizable**, forcing agents to rely on **generalized diagnostic reasoning** instead of string matching.

---

### Difficulty Tiers & Behavioral Intent

Tasks are categorized by the **depth of reasoning** required.

| Tier | Max Steps | Ideal Steps | Faults | Strategic Complexity |
|---|---|---|---|---|
| Easy | 10 | 3 | 1 | Linear: single-file lookup → direct fix |
| Medium | 15 | 6 | 2 | Relational: cross-file reasoning |
| Hard | 25 | 10 | 3 | Sequential: cascading failures |

---

### How the Generator Synthesizes an Episode

Each episode is constructed in four stages:

1. **Base Filesystem**  
   A clean project snapshot is initialized.

2. **Pipeline Definition**  
   CI/CD stages are constructed (e.g., `install → test → build`).

3. **Fault Injection**  
   Files are mutated with **typed faults**, such as:
   - `package_present` / `package_version`
   - `dockerfile_base`
   - `env_var_present`
   - `config_value`
   - `ci_stage_order`
   - `port_value`

4. **Answer Key Generation**  
   A hidden ground-truth spec used by the grader for **structural validation**.

---

### Scenario Breakdown

#### Easy — Localized Debugging

Focus: **Information retrieval**

- Failure is confined to a single file  
- Example: `app.py` imports a missing dependency

**Agent goal:**  
Map runtime error → specific file → apply fix

---

#### Medium — Cross-Subsystem Reasoning

Focus: **Iterative discovery**

- Two faults across different subsystems  
- Only the *first failing stage* is visible initially

**Key concept: Shadowing**
> Fixing one issue reveals the next.

| Variant | Pipeline | Faults |
|---|---|---|
| A | install → env_check → build | missing env var + Docker mismatch |
| B | install → config → smoke_test | dependency + config gate |
| C | install → port_check → build | port mismatch + Docker issue |

**Agent requirement:**
- Prioritize fixes correctly  
- Maintain state across iterations  

---

#### Hard — Cascading Failures

Focus: **Causal + temporal reasoning**

- Three faults chained across stages  
- Each fix changes future observations

Example chain:

CI stage order incorrect
→ build executes prematurely
→ dependency resolution fails

**Key property: Temporal dependency**
- Fixing earlier stages alters downstream failures

---

### Why This Design Works

#### 1. Partial Observability
The agent never sees all failures at once.

#### 2. Structural Validation
Correctness is semantic:
- not "does file match?"
- but "is the system now valid?"

#### 3. Anti-Shortcut Mechanics

- **File Integrity Check**  
  Prevents appending junk to pass tests

- **Blind Edit Penalty**  
  Forces reading before editing

- **Edit Spam Penalty**  
  Discourages brute-force iteration

---

### Optimal Agent Policy

The correct strategy is not:

`try random fixes → rerun`

It is : 

`observe → localize → read → diagnose → fix → verify → repeat`

Each difficulty level increases pressure on:
- localisation accuracy
- causal reasoning
- sequencing of fixes

### Why hard is genuinely hard

- **Docker base reasoning (`alpine` vs `slim`)**  
  Errors like `gcc: command not found` require understanding that `alpine` lacks build tools/glibc. The correct fix is switching to `python:3.11-slim`, not just bumping versions.

- **Dependency compatibility (not presence)**  
  Failures like `numpy==1.21` are not about missing packages, but **version conflicts** with transitive dependencies. The agent must reason about compatibility, not just add lines.

- **Sequential error revelation**  
  Only one failure is visible per pipeline run. Fixing one stage reveals the next, forcing **multi-step reasoning loops**.

- **Exploration vs efficiency trade-off**  
  Reading everything wastes steps (efficiency penalty), but blind edits are penalized. The agent must act **surgically**, not exhaustively.

---

## 4. Reward Function

## 4. Grader Logic & Reward Shaping

> The grader rewards *process quality*, not just success. Agents are guided through a realistic debugging flow: investigate → diagnose → fix → verify.

Each step reward is composed of:
**grade(state) delta + balance_score(state, ctx)**

---

### Core Score (Structural Progress)

- **Fix Credit (max +0.20)**  
  Proportional to fraction of correctly applied fixes.

- **Pipeline Passed (+0.50)**  
  Awarded only when `pipeline_status == "passed"`.

- **File Integrity (−0.10 → 0.0)**  
  Penalizes excessive edits (e.g., appending large amounts of code).

---

### Milestone-Based Progression

| Stage | Description | Reward |
|------|------------|--------|
| Investigated | First pipeline run to observe failure | +0.10 |
| Diagnosed | Reads relevant diagnostic/source files | +0.10 |
| Fix Applied | Valid structural fix detected | +0.15 |
| Verified | Pipeline successfully passes | +0.50 |

Progress is **state-driven**, not command-driven.

---

### Behavioral Shaping (Per-Step)

#### Rewards
- **Correct Diagnosis**: +0.10  
- **Cross-File Reasoning**: +0.05  

#### Penalties
- **Blind Edits** (edit without reading): −0.10  
- **Edit Spam** (>2 edits per file): −0.05 each  
- **Idle Pipeline Runs** (no FS changes): −0.05  
- **Stalling** (no progress): −0.05  
- **Regression** (breaking prior fix): −0.15  
- **Inefficiency**: −0.02 per step beyond ideal (6 steps)

---

### Key Design Insight

The grader differentiates:
- **Structured debugging** → rewarded  
- **Brute-force / guesswork** → penalized  

Partial fixes receive proportional credit, enabling meaningful learning even in multi-error environments.

---

## 6. Project Structure

```
CI_CD_Doctor/
├── Dockerfile                    ← container setup
├── README.md                     ← main project overview
├── __init__.py
├── client.py                     ← environment client interface
├── models.py                     ← core data models (Action / State / Observation)
├── inference.py                  ← baseline agent runner
├── openenv.yaml                  ← OpenEnv task + grader config
├── pyproject.toml
├── uv.lock                       ← dependency lockfile
│
├── core/                         ← modularized environment logic
│   ├── __init__.py
│   ├── grading/
│   │   └── grader.py             ← scoring + reward shaping logic
│   ├── pipeline/
│   │   └── stage_runner.py       ← simulated CI/CD stages
│   ├── scenarios/
│   │   └── generator.py          ← task + variant generation
│   ├── utils/
│   │   └── packages.py           ← dependency definitions
│   └── validation/
│       ├── parser.py             ← command parsing logic
│       └── validator.py          ← structural validation (CI rules, configs)
│
├── server/                       ← execution backend
│   ├── __init__.py
│   ├── app.py                    ← FastAPI entrypoint
│   ├── app_2.py                  ← alternate server setup
│   └── environment.py            ← main env loop (reset/step/state)
│
├── docs/
│   ├── README.md.                ← HF space readme
│   └── advanced_readme.md        ← detailed system design
```

---

## 7. Development

### Run the server locally

```bash
uvicorn server.app:app --reload
```