CI_CD_Doctor / docs /advanced_readme.md
samrat-rm's picture
Upload folder using huggingface_hub
4dd97eb verified
# CI/CD Doctor β€” Advanced Reference
Deep dive into the environment internals: architecture, I/O contracts, task variants, reward shaping, grader semantics, and layout. If you are only trying to run the env, start with the [root README](../README.md).
---
## 1. Environment Overview
> **Highlight:** The environment is a pure in-memory simulation. No real `pip`, no real `docker`, no subprocess β€” the "filesystem" is a Python `dict[str, str]`. Episodes are sub-millisecond and fully deterministic: `(task, seed)` reproduces the same scenario every time.
```
Agent issues a command string ─► parser.py
β”‚
β–Ό
environment/server/environment.py
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β–Ό β–Ό β–Ό
in-memory filesystem stage_runner.py grader.py
(mutated by edits) (simulated stages) (reward + tiers)
β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
PipelineObservation back to agent
```
**Episode lifecycle.** `reset(task, seed)` builds a broken scenario β†’ `step(action)` applies one shell-like command β†’ episode terminates when the pipeline passes *or* the step budget runs out.
---
## 2. Action & Observation Spaces
> **Highlight:** All I/O is typed with Pydantic v2 models in [environment/models.py](../environment/models.py). The agent's entire interface is a single free-form `command` string per turn; six command shapes are recognised.
### `PipelineAction`
```python
class PipelineAction(BaseModel):
command: str # raw shell-like string, e.g. 'cat requirements.txt'
```
Six command shapes are recognised by [environment/parser.py](../environment/parser.py):
| Command | Example | Effect |
|---|---|---|
| `cat <file>` | `cat requirements.txt` | Read a file from the in-memory FS |
| `echo "<text>" >> <file>` | `echo "pandas" >> requirements.txt` | Append a line to a file |
| `sed -i 's/old/new/' <file>` | `sed -i 's/3.10/3.11/' Dockerfile` | Replace all occurrences of text in a file |
| `pipeline run` | `pipeline run` | Execute full pipeline and return logs |
| `pipeline logs [stage]` | `pipeline logs install` | Show last pipeline logs (optionally filtered by stage) |
| `pipeline status` | `pipeline status` | Show current pipeline state (`not_run` / `failed` / `passed`) |
| `diagnose "<reason>"` | `diagnose "Missing env var SECRET_KEY"` | Record agent diagnosis (used for reward bonuses) |
Anything else returns `Command not recognized` with `exit_code=1`.
### `PipelineObservation`
```python
class PipelineObservation(BaseModel):
stdout: str # what the agent sees this turn
exit_code: int # 0 = success, 1 = error
pipeline_status: str # 'not_run' | 'failed' | 'passed'
steps_remaining: int
done: bool = False
reward: float = 0.0
```
### `PipelineState` (server-side only)
```python
class PipelineState(BaseModel):
episode_id: str
task: str # "easy" | "medium" | "hard"
filesystem: Dict[str, str]
pipeline_status: str
step_count: int
done: bool
total_reward: float
answer_key: Dict[str, Any] # never sent to agent, used by grader
milestones: List[str] = Field(default_factory=list) # grader-only, tracks unlocked reward tiers
```
Tracks full episode state inside the server, including filesystem mutations, progress, and reward accumulation.
- `answer_key` is hidden from the agent and used only for structural validation in the grader.
- `milestones` track progression through the debugging lifecycle (investigated β†’ diagnosed β†’ fixed β†’ verified).
---
## 3. Task Generation & Logic (Procedural Complexity)
**Design Philosophy**
Tasks are not static templates. They are programmatically synthesized scenarios generated by `core/scenarios/generator.py`.
Each episode is a unique composition of:
- a pipeline graph
- injected faults
- a deterministic seed
This makes the environment **non-memorizable**, forcing agents to rely on **generalized diagnostic reasoning** instead of string matching.
---
### Difficulty Tiers & Behavioral Intent
Tasks are categorized by the **depth of reasoning** required.
| Tier | Max Steps | Ideal Steps | Faults | Strategic Complexity |
|---|---|---|---|---|
| Easy | 10 | 3 | 1 | Linear: single-file lookup β†’ direct fix |
| Medium | 15 | 6 | 2 | Relational: cross-file reasoning |
| Hard | 25 | 10 | 3 | Sequential: cascading failures |
---
### How the Generator Synthesizes an Episode
Each episode is constructed in four stages:
1. **Base Filesystem**
A clean project snapshot is initialized.
2. **Pipeline Definition**
CI/CD stages are constructed (e.g., `install β†’ test β†’ build`).
3. **Fault Injection**
Files are mutated with **typed faults**, such as:
- `package_present` / `package_version`
- `dockerfile_base`
- `env_var_present`
- `config_value`
- `ci_stage_order`
- `port_value`
4. **Answer Key Generation**
A hidden ground-truth spec used by the grader for **structural validation**.
---
### Scenario Breakdown
#### Easy β€” Localized Debugging
Focus: **Information retrieval**
- Failure is confined to a single file
- Example: `app.py` imports a missing dependency
**Agent goal:**
Map runtime error β†’ specific file β†’ apply fix
---
#### Medium β€” Cross-Subsystem Reasoning
Focus: **Iterative discovery**
- Two faults across different subsystems
- Only the *first failing stage* is visible initially
**Key concept: Shadowing**
> Fixing one issue reveals the next.
| Variant | Pipeline | Faults |
|---|---|---|
| A | install β†’ env_check β†’ build | missing env var + Docker mismatch |
| B | install β†’ config β†’ smoke_test | dependency + config gate |
| C | install β†’ port_check β†’ build | port mismatch + Docker issue |
**Agent requirement:**
- Prioritize fixes correctly
- Maintain state across iterations
---
#### Hard β€” Cascading Failures
Focus: **Causal + temporal reasoning**
- Three faults chained across stages
- Each fix changes future observations
Example chain:
CI stage order incorrect
β†’ build executes prematurely
β†’ dependency resolution fails
**Key property: Temporal dependency**
- Fixing earlier stages alters downstream failures
---
### Why This Design Works
#### 1. Partial Observability
The agent never sees all failures at once.
#### 2. Structural Validation
Correctness is semantic:
- not "does file match?"
- but "is the system now valid?"
#### 3. Anti-Shortcut Mechanics
- **File Integrity Check**
Prevents appending junk to pass tests
- **Blind Edit Penalty**
Forces reading before editing
- **Edit Spam Penalty**
Discourages brute-force iteration
---
### Optimal Agent Policy
The correct strategy is not:
`try random fixes β†’ rerun`
It is :
`observe β†’ localize β†’ read β†’ diagnose β†’ fix β†’ verify β†’ repeat`
Each difficulty level increases pressure on:
- localisation accuracy
- causal reasoning
- sequencing of fixes
### Why hard is genuinely hard
- **Docker base reasoning (`alpine` vs `slim`)**
Errors like `gcc: command not found` require understanding that `alpine` lacks build tools/glibc. The correct fix is switching to `python:3.11-slim`, not just bumping versions.
- **Dependency compatibility (not presence)**
Failures like `numpy==1.21` are not about missing packages, but **version conflicts** with transitive dependencies. The agent must reason about compatibility, not just add lines.
- **Sequential error revelation**
Only one failure is visible per pipeline run. Fixing one stage reveals the next, forcing **multi-step reasoning loops**.
- **Exploration vs efficiency trade-off**
Reading everything wastes steps (efficiency penalty), but blind edits are penalized. The agent must act **surgically**, not exhaustively.
---
## 4. Reward Function
## 4. Grader Logic & Reward Shaping
> The grader rewards *process quality*, not just success. Agents are guided through a realistic debugging flow: investigate β†’ diagnose β†’ fix β†’ verify.
Each step reward is composed of:
**grade(state) delta + balance_score(state, ctx)**
---
### Core Score (Structural Progress)
- **Fix Credit (max +0.20)**
Proportional to fraction of correctly applied fixes.
- **Pipeline Passed (+0.50)**
Awarded only when `pipeline_status == "passed"`.
- **File Integrity (βˆ’0.10 β†’ 0.0)**
Penalizes excessive edits (e.g., appending large amounts of code).
---
### Milestone-Based Progression
| Stage | Description | Reward |
|------|------------|--------|
| Investigated | First pipeline run to observe failure | +0.10 |
| Diagnosed | Reads relevant diagnostic/source files | +0.10 |
| Fix Applied | Valid structural fix detected | +0.15 |
| Verified | Pipeline successfully passes | +0.50 |
Progress is **state-driven**, not command-driven.
---
### Behavioral Shaping (Per-Step)
#### Rewards
- **Correct Diagnosis**: +0.10
- **Cross-File Reasoning**: +0.05
#### Penalties
- **Blind Edits** (edit without reading): βˆ’0.10
- **Edit Spam** (>2 edits per file): βˆ’0.05 each
- **Idle Pipeline Runs** (no FS changes): βˆ’0.05
- **Stalling** (no progress): βˆ’0.05
- **Regression** (breaking prior fix): βˆ’0.15
- **Inefficiency**: βˆ’0.02 per step beyond ideal (6 steps)
---
### Key Design Insight
The grader differentiates:
- **Structured debugging** β†’ rewarded
- **Brute-force / guesswork** β†’ penalized
Partial fixes receive proportional credit, enabling meaningful learning even in multi-error environments.
---
## 6. Project Structure
```
CI_CD_Doctor/
β”œβ”€β”€ Dockerfile ← container setup
β”œβ”€β”€ README.md ← main project overview
β”œβ”€β”€ __init__.py
β”œβ”€β”€ client.py ← environment client interface
β”œβ”€β”€ models.py ← core data models (Action / State / Observation)
β”œβ”€β”€ inference.py ← baseline agent runner
β”œβ”€β”€ openenv.yaml ← OpenEnv task + grader config
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ uv.lock ← dependency lockfile
β”‚
β”œβ”€β”€ core/ ← modularized environment logic
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ grading/
β”‚ β”‚ └── grader.py ← scoring + reward shaping logic
β”‚ β”œβ”€β”€ pipeline/
β”‚ β”‚ └── stage_runner.py ← simulated CI/CD stages
β”‚ β”œβ”€β”€ scenarios/
β”‚ β”‚ └── generator.py ← task + variant generation
β”‚ β”œβ”€β”€ utils/
β”‚ β”‚ └── packages.py ← dependency definitions
β”‚ └── validation/
β”‚ β”œβ”€β”€ parser.py ← command parsing logic
β”‚ └── validator.py ← structural validation (CI rules, configs)
β”‚
β”œβ”€β”€ server/ ← execution backend
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ app.py ← FastAPI entrypoint
β”‚ β”œβ”€β”€ app_2.py ← alternate server setup
β”‚ └── environment.py ← main env loop (reset/step/state)
β”‚
β”œβ”€β”€ docs/
β”‚ β”œβ”€β”€ README.md. ← HF space readme
β”‚ └── advanced_readme.md ← detailed system design
```
---
## 7. Development
### Run the server locally
```bash
uvicorn server.app:app --reload
```