Spaces:

samrat-rm
/

CI_CD_Doctor

Sleeping

App Files Files Community

CI_CD_Doctor / docs /advanced_readme.md

samrat-rm

Upload folder using huggingface_hub

4dd97eb verified 4 days ago

preview code

raw

history blame contribute delete

12 kB

	# CI/CD Doctor — Advanced Reference

	Deep dive into the environment internals: architecture, I/O contracts, task variants, reward shaping, grader semantics, and layout. If you are only trying to run the env, start with the [root README](../README.md).

	---

	## 1. Environment Overview

	> Highlight: The environment is a pure in-memory simulation. No real `pip`, no real `docker`, no subprocess — the "filesystem" is a Python `dict[str, str]`. Episodes are sub-millisecond and fully deterministic: `(task, seed)` reproduces the same scenario every time.

	```
	Agent issues a command string ─► parser.py
	│
	▼
	environment/server/environment.py
	│
	┌────────────────────────┼────────────────────────┐
	▼ ▼ ▼
	in-memory filesystem stage_runner.py grader.py
	(mutated by edits) (simulated stages) (reward + tiers)
	│ │ │
	└────────────────────────┴────────────────────────┘
	│
	▼
	PipelineObservation back to agent
	```

	Episode lifecycle. `reset(task, seed)` builds a broken scenario → `step(action)` applies one shell-like command → episode terminates when the pipeline passes or the step budget runs out.

	---

	## 2. Action & Observation Spaces

	> Highlight: All I/O is typed with Pydantic v2 models in [environment/models.py](../environment/models.py). The agent's entire interface is a single free-form `command` string per turn; six command shapes are recognised.

	### `PipelineAction`

	```python
	class PipelineAction(BaseModel):
	command: str # raw shell-like string, e.g. 'cat requirements.txt'
	```

	Six command shapes are recognised by [environment/parser.py](../environment/parser.py):

	\| Command \| Example \| Effect \|
	\|---\|---\|---\|
	\| `cat <file>` \| `cat requirements.txt` \| Read a file from the in-memory FS \|
	\| `echo "<text>" >> <file>` \| `echo "pandas" >> requirements.txt` \| Append a line to a file \|
	\| `sed -i 's/old/new/' <file>` \| `sed -i 's/3.10/3.11/' Dockerfile` \| Replace all occurrences of text in a file \|
	\| `pipeline run` \| `pipeline run` \| Execute full pipeline and return logs \|
	\| `pipeline logs [stage]` \| `pipeline logs install` \| Show last pipeline logs (optionally filtered by stage) \|
	\| `pipeline status` \| `pipeline status` \| Show current pipeline state (`not_run` / `failed` / `passed`) \|
	\| `diagnose "<reason>"` \| `diagnose "Missing env var SECRET_KEY"` \| Record agent diagnosis (used for reward bonuses) \|

	Anything else returns `Command not recognized` with `exit_code=1`.

	### `PipelineObservation`

	```python
	class PipelineObservation(BaseModel):
	stdout: str # what the agent sees this turn
	exit_code: int # 0 = success, 1 = error
	pipeline_status: str # 'not_run' \| 'failed' \| 'passed'
	steps_remaining: int
	done: bool = False
	reward: float = 0.0
	```

	### `PipelineState` (server-side only)

	```python
	class PipelineState(BaseModel):
	episode_id: str
	task: str # "easy" \| "medium" \| "hard"
	filesystem: Dict[str, str]
	pipeline_status: str
	step_count: int
	done: bool
	total_reward: float
	answer_key: Dict[str, Any] # never sent to agent, used by grader
	milestones: List[str] = Field(default_factory=list) # grader-only, tracks unlocked reward tiers
	```

	Tracks full episode state inside the server, including filesystem mutations, progress, and reward accumulation.

	- `answer_key` is hidden from the agent and used only for structural validation in the grader.
	- `milestones` track progression through the debugging lifecycle (investigated → diagnosed → fixed → verified).

	---

	## 3. Task Generation & Logic (Procedural Complexity)

	Design Philosophy
	Tasks are not static templates. They are programmatically synthesized scenarios generated by `core/scenarios/generator.py`.

	Each episode is a unique composition of:
	- a pipeline graph
	- injected faults
	- a deterministic seed

	This makes the environment non-memorizable, forcing agents to rely on generalized diagnostic reasoning instead of string matching.

	---

	### Difficulty Tiers & Behavioral Intent

	Tasks are categorized by the depth of reasoning required.

	\| Tier \| Max Steps \| Ideal Steps \| Faults \| Strategic Complexity \|
	\|---\|---\|---\|---\|---\|
	\| Easy \| 10 \| 3 \| 1 \| Linear: single-file lookup → direct fix \|
	\| Medium \| 15 \| 6 \| 2 \| Relational: cross-file reasoning \|
	\| Hard \| 25 \| 10 \| 3 \| Sequential: cascading failures \|

	---

	### How the Generator Synthesizes an Episode

	Each episode is constructed in four stages:

	1. Base Filesystem
	A clean project snapshot is initialized.

	2. Pipeline Definition
	CI/CD stages are constructed (e.g., `install → test → build`).

	3. Fault Injection
	Files are mutated with typed faults, such as:
	- `package_present` / `package_version`
	- `dockerfile_base`
	- `env_var_present`
	- `config_value`
	- `ci_stage_order`
	- `port_value`

	4. Answer Key Generation
	A hidden ground-truth spec used by the grader for structural validation.

	---

	### Scenario Breakdown

	#### Easy — Localized Debugging

	Focus: Information retrieval

	- Failure is confined to a single file
	- Example: `app.py` imports a missing dependency

	Agent goal:
	Map runtime error → specific file → apply fix

	---

	#### Medium — Cross-Subsystem Reasoning

	Focus: Iterative discovery

	- Two faults across different subsystems
	- Only the first failing stage is visible initially

	Key concept: Shadowing
	> Fixing one issue reveals the next.

	\| Variant \| Pipeline \| Faults \|
	\|---\|---\|---\|
	\| A \| install → env_check → build \| missing env var + Docker mismatch \|
	\| B \| install → config → smoke_test \| dependency + config gate \|
	\| C \| install → port_check → build \| port mismatch + Docker issue \|

	Agent requirement:
	- Prioritize fixes correctly
	- Maintain state across iterations

	---

	#### Hard — Cascading Failures

	Focus: Causal + temporal reasoning

	- Three faults chained across stages
	- Each fix changes future observations

	Example chain:

	CI stage order incorrect
	→ build executes prematurely
	→ dependency resolution fails

	Key property: Temporal dependency
	- Fixing earlier stages alters downstream failures

	---

	### Why This Design Works

	#### 1. Partial Observability
	The agent never sees all failures at once.

	#### 2. Structural Validation
	Correctness is semantic:
	- not "does file match?"
	- but "is the system now valid?"

	#### 3. Anti-Shortcut Mechanics

	- File Integrity Check
	Prevents appending junk to pass tests

	- Blind Edit Penalty
	Forces reading before editing

	- Edit Spam Penalty
	Discourages brute-force iteration

	---

	### Optimal Agent Policy

	The correct strategy is not:

	`try random fixes → rerun`

	It is :

	`observe → localize → read → diagnose → fix → verify → repeat`

	Each difficulty level increases pressure on:
	- localisation accuracy
	- causal reasoning
	- sequencing of fixes

	### Why hard is genuinely hard

	- Docker base reasoning (`alpine` vs `slim`)
	Errors like `gcc: command not found` require understanding that `alpine` lacks build tools/glibc. The correct fix is switching to `python:3.11-slim`, not just bumping versions.

	- Dependency compatibility (not presence)
	Failures like `numpy==1.21` are not about missing packages, but version conflicts with transitive dependencies. The agent must reason about compatibility, not just add lines.

	- Sequential error revelation
	Only one failure is visible per pipeline run. Fixing one stage reveals the next, forcing multi-step reasoning loops.

	- Exploration vs efficiency trade-off
	Reading everything wastes steps (efficiency penalty), but blind edits are penalized. The agent must act surgically, not exhaustively.

	---

	## 4. Reward Function

	## 4. Grader Logic & Reward Shaping

	> The grader rewards process quality, not just success. Agents are guided through a realistic debugging flow: investigate → diagnose → fix → verify.

	Each step reward is composed of:
	grade(state) delta + balance_score(state, ctx)

	---

	### Core Score (Structural Progress)

	- Fix Credit (max +0.20)
	Proportional to fraction of correctly applied fixes.

	- Pipeline Passed (+0.50)
	Awarded only when `pipeline_status == "passed"`.

	- File Integrity (−0.10 → 0.0)
	Penalizes excessive edits (e.g., appending large amounts of code).

	---

	### Milestone-Based Progression

	\| Stage \| Description \| Reward \|
	\|------\|------------\|--------\|
	\| Investigated \| First pipeline run to observe failure \| +0.10 \|
	\| Diagnosed \| Reads relevant diagnostic/source files \| +0.10 \|
	\| Fix Applied \| Valid structural fix detected \| +0.15 \|
	\| Verified \| Pipeline successfully passes \| +0.50 \|

	Progress is state-driven, not command-driven.

	---

	### Behavioral Shaping (Per-Step)

	#### Rewards
	- Correct Diagnosis: +0.10
	- Cross-File Reasoning: +0.05

	#### Penalties
	- Blind Edits (edit without reading): −0.10
	- Edit Spam (>2 edits per file): −0.05 each
	- Idle Pipeline Runs (no FS changes): −0.05
	- Stalling (no progress): −0.05
	- Regression (breaking prior fix): −0.15
	- Inefficiency: −0.02 per step beyond ideal (6 steps)

	---

	### Key Design Insight

	The grader differentiates:
	- Structured debugging → rewarded
	- Brute-force / guesswork → penalized

	Partial fixes receive proportional credit, enabling meaningful learning even in multi-error environments.

	---

	## 6. Project Structure

	```
	CI_CD_Doctor/
	├── Dockerfile ← container setup
	├── README.md ← main project overview
	├── __init__.py
	├── client.py ← environment client interface
	├── models.py ← core data models (Action / State / Observation)
	├── inference.py ← baseline agent runner
	├── openenv.yaml ← OpenEnv task + grader config
	├── pyproject.toml
	├── uv.lock ← dependency lockfile
	│
	├── core/ ← modularized environment logic
	│ ├── __init__.py
	│ ├── grading/
	│ │ └── grader.py ← scoring + reward shaping logic
	│ ├── pipeline/
	│ │ └── stage_runner.py ← simulated CI/CD stages
	│ ├── scenarios/
	│ │ └── generator.py ← task + variant generation
	│ ├── utils/
	│ │ └── packages.py ← dependency definitions
	│ └── validation/
	│ ├── parser.py ← command parsing logic
	│ └── validator.py ← structural validation (CI rules, configs)
	│
	├── server/ ← execution backend
	│ ├── __init__.py
	│ ├── app.py ← FastAPI entrypoint
	│ ├── app_2.py ← alternate server setup
	│ └── environment.py ← main env loop (reset/step/state)
	│
	├── docs/
	│ ├── README.md. ← HF space readme
	│ └── advanced_readme.md ← detailed system design
	```

	---

	## 7. Development

	### Run the server locally

	```bash
	uvicorn server.app:app --reload
	```