Spaces:

samrat-rm
/

CI_CD_Doctor

Sleeping

App Files Files Community

CI_CD_Doctor / docs /advanced_readme.md

samrat-rm

Upload folder using huggingface_hub

4dd97eb verified 4 days ago

preview code

raw

history blame contribute delete

12 kB

CI/CD Doctor — Advanced Reference

Deep dive into the environment internals: architecture, I/O contracts, task variants, reward shaping, grader semantics, and layout. If you are only trying to run the env, start with the root README.

1. Environment Overview

Highlight: The environment is a pure in-memory simulation. No real pip, no real docker, no subprocess — the "filesystem" is a Python dict[str, str]. Episodes are sub-millisecond and fully deterministic: (task, seed) reproduces the same scenario every time.

Agent issues a command string  ─►  parser.py
                                       │
                                       ▼
                              environment/server/environment.py
                                       │
              ┌────────────────────────┼────────────────────────┐
              ▼                        ▼                        ▼
       in-memory filesystem      stage_runner.py            grader.py
        (mutated by edits)       (simulated stages)       (reward + tiers)
              │                        │                        │
              └────────────────────────┴────────────────────────┘
                                       │
                                       ▼
                             PipelineObservation back to agent

Episode lifecycle. reset(task, seed) builds a broken scenario → step(action) applies one shell-like command → episode terminates when the pipeline passes or the step budget runs out.

2. Action & Observation Spaces

Highlight: All I/O is typed with Pydantic v2 models in environment/models.py. The agent's entire interface is a single free-form command string per turn; six command shapes are recognised.

`PipelineAction`

class PipelineAction(BaseModel):
    command: str   # raw shell-like string, e.g. 'cat requirements.txt'

Six command shapes are recognised by environment/parser.py:

Command	Example	Effect
`cat <file>`	`cat requirements.txt`	Read a file from the in-memory FS
`echo "<text>" >> <file>`	`echo "pandas" >> requirements.txt`	Append a line to a file
`sed -i 's/old/new/' <file>`	`sed -i 's/3.10/3.11/' Dockerfile`	Replace all occurrences of text in a file
`pipeline run`	`pipeline run`	Execute full pipeline and return logs
`pipeline logs [stage]`	`pipeline logs install`	Show last pipeline logs (optionally filtered by stage)
`pipeline status`	`pipeline status`	Show current pipeline state (`not_run` / `failed` / `passed`)
`diagnose "<reason>"`	`diagnose "Missing env var SECRET_KEY"`	Record agent diagnosis (used for reward bonuses)

Anything else returns Command not recognized with exit_code=1.

`PipelineObservation`

class PipelineObservation(BaseModel):
    stdout: str              # what the agent sees this turn
    exit_code: int           # 0 = success, 1 = error
    pipeline_status: str     # 'not_run' | 'failed' | 'passed'
    steps_remaining: int
    done: bool = False
    reward: float = 0.0

`PipelineState` (server-side only)

class PipelineState(BaseModel):
    episode_id: str
    task: str  # "easy" | "medium" | "hard"
    filesystem: Dict[str, str]
    pipeline_status: str
    step_count: int
    done: bool
    total_reward: float
    answer_key: Dict[str, Any]  # never sent to agent, used by grader
    milestones: List[str] = Field(default_factory=list)  # grader-only, tracks unlocked reward tiers

Tracks full episode state inside the server, including filesystem mutations, progress, and reward accumulation.

answer_key is hidden from the agent and used only for structural validation in the grader.
milestones track progression through the debugging lifecycle (investigated → diagnosed → fixed → verified).

3. Task Generation & Logic (Procedural Complexity)

Design Philosophy
Tasks are not static templates. They are programmatically synthesized scenarios generated by core/scenarios/generator.py.

Each episode is a unique composition of:

a pipeline graph
injected faults
a deterministic seed

This makes the environment non-memorizable, forcing agents to rely on generalized diagnostic reasoning instead of string matching.

Difficulty Tiers & Behavioral Intent

Tasks are categorized by the depth of reasoning required.

Tier	Max Steps	Ideal Steps	Faults	Strategic Complexity
Easy	10	3	1	Linear: single-file lookup → direct fix
Medium	15	6	2	Relational: cross-file reasoning
Hard	25	10	3	Sequential: cascading failures

How the Generator Synthesizes an Episode

Each episode is constructed in four stages:

Base Filesystem
A clean project snapshot is initialized.
Pipeline Definition
CI/CD stages are constructed (e.g., install → test → build).
Fault Injection
Files are mutated with typed faults, such as:
- package_present / package_version
- dockerfile_base
- env_var_present
- config_value
- ci_stage_order
- port_value
Answer Key Generation
A hidden ground-truth spec used by the grader for structural validation.

Scenario Breakdown

Easy — Localized Debugging

Focus: Information retrieval

Failure is confined to a single file
Example: app.py imports a missing dependency

Agent goal:
Map runtime error → specific file → apply fix

Medium — Cross-Subsystem Reasoning

Focus: Iterative discovery

Two faults across different subsystems
Only the first failing stage is visible initially

Key concept: Shadowing

Fixing one issue reveals the next.

Variant	Pipeline	Faults
A	install → env_check → build	missing env var + Docker mismatch
B	install → config → smoke_test	dependency + config gate
C	install → port_check → build	port mismatch + Docker issue

Agent requirement:

Prioritize fixes correctly
Maintain state across iterations

Hard — Cascading Failures

Focus: Causal + temporal reasoning

Three faults chained across stages
Each fix changes future observations

Example chain:

CI stage order incorrect → build executes prematurely → dependency resolution fails

Key property: Temporal dependency

Fixing earlier stages alters downstream failures

Why This Design Works

1. Partial Observability

The agent never sees all failures at once.

2. Structural Validation

Correctness is semantic:

not "does file match?"
but "is the system now valid?"

3. Anti-Shortcut Mechanics

File Integrity Check
Prevents appending junk to pass tests
Blind Edit Penalty
Forces reading before editing
Edit Spam Penalty
Discourages brute-force iteration

Optimal Agent Policy

The correct strategy is not:

try random fixes → rerun

It is :

observe → localize → read → diagnose → fix → verify → repeat

Each difficulty level increases pressure on:

localisation accuracy
causal reasoning
sequencing of fixes

Why hard is genuinely hard

Docker base reasoning (alpine vs slim)
Errors like gcc: command not found require understanding that alpine lacks build tools/glibc. The correct fix is switching to python:3.11-slim, not just bumping versions.
Dependency compatibility (not presence)
Failures like numpy==1.21 are not about missing packages, but version conflicts with transitive dependencies. The agent must reason about compatibility, not just add lines.
Sequential error revelation
Only one failure is visible per pipeline run. Fixing one stage reveals the next, forcing multi-step reasoning loops.
Exploration vs efficiency trade-off
Reading everything wastes steps (efficiency penalty), but blind edits are penalized. The agent must act surgically, not exhaustively.

4. Reward Function

4. Grader Logic & Reward Shaping

The grader rewards process quality, not just success. Agents are guided through a realistic debugging flow: investigate → diagnose → fix → verify.

Each step reward is composed of: grade(state) delta + balance_score(state, ctx)

Core Score (Structural Progress)

Fix Credit (max +0.20)
Proportional to fraction of correctly applied fixes.
Pipeline Passed (+0.50)
Awarded only when pipeline_status == "passed".
File Integrity (−0.10 → 0.0)
Penalizes excessive edits (e.g., appending large amounts of code).

Milestone-Based Progression

Stage	Description	Reward
Investigated	First pipeline run to observe failure	+0.10
Diagnosed	Reads relevant diagnostic/source files	+0.10
Fix Applied	Valid structural fix detected	+0.15
Verified	Pipeline successfully passes	+0.50

Progress is state-driven, not command-driven.

Behavioral Shaping (Per-Step)

Rewards

Correct Diagnosis: +0.10
Cross-File Reasoning: +0.05

Penalties

Blind Edits (edit without reading): −0.10
Edit Spam (>2 edits per file): −0.05 each
Idle Pipeline Runs (no FS changes): −0.05
Stalling (no progress): −0.05
Regression (breaking prior fix): −0.15
Inefficiency: −0.02 per step beyond ideal (6 steps)

Key Design Insight

The grader differentiates:

Structured debugging → rewarded
Brute-force / guesswork → penalized

Partial fixes receive proportional credit, enabling meaningful learning even in multi-error environments.

6. Project Structure

CI_CD_Doctor/
├── Dockerfile                    ← container setup
├── README.md                     ← main project overview
├── __init__.py
├── client.py                     ← environment client interface
├── models.py                     ← core data models (Action / State / Observation)
├── inference.py                  ← baseline agent runner
├── openenv.yaml                  ← OpenEnv task + grader config
├── pyproject.toml
├── uv.lock                       ← dependency lockfile
│
├── core/                         ← modularized environment logic
│   ├── __init__.py
│   ├── grading/
│   │   └── grader.py             ← scoring + reward shaping logic
│   ├── pipeline/
│   │   └── stage_runner.py       ← simulated CI/CD stages
│   ├── scenarios/
│   │   └── generator.py          ← task + variant generation
│   ├── utils/
│   │   └── packages.py           ← dependency definitions
│   └── validation/
│       ├── parser.py             ← command parsing logic
│       └── validator.py          ← structural validation (CI rules, configs)
│
├── server/                       ← execution backend
│   ├── __init__.py
│   ├── app.py                    ← FastAPI entrypoint
│   ├── app_2.py                  ← alternate server setup
│   └── environment.py            ← main env loop (reset/step/state)
│
├── docs/
│   ├── README.md.                ← HF space readme
│   └── advanced_readme.md        ← detailed system design

7. Development

Run the server locally

uvicorn server.app:app --reload