CI_CD_Doctor / docs /advanced_readme.md
samrat-rm's picture
Upload folder using huggingface_hub
4dd97eb verified

CI/CD Doctor β€” Advanced Reference

Deep dive into the environment internals: architecture, I/O contracts, task variants, reward shaping, grader semantics, and layout. If you are only trying to run the env, start with the root README.


1. Environment Overview

Highlight: The environment is a pure in-memory simulation. No real pip, no real docker, no subprocess β€” the "filesystem" is a Python dict[str, str]. Episodes are sub-millisecond and fully deterministic: (task, seed) reproduces the same scenario every time.

Agent issues a command string  ─►  parser.py
                                       β”‚
                                       β–Ό
                              environment/server/environment.py
                                       β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β–Ό                        β–Ό                        β–Ό
       in-memory filesystem      stage_runner.py            grader.py
        (mutated by edits)       (simulated stages)       (reward + tiers)
              β”‚                        β”‚                        β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                       β”‚
                                       β–Ό
                             PipelineObservation back to agent

Episode lifecycle. reset(task, seed) builds a broken scenario β†’ step(action) applies one shell-like command β†’ episode terminates when the pipeline passes or the step budget runs out.


2. Action & Observation Spaces

Highlight: All I/O is typed with Pydantic v2 models in environment/models.py. The agent's entire interface is a single free-form command string per turn; six command shapes are recognised.

PipelineAction

class PipelineAction(BaseModel):
    command: str   # raw shell-like string, e.g. 'cat requirements.txt'

Six command shapes are recognised by environment/parser.py:

Command Example Effect
cat <file> cat requirements.txt Read a file from the in-memory FS
echo "<text>" >> <file> echo "pandas" >> requirements.txt Append a line to a file
sed -i 's/old/new/' <file> sed -i 's/3.10/3.11/' Dockerfile Replace all occurrences of text in a file
pipeline run pipeline run Execute full pipeline and return logs
pipeline logs [stage] pipeline logs install Show last pipeline logs (optionally filtered by stage)
pipeline status pipeline status Show current pipeline state (not_run / failed / passed)
diagnose "<reason>" diagnose "Missing env var SECRET_KEY" Record agent diagnosis (used for reward bonuses)

Anything else returns Command not recognized with exit_code=1.

PipelineObservation

class PipelineObservation(BaseModel):
    stdout: str              # what the agent sees this turn
    exit_code: int           # 0 = success, 1 = error
    pipeline_status: str     # 'not_run' | 'failed' | 'passed'
    steps_remaining: int
    done: bool = False
    reward: float = 0.0

PipelineState (server-side only)

class PipelineState(BaseModel):
    episode_id: str
    task: str  # "easy" | "medium" | "hard"
    filesystem: Dict[str, str]
    pipeline_status: str
    step_count: int
    done: bool
    total_reward: float
    answer_key: Dict[str, Any]  # never sent to agent, used by grader
    milestones: List[str] = Field(default_factory=list)  # grader-only, tracks unlocked reward tiers

Tracks full episode state inside the server, including filesystem mutations, progress, and reward accumulation.

  • answer_key is hidden from the agent and used only for structural validation in the grader.
  • milestones track progression through the debugging lifecycle (investigated β†’ diagnosed β†’ fixed β†’ verified).

3. Task Generation & Logic (Procedural Complexity)

Design Philosophy
Tasks are not static templates. They are programmatically synthesized scenarios generated by core/scenarios/generator.py.

Each episode is a unique composition of:

  • a pipeline graph
  • injected faults
  • a deterministic seed

This makes the environment non-memorizable, forcing agents to rely on generalized diagnostic reasoning instead of string matching.


Difficulty Tiers & Behavioral Intent

Tasks are categorized by the depth of reasoning required.

Tier Max Steps Ideal Steps Faults Strategic Complexity
Easy 10 3 1 Linear: single-file lookup β†’ direct fix
Medium 15 6 2 Relational: cross-file reasoning
Hard 25 10 3 Sequential: cascading failures

How the Generator Synthesizes an Episode

Each episode is constructed in four stages:

  1. Base Filesystem
    A clean project snapshot is initialized.

  2. Pipeline Definition
    CI/CD stages are constructed (e.g., install β†’ test β†’ build).

  3. Fault Injection
    Files are mutated with typed faults, such as:

    • package_present / package_version
    • dockerfile_base
    • env_var_present
    • config_value
    • ci_stage_order
    • port_value
  4. Answer Key Generation
    A hidden ground-truth spec used by the grader for structural validation.


Scenario Breakdown

Easy β€” Localized Debugging

Focus: Information retrieval

  • Failure is confined to a single file
  • Example: app.py imports a missing dependency

Agent goal:
Map runtime error β†’ specific file β†’ apply fix


Medium β€” Cross-Subsystem Reasoning

Focus: Iterative discovery

  • Two faults across different subsystems
  • Only the first failing stage is visible initially

Key concept: Shadowing

Fixing one issue reveals the next.

Variant Pipeline Faults
A install β†’ env_check β†’ build missing env var + Docker mismatch
B install β†’ config β†’ smoke_test dependency + config gate
C install β†’ port_check β†’ build port mismatch + Docker issue

Agent requirement:

  • Prioritize fixes correctly
  • Maintain state across iterations

Hard β€” Cascading Failures

Focus: Causal + temporal reasoning

  • Three faults chained across stages
  • Each fix changes future observations

Example chain:

CI stage order incorrect β†’ build executes prematurely β†’ dependency resolution fails

Key property: Temporal dependency

  • Fixing earlier stages alters downstream failures

Why This Design Works

1. Partial Observability

The agent never sees all failures at once.

2. Structural Validation

Correctness is semantic:

  • not "does file match?"
  • but "is the system now valid?"

3. Anti-Shortcut Mechanics

  • File Integrity Check
    Prevents appending junk to pass tests

  • Blind Edit Penalty
    Forces reading before editing

  • Edit Spam Penalty
    Discourages brute-force iteration


Optimal Agent Policy

The correct strategy is not:

try random fixes β†’ rerun

It is :

observe β†’ localize β†’ read β†’ diagnose β†’ fix β†’ verify β†’ repeat

Each difficulty level increases pressure on:

  • localisation accuracy
  • causal reasoning
  • sequencing of fixes

Why hard is genuinely hard

  • Docker base reasoning (alpine vs slim)
    Errors like gcc: command not found require understanding that alpine lacks build tools/glibc. The correct fix is switching to python:3.11-slim, not just bumping versions.

  • Dependency compatibility (not presence)
    Failures like numpy==1.21 are not about missing packages, but version conflicts with transitive dependencies. The agent must reason about compatibility, not just add lines.

  • Sequential error revelation
    Only one failure is visible per pipeline run. Fixing one stage reveals the next, forcing multi-step reasoning loops.

  • Exploration vs efficiency trade-off
    Reading everything wastes steps (efficiency penalty), but blind edits are penalized. The agent must act surgically, not exhaustively.


4. Reward Function

4. Grader Logic & Reward Shaping

The grader rewards process quality, not just success. Agents are guided through a realistic debugging flow: investigate β†’ diagnose β†’ fix β†’ verify.

Each step reward is composed of: grade(state) delta + balance_score(state, ctx)


Core Score (Structural Progress)

  • Fix Credit (max +0.20)
    Proportional to fraction of correctly applied fixes.

  • Pipeline Passed (+0.50)
    Awarded only when pipeline_status == "passed".

  • File Integrity (βˆ’0.10 β†’ 0.0)
    Penalizes excessive edits (e.g., appending large amounts of code).


Milestone-Based Progression

Stage Description Reward
Investigated First pipeline run to observe failure +0.10
Diagnosed Reads relevant diagnostic/source files +0.10
Fix Applied Valid structural fix detected +0.15
Verified Pipeline successfully passes +0.50

Progress is state-driven, not command-driven.


Behavioral Shaping (Per-Step)

Rewards

  • Correct Diagnosis: +0.10
  • Cross-File Reasoning: +0.05

Penalties

  • Blind Edits (edit without reading): βˆ’0.10
  • Edit Spam (>2 edits per file): βˆ’0.05 each
  • Idle Pipeline Runs (no FS changes): βˆ’0.05
  • Stalling (no progress): βˆ’0.05
  • Regression (breaking prior fix): βˆ’0.15
  • Inefficiency: βˆ’0.02 per step beyond ideal (6 steps)

Key Design Insight

The grader differentiates:

  • Structured debugging β†’ rewarded
  • Brute-force / guesswork β†’ penalized

Partial fixes receive proportional credit, enabling meaningful learning even in multi-error environments.


6. Project Structure

CI_CD_Doctor/
β”œβ”€β”€ Dockerfile                    ← container setup
β”œβ”€β”€ README.md                     ← main project overview
β”œβ”€β”€ __init__.py
β”œβ”€β”€ client.py                     ← environment client interface
β”œβ”€β”€ models.py                     ← core data models (Action / State / Observation)
β”œβ”€β”€ inference.py                  ← baseline agent runner
β”œβ”€β”€ openenv.yaml                  ← OpenEnv task + grader config
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ uv.lock                       ← dependency lockfile
β”‚
β”œβ”€β”€ core/                         ← modularized environment logic
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ grading/
β”‚   β”‚   └── grader.py             ← scoring + reward shaping logic
β”‚   β”œβ”€β”€ pipeline/
β”‚   β”‚   └── stage_runner.py       ← simulated CI/CD stages
β”‚   β”œβ”€β”€ scenarios/
β”‚   β”‚   └── generator.py          ← task + variant generation
β”‚   β”œβ”€β”€ utils/
β”‚   β”‚   └── packages.py           ← dependency definitions
β”‚   └── validation/
β”‚       β”œβ”€β”€ parser.py             ← command parsing logic
β”‚       └── validator.py          ← structural validation (CI rules, configs)
β”‚
β”œβ”€β”€ server/                       ← execution backend
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ app.py                    ← FastAPI entrypoint
β”‚   β”œβ”€β”€ app_2.py                  ← alternate server setup
β”‚   └── environment.py            ← main env loop (reset/step/state)
β”‚
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ README.md.                ← HF space readme
β”‚   └── advanced_readme.md        ← detailed system design

7. Development

Run the server locally

uvicorn server.app:app --reload