Spaces:
Sleeping
Sleeping
| # CI/CD Doctor β Advanced Reference | |
| Deep dive into the environment internals: architecture, I/O contracts, task variants, reward shaping, grader semantics, and layout. If you are only trying to run the env, start with the [root README](../README.md). | |
| --- | |
| ## 1. Environment Overview | |
| > **Highlight:** The environment is a pure in-memory simulation. No real `pip`, no real `docker`, no subprocess β the "filesystem" is a Python `dict[str, str]`. Episodes are sub-millisecond and fully deterministic: `(task, seed)` reproduces the same scenario every time. | |
| ``` | |
| Agent issues a command string ββΊ parser.py | |
| β | |
| βΌ | |
| environment/server/environment.py | |
| β | |
| ββββββββββββββββββββββββββΌβββββββββββββββββββββββββ | |
| βΌ βΌ βΌ | |
| in-memory filesystem stage_runner.py grader.py | |
| (mutated by edits) (simulated stages) (reward + tiers) | |
| β β β | |
| ββββββββββββββββββββββββββ΄βββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| PipelineObservation back to agent | |
| ``` | |
| **Episode lifecycle.** `reset(task, seed)` builds a broken scenario β `step(action)` applies one shell-like command β episode terminates when the pipeline passes *or* the step budget runs out. | |
| --- | |
| ## 2. Action & Observation Spaces | |
| > **Highlight:** All I/O is typed with Pydantic v2 models in [environment/models.py](../environment/models.py). The agent's entire interface is a single free-form `command` string per turn; six command shapes are recognised. | |
| ### `PipelineAction` | |
| ```python | |
| class PipelineAction(BaseModel): | |
| command: str # raw shell-like string, e.g. 'cat requirements.txt' | |
| ``` | |
| Six command shapes are recognised by [environment/parser.py](../environment/parser.py): | |
| | Command | Example | Effect | | |
| |---|---|---| | |
| | `cat <file>` | `cat requirements.txt` | Read a file from the in-memory FS | | |
| | `echo "<text>" >> <file>` | `echo "pandas" >> requirements.txt` | Append a line to a file | | |
| | `sed -i 's/old/new/' <file>` | `sed -i 's/3.10/3.11/' Dockerfile` | Replace all occurrences of text in a file | | |
| | `pipeline run` | `pipeline run` | Execute full pipeline and return logs | | |
| | `pipeline logs [stage]` | `pipeline logs install` | Show last pipeline logs (optionally filtered by stage) | | |
| | `pipeline status` | `pipeline status` | Show current pipeline state (`not_run` / `failed` / `passed`) | | |
| | `diagnose "<reason>"` | `diagnose "Missing env var SECRET_KEY"` | Record agent diagnosis (used for reward bonuses) | | |
| Anything else returns `Command not recognized` with `exit_code=1`. | |
| ### `PipelineObservation` | |
| ```python | |
| class PipelineObservation(BaseModel): | |
| stdout: str # what the agent sees this turn | |
| exit_code: int # 0 = success, 1 = error | |
| pipeline_status: str # 'not_run' | 'failed' | 'passed' | |
| steps_remaining: int | |
| done: bool = False | |
| reward: float = 0.0 | |
| ``` | |
| ### `PipelineState` (server-side only) | |
| ```python | |
| class PipelineState(BaseModel): | |
| episode_id: str | |
| task: str # "easy" | "medium" | "hard" | |
| filesystem: Dict[str, str] | |
| pipeline_status: str | |
| step_count: int | |
| done: bool | |
| total_reward: float | |
| answer_key: Dict[str, Any] # never sent to agent, used by grader | |
| milestones: List[str] = Field(default_factory=list) # grader-only, tracks unlocked reward tiers | |
| ``` | |
| Tracks full episode state inside the server, including filesystem mutations, progress, and reward accumulation. | |
| - `answer_key` is hidden from the agent and used only for structural validation in the grader. | |
| - `milestones` track progression through the debugging lifecycle (investigated β diagnosed β fixed β verified). | |
| --- | |
| ## 3. Task Generation & Logic (Procedural Complexity) | |
| **Design Philosophy** | |
| Tasks are not static templates. They are programmatically synthesized scenarios generated by `core/scenarios/generator.py`. | |
| Each episode is a unique composition of: | |
| - a pipeline graph | |
| - injected faults | |
| - a deterministic seed | |
| This makes the environment **non-memorizable**, forcing agents to rely on **generalized diagnostic reasoning** instead of string matching. | |
| --- | |
| ### Difficulty Tiers & Behavioral Intent | |
| Tasks are categorized by the **depth of reasoning** required. | |
| | Tier | Max Steps | Ideal Steps | Faults | Strategic Complexity | | |
| |---|---|---|---|---| | |
| | Easy | 10 | 3 | 1 | Linear: single-file lookup β direct fix | | |
| | Medium | 15 | 6 | 2 | Relational: cross-file reasoning | | |
| | Hard | 25 | 10 | 3 | Sequential: cascading failures | | |
| --- | |
| ### How the Generator Synthesizes an Episode | |
| Each episode is constructed in four stages: | |
| 1. **Base Filesystem** | |
| A clean project snapshot is initialized. | |
| 2. **Pipeline Definition** | |
| CI/CD stages are constructed (e.g., `install β test β build`). | |
| 3. **Fault Injection** | |
| Files are mutated with **typed faults**, such as: | |
| - `package_present` / `package_version` | |
| - `dockerfile_base` | |
| - `env_var_present` | |
| - `config_value` | |
| - `ci_stage_order` | |
| - `port_value` | |
| 4. **Answer Key Generation** | |
| A hidden ground-truth spec used by the grader for **structural validation**. | |
| --- | |
| ### Scenario Breakdown | |
| #### Easy β Localized Debugging | |
| Focus: **Information retrieval** | |
| - Failure is confined to a single file | |
| - Example: `app.py` imports a missing dependency | |
| **Agent goal:** | |
| Map runtime error β specific file β apply fix | |
| --- | |
| #### Medium β Cross-Subsystem Reasoning | |
| Focus: **Iterative discovery** | |
| - Two faults across different subsystems | |
| - Only the *first failing stage* is visible initially | |
| **Key concept: Shadowing** | |
| > Fixing one issue reveals the next. | |
| | Variant | Pipeline | Faults | | |
| |---|---|---| | |
| | A | install β env_check β build | missing env var + Docker mismatch | | |
| | B | install β config β smoke_test | dependency + config gate | | |
| | C | install β port_check β build | port mismatch + Docker issue | | |
| **Agent requirement:** | |
| - Prioritize fixes correctly | |
| - Maintain state across iterations | |
| --- | |
| #### Hard β Cascading Failures | |
| Focus: **Causal + temporal reasoning** | |
| - Three faults chained across stages | |
| - Each fix changes future observations | |
| Example chain: | |
| CI stage order incorrect | |
| β build executes prematurely | |
| β dependency resolution fails | |
| **Key property: Temporal dependency** | |
| - Fixing earlier stages alters downstream failures | |
| --- | |
| ### Why This Design Works | |
| #### 1. Partial Observability | |
| The agent never sees all failures at once. | |
| #### 2. Structural Validation | |
| Correctness is semantic: | |
| - not "does file match?" | |
| - but "is the system now valid?" | |
| #### 3. Anti-Shortcut Mechanics | |
| - **File Integrity Check** | |
| Prevents appending junk to pass tests | |
| - **Blind Edit Penalty** | |
| Forces reading before editing | |
| - **Edit Spam Penalty** | |
| Discourages brute-force iteration | |
| --- | |
| ### Optimal Agent Policy | |
| The correct strategy is not: | |
| `try random fixes β rerun` | |
| It is : | |
| `observe β localize β read β diagnose β fix β verify β repeat` | |
| Each difficulty level increases pressure on: | |
| - localisation accuracy | |
| - causal reasoning | |
| - sequencing of fixes | |
| ### Why hard is genuinely hard | |
| - **Docker base reasoning (`alpine` vs `slim`)** | |
| Errors like `gcc: command not found` require understanding that `alpine` lacks build tools/glibc. The correct fix is switching to `python:3.11-slim`, not just bumping versions. | |
| - **Dependency compatibility (not presence)** | |
| Failures like `numpy==1.21` are not about missing packages, but **version conflicts** with transitive dependencies. The agent must reason about compatibility, not just add lines. | |
| - **Sequential error revelation** | |
| Only one failure is visible per pipeline run. Fixing one stage reveals the next, forcing **multi-step reasoning loops**. | |
| - **Exploration vs efficiency trade-off** | |
| Reading everything wastes steps (efficiency penalty), but blind edits are penalized. The agent must act **surgically**, not exhaustively. | |
| --- | |
| ## 4. Reward Function | |
| ## 4. Grader Logic & Reward Shaping | |
| > The grader rewards *process quality*, not just success. Agents are guided through a realistic debugging flow: investigate β diagnose β fix β verify. | |
| Each step reward is composed of: | |
| **grade(state) delta + balance_score(state, ctx)** | |
| --- | |
| ### Core Score (Structural Progress) | |
| - **Fix Credit (max +0.20)** | |
| Proportional to fraction of correctly applied fixes. | |
| - **Pipeline Passed (+0.50)** | |
| Awarded only when `pipeline_status == "passed"`. | |
| - **File Integrity (β0.10 β 0.0)** | |
| Penalizes excessive edits (e.g., appending large amounts of code). | |
| --- | |
| ### Milestone-Based Progression | |
| | Stage | Description | Reward | | |
| |------|------------|--------| | |
| | Investigated | First pipeline run to observe failure | +0.10 | | |
| | Diagnosed | Reads relevant diagnostic/source files | +0.10 | | |
| | Fix Applied | Valid structural fix detected | +0.15 | | |
| | Verified | Pipeline successfully passes | +0.50 | | |
| Progress is **state-driven**, not command-driven. | |
| --- | |
| ### Behavioral Shaping (Per-Step) | |
| #### Rewards | |
| - **Correct Diagnosis**: +0.10 | |
| - **Cross-File Reasoning**: +0.05 | |
| #### Penalties | |
| - **Blind Edits** (edit without reading): β0.10 | |
| - **Edit Spam** (>2 edits per file): β0.05 each | |
| - **Idle Pipeline Runs** (no FS changes): β0.05 | |
| - **Stalling** (no progress): β0.05 | |
| - **Regression** (breaking prior fix): β0.15 | |
| - **Inefficiency**: β0.02 per step beyond ideal (6 steps) | |
| --- | |
| ### Key Design Insight | |
| The grader differentiates: | |
| - **Structured debugging** β rewarded | |
| - **Brute-force / guesswork** β penalized | |
| Partial fixes receive proportional credit, enabling meaningful learning even in multi-error environments. | |
| --- | |
| ## 6. Project Structure | |
| ``` | |
| CI_CD_Doctor/ | |
| βββ Dockerfile β container setup | |
| βββ README.md β main project overview | |
| βββ __init__.py | |
| βββ client.py β environment client interface | |
| βββ models.py β core data models (Action / State / Observation) | |
| βββ inference.py β baseline agent runner | |
| βββ openenv.yaml β OpenEnv task + grader config | |
| βββ pyproject.toml | |
| βββ uv.lock β dependency lockfile | |
| β | |
| βββ core/ β modularized environment logic | |
| β βββ __init__.py | |
| β βββ grading/ | |
| β β βββ grader.py β scoring + reward shaping logic | |
| β βββ pipeline/ | |
| β β βββ stage_runner.py β simulated CI/CD stages | |
| β βββ scenarios/ | |
| β β βββ generator.py β task + variant generation | |
| β βββ utils/ | |
| β β βββ packages.py β dependency definitions | |
| β βββ validation/ | |
| β βββ parser.py β command parsing logic | |
| β βββ validator.py β structural validation (CI rules, configs) | |
| β | |
| βββ server/ β execution backend | |
| β βββ __init__.py | |
| β βββ app.py β FastAPI entrypoint | |
| β βββ app_2.py β alternate server setup | |
| β βββ environment.py β main env loop (reset/step/state) | |
| β | |
| βββ docs/ | |
| β βββ README.md. β HF space readme | |
| β βββ advanced_readme.md β detailed system design | |
| ``` | |
| --- | |
| ## 7. Development | |
| ### Run the server locally | |
| ```bash | |
| uvicorn server.app:app --reload | |
| ``` |