Spaces:
Sleeping
CI/CD Doctor β Advanced Reference
Deep dive into the environment internals: architecture, I/O contracts, task variants, reward shaping, grader semantics, and layout. If you are only trying to run the env, start with the root README.
1. Environment Overview
Highlight: The environment is a pure in-memory simulation. No real
pip, no realdocker, no subprocess β the "filesystem" is a Pythondict[str, str]. Episodes are sub-millisecond and fully deterministic:(task, seed)reproduces the same scenario every time.
Agent issues a command string ββΊ parser.py
β
βΌ
environment/server/environment.py
β
ββββββββββββββββββββββββββΌβββββββββββββββββββββββββ
βΌ βΌ βΌ
in-memory filesystem stage_runner.py grader.py
(mutated by edits) (simulated stages) (reward + tiers)
β β β
ββββββββββββββββββββββββββ΄βββββββββββββββββββββββββ
β
βΌ
PipelineObservation back to agent
Episode lifecycle. reset(task, seed) builds a broken scenario β step(action) applies one shell-like command β episode terminates when the pipeline passes or the step budget runs out.
2. Action & Observation Spaces
Highlight: All I/O is typed with Pydantic v2 models in environment/models.py. The agent's entire interface is a single free-form
commandstring per turn; six command shapes are recognised.
PipelineAction
class PipelineAction(BaseModel):
command: str # raw shell-like string, e.g. 'cat requirements.txt'
Six command shapes are recognised by environment/parser.py:
| Command | Example | Effect |
|---|---|---|
cat <file> |
cat requirements.txt |
Read a file from the in-memory FS |
echo "<text>" >> <file> |
echo "pandas" >> requirements.txt |
Append a line to a file |
sed -i 's/old/new/' <file> |
sed -i 's/3.10/3.11/' Dockerfile |
Replace all occurrences of text in a file |
pipeline run |
pipeline run |
Execute full pipeline and return logs |
pipeline logs [stage] |
pipeline logs install |
Show last pipeline logs (optionally filtered by stage) |
pipeline status |
pipeline status |
Show current pipeline state (not_run / failed / passed) |
diagnose "<reason>" |
diagnose "Missing env var SECRET_KEY" |
Record agent diagnosis (used for reward bonuses) |
Anything else returns Command not recognized with exit_code=1.
PipelineObservation
class PipelineObservation(BaseModel):
stdout: str # what the agent sees this turn
exit_code: int # 0 = success, 1 = error
pipeline_status: str # 'not_run' | 'failed' | 'passed'
steps_remaining: int
done: bool = False
reward: float = 0.0
PipelineState (server-side only)
class PipelineState(BaseModel):
episode_id: str
task: str # "easy" | "medium" | "hard"
filesystem: Dict[str, str]
pipeline_status: str
step_count: int
done: bool
total_reward: float
answer_key: Dict[str, Any] # never sent to agent, used by grader
milestones: List[str] = Field(default_factory=list) # grader-only, tracks unlocked reward tiers
Tracks full episode state inside the server, including filesystem mutations, progress, and reward accumulation.
answer_keyis hidden from the agent and used only for structural validation in the grader.milestonestrack progression through the debugging lifecycle (investigated β diagnosed β fixed β verified).
3. Task Generation & Logic (Procedural Complexity)
Design Philosophy
Tasks are not static templates. They are programmatically synthesized scenarios generated by core/scenarios/generator.py.
Each episode is a unique composition of:
- a pipeline graph
- injected faults
- a deterministic seed
This makes the environment non-memorizable, forcing agents to rely on generalized diagnostic reasoning instead of string matching.
Difficulty Tiers & Behavioral Intent
Tasks are categorized by the depth of reasoning required.
| Tier | Max Steps | Ideal Steps | Faults | Strategic Complexity |
|---|---|---|---|---|
| Easy | 10 | 3 | 1 | Linear: single-file lookup β direct fix |
| Medium | 15 | 6 | 2 | Relational: cross-file reasoning |
| Hard | 25 | 10 | 3 | Sequential: cascading failures |
How the Generator Synthesizes an Episode
Each episode is constructed in four stages:
Base Filesystem
A clean project snapshot is initialized.Pipeline Definition
CI/CD stages are constructed (e.g.,install β test β build).Fault Injection
Files are mutated with typed faults, such as:package_present/package_versiondockerfile_baseenv_var_presentconfig_valueci_stage_orderport_value
Answer Key Generation
A hidden ground-truth spec used by the grader for structural validation.
Scenario Breakdown
Easy β Localized Debugging
Focus: Information retrieval
- Failure is confined to a single file
- Example:
app.pyimports a missing dependency
Agent goal:
Map runtime error β specific file β apply fix
Medium β Cross-Subsystem Reasoning
Focus: Iterative discovery
- Two faults across different subsystems
- Only the first failing stage is visible initially
Key concept: Shadowing
Fixing one issue reveals the next.
| Variant | Pipeline | Faults |
|---|---|---|
| A | install β env_check β build | missing env var + Docker mismatch |
| B | install β config β smoke_test | dependency + config gate |
| C | install β port_check β build | port mismatch + Docker issue |
Agent requirement:
- Prioritize fixes correctly
- Maintain state across iterations
Hard β Cascading Failures
Focus: Causal + temporal reasoning
- Three faults chained across stages
- Each fix changes future observations
Example chain:
CI stage order incorrect β build executes prematurely β dependency resolution fails
Key property: Temporal dependency
- Fixing earlier stages alters downstream failures
Why This Design Works
1. Partial Observability
The agent never sees all failures at once.
2. Structural Validation
Correctness is semantic:
- not "does file match?"
- but "is the system now valid?"
3. Anti-Shortcut Mechanics
File Integrity Check
Prevents appending junk to pass testsBlind Edit Penalty
Forces reading before editingEdit Spam Penalty
Discourages brute-force iteration
Optimal Agent Policy
The correct strategy is not:
try random fixes β rerun
It is :
observe β localize β read β diagnose β fix β verify β repeat
Each difficulty level increases pressure on:
- localisation accuracy
- causal reasoning
- sequencing of fixes
Why hard is genuinely hard
Docker base reasoning (
alpinevsslim)
Errors likegcc: command not foundrequire understanding thatalpinelacks build tools/glibc. The correct fix is switching topython:3.11-slim, not just bumping versions.Dependency compatibility (not presence)
Failures likenumpy==1.21are not about missing packages, but version conflicts with transitive dependencies. The agent must reason about compatibility, not just add lines.Sequential error revelation
Only one failure is visible per pipeline run. Fixing one stage reveals the next, forcing multi-step reasoning loops.Exploration vs efficiency trade-off
Reading everything wastes steps (efficiency penalty), but blind edits are penalized. The agent must act surgically, not exhaustively.
4. Reward Function
4. Grader Logic & Reward Shaping
The grader rewards process quality, not just success. Agents are guided through a realistic debugging flow: investigate β diagnose β fix β verify.
Each step reward is composed of: grade(state) delta + balance_score(state, ctx)
Core Score (Structural Progress)
Fix Credit (max +0.20)
Proportional to fraction of correctly applied fixes.Pipeline Passed (+0.50)
Awarded only whenpipeline_status == "passed".File Integrity (β0.10 β 0.0)
Penalizes excessive edits (e.g., appending large amounts of code).
Milestone-Based Progression
| Stage | Description | Reward |
|---|---|---|
| Investigated | First pipeline run to observe failure | +0.10 |
| Diagnosed | Reads relevant diagnostic/source files | +0.10 |
| Fix Applied | Valid structural fix detected | +0.15 |
| Verified | Pipeline successfully passes | +0.50 |
Progress is state-driven, not command-driven.
Behavioral Shaping (Per-Step)
Rewards
- Correct Diagnosis: +0.10
- Cross-File Reasoning: +0.05
Penalties
- Blind Edits (edit without reading): β0.10
- Edit Spam (>2 edits per file): β0.05 each
- Idle Pipeline Runs (no FS changes): β0.05
- Stalling (no progress): β0.05
- Regression (breaking prior fix): β0.15
- Inefficiency: β0.02 per step beyond ideal (6 steps)
Key Design Insight
The grader differentiates:
- Structured debugging β rewarded
- Brute-force / guesswork β penalized
Partial fixes receive proportional credit, enabling meaningful learning even in multi-error environments.
6. Project Structure
CI_CD_Doctor/
βββ Dockerfile β container setup
βββ README.md β main project overview
βββ __init__.py
βββ client.py β environment client interface
βββ models.py β core data models (Action / State / Observation)
βββ inference.py β baseline agent runner
βββ openenv.yaml β OpenEnv task + grader config
βββ pyproject.toml
βββ uv.lock β dependency lockfile
β
βββ core/ β modularized environment logic
β βββ __init__.py
β βββ grading/
β β βββ grader.py β scoring + reward shaping logic
β βββ pipeline/
β β βββ stage_runner.py β simulated CI/CD stages
β βββ scenarios/
β β βββ generator.py β task + variant generation
β βββ utils/
β β βββ packages.py β dependency definitions
β βββ validation/
β βββ parser.py β command parsing logic
β βββ validator.py β structural validation (CI rules, configs)
β
βββ server/ β execution backend
β βββ __init__.py
β βββ app.py β FastAPI entrypoint
β βββ app_2.py β alternate server setup
β βββ environment.py β main env loop (reset/step/state)
β
βββ docs/
β βββ README.md. β HF space readme
β βββ advanced_readme.md β detailed system design
7. Development
Run the server locally
uvicorn server.app:app --reload