Spaces:
Sleeping
Sleeping
| title: Ci Cd Doctor Environment Server | |
| emoji: π©Ί | |
| colorFrom: indigo | |
| colorTo: indigo | |
| sdk: docker | |
| pinned: false | |
| app_port: 8000 | |
| base_path: /web | |
| tags: | |
| - openenv | |
| # CI/CD Doctor | |
| **An OpenEnv RL environment where the agent plays a DevOps engineer fixing a broken CI/CD pipeline.** | |
| Each episode boots a procedurally generated, structurally broken project. The agent reads pipeline error logs, inspects config files, applies targeted edits with `sed` / `echo` commands, and re-runs the pipeline until it goes green β under a strict step budget. Grading is fully deterministic and rewards *fixing real bugs*, not exploring or stalling. | |
| [Hugging Face Playground (with instructions)](https://huggingface.co/spaces/samrat-rm/CI_CD_Doctor) | |
| --- | |
| ## 1. Why This Environment | |
| CI/CD failure triage is one of the highest-leverage chores in modern software engineering. Every team that ships code spends real engineer-hours staring at red builds asking: | |
| > *Which stage failed? Which file is wrong? What does it expect? What do I change?* | |
| That loop β **discover β investigate β diagnose β fix β verify** β is exactly what this environment trains. Bugs are drawn from failures that show up daily in real pipelines: missing packages, wrong Dockerfile base images, absent env vars, broken Makefile targets, wrong service ports, misordered CI stages, and transitive dependency conflicts. | |
| ### Research Context | |
| Soni et al. (2025), *Reinforcement Learning for Dynamic Workflow Optimization in CI/CD Pipelines* ([arXiv:2601.11647](https://doi.org/10.48550/arXiv.2601.11647)) validate RL for pipeline automation but explicitly leave *failure diagnosis and repair* as future work β the gap CI/CD Doctor fills. | |
| --- | |
| ## 2. Design Principles | |
| - **No mocked rewards.** Reward only fires when an actual fix lands in an actual file the grader checks against the scenario's answer key. | |
| - **Logs describe the symptom, not the cure.** Failure messages name the offending file and the shape of the fix, but never leak the exact value β the agent must read and reason. | |
| - **Cascading failures on hard.** Hard scenarios chain three independent bugs across multiple files. Each pipeline run only reveals the *next* failing stage. | |
| - **Anti-exploit shaping.** Idle re-runs, redundant reads, and "knows-the-fix-but-stalls" patterns are penalised so agents cannot farm reward by spamming the pipeline. | |
| - **Pure simulation.** No real `pip`, no real `docker`, no real subprocess. The "filesystem" is a Python `dict[str, str]`, making episodes sub-millisecond and fully deterministic β same seed, same scenario, every time. | |
| --- | |
| ## 3. Tasks | |
| Tasks are categorized by the **depth of reasoning** required. | |
| | Tier | Max Steps | Ideal Steps | Faults | Strategic Complexity | | |
| |---|---|---|---|---| | |
| | Easy | 10 | 3 | 1 | Linear: single-file lookup β direct fix | | |
| | Medium | 15 | 6 | 2 | Relational: cross-file reasoning | | |
| | Hard | 25 | 10 | 3 | Sequential: cascading failures | | |
| **Notes** : | |
| - Faults are typed (e.g., package_present, dockerfile_base, env_var_present, config_value, ci_stage_order, port_value). | |
| - Only the first failing stage is exposed per run; later faults are revealed after fixes. | |
| - Validation is structural, not string-based. | |
| See [docs/advanced_readme.md](advanced_readme.md) for the full variant breakdown, pipeline shapes, and reasoning about why hard is genuinely hard. | |
| --- | |
| ## 4. Quick Start | |
| ### Install | |
| ```bash | |
| git clone https://github.com/<your-handle>/CI_CD_Doctor.git | |
| cd CI_CD_Doctor | |
| uv sync # or: pip install -e . | |
| ``` | |
| ### Build the Docker image & run inference | |
| ```bash | |
| docker build -t ci-cd-doctor-env:latest -f Dockerfile . | |
| docker run -p 8000:8000 ci-cd-doctor-env:latest | |
| uv run python inference.py | |
| ``` | |
| --- | |
| ## 5. Baseline Performance | |
| Results from 50 episodes per (model, task) cell, seeds `0β1000`, temperature `0.5`, 4k-token context per step. Mean reward is averaged across episodes; pass rate counts episodes that cleared the task's success threshold (see Β§3). Avg steps is measured on passing episodes only. | |
| | Model | Task | Mean reward | Pass rate | Avg steps (passed) | | |
| |---|---|---|---|---| | |
| | `Qwen/Qwen2.5-72B-Instruct` | easy | 0.99 | ~90% | 5.5 | | |
| | `Qwen/Qwen2.5-72B-Instruct` | medium | 0.62 | ~50% | 11.5 | | |
| | `Qwen/Qwen2.5-72B-Instruct` | hard | 0.38 | ~20% | 22.5 | | |
| **Observations.** | |
| - **Easy is near-ceiling for frontier models** but not trivial: failures come from hallucinated filenames, malformed `sed` patterns, or forgetting to re-run the pipeline after the fix. | |
| - **Medium halves the pass rate.** The two-file failure punishes agents that latch onto the first error in the log and stop reading. | |
| - **Hard is the real benchmark.** Cascading failures mean the agent must diagnose, fix, re-run, and re-diagnose β the step budget and efficiency penalty make brute-force exploration unviable. No evaluated model clears 25% pass rate. | |
| --- | |
| ## 6. Grader - The heart of the environment β€οΈ | |
| This grader implements a structured, real-world evaluation aligned with OpenEnv principles by rewarding state transitions in a debugging workflow rather than surface-level actions. It combines deterministic structural validation of fixes with trajectory-based shaping, encouraging investigation, diagnosis, and verification. | |
| The design provides dense reward signals, penalizes inefficient or uninformed behavior, and ensures evaluation reflects true task completion, causal reasoning, and system correctness rather than pattern matching. | |
| ## 7. Task / Problem Description | |
| This scenario generator creates procedurally diverse CI/CD debugging tasks that emphasize causal reasoning over pattern matching, aligned with OpenEnv evaluation principles. Each scenario introduces realistic, multi-file failures with symptom-based signals, requiring agents to investigate, diagnose, and apply structurally valid fixes. By encoding ground-truth fixes, diagnostic files, and interdependent errors, it ensures evaluation captures true system understanding, cross-file reasoning, and end-to-end pipeline correctness. | |
| ## 8. Documentation | |
| - **[advanced_readme.md](advanced_readme.md)** β environment flow diagram, action & observation spaces, full task variants, reward shaping, grader internals, and project structure. | |
| --- | |
| ## 9. License | |
| MIT. | |
| --- | |
| <img width="510" height="572" alt="ci_cd_doc_meme" src="https://github.com/user-attachments/assets/802c5c70-fea6-40a4-b702-91eecbffd3fd" /> | |