CI_CD_Doctor / README.md
samrat-rm's picture
Upload folder using huggingface_hub
5691172 verified
metadata
title: Ci Cd Doctor Environment Server
emoji: 🩺
colorFrom: indigo
colorTo: indigo
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv

CI/CD Doctor

An OpenEnv RL environment where the agent plays a DevOps engineer fixing a broken CI/CD pipeline.

Each episode boots a procedurally generated, structurally broken project. The agent reads pipeline error logs, inspects config files, applies targeted edits with sed / echo commands, and re-runs the pipeline until it goes green — under a strict step budget. Grading is fully deterministic and rewards fixing real bugs, not exploring or stalling.

Hugging Face Playground (with instructions)


1. Why This Environment

CI/CD failure triage is one of the highest-leverage chores in modern software engineering. Every team that ships code spends real engineer-hours staring at red builds asking:

Which stage failed? Which file is wrong? What does it expect? What do I change?

That loop — discover → investigate → diagnose → fix → verify — is exactly what this environment trains. Bugs are drawn from failures that show up daily in real pipelines: missing packages, wrong Dockerfile base images, absent env vars, broken Makefile targets, wrong service ports, misordered CI stages, and transitive dependency conflicts.

Research Context

Soni et al. (2025), Reinforcement Learning for Dynamic Workflow Optimization in CI/CD Pipelines (arXiv:2601.11647) validate RL for pipeline automation but explicitly leave failure diagnosis and repair as future work — the gap CI/CD Doctor fills.


2. Design Principles

  • No mocked rewards. Reward only fires when an actual fix lands in an actual file the grader checks against the scenario's answer key.
  • Logs describe the symptom, not the cure. Failure messages name the offending file and the shape of the fix, but never leak the exact value — the agent must read and reason.
  • Cascading failures on hard. Hard scenarios chain three independent bugs across multiple files. Each pipeline run only reveals the next failing stage.
  • Anti-exploit shaping. Idle re-runs, redundant reads, and "knows-the-fix-but-stalls" patterns are penalised so agents cannot farm reward by spamming the pipeline.
  • Pure simulation. No real pip, no real docker, no real subprocess. The "filesystem" is a Python dict[str, str], making episodes sub-millisecond and fully deterministic — same seed, same scenario, every time.

3. Tasks

Tasks are categorized by the depth of reasoning required.

Tier Max Steps Ideal Steps Faults Strategic Complexity
Easy 10 3 1 Linear: single-file lookup → direct fix
Medium 15 6 2 Relational: cross-file reasoning
Hard 25 10 3 Sequential: cascading failures

Notes :

  • Faults are typed (e.g., package_present, dockerfile_base, env_var_present, config_value, ci_stage_order, port_value).

  • Only the first failing stage is exposed per run; later faults are revealed after fixes.

  • Validation is structural, not string-based.

See docs/advanced_readme.md for the full variant breakdown, pipeline shapes, and reasoning about why hard is genuinely hard.


4. Quick Start

Install

git clone https://github.com/<your-handle>/CI_CD_Doctor.git
cd CI_CD_Doctor
uv sync   # or: pip install -e .

Build the Docker image & run inference

docker build -t ci-cd-doctor-env:latest -f Dockerfile .
docker run -p 8000:8000 ci-cd-doctor-env:latest
uv run python inference.py

5. Baseline Performance

Results from 50 episodes per (model, task) cell, seeds 0–1000, temperature 0.5, 4k-token context per step. Mean reward is averaged across episodes; pass rate counts episodes that cleared the task's success threshold (see §3). Avg steps is measured on passing episodes only.

Model Task Mean reward Pass rate Avg steps (passed)
Qwen/Qwen2.5-72B-Instruct easy 0.99 ~90% 5.5
Qwen/Qwen2.5-72B-Instruct medium 0.62 ~50% 11.5
Qwen/Qwen2.5-72B-Instruct hard 0.38 ~20% 22.5

Observations.

  • Easy is near-ceiling for frontier models but not trivial: failures come from hallucinated filenames, malformed sed patterns, or forgetting to re-run the pipeline after the fix.
  • Medium halves the pass rate. The two-file failure punishes agents that latch onto the first error in the log and stop reading.
  • Hard is the real benchmark. Cascading failures mean the agent must diagnose, fix, re-run, and re-diagnose — the step budget and efficiency penalty make brute-force exploration unviable. No evaluated model clears 25% pass rate.

6. Grader - The heart of the environment ❤️

This grader implements a structured, real-world evaluation aligned with OpenEnv principles by rewarding state transitions in a debugging workflow rather than surface-level actions. It combines deterministic structural validation of fixes with trajectory-based shaping, encouraging investigation, diagnosis, and verification.

The design provides dense reward signals, penalizes inefficient or uninformed behavior, and ensures evaluation reflects true task completion, causal reasoning, and system correctness rather than pattern matching.

7. Task / Problem Description

This scenario generator creates procedurally diverse CI/CD debugging tasks that emphasize causal reasoning over pattern matching, aligned with OpenEnv evaluation principles. Each scenario introduces realistic, multi-file failures with symptom-based signals, requiring agents to investigate, diagnose, and apply structurally valid fixes. By encoding ground-truth fixes, diagnostic files, and interdependent errors, it ensures evaluation captures true system understanding, cross-file reasoning, and end-to-end pipeline correctness.

8. Documentation

  • advanced_readme.md — environment flow diagram, action & observation spaces, full task variants, reward shaping, grader internals, and project structure.

9. License

MIT.


ci_cd_doc_meme