| --- |
| title: CICD_DEBUGGER |
| colorFrom: blue |
| colorTo: green |
| sdk: docker |
| app_port: 7860 |
| pinned: false |
| tags: |
| - openenv |
| --- |
| |
| # CI/CD Pipeline Debugger Environment (OpenEnv) |
|
|
| ## 1. Project Goal |
|
|
| This repository implements an AI training and evaluation environment where an agent learns to debug broken CI/CD pipelines automatically. |
|
|
| The environment targets real-world DevOps failure patterns, including: |
|
|
| - YAML syntax and structure issues |
| - Incorrect build/test commands (for example, npm tset -> npm test) |
| - Dependency and setup failures |
| - Multi-stage pipeline execution errors |
|
|
| This is designed as an RL-style interaction loop: |
|
|
| Observe -> Think -> Act -> Get Reward -> Repeat |
|
|
| ## 2. Why This Matters |
|
|
| CI/CD failures are common, repetitive, and often multi-step to resolve. This project turns that workflow into a structured learning environment where agents: |
|
|
| - Read failure context |
| - Reason about root causes |
| - Propose and apply fixes |
| - Get shaped rewards for robust behavior |
|
|
| ## 3. System Architecture |
|
|
| High-level flow: |
|
|
| Agent (LLM) -> Action -> Environment.step() -> Reward/Evaluation -> Next step |
|
|
| Core integration path: |
|
|
| Model -> Action -> Environment.step() -> RewardCalculator |
|
|
| RewardCalculator integrates: |
|
|
| - DeterministicGrader |
| - LLMJudge |
| - HiddenTestRunner |
| - AntiHackingDetector |
|
|
| ### 3.1 OpenEnv Interface (Typed) |
|
|
| Typed Pydantic models are defined in `env/models.py`: |
|
|
| - `Observation`: strict schema for environment observations |
| - `Action`: normalized tool + payload action schema |
| - `Reward`: bounded reward model with components |
|
|
| Environment contract: |
|
|
| - `reset()` returns the initial `Observation` payload |
| - `step(action)` returns `(observation, reward, done, info)` |
| - `state()` returns current environment state snapshot |
|
|
| Server/API contract models are exposed in `server/app.py` and use the same typed observation/action/reward structures. |
|
|
| ### 3.2 Action and Observation Spaces |
|
|
| Observation fields include: |
|
|
| - `task_id`, `difficulty`, `failure_stage`, `actual_bug` |
| - `config`, `logs`, `error_message` |
| - `available_tools`, `progress_flags` |
| - `file_modification_count`, `hidden_test_pass_rate`, `step_count`, `last_action_error` |
|
|
| Action schema: |
|
|
| - `tool`: one of `read_file`, `read_logs`, `analyze_error`, `edit_config`, `run_pipeline_stage`, `run_tests`, `validate_fix`, `submit_solution` |
| - `payload`: optional dict (for example `{ "raw": "replace npm tset with npm test" }`) |
|
|
| Reward schema: |
|
|
| - `value`: bounded float in `[0.0, 1.0]` |
| - `components`: reward breakdown dictionary |
|
|
| ## 4. Core Modules |
|
|
| ### 4.1 Quality Judge |
|
|
| - File: env/graders/llm_judge.py |
| - Purpose: quality-aware scoring of fixes |
| - Output keys: correctness, minimalism, quality (all in [0,1]) |
| - Guarantees: |
| - strict JSON parsing attempt |
| - robust fallback parsing for messy output |
| - no-crash behavior (safe zero scores on failure) |
| |
| ### 4.2 Deterministic Grader |
| |
| - File: env/graders/deterministic.py |
| - Purpose: reproducible correctness scoring (0-1) |
| - Checks: |
| - YAML validity |
| - command and fix correctness |
| - similarity and issue resolution |
| - Rules: |
| - deterministic only |
| - same input, same score |
| |
| ### 4.3 Anti-Hacking Detector |
| |
| - File: env/anti_hacking.py |
| - Purpose: detect reward-hacking and shortcut behavior |
| - Penalty detectors: |
| - stage skipping (if: false, when: never) |
| - fake success (echo tests passed, unsafe exit 0 patterns) |
| - pipeline breakage between versions |
| - excessive edits |
| - timeout abuse via too many steps |
|
|
| ### 4.4 Hidden Tests |
|
|
| - File: env/hidden_tests.py |
| - Purpose: test fix robustness, not just exact-match overfitting |
| - Method: |
| - deterministic variant generation (OS, versions, env shifts) |
| - evaluate pass rate across variants |
| |
| ### 4.5 Reward Shaping |
| |
| - File: env/rewards.py |
| - Purpose: step-level learning signal |
| - Components: |
| - progress rewards (logs, analysis, fix proposal) |
| - execution rewards (pipeline run, tests pass) |
| - quality rewards (deterministic + hidden tests + LLM judge) |
| - anti-hacking penalties |
| |
| ## 5. Inference and Evaluation |
| |
| ### 5.1 Prompt and Model Layers |
| |
| - inference/prompts.py: stable prompt templates and fallback action heuristics |
| - inference/model_wrapper.py: OpenAI client action generation, candidate generation, and safe fallback |
|
|
| Canonical action tools used by environment and inference: |
|
|
| - read_file |
| - read_logs |
| - analyze_error |
| - edit_config |
| - run_pipeline_stage |
| - run_tests |
| - validate_fix |
| - submit_solution |
| |
| ### 5.2 Metrics and Artifacts |
| |
| - inference/metrics.py: reward, success-rate, and failure reason tracking |
| - inference/visualize.py: reward curve and metrics artifact export |
| |
| ### 5.3 Submission-Critical Runtime |
| |
| - File: inference.py (root) |
| - Responsibilities: |
| - initialize model and environment |
| - run step loop |
| - calculate rewards |
| - emit strict stdout contract |
| - always emit END line |
| |
| Required output format: |
| |
| - [START] task=... env=... model=... |
| - [STEP] step=<n> action=... reward=0.00 done=<true|false> error=<msg|null> |
| - [END] success=<true|false> steps=<n> score=<0.000> rewards=<r1,r2,...> |
| |
| Rules enforced: |
| |
| - single-line logs only |
| - reward values with 2 decimals |
| - lowercase booleans |
| - no extra runtime log noise |
| |
| ## 6. Task Coverage |
| |
| The project includes 9 CI-fix tasks spanning: |
| |
| - easy: syntax and typo fixes |
| - medium: dependency/env/cache/permissions issues |
| - hard: matrix logic, conditional flow, orchestration-level failures |
| |
| Representative baseline tasks (one per difficulty): |
| |
| - easy: `easy-command-typo` (fix invalid `npm tset` command) |
| - medium: `medium-python-version` (align workflow Python version) |
| - hard: `hard-needs-order` (repair deploy job dependency ordering) |
| |
| ## 7. Setup |
| |
| ```bash |
| python3 -m venv .venv |
| source .venv/bin/activate |
| pip install -r requirements.txt |
| ``` |
| |
| Environment variables: |
| |
| ```bash |
| export API_BASE_URL="https://router.huggingface.co/v1" |
| export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct" |
| export HF_TOKEN="<your_openai_compatible_api_key>" |
| # Optional alias; if set, this takes precedence over HF_TOKEN in inference.py |
| export OPENAI_API_KEY="<same_token_optional>" |
| # Optional, only if your inference spins environments from local images. |
| export LOCAL_IMAGE_NAME="<local_env_image_name>" |
| ``` |
| |
| If you want to use an OpenAI access token directly: |
| |
| ```bash |
| export API_BASE_URL="https://api.openai.com/v1" |
| export MODEL_NAME="gpt-4o-mini" |
| export HF_TOKEN="<your_openai_access_token>" |
| # Optional alias: |
| export OPENAI_API_KEY="<same_token_optional>" |
| ``` |
| |
| ## 8. Run Inference |
| |
| Offline/local mode: |
| |
| ```bash |
| python inference.py --offline --force-local-env --max-steps 8 --policy-mode imp --trajectories 4 |
| ``` |
| |
| Model-backed mode: |
| |
| ```bash |
| python inference.py --max-steps 8 --policy-mode imp --trajectories 4 |
| ``` |
| |
| Run baseline across easy/medium/hard tasks: |
| |
| OpenAI client mode: |
| |
| ```bash |
| OPENAI_API_KEY="<your_openai_compatible_api_key>" python baseline_inference.py --max-steps 5 --policy-mode imp --trajectories 3 --force-local-env |
| ``` |
| |
| Offline reproducible mode: |
| |
| ```bash |
| python baseline_inference.py --max-steps 5 --policy-mode imp --trajectories 3 --offline --force-local-env |
| ``` |
| |
| Policy modes: |
| |
| - sft: deterministic heuristic policy |
| - direct: single model action per step |
| - imp: multi-candidate generation and ranking |
| |
| ## 9. Baseline Scores |
| |
| Reproducible baseline artifact: |
| |
| - `artifacts/baseline_scores.json` |
| |
| Latest baseline run (`max_steps=5`, `policy_mode=imp`, `trajectories=3`): |
| |
| | Task ID | Difficulty | Score | Success | |
| |---|---|---:|---:| |
| | easy-command-typo | easy | 0.541 | false | |
| | medium-python-version | medium | 0.679 | false | |
| | hard-needs-order | hard | 0.513 | false | |
| |
| Aggregate: |
| |
| - average score: `0.578` |
| - success rate: `0.000` |
| |
| When `OPENAI_API_KEY` is provided, the same script runs with the OpenAI API client path in `inference.py`. |
| |
| ## 10. Tests |
| |
| Run all tests: |
| |
| ```bash |
| python -m unittest discover -s tests -v |
| ``` |
| |
| Coverage includes: |
| |
| - LLM judge |
| - deterministic grader |
| - anti-hacking detectors |
| - hidden tests |
| - reward system |
| - end-to-end inference output format |
| |
| ## 11. Validation and Submission |
| |
| OpenEnv validation: |
| |
| ```bash |
| python -m openenv.cli.__main__ validate |
| ``` |
| |
| Pre-submission script: |
| |
| ```bash |
| ./validate-submission.sh <your_hf_space_url> |
| ``` |
| |
| Required environment variables: |
| |
| ```bash |
| export API_BASE_URL="https://router.huggingface.co/v1" |
| export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct" |
| export OPENAI_API_KEY="<your_openai_compatible_api_key>" |
| # Optional fallback: |
| export HF_TOKEN="<your_token>" |
| ``` |
| |
| Docker run (Space/API mode): |
| |
| ```bash |
| docker build -t cicd-debugger-env . |
| docker run --rm -p 7860:7860 cicd-debugger-env |
| ``` |
| |
| Server endpoints used by validators: |
| |
| - `POST /reset` |
| - `POST /step` |
| - `GET /state` |
| - `GET /health` |
| |
| ## 12. Deploy to Hugging Face Space (OpenAI Token) |
| |
| This repository is already configured for Docker Spaces (`sdk: docker` in this README front matter). |
| |
| 1. Create a new Hugging Face Space with SDK set to `Docker`. |
| 2. Push this repository to the Space git remote. |
| 3. In Space Settings -> Variables and secrets, add these Secrets: |
| |
| ```text |
| OPENAI_API_KEY=<your_openai_access_token> |
| API_BASE_URL=https://api.openai.com/v1 |
| MODEL_NAME=gpt-4o-mini |
| ``` |
| |
| 4. Optional Secrets: |
| |
| ```text |
| HF_TOKEN=<optional_fallback_token> |
| OFFLINE_INFERENCE=0 |
| MAX_STEPS=8 |
| TEMPERATURE=0.2 |
| MAX_TOKENS=120 |
| ``` |
| |
| 5. Keep the app port as `7860` (already configured). |
| 6. Wait for build completion, then verify: |
| |
| ```bash |
| curl -sS https://<your-space-name>.hf.space/health |
| curl -sS -X POST https://<your-space-name>.hf.space/reset -H 'Content-Type: application/json' -d '{}' |
| ``` |
| |
| Notes: |
| |
| - `.env.example` is for local development reference only. Hugging Face Spaces use Secrets/Variables from Space Settings. |
| - Runtime code reads `OPENAI_API_KEY` first and falls back to `HF_TOKEN` when `OPENAI_API_KEY` is not provided. |
|
|
| ## 13. One-line Presentation Summary |
|
|
| We built an OpenEnv-compliant reinforcement learning environment where AI agents learn to debug real CI/CD pipelines using multi-step reasoning, hybrid grading, anti-hacking safeguards, and robust reward shaping. |
|
|