Lishika's picture
clean final submission
30bf68a
---
title: CICD_DEBUGGER
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
tags:
- openenv
---
# CI/CD Pipeline Debugger Environment (OpenEnv)
## 1. Project Goal
This repository implements an AI training and evaluation environment where an agent learns to debug broken CI/CD pipelines automatically.
The environment targets real-world DevOps failure patterns, including:
- YAML syntax and structure issues
- Incorrect build/test commands (for example, npm tset -> npm test)
- Dependency and setup failures
- Multi-stage pipeline execution errors
This is designed as an RL-style interaction loop:
Observe -> Think -> Act -> Get Reward -> Repeat
## 2. Why This Matters
CI/CD failures are common, repetitive, and often multi-step to resolve. This project turns that workflow into a structured learning environment where agents:
- Read failure context
- Reason about root causes
- Propose and apply fixes
- Get shaped rewards for robust behavior
## 3. System Architecture
High-level flow:
Agent (LLM) -> Action -> Environment.step() -> Reward/Evaluation -> Next step
Core integration path:
Model -> Action -> Environment.step() -> RewardCalculator
RewardCalculator integrates:
- DeterministicGrader
- LLMJudge
- HiddenTestRunner
- AntiHackingDetector
### 3.1 OpenEnv Interface (Typed)
Typed Pydantic models are defined in `env/models.py`:
- `Observation`: strict schema for environment observations
- `Action`: normalized tool + payload action schema
- `Reward`: bounded reward model with components
Environment contract:
- `reset()` returns the initial `Observation` payload
- `step(action)` returns `(observation, reward, done, info)`
- `state()` returns current environment state snapshot
Server/API contract models are exposed in `server/app.py` and use the same typed observation/action/reward structures.
### 3.2 Action and Observation Spaces
Observation fields include:
- `task_id`, `difficulty`, `failure_stage`, `actual_bug`
- `config`, `logs`, `error_message`
- `available_tools`, `progress_flags`
- `file_modification_count`, `hidden_test_pass_rate`, `step_count`, `last_action_error`
Action schema:
- `tool`: one of `read_file`, `read_logs`, `analyze_error`, `edit_config`, `run_pipeline_stage`, `run_tests`, `validate_fix`, `submit_solution`
- `payload`: optional dict (for example `{ "raw": "replace npm tset with npm test" }`)
Reward schema:
- `value`: bounded float in `[0.0, 1.0]`
- `components`: reward breakdown dictionary
## 4. Core Modules
### 4.1 Quality Judge
- File: env/graders/llm_judge.py
- Purpose: quality-aware scoring of fixes
- Output keys: correctness, minimalism, quality (all in [0,1])
- Guarantees:
- strict JSON parsing attempt
- robust fallback parsing for messy output
- no-crash behavior (safe zero scores on failure)
### 4.2 Deterministic Grader
- File: env/graders/deterministic.py
- Purpose: reproducible correctness scoring (0-1)
- Checks:
- YAML validity
- command and fix correctness
- similarity and issue resolution
- Rules:
- deterministic only
- same input, same score
### 4.3 Anti-Hacking Detector
- File: env/anti_hacking.py
- Purpose: detect reward-hacking and shortcut behavior
- Penalty detectors:
- stage skipping (if: false, when: never)
- fake success (echo tests passed, unsafe exit 0 patterns)
- pipeline breakage between versions
- excessive edits
- timeout abuse via too many steps
### 4.4 Hidden Tests
- File: env/hidden_tests.py
- Purpose: test fix robustness, not just exact-match overfitting
- Method:
- deterministic variant generation (OS, versions, env shifts)
- evaluate pass rate across variants
### 4.5 Reward Shaping
- File: env/rewards.py
- Purpose: step-level learning signal
- Components:
- progress rewards (logs, analysis, fix proposal)
- execution rewards (pipeline run, tests pass)
- quality rewards (deterministic + hidden tests + LLM judge)
- anti-hacking penalties
## 5. Inference and Evaluation
### 5.1 Prompt and Model Layers
- inference/prompts.py: stable prompt templates and fallback action heuristics
- inference/model_wrapper.py: OpenAI client action generation, candidate generation, and safe fallback
Canonical action tools used by environment and inference:
- read_file
- read_logs
- analyze_error
- edit_config
- run_pipeline_stage
- run_tests
- validate_fix
- submit_solution
### 5.2 Metrics and Artifacts
- inference/metrics.py: reward, success-rate, and failure reason tracking
- inference/visualize.py: reward curve and metrics artifact export
### 5.3 Submission-Critical Runtime
- File: inference.py (root)
- Responsibilities:
- initialize model and environment
- run step loop
- calculate rewards
- emit strict stdout contract
- always emit END line
Required output format:
- [START] task=... env=... model=...
- [STEP] step=<n> action=... reward=0.00 done=<true|false> error=<msg|null>
- [END] success=<true|false> steps=<n> score=<0.000> rewards=<r1,r2,...>
Rules enforced:
- single-line logs only
- reward values with 2 decimals
- lowercase booleans
- no extra runtime log noise
## 6. Task Coverage
The project includes 9 CI-fix tasks spanning:
- easy: syntax and typo fixes
- medium: dependency/env/cache/permissions issues
- hard: matrix logic, conditional flow, orchestration-level failures
Representative baseline tasks (one per difficulty):
- easy: `easy-command-typo` (fix invalid `npm tset` command)
- medium: `medium-python-version` (align workflow Python version)
- hard: `hard-needs-order` (repair deploy job dependency ordering)
## 7. Setup
```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```
Environment variables:
```bash
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="<your_openai_compatible_api_key>"
# Optional alias; if set, this takes precedence over HF_TOKEN in inference.py
export OPENAI_API_KEY="<same_token_optional>"
# Optional, only if your inference spins environments from local images.
export LOCAL_IMAGE_NAME="<local_env_image_name>"
```
If you want to use an OpenAI access token directly:
```bash
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o-mini"
export HF_TOKEN="<your_openai_access_token>"
# Optional alias:
export OPENAI_API_KEY="<same_token_optional>"
```
## 8. Run Inference
Offline/local mode:
```bash
python inference.py --offline --force-local-env --max-steps 8 --policy-mode imp --trajectories 4
```
Model-backed mode:
```bash
python inference.py --max-steps 8 --policy-mode imp --trajectories 4
```
Run baseline across easy/medium/hard tasks:
OpenAI client mode:
```bash
OPENAI_API_KEY="<your_openai_compatible_api_key>" python baseline_inference.py --max-steps 5 --policy-mode imp --trajectories 3 --force-local-env
```
Offline reproducible mode:
```bash
python baseline_inference.py --max-steps 5 --policy-mode imp --trajectories 3 --offline --force-local-env
```
Policy modes:
- sft: deterministic heuristic policy
- direct: single model action per step
- imp: multi-candidate generation and ranking
## 9. Baseline Scores
Reproducible baseline artifact:
- `artifacts/baseline_scores.json`
Latest baseline run (`max_steps=5`, `policy_mode=imp`, `trajectories=3`):
| Task ID | Difficulty | Score | Success |
|---|---|---:|---:|
| easy-command-typo | easy | 0.541 | false |
| medium-python-version | medium | 0.679 | false |
| hard-needs-order | hard | 0.513 | false |
Aggregate:
- average score: `0.578`
- success rate: `0.000`
When `OPENAI_API_KEY` is provided, the same script runs with the OpenAI API client path in `inference.py`.
## 10. Tests
Run all tests:
```bash
python -m unittest discover -s tests -v
```
Coverage includes:
- LLM judge
- deterministic grader
- anti-hacking detectors
- hidden tests
- reward system
- end-to-end inference output format
## 11. Validation and Submission
OpenEnv validation:
```bash
python -m openenv.cli.__main__ validate
```
Pre-submission script:
```bash
./validate-submission.sh <your_hf_space_url>
```
Required environment variables:
```bash
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export OPENAI_API_KEY="<your_openai_compatible_api_key>"
# Optional fallback:
export HF_TOKEN="<your_token>"
```
Docker run (Space/API mode):
```bash
docker build -t cicd-debugger-env .
docker run --rm -p 7860:7860 cicd-debugger-env
```
Server endpoints used by validators:
- `POST /reset`
- `POST /step`
- `GET /state`
- `GET /health`
## 12. Deploy to Hugging Face Space (OpenAI Token)
This repository is already configured for Docker Spaces (`sdk: docker` in this README front matter).
1. Create a new Hugging Face Space with SDK set to `Docker`.
2. Push this repository to the Space git remote.
3. In Space Settings -> Variables and secrets, add these Secrets:
```text
OPENAI_API_KEY=<your_openai_access_token>
API_BASE_URL=https://api.openai.com/v1
MODEL_NAME=gpt-4o-mini
```
4. Optional Secrets:
```text
HF_TOKEN=<optional_fallback_token>
OFFLINE_INFERENCE=0
MAX_STEPS=8
TEMPERATURE=0.2
MAX_TOKENS=120
```
5. Keep the app port as `7860` (already configured).
6. Wait for build completion, then verify:
```bash
curl -sS https://<your-space-name>.hf.space/health
curl -sS -X POST https://<your-space-name>.hf.space/reset -H 'Content-Type: application/json' -d '{}'
```
Notes:
- `.env.example` is for local development reference only. Hugging Face Spaces use Secrets/Variables from Space Settings.
- Runtime code reads `OPENAI_API_KEY` first and falls back to `HF_TOKEN` when `OPENAI_API_KEY` is not provided.
## 13. One-line Presentation Summary
We built an OpenEnv-compliant reinforcement learning environment where AI agents learn to debug real CI/CD pipelines using multi-step reasoning, hybrid grading, anti-hacking safeguards, and robust reward shaping.