---
title: CICD_DEBUGGER
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
tags:
  - openenv
---

# CI/CD Pipeline Debugger Environment (OpenEnv)

## 1. Project Goal

This repository implements an AI training and evaluation environment where an agent learns to debug broken CI/CD pipelines automatically.

The environment targets real-world DevOps failure patterns, including:

- YAML syntax and structure issues
- Incorrect build/test commands (for example, npm tset -> npm test)
- Dependency and setup failures
- Multi-stage pipeline execution errors

This is designed as an RL-style interaction loop:

Observe -> Think -> Act -> Get Reward -> Repeat

## 2. Why This Matters

CI/CD failures are common, repetitive, and often multi-step to resolve. This project turns that workflow into a structured learning environment where agents:

- Read failure context
- Reason about root causes
- Propose and apply fixes
- Get shaped rewards for robust behavior

## 3. System Architecture

High-level flow:

Agent (LLM) -> Action -> Environment.step() -> Reward/Evaluation -> Next step

Core integration path:

Model -> Action -> Environment.step() -> RewardCalculator

RewardCalculator integrates:

- DeterministicGrader
- LLMJudge
- HiddenTestRunner
- AntiHackingDetector

### 3.1 OpenEnv Interface (Typed)

Typed Pydantic models are defined in `env/models.py`:

- `Observation`: strict schema for environment observations
- `Action`: normalized tool + payload action schema
- `Reward`: bounded reward model with components

Environment contract:

- `reset()` returns the initial `Observation` payload
- `step(action)` returns `(observation, reward, done, info)`
- `state()` returns current environment state snapshot

Server/API contract models are exposed in `server/app.py` and use the same typed observation/action/reward structures.

### 3.2 Action and Observation Spaces

Observation fields include:

- `task_id`, `difficulty`, `failure_stage`, `actual_bug`
- `config`, `logs`, `error_message`
- `available_tools`, `progress_flags`
- `file_modification_count`, `hidden_test_pass_rate`, `step_count`, `last_action_error`

Action schema:

- `tool`: one of `read_file`, `read_logs`, `analyze_error`, `edit_config`, `run_pipeline_stage`, `run_tests`, `validate_fix`, `submit_solution`
- `payload`: optional dict (for example `{ "raw": "replace npm tset with npm test" }`)

Reward schema:

- `value`: bounded float in `[0.0, 1.0]`
- `components`: reward breakdown dictionary

## 4. Core Modules

### 4.1 Quality Judge

- File: env/graders/llm_judge.py
- Purpose: quality-aware scoring of fixes
- Output keys: correctness, minimalism, quality (all in [0,1])
- Guarantees:
	- strict JSON parsing attempt
	- robust fallback parsing for messy output
	- no-crash behavior (safe zero scores on failure)

### 4.2 Deterministic Grader

- File: env/graders/deterministic.py
- Purpose: reproducible correctness scoring (0-1)
- Checks:
	- YAML validity
	- command and fix correctness
	- similarity and issue resolution
- Rules:
	- deterministic only
	- same input, same score

### 4.3 Anti-Hacking Detector

- File: env/anti_hacking.py
- Purpose: detect reward-hacking and shortcut behavior
- Penalty detectors:
	- stage skipping (if: false, when: never)
	- fake success (echo tests passed, unsafe exit 0 patterns)
	- pipeline breakage between versions
	- excessive edits
	- timeout abuse via too many steps

### 4.4 Hidden Tests

- File: env/hidden_tests.py
- Purpose: test fix robustness, not just exact-match overfitting
- Method:
	- deterministic variant generation (OS, versions, env shifts)
	- evaluate pass rate across variants

### 4.5 Reward Shaping

- File: env/rewards.py
- Purpose: step-level learning signal
- Components:
	- progress rewards (logs, analysis, fix proposal)
	- execution rewards (pipeline run, tests pass)
	- quality rewards (deterministic + hidden tests + LLM judge)
	- anti-hacking penalties

## 5. Inference and Evaluation

### 5.1 Prompt and Model Layers

- inference/prompts.py: stable prompt templates and fallback action heuristics
- inference/model_wrapper.py: OpenAI client action generation, candidate generation, and safe fallback

Canonical action tools used by environment and inference:

- read_file
- read_logs
- analyze_error
- edit_config
- run_pipeline_stage
- run_tests
- validate_fix
- submit_solution

### 5.2 Metrics and Artifacts

- inference/metrics.py: reward, success-rate, and failure reason tracking
- inference/visualize.py: reward curve and metrics artifact export

### 5.3 Submission-Critical Runtime

- File: inference.py (root)
- Responsibilities:
	- initialize model and environment
	- run step loop
	- calculate rewards
	- emit strict stdout contract
	- always emit END line

Required output format:

- [START] task=... env=... model=...
- [STEP] step=<n> action=... reward=0.00 done=<true|false> error=<msg|null>
- [END] success=<true|false> steps=<n> score=<0.000> rewards=<r1,r2,...>

Rules enforced:

- single-line logs only
- reward values with 2 decimals
- lowercase booleans
- no extra runtime log noise

## 6. Task Coverage

The project includes 9 CI-fix tasks spanning:

- easy: syntax and typo fixes
- medium: dependency/env/cache/permissions issues
- hard: matrix logic, conditional flow, orchestration-level failures

Representative baseline tasks (one per difficulty):

- easy: `easy-command-typo` (fix invalid `npm tset` command)
- medium: `medium-python-version` (align workflow Python version)
- hard: `hard-needs-order` (repair deploy job dependency ordering)

## 7. Setup

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

Environment variables:

```bash
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="<your_openai_compatible_api_key>"
# Optional alias; if set, this takes precedence over HF_TOKEN in inference.py
export OPENAI_API_KEY="<same_token_optional>"
# Optional, only if your inference spins environments from local images.
export LOCAL_IMAGE_NAME="<local_env_image_name>"
```

If you want to use an OpenAI access token directly:

```bash
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o-mini"
export HF_TOKEN="<your_openai_access_token>"
# Optional alias:
export OPENAI_API_KEY="<same_token_optional>"
```

## 8. Run Inference

Offline/local mode:

```bash
python inference.py --offline --force-local-env --max-steps 8 --policy-mode imp --trajectories 4
```

Model-backed mode:

```bash
python inference.py --max-steps 8 --policy-mode imp --trajectories 4
```

Run baseline across easy/medium/hard tasks:

OpenAI client mode:

```bash
OPENAI_API_KEY="<your_openai_compatible_api_key>" python baseline_inference.py --max-steps 5 --policy-mode imp --trajectories 3 --force-local-env
```

Offline reproducible mode:

```bash
python baseline_inference.py --max-steps 5 --policy-mode imp --trajectories 3 --offline --force-local-env
```

Policy modes:

- sft: deterministic heuristic policy
- direct: single model action per step
- imp: multi-candidate generation and ranking

## 9. Baseline Scores

Reproducible baseline artifact:

- `artifacts/baseline_scores.json`

Latest baseline run (`max_steps=5`, `policy_mode=imp`, `trajectories=3`):

| Task ID | Difficulty | Score | Success |
|---|---|---:|---:|
| easy-command-typo | easy | 0.541 | false |
| medium-python-version | medium | 0.679 | false |
| hard-needs-order | hard | 0.513 | false |

Aggregate:

- average score: `0.578`
- success rate: `0.000`

When `OPENAI_API_KEY` is provided, the same script runs with the OpenAI API client path in `inference.py`.

## 10. Tests

Run all tests:

```bash
python -m unittest discover -s tests -v
```

Coverage includes:

- LLM judge
- deterministic grader
- anti-hacking detectors
- hidden tests
- reward system
- end-to-end inference output format

## 11. Validation and Submission

OpenEnv validation:

```bash
python -m openenv.cli.__main__ validate
```

Pre-submission script:

```bash
./validate-submission.sh <your_hf_space_url>
```

Required environment variables:

```bash
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export OPENAI_API_KEY="<your_openai_compatible_api_key>"
# Optional fallback:
export HF_TOKEN="<your_token>"
```

Docker run (Space/API mode):

```bash
docker build -t cicd-debugger-env .
docker run --rm -p 7860:7860 cicd-debugger-env
```

Server endpoints used by validators:

- `POST /reset`
- `POST /step`
- `GET /state`
- `GET /health`

## 12. Deploy to Hugging Face Space (OpenAI Token)

This repository is already configured for Docker Spaces (`sdk: docker` in this README front matter).

1. Create a new Hugging Face Space with SDK set to `Docker`.
2. Push this repository to the Space git remote.
3. In Space Settings -> Variables and secrets, add these Secrets:

```text
OPENAI_API_KEY=<your_openai_access_token>
API_BASE_URL=https://api.openai.com/v1
MODEL_NAME=gpt-4o-mini
```

4. Optional Secrets:

```text
HF_TOKEN=<optional_fallback_token>
OFFLINE_INFERENCE=0
MAX_STEPS=8
TEMPERATURE=0.2
MAX_TOKENS=120
```

5. Keep the app port as `7860` (already configured).
6. Wait for build completion, then verify:

```bash
curl -sS https://<your-space-name>.hf.space/health
curl -sS -X POST https://<your-space-name>.hf.space/reset -H 'Content-Type: application/json' -d '{}'
```

Notes:

- `.env.example` is for local development reference only. Hugging Face Spaces use Secrets/Variables from Space Settings.
- Runtime code reads `OPENAI_API_KEY` first and falls back to `HF_TOKEN` when `OPENAI_API_KEY` is not provided.

## 13. One-line Presentation Summary

We built an OpenEnv-compliant reinforcement learning environment where AI agents learn to debug real CI/CD pipelines using multi-step reasoning, hybrid grading, anti-hacking safeguards, and robust reward shaping.