Spaces:

Lishika
/

cicd-debugger-env

Sleeping

App Files Files Community

cicd-debugger-env / README.md

Lishika

clean final submission

30bf68a 9 days ago

preview code

raw

history blame contribute delete

9.89 kB

	---
	title: CICD_DEBUGGER
	colorFrom: blue
	colorTo: green
	sdk: docker
	app_port: 7860
	pinned: false
	tags:
	- openenv
	---

	# CI/CD Pipeline Debugger Environment (OpenEnv)

	## 1. Project Goal

	This repository implements an AI training and evaluation environment where an agent learns to debug broken CI/CD pipelines automatically.

	The environment targets real-world DevOps failure patterns, including:

	- YAML syntax and structure issues
	- Incorrect build/test commands (for example, npm tset -> npm test)
	- Dependency and setup failures
	- Multi-stage pipeline execution errors

	This is designed as an RL-style interaction loop:

	Observe -> Think -> Act -> Get Reward -> Repeat

	## 2. Why This Matters

	CI/CD failures are common, repetitive, and often multi-step to resolve. This project turns that workflow into a structured learning environment where agents:

	- Read failure context
	- Reason about root causes
	- Propose and apply fixes
	- Get shaped rewards for robust behavior

	## 3. System Architecture

	High-level flow:

	Agent (LLM) -> Action -> Environment.step() -> Reward/Evaluation -> Next step

	Core integration path:

	Model -> Action -> Environment.step() -> RewardCalculator

	RewardCalculator integrates:

	- DeterministicGrader
	- LLMJudge
	- HiddenTestRunner
	- AntiHackingDetector

	### 3.1 OpenEnv Interface (Typed)

	Typed Pydantic models are defined in `env/models.py`:

	- `Observation`: strict schema for environment observations
	- `Action`: normalized tool + payload action schema
	- `Reward`: bounded reward model with components

	Environment contract:

	- `reset()` returns the initial `Observation` payload
	- `step(action)` returns `(observation, reward, done, info)`
	- `state()` returns current environment state snapshot

	Server/API contract models are exposed in `server/app.py` and use the same typed observation/action/reward structures.

	### 3.2 Action and Observation Spaces

	Observation fields include:

	- `task_id`, `difficulty`, `failure_stage`, `actual_bug`
	- `config`, `logs`, `error_message`
	- `available_tools`, `progress_flags`
	- `file_modification_count`, `hidden_test_pass_rate`, `step_count`, `last_action_error`

	Action schema:

	- `tool`: one of `read_file`, `read_logs`, `analyze_error`, `edit_config`, `run_pipeline_stage`, `run_tests`, `validate_fix`, `submit_solution`
	- `payload`: optional dict (for example `{ "raw": "replace npm tset with npm test" }`)

	Reward schema:

	- `value`: bounded float in `[0.0, 1.0]`
	- `components`: reward breakdown dictionary

	## 4. Core Modules

	### 4.1 Quality Judge

	- File: env/graders/llm_judge.py
	- Purpose: quality-aware scoring of fixes
	- Output keys: correctness, minimalism, quality (all in [0,1])
	- Guarantees:
	- strict JSON parsing attempt
	- robust fallback parsing for messy output
	- no-crash behavior (safe zero scores on failure)

	### 4.2 Deterministic Grader

	- File: env/graders/deterministic.py
	- Purpose: reproducible correctness scoring (0-1)
	- Checks:
	- YAML validity
	- command and fix correctness
	- similarity and issue resolution
	- Rules:
	- deterministic only
	- same input, same score

	### 4.3 Anti-Hacking Detector

	- File: env/anti_hacking.py
	- Purpose: detect reward-hacking and shortcut behavior
	- Penalty detectors:
	- stage skipping (if: false, when: never)
	- fake success (echo tests passed, unsafe exit 0 patterns)
	- pipeline breakage between versions
	- excessive edits
	- timeout abuse via too many steps

	### 4.4 Hidden Tests

	- File: env/hidden_tests.py
	- Purpose: test fix robustness, not just exact-match overfitting
	- Method:
	- deterministic variant generation (OS, versions, env shifts)
	- evaluate pass rate across variants

	### 4.5 Reward Shaping

	- File: env/rewards.py
	- Purpose: step-level learning signal
	- Components:
	- progress rewards (logs, analysis, fix proposal)
	- execution rewards (pipeline run, tests pass)
	- quality rewards (deterministic + hidden tests + LLM judge)
	- anti-hacking penalties

	## 5. Inference and Evaluation

	### 5.1 Prompt and Model Layers

	- inference/prompts.py: stable prompt templates and fallback action heuristics
	- inference/model_wrapper.py: OpenAI client action generation, candidate generation, and safe fallback

	Canonical action tools used by environment and inference:

	- read_file
	- read_logs
	- analyze_error
	- edit_config
	- run_pipeline_stage
	- run_tests
	- validate_fix
	- submit_solution

	### 5.2 Metrics and Artifacts

	- inference/metrics.py: reward, success-rate, and failure reason tracking
	- inference/visualize.py: reward curve and metrics artifact export

	### 5.3 Submission-Critical Runtime

	- File: inference.py (root)
	- Responsibilities:
	- initialize model and environment
	- run step loop
	- calculate rewards
	- emit strict stdout contract
	- always emit END line

	Required output format:

	- [START] task=... env=... model=...
	- [STEP] step=<n> action=... reward=0.00 done=<true\|false> error=<msg\|null>
	- [END] success=<true\|false> steps=<n> score=<0.000> rewards=<r1,r2,...>

	Rules enforced:

	- single-line logs only
	- reward values with 2 decimals
	- lowercase booleans
	- no extra runtime log noise

	## 6. Task Coverage

	The project includes 9 CI-fix tasks spanning:

	- easy: syntax and typo fixes
	- medium: dependency/env/cache/permissions issues
	- hard: matrix logic, conditional flow, orchestration-level failures

	Representative baseline tasks (one per difficulty):

	- easy: `easy-command-typo` (fix invalid `npm tset` command)
	- medium: `medium-python-version` (align workflow Python version)
	- hard: `hard-needs-order` (repair deploy job dependency ordering)

	## 7. Setup

	```bash
	python3 -m venv .venv
	source .venv/bin/activate
	pip install -r requirements.txt
	```

	Environment variables:

	```bash
	export API_BASE_URL="https://router.huggingface.co/v1"
	export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
	export HF_TOKEN="<your_openai_compatible_api_key>"
	# Optional alias; if set, this takes precedence over HF_TOKEN in inference.py
	export OPENAI_API_KEY="<same_token_optional>"
	# Optional, only if your inference spins environments from local images.
	export LOCAL_IMAGE_NAME="<local_env_image_name>"
	```

	If you want to use an OpenAI access token directly:

	```bash
	export API_BASE_URL="https://api.openai.com/v1"
	export MODEL_NAME="gpt-4o-mini"
	export HF_TOKEN="<your_openai_access_token>"
	# Optional alias:
	export OPENAI_API_KEY="<same_token_optional>"
	```

	## 8. Run Inference

	Offline/local mode:

	```bash
	python inference.py --offline --force-local-env --max-steps 8 --policy-mode imp --trajectories 4
	```

	Model-backed mode:

	```bash
	python inference.py --max-steps 8 --policy-mode imp --trajectories 4
	```

	Run baseline across easy/medium/hard tasks:

	OpenAI client mode:

	```bash
	OPENAI_API_KEY="<your_openai_compatible_api_key>" python baseline_inference.py --max-steps 5 --policy-mode imp --trajectories 3 --force-local-env
	```

	Offline reproducible mode:

	```bash
	python baseline_inference.py --max-steps 5 --policy-mode imp --trajectories 3 --offline --force-local-env
	```

	Policy modes:

	- sft: deterministic heuristic policy
	- direct: single model action per step
	- imp: multi-candidate generation and ranking

	## 9. Baseline Scores

	Reproducible baseline artifact:

	- `artifacts/baseline_scores.json`

	Latest baseline run (`max_steps=5`, `policy_mode=imp`, `trajectories=3`):

	\| Task ID \| Difficulty \| Score \| Success \|
	\|---\|---\|---:\|---:\|
	\| easy-command-typo \| easy \| 0.541 \| false \|
	\| medium-python-version \| medium \| 0.679 \| false \|
	\| hard-needs-order \| hard \| 0.513 \| false \|

	Aggregate:

	- average score: `0.578`
	- success rate: `0.000`

	When `OPENAI_API_KEY` is provided, the same script runs with the OpenAI API client path in `inference.py`.

	## 10. Tests

	Run all tests:

	```bash
	python -m unittest discover -s tests -v
	```

	Coverage includes:

	- LLM judge
	- deterministic grader
	- anti-hacking detectors
	- hidden tests
	- reward system
	- end-to-end inference output format

	## 11. Validation and Submission

	OpenEnv validation:

	```bash
	python -m openenv.cli.__main__ validate
	```

	Pre-submission script:

	```bash
	./validate-submission.sh <your_hf_space_url>
	```

	Required environment variables:

	```bash
	export API_BASE_URL="https://router.huggingface.co/v1"
	export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
	export OPENAI_API_KEY="<your_openai_compatible_api_key>"
	# Optional fallback:
	export HF_TOKEN="<your_token>"
	```

	Docker run (Space/API mode):

	```bash
	docker build -t cicd-debugger-env .
	docker run --rm -p 7860:7860 cicd-debugger-env
	```

	Server endpoints used by validators:

	- `POST /reset`
	- `POST /step`
	- `GET /state`
	- `GET /health`

	## 12. Deploy to Hugging Face Space (OpenAI Token)

	This repository is already configured for Docker Spaces (`sdk: docker` in this README front matter).

	1. Create a new Hugging Face Space with SDK set to `Docker`.
	2. Push this repository to the Space git remote.
	3. In Space Settings -> Variables and secrets, add these Secrets:

	```text
	OPENAI_API_KEY=<your_openai_access_token>
	API_BASE_URL=https://api.openai.com/v1
	MODEL_NAME=gpt-4o-mini
	```

	4. Optional Secrets:

	```text
	HF_TOKEN=<optional_fallback_token>
	OFFLINE_INFERENCE=0
	MAX_STEPS=8
	TEMPERATURE=0.2
	MAX_TOKENS=120
	```

	5. Keep the app port as `7860` (already configured).
	6. Wait for build completion, then verify:

	```bash
	curl -sS https://<your-space-name>.hf.space/health
	curl -sS -X POST https://<your-space-name>.hf.space/reset -H 'Content-Type: application/json' -d '{}'
	```

	Notes:

	- `.env.example` is for local development reference only. Hugging Face Spaces use Secrets/Variables from Space Settings.
	- Runtime code reads `OPENAI_API_KEY` first and falls back to `HF_TOKEN` when `OPENAI_API_KEY` is not provided.

	## 13. One-line Presentation Summary

	We built an OpenEnv-compliant reinforcement learning environment where AI agents learn to debug real CI/CD pipelines using multi-step reasoning, hybrid grading, anti-hacking safeguards, and robust reward shaping.