Spaces:
Sleeping
Sleeping
| spec_version: 1 | |
| name: CI_CD_Doctor | |
| type: space | |
| runtime: fastapi | |
| app: server.app:app | |
| port: 8000 | |
| tasks: | |
| - id: task_easy | |
| difficulty: easy | |
| max_steps: 10 | |
| description: "Single-file failure: one required package is missing from requirements.txt. The error shows a ModuleNotFoundError — the agent must cross-reference app.py imports with requirements.txt to identify and add the missing package." | |
| grader: | |
| type: programmatic | |
| module: core.grading.grader:grade | |
| success_threshold: 0.70 | |
| prompt_template: | | |
| Score the agent's CI/CD debugging session from 0.01 to 0.99 | |
| Task: fix a single missing package in requirements.txt so the install stage passes. | |
| Full credit (0.99) requires the pipeline to reach `passed` state. | |
| Partial credit is awarded per correct fix landed in the target file. | |
| Penalize blind edits (editing without reading), redundant reads, | |
| idle pipeline re-runs, and exceeding the ideal step budget. | |
| Bonus for correct diagnosis and cross-file reasoning. | |
| - id: task_medium | |
| difficulty: medium | |
| max_steps: 15 | |
| description: "Two-file cascading failure (one of four randomized variants). Each variant pairs two bugs across files such as Dockerfile (wrong Python version or port), .env.ci (missing required env var), requirements.txt (missing package), deploy_config.yml (deploy_enabled false), Makefile (wrong test command), or service.yaml (wrong port). Bugs surface one at a time across the pipeline stages — fix both to make it pass." | |
| grader: | |
| type: programmatic | |
| module: core.grading.grader:grade | |
| success_threshold: 0.60 | |
| prompt_template: | | |
| Score the agent's CI/CD debugging session from 0.01 to 0.99 | |
| Task: diagnose and fix TWO bugs spread across two files in one of four | |
| structurally distinct medium variants. Error messages show symptoms, not | |
| solutions — the agent must reason about root causes. Each pipeline run | |
| only reveals the next failing stage. | |
| Full credit (0.99) requires the pipeline to reach `passed` state. | |
| Award partial credit per fix landed in the correct file. | |
| Penalize blind edits, redundant reads, idle pipeline re-runs, | |
| and exceeding the ideal step budget. | |
| Bonus for correct diagnosis and cross-file reasoning. | |
| - id: task_hard | |
| difficulty: hard | |
| max_steps: 25 | |
| description: "Three-file cascading failure (one of four randomized variants). Each variant chains three bugs across files such as ci.yml (wrong stage order), Dockerfile (alpine base incompatible with binary wheels), requirements.txt (missing package or wrong version pin like numpy==1.21), .env.ci (missing env var), Makefile (wrong test command), deploy_config.yml (deploy_enabled false), or service.yaml (wrong port). Each pipeline run stops at the first failing stage, so bugs reveal themselves one at a time — fix all three in sequence." | |
| grader: | |
| type: programmatic | |
| module: core.grading.grader:grade | |
| success_threshold: 0.45 | |
| prompt_template: | | |
| Score the agent's CI/CD debugging session from 0.01 to 0.99 | |
| Task: diagnose and fix THREE cascading bugs across three files in one of | |
| four hard variants. Bugs surface one at a time as earlier stages pass. | |
| Error messages describe symptoms (e.g. gcc not found, dependency | |
| resolver conflict) — the agent must reason about causes. | |
| Full credit (0.99) requires the pipeline to reach `passed` state. | |
| Award partial credit per fix landed in the correct file. | |
| Penalize blind edits, redundant reads, idle pipeline re-runs, | |
| and exceeding the ideal step budget. | |
| Bonus for correct diagnosis and cross-file reasoning. | |