spec_version: 1 name: CI_CD_Doctor type: space runtime: fastapi app: server.app:app port: 8000 tasks: - id: task_easy difficulty: easy max_steps: 10 description: "Single-file failure: one required package is missing from requirements.txt. The error shows a ModuleNotFoundError — the agent must cross-reference app.py imports with requirements.txt to identify and add the missing package." grader: type: programmatic module: core.grading.grader:grade success_threshold: 0.70 prompt_template: | Score the agent's CI/CD debugging session from 0.01 to 0.99 Task: fix a single missing package in requirements.txt so the install stage passes. Full credit (0.99) requires the pipeline to reach `passed` state. Partial credit is awarded per correct fix landed in the target file. Penalize blind edits (editing without reading), redundant reads, idle pipeline re-runs, and exceeding the ideal step budget. Bonus for correct diagnosis and cross-file reasoning. - id: task_medium difficulty: medium max_steps: 15 description: "Two-file cascading failure (one of four randomized variants). Each variant pairs two bugs across files such as Dockerfile (wrong Python version or port), .env.ci (missing required env var), requirements.txt (missing package), deploy_config.yml (deploy_enabled false), Makefile (wrong test command), or service.yaml (wrong port). Bugs surface one at a time across the pipeline stages — fix both to make it pass." grader: type: programmatic module: core.grading.grader:grade success_threshold: 0.60 prompt_template: | Score the agent's CI/CD debugging session from 0.01 to 0.99 Task: diagnose and fix TWO bugs spread across two files in one of four structurally distinct medium variants. Error messages show symptoms, not solutions — the agent must reason about root causes. Each pipeline run only reveals the next failing stage. Full credit (0.99) requires the pipeline to reach `passed` state. Award partial credit per fix landed in the correct file. Penalize blind edits, redundant reads, idle pipeline re-runs, and exceeding the ideal step budget. Bonus for correct diagnosis and cross-file reasoning. - id: task_hard difficulty: hard max_steps: 25 description: "Three-file cascading failure (one of four randomized variants). Each variant chains three bugs across files such as ci.yml (wrong stage order), Dockerfile (alpine base incompatible with binary wheels), requirements.txt (missing package or wrong version pin like numpy==1.21), .env.ci (missing env var), Makefile (wrong test command), deploy_config.yml (deploy_enabled false), or service.yaml (wrong port). Each pipeline run stops at the first failing stage, so bugs reveal themselves one at a time — fix all three in sequence." grader: type: programmatic module: core.grading.grader:grade success_threshold: 0.45 prompt_template: | Score the agent's CI/CD debugging session from 0.01 to 0.99 Task: diagnose and fix THREE cascading bugs across three files in one of four hard variants. Bugs surface one at a time as earlier stages pass. Error messages describe symptoms (e.g. gcc not found, dependency resolver conflict) — the agent must reason about causes. Full credit (0.99) requires the pipeline to reach `passed` state. Award partial credit per fix landed in the correct file. Penalize blind edits, redundant reads, idle pipeline re-runs, and exceeding the ideal step budget. Bonus for correct diagnosis and cross-file reasoning.