File size: 3,770 Bytes
45c94ac
 
 
 
 
 
 
 
 
 
 
e365f21
45c94ac
 
f4dcf31
45c94ac
 
 
 
 
 
e365f21
 
 
45c94ac
 
 
 
 
 
 
f4dcf31
45c94ac
 
 
 
e365f21
 
 
45c94ac
 
e365f21
 
 
45c94ac
 
 
 
 
 
 
f4dcf31
45c94ac
 
 
 
 
e365f21
 
45c94ac
 
e365f21
45c94ac
e365f21
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
spec_version: 1
name: CI_CD_Doctor
type: space
runtime: fastapi
app: server.app:app
port: 8000

tasks:
  - id: task_easy
    difficulty: easy
    max_steps: 10
    description: "Single-file failure: one required package is missing from requirements.txt. The error shows a ModuleNotFoundError — the agent must cross-reference app.py imports with requirements.txt to identify and add the missing package."
    grader:
      type: programmatic
      module: core.grading.grader:grade
      success_threshold: 0.70
      prompt_template: |
        Score the agent's CI/CD debugging session from 0.01 to 0.99
        Task: fix a single missing package in requirements.txt so the install stage passes.
        Full credit (0.99) requires the pipeline to reach `passed` state.
        Partial credit is awarded per correct fix landed in the target file.
        Penalize blind edits (editing without reading), redundant reads,
        idle pipeline re-runs, and exceeding the ideal step budget.
        Bonus for correct diagnosis and cross-file reasoning.

  - id: task_medium
    difficulty: medium
    max_steps: 15
    description: "Two-file cascading failure (one of four randomized variants). Each variant pairs two bugs across files such as Dockerfile (wrong Python version or port), .env.ci (missing required env var), requirements.txt (missing package), deploy_config.yml (deploy_enabled false), Makefile (wrong test command), or service.yaml (wrong port). Bugs surface one at a time across the pipeline stages — fix both to make it pass."
    grader:
      type: programmatic
      module: core.grading.grader:grade
      success_threshold: 0.60
      prompt_template: |
        Score the agent's CI/CD debugging session from 0.01 to 0.99
        Task: diagnose and fix TWO bugs spread across two files in one of four
        structurally distinct medium variants. Error messages show symptoms, not
        solutions — the agent must reason about root causes. Each pipeline run
        only reveals the next failing stage.
        Full credit (0.99) requires the pipeline to reach `passed` state.
        Award partial credit per fix landed in the correct file.
        Penalize blind edits, redundant reads, idle pipeline re-runs,
        and exceeding the ideal step budget.
        Bonus for correct diagnosis and cross-file reasoning.

  - id: task_hard
    difficulty: hard
    max_steps: 25
    description: "Three-file cascading failure (one of four randomized variants). Each variant chains three bugs across files such as ci.yml (wrong stage order), Dockerfile (alpine base incompatible with binary wheels), requirements.txt (missing package or wrong version pin like numpy==1.21), .env.ci (missing env var), Makefile (wrong test command), deploy_config.yml (deploy_enabled false), or service.yaml (wrong port). Each pipeline run stops at the first failing stage, so bugs reveal themselves one at a time — fix all three in sequence."
    grader:
      type: programmatic
      module: core.grading.grader:grade
      success_threshold: 0.45
      prompt_template: |
        Score the agent's CI/CD debugging session from 0.01 to 0.99
        Task: diagnose and fix THREE cascading bugs across three files in one of
        four hard variants. Bugs surface one at a time as earlier stages pass.
        Error messages describe symptoms (e.g. gcc not found, dependency
        resolver conflict) — the agent must reason about causes.
        Full credit (0.99) requires the pipeline to reach `passed` state.
        Award partial credit per fix landed in the correct file.
        Penalize blind edits, redundant reads, idle pipeline re-runs,
        and exceeding the ideal step budget.
        Bonus for correct diagnosis and cross-file reasoning.