File size: 4,626 Bytes
b37875f
 
 
 
 
 
 
c6913d5
 
 
 
 
 
f58e721
c6913d5
 
 
 
 
 
 
 
 
 
 
 
 
7a56cd1
d3b224f
 
7a56cd1
d3b224f
7a56cd1
 
d3b224f
 
a583a04
d3b224f
f58e721
89b370c
c6913d5
 
 
 
 
 
f58e721
c6913d5
 
 
 
 
 
 
 
 
 
 
 
 
7a56cd1
d3b224f
 
7a56cd1
d3b224f
7a56cd1
 
d3b224f
 
a583a04
d3b224f
f58e721
89b370c
c6913d5
 
 
 
 
 
f58e721
c6913d5
 
 
 
 
 
 
 
 
 
 
 
 
 
7a56cd1
d3b224f
 
7a56cd1
 
d3b224f
7a56cd1
 
d3b224f
 
a583a04
 
d3b224f
89b370c
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
spec_version: 1
name: WhyDidItFail
type: space
runtime: fastapi
app: server.app:app
port: 8000

tasks:
  - id: task_easy
    difficulty: easy
    max_steps: 10
    grader:
      type: llm
      default_score: 0.15
      prompt_template: |
        You are evaluating an ML engineer's diagnosis of a failed training run.
        The agent was given training logs only (no config or gradient data) and must identify the failure mode.

        Valid failure mode labels:
        - "exploding gradients": loss becomes NaN/inf, gradient norms spike massively
        - "learning rate too high": loss oscillates wildly epoch-to-epoch (not NaN)
        - "overfitting": train loss low, val loss rising, regularization already present in config
        - "underfitting": both train and val loss stay high near random baseline, no gap

        Agent response:
        {response}

        You MUST reply with exactly one of these four numbers and nothing else:
        0.85
        0.65
        0.30
        0.15

        Rules:
        - 0.85: Correct failure mode with reasoning that cites specific numeric values from the logs
        - 0.65: Correct failure mode but reasoning is vague or missing specific numbers
        - 0.30: Wrong label but description matches a related concept
        - 0.15: Wrong failure mode, no diagnosis submitted, or empty response
        - If in doubt, return 0.15. Only return one of the four values listed above.
        - CRITICAL: Never return 0, 0.0, 1, or 1.0. These are invalid scores.

  - id: task_medium
    difficulty: medium
    max_steps: 15
    grader:
      type: llm
      default_score: 0.15
      prompt_template: |
        You are evaluating an ML engineer's diagnosis of a failed training run.
        The agent was given training logs AND hyperparameter config and must identify the failure mode.

        Valid failure mode labels:
        - "learning rate too low": loss decreases extremely slowly, lr in config is very small (e.g. 1e-6)
        - "missing regularization": train loss low, val loss rising, config shows weight_decay=0 and dropout=0
        - "batch size too small": training loss is highly noisy, config shows batch_size <= 4
        - "optimizer misconfiguration": slow convergence, config shows SGD with momentum=0.0

        Agent response:
        {response}

        You MUST reply with exactly one of these four numbers and nothing else:
        0.85
        0.65
        0.30
        0.15

        Rules:
        - 0.85: Correct failure mode with reasoning citing both log values AND config parameters
        - 0.65: Correct failure mode but reasoning only references logs or config, not both
        - 0.30: Wrong label but description matches a related concept
        - 0.15: Wrong failure mode, no diagnosis submitted, or empty response
        - If in doubt, return 0.15. Only return one of the four values listed above.
        - CRITICAL: Never return 0, 0.0, 1, or 1.0. These are invalid scores.

  - id: task_hard
    difficulty: hard
    max_steps: 20
    grader:
      type: llm
      default_score: 0.15
      prompt_template: |
        You are evaluating an ML engineer's diagnosis of a failed training run.
        The agent was given training logs, hyperparameter config, AND gradient norm data.
        It must identify the failure mode AND provide a concrete, actionable fix.

        Valid failure mode labels:
        - "vanishing gradients": gradient norms decay exponentially toward input layers, sigmoid/tanh activation
        - "dying relu": gradient norms are exactly 0.0 in hidden layers, relu activation
        - "bad weight initialization": loss is NaN from epoch 1, extreme gradient norms (>10000), bad weight_init config
        - "lr scheduler misconfiguration": loss spikes when scheduler fires, gamma > 1.0 (lr increases instead of decreasing)

        Agent response:
        {response}

        You MUST reply with exactly one of these five numbers and nothing else:
        0.85
        0.75
        0.50
        0.20
        0.15

        Rules:
        - 0.85: Correct failure mode AND a specific actionable fix addressing the root cause
        - 0.75: Correct failure mode with a reasonable fix that lacks specifics
        - 0.50: Correct failure mode but fix is vague, wrong, or missing
        - 0.20: Wrong failure mode but fix is incidentally relevant
        - 0.15: Wrong failure mode, no useful fix, no diagnosis submitted, or empty response
        - If in doubt, return 0.15. Only return one of the five values listed above.
        - CRITICAL: Never return 0, 0.0, 1, or 1.0. These are invalid scores.