Spaces:
Sleeping
Sleeping
File size: 4,626 Bytes
b37875f c6913d5 f58e721 c6913d5 7a56cd1 d3b224f 7a56cd1 d3b224f 7a56cd1 d3b224f a583a04 d3b224f f58e721 89b370c c6913d5 f58e721 c6913d5 7a56cd1 d3b224f 7a56cd1 d3b224f 7a56cd1 d3b224f a583a04 d3b224f f58e721 89b370c c6913d5 f58e721 c6913d5 7a56cd1 d3b224f 7a56cd1 d3b224f 7a56cd1 d3b224f a583a04 d3b224f 89b370c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 | spec_version: 1
name: WhyDidItFail
type: space
runtime: fastapi
app: server.app:app
port: 8000
tasks:
- id: task_easy
difficulty: easy
max_steps: 10
grader:
type: llm
default_score: 0.15
prompt_template: |
You are evaluating an ML engineer's diagnosis of a failed training run.
The agent was given training logs only (no config or gradient data) and must identify the failure mode.
Valid failure mode labels:
- "exploding gradients": loss becomes NaN/inf, gradient norms spike massively
- "learning rate too high": loss oscillates wildly epoch-to-epoch (not NaN)
- "overfitting": train loss low, val loss rising, regularization already present in config
- "underfitting": both train and val loss stay high near random baseline, no gap
Agent response:
{response}
You MUST reply with exactly one of these four numbers and nothing else:
0.85
0.65
0.30
0.15
Rules:
- 0.85: Correct failure mode with reasoning that cites specific numeric values from the logs
- 0.65: Correct failure mode but reasoning is vague or missing specific numbers
- 0.30: Wrong label but description matches a related concept
- 0.15: Wrong failure mode, no diagnosis submitted, or empty response
- If in doubt, return 0.15. Only return one of the four values listed above.
- CRITICAL: Never return 0, 0.0, 1, or 1.0. These are invalid scores.
- id: task_medium
difficulty: medium
max_steps: 15
grader:
type: llm
default_score: 0.15
prompt_template: |
You are evaluating an ML engineer's diagnosis of a failed training run.
The agent was given training logs AND hyperparameter config and must identify the failure mode.
Valid failure mode labels:
- "learning rate too low": loss decreases extremely slowly, lr in config is very small (e.g. 1e-6)
- "missing regularization": train loss low, val loss rising, config shows weight_decay=0 and dropout=0
- "batch size too small": training loss is highly noisy, config shows batch_size <= 4
- "optimizer misconfiguration": slow convergence, config shows SGD with momentum=0.0
Agent response:
{response}
You MUST reply with exactly one of these four numbers and nothing else:
0.85
0.65
0.30
0.15
Rules:
- 0.85: Correct failure mode with reasoning citing both log values AND config parameters
- 0.65: Correct failure mode but reasoning only references logs or config, not both
- 0.30: Wrong label but description matches a related concept
- 0.15: Wrong failure mode, no diagnosis submitted, or empty response
- If in doubt, return 0.15. Only return one of the four values listed above.
- CRITICAL: Never return 0, 0.0, 1, or 1.0. These are invalid scores.
- id: task_hard
difficulty: hard
max_steps: 20
grader:
type: llm
default_score: 0.15
prompt_template: |
You are evaluating an ML engineer's diagnosis of a failed training run.
The agent was given training logs, hyperparameter config, AND gradient norm data.
It must identify the failure mode AND provide a concrete, actionable fix.
Valid failure mode labels:
- "vanishing gradients": gradient norms decay exponentially toward input layers, sigmoid/tanh activation
- "dying relu": gradient norms are exactly 0.0 in hidden layers, relu activation
- "bad weight initialization": loss is NaN from epoch 1, extreme gradient norms (>10000), bad weight_init config
- "lr scheduler misconfiguration": loss spikes when scheduler fires, gamma > 1.0 (lr increases instead of decreasing)
Agent response:
{response}
You MUST reply with exactly one of these five numbers and nothing else:
0.85
0.75
0.50
0.20
0.15
Rules:
- 0.85: Correct failure mode AND a specific actionable fix addressing the root cause
- 0.75: Correct failure mode with a reasonable fix that lacks specifics
- 0.50: Correct failure mode but fix is vague, wrong, or missing
- 0.20: Wrong failure mode but fix is incidentally relevant
- 0.15: Wrong failure mode, no useful fix, no diagnosis submitted, or empty response
- If in doubt, return 0.15. Only return one of the five values listed above.
- CRITICAL: Never return 0, 0.0, 1, or 1.0. These are invalid scores. |