Spaces:
Sleeping
Sleeping
| spec_version: 1 | |
| name: WhyDidItFail | |
| type: space | |
| runtime: fastapi | |
| app: server.app:app | |
| port: 8000 | |
| tasks: | |
| - id: task_easy | |
| difficulty: easy | |
| max_steps: 10 | |
| grader: | |
| type: llm | |
| default_score: 0.15 | |
| prompt_template: | | |
| You are evaluating an ML engineer's diagnosis of a failed training run. | |
| The agent was given training logs only (no config or gradient data) and must identify the failure mode. | |
| Valid failure mode labels: | |
| - "exploding gradients": loss becomes NaN/inf, gradient norms spike massively | |
| - "learning rate too high": loss oscillates wildly epoch-to-epoch (not NaN) | |
| - "overfitting": train loss low, val loss rising, regularization already present in config | |
| - "underfitting": both train and val loss stay high near random baseline, no gap | |
| Agent response: | |
| {response} | |
| You MUST reply with exactly one of these four numbers and nothing else: | |
| 0.85 | |
| 0.65 | |
| 0.30 | |
| 0.15 | |
| Rules: | |
| - 0.85: Correct failure mode with reasoning that cites specific numeric values from the logs | |
| - 0.65: Correct failure mode but reasoning is vague or missing specific numbers | |
| - 0.30: Wrong label but description matches a related concept | |
| - 0.15: Wrong failure mode, no diagnosis submitted, or empty response | |
| - If in doubt, return 0.15. Only return one of the four values listed above. | |
| - CRITICAL: Never return 0, 0.0, 1, or 1.0. These are invalid scores. | |
| - id: task_medium | |
| difficulty: medium | |
| max_steps: 15 | |
| grader: | |
| type: llm | |
| default_score: 0.15 | |
| prompt_template: | | |
| You are evaluating an ML engineer's diagnosis of a failed training run. | |
| The agent was given training logs AND hyperparameter config and must identify the failure mode. | |
| Valid failure mode labels: | |
| - "learning rate too low": loss decreases extremely slowly, lr in config is very small (e.g. 1e-6) | |
| - "missing regularization": train loss low, val loss rising, config shows weight_decay=0 and dropout=0 | |
| - "batch size too small": training loss is highly noisy, config shows batch_size <= 4 | |
| - "optimizer misconfiguration": slow convergence, config shows SGD with momentum=0.0 | |
| Agent response: | |
| {response} | |
| You MUST reply with exactly one of these four numbers and nothing else: | |
| 0.85 | |
| 0.65 | |
| 0.30 | |
| 0.15 | |
| Rules: | |
| - 0.85: Correct failure mode with reasoning citing both log values AND config parameters | |
| - 0.65: Correct failure mode but reasoning only references logs or config, not both | |
| - 0.30: Wrong label but description matches a related concept | |
| - 0.15: Wrong failure mode, no diagnosis submitted, or empty response | |
| - If in doubt, return 0.15. Only return one of the four values listed above. | |
| - CRITICAL: Never return 0, 0.0, 1, or 1.0. These are invalid scores. | |
| - id: task_hard | |
| difficulty: hard | |
| max_steps: 20 | |
| grader: | |
| type: llm | |
| default_score: 0.15 | |
| prompt_template: | | |
| You are evaluating an ML engineer's diagnosis of a failed training run. | |
| The agent was given training logs, hyperparameter config, AND gradient norm data. | |
| It must identify the failure mode AND provide a concrete, actionable fix. | |
| Valid failure mode labels: | |
| - "vanishing gradients": gradient norms decay exponentially toward input layers, sigmoid/tanh activation | |
| - "dying relu": gradient norms are exactly 0.0 in hidden layers, relu activation | |
| - "bad weight initialization": loss is NaN from epoch 1, extreme gradient norms (>10000), bad weight_init config | |
| - "lr scheduler misconfiguration": loss spikes when scheduler fires, gamma > 1.0 (lr increases instead of decreasing) | |
| Agent response: | |
| {response} | |
| You MUST reply with exactly one of these five numbers and nothing else: | |
| 0.85 | |
| 0.75 | |
| 0.50 | |
| 0.20 | |
| 0.15 | |
| Rules: | |
| - 0.85: Correct failure mode AND a specific actionable fix addressing the root cause | |
| - 0.75: Correct failure mode with a reasonable fix that lacks specifics | |
| - 0.50: Correct failure mode but fix is vague, wrong, or missing | |
| - 0.20: Wrong failure mode but fix is incidentally relevant | |
| - 0.15: Wrong failure mode, no useful fix, no diagnosis submitted, or empty response | |
| - If in doubt, return 0.15. Only return one of the five values listed above. | |
| - CRITICAL: Never return 0, 0.0, 1, or 1.0. These are invalid scores. |