spec_version: 1 name: WhyDidItFail type: space runtime: fastapi app: server.app:app port: 8000 tasks: - id: task_easy difficulty: easy max_steps: 10 grader: type: llm default_score: 0.15 prompt_template: | You are evaluating an ML engineer's diagnosis of a failed training run. The agent was given training logs only (no config or gradient data) and must identify the failure mode. Valid failure mode labels: - "exploding gradients": loss becomes NaN/inf, gradient norms spike massively - "learning rate too high": loss oscillates wildly epoch-to-epoch (not NaN) - "overfitting": train loss low, val loss rising, regularization already present in config - "underfitting": both train and val loss stay high near random baseline, no gap Agent response: {response} You MUST reply with exactly one of these four numbers and nothing else: 0.85 0.65 0.30 0.15 Rules: - 0.85: Correct failure mode with reasoning that cites specific numeric values from the logs - 0.65: Correct failure mode but reasoning is vague or missing specific numbers - 0.30: Wrong label but description matches a related concept - 0.15: Wrong failure mode, no diagnosis submitted, or empty response - If in doubt, return 0.15. Only return one of the four values listed above. - CRITICAL: Never return 0, 0.0, 1, or 1.0. These are invalid scores. - id: task_medium difficulty: medium max_steps: 15 grader: type: llm default_score: 0.15 prompt_template: | You are evaluating an ML engineer's diagnosis of a failed training run. The agent was given training logs AND hyperparameter config and must identify the failure mode. Valid failure mode labels: - "learning rate too low": loss decreases extremely slowly, lr in config is very small (e.g. 1e-6) - "missing regularization": train loss low, val loss rising, config shows weight_decay=0 and dropout=0 - "batch size too small": training loss is highly noisy, config shows batch_size <= 4 - "optimizer misconfiguration": slow convergence, config shows SGD with momentum=0.0 Agent response: {response} You MUST reply with exactly one of these four numbers and nothing else: 0.85 0.65 0.30 0.15 Rules: - 0.85: Correct failure mode with reasoning citing both log values AND config parameters - 0.65: Correct failure mode but reasoning only references logs or config, not both - 0.30: Wrong label but description matches a related concept - 0.15: Wrong failure mode, no diagnosis submitted, or empty response - If in doubt, return 0.15. Only return one of the four values listed above. - CRITICAL: Never return 0, 0.0, 1, or 1.0. These are invalid scores. - id: task_hard difficulty: hard max_steps: 20 grader: type: llm default_score: 0.15 prompt_template: | You are evaluating an ML engineer's diagnosis of a failed training run. The agent was given training logs, hyperparameter config, AND gradient norm data. It must identify the failure mode AND provide a concrete, actionable fix. Valid failure mode labels: - "vanishing gradients": gradient norms decay exponentially toward input layers, sigmoid/tanh activation - "dying relu": gradient norms are exactly 0.0 in hidden layers, relu activation - "bad weight initialization": loss is NaN from epoch 1, extreme gradient norms (>10000), bad weight_init config - "lr scheduler misconfiguration": loss spikes when scheduler fires, gamma > 1.0 (lr increases instead of decreasing) Agent response: {response} You MUST reply with exactly one of these five numbers and nothing else: 0.85 0.75 0.50 0.20 0.15 Rules: - 0.85: Correct failure mode AND a specific actionable fix addressing the root cause - 0.75: Correct failure mode with a reasonable fix that lacks specifics - 0.50: Correct failure mode but fix is vague, wrong, or missing - 0.20: Wrong failure mode but fix is incidentally relevant - 0.15: Wrong failure mode, no useful fix, no diagnosis submitted, or empty response - If in doubt, return 0.15. Only return one of the five values listed above. - CRITICAL: Never return 0, 0.0, 1, or 1.0. These are invalid scores.