spec_version: 1
name: WhyDidItFail
type: space
runtime: fastapi
app: server.app:app
port: 8000

tasks:
  - id: task_easy
    difficulty: easy
    max_steps: 10
    grader:
      type: llm
      default_score: 0.15
      prompt_template: |
        You are evaluating an ML engineer's diagnosis of a failed training run.
        The agent was given training logs only (no config or gradient data) and must identify the failure mode.

        Valid failure mode labels:
        - "exploding gradients": loss becomes NaN/inf, gradient norms spike massively
        - "learning rate too high": loss oscillates wildly epoch-to-epoch (not NaN)
        - "overfitting": train loss low, val loss rising, regularization already present in config
        - "underfitting": both train and val loss stay high near random baseline, no gap

        Agent response:
        {response}

        You MUST reply with exactly one of these four numbers and nothing else:
        0.85
        0.65
        0.30
        0.15

        Rules:
        - 0.85: Correct failure mode with reasoning that cites specific numeric values from the logs
        - 0.65: Correct failure mode but reasoning is vague or missing specific numbers
        - 0.30: Wrong label but description matches a related concept
        - 0.15: Wrong failure mode, no diagnosis submitted, or empty response
        - If in doubt, return 0.15. Only return one of the four values listed above.
        - CRITICAL: Never return 0, 0.0, 1, or 1.0. These are invalid scores.

  - id: task_medium
    difficulty: medium
    max_steps: 15
    grader:
      type: llm
      default_score: 0.15
      prompt_template: |
        You are evaluating an ML engineer's diagnosis of a failed training run.
        The agent was given training logs AND hyperparameter config and must identify the failure mode.

        Valid failure mode labels:
        - "learning rate too low": loss decreases extremely slowly, lr in config is very small (e.g. 1e-6)
        - "missing regularization": train loss low, val loss rising, config shows weight_decay=0 and dropout=0
        - "batch size too small": training loss is highly noisy, config shows batch_size <= 4
        - "optimizer misconfiguration": slow convergence, config shows SGD with momentum=0.0

        Agent response:
        {response}

        You MUST reply with exactly one of these four numbers and nothing else:
        0.85
        0.65
        0.30
        0.15

        Rules:
        - 0.85: Correct failure mode with reasoning citing both log values AND config parameters
        - 0.65: Correct failure mode but reasoning only references logs or config, not both
        - 0.30: Wrong label but description matches a related concept
        - 0.15: Wrong failure mode, no diagnosis submitted, or empty response
        - If in doubt, return 0.15. Only return one of the four values listed above.
        - CRITICAL: Never return 0, 0.0, 1, or 1.0. These are invalid scores.

  - id: task_hard
    difficulty: hard
    max_steps: 20
    grader:
      type: llm
      default_score: 0.15
      prompt_template: |
        You are evaluating an ML engineer's diagnosis of a failed training run.
        The agent was given training logs, hyperparameter config, AND gradient norm data.
        It must identify the failure mode AND provide a concrete, actionable fix.

        Valid failure mode labels:
        - "vanishing gradients": gradient norms decay exponentially toward input layers, sigmoid/tanh activation
        - "dying relu": gradient norms are exactly 0.0 in hidden layers, relu activation
        - "bad weight initialization": loss is NaN from epoch 1, extreme gradient norms (>10000), bad weight_init config
        - "lr scheduler misconfiguration": loss spikes when scheduler fires, gamma > 1.0 (lr increases instead of decreasing)

        Agent response:
        {response}

        You MUST reply with exactly one of these five numbers and nothing else:
        0.85
        0.75
        0.50
        0.20
        0.15

        Rules:
        - 0.85: Correct failure mode AND a specific actionable fix addressing the root cause
        - 0.75: Correct failure mode with a reasonable fix that lacks specifics
        - 0.50: Correct failure mode but fix is vague, wrong, or missing
        - 0.20: Wrong failure mode but fix is incidentally relevant
        - 0.15: Wrong failure mode, no useful fix, no diagnosis submitted, or empty response
        - If in doubt, return 0.15. Only return one of the five values listed above.
        - CRITICAL: Never return 0, 0.0, 1, or 1.0. These are invalid scores.