WhyDidItFail / openenv.yaml
samrat-rm's picture
fix: implementing strict prompt conditions for scores/reward to be in 0.0–1.0 range
89b370c
spec_version: 1
name: WhyDidItFail
type: space
runtime: fastapi
app: server.app:app
port: 8000
tasks:
- id: task_easy
difficulty: easy
max_steps: 10
grader:
type: llm
default_score: 0.15
prompt_template: |
You are evaluating an ML engineer's diagnosis of a failed training run.
The agent was given training logs only (no config or gradient data) and must identify the failure mode.
Valid failure mode labels:
- "exploding gradients": loss becomes NaN/inf, gradient norms spike massively
- "learning rate too high": loss oscillates wildly epoch-to-epoch (not NaN)
- "overfitting": train loss low, val loss rising, regularization already present in config
- "underfitting": both train and val loss stay high near random baseline, no gap
Agent response:
{response}
You MUST reply with exactly one of these four numbers and nothing else:
0.85
0.65
0.30
0.15
Rules:
- 0.85: Correct failure mode with reasoning that cites specific numeric values from the logs
- 0.65: Correct failure mode but reasoning is vague or missing specific numbers
- 0.30: Wrong label but description matches a related concept
- 0.15: Wrong failure mode, no diagnosis submitted, or empty response
- If in doubt, return 0.15. Only return one of the four values listed above.
- CRITICAL: Never return 0, 0.0, 1, or 1.0. These are invalid scores.
- id: task_medium
difficulty: medium
max_steps: 15
grader:
type: llm
default_score: 0.15
prompt_template: |
You are evaluating an ML engineer's diagnosis of a failed training run.
The agent was given training logs AND hyperparameter config and must identify the failure mode.
Valid failure mode labels:
- "learning rate too low": loss decreases extremely slowly, lr in config is very small (e.g. 1e-6)
- "missing regularization": train loss low, val loss rising, config shows weight_decay=0 and dropout=0
- "batch size too small": training loss is highly noisy, config shows batch_size <= 4
- "optimizer misconfiguration": slow convergence, config shows SGD with momentum=0.0
Agent response:
{response}
You MUST reply with exactly one of these four numbers and nothing else:
0.85
0.65
0.30
0.15
Rules:
- 0.85: Correct failure mode with reasoning citing both log values AND config parameters
- 0.65: Correct failure mode but reasoning only references logs or config, not both
- 0.30: Wrong label but description matches a related concept
- 0.15: Wrong failure mode, no diagnosis submitted, or empty response
- If in doubt, return 0.15. Only return one of the four values listed above.
- CRITICAL: Never return 0, 0.0, 1, or 1.0. These are invalid scores.
- id: task_hard
difficulty: hard
max_steps: 20
grader:
type: llm
default_score: 0.15
prompt_template: |
You are evaluating an ML engineer's diagnosis of a failed training run.
The agent was given training logs, hyperparameter config, AND gradient norm data.
It must identify the failure mode AND provide a concrete, actionable fix.
Valid failure mode labels:
- "vanishing gradients": gradient norms decay exponentially toward input layers, sigmoid/tanh activation
- "dying relu": gradient norms are exactly 0.0 in hidden layers, relu activation
- "bad weight initialization": loss is NaN from epoch 1, extreme gradient norms (>10000), bad weight_init config
- "lr scheduler misconfiguration": loss spikes when scheduler fires, gamma > 1.0 (lr increases instead of decreasing)
Agent response:
{response}
You MUST reply with exactly one of these five numbers and nothing else:
0.85
0.75
0.50
0.20
0.15
Rules:
- 0.85: Correct failure mode AND a specific actionable fix addressing the root cause
- 0.75: Correct failure mode with a reasonable fix that lacks specifics
- 0.50: Correct failure mode but fix is vague, wrong, or missing
- 0.20: Wrong failure mode but fix is incidentally relevant
- 0.15: Wrong failure mode, no useful fix, no diagnosis submitted, or empty response
- If in doubt, return 0.15. Only return one of the five values listed above.
- CRITICAL: Never return 0, 0.0, 1, or 1.0. These are invalid scores.