samrat-rm commited on
Commit
c6913d5
·
1 Parent(s): 77f9568

feat: updated the yaml file with tasks for evaluation

Browse files
Files changed (1) hide show
  1. openenv.yaml +80 -0
openenv.yaml CHANGED
@@ -5,3 +5,83 @@ runtime: fastapi
5
  app: server.app:app
6
  port: 8000
7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  app: server.app:app
6
  port: 8000
7
 
8
+ tasks:
9
+ - id: task_easy
10
+ difficulty: easy
11
+ max_steps: 10
12
+ grader:
13
+ type: llm
14
+ prompt_template: |
15
+ You are evaluating an ML engineer's diagnosis of a failed training run.
16
+ The agent was given training logs only (no config or gradient data) and must identify the failure mode.
17
+
18
+ Valid failure mode labels:
19
+ - "exploding gradients": loss becomes NaN/inf, gradient norms spike massively
20
+ - "learning rate too high": loss oscillates wildly epoch-to-epoch (not NaN)
21
+ - "overfitting": train loss low, val loss rising, regularization already present in config
22
+ - "underfitting": both train and val loss stay high near random baseline, no gap
23
+
24
+ Agent response:
25
+ {response}
26
+
27
+ Score 0.0 to 1.0:
28
+ - 1.0: Correct failure mode with reasoning that cites specific numeric values from the logs
29
+ - 0.7: Correct failure mode but reasoning is vague or missing specific numbers
30
+ - 0.3: Wrong label but description matches a related concept
31
+ - 0.0: Wrong failure mode or no diagnosis submitted
32
+
33
+ Reply with a single float between 0.0 and 1.0. No explanation.
34
+
35
+ - id: task_medium
36
+ difficulty: medium
37
+ max_steps: 15
38
+ grader:
39
+ type: llm
40
+ prompt_template: |
41
+ You are evaluating an ML engineer's diagnosis of a failed training run.
42
+ The agent was given training logs AND hyperparameter config and must identify the failure mode.
43
+
44
+ Valid failure mode labels:
45
+ - "learning rate too low": loss decreases extremely slowly, lr in config is very small (e.g. 1e-6)
46
+ - "missing regularization": train loss low, val loss rising, config shows weight_decay=0 and dropout=0
47
+ - "batch size too small": training loss is highly noisy, config shows batch_size <= 4
48
+ - "optimizer misconfiguration": slow convergence, config shows SGD with momentum=0.0
49
+
50
+ Agent response:
51
+ {response}
52
+
53
+ Score 0.0 to 1.0:
54
+ - 1.0: Correct failure mode with reasoning citing both log values AND config parameters
55
+ - 0.7: Correct failure mode but reasoning only references logs or config, not both
56
+ - 0.3: Wrong label but description matches a related concept
57
+ - 0.0: Wrong failure mode or no diagnosis submitted
58
+
59
+ Reply with a single float between 0.0 and 1.0. No explanation.
60
+
61
+ - id: task_hard
62
+ difficulty: hard
63
+ max_steps: 20
64
+ grader:
65
+ type: llm
66
+ prompt_template: |
67
+ You are evaluating an ML engineer's diagnosis of a failed training run.
68
+ The agent was given training logs, hyperparameter config, AND gradient norm data.
69
+ It must identify the failure mode AND provide a concrete, actionable fix.
70
+
71
+ Valid failure mode labels:
72
+ - "vanishing gradients": gradient norms decay exponentially toward input layers, sigmoid/tanh activation
73
+ - "dying relu": gradient norms are exactly 0.0 in hidden layers, relu activation
74
+ - "bad weight initialization": loss is NaN from epoch 1, extreme gradient norms (>10000), bad weight_init config
75
+ - "lr scheduler misconfiguration": loss spikes when scheduler fires, gamma > 1.0 (lr increases instead of decreasing)
76
+
77
+ Agent response:
78
+ {response}
79
+
80
+ Score 0.0 to 1.0:
81
+ - 1.0: Correct failure mode AND a specific actionable fix addressing the root cause
82
+ - 0.8: Correct failure mode with a reasonable fix that lacks specifics
83
+ - 0.5: Correct failure mode but fix is vague, wrong, or missing
84
+ - 0.2: Wrong failure mode but fix is incidentally relevant
85
+ - 0.0: Wrong failure mode and no useful fix
86
+
87
+ Reply with a single float between 0.0 and 1.0. No explanation.