vuhaian commited on
Commit
aebe7f3
·
verified ·
1 Parent(s): 5aecbda

upload eval_results_3.json

Browse files
Files changed (1) hide show
  1. eval_results_3.json +2570 -0
eval_results_3.json ADDED
@@ -0,0 +1,2570 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "vuhaian/412k_raw_e1": {
3
+ "name": "vuhaian/412k_raw_e1",
4
+ "uid": null,
5
+ "hotkey": null,
6
+ "is_king": false,
7
+ "is_teacher": false,
8
+ "kl_global_avg": 1.172777479832485,
9
+ "on_policy_rkl": {
10
+ "mean_rkl": 1.9198109520780764
11
+ },
12
+ "top_k_overlap_mean": 0.27019265932892383,
13
+ "teacher_trace_nll_mean": 2.01360827430854,
14
+ "capability": {
15
+ "pass_frac": 0.7239583333333334
16
+ },
17
+ "length_axis": {
18
+ "penalty": 0.9755010752381733
19
+ },
20
+ "degenerate_count": 0,
21
+ "activation_fingerprint": {
22
+ "layer_fingerprints": {
23
+ "all": [
24
+ -0.05128144763420785,
25
+ -0.034146015649314326,
26
+ -0.019164217192576775,
27
+ -0.05245710265199352,
28
+ 0.0600000220138926,
29
+ -0.008791402566627228,
30
+ -0.003558177133475171,
31
+ 0.08171322574945596,
32
+ 0.004650599937612277,
33
+ 0.03179470561374305,
34
+ -0.013036245462702871,
35
+ -0.03831803035844744,
36
+ -0.06713718433425508,
37
+ 0.04997054026924338,
38
+ 0.0264990560203545,
39
+ 0.06975899906418416,
40
+ 0.011943822658565756,
41
+ 0.010799379720898323,
42
+ -0.012432811913750964,
43
+ 0.005659790528100843,
44
+ -0.00902029115416072,
45
+ 0.003901510014775407,
46
+ 0.03979540215070909,
47
+ 0.008718574379684754,
48
+ -0.07586616274064592,
49
+ 0.010716147507249758,
50
+ -0.06484829845892019,
51
+ 0.046870140310835154,
52
+ -0.05542225026322279,
53
+ 0.013025841435996803,
54
+ 0.07396222585343552,
55
+ -0.08238948748535038,
56
+ -0.05128144763420785,
57
+ -0.034146015649314326,
58
+ -0.019164217192576775,
59
+ -0.05245710265199352,
60
+ 0.0600000220138926,
61
+ -0.008791402566627228,
62
+ -0.003558177133475171,
63
+ 0.08171322574945596,
64
+ 0.004650599937612277,
65
+ 0.03179470561374305,
66
+ -0.013036245462702871,
67
+ -0.03831803035844744,
68
+ -0.06713718433425508,
69
+ 0.04997054026924338,
70
+ 0.0264990560203545,
71
+ 0.06975899906418416,
72
+ 0.011943822658565756,
73
+ 0.010799379720898323,
74
+ -0.012432811913750964,
75
+ 0.005659790528100843,
76
+ -0.00902029115416072,
77
+ 0.003901510014775407,
78
+ 0.03979540215070909,
79
+ 0.008718574379684754,
80
+ -0.07586616274064592,
81
+ 0.010716147507249758,
82
+ -0.06484829845892019,
83
+ 0.046870140310835154,
84
+ -0.05542225026322279,
85
+ 0.013025841435996803,
86
+ 0.07396222585343552,
87
+ -0.08238948748535038,
88
+ -0.05128144763420785,
89
+ -0.034146015649314326,
90
+ -0.019164217192576775,
91
+ -0.05245710265199352,
92
+ 0.0600000220138926,
93
+ -0.008791402566627228,
94
+ -0.003558177133475171,
95
+ 0.08171322574945596,
96
+ 0.004650599937612277,
97
+ 0.03179470561374305,
98
+ -0.013036245462702871,
99
+ -0.03831803035844744,
100
+ -0.06713718433425508,
101
+ 0.04997054026924338,
102
+ 0.0264990560203545,
103
+ 0.06975899906418416,
104
+ 0.011943822658565756,
105
+ 0.010799379720898323,
106
+ -0.012432811913750964,
107
+ 0.005659790528100843,
108
+ -0.00902029115416072,
109
+ 0.003901510014775407,
110
+ 0.03979540215070909,
111
+ 0.008718574379684754,
112
+ -0.07586616274064592,
113
+ 0.010716147507249758,
114
+ -0.06484829845892019,
115
+ 0.046870140310835154,
116
+ -0.05542225026322279,
117
+ 0.013025841435996803,
118
+ 0.07396222585343552,
119
+ -0.08238948748535038,
120
+ -0.05128144763420785,
121
+ -0.034146015649314326,
122
+ -0.019164217192576775,
123
+ -0.05245710265199352,
124
+ 0.0600000220138926,
125
+ -0.008791402566627228,
126
+ -0.003558177133475171,
127
+ 0.08171322574945596,
128
+ 0.004650599937612277,
129
+ 0.03179470561374305,
130
+ -0.013036245462702871,
131
+ -0.03831803035844744,
132
+ -0.06713718433425508,
133
+ 0.04997054026924338,
134
+ 0.0264990560203545,
135
+ 0.06975899906418416,
136
+ 0.011943822658565756,
137
+ 0.010799379720898323,
138
+ -0.012432811913750964,
139
+ 0.005659790528100843,
140
+ -0.00902029115416072,
141
+ 0.003901510014775407,
142
+ 0.03979540215070909,
143
+ 0.008718574379684754,
144
+ -0.07586616274064592,
145
+ 0.010716147507249758,
146
+ -0.06484829845892019,
147
+ 0.046870140310835154,
148
+ -0.05542225026322279,
149
+ 0.013025841435996803,
150
+ 0.07396222585343552,
151
+ -0.08238948748535038,
152
+ -0.05128144763420785,
153
+ -0.034146015649314326,
154
+ -0.019164217192576775,
155
+ -0.05245710265199352,
156
+ 0.0600000220138926,
157
+ -0.008791402566627228,
158
+ -0.003558177133475171,
159
+ 0.08171322574945596,
160
+ 0.004650599937612277,
161
+ 0.03179470561374305,
162
+ -0.013036245462702871,
163
+ -0.03831803035844744,
164
+ -0.06713718433425508,
165
+ 0.04997054026924338,
166
+ 0.0264990560203545,
167
+ 0.06975899906418416,
168
+ 0.011943822658565756,
169
+ 0.010799379720898323,
170
+ -0.012432811913750964,
171
+ 0.005659790528100843,
172
+ -0.00902029115416072,
173
+ 0.003901510014775407,
174
+ 0.03979540215070909,
175
+ 0.008718574379684754,
176
+ -0.07586616274064592,
177
+ 0.010716147507249758,
178
+ -0.06484829845892019,
179
+ 0.046870140310835154,
180
+ -0.05542225026322279,
181
+ 0.013025841435996803,
182
+ 0.07396222585343552,
183
+ -0.08238948748535038,
184
+ -0.05128144763420785,
185
+ -0.034146015649314326,
186
+ -0.019164217192576775,
187
+ -0.05245710265199352,
188
+ 0.0600000220138926,
189
+ -0.008791402566627228,
190
+ -0.003558177133475171,
191
+ 0.08171322574945596,
192
+ 0.004650599937612277,
193
+ 0.03179470561374305,
194
+ -0.013036245462702871,
195
+ -0.03831803035844744,
196
+ -0.06713718433425508,
197
+ 0.04997054026924338,
198
+ 0.0264990560203545,
199
+ 0.06975899906418416,
200
+ 0.011943822658565756,
201
+ 0.010799379720898323,
202
+ -0.012432811913750964,
203
+ 0.005659790528100843,
204
+ -0.00902029115416072,
205
+ 0.003901510014775407,
206
+ 0.03979540215070909,
207
+ 0.008718574379684754,
208
+ -0.07586616274064592,
209
+ 0.010716147507249758,
210
+ -0.06484829845892019,
211
+ 0.046870140310835154,
212
+ -0.05542225026322279,
213
+ 0.013025841435996803,
214
+ 0.07396222585343552,
215
+ -0.08238948748535038,
216
+ -0.05128144763420785,
217
+ -0.034146015649314326,
218
+ -0.019164217192576775,
219
+ -0.05245710265199352,
220
+ 0.0600000220138926,
221
+ -0.008791402566627228,
222
+ -0.003558177133475171,
223
+ 0.08171322574945596,
224
+ 0.004650599937612277,
225
+ 0.03179470561374305,
226
+ -0.013036245462702871,
227
+ -0.03831803035844744,
228
+ -0.06713718433425508,
229
+ 0.04997054026924338,
230
+ 0.0264990560203545,
231
+ 0.06975899906418416,
232
+ 0.011943822658565756,
233
+ 0.010799379720898323,
234
+ -0.012432811913750964,
235
+ 0.005659790528100843,
236
+ -0.00902029115416072,
237
+ 0.003901510014775407,
238
+ 0.03979540215070909,
239
+ 0.008718574379684754,
240
+ -0.07586616274064592,
241
+ 0.010716147507249758,
242
+ -0.06484829845892019,
243
+ 0.046870140310835154,
244
+ -0.05542225026322279,
245
+ 0.013025841435996803,
246
+ 0.07396222585343552,
247
+ -0.08238948748535038,
248
+ -0.05128144763420785,
249
+ -0.034146015649314326,
250
+ -0.019164217192576775,
251
+ -0.05245710265199352,
252
+ 0.0600000220138926,
253
+ -0.008791402566627228,
254
+ -0.003558177133475171,
255
+ 0.08171322574945596,
256
+ 0.004650599937612277,
257
+ 0.03179470561374305,
258
+ -0.013036245462702871,
259
+ -0.03831803035844744,
260
+ -0.06713718433425508,
261
+ 0.04997054026924338,
262
+ 0.0264990560203545,
263
+ 0.06975899906418416,
264
+ 0.011943822658565756,
265
+ 0.010799379720898323,
266
+ -0.012432811913750964,
267
+ 0.005659790528100843,
268
+ -0.00902029115416072,
269
+ 0.003901510014775407,
270
+ 0.03979540215070909,
271
+ 0.008718574379684754,
272
+ -0.07586616274064592,
273
+ 0.010716147507249758,
274
+ -0.06484829845892019,
275
+ 0.046870140310835154,
276
+ -0.05542225026322279,
277
+ 0.013025841435996803,
278
+ 0.07396222585343552,
279
+ -0.08238948748535038,
280
+ -0.05128144763420785,
281
+ -0.034146015649314326,
282
+ -0.019164217192576775,
283
+ -0.05245710265199352,
284
+ 0.0600000220138926,
285
+ -0.008791402566627228,
286
+ -0.003558177133475171,
287
+ 0.08171322574945596,
288
+ 0.004650599937612277,
289
+ 0.03179470561374305,
290
+ -0.013036245462702871,
291
+ -0.03831803035844744,
292
+ -0.06713718433425508,
293
+ 0.04997054026924338,
294
+ 0.0264990560203545,
295
+ 0.06975899906418416,
296
+ 0.011943822658565756,
297
+ 0.010799379720898323,
298
+ -0.012432811913750964,
299
+ 0.005659790528100843,
300
+ -0.00902029115416072,
301
+ 0.003901510014775407,
302
+ 0.03979540215070909,
303
+ 0.008718574379684754,
304
+ -0.07586616274064592,
305
+ 0.010716147507249758,
306
+ -0.06484829845892019,
307
+ 0.046870140310835154,
308
+ -0.05542225026322279,
309
+ 0.013025841435996803,
310
+ 0.07396222585343552,
311
+ -0.08238948748535038,
312
+ -0.05128144763420785,
313
+ -0.034146015649314326,
314
+ -0.019164217192576775,
315
+ -0.05245710265199352,
316
+ 0.0600000220138926,
317
+ -0.008791402566627228,
318
+ -0.003558177133475171,
319
+ 0.08171322574945596,
320
+ 0.004650599937612277,
321
+ 0.03179470561374305,
322
+ -0.013036245462702871,
323
+ -0.03831803035844744,
324
+ -0.06713718433425508,
325
+ 0.04997054026924338,
326
+ 0.0264990560203545,
327
+ 0.06975899906418416,
328
+ 0.011943822658565756,
329
+ 0.010799379720898323,
330
+ -0.012432811913750964,
331
+ 0.005659790528100843,
332
+ -0.00902029115416072,
333
+ 0.003901510014775407,
334
+ 0.03979540215070909,
335
+ 0.008718574379684754,
336
+ -0.07586616274064592,
337
+ 0.010716147507249758,
338
+ -0.06484829845892019,
339
+ 0.046870140310835154,
340
+ -0.05542225026322279,
341
+ 0.013025841435996803,
342
+ 0.07396222585343552,
343
+ -0.08238948748535038,
344
+ -0.05128144763420785,
345
+ -0.034146015649314326,
346
+ -0.019164217192576775,
347
+ -0.05245710265199352,
348
+ 0.0600000220138926,
349
+ -0.008791402566627228,
350
+ -0.003558177133475171,
351
+ 0.08171322574945596,
352
+ 0.004650599937612277,
353
+ 0.03179470561374305,
354
+ -0.013036245462702871,
355
+ -0.03831803035844744,
356
+ -0.06713718433425508,
357
+ 0.04997054026924338,
358
+ 0.0264990560203545,
359
+ 0.06975899906418416,
360
+ 0.011943822658565756,
361
+ 0.010799379720898323,
362
+ -0.012432811913750964,
363
+ 0.005659790528100843,
364
+ -0.00902029115416072,
365
+ 0.003901510014775407,
366
+ 0.03979540215070909,
367
+ 0.008718574379684754,
368
+ -0.07586616274064592,
369
+ 0.010716147507249758,
370
+ -0.06484829845892019,
371
+ 0.046870140310835154,
372
+ -0.05542225026322279,
373
+ 0.013025841435996803,
374
+ 0.07396222585343552,
375
+ -0.08238948748535038,
376
+ -0.05128144763420785,
377
+ -0.034146015649314326,
378
+ -0.019164217192576775,
379
+ -0.05245710265199352,
380
+ 0.0600000220138926,
381
+ -0.008791402566627228,
382
+ -0.003558177133475171,
383
+ 0.08171322574945596,
384
+ 0.004650599937612277,
385
+ 0.03179470561374305,
386
+ -0.013036245462702871,
387
+ -0.03831803035844744,
388
+ -0.06713718433425508,
389
+ 0.04997054026924338,
390
+ 0.0264990560203545,
391
+ 0.06975899906418416,
392
+ 0.011943822658565756,
393
+ 0.010799379720898323,
394
+ -0.012432811913750964,
395
+ 0.005659790528100843,
396
+ -0.00902029115416072,
397
+ 0.003901510014775407,
398
+ 0.03979540215070909,
399
+ 0.008718574379684754,
400
+ -0.07586616274064592,
401
+ 0.010716147507249758,
402
+ -0.06484829845892019,
403
+ 0.046870140310835154,
404
+ -0.05542225026322279,
405
+ 0.013025841435996803,
406
+ 0.07396222585343552,
407
+ -0.08238948748535038,
408
+ -0.05128144763420785,
409
+ -0.034146015649314326,
410
+ -0.019164217192576775,
411
+ -0.05245710265199352,
412
+ 0.0600000220138926,
413
+ -0.008791402566627228,
414
+ -0.003558177133475171,
415
+ 0.08171322574945596,
416
+ 0.004650599937612277,
417
+ 0.03179470561374305,
418
+ -0.013036245462702871,
419
+ -0.03831803035844744,
420
+ -0.06713718433425508,
421
+ 0.04997054026924338,
422
+ 0.0264990560203545,
423
+ 0.06975899906418416,
424
+ 0.011943822658565756,
425
+ 0.010799379720898323,
426
+ -0.012432811913750964,
427
+ 0.005659790528100843,
428
+ -0.00902029115416072,
429
+ 0.003901510014775407,
430
+ 0.03979540215070909,
431
+ 0.008718574379684754,
432
+ -0.07586616274064592,
433
+ 0.010716147507249758,
434
+ -0.06484829845892019,
435
+ 0.046870140310835154,
436
+ -0.05542225026322279,
437
+ 0.013025841435996803,
438
+ 0.07396222585343552,
439
+ -0.08238948748535038,
440
+ -0.05128144763420785,
441
+ -0.034146015649314326,
442
+ -0.019164217192576775,
443
+ -0.05245710265199352,
444
+ 0.0600000220138926,
445
+ -0.008791402566627228,
446
+ -0.003558177133475171,
447
+ 0.08171322574945596,
448
+ 0.004650599937612277,
449
+ 0.03179470561374305,
450
+ -0.013036245462702871,
451
+ -0.03831803035844744,
452
+ -0.06713718433425508,
453
+ 0.04997054026924338,
454
+ 0.0264990560203545,
455
+ 0.06975899906418416,
456
+ 0.011943822658565756,
457
+ 0.010799379720898323,
458
+ -0.012432811913750964,
459
+ 0.005659790528100843,
460
+ -0.00902029115416072,
461
+ 0.003901510014775407,
462
+ 0.03979540215070909,
463
+ 0.008718574379684754,
464
+ -0.07586616274064592,
465
+ 0.010716147507249758,
466
+ -0.06484829845892019,
467
+ 0.046870140310835154,
468
+ -0.05542225026322279,
469
+ 0.013025841435996803,
470
+ 0.07396222585343552,
471
+ -0.08238948748535038,
472
+ -0.05128144763420785,
473
+ -0.034146015649314326,
474
+ -0.019164217192576775,
475
+ -0.05245710265199352,
476
+ 0.0600000220138926,
477
+ -0.008791402566627228,
478
+ -0.003558177133475171,
479
+ 0.08171322574945596,
480
+ 0.004650599937612277,
481
+ 0.03179470561374305,
482
+ -0.013036245462702871,
483
+ -0.03831803035844744,
484
+ -0.06713718433425508,
485
+ 0.04997054026924338,
486
+ 0.0264990560203545,
487
+ 0.06975899906418416,
488
+ 0.011943822658565756,
489
+ 0.010799379720898323,
490
+ -0.012432811913750964,
491
+ 0.005659790528100843,
492
+ -0.00902029115416072,
493
+ 0.003901510014775407,
494
+ 0.03979540215070909,
495
+ 0.008718574379684754,
496
+ -0.07586616274064592,
497
+ 0.010716147507249758,
498
+ -0.06484829845892019,
499
+ 0.046870140310835154,
500
+ -0.05542225026322279,
501
+ 0.013025841435996803,
502
+ 0.07396222585343552,
503
+ -0.08238948748535038,
504
+ -0.05128144763420785,
505
+ -0.034146015649314326,
506
+ -0.019164217192576775,
507
+ -0.05245710265199352,
508
+ 0.0600000220138926,
509
+ -0.008791402566627228,
510
+ -0.003558177133475171,
511
+ 0.08171322574945596,
512
+ 0.004650599937612277,
513
+ 0.03179470561374305,
514
+ -0.013036245462702871,
515
+ -0.03831803035844744,
516
+ -0.06713718433425508,
517
+ 0.04997054026924338,
518
+ 0.0264990560203545,
519
+ 0.06975899906418416,
520
+ 0.011943822658565756,
521
+ 0.010799379720898323,
522
+ -0.012432811913750964,
523
+ 0.005659790528100843,
524
+ -0.00902029115416072,
525
+ 0.003901510014775407,
526
+ 0.03979540215070909,
527
+ 0.008718574379684754,
528
+ -0.07586616274064592,
529
+ 0.010716147507249758,
530
+ -0.06484829845892019,
531
+ 0.046870140310835154,
532
+ -0.05542225026322279,
533
+ 0.013025841435996803,
534
+ 0.07396222585343552,
535
+ -0.08238948748535038
536
+ ]
537
+ },
538
+ "n_layers": 1,
539
+ "hidden_size": 512
540
+ },
541
+ "v31_math_gsm_symbolic": {
542
+ "n": 16,
543
+ "correct": 10,
544
+ "pass_frac": 0.625,
545
+ "completion_tokens": 2706,
546
+ "mean_gen_tokens_correct": 96.7,
547
+ "items": [
548
+ {
549
+ "src": "v31_gsm_symbolic/bakery_orders/p0",
550
+ "gold": "224",
551
+ "ok": false,
552
+ "tokens": 1,
553
+ "tail": "",
554
+ "difficulty": 0,
555
+ "is_noop": false,
556
+ "template": "bakery_orders"
557
+ },
558
+ {
559
+ "src": "v31_gsm_symbolic/travel_distance/p0",
560
+ "gold": "330",
561
+ "ok": false,
562
+ "tokens": 102,
563
+ "tail": "\u00d7 5 hours = 250 miles\n\nStep 3: Calculate the total distance.\nTotal Distance = 80 miles + 250 miles = 330 miles\n\n#### 330",
564
+ "difficulty": 0,
565
+ "is_noop": false,
566
+ "template": "travel_distance"
567
+ },
568
+ {
569
+ "src": "v31_gsm_symbolic/work_rate/noop",
570
+ "gold": "400",
571
+ "ok": true,
572
+ "tokens": 140,
573
+ "tail": "ant information**\nThe billboard information about a sale ending in 92 days is not needed for this calculation.\n\n#### 400",
574
+ "difficulty": 0,
575
+ "is_noop": true,
576
+ "template": "work_rate"
577
+ },
578
+ {
579
+ "src": "v31_gsm_symbolic/travel_distance/noop",
580
+ "gold": "510",
581
+ "ok": true,
582
+ "tokens": 167,
583
+ "tail": "ance = 50 \u00d7 2 = 100 miles\n\n**Step 4: Calculate total distance**\n- Total distance = 250 + 160 + 100 = 510 miles\n\n#### 510",
584
+ "difficulty": 0,
585
+ "is_noop": true,
586
+ "template": "travel_distance"
587
+ },
588
+ {
589
+ "src": "v31_gsm_symbolic/work_rate/p2",
590
+ "gold": "190",
591
+ "ok": false,
592
+ "tokens": 686,
593
+ "tail": "uld round to 193, or perhaps the problem allows 192.5.\n\nI'll provide the exact calculation: 55 \u00d7 3.5 = 192.5\n\n#### 192.5",
594
+ "difficulty": 2,
595
+ "is_noop": false,
596
+ "template": "work_rate"
597
+ },
598
+ {
599
+ "src": "v31_gsm_symbolic/bakery_orders/p0",
600
+ "gold": "144",
601
+ "ok": false,
602
+ "tokens": 1,
603
+ "tail": "",
604
+ "difficulty": 0,
605
+ "is_noop": false,
606
+ "template": "bakery_orders"
607
+ },
608
+ {
609
+ "src": "v31_gsm_symbolic/work_rate/p2",
610
+ "gold": "165",
611
+ "ok": false,
612
+ "tokens": 768,
613
+ "tail": " I'll use 169.\n\nWait - let me re-check: 45 \u00d7 3.75 = 45 \u00d7 3 + 45 \u00d7 0.75 = 135 + 33.75 = 168.75\n\nHmm, but 168.75 is not an",
614
+ "difficulty": 2,
615
+ "is_noop": false,
616
+ "template": "work_rate"
617
+ },
618
+ {
619
+ "src": "v31_gsm_symbolic/percentage_compose/p0",
620
+ "gold": "300",
621
+ "ok": true,
622
+ "tokens": 27,
623
+ "tail": "30% of 1000 marbles = 0.30 \u00d7 1000 = 300 marbles\n\n#### 300",
624
+ "difficulty": 0,
625
+ "is_noop": false,
626
+ "template": "percentage_compose"
627
+ },
628
+ {
629
+ "src": "v31_gsm_symbolic/travel_distance/p0",
630
+ "gold": "560",
631
+ "ok": true,
632
+ "tokens": 121,
633
+ "tail": "0 mph \u00d7 4 hours = 240 miles\n\n**Step 4: Calculate total distance**\nTotal distance = 200 + 120 + 240 = 560 miles\n\n#### 560",
634
+ "difficulty": 0,
635
+ "is_noop": false,
636
+ "template": "travel_distance"
637
+ },
638
+ {
639
+ "src": "v31_gsm_symbolic/bakery_orders/p0",
640
+ "gold": "147",
641
+ "ok": true,
642
+ "tokens": 96,
643
+ "tail": " muffins\n\nStep 3: Calculate leftover muffins.\nLeftover = Total produced \u2212 Total sold = 700 \u2212 553 = 147 muffins\n\n#### 147",
644
+ "difficulty": 0,
645
+ "is_noop": false,
646
+ "template": "bakery_orders"
647
+ },
648
+ {
649
+ "src": "v31_gsm_symbolic/garden_harvest/p1",
650
+ "gold": "405",
651
+ "ok": true,
652
+ "tokens": 117,
653
+ "tail": "kept (not damaged).\n- 100% - 10% = 90% kept\n\nStep 4: Calculate the final harvest.\n- 450 \u00d7 0.90 = 405 cucumbers\n\n#### 405",
654
+ "difficulty": 1,
655
+ "is_noop": false,
656
+ "template": "garden_harvest"
657
+ },
658
+ {
659
+ "src": "v31_gsm_symbolic/garden_harvest/p0",
660
+ "gold": "840",
661
+ "ok": true,
662
+ "tokens": 85,
663
+ "tail": " the total number of cucumbers harvested.\n- Cucumbers per plant: 8\n- Total cucumbers = 105 \u00d7 8 = 840 cucumbers\n\n#### 840",
664
+ "difficulty": 0,
665
+ "is_noop": false,
666
+ "template": "garden_harvest"
667
+ },
668
+ {
669
+ "src": "v31_gsm_symbolic/garden_harvest/p0",
670
+ "gold": "680",
671
+ "ok": true,
672
+ "tokens": 85,
673
+ "tail": "he total number of zucchini harvested.\n- Each plant yields 8 zucchini\n- Total zucchini = 85 \u00d7 8 = 680 zucchini\n\n#### 680",
674
+ "difficulty": 0,
675
+ "is_noop": false,
676
+ "template": "garden_harvest"
677
+ },
678
+ {
679
+ "src": "v31_gsm_symbolic/bakery_orders/p0",
680
+ "gold": "93",
681
+ "ok": true,
682
+ "tokens": 104,
683
+ "tail": "ys = 267 tarts\n\nStep 3: Calculate leftover tarts.\nLeftover = Total produced \u2212 Total sold = 360 \u2212 267 = 93 tarts\n\n#### 93",
684
+ "difficulty": 0,
685
+ "is_noop": false,
686
+ "template": "bakery_orders"
687
+ },
688
+ {
689
+ "src": "v31_gsm_symbolic/bakery_orders/p1",
690
+ "gold": "196",
691
+ "ok": false,
692
+ "tokens": 181,
693
+ "tail": "y with daily calculation:\n- Daily extra = 150 \u2212 112 \u2212 10 = 28 cakes per day\n- 7-day extra = 28 \u00d7 7 = 196 cakes\n\n#### 196",
694
+ "difficulty": 1,
695
+ "is_noop": false,
696
+ "template": "bakery_orders"
697
+ },
698
+ {
699
+ "src": "v31_gsm_symbolic/percentage_compose/p0",
700
+ "gold": "40",
701
+ "ok": true,
702
+ "tokens": 25,
703
+ "tail": "20% of 200 marbles = 0.20 \u00d7 200 = 40 marbles\n\n#### 40",
704
+ "difficulty": 0,
705
+ "is_noop": false,
706
+ "template": "percentage_compose"
707
+ }
708
+ ]
709
+ },
710
+ "v31_math_competition": {
711
+ "n": 16,
712
+ "correct": 14,
713
+ "pass_frac": 0.875,
714
+ "completion_tokens": 3188,
715
+ "mean_gen_tokens_correct": 204.9,
716
+ "items": [
717
+ {
718
+ "src": "v31_math_competition/number_theory/gcd_three",
719
+ "gold": "8",
720
+ "ok": true,
721
+ "tokens": 186,
722
+ "tail": "4: Calculate the gcd**\n\ngcd(24, 48, 40) = 2\u00b3 = 8\n\n**Verification:**\n- 24 \u00f7 8 = 3 \u2713\n- 48 \u00f7 8 = 6 \u2713\n- 40 \u00f7 8 = 5 \u2713\n\n#### 8",
723
+ "family": "number_theory",
724
+ "template": "gcd_three"
725
+ },
726
+ {
727
+ "src": "v31_math_competition/geometry/composite_area",
728
+ "gold": "29",
729
+ "ok": true,
730
+ "tokens": 110,
731
+ "tail": "ped figure**\nArea of L-shaped figure = Area of rectangle \u2212 Area of square\nArea of L-shaped figure = 33 \u2212 4 = 29\n\n#### 29",
732
+ "family": "geometry",
733
+ "template": "composite_area"
734
+ },
735
+ {
736
+ "src": "v31_math_competition/algebra/quadratic_sum_roots",
737
+ "gold": "-50",
738
+ "ok": true,
739
+ "tokens": 249,
740
+ "tail": "0\nDivide by 2: x\u00b2 + 5x - 50 = 0\nFactor: (x + 10)(x - 5) = 0\nRoots: x = -10 and x = 5\nProduct: (-10)(5) = -50 \u2713\n\n#### -50",
741
+ "family": "algebra",
742
+ "template": "quadratic_sum_roots"
743
+ },
744
+ {
745
+ "src": "v31_math_competition/combinatorics/binomial_compute",
746
+ "gold": "165",
747
+ "ok": true,
748
+ "tokens": 291,
749
+ "tail": "te the numerator**\n$$11 \\times 10 = 110$$\n$$110 \\times 9 = 990$$\n\n**Step 8: Divide by 6**\n$$990 \\div 6 = 165$$\n\n#### 165",
750
+ "family": "combinatorics",
751
+ "template": "binomial_compute"
752
+ },
753
+ {
754
+ "src": "v31_math_competition/combinatorics/arrangements_constraint",
755
+ "gold": "480",
756
+ "ok": false,
757
+ "tokens": 135,
758
+ "tail": "48\n\n## Step 3: Calculate arrangements where A and B are NOT adjacent\nTotal valid arrangements = 720 - 48 = 672\n\n#### 672",
759
+ "family": "combinatorics",
760
+ "template": "arrangements_constraint"
761
+ },
762
+ {
763
+ "src": "v31_math_competition/combinatorics/binomial_compute",
764
+ "gold": "56",
765
+ "ok": true,
766
+ "tokens": 286,
767
+ "tail": ": Simplify**\n40,320 \u00f7 720 = 56\n\nAlternatively, using symmetry: C(8, 5) = C(8, 3) = 8 \u00d7 7 \u00d7 6 / (3 \u00d7 2 \u00d7 1) = 56\n\n#### 56",
768
+ "family": "combinatorics",
769
+ "template": "binomial_compute"
770
+ },
771
+ {
772
+ "src": "v31_math_competition/algebra/linear_system_2x2",
773
+ "gold": "-7",
774
+ "ok": true,
775
+ "tokens": 144,
776
+ "tail": "ep 3: Verify with the first equation**\n\n1 + 2(-7) = 1 - 14 = -13 \u2713\n\n**Step 4: Calculate xy**\n\nxy = (1)(-7) = -7\n\n#### -7",
777
+ "family": "algebra",
778
+ "template": "linear_system_2x2"
779
+ },
780
+ {
781
+ "src": "v31_math_competition/geometry/composite_area",
782
+ "gold": "75",
783
+ "ok": true,
784
+ "tokens": 110,
785
+ "tail": "d figure**\nArea of L-shaped figure = Area of rectangle \u2212 Area of square\nArea of L-shaped figure = 100 \u2212 25 = 75\n\n#### 75",
786
+ "family": "geometry",
787
+ "template": "composite_area"
788
+ },
789
+ {
790
+ "src": "v31_math_competition/algebra/quadratic_sum_roots",
791
+ "gold": "-6",
792
+ "ok": true,
793
+ "tokens": 250,
794
+ "tail": "\n- Divide by 2: x\u00b2 + 5x - 6 = 0\n- Factor: (x + 6)(x - 1) = 0\n- Roots: x = -6 and x = 1\n\nProduct: (-6)(1) = -6 \u2713\n\n#### -6",
795
+ "family": "algebra",
796
+ "template": "quadratic_sum_roots"
797
+ },
798
+ {
799
+ "src": "v31_math_competition/probability/probability_fraction",
800
+ "gold": "5/32",
801
+ "ok": false,
802
+ "tokens": 184,
803
+ "tail": "ction**\n\n5 and 32 share no common factors (5 is prime, 32 = 2^5), so the fraction is already in lowest terms.\n\n#### 8/32",
804
+ "family": "probability",
805
+ "template": "probability_fraction"
806
+ },
807
+ {
808
+ "src": "v31_math_competition/probability/probability_fraction",
809
+ "gold": "1/36",
810
+ "ok": true,
811
+ "tokens": 154,
812
+ "tail": "utcome.\n\nThe probability is:\n$$\\frac{1}{36}$$\n\nThis fraction is already in lowest terms since gcd(1, 36) = 1.\n\n#### 1/36",
813
+ "family": "probability",
814
+ "template": "probability_fraction"
815
+ },
816
+ {
817
+ "src": "v31_math_competition/number_theory/gcd_three",
818
+ "gold": "6",
819
+ "ok": true,
820
+ "tokens": 258,
821
+ "tail": ": Calculate the GCD**\n\nGCD = 2\u00b9 \u00d7 3\u00b9 = 2 \u00d7 3 = 6\n\n**Verification:**\n- 18 \u00f7 6 = 3 \u2713\n- 24 \u00f7 6 = 4 \u2713\n- 18 \u00f7 6 = 3 \u2713\n\n#### 6",
822
+ "family": "number_theory",
823
+ "template": "gcd_three"
824
+ },
825
+ {
826
+ "src": "v31_math_competition/probability/probability_fraction",
827
+ "gold": "5/36",
828
+ "ok": true,
829
+ "tokens": 239,
830
+ "tail": "torization of 5 is 5\u00b9.\n\nSince 5 and 36 share no common factors, the fraction 5/36 is already in lowest terms.\n\n#### 5/36",
831
+ "family": "probability",
832
+ "template": "probability_fraction"
833
+ },
834
+ {
835
+ "src": "v31_math_competition/combinatorics/arrangements_constraint",
836
+ "gold": "1440",
837
+ "ok": true,
838
+ "tokens": 213,
839
+ "tail": " valid arrangements\nTotal = (arrangements of 6 items) \u00d7 (arrangements within AB block)\nTotal = 720 \u00d7 2 = 1440\n\n#### 1440",
840
+ "family": "combinatorics",
841
+ "template": "arrangements_constraint"
842
+ },
843
+ {
844
+ "src": "v31_math_competition/algebra/quadratic_sum_roots",
845
+ "gold": "70",
846
+ "ok": true,
847
+ "tokens": 229,
848
+ "tail": "rify by factoring.\nx\u00b2 + 17x + 70 = 0\n(x + 7)(x + 10) = 0\nRoots: x = -7 and x = -10\nProduct: (-7) \u00d7 (-10) = 70 \u2713\n\n#### 70",
849
+ "family": "algebra",
850
+ "template": "quadratic_sum_roots"
851
+ },
852
+ {
853
+ "src": "v31_math_competition/probability/probability_fraction",
854
+ "gold": "7/10",
855
+ "ok": true,
856
+ "tokens": 150,
857
+ "tail": "n is simplified**\n- The greatest common divisor of 7 and 10 is 1\n- Therefore, 7/10 is already in lowest terms\n\n#### 7/10",
858
+ "family": "probability",
859
+ "template": "probability_fraction"
860
+ }
861
+ ]
862
+ },
863
+ "v31_math_robustness": {
864
+ "n": 16,
865
+ "correct": 5,
866
+ "pass_frac": 0.3125,
867
+ "completion_tokens": 2483,
868
+ "mean_gen_tokens_correct": 175.8,
869
+ "items": [
870
+ {
871
+ "src": "v31_math_robustness/context_pad/bakery_orders/padded",
872
+ "gold": "135",
873
+ "ok": false,
874
+ "tokens": 1,
875
+ "tail": "",
876
+ "perturbation": "context_pad",
877
+ "template": "bakery_orders/padded"
878
+ },
879
+ {
880
+ "src": "v31_math_robustness/numerical_swap/shopping_discount",
881
+ "gold": "87",
882
+ "ok": false,
883
+ "tokens": 172,
884
+ "tail": "Calculate total amount spent\nTotal = $60 + $3 = $63\n\nStep 6: Calculate money left\nMoney left = $150 \u2212 $63 = $87\n\n#### 87",
885
+ "perturbation": "numerical_swap",
886
+ "template": "shopping_discount"
887
+ },
888
+ {
889
+ "src": "v31_math_robustness/unit_swap/library_books/pounds",
890
+ "gold": "17",
891
+ "ok": false,
892
+ "tokens": 159,
893
+ "tail": "Alternatively: \u00a318 \u00d7 0.90 = \u00a316.20\n\nThe total fine is \u00a316.20, which rounds to \u00a316 when expressed as an integer.\n\n#### 16",
894
+ "perturbation": "unit_swap",
895
+ "template": "library_books/pounds"
896
+ },
897
+ {
898
+ "src": "v31_math_robustness/unit_swap/work_rate/euros",
899
+ "gold": "330",
900
+ "ok": false,
901
+ "tokens": 106,
902
+ "tail": "lculate total labels packed in 6 hours.\nTotal labels = Combined rate \u00d7 Time\nTotal labels = 55 \u00d7 6 = 330 labels\n\n#### 330",
903
+ "perturbation": "unit_swap",
904
+ "template": "work_rate/euros"
905
+ },
906
+ {
907
+ "src": "v31_math_robustness/context_pad/shopping_discount/padded",
908
+ "gold": "122",
909
+ "ok": false,
910
+ "tokens": 575,
911
+ "tail": "rounded answer, or perhaps the problem expects 121 (truncated). \n\nGiven standard rounding rules: 121.65 \u2192 122.\n\n#### 122",
912
+ "perturbation": "context_pad",
913
+ "template": "shopping_discount/padded"
914
+ },
915
+ {
916
+ "src": "v31_math_robustness/numerical_swap/garden_harvest",
917
+ "gold": "924",
918
+ "ok": true,
919
+ "tokens": 83,
920
+ "tail": "tal number of cucumbers harvested.\n- Each plant yields 7 cucumbers\n- Total cucumbers = 132 \u00d7 7 = 924 cucumbers\n\n#### 924",
921
+ "perturbation": "numerical_swap",
922
+ "template": "garden_harvest"
923
+ },
924
+ {
925
+ "src": "v31_math_robustness/numerical_swap/work_rate",
926
+ "gold": "200",
927
+ "ok": true,
928
+ "tokens": 100,
929
+ "tail": "4: Calculate total envelopes packed in 5 hours.\nTotal envelopes = 40 envelopes/hour \u00d7 5 hours = 200 envelopes.\n\n#### 200",
930
+ "perturbation": "numerical_swap",
931
+ "template": "work_rate"
932
+ },
933
+ {
934
+ "src": "v31_math_robustness/context_pad/percentage_compose/padded",
935
+ "gold": "60",
936
+ "ok": false,
937
+ "tokens": 1,
938
+ "tail": "",
939
+ "perturbation": "context_pad",
940
+ "template": "percentage_compose/padded"
941
+ },
942
+ {
943
+ "src": "v31_math_robustness/unit_swap/percentage_compose/pounds",
944
+ "gold": "75",
945
+ "ok": false,
946
+ "tokens": 1,
947
+ "tail": "",
948
+ "perturbation": "unit_swap",
949
+ "template": "percentage_compose/pounds"
950
+ },
951
+ {
952
+ "src": "v31_math_robustness/context_pad/travel_distance/padded",
953
+ "gold": "520",
954
+ "ok": false,
955
+ "tokens": 123,
956
+ "tail": "0 mph \u00d7 4 hours = 240 miles\n\n**Step 4: Calculate total distance**\nTotal distance = 160 + 120 + 240 = 440 miles\n\n#### 440",
957
+ "perturbation": "context_pad",
958
+ "template": "travel_distance/padded"
959
+ },
960
+ {
961
+ "src": "v31_math_robustness/context_pad/travel_distance/padded",
962
+ "gold": "525",
963
+ "ok": false,
964
+ "tokens": 204,
965
+ "tail": "200 + 65 = 525 miles\n\nNote: The information about the bus route and documentary is irrelevant to this problem.\n\n#### 525",
966
+ "perturbation": "context_pad",
967
+ "template": "travel_distance/padded"
968
+ },
969
+ {
970
+ "src": "v31_math_robustness/topical_distractor/work_rate/noop",
971
+ "gold": "250",
972
+ "ok": true,
973
+ "tokens": 161,
974
+ "tail": "on asks \"how many widgets do they pack?\" \u2014 so we don't need the weight information for this specific question.\n\n#### 250",
975
+ "perturbation": "topical_distractor",
976
+ "template": "work_rate/noop"
977
+ },
978
+ {
979
+ "src": "v31_math_robustness/numerical_swap/classroom_supplies",
980
+ "gold": "100",
981
+ "ok": false,
982
+ "tokens": 177,
983
+ "tail": " separately:\n- Pencils: 20 \u00d7 2 = 40\n- Notebooks: 20 \u00d7 1 = 20\n- Erasers: 20 \u00d7 2 = 40\n\nTotal: 40 + 20 + 40 = 100\n\n#### 100",
984
+ "perturbation": "numerical_swap",
985
+ "template": "classroom_supplies"
986
+ },
987
+ {
988
+ "src": "v31_math_robustness/context_pad/work_rate/padded",
989
+ "gold": "465",
990
+ "ok": true,
991
+ "tokens": 433,
992
+ "tail": "ombined rate \u00d7 Working time\nTotal boxes = 60 boxes/hour \u00d7 7.75 hours\nTotal boxes = 60 \u00d7 7.75\nTotal boxes = 465\n\n#### 465",
993
+ "perturbation": "context_pad",
994
+ "template": "work_rate/padded"
995
+ },
996
+ {
997
+ "src": "v31_math_robustness/digit_expand/garden_harvest/scaled",
998
+ "gold": "5040",
999
+ "ok": false,
1000
+ "tokens": 85,
1001
+ "tail": "nd the total number of onions harvested.\n- Each plant yields 60 onions\n- Total onions = 84 \u00d7 60 = 5,040 onions\n\n#### 504",
1002
+ "perturbation": "digit_expand",
1003
+ "template": "garden_harvest/scaled"
1004
+ },
1005
+ {
1006
+ "src": "v31_math_robustness/topical_distractor/garden_harvest/noop",
1007
+ "gold": "135",
1008
+ "ok": true,
1009
+ "tokens": 102,
1010
+ "tail": "35 tomatoes\n\nNote: The information about the store selling tomatoes for 2 years is irrelevant to this problem.\n\n#### 135",
1011
+ "perturbation": "topical_distractor",
1012
+ "template": "garden_harvest/noop"
1013
+ }
1014
+ ]
1015
+ },
1016
+ "v31_code_humaneval_plus": {
1017
+ "n": 16,
1018
+ "correct": 16,
1019
+ "pass_frac": 1.0,
1020
+ "completion_tokens": 503,
1021
+ "mean_gen_tokens_correct": 31.4,
1022
+ "items": [
1023
+ {
1024
+ "src": "v31_code_humaneval_plus/max_consecutive_run",
1025
+ "task_id": "v31_codeplus/max_consecutive_run_dfwwli",
1026
+ "entry_point": "max_consecutive_run_dfwwli",
1027
+ "ok": true,
1028
+ "tokens": 79,
1029
+ "n_test_cases": 46,
1030
+ "template": "max_consecutive_run",
1031
+ "tail": "] == values[i-1]:\n current_run += 1\n max_run = max(max_run, current_run)\n else:\n current_run = 1\n \n return max_run"
1032
+ },
1033
+ {
1034
+ "src": "v31_code_humaneval_plus/dict_value_sum",
1035
+ "task_id": "v31_codeplus/sum_dict_values_mxrgea",
1036
+ "entry_point": "sum_dict_values_mxrgea",
1037
+ "ok": true,
1038
+ "tokens": 29,
1039
+ "n_test_cases": 51,
1040
+ "template": "dict_value_sum",
1041
+ "tail": " total = 0\n for value in d.values():\n if isinstance(value, int):\n total += value\n return total"
1042
+ },
1043
+ {
1044
+ "src": "v31_code_humaneval_plus/max_consecutive_run",
1045
+ "task_id": "v31_codeplus/max_consecutive_run_mydnix",
1046
+ "entry_point": "max_consecutive_run_mydnix",
1047
+ "ok": true,
1048
+ "tokens": 79,
1049
+ "n_test_cases": 47,
1050
+ "template": "max_consecutive_run",
1051
+ "tail": "i] == items[i-1]:\n current_run += 1\n max_run = max(max_run, current_run)\n else:\n current_run = 1\n \n return max_run"
1052
+ },
1053
+ {
1054
+ "src": "v31_code_humaneval_plus/filter_above",
1055
+ "task_id": "v31_codeplus/filter_above_googqy",
1056
+ "entry_point": "filter_above_googqy",
1057
+ "ok": true,
1058
+ "tokens": 14,
1059
+ "n_test_cases": 58,
1060
+ "template": "filter_above",
1061
+ "tail": " return [x for x in nums if x > limit]"
1062
+ },
1063
+ {
1064
+ "src": "v31_code_humaneval_plus/max_consecutive_run",
1065
+ "task_id": "v31_codeplus/max_consecutive_run_synmlv",
1066
+ "entry_point": "max_consecutive_run_synmlv",
1067
+ "ok": true,
1068
+ "tokens": 79,
1069
+ "n_test_cases": 46,
1070
+ "template": "max_consecutive_run",
1071
+ "tail": "i] == items[i-1]:\n current_run += 1\n max_run = max(max_run, current_run)\n else:\n current_run = 1\n \n return max_run"
1072
+ },
1073
+ {
1074
+ "src": "v31_code_humaneval_plus/dict_value_sum",
1075
+ "task_id": "v31_codeplus/sum_dict_values_remcab",
1076
+ "entry_point": "sum_dict_values_remcab",
1077
+ "ok": true,
1078
+ "tokens": 17,
1079
+ "n_test_cases": 48,
1080
+ "template": "dict_value_sum",
1081
+ "tail": " return sum(v for v in mapping.values() if isinstance(v, int))"
1082
+ },
1083
+ {
1084
+ "src": "v31_code_humaneval_plus/is_palindrome",
1085
+ "task_id": "v31_codeplus/is_palindrome_ubxezr",
1086
+ "entry_point": "is_palindrome_ubxezr",
1087
+ "ok": true,
1088
+ "tokens": 10,
1089
+ "n_test_cases": 48,
1090
+ "template": "is_palindrome",
1091
+ "tail": " return s == s[::-1]"
1092
+ },
1093
+ {
1094
+ "src": "v31_code_humaneval_plus/count_in_list",
1095
+ "task_id": "v31_codeplus/count_target_uxjkbu",
1096
+ "entry_point": "count_target_uxjkbu",
1097
+ "ok": true,
1098
+ "tokens": 28,
1099
+ "n_test_cases": 63,
1100
+ "template": "count_in_list",
1101
+ "tail": " count = 0\n for item in items:\n if item == target:\n count += 1\n return count"
1102
+ },
1103
+ {
1104
+ "src": "v31_code_humaneval_plus/count_in_list",
1105
+ "task_id": "v31_codeplus/count_target_dtmhtx",
1106
+ "entry_point": "count_target_dtmhtx",
1107
+ "ok": true,
1108
+ "tokens": 28,
1109
+ "n_test_cases": 51,
1110
+ "template": "count_in_list",
1111
+ "tail": " count = 0\n for value in values:\n if value == target:\n count += 1\n return count"
1112
+ },
1113
+ {
1114
+ "src": "v31_code_humaneval_plus/count_in_list",
1115
+ "task_id": "v31_codeplus/count_target_ojdeyc",
1116
+ "entry_point": "count_target_ojdeyc",
1117
+ "ok": true,
1118
+ "tokens": 28,
1119
+ "n_test_cases": 64,
1120
+ "template": "count_in_list",
1121
+ "tail": " count = 0\n for num in nums:\n if num == target:\n count += 1\n return count"
1122
+ },
1123
+ {
1124
+ "src": "v31_code_humaneval_plus/filter_above",
1125
+ "task_id": "v31_codeplus/filter_above_kemaob",
1126
+ "entry_point": "filter_above_kemaob",
1127
+ "ok": true,
1128
+ "tokens": 14,
1129
+ "n_test_cases": 51,
1130
+ "template": "filter_above",
1131
+ "tail": " return [x for x in nums if x > limit]"
1132
+ },
1133
+ {
1134
+ "src": "v31_code_humaneval_plus/dict_value_sum",
1135
+ "task_id": "v31_codeplus/sum_dict_values_pasrzu",
1136
+ "entry_point": "sum_dict_values_pasrzu",
1137
+ "ok": true,
1138
+ "tokens": 17,
1139
+ "n_test_cases": 43,
1140
+ "template": "dict_value_sum",
1141
+ "tail": " return sum(v for v in data.values() if isinstance(v, int))"
1142
+ },
1143
+ {
1144
+ "src": "v31_code_humaneval_plus/filter_above",
1145
+ "task_id": "v31_codeplus/filter_above_lwuskg",
1146
+ "entry_point": "filter_above_lwuskg",
1147
+ "ok": true,
1148
+ "tokens": 15,
1149
+ "n_test_cases": 60,
1150
+ "template": "filter_above",
1151
+ "tail": " return [x for x in items if x > min_val]"
1152
+ },
1153
+ {
1154
+ "src": "v31_code_humaneval_plus/count_in_list",
1155
+ "task_id": "v31_codeplus/count_target_lhsguo",
1156
+ "entry_point": "count_target_lhsguo",
1157
+ "ok": true,
1158
+ "tokens": 28,
1159
+ "n_test_cases": 51,
1160
+ "template": "count_in_list",
1161
+ "tail": " count = 0\n for item in items:\n if item == target:\n count += 1\n return count"
1162
+ },
1163
+ {
1164
+ "src": "v31_code_humaneval_plus/is_palindrome",
1165
+ "task_id": "v31_codeplus/is_palindrome_iwkdug",
1166
+ "entry_point": "is_palindrome_iwkdug",
1167
+ "ok": true,
1168
+ "tokens": 10,
1169
+ "n_test_cases": 48,
1170
+ "template": "is_palindrome",
1171
+ "tail": " return text == text[::-1]"
1172
+ },
1173
+ {
1174
+ "src": "v31_code_humaneval_plus/count_in_list",
1175
+ "task_id": "v31_codeplus/count_target_qzoiwu",
1176
+ "entry_point": "count_target_qzoiwu",
1177
+ "ok": true,
1178
+ "tokens": 28,
1179
+ "n_test_cases": 57,
1180
+ "template": "count_in_list",
1181
+ "tail": " count = 0\n for num in nums:\n if num == target:\n count += 1\n return count"
1182
+ }
1183
+ ]
1184
+ },
1185
+ "v31_ifeval_verifiable": {
1186
+ "n": 16,
1187
+ "correct": 10,
1188
+ "pass_frac": 0.625,
1189
+ "completion_tokens": 4148,
1190
+ "mean_gen_tokens_correct": 274.0,
1191
+ "items": [
1192
+ {
1193
+ "src": "v31_ifeval/stack2",
1194
+ "instruction_ids": [
1195
+ "change_case:english_lowercase",
1196
+ "keywords:forbidden_words"
1197
+ ],
1198
+ "per_instruction": [
1199
+ true,
1200
+ true
1201
+ ],
1202
+ "stack_depth": 2,
1203
+ "ok": true,
1204
+ "tokens": 290,
1205
+ "tail": "he first warm days of spring, when the beekeeper must split strong colonies and ensure they have enough space to expand."
1206
+ },
1207
+ {
1208
+ "src": "v31_ifeval/stack1",
1209
+ "instruction_ids": [
1210
+ "detectable_format:json_format"
1211
+ ],
1212
+ "per_instruction": [
1213
+ true
1214
+ ],
1215
+ "stack_depth": 1,
1216
+ "ok": true,
1217
+ "tokens": 237,
1218
+ "tail": "but none have captured the same authentic sense of place that this Saturday tradition has cultivated over two decades.\"}"
1219
+ },
1220
+ {
1221
+ "src": "v31_ifeval/stack2",
1222
+ "instruction_ids": [
1223
+ "length_constraints:number_paragraphs",
1224
+ "change_case:capital_word_frequency"
1225
+ ],
1226
+ "per_instruction": [
1227
+ true,
1228
+ true
1229
+ ],
1230
+ "stack_depth": 2,
1231
+ "ok": true,
1232
+ "tokens": 278,
1233
+ "tail": "n need several practice sessions before riding independently, so remain encouraging and positive throughout the journey."
1234
+ },
1235
+ {
1236
+ "src": "v31_ifeval/stack4",
1237
+ "instruction_ids": [
1238
+ "keywords:forbidden_words",
1239
+ "change_case:capital_word_frequency",
1240
+ "keywords:frequency",
1241
+ "startend:end_checker"
1242
+ ],
1243
+ "per_instruction": [
1244
+ true,
1245
+ false,
1246
+ true,
1247
+ true
1248
+ ],
1249
+ "stack_depth": 4,
1250
+ "ok": false,
1251
+ "tokens": 233,
1252
+ "tail": "g remind us that nature finds a way to flourish everywhere, even in the heart of a bustling metropolis. End of response."
1253
+ },
1254
+ {
1255
+ "src": "v31_ifeval/stack3",
1256
+ "instruction_ids": [
1257
+ "length_constraints:number_paragraphs",
1258
+ "detectable_format:number_highlighted_sections",
1259
+ "keywords:letter_frequency"
1260
+ ],
1261
+ "per_instruction": [
1262
+ true,
1263
+ true,
1264
+ true
1265
+ ],
1266
+ "stack_depth": 3,
1267
+ "ok": true,
1268
+ "tokens": 244,
1269
+ "tail": "splanted populations* adapt to new environments and whether they might eventually develop migratory habits of their own."
1270
+ },
1271
+ {
1272
+ "src": "v31_ifeval/stack4",
1273
+ "instruction_ids": [
1274
+ "detectable_format:number_highlighted_sections",
1275
+ "keywords:letter_frequency",
1276
+ "keywords:existence",
1277
+ "change_case:english_lowercase"
1278
+ ],
1279
+ "per_instruction": [
1280
+ true,
1281
+ true,
1282
+ true,
1283
+ true
1284
+ ],
1285
+ "stack_depth": 4,
1286
+ "ok": true,
1287
+ "tokens": 270,
1288
+ "tail": "hat each spring brings new varieties of vegetables, fruits, and flowers that reflect the changing landscape around them."
1289
+ },
1290
+ {
1291
+ "src": "v31_ifeval/stack1",
1292
+ "instruction_ids": [
1293
+ "punctuation:no_comma"
1294
+ ],
1295
+ "per_instruction": [
1296
+ true
1297
+ ],
1298
+ "stack_depth": 1,
1299
+ "ok": true,
1300
+ "tokens": 280,
1301
+ "tail": "y since any failure in the mountain could endanger thousands of workers and disrupt vital European transportation links."
1302
+ },
1303
+ {
1304
+ "src": "v31_ifeval/stack1",
1305
+ "instruction_ids": [
1306
+ "change_case:capital_word_frequency"
1307
+ ],
1308
+ "per_instruction": [
1309
+ true
1310
+ ],
1311
+ "stack_depth": 1,
1312
+ "ok": true,
1313
+ "tokens": 235,
1314
+ "tail": "ing. With consistent practice and a positive attitude, most children will be riding confidently and joyfully in no time."
1315
+ },
1316
+ {
1317
+ "src": "v31_ifeval/stack1",
1318
+ "instruction_ids": [
1319
+ "detectable_format:number_bullet_lists"
1320
+ ],
1321
+ "per_instruction": [
1322
+ true
1323
+ ],
1324
+ "stack_depth": 1,
1325
+ "ok": true,
1326
+ "tokens": 300,
1327
+ "tail": "ons with high overburden pressure and complex fault zones that have already caused significant delays and cost overruns."
1328
+ },
1329
+ {
1330
+ "src": "v31_ifeval/stack4",
1331
+ "instruction_ids": [
1332
+ "change_case:english_capital",
1333
+ "punctuation:no_comma",
1334
+ "detectable_content:number_placeholders",
1335
+ "detectable_format:number_highlighted_sections"
1336
+ ],
1337
+ "per_instruction": [
1338
+ false,
1339
+ true,
1340
+ false,
1341
+ false
1342
+ ],
1343
+ "stack_depth": 4,
1344
+ "ok": false,
1345
+ "tokens": 1,
1346
+ "tail": ""
1347
+ },
1348
+ {
1349
+ "src": "v31_ifeval/stack3",
1350
+ "instruction_ids": [
1351
+ "detectable_format:number_bullet_lists",
1352
+ "detectable_content:number_placeholders",
1353
+ "detectable_format:multiple_sections"
1354
+ ],
1355
+ "per_instruction": [
1356
+ true,
1357
+ true,
1358
+ true
1359
+ ],
1360
+ "stack_depth": 3,
1361
+ "ok": true,
1362
+ "tokens": 247,
1363
+ "tail": "more moisture\n* Attract pollinators by planting lavender, sunflowers, or bee-friendly flowers near your balcony entrance"
1364
+ },
1365
+ {
1366
+ "src": "v31_ifeval/stack4",
1367
+ "instruction_ids": [
1368
+ "detectable_format:multiple_sections",
1369
+ "detectable_format:title",
1370
+ "change_case:capital_word_frequency",
1371
+ "keywords:forbidden_words"
1372
+ ],
1373
+ "per_instruction": [
1374
+ true,
1375
+ true,
1376
+ false,
1377
+ true
1378
+ ],
1379
+ "stack_depth": 4,
1380
+ "ok": false,
1381
+ "tokens": 199,
1382
+ "tail": "akers for preservation and tourism purposes, the romantic era of solitary vigil has largely faded into maritime history."
1383
+ },
1384
+ {
1385
+ "src": "v31_ifeval/stack3",
1386
+ "instruction_ids": [
1387
+ "detectable_format:multiple_sections",
1388
+ "detectable_content:number_placeholders",
1389
+ "length_constraints:number_paragraphs"
1390
+ ],
1391
+ "per_instruction": [
1392
+ true,
1393
+ true,
1394
+ false
1395
+ ],
1396
+ "stack_depth": 3,
1397
+ "ok": false,
1398
+ "tokens": 420,
1399
+ "tail": "own's daily rhythm. The main street outside remains empty, but within these walls, the day has already begun in earnest."
1400
+ },
1401
+ {
1402
+ "src": "v31_ifeval/stack3",
1403
+ "instruction_ids": [
1404
+ "detectable_format:title",
1405
+ "keywords:forbidden_words",
1406
+ "detectable_format:number_highlighted_sections"
1407
+ ],
1408
+ "per_instruction": [
1409
+ true,
1410
+ true,
1411
+ false
1412
+ ],
1413
+ "stack_depth": 3,
1414
+ "ok": false,
1415
+ "tokens": 219,
1416
+ "tail": "ivation proves that you don't need acreage to experience the profound connection between human hands and growing things."
1417
+ },
1418
+ {
1419
+ "src": "v31_ifeval/stack3",
1420
+ "instruction_ids": [
1421
+ "keywords:letter_frequency",
1422
+ "startend:quotation",
1423
+ "change_case:english_capital"
1424
+ ],
1425
+ "per_instruction": [
1426
+ true,
1427
+ false,
1428
+ true
1429
+ ],
1430
+ "stack_depth": 3,
1431
+ "ok": false,
1432
+ "tokens": 336,
1433
+ "tail": "BORHOOD MEMBERS WHO UNDERSTAND THE IMPORTANCE OF KEEPING THIS SMALL BUT VITAL INSTITUTION ALIVE FOR GENERATIONS TO COME."
1434
+ },
1435
+ {
1436
+ "src": "v31_ifeval/stack2",
1437
+ "instruction_ids": [
1438
+ "keywords:frequency",
1439
+ "detectable_content:postscript"
1440
+ ],
1441
+ "per_instruction": [
1442
+ true,
1443
+ true
1444
+ ],
1445
+ "stack_depth": 2,
1446
+ "ok": true,
1447
+ "tokens": 359,
1448
+ "tail": "ht, so I'm currently riding with one foot slightly elevated, which makes me look like a particularly enthusiastic heron."
1449
+ }
1450
+ ]
1451
+ },
1452
+ "v31_reasoning_logic_grid": {
1453
+ "n": 16,
1454
+ "correct": 9,
1455
+ "pass_frac": 0.5625,
1456
+ "completion_tokens": 66,
1457
+ "mean_gen_tokens_correct": 4.1,
1458
+ "items": [
1459
+ {
1460
+ "src": "v31_logic_grid/4x4",
1461
+ "gold": "yellow",
1462
+ "ok": false,
1463
+ "tokens": 4,
1464
+ "tail": "Answer: blue",
1465
+ "num_people": 4,
1466
+ "num_attrs": 4,
1467
+ "num_clues": 17
1468
+ },
1469
+ {
1470
+ "src": "v31_logic_grid/3x3",
1471
+ "gold": "pasta",
1472
+ "ok": true,
1473
+ "tokens": 4,
1474
+ "tail": "Answer: pasta",
1475
+ "num_people": 3,
1476
+ "num_attrs": 3,
1477
+ "num_clues": 10
1478
+ },
1479
+ {
1480
+ "src": "v31_logic_grid/4x4",
1481
+ "gold": "lizard",
1482
+ "ok": true,
1483
+ "tokens": 5,
1484
+ "tail": "Answer: lizard",
1485
+ "num_people": 4,
1486
+ "num_attrs": 4,
1487
+ "num_clues": 17
1488
+ },
1489
+ {
1490
+ "src": "v31_logic_grid/3x3",
1491
+ "gold": "photography",
1492
+ "ok": true,
1493
+ "tokens": 4,
1494
+ "tail": "Answer: photography",
1495
+ "num_people": 3,
1496
+ "num_attrs": 3,
1497
+ "num_clues": 9
1498
+ },
1499
+ {
1500
+ "src": "v31_logic_grid/3x2",
1501
+ "gold": "bird",
1502
+ "ok": true,
1503
+ "tokens": 4,
1504
+ "tail": "Answer: bird",
1505
+ "num_people": 3,
1506
+ "num_attrs": 2,
1507
+ "num_clues": 6
1508
+ },
1509
+ {
1510
+ "src": "v31_logic_grid/3x2",
1511
+ "gold": "dog",
1512
+ "ok": true,
1513
+ "tokens": 4,
1514
+ "tail": "Answer: dog",
1515
+ "num_people": 3,
1516
+ "num_attrs": 2,
1517
+ "num_clues": 5
1518
+ },
1519
+ {
1520
+ "src": "v31_logic_grid/5x3",
1521
+ "gold": "blue",
1522
+ "ok": false,
1523
+ "tokens": 4,
1524
+ "tail": "Answer: purple",
1525
+ "num_people": 5,
1526
+ "num_attrs": 3,
1527
+ "num_clues": 13
1528
+ },
1529
+ {
1530
+ "src": "v31_logic_grid/3x2",
1531
+ "gold": "pasta",
1532
+ "ok": true,
1533
+ "tokens": 4,
1534
+ "tail": "Answer: pasta",
1535
+ "num_people": 3,
1536
+ "num_attrs": 2,
1537
+ "num_clues": 6
1538
+ },
1539
+ {
1540
+ "src": "v31_logic_grid/4x3",
1541
+ "gold": "reading",
1542
+ "ok": true,
1543
+ "tokens": 4,
1544
+ "tail": "Answer: reading",
1545
+ "num_people": 4,
1546
+ "num_attrs": 3,
1547
+ "num_clues": 14
1548
+ },
1549
+ {
1550
+ "src": "v31_logic_grid/4x4",
1551
+ "gold": "rabbit",
1552
+ "ok": true,
1553
+ "tokens": 4,
1554
+ "tail": "Answer: rabbit",
1555
+ "num_people": 4,
1556
+ "num_attrs": 4,
1557
+ "num_clues": 16
1558
+ },
1559
+ {
1560
+ "src": "v31_logic_grid/3x3",
1561
+ "gold": "cat",
1562
+ "ok": false,
1563
+ "tokens": 5,
1564
+ "tail": "Answer: hamster",
1565
+ "num_people": 3,
1566
+ "num_attrs": 3,
1567
+ "num_clues": 8
1568
+ },
1569
+ {
1570
+ "src": "v31_logic_grid/3x2",
1571
+ "gold": "dog",
1572
+ "ok": false,
1573
+ "tokens": 4,
1574
+ "tail": "Answer: fish",
1575
+ "num_people": 3,
1576
+ "num_attrs": 2,
1577
+ "num_clues": 3
1578
+ },
1579
+ {
1580
+ "src": "v31_logic_grid/4x4",
1581
+ "gold": "gardening",
1582
+ "ok": false,
1583
+ "tokens": 4,
1584
+ "tail": "Answer: knitting",
1585
+ "num_people": 4,
1586
+ "num_attrs": 4,
1587
+ "num_clues": 16
1588
+ },
1589
+ {
1590
+ "src": "v31_logic_grid/3x3",
1591
+ "gold": "pasta",
1592
+ "ok": false,
1593
+ "tokens": 4,
1594
+ "tail": "Answer: tacos",
1595
+ "num_people": 3,
1596
+ "num_attrs": 3,
1597
+ "num_clues": 10
1598
+ },
1599
+ {
1600
+ "src": "v31_logic_grid/3x3",
1601
+ "gold": "doctor",
1602
+ "ok": false,
1603
+ "tokens": 4,
1604
+ "tail": "Answer: teacher",
1605
+ "num_people": 3,
1606
+ "num_attrs": 3,
1607
+ "num_clues": 7
1608
+ },
1609
+ {
1610
+ "src": "v31_logic_grid/3x2",
1611
+ "gold": "salad",
1612
+ "ok": true,
1613
+ "tokens": 4,
1614
+ "tail": "Answer: salad",
1615
+ "num_people": 3,
1616
+ "num_attrs": 2,
1617
+ "num_clues": 6
1618
+ }
1619
+ ]
1620
+ },
1621
+ "v31_reasoning_dyval_arith": {
1622
+ "n": 16,
1623
+ "correct": 16,
1624
+ "pass_frac": 1.0,
1625
+ "completion_tokens": 3152,
1626
+ "mean_gen_tokens_correct": 197.0,
1627
+ "items": [
1628
+ {
1629
+ "src": "v31_dyval_arith/d4/nl_vars",
1630
+ "gold": "250",
1631
+ "ok": true,
1632
+ "tokens": 278,
1633
+ "tail": "3 = max(v11, v12) = max(10, 7) = 10\n\nv14 = (v10 * v13) = 5 * 10 = 50\n\nv15 = (v7 * v14) = 5 * 50 = 250\n\nFinal answer: 250",
1634
+ "depth": 4,
1635
+ "mode": "nl_vars"
1636
+ },
1637
+ {
1638
+ "src": "v31_dyval_arith/d2/nl_vars",
1639
+ "gold": "15",
1640
+ "ok": true,
1641
+ "tokens": 91,
1642
+ "tail": "t, let me calculate v2:\nv2 = (8 - 6) = 2\n\nFinally, let me calculate v3:\nv3 = (v1 + v2) = (13 + 2) = 15\n\nFinal answer: 15",
1643
+ "depth": 2,
1644
+ "mode": "nl_vars"
1645
+ },
1646
+ {
1647
+ "src": "v31_dyval_arith/d2/nl_vars",
1648
+ "gold": "-7",
1649
+ "ok": true,
1650
+ "tokens": 92,
1651
+ "tail": " let me calculate v2:\nv2 = (2 - 9) = -7\n\nNow, let me calculate v3:\nv3 = min(v1, v2) = min(11, -7) = -7\n\nFinal answer: -7",
1652
+ "depth": 2,
1653
+ "mode": "nl_vars"
1654
+ },
1655
+ {
1656
+ "src": "v31_dyval_arith/d4/nl_vars",
1657
+ "gold": "34",
1658
+ "ok": true,
1659
+ "tokens": 265,
1660
+ "tail": "1 * v12) = 13 * 64 = 832\n\nv14 = (v10 - v13) = -2 - 832 = -834\n\nv15 = max(v7, v14) = max(34, -834) = 34\n\nFinal answer: 34",
1661
+ "depth": 4,
1662
+ "mode": "nl_vars"
1663
+ },
1664
+ {
1665
+ "src": "v31_dyval_arith/d4/math",
1666
+ "gold": "-160",
1667
+ "ok": true,
1668
+ "tokens": 408,
1669
+ "tail": " 2)), min((3 + 9), max(6, 4))) = min(10, 6) = 6`\n\n**Step 5: Final subtraction**\n\n- `-154 - 6 = -160`\n\nFinal answer: -160",
1670
+ "depth": 4,
1671
+ "mode": "math"
1672
+ },
1673
+ {
1674
+ "src": "v31_dyval_arith/d3/math",
1675
+ "gold": "90",
1676
+ "ok": true,
1677
+ "tokens": 177,
1678
+ "tail": ":**\n- `max(7, 2) = 7`\n- `(2 + 4) = 6`\n- So: `min(7, 6) = 6`\n\n**Final evaluation:**\n- `max(90, 6) = 90`\n\nFinal answer: 90",
1679
+ "depth": 3,
1680
+ "mode": "math"
1681
+ },
1682
+ {
1683
+ "src": "v31_dyval_arith/d2/nl_vars",
1684
+ "gold": "72",
1685
+ "ok": true,
1686
+ "tokens": 93,
1687
+ "tail": " me calculate v2:\nv2 = min(6, 5) = 5\n\nFinally, let me calculate v3:\nv3 = max(v1, v2) = max(72, 5) = 72\n\nFinal answer: 72",
1688
+ "depth": 2,
1689
+ "mode": "nl_vars"
1690
+ },
1691
+ {
1692
+ "src": "v31_dyval_arith/d5/nl_vars",
1693
+ "gold": "33",
1694
+ "ok": true,
1695
+ "tokens": 547,
1696
+ "tail": "25, v28) = min(-4, 13) = -4\nv30 = (v22 * v29) = 3 * (-4) = -12\nv31 = max(v15, v30) = max(33, -12) = 33\n\nFinal answer: 33",
1697
+ "depth": 5,
1698
+ "mode": "nl_vars"
1699
+ },
1700
+ {
1701
+ "src": "v31_dyval_arith/d2/nl_vars",
1702
+ "gold": "0",
1703
+ "ok": true,
1704
+ "tokens": 92,
1705
+ "tail": "Next, let me calculate v2:\nv2 = (1 * 8) = 8\n\nFinally, let me calculate v3:\nv3 = (v1 - v2) = (8 - 8) = 0\n\nFinal answer: 0",
1706
+ "depth": 2,
1707
+ "mode": "nl_vars"
1708
+ },
1709
+ {
1710
+ "src": "v31_dyval_arith/d2/math",
1711
+ "gold": "3",
1712
+ "ok": true,
1713
+ "tokens": 99,
1714
+ "tail": " and 4 is 4)\n\nNow I can substitute these back into the expression:\n((1 - 2) + min(9, 4)) = (-1 + 4) = 3\n\nFinal answer: 3",
1715
+ "depth": 2,
1716
+ "mode": "math"
1717
+ },
1718
+ {
1719
+ "src": "v31_dyval_arith/d4/nl_vars",
1720
+ "gold": "9180",
1721
+ "ok": true,
1722
+ "tokens": 269,
1723
+ "tail": "11, v12) = max(4, 9) = 9\n\nv14 = (v10 * v13) = -20 * 9 = -180\n\nv15 = (v7 * v14) = -51 * (-180) = 9180\n\nFinal answer: 9180",
1724
+ "depth": 4,
1725
+ "mode": "nl_vars"
1726
+ },
1727
+ {
1728
+ "src": "v31_dyval_arith/d2/nl_vars",
1729
+ "gold": "3",
1730
+ "ok": true,
1731
+ "tokens": 93,
1732
+ "tail": ", let me calculate v2:\nv2 = (2 + 1) = 3\n\nFinally, let me calculate v3:\nv3 = min(v1, v2) = min(6, 3) = 3\n\nFinal answer: 3",
1733
+ "depth": 2,
1734
+ "mode": "nl_vars"
1735
+ },
1736
+ {
1737
+ "src": "v31_dyval_arith/d3/math",
1738
+ "gold": "54",
1739
+ "ok": true,
1740
+ "tokens": 163,
1741
+ "tail": "he main expression:\n\n((max(4, 9) * max(6, 3)) - min((9 - 9), (1 * 8)))\n= ((9 * 6) - 0)\n= (54 - 0)\n= 54\n\nFinal answer: 54",
1742
+ "depth": 3,
1743
+ "mode": "math"
1744
+ },
1745
+ {
1746
+ "src": "v31_dyval_arith/d3/math",
1747
+ "gold": "12",
1748
+ "ok": true,
1749
+ "tokens": 206,
1750
+ "tail": " 8) = 12`\n- So: `max(min(1, 7), (4 + 8)) = max(1, 12) = 12`\n\n**Final evaluation:**\n- `max(9, 12) = 12`\n\nFinal answer: 12",
1751
+ "depth": 3,
1752
+ "mode": "math"
1753
+ },
1754
+ {
1755
+ "src": "v31_dyval_arith/d3/nl_vars",
1756
+ "gold": "14",
1757
+ "ok": true,
1758
+ "tokens": 142,
1759
+ "tail": "8, 3) = 3\n\nv5 = (2 * 6) = 12\n\nv6 = (v4 - v5) = (3 - 12) = -9\n\nv7 = (v3 - v6) = (5 - (-9)) = 5 + 9 = 14\n\nFinal answer: 14",
1760
+ "depth": 3,
1761
+ "mode": "nl_vars"
1762
+ },
1763
+ {
1764
+ "src": "v31_dyval_arith/d2/math",
1765
+ "gold": "1",
1766
+ "ok": true,
1767
+ "tokens": 137,
1768
+ "tail": "(8 + 2):\n- 8 + 2 = 10\n\nNow I have:\n- min(10, 1)\n\nFinally:\n- min(10, 1) = 1 (since 1 is smaller than 10)\n\nFinal answer: 1",
1769
+ "depth": 2,
1770
+ "mode": "math"
1771
+ }
1772
+ ]
1773
+ },
1774
+ "v31_long_context_ruler": {
1775
+ "n": 16,
1776
+ "correct": 15,
1777
+ "pass_frac": 0.9375,
1778
+ "completion_tokens": 1878,
1779
+ "mean_gen_tokens_correct": 114.3,
1780
+ "items": [
1781
+ {
1782
+ "src": "v31_long_context_ruler/niah_multikey",
1783
+ "gold": "HN831",
1784
+ "ok": true,
1785
+ "tokens": 46,
1786
+ "tail": "o.\n\nLooking through the passage, I can find the answer directly stated:\n\n\"The code ZI2HMZ maps to HN831.\"\n\nAnswer: HN831",
1787
+ "task": "niah_multikey",
1788
+ "n_distractors": 77
1789
+ },
1790
+ {
1791
+ "src": "v31_long_context_ruler/niah_single",
1792
+ "gold": "HKRS4R",
1793
+ "ok": true,
1794
+ "tokens": 7,
1795
+ "tail": "Answer: HKRS4R",
1796
+ "task": "niah_single",
1797
+ "n_distractors": 146,
1798
+ "needle_pos": 104
1799
+ },
1800
+ {
1801
+ "src": "v31_long_context_ruler/multihop_var",
1802
+ "gold": "900",
1803
+ "ok": true,
1804
+ "tokens": 187,
1805
+ "tail": "= 900\n\n5. \"Variable v_nmf takes the same value as v_gr8.\"\n - v_nmf = v_gr8 = 900\n\nTherefore, v_nmf = 900.\n\nAnswer: 900",
1806
+ "task": "multihop_var",
1807
+ "n_distractors": 111
1808
+ },
1809
+ {
1810
+ "src": "v31_long_context_ruler/niah_single",
1811
+ "gold": "FIQPR6",
1812
+ "ok": true,
1813
+ "tokens": 7,
1814
+ "tail": "Answer: FIQPR6",
1815
+ "task": "niah_single",
1816
+ "n_distractors": 79,
1817
+ "needle_pos": 14
1818
+ },
1819
+ {
1820
+ "src": "v31_long_context_ruler/aggregation_count",
1821
+ "gold": "7",
1822
+ "ok": true,
1823
+ "tokens": 256,
1824
+ "tail": " briefly mentioned the word marigold in passing.\" (4)\n- Later: \"Someone briefly mentioned the word marigold in passing.\"",
1825
+ "task": "aggregation_count",
1826
+ "n_distractors": 125
1827
+ },
1828
+ {
1829
+ "src": "v31_long_context_ruler/multihop_var",
1830
+ "gold": "301",
1831
+ "ok": true,
1832
+ "tokens": 145,
1833
+ "tail": "kes the same value as v_6h8.\" \u2192 v_upd = 301\n- \"Variable v_y8v takes the same value as v_upd.\" \u2192 v_y8v = 301\n\nAnswer: 301",
1834
+ "task": "multihop_var",
1835
+ "n_distractors": 80
1836
+ },
1837
+ {
1838
+ "src": "v31_long_context_ruler/aggregation_count",
1839
+ "gold": "3",
1840
+ "ok": true,
1841
+ "tokens": 214,
1842
+ "tail": "Skyler walked a passing car for a few moments.\")\n\nI found exactly 3 occurrences of 'marigold' in the passage.\n\nAnswer: 3",
1843
+ "task": "aggregation_count",
1844
+ "n_distractors": 194
1845
+ },
1846
+ {
1847
+ "src": "v31_long_context_ruler/multihop_var",
1848
+ "gold": "250",
1849
+ "ok": true,
1850
+ "tokens": 119,
1851
+ "tail": "i5 = 250\n\n3. \"Variable v_yu0 takes the same value as v_ts5.\" \u2192 v_yu0 = v_ts5 = 250\n\nTherefore, v_yu0 = 250.\n\nAnswer: 250",
1852
+ "task": "multihop_var",
1853
+ "n_distractors": 111
1854
+ },
1855
+ {
1856
+ "src": "v31_long_context_ruler/niah_single",
1857
+ "gold": "0VJ0MN",
1858
+ "ok": true,
1859
+ "tokens": 9,
1860
+ "tail": "Answer: 0VJ0MN",
1861
+ "task": "niah_single",
1862
+ "n_distractors": 99,
1863
+ "needle_pos": 84
1864
+ },
1865
+ {
1866
+ "src": "v31_long_context_ruler/multihop_var",
1867
+ "gold": "352",
1868
+ "ok": true,
1869
+ "tokens": 192,
1870
+ "tail": "= 352\n\n5. `Variable v_aer takes the same value as v_qqg.`\n - v_aer = v_qqg = 352\n\nTherefore, v_aer = 352.\n\nAnswer: 352",
1871
+ "task": "multihop_var",
1872
+ "n_distractors": 101
1873
+ },
1874
+ {
1875
+ "src": "v31_long_context_ruler/aggregation_count",
1876
+ "gold": "8",
1877
+ "ok": true,
1878
+ "tokens": 256,
1879
+ "tail": "tornado in passing.\" (4)\n- Later: \"Someone briefly mentioned the word tornado in passing.\" (5)\n- Later: \"Someone briefly",
1880
+ "task": "aggregation_count",
1881
+ "n_distractors": 118
1882
+ },
1883
+ {
1884
+ "src": "v31_long_context_ruler/niah_multikey",
1885
+ "gold": "GZ9LL",
1886
+ "ok": true,
1887
+ "tokens": 52,
1888
+ "tail": "e, I can find this directly stated:\n\n\"The code VGIBYV maps to GZ9LL.\"\n\nThis appears early in the passage.\n\nAnswer: GZ9LL",
1889
+ "task": "niah_multikey",
1890
+ "n_distractors": 95
1891
+ },
1892
+ {
1893
+ "src": "v31_long_context_ruler/niah_multikey",
1894
+ "gold": "8LQ7X",
1895
+ "ok": true,
1896
+ "tokens": 47,
1897
+ "tail": "at code EPBMRB maps to.\n\nLooking through the passage, I find this line:\n\n\"The code EPBMRB maps to 8LQ7X.\"\n\nAnswer: 8LQ7X",
1898
+ "task": "niah_multikey",
1899
+ "n_distractors": 110
1900
+ },
1901
+ {
1902
+ "src": "v31_long_context_ruler/multihop_var",
1903
+ "gold": "401",
1904
+ "ok": true,
1905
+ "tokens": 123,
1906
+ "tail": "jx = 401\n\n3. \"Variable v_3rf takes the same value as v_9dz.\" \u2192 v_3rf = v_9dz = 401\n\nTherefore, v_3rf = 401.\n\nAnswer: 401",
1907
+ "task": "multihop_var",
1908
+ "n_distractors": 145
1909
+ },
1910
+ {
1911
+ "src": "v31_long_context_ruler/aggregation_count",
1912
+ "gold": "4",
1913
+ "ok": false,
1914
+ "tokens": 164,
1915
+ "tail": "mentioned the word porcelain in passing.\" (3)\n\nI don't see any other instances of 'porcelain' in the passage.\n\nAnswer: 3",
1916
+ "task": "aggregation_count",
1917
+ "n_distractors": 125
1918
+ },
1919
+ {
1920
+ "src": "v31_long_context_ruler/niah_multikey",
1921
+ "gold": "3B2YI",
1922
+ "ok": true,
1923
+ "tokens": 54,
1924
+ "tail": ", I find this explicit statement:\n\n\"The code JLAAJJ maps to 3B2YI.\"\n\nThis is directly stated in the text.\n\nAnswer: 3B2YI",
1925
+ "task": "niah_multikey",
1926
+ "n_distractors": 110
1927
+ }
1928
+ ]
1929
+ },
1930
+ "v31_knowledge_multi_hop_kg": {
1931
+ "n": 16,
1932
+ "correct": 16,
1933
+ "pass_frac": 1.0,
1934
+ "completion_tokens": 2305,
1935
+ "mean_gen_tokens_correct": 144.1,
1936
+ "items": [
1937
+ {
1938
+ "src": "v31_knowledge_multi_hop_kg/kg_location",
1939
+ "gold": "Place_RN8P",
1940
+ "ok": true,
1941
+ "tokens": 118,
1942
+ "tail": "ore step from Place_MSQ9, I reach Place_RN8P.\n\nAfter 2 steps up from Place_F2NF, I reach Place_RN8P.\n\nAnswer: Place_RN8P",
1943
+ "task": "kg_location"
1944
+ },
1945
+ {
1946
+ "src": "v31_knowledge_multi_hop_kg/kg_location",
1947
+ "gold": "Place_PZA6",
1948
+ "ok": true,
1949
+ "tokens": 75,
1950
+ "tail": "p 2: Place_FPZU is located inside Place_PZA6.\n\nAfter 2 steps up from Place_GKVZ, I reach Place_PZA6.\n\nAnswer: Place_PZA6",
1951
+ "task": "kg_location"
1952
+ },
1953
+ {
1954
+ "src": "v31_knowledge_multi_hop_kg/kg_employment",
1955
+ "gold": "Person_CX3Q",
1956
+ "ok": true,
1957
+ "tokens": 162,
1958
+ "tail": "at Person_CX3Q reports to someone)\n\nTherefore, Person_CX3Q is the most senior employee at Org_2H4Z.\n\nAnswer: Person_CX3Q",
1959
+ "task": "kg_employment"
1960
+ },
1961
+ {
1962
+ "src": "v31_knowledge_multi_hop_kg/kg_family",
1963
+ "gold": "Person_UECR",
1964
+ "ok": true,
1965
+ "tokens": 173,
1966
+ "tail": "rson_XHUB is the parent of Person_S8BA \u2713\n\nTherefore, Person_UECR is the grandparent of Person_S8BA.\n\nAnswer: Person_UECR",
1967
+ "task": "kg_family"
1968
+ },
1969
+ {
1970
+ "src": "v31_knowledge_multi_hop_kg/kg_family",
1971
+ "gold": "Person_U3DM",
1972
+ "ok": true,
1973
+ "tokens": 95,
1974
+ "tail": "_6TSU, who is the parent of Person_HZNK.\n\nTherefore, Person_U3DM is the grandparent of Person_HZNK.\n\nAnswer: Person_U3DM",
1975
+ "task": "kg_family"
1976
+ },
1977
+ {
1978
+ "src": "v31_knowledge_multi_hop_kg/kg_location",
1979
+ "gold": "Place_KX8M",
1980
+ "ok": true,
1981
+ "tokens": 112,
1982
+ "tail": "7RH, going up 1 more step reaches Place_KX8M.\n\nAfter 2 steps up from Place_DBDF, I reach Place_KX8M.\n\nAnswer: Place_KX8M",
1983
+ "task": "kg_location"
1984
+ },
1985
+ {
1986
+ "src": "v31_knowledge_multi_hop_kg/kg_family",
1987
+ "gold": "Person_JXXY",
1988
+ "ok": true,
1989
+ "tokens": 116,
1990
+ "tail": "erson_JXXY is the parent of Person_VQA6.\n\nTherefore, Person_JXXY is the grandparent of Person_VQZC.\n\nAnswer: Person_JXXY",
1991
+ "task": "kg_family"
1992
+ },
1993
+ {
1994
+ "src": "v31_knowledge_multi_hop_kg/kg_employment",
1995
+ "gold": "Person_V974",
1996
+ "ok": true,
1997
+ "tokens": 256,
1998
+ "tail": "rson_Z5FQ reports to Person_RF4K\n- Person_WJNY reports to Person_H9YF\n- Person_6EZD reports to Person_WJNY\n- Person_H9YF",
1999
+ "task": "kg_employment"
2000
+ },
2001
+ {
2002
+ "src": "v31_knowledge_multi_hop_kg/kg_location",
2003
+ "gold": "Place_2HRG",
2004
+ "ok": true,
2005
+ "tokens": 155,
2006
+ "tail": "HFR, going up 1 more step reaches Place_2HRG.\n\nAfter 3 steps up from Place_U3MJ, I reach Place_2HRG.\n\nAnswer: Place_2HRG",
2007
+ "task": "kg_location"
2008
+ },
2009
+ {
2010
+ "src": "v31_knowledge_multi_hop_kg/kg_family",
2011
+ "gold": "Person_4QBM",
2012
+ "ok": true,
2013
+ "tokens": 95,
2014
+ "tail": "rson_B5JD, who is the parent of Person_YDD7. This makes Person_4QBM the grandparent of Person_YDD7.\n\nAnswer: Person_4QBM",
2015
+ "task": "kg_family"
2016
+ },
2017
+ {
2018
+ "src": "v31_knowledge_multi_hop_kg/kg_family",
2019
+ "gold": "Person_RNSB",
2020
+ "ok": true,
2021
+ "tokens": 91,
2022
+ "tail": "_4XZK, who is the parent of Person_TH96.\n\nTherefore, Person_RNSB is the grandparent of Person_TH96.\n\nAnswer: Person_RNSB",
2023
+ "task": "kg_family"
2024
+ },
2025
+ {
2026
+ "src": "v31_knowledge_multi_hop_kg/kg_family",
2027
+ "gold": "Person_UGXM",
2028
+ "ok": true,
2029
+ "tokens": 166,
2030
+ "tail": "of Person_KHWC\n\nSo the grandchildren of Person_ZSSG are: Person_UGXM, Person_G2ZA, and Person_KHWC.\n\nAnswer: Person_UGXM",
2031
+ "task": "kg_family"
2032
+ },
2033
+ {
2034
+ "src": "v31_knowledge_multi_hop_kg/kg_employment",
2035
+ "gold": "Person_DBK9",
2036
+ "ok": true,
2037
+ "tokens": 256,
2038
+ "tail": "- Person_32HD reports to Person_DBK9\n\nSo Person_DBK9 has someone reporting to them. But does anyone report to Person_DBK",
2039
+ "task": "kg_employment"
2040
+ },
2041
+ {
2042
+ "src": "v31_knowledge_multi_hop_kg/kg_location",
2043
+ "gold": "Place_6R8W",
2044
+ "ok": true,
2045
+ "tokens": 122,
2046
+ "tail": "ore step from Place_7EV6, I reach Place_6R8W.\n\nAfter 2 steps up from Place_SEQE, I reach Place_6R8W.\n\nAnswer: Place_6R8W",
2047
+ "task": "kg_location"
2048
+ },
2049
+ {
2050
+ "src": "v31_knowledge_multi_hop_kg/kg_family",
2051
+ "gold": "Person_4AY4",
2052
+ "ok": true,
2053
+ "tokens": 166,
2054
+ "tail": "S\n- Person_4AY4 is parent of Person_2DYN\n\nTherefore, Person_4AY4 is the grandparent of Person_WTHS.\n\nAnswer: Person_4AY4",
2055
+ "task": "kg_family"
2056
+ },
2057
+ {
2058
+ "src": "v31_knowledge_multi_hop_kg/kg_employment",
2059
+ "gold": "Person_QS2G",
2060
+ "ok": true,
2061
+ "tokens": 147,
2062
+ "tail": "at Person_QS2G reports to someone)\n\nTherefore, Person_QS2G is the most senior employee at Org_PGAB.\n\nAnswer: Person_QS2G",
2063
+ "task": "kg_employment"
2064
+ }
2065
+ ]
2066
+ },
2067
+ "v31_truthfulness_calibration": {
2068
+ "n": 16,
2069
+ "correct": 16,
2070
+ "pass_frac": 1.0,
2071
+ "completion_tokens": 988,
2072
+ "mean_gen_tokens_correct": 61.8,
2073
+ "items": [
2074
+ {
2075
+ "src": "v31_truthfulness/determinate/_t_determinate_simple",
2076
+ "gold": "32",
2077
+ "verdict": "correct",
2078
+ "ok": true,
2079
+ "tokens": 52,
2080
+ "tail": "uys 4 oranges at $8 each.\n\nTotal cost = number of oranges \u00d7 price per orange\nTotal cost = 4 \u00d7 $8 = $32\n\nFinal answer: 32",
2081
+ "family": "determinate"
2082
+ },
2083
+ {
2084
+ "src": "v31_truthfulness/mixed/_t_mixed_extra_distractor",
2085
+ "gold": "18",
2086
+ "verdict": "correct",
2087
+ "ok": true,
2088
+ "tokens": 58,
2089
+ "tail": "information about the stationery shop being in business for 23 years is irrelevant to the calculation.\n\nFinal answer: 18",
2090
+ "family": "mixed"
2091
+ },
2092
+ {
2093
+ "src": "v31_truthfulness/mixed/_t_mixed_extra_distractor",
2094
+ "gold": "15",
2095
+ "verdict": "correct",
2096
+ "ok": true,
2097
+ "tokens": 59,
2098
+ "tail": " information about the stationery shop being in business for 8 years is irrelevant to the calculation.\n\nFinal answer: 15",
2099
+ "family": "mixed"
2100
+ },
2101
+ {
2102
+ "src": "v31_truthfulness/indeterminate/_t_missing_initial",
2103
+ "gold": "cannot determine",
2104
+ "verdict": "correct",
2105
+ "ok": true,
2106
+ "tokens": 65,
2107
+ "tail": " doesn't tell me Ian's initial amount of money, I cannot calculate how much he has left.\n\nFinal answer: cannot determine",
2108
+ "family": "indeterminate"
2109
+ },
2110
+ {
2111
+ "src": "v31_truthfulness/mixed/_t_mixed_relevant_filter",
2112
+ "gold": "60",
2113
+ "verdict": "correct",
2114
+ "ok": true,
2115
+ "tokens": 59,
2116
+ "tail": " \u00d7 $10 = $60\n\nLiam's age (39 years old) is irrelevant information that doesn't affect the calculation.\n\nFinal answer: 60",
2117
+ "family": "mixed"
2118
+ },
2119
+ {
2120
+ "src": "v31_truthfulness/determinate/_t_determinate_subtraction",
2121
+ "gold": "20",
2122
+ "verdict": "correct",
2123
+ "ok": true,
2124
+ "tokens": 39,
2125
+ "tail": "how much money Ben has left after spending some.\n\nBen starts with $30 and spends $10.\n\n$30 - $10 = $20\n\nFinal answer: 20",
2126
+ "family": "determinate"
2127
+ },
2128
+ {
2129
+ "src": "v31_truthfulness/mixed/_t_mixed_extra_distractor",
2130
+ "gold": "10",
2131
+ "verdict": "correct",
2132
+ "ok": true,
2133
+ "tokens": 59,
2134
+ "tail": " information about the stationery shop being in business for 5 years is irrelevant to the calculation.\n\nFinal answer: 10",
2135
+ "family": "mixed"
2136
+ },
2137
+ {
2138
+ "src": "v31_truthfulness/determinate/_t_determinate_simple",
2139
+ "gold": "60",
2140
+ "verdict": "correct",
2141
+ "ok": true,
2142
+ "tokens": 52,
2143
+ "tail": "stcards at $12 each.\n\nTotal cost = Number of postcards \u00d7 Price per postcard\nTotal cost = 5 \u00d7 $12 = $60\n\nFinal answer: 60",
2144
+ "family": "determinate"
2145
+ },
2146
+ {
2147
+ "src": "v31_truthfulness/determinate/_t_determinate_subtraction",
2148
+ "gold": "74",
2149
+ "verdict": "correct",
2150
+ "ok": true,
2151
+ "tokens": 52,
2152
+ "tail": "om his starting amount of $100.\n\nStarting amount: $100\nAmount spent: $26\n\nMoney left: $100 - $26 = $74\n\nFinal answer: 74",
2153
+ "family": "determinate"
2154
+ },
2155
+ {
2156
+ "src": "v31_truthfulness/determinate/_t_determinate_subtraction",
2157
+ "gold": "12",
2158
+ "verdict": "correct",
2159
+ "ok": true,
2160
+ "tokens": 46,
2161
+ "tail": " left after spending some of it.\n\nStarting amount: $30\nAmount spent: $18\n\nMoney left = $30 - $18 = $12\n\nFinal answer: 12",
2162
+ "family": "determinate"
2163
+ },
2164
+ {
2165
+ "src": "v31_truthfulness/indeterminate/_t_missing_quantity_price",
2166
+ "gold": "cannot determine",
2167
+ "verdict": "correct",
2168
+ "ok": true,
2169
+ "tokens": 103,
2170
+ "tail": "l amount spent from the information provided, I must conclude this cannot be determined.\n\nFinal answer: cannot determine",
2171
+ "family": "indeterminate"
2172
+ },
2173
+ {
2174
+ "src": "v31_truthfulness/determinate/_t_determinate_simple",
2175
+ "gold": "84",
2176
+ "verdict": "correct",
2177
+ "ok": true,
2178
+ "tokens": 51,
2179
+ "tail": " stickers at $12 each.\n\nTotal cost = Number of stickers \u00d7 Price per sticker\nTotal cost = 7 \u00d7 $12 = $84\n\nFinal answer: 84",
2180
+ "family": "determinate"
2181
+ },
2182
+ {
2183
+ "src": "v31_truthfulness/mixed/_t_mixed_extra_distractor",
2184
+ "gold": "24",
2185
+ "verdict": "correct",
2186
+ "ok": true,
2187
+ "tokens": 80,
2188
+ "tail": "ss for 7 years is irrelevant to calculating how much Julia spent.\n\nCalculation:\n6 pens \u00d7 $4 each = $24\n\nFinal answer: 24",
2189
+ "family": "mixed"
2190
+ },
2191
+ {
2192
+ "src": "v31_truthfulness/indeterminate/_t_missing_quantity_count",
2193
+ "gold": "cannot determine",
2194
+ "verdict": "correct",
2195
+ "ok": true,
2196
+ "tokens": 64,
2197
+ "tail": " number of candles Chris bought is not given, I cannot calculate the total amount spent.\n\nFinal answer: cannot determine",
2198
+ "family": "indeterminate"
2199
+ },
2200
+ {
2201
+ "src": "v31_truthfulness/indeterminate/_t_missing_quantity_count",
2202
+ "gold": "cannot determine",
2203
+ "verdict": "correct",
2204
+ "ok": true,
2205
+ "tokens": 88,
2206
+ "tail": "wing the exact quantity of bananas purchased, I cannot calculate the total amount spent.\n\nFinal answer: cannot determine",
2207
+ "family": "indeterminate"
2208
+ },
2209
+ {
2210
+ "src": "v31_truthfulness/mixed/_t_mixed_relevant_filter",
2211
+ "gold": "36",
2212
+ "verdict": "correct",
2213
+ "ok": true,
2214
+ "tokens": 61,
2215
+ "tail": "/book = $36\n\nTyler's age (38 years old) is irrelevant information that doesn't affect the calculation.\n\nFinal answer: 36",
2216
+ "family": "mixed"
2217
+ }
2218
+ ],
2219
+ "incorrect": 0,
2220
+ "not_attempted": 0,
2221
+ "raw_score": 1.0
2222
+ },
2223
+ "v31_consistency_paraphrase": {
2224
+ "n": 16,
2225
+ "correct": 2,
2226
+ "pass_frac": 0.25,
2227
+ "completion_tokens": 906,
2228
+ "mean_gen_tokens_correct": 186.5,
2229
+ "items": [
2230
+ {
2231
+ "src": "v31_consistency_paraphrase/garden_harvest/p0",
2232
+ "gold": "720",
2233
+ "score": 0.0,
2234
+ "ok": false,
2235
+ "tokens": 3,
2236
+ "tail_a": "",
2237
+ "tail_b": "20",
2238
+ "template": "garden_harvest",
2239
+ "difficulty": 0
2240
+ },
2241
+ {
2242
+ "src": "v31_consistency_paraphrase/garden_harvest/p0",
2243
+ "gold": "384",
2244
+ "score": 0.0,
2245
+ "ok": false,
2246
+ "tokens": 2,
2247
+ "tail_a": "",
2248
+ "tail_b": "",
2249
+ "template": "garden_harvest",
2250
+ "difficulty": 0
2251
+ },
2252
+ {
2253
+ "src": "v31_consistency_paraphrase/bakery_orders/p0",
2254
+ "gold": "228",
2255
+ "score": 0.0,
2256
+ "ok": false,
2257
+ "tokens": 2,
2258
+ "tail_a": "",
2259
+ "tail_b": "",
2260
+ "template": "bakery_orders",
2261
+ "difficulty": 0
2262
+ },
2263
+ {
2264
+ "src": "v31_consistency_paraphrase/percentage_compose/p0",
2265
+ "gold": "60",
2266
+ "score": 1.0,
2267
+ "ok": true,
2268
+ "tokens": 189,
2269
+ "tail_a": " the number of marbles given away.\nMarbles given away = 200 \u00d7 0.30 = 60\n\n#### 60",
2270
+ "tail_b": " of 200.\n30% \u00d7 200 = 0.30 \u00d7 200 = 60\n\nQuinn would give away 60 marbles.\n\n#### 60",
2271
+ "template": "percentage_compose",
2272
+ "difficulty": 0
2273
+ },
2274
+ {
2275
+ "src": "v31_consistency_paraphrase/work_rate/p0",
2276
+ "gold": "180",
2277
+ "score": 0.5,
2278
+ "ok": false,
2279
+ "tokens": 99,
2280
+ "tail_a": " packed in 4 hours.\nTotal boxes = 45 boxes/hour \u00d7 4 hours = 180 boxes.\n\n#### 180",
2281
+ "tail_b": "20",
2282
+ "template": "work_rate",
2283
+ "difficulty": 0
2284
+ },
2285
+ {
2286
+ "src": "v31_consistency_paraphrase/shopping_discount/p0",
2287
+ "gold": "128",
2288
+ "score": 0.0,
2289
+ "ok": false,
2290
+ "tokens": 2,
2291
+ "tail_a": "",
2292
+ "tail_b": "",
2293
+ "template": "shopping_discount",
2294
+ "difficulty": 0
2295
+ },
2296
+ {
2297
+ "src": "v31_consistency_paraphrase/garden_harvest/p0",
2298
+ "gold": "1596",
2299
+ "score": 0.0,
2300
+ "ok": false,
2301
+ "tokens": 2,
2302
+ "tail_a": "",
2303
+ "tail_b": "",
2304
+ "template": "garden_harvest",
2305
+ "difficulty": 0
2306
+ },
2307
+ {
2308
+ "src": "v31_consistency_paraphrase/bakery_orders/p0",
2309
+ "gold": "128",
2310
+ "score": 0.0,
2311
+ "ok": false,
2312
+ "tokens": 2,
2313
+ "tail_a": "",
2314
+ "tail_b": "",
2315
+ "template": "bakery_orders",
2316
+ "difficulty": 0
2317
+ },
2318
+ {
2319
+ "src": "v31_consistency_paraphrase/percentage_compose/p0",
2320
+ "gold": "240",
2321
+ "score": 1.0,
2322
+ "ok": true,
2323
+ "tokens": 184,
2324
+ "tail_a": "he number of marbles given away.\nMarbles given away = 800 \u00d7 0.30 = 240\n\n#### 240",
2325
+ "tail_b": " 800.\n30% \u00d7 800 = 0.30 \u00d7 800 = 240\n\nQuinn would give away 240 marbles.\n\n#### 240",
2326
+ "template": "percentage_compose",
2327
+ "difficulty": 0
2328
+ },
2329
+ {
2330
+ "src": "v31_consistency_paraphrase/bakery_orders/p0",
2331
+ "gold": "168",
2332
+ "score": 0.5,
2333
+ "ok": false,
2334
+ "tokens": 103,
2335
+ "tail_a": "",
2336
+ "tail_b": "ies produced \u2212 Total standing orders\nExtra pies = 480 \u2212 312 = 168 pies\n\n#### 168",
2337
+ "template": "bakery_orders",
2338
+ "difficulty": 0
2339
+ },
2340
+ {
2341
+ "src": "v31_consistency_paraphrase/classroom_supplies/p0",
2342
+ "gold": "200",
2343
+ "score": 0.0,
2344
+ "ok": false,
2345
+ "tokens": 3,
2346
+ "tail_a": "",
2347
+ "tail_b": "20",
2348
+ "template": "classroom_supplies",
2349
+ "difficulty": 0
2350
+ },
2351
+ {
2352
+ "src": "v31_consistency_paraphrase/shopping_discount/p0",
2353
+ "gold": "119",
2354
+ "score": 0.0,
2355
+ "ok": false,
2356
+ "tokens": 2,
2357
+ "tail_a": "",
2358
+ "tail_b": "",
2359
+ "template": "shopping_discount",
2360
+ "difficulty": 0
2361
+ },
2362
+ {
2363
+ "src": "v31_consistency_paraphrase/work_rate/p0",
2364
+ "gold": "440",
2365
+ "score": 0.5,
2366
+ "ok": false,
2367
+ "tokens": 97,
2368
+ "tail_a": "culate total flyers in 8 hours**\n55 flyers/hour \u00d7 8 hours = 440 flyers\n\n#### 440",
2369
+ "tail_b": "45",
2370
+ "template": "work_rate",
2371
+ "difficulty": 0
2372
+ },
2373
+ {
2374
+ "src": "v31_consistency_paraphrase/garden_harvest/p0",
2375
+ "gold": "686",
2376
+ "score": 0.0,
2377
+ "ok": false,
2378
+ "tokens": 3,
2379
+ "tail_a": "",
2380
+ "tail_b": "20",
2381
+ "template": "garden_harvest",
2382
+ "difficulty": 0
2383
+ },
2384
+ {
2385
+ "src": "v31_consistency_paraphrase/shopping_discount/p0",
2386
+ "gold": "98",
2387
+ "score": 0.0,
2388
+ "ok": false,
2389
+ "tokens": 2,
2390
+ "tail_a": "",
2391
+ "tail_b": "",
2392
+ "template": "shopping_discount",
2393
+ "difficulty": 0
2394
+ },
2395
+ {
2396
+ "src": "v31_consistency_paraphrase/library_books/p1",
2397
+ "gold": "54",
2398
+ "score": 0.5,
2399
+ "ok": false,
2400
+ "tokens": 211,
2401
+ "tail_a": "",
2402
+ "tail_b": "/day \u00d7 (1 \u2212 0.10) = 5 \u00d7 6 \u00d7 2 \u00d7 0.90 = 30 \u00d7 1.8 = $54\n\n#### Final Answer\n#### 54",
2403
+ "template": "library_books",
2404
+ "difficulty": 1
2405
+ }
2406
+ ],
2407
+ "raw_consistency_mean": 0.25
2408
+ },
2409
+ "calibration_bench": {
2410
+ "n": 16,
2411
+ "correct": 8,
2412
+ "pass_frac": 0.5,
2413
+ "completion_tokens": 32,
2414
+ "mean_gen_tokens_correct": 2.0,
2415
+ "items": [
2416
+ {
2417
+ "src": "calibration/books_total/solv",
2418
+ "kind": "solv",
2419
+ "adversarial": false,
2420
+ "ok": true,
2421
+ "tokens": 2,
2422
+ "tail": "72"
2423
+ },
2424
+ {
2425
+ "src": "calibration/orchard_yield/solv",
2426
+ "kind": "solv",
2427
+ "adversarial": false,
2428
+ "ok": true,
2429
+ "tokens": 2,
2430
+ "tail": "25"
2431
+ },
2432
+ {
2433
+ "src": "calibration/orchard_yield/solv",
2434
+ "kind": "solv",
2435
+ "adversarial": false,
2436
+ "ok": true,
2437
+ "tokens": 2,
2438
+ "tail": "63"
2439
+ },
2440
+ {
2441
+ "src": "calibration/orchard_yield/unsolv",
2442
+ "kind": "unsolv",
2443
+ "adversarial": false,
2444
+ "ok": false,
2445
+ "tokens": 2,
2446
+ "tail": "26"
2447
+ },
2448
+ {
2449
+ "src": "calibration/orchard_yield/unsolv",
2450
+ "kind": "unsolv",
2451
+ "adversarial": false,
2452
+ "ok": false,
2453
+ "tokens": 2,
2454
+ "tail": "5"
2455
+ },
2456
+ {
2457
+ "src": "calibration/trail_distance/unsolv_adversarial_unit_mismatch",
2458
+ "kind": "unsolv",
2459
+ "adversarial": true,
2460
+ "ok": false,
2461
+ "tokens": 2,
2462
+ "tail": "105"
2463
+ },
2464
+ {
2465
+ "src": "calibration/trail_distance/unsolv_adversarial_contradiction",
2466
+ "kind": "unsolv",
2467
+ "adversarial": true,
2468
+ "ok": false,
2469
+ "tokens": 2,
2470
+ "tail": "110"
2471
+ },
2472
+ {
2473
+ "src": "calibration/class_total/solv",
2474
+ "kind": "solv",
2475
+ "adversarial": false,
2476
+ "ok": true,
2477
+ "tokens": 2,
2478
+ "tail": "76"
2479
+ },
2480
+ {
2481
+ "src": "calibration/trail_distance/solv",
2482
+ "kind": "solv",
2483
+ "adversarial": false,
2484
+ "ok": true,
2485
+ "tokens": 2,
2486
+ "tail": "67"
2487
+ },
2488
+ {
2489
+ "src": "calibration/trail_distance/unsolv",
2490
+ "kind": "unsolv",
2491
+ "adversarial": false,
2492
+ "ok": false,
2493
+ "tokens": 2,
2494
+ "tail": "28"
2495
+ },
2496
+ {
2497
+ "src": "calibration/trail_distance/solv",
2498
+ "kind": "solv",
2499
+ "adversarial": false,
2500
+ "ok": true,
2501
+ "tokens": 2,
2502
+ "tail": "56"
2503
+ },
2504
+ {
2505
+ "src": "calibration/orchard_yield/unsolv",
2506
+ "kind": "unsolv",
2507
+ "adversarial": false,
2508
+ "ok": false,
2509
+ "tokens": 2,
2510
+ "tail": "38"
2511
+ },
2512
+ {
2513
+ "src": "calibration/class_total/solv",
2514
+ "kind": "solv",
2515
+ "adversarial": false,
2516
+ "ok": true,
2517
+ "tokens": 2,
2518
+ "tail": "50"
2519
+ },
2520
+ {
2521
+ "src": "calibration/books_total/unsolv",
2522
+ "kind": "unsolv",
2523
+ "adversarial": false,
2524
+ "ok": false,
2525
+ "tokens": 2,
2526
+ "tail": "44"
2527
+ },
2528
+ {
2529
+ "src": "calibration/books_total/solv",
2530
+ "kind": "solv",
2531
+ "adversarial": false,
2532
+ "ok": true,
2533
+ "tokens": 2,
2534
+ "tail": "66"
2535
+ },
2536
+ {
2537
+ "src": "calibration/orchard_yield/unsolv",
2538
+ "kind": "unsolv",
2539
+ "adversarial": false,
2540
+ "ok": false,
2541
+ "tokens": 2,
2542
+ "tail": "22"
2543
+ }
2544
+ ],
2545
+ "n_solv": 8,
2546
+ "n_unsolv": 8,
2547
+ "correct_solv": 8,
2548
+ "correct_unsolv": 0
2549
+ },
2550
+ "judge_probe": {
2551
+ "n": 8,
2552
+ "n_valid": 8,
2553
+ "mean_score": 4.625,
2554
+ "normalized": 0.9062
2555
+ },
2556
+ "long_form_judge_probe": {
2557
+ "n": 8,
2558
+ "n_valid": 8,
2559
+ "normalized": 0.7443,
2560
+ "coherence_factor": 0.8798
2561
+ },
2562
+ "chat_turns_probe": {
2563
+ "n": 4,
2564
+ "n_valid": 4,
2565
+ "mean_score": 4.75,
2566
+ "normalized": 0.9375
2567
+ }
2568
+ },
2569
+ "__finished_at__": 1779445576.632039
2570
+ }