vuhaian commited on
Commit
3efda21
·
verified ·
1 Parent(s): 4cadd2d

Upload eval_results_run8_3.json with huggingface_hub

Browse files
Files changed (1) hide show
  1. eval_results_run8_3.json +2573 -0
eval_results_run8_3.json ADDED
@@ -0,0 +1,2573 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "vuhaian/kimis_721k_e3": {
3
+ "name": "vuhaian/kimis_721k_e3",
4
+ "uid": null,
5
+ "hotkey": null,
6
+ "is_king": false,
7
+ "is_teacher": false,
8
+ "kl_global_avg": 1.3288932885127474,
9
+ "on_policy_rkl": {
10
+ "mean_rkl": 1.9634060132446616
11
+ },
12
+ "top_k_overlap_mean": 0.23187229061388642,
13
+ "teacher_trace_nll_mean": 2.4387999986159534,
14
+ "capability": {
15
+ "pass_frac": 0.875
16
+ },
17
+ "length_axis": {
18
+ "penalty": 0.9314234841365243
19
+ },
20
+ "degenerate_count": 0,
21
+ "activation_fingerprint": {
22
+ "layer_fingerprints": {
23
+ "all": [
24
+ -0.051259692914363324,
25
+ -0.06267719142459637,
26
+ -0.012063772010812332,
27
+ -0.07475230156699784,
28
+ 0.05392415383780403,
29
+ -0.035363632426432014,
30
+ 0.023277184152441465,
31
+ 0.08131707975709215,
32
+ 0.010839253799188537,
33
+ 0.043572439696947206,
34
+ -0.03292593413477352,
35
+ -0.015000348092391654,
36
+ -0.05376541999555643,
37
+ 0.04153157601090752,
38
+ 0.008707685060435968,
39
+ 0.06787005569240842,
40
+ 0.010952635115079622,
41
+ 0.011077354562559821,
42
+ -0.019150104254005675,
43
+ -0.0050001160307972204,
44
+ -0.02479649378538213,
45
+ -0.0060205478738170545,
46
+ 0.07335771138153742,
47
+ 0.0352729273737192,
48
+ -0.05974061534301707,
49
+ 0.005601037005020011,
50
+ -0.028061875683045613,
51
+ 0.04720064180546216,
52
+ -0.042302568958966905,
53
+ 0.03189416416016454,
54
+ 0.05264294496823462,
55
+ -0.07855057564934947,
56
+ -0.051259692914363324,
57
+ -0.06267719142459637,
58
+ -0.012063772010812332,
59
+ -0.07475230156699784,
60
+ 0.05392415383780403,
61
+ -0.035363632426432014,
62
+ 0.023277184152441465,
63
+ 0.08131707975709215,
64
+ 0.010839253799188537,
65
+ 0.043572439696947206,
66
+ -0.03292593413477352,
67
+ -0.015000348092391654,
68
+ -0.05376541999555643,
69
+ 0.04153157601090752,
70
+ 0.008707685060435968,
71
+ 0.06787005569240842,
72
+ 0.010952635115079622,
73
+ 0.011077354562559821,
74
+ -0.019150104254005675,
75
+ -0.0050001160307972204,
76
+ -0.02479649378538213,
77
+ -0.0060205478738170545,
78
+ 0.07335771138153742,
79
+ 0.0352729273737192,
80
+ -0.05974061534301707,
81
+ 0.005601037005020011,
82
+ -0.028061875683045613,
83
+ 0.04720064180546216,
84
+ -0.042302568958966905,
85
+ 0.03189416416016454,
86
+ 0.05264294496823462,
87
+ -0.07855057564934947,
88
+ -0.051259692914363324,
89
+ -0.06267719142459637,
90
+ -0.012063772010812332,
91
+ -0.07475230156699784,
92
+ 0.05392415383780403,
93
+ -0.035363632426432014,
94
+ 0.023277184152441465,
95
+ 0.08131707975709215,
96
+ 0.010839253799188537,
97
+ 0.043572439696947206,
98
+ -0.03292593413477352,
99
+ -0.015000348092391654,
100
+ -0.05376541999555643,
101
+ 0.04153157601090752,
102
+ 0.008707685060435968,
103
+ 0.06787005569240842,
104
+ 0.010952635115079622,
105
+ 0.011077354562559821,
106
+ -0.019150104254005675,
107
+ -0.0050001160307972204,
108
+ -0.02479649378538213,
109
+ -0.0060205478738170545,
110
+ 0.07335771138153742,
111
+ 0.0352729273737192,
112
+ -0.05974061534301707,
113
+ 0.005601037005020011,
114
+ -0.028061875683045613,
115
+ 0.04720064180546216,
116
+ -0.042302568958966905,
117
+ 0.03189416416016454,
118
+ 0.05264294496823462,
119
+ -0.07855057564934947,
120
+ -0.051259692914363324,
121
+ -0.06267719142459637,
122
+ -0.012063772010812332,
123
+ -0.07475230156699784,
124
+ 0.05392415383780403,
125
+ -0.035363632426432014,
126
+ 0.023277184152441465,
127
+ 0.08131707975709215,
128
+ 0.010839253799188537,
129
+ 0.043572439696947206,
130
+ -0.03292593413477352,
131
+ -0.015000348092391654,
132
+ -0.05376541999555643,
133
+ 0.04153157601090752,
134
+ 0.008707685060435968,
135
+ 0.06787005569240842,
136
+ 0.010952635115079622,
137
+ 0.011077354562559821,
138
+ -0.019150104254005675,
139
+ -0.0050001160307972204,
140
+ -0.02479649378538213,
141
+ -0.0060205478738170545,
142
+ 0.07335771138153742,
143
+ 0.0352729273737192,
144
+ -0.05974061534301707,
145
+ 0.005601037005020011,
146
+ -0.028061875683045613,
147
+ 0.04720064180546216,
148
+ -0.042302568958966905,
149
+ 0.03189416416016454,
150
+ 0.05264294496823462,
151
+ -0.07855057564934947,
152
+ -0.051259692914363324,
153
+ -0.06267719142459637,
154
+ -0.012063772010812332,
155
+ -0.07475230156699784,
156
+ 0.05392415383780403,
157
+ -0.035363632426432014,
158
+ 0.023277184152441465,
159
+ 0.08131707975709215,
160
+ 0.010839253799188537,
161
+ 0.043572439696947206,
162
+ -0.03292593413477352,
163
+ -0.015000348092391654,
164
+ -0.05376541999555643,
165
+ 0.04153157601090752,
166
+ 0.008707685060435968,
167
+ 0.06787005569240842,
168
+ 0.010952635115079622,
169
+ 0.011077354562559821,
170
+ -0.019150104254005675,
171
+ -0.0050001160307972204,
172
+ -0.02479649378538213,
173
+ -0.0060205478738170545,
174
+ 0.07335771138153742,
175
+ 0.0352729273737192,
176
+ -0.05974061534301707,
177
+ 0.005601037005020011,
178
+ -0.028061875683045613,
179
+ 0.04720064180546216,
180
+ -0.042302568958966905,
181
+ 0.03189416416016454,
182
+ 0.05264294496823462,
183
+ -0.07855057564934947,
184
+ -0.051259692914363324,
185
+ -0.06267719142459637,
186
+ -0.012063772010812332,
187
+ -0.07475230156699784,
188
+ 0.05392415383780403,
189
+ -0.035363632426432014,
190
+ 0.023277184152441465,
191
+ 0.08131707975709215,
192
+ 0.010839253799188537,
193
+ 0.043572439696947206,
194
+ -0.03292593413477352,
195
+ -0.015000348092391654,
196
+ -0.05376541999555643,
197
+ 0.04153157601090752,
198
+ 0.008707685060435968,
199
+ 0.06787005569240842,
200
+ 0.010952635115079622,
201
+ 0.011077354562559821,
202
+ -0.019150104254005675,
203
+ -0.0050001160307972204,
204
+ -0.02479649378538213,
205
+ -0.0060205478738170545,
206
+ 0.07335771138153742,
207
+ 0.0352729273737192,
208
+ -0.05974061534301707,
209
+ 0.005601037005020011,
210
+ -0.028061875683045613,
211
+ 0.04720064180546216,
212
+ -0.042302568958966905,
213
+ 0.03189416416016454,
214
+ 0.05264294496823462,
215
+ -0.07855057564934947,
216
+ -0.051259692914363324,
217
+ -0.06267719142459637,
218
+ -0.012063772010812332,
219
+ -0.07475230156699784,
220
+ 0.05392415383780403,
221
+ -0.035363632426432014,
222
+ 0.023277184152441465,
223
+ 0.08131707975709215,
224
+ 0.010839253799188537,
225
+ 0.043572439696947206,
226
+ -0.03292593413477352,
227
+ -0.015000348092391654,
228
+ -0.05376541999555643,
229
+ 0.04153157601090752,
230
+ 0.008707685060435968,
231
+ 0.06787005569240842,
232
+ 0.010952635115079622,
233
+ 0.011077354562559821,
234
+ -0.019150104254005675,
235
+ -0.0050001160307972204,
236
+ -0.02479649378538213,
237
+ -0.0060205478738170545,
238
+ 0.07335771138153742,
239
+ 0.0352729273737192,
240
+ -0.05974061534301707,
241
+ 0.005601037005020011,
242
+ -0.028061875683045613,
243
+ 0.04720064180546216,
244
+ -0.042302568958966905,
245
+ 0.03189416416016454,
246
+ 0.05264294496823462,
247
+ -0.07855057564934947,
248
+ -0.051259692914363324,
249
+ -0.06267719142459637,
250
+ -0.012063772010812332,
251
+ -0.07475230156699784,
252
+ 0.05392415383780403,
253
+ -0.035363632426432014,
254
+ 0.023277184152441465,
255
+ 0.08131707975709215,
256
+ 0.010839253799188537,
257
+ 0.043572439696947206,
258
+ -0.03292593413477352,
259
+ -0.015000348092391654,
260
+ -0.05376541999555643,
261
+ 0.04153157601090752,
262
+ 0.008707685060435968,
263
+ 0.06787005569240842,
264
+ 0.010952635115079622,
265
+ 0.011077354562559821,
266
+ -0.019150104254005675,
267
+ -0.0050001160307972204,
268
+ -0.02479649378538213,
269
+ -0.0060205478738170545,
270
+ 0.07335771138153742,
271
+ 0.0352729273737192,
272
+ -0.05974061534301707,
273
+ 0.005601037005020011,
274
+ -0.028061875683045613,
275
+ 0.04720064180546216,
276
+ -0.042302568958966905,
277
+ 0.03189416416016454,
278
+ 0.05264294496823462,
279
+ -0.07855057564934947,
280
+ -0.051259692914363324,
281
+ -0.06267719142459637,
282
+ -0.012063772010812332,
283
+ -0.07475230156699784,
284
+ 0.05392415383780403,
285
+ -0.035363632426432014,
286
+ 0.023277184152441465,
287
+ 0.08131707975709215,
288
+ 0.010839253799188537,
289
+ 0.043572439696947206,
290
+ -0.03292593413477352,
291
+ -0.015000348092391654,
292
+ -0.05376541999555643,
293
+ 0.04153157601090752,
294
+ 0.008707685060435968,
295
+ 0.06787005569240842,
296
+ 0.010952635115079622,
297
+ 0.011077354562559821,
298
+ -0.019150104254005675,
299
+ -0.0050001160307972204,
300
+ -0.02479649378538213,
301
+ -0.0060205478738170545,
302
+ 0.07335771138153742,
303
+ 0.0352729273737192,
304
+ -0.05974061534301707,
305
+ 0.005601037005020011,
306
+ -0.028061875683045613,
307
+ 0.04720064180546216,
308
+ -0.042302568958966905,
309
+ 0.03189416416016454,
310
+ 0.05264294496823462,
311
+ -0.07855057564934947,
312
+ -0.051259692914363324,
313
+ -0.06267719142459637,
314
+ -0.012063772010812332,
315
+ -0.07475230156699784,
316
+ 0.05392415383780403,
317
+ -0.035363632426432014,
318
+ 0.023277184152441465,
319
+ 0.08131707975709215,
320
+ 0.010839253799188537,
321
+ 0.043572439696947206,
322
+ -0.03292593413477352,
323
+ -0.015000348092391654,
324
+ -0.05376541999555643,
325
+ 0.04153157601090752,
326
+ 0.008707685060435968,
327
+ 0.06787005569240842,
328
+ 0.010952635115079622,
329
+ 0.011077354562559821,
330
+ -0.019150104254005675,
331
+ -0.0050001160307972204,
332
+ -0.02479649378538213,
333
+ -0.0060205478738170545,
334
+ 0.07335771138153742,
335
+ 0.0352729273737192,
336
+ -0.05974061534301707,
337
+ 0.005601037005020011,
338
+ -0.028061875683045613,
339
+ 0.04720064180546216,
340
+ -0.042302568958966905,
341
+ 0.03189416416016454,
342
+ 0.05264294496823462,
343
+ -0.07855057564934947,
344
+ -0.051259692914363324,
345
+ -0.06267719142459637,
346
+ -0.012063772010812332,
347
+ -0.07475230156699784,
348
+ 0.05392415383780403,
349
+ -0.035363632426432014,
350
+ 0.023277184152441465,
351
+ 0.08131707975709215,
352
+ 0.010839253799188537,
353
+ 0.043572439696947206,
354
+ -0.03292593413477352,
355
+ -0.015000348092391654,
356
+ -0.05376541999555643,
357
+ 0.04153157601090752,
358
+ 0.008707685060435968,
359
+ 0.06787005569240842,
360
+ 0.010952635115079622,
361
+ 0.011077354562559821,
362
+ -0.019150104254005675,
363
+ -0.0050001160307972204,
364
+ -0.02479649378538213,
365
+ -0.0060205478738170545,
366
+ 0.07335771138153742,
367
+ 0.0352729273737192,
368
+ -0.05974061534301707,
369
+ 0.005601037005020011,
370
+ -0.028061875683045613,
371
+ 0.04720064180546216,
372
+ -0.042302568958966905,
373
+ 0.03189416416016454,
374
+ 0.05264294496823462,
375
+ -0.07855057564934947,
376
+ -0.051259692914363324,
377
+ -0.06267719142459637,
378
+ -0.012063772010812332,
379
+ -0.07475230156699784,
380
+ 0.05392415383780403,
381
+ -0.035363632426432014,
382
+ 0.023277184152441465,
383
+ 0.08131707975709215,
384
+ 0.010839253799188537,
385
+ 0.043572439696947206,
386
+ -0.03292593413477352,
387
+ -0.015000348092391654,
388
+ -0.05376541999555643,
389
+ 0.04153157601090752,
390
+ 0.008707685060435968,
391
+ 0.06787005569240842,
392
+ 0.010952635115079622,
393
+ 0.011077354562559821,
394
+ -0.019150104254005675,
395
+ -0.0050001160307972204,
396
+ -0.02479649378538213,
397
+ -0.0060205478738170545,
398
+ 0.07335771138153742,
399
+ 0.0352729273737192,
400
+ -0.05974061534301707,
401
+ 0.005601037005020011,
402
+ -0.028061875683045613,
403
+ 0.04720064180546216,
404
+ -0.042302568958966905,
405
+ 0.03189416416016454,
406
+ 0.05264294496823462,
407
+ -0.07855057564934947,
408
+ -0.051259692914363324,
409
+ -0.06267719142459637,
410
+ -0.012063772010812332,
411
+ -0.07475230156699784,
412
+ 0.05392415383780403,
413
+ -0.035363632426432014,
414
+ 0.023277184152441465,
415
+ 0.08131707975709215,
416
+ 0.010839253799188537,
417
+ 0.043572439696947206,
418
+ -0.03292593413477352,
419
+ -0.015000348092391654,
420
+ -0.05376541999555643,
421
+ 0.04153157601090752,
422
+ 0.008707685060435968,
423
+ 0.06787005569240842,
424
+ 0.010952635115079622,
425
+ 0.011077354562559821,
426
+ -0.019150104254005675,
427
+ -0.0050001160307972204,
428
+ -0.02479649378538213,
429
+ -0.0060205478738170545,
430
+ 0.07335771138153742,
431
+ 0.0352729273737192,
432
+ -0.05974061534301707,
433
+ 0.005601037005020011,
434
+ -0.028061875683045613,
435
+ 0.04720064180546216,
436
+ -0.042302568958966905,
437
+ 0.03189416416016454,
438
+ 0.05264294496823462,
439
+ -0.07855057564934947,
440
+ -0.051259692914363324,
441
+ -0.06267719142459637,
442
+ -0.012063772010812332,
443
+ -0.07475230156699784,
444
+ 0.05392415383780403,
445
+ -0.035363632426432014,
446
+ 0.023277184152441465,
447
+ 0.08131707975709215,
448
+ 0.010839253799188537,
449
+ 0.043572439696947206,
450
+ -0.03292593413477352,
451
+ -0.015000348092391654,
452
+ -0.05376541999555643,
453
+ 0.04153157601090752,
454
+ 0.008707685060435968,
455
+ 0.06787005569240842,
456
+ 0.010952635115079622,
457
+ 0.011077354562559821,
458
+ -0.019150104254005675,
459
+ -0.0050001160307972204,
460
+ -0.02479649378538213,
461
+ -0.0060205478738170545,
462
+ 0.07335771138153742,
463
+ 0.0352729273737192,
464
+ -0.05974061534301707,
465
+ 0.005601037005020011,
466
+ -0.028061875683045613,
467
+ 0.04720064180546216,
468
+ -0.042302568958966905,
469
+ 0.03189416416016454,
470
+ 0.05264294496823462,
471
+ -0.07855057564934947,
472
+ -0.051259692914363324,
473
+ -0.06267719142459637,
474
+ -0.012063772010812332,
475
+ -0.07475230156699784,
476
+ 0.05392415383780403,
477
+ -0.035363632426432014,
478
+ 0.023277184152441465,
479
+ 0.08131707975709215,
480
+ 0.010839253799188537,
481
+ 0.043572439696947206,
482
+ -0.03292593413477352,
483
+ -0.015000348092391654,
484
+ -0.05376541999555643,
485
+ 0.04153157601090752,
486
+ 0.008707685060435968,
487
+ 0.06787005569240842,
488
+ 0.010952635115079622,
489
+ 0.011077354562559821,
490
+ -0.019150104254005675,
491
+ -0.0050001160307972204,
492
+ -0.02479649378538213,
493
+ -0.0060205478738170545,
494
+ 0.07335771138153742,
495
+ 0.0352729273737192,
496
+ -0.05974061534301707,
497
+ 0.005601037005020011,
498
+ -0.028061875683045613,
499
+ 0.04720064180546216,
500
+ -0.042302568958966905,
501
+ 0.03189416416016454,
502
+ 0.05264294496823462,
503
+ -0.07855057564934947,
504
+ -0.051259692914363324,
505
+ -0.06267719142459637,
506
+ -0.012063772010812332,
507
+ -0.07475230156699784,
508
+ 0.05392415383780403,
509
+ -0.035363632426432014,
510
+ 0.023277184152441465,
511
+ 0.08131707975709215,
512
+ 0.010839253799188537,
513
+ 0.043572439696947206,
514
+ -0.03292593413477352,
515
+ -0.015000348092391654,
516
+ -0.05376541999555643,
517
+ 0.04153157601090752,
518
+ 0.008707685060435968,
519
+ 0.06787005569240842,
520
+ 0.010952635115079622,
521
+ 0.011077354562559821,
522
+ -0.019150104254005675,
523
+ -0.0050001160307972204,
524
+ -0.02479649378538213,
525
+ -0.0060205478738170545,
526
+ 0.07335771138153742,
527
+ 0.0352729273737192,
528
+ -0.05974061534301707,
529
+ 0.005601037005020011,
530
+ -0.028061875683045613,
531
+ 0.04720064180546216,
532
+ -0.042302568958966905,
533
+ 0.03189416416016454,
534
+ 0.05264294496823462,
535
+ -0.07855057564934947
536
+ ]
537
+ },
538
+ "n_layers": 1,
539
+ "hidden_size": 512
540
+ },
541
+ "v31_math_gsm_symbolic": {
542
+ "n": 16,
543
+ "correct": 14,
544
+ "pass_frac": 0.875,
545
+ "completion_tokens": 2330,
546
+ "mean_gen_tokens_correct": 133.3,
547
+ "items": [
548
+ {
549
+ "src": "v31_gsm_symbolic/bakery_orders/p0",
550
+ "gold": "204",
551
+ "ok": true,
552
+ "tokens": 113,
553
+ "tail": "ulate leftover cakes**\nLeftover cakes = Total production \u2212 Total orders\nLeftover cakes = 480 \u2212 276 = 204 cakes\n\n#### 204",
554
+ "difficulty": 0,
555
+ "is_noop": false,
556
+ "template": "bakery_orders"
557
+ },
558
+ {
559
+ "src": "v31_gsm_symbolic/travel_distance/p2",
560
+ "gold": "667",
561
+ "ok": true,
562
+ "tokens": 181,
563
+ "tail": "our distance: 52 miles\n\n**Step 5: Calculate total distance**\nTotal distance = 180 + 300 + 135 + 52 = 667 miles\n\n#### 667",
564
+ "difficulty": 2,
565
+ "is_noop": false,
566
+ "template": "travel_distance"
567
+ },
568
+ {
569
+ "src": "v31_gsm_symbolic/work_rate/p1",
570
+ "gold": "220",
571
+ "ok": true,
572
+ "tokens": 123,
573
+ "tail": "our\n\n**Step 3: Calculate total labels packed in 4 hours**\nTotal labels = 55 labels/hour \u00d7 4 hours = 220 labels\n\n#### 220",
574
+ "difficulty": 1,
575
+ "is_noop": false,
576
+ "template": "work_rate"
577
+ },
578
+ {
579
+ "src": "v31_gsm_symbolic/percentage_compose/p0",
580
+ "gold": "120",
581
+ "ok": true,
582
+ "tokens": 75,
583
+ "tail": ".\n40% = 40/100 = 0.4\n\nStep 2: Multiply the total marbles by 0.4.\n300 \u00d7 0.4 = 120\n\nPaul gives away 120 marbles.\n\n#### 120",
584
+ "difficulty": 0,
585
+ "is_noop": false,
586
+ "template": "percentage_compose"
587
+ },
588
+ {
589
+ "src": "v31_gsm_symbolic/travel_distance/p1",
590
+ "gold": "895",
591
+ "ok": true,
592
+ "tokens": 203,
593
+ "tail": " 55 \u00d7 3 = 165 miles\n\n**Step 5: Calculate total distance**\n- Total distance = 300 + 150 + 280 + 165 = 895 miles\n\n#### 895",
594
+ "difficulty": 1,
595
+ "is_noop": false,
596
+ "template": "travel_distance"
597
+ },
598
+ {
599
+ "src": "v31_gsm_symbolic/shopping_discount/p0",
600
+ "gold": "99",
601
+ "ok": false,
602
+ "tokens": 227,
603
+ "tail": "ney left = $150 \u2212 $51.84 = $98.16\n\nSince the answer should be an integer, I'll round to the nearest dollar: $98\n\n#### 98",
604
+ "difficulty": 0,
605
+ "is_noop": false,
606
+ "template": "shopping_discount"
607
+ },
608
+ {
609
+ "src": "v31_gsm_symbolic/shopping_discount/p0",
610
+ "gold": "46",
611
+ "ok": false,
612
+ "tokens": 237,
613
+ "tail": "ney left = $100 \u2212 $54.60 = $45.40\n\nSince the answer should be an integer, I'll round to the nearest dollar: $45\n\n#### 45",
614
+ "difficulty": 0,
615
+ "is_noop": false,
616
+ "template": "shopping_discount"
617
+ },
618
+ {
619
+ "src": "v31_gsm_symbolic/bakery_orders/p1",
620
+ "gold": "258",
621
+ "ok": true,
622
+ "tokens": 166,
623
+ "tail": "loaves per day\n\n**Step 5: Calculate total leftover loaves over 6 days**\n- Total leftover = 43 \u00d7 6 = 258 loaves\n\n#### 258",
624
+ "difficulty": 1,
625
+ "is_noop": false,
626
+ "template": "bakery_orders"
627
+ },
628
+ {
629
+ "src": "v31_gsm_symbolic/travel_distance/p0",
630
+ "gold": "430",
631
+ "ok": true,
632
+ "tokens": 167,
633
+ "tail": "ance = 60 \u00d7 2 = 120 miles\n\n**Step 4: Calculate total distance**\n- Total distance = 100 + 210 + 120 = 430 miles\n\n#### 430",
634
+ "difficulty": 0,
635
+ "is_noop": false,
636
+ "template": "travel_distance"
637
+ },
638
+ {
639
+ "src": "v31_gsm_symbolic/work_rate/p0",
640
+ "gold": "375",
641
+ "ok": true,
642
+ "tokens": 117,
643
+ "tail": "our\n\n**Step 4: Calculate total labels packed in 5 hours**\nTotal labels = 75 labels/hour \u00d7 5 hours = 375 labels\n\n#### 375",
644
+ "difficulty": 0,
645
+ "is_noop": false,
646
+ "template": "work_rate"
647
+ },
648
+ {
649
+ "src": "v31_gsm_symbolic/garden_harvest/p0",
650
+ "gold": "396",
651
+ "ok": true,
652
+ "tokens": 94,
653
+ "tail": "s\n\n**Step 2: Find the total number of peppers**\n- Peppers per plant: 3\n- Total peppers = 132 \u00d7 3 = 396 peppers\n\n#### 396",
654
+ "difficulty": 0,
655
+ "is_noop": false,
656
+ "template": "garden_harvest"
657
+ },
658
+ {
659
+ "src": "v31_gsm_symbolic/bakery_orders/p0",
660
+ "gold": "141",
661
+ "ok": true,
662
+ "tokens": 120,
663
+ "tail": "ra pies left over**\nExtra pies = Total pies produced \u2212 Total standing orders\nExtra pies = 360 \u2212 219 = 141 pies\n\n#### 141",
664
+ "difficulty": 0,
665
+ "is_noop": false,
666
+ "template": "bakery_orders"
667
+ },
668
+ {
669
+ "src": "v31_gsm_symbolic/garden_harvest/p0",
670
+ "gold": "320",
671
+ "ok": true,
672
+ "tokens": 91,
673
+ "tail": "Step 2: Find the total number of peppers.\n- Each plant yields 4 peppers\n- Total peppers = 80 \u00d7 4 = 320 peppers\n\n#### 320",
674
+ "difficulty": 0,
675
+ "is_noop": false,
676
+ "template": "garden_harvest"
677
+ },
678
+ {
679
+ "src": "v31_gsm_symbolic/percentage_compose/noop",
680
+ "gold": "60",
681
+ "ok": true,
682
+ "tokens": 160,
683
+ "tail": "5 = 400 \u00d7 15/100 = 6000/100 = 60\n\n**Step 3: Verify**\n10% of 400 = 40\n5% of 400 = 20\n15% of 400 = 40 + 20 = 60 \u2713\n\n#### 60",
684
+ "difficulty": 0,
685
+ "is_noop": true,
686
+ "template": "percentage_compose"
687
+ },
688
+ {
689
+ "src": "v31_gsm_symbolic/library_books/noop",
690
+ "gold": "220",
691
+ "ok": true,
692
+ "tokens": 126,
693
+ "tail": "0\n\nNote: The information about the recipe being originally written 12 years ago is irrelevant to this problem.\n\n#### 220",
694
+ "difficulty": 0,
695
+ "is_noop": true,
696
+ "template": "library_books"
697
+ },
698
+ {
699
+ "src": "v31_gsm_symbolic/library_books/p0",
700
+ "gold": "24",
701
+ "ok": true,
702
+ "tokens": 130,
703
+ "tail": "l books**\nTotal fine = 3 books \u00d7 $8 per book = $24\n\nAlternatively: Total fine = 3 books \u00d7 8 days \u00d7 $1/day = $24\n\n#### 24",
704
+ "difficulty": 0,
705
+ "is_noop": false,
706
+ "template": "library_books"
707
+ }
708
+ ]
709
+ },
710
+ "v31_math_competition": {
711
+ "n": 16,
712
+ "correct": 12,
713
+ "pass_frac": 0.75,
714
+ "completion_tokens": 5131,
715
+ "mean_gen_tokens_correct": 249.5,
716
+ "items": [
717
+ {
718
+ "src": "v31_math_competition/number_theory/smallest_k_divisible",
719
+ "gold": "7",
720
+ "ok": true,
721
+ "tokens": 220,
722
+ "tail": " k**\n\nThe smallest positive integer divisible by 7 is k = 7.\n\n**Verification:**\n- 497 + 7 = 504\n- 504 \u00f7 7 = 72 \u2713\n\n#### 7",
723
+ "family": "number_theory",
724
+ "template": "smallest_k_divisible"
725
+ },
726
+ {
727
+ "src": "v31_math_competition/algebra/linear_system_2x2",
728
+ "gold": "11",
729
+ "ok": true,
730
+ "tokens": 143,
731
+ "tail": " 40 + 37\n\n7x + 7y = 77\n\n**Step 2: Factor out 7**\n\n7(x + y) = 77\n\n**Step 3: Solve for x + y**\n\nx + y = 77/7 = 11\n\n#### 11",
732
+ "family": "algebra",
733
+ "template": "linear_system_2x2"
734
+ },
735
+ {
736
+ "src": "v31_math_competition/geometry/composite_area",
737
+ "gold": "14",
738
+ "ok": true,
739
+ "tokens": 142,
740
+ "tail": "figure**\n- Area of L-shaped figure = Area of rectangle \u2212 Area of square\n- Area of L-shaped figure = 18 \u2212 4 = 14\n\n#### 14",
741
+ "family": "geometry",
742
+ "template": "composite_area"
743
+ },
744
+ {
745
+ "src": "v31_math_competition/algebra/quadratic_sum_roots",
746
+ "gold": "40",
747
+ "ok": true,
748
+ "tokens": 260,
749
+ "tail": "to -14.\nThose numbers are -10 and -4.\n\nSo: (x - 10)(x - 4) = 0\n\nRoots: x = 10 and x = 4\n\nProduct: 10 \u00d7 4 = 40 \u2713\n\n#### 40",
750
+ "family": "algebra",
751
+ "template": "quadratic_sum_roots"
752
+ },
753
+ {
754
+ "src": "v31_math_competition/combinatorics/arrangements_constraint",
755
+ "gold": "240",
756
+ "ok": false,
757
+ "tokens": 449,
758
+ "tail": " block position, the remaining letters (C, D, E, F) can be arranged in 4! = 24 ways.\n\nSo total = 5 \u00d7 24 = 120.\n\n#### 120",
759
+ "family": "combinatorics",
760
+ "template": "arrangements_constraint"
761
+ },
762
+ {
763
+ "src": "v31_math_competition/algebra/linear_system_2x2",
764
+ "gold": "15",
765
+ "ok": true,
766
+ "tokens": 273,
767
+ "tail": "tep 4: Verify with Equation 2**\n\n3(-5) + 5(-3) = -15 - 15 = -30 \u2713\n\n**Step 5: Calculate xy**\n\nxy = (-5)(-3) = 15\n\n#### 15",
768
+ "family": "algebra",
769
+ "template": "linear_system_2x2"
770
+ },
771
+ {
772
+ "src": "v31_math_competition/probability/probability_fraction",
773
+ "gold": "1/18",
774
+ "ok": true,
775
+ "tokens": 230,
776
+ "tail": "{2}{36} = \\frac{1}{18}$$\n\n## Verification\n\nThe fraction 1/18 is already in lowest terms since gcd(1, 18) = 1.\n\n#### 1/18",
777
+ "family": "probability",
778
+ "template": "probability_fraction"
779
+ },
780
+ {
781
+ "src": "v31_math_competition/combinatorics/arrangements_constraint",
782
+ "gold": "240",
783
+ "ok": false,
784
+ "tokens": 441,
785
+ "tail": " block position, the remaining letters (C, D, E, F) can be arranged in 4! = 24 ways.\n\nSo total = 5 \u00d7 24 = 120.\n\n#### 120",
786
+ "family": "combinatorics",
787
+ "template": "arrangements_constraint"
788
+ },
789
+ {
790
+ "src": "v31_math_competition/number_theory/gcd_three",
791
+ "gold": "216",
792
+ "ok": true,
793
+ "tokens": 260,
794
+ "tail": "the LCM.**\n\nLCM = 2\u00b3 \u00d7 3\u00b3 = 8 \u00d7 27 = 216\n\n**Verification:**\n- 216 \u00f7 72 = 3 \u2713\n- 216 \u00f7 27 = 8 \u2713\n- 216 \u00f7 36 = 6 \u2713\n\n#### 216",
795
+ "family": "number_theory",
796
+ "template": "gcd_three"
797
+ },
798
+ {
799
+ "src": "v31_math_competition/combinatorics/arrangements_constraint",
800
+ "gold": "480",
801
+ "ok": false,
802
+ "tokens": 1024,
803
+ "tail": "jacent, and they can be arranged in 2 ways. So 20 invalid.\n\nValid: 720 - 20 = 700? No wait, let me recheck.\n\nTotal: 720.",
804
+ "family": "combinatorics",
805
+ "template": "arrangements_constraint"
806
+ },
807
+ {
808
+ "src": "v31_math_competition/algebra/quadratic_sum_roots",
809
+ "gold": "70",
810
+ "ok": true,
811
+ "tokens": 247,
812
+ "tail": "+ 140 = 0\n2(x\u00b2 - 17x + 70) = 0\n2(x - 7)(x - 10) = 0\n\nSo the roots are x = 7 and x = 10.\n\nProduct: 7 \u00d7 10 = 70 \u2713\n\n#### 70",
813
+ "family": "algebra",
814
+ "template": "quadratic_sum_roots"
815
+ },
816
+ {
817
+ "src": "v31_math_competition/algebra/linear_system_2x2",
818
+ "gold": "-7",
819
+ "ok": true,
820
+ "tokens": 286,
821
+ "tail": "uation 1:\n2x + 4(-7) = -40\n2x - 28 = -40\n2x = -12\nx = -6\n\nCheck in Equation 2:\n4(-6) + 5(-7) = -24 - 35 = -59 \u2713\n\n#### -7",
822
+ "family": "algebra",
823
+ "template": "linear_system_2x2"
824
+ },
825
+ {
826
+ "src": "v31_math_competition/number_theory/smallest_k_divisible",
827
+ "gold": "12",
828
+ "ok": true,
829
+ "tokens": 244,
830
+ "tail": " positive integer k satisfying k \u2261 12 (mod 13) is k = 12.\n\n**Verification:**\n- 807 + 12 = 819\n- 819 \u00f7 13 = 63 \u2713\n\n#### 12",
831
+ "family": "number_theory",
832
+ "template": "smallest_k_divisible"
833
+ },
834
+ {
835
+ "src": "v31_math_competition/geometry/composite_area",
836
+ "gold": "61",
837
+ "ok": true,
838
+ "tokens": 139,
839
+ "tail": "re**\nArea of L-shaped figure = Area of rectangle \u2212 Area of square cut out\nArea of L-shaped figure = 70 \u2212 9 = 61\n\n#### 61",
840
+ "family": "geometry",
841
+ "template": "composite_area"
842
+ },
843
+ {
844
+ "src": "v31_math_competition/combinatorics/arrangements_constraint",
845
+ "gold": "48",
846
+ "ok": false,
847
+ "tokens": 223,
848
+ "tail": "re 2 ways to place A and B.\n\n**Step 4: Calculate total arrangements**\n\nTotal arrangements = 3! \u00d7 2 = 6 \u00d7 2 = 12\n\n#### 12",
849
+ "family": "combinatorics",
850
+ "template": "arrangements_constraint"
851
+ },
852
+ {
853
+ "src": "v31_math_competition/probability/probability_fraction",
854
+ "gold": "5/72",
855
+ "ok": true,
856
+ "tokens": 550,
857
+ "tail": "ations: 3\n- (3, 5, 6): permutations: 3\n- (4, 4, 6): permutations: 3\n- (4, 5, 5): permutations: 3\n\nTotal: 15 \u2713\n\n#### 5/72",
858
+ "family": "probability",
859
+ "template": "probability_fraction"
860
+ }
861
+ ]
862
+ },
863
+ "v31_math_robustness": {
864
+ "n": 16,
865
+ "correct": 13,
866
+ "pass_frac": 0.8125,
867
+ "completion_tokens": 3549,
868
+ "mean_gen_tokens_correct": 159.4,
869
+ "items": [
870
+ {
871
+ "src": "v31_math_robustness/numerical_swap/travel_distance",
872
+ "gold": "460",
873
+ "ok": true,
874
+ "tokens": 167,
875
+ "tail": "ance = 60 \u00d7 4 = 240 miles\n\n**Step 4: Calculate total distance**\n- Total distance = 120 + 100 + 240 = 460 miles\n\n#### 460",
876
+ "perturbation": "numerical_swap",
877
+ "template": "travel_distance"
878
+ },
879
+ {
880
+ "src": "v31_math_robustness/unit_swap/garden_harvest/metric",
881
+ "gold": "1633",
882
+ "ok": false,
883
+ "tokens": 441,
884
+ "tail": "ion: (1,632 + 408) \u00d7 0.80 = 2,040 \u00d7 0.80 = 1,632\n\nLet me verify: 1,632 + 408 = 2,040, and 2,040 \u00d7 0.8 = 1,632\n\n#### 1632",
885
+ "perturbation": "unit_swap",
886
+ "template": "garden_harvest/metric"
887
+ },
888
+ {
889
+ "src": "v31_math_robustness/numerical_swap/travel_distance",
890
+ "gold": "530",
891
+ "ok": true,
892
+ "tokens": 167,
893
+ "tail": "tance = 50 \u00d7 5 = 250 miles\n\n**Step 4: Calculate total distance**\n- Total distance = 80 + 200 + 250 = 530 miles\n\n#### 530",
894
+ "perturbation": "numerical_swap",
895
+ "template": "travel_distance"
896
+ },
897
+ {
898
+ "src": "v31_math_robustness/numerical_swap/library_books",
899
+ "gold": "120",
900
+ "ok": true,
901
+ "tokens": 142,
902
+ "tail": "ook = $120\n\nAlternatively, I can calculate:\nTotal fine = 2 books \u00d7 12 days \u00d7 $5/day = 24 book-days \u00d7 $5 = $120\n\n#### 120",
903
+ "perturbation": "numerical_swap",
904
+ "template": "library_books"
905
+ },
906
+ {
907
+ "src": "v31_math_robustness/digit_expand/bakery_orders/scaled",
908
+ "gold": "1680",
909
+ "ok": true,
910
+ "tokens": 134,
911
+ "tail": "ver loaves**\nLeftover loaves = Total production \u2212 Total orders\nLeftover loaves = 5,600 \u2212 3,920 = 1,680 loaves\n\n#### 1680",
912
+ "perturbation": "digit_expand",
913
+ "template": "bakery_orders/scaled"
914
+ },
915
+ {
916
+ "src": "v31_math_robustness/topical_distractor/library_books/noop",
917
+ "gold": "80",
918
+ "ok": false,
919
+ "tokens": 562,
920
+ "tail": "9.\n\nGiven typical problem conventions, I'll provide 79 (truncated) or check if I should round. Most likely: 79.\n\n#### 79",
921
+ "perturbation": "topical_distractor",
922
+ "template": "library_books/noop"
923
+ },
924
+ {
925
+ "src": "v31_math_robustness/unit_swap/classroom_supplies/pounds",
926
+ "gold": "160",
927
+ "ok": true,
928
+ "tokens": 99,
929
+ "tail": " = 8\n\n**Step 2: Calculate total supplies for all students**\nTotal students = 20\n\nTotal supplies = 20 \u00d7 8 = 160\n\n#### 160",
930
+ "perturbation": "unit_swap",
931
+ "template": "classroom_supplies/pounds"
932
+ },
933
+ {
934
+ "src": "v31_math_robustness/topical_distractor/percentage_compose/noop",
935
+ "gold": "400",
936
+ "ok": true,
937
+ "tokens": 126,
938
+ "tail": " store has been selling marbles for 4 years\" is irrelevant to this calculation.\n\nTyler gives away 400 marbles.\n\n#### 400",
939
+ "perturbation": "topical_distractor",
940
+ "template": "percentage_compose/noop"
941
+ },
942
+ {
943
+ "src": "v31_math_robustness/context_pad/library_books/padded",
944
+ "gold": "113",
945
+ "ok": true,
946
+ "tokens": 424,
947
+ "tail": "e 113 (rounded) or perhaps the problem expects 112 or 113. Given standard rounding rules, 112.5 rounds to 113.\n\n#### 113",
948
+ "perturbation": "context_pad",
949
+ "template": "library_books/padded"
950
+ },
951
+ {
952
+ "src": "v31_math_robustness/unit_swap/percentage_compose/euros",
953
+ "gold": "60",
954
+ "ok": true,
955
+ "tokens": 75,
956
+ "tail": " 20/100 = 0.20\n\nStep 2: Multiply the total marbles by the decimal.\n300 \u00d7 0.20 = 60\n\nRyan gives away 60 marbles.\n\n#### 60",
957
+ "perturbation": "unit_swap",
958
+ "template": "percentage_compose/euros"
959
+ },
960
+ {
961
+ "src": "v31_math_robustness/unit_swap/garden_harvest/pounds",
962
+ "gold": "1076",
963
+ "ok": false,
964
+ "tokens": 474,
965
+ "tail": "teger answer. Let me verify my calculations are correct and round to nearest integer.\n\n1075.2 rounds to 1075.\n\n#### 1075",
966
+ "perturbation": "unit_swap",
967
+ "template": "garden_harvest/pounds"
968
+ },
969
+ {
970
+ "src": "v31_math_robustness/context_pad/travel_distance/padded",
971
+ "gold": "650",
972
+ "ok": true,
973
+ "tokens": 171,
974
+ "tail": "stance = 45 \u00d7 2 = 90 miles\n\n**Step 4: Calculate total distance**\n- Total distance = 350 + 210 + 90 = 650 miles\n\n#### 650",
975
+ "perturbation": "context_pad",
976
+ "template": "travel_distance/padded"
977
+ },
978
+ {
979
+ "src": "v31_math_robustness/context_pad/classroom_supplies/padded",
980
+ "gold": "144",
981
+ "ok": true,
982
+ "tokens": 175,
983
+ "tail": "Note: The information about the bus route, sunny Saturday, and crowded streets is irrelevant to this problem.)\n\n#### 144",
984
+ "perturbation": "context_pad",
985
+ "template": "classroom_supplies/padded"
986
+ },
987
+ {
988
+ "src": "v31_math_robustness/context_pad/travel_distance/padded",
989
+ "gold": "520",
990
+ "ok": true,
991
+ "tokens": 163,
992
+ "tail": "ance = 40 \u00d7 3 = 120 miles\n\n**Step 4: Calculate total distance**\n- Total distance = 280 + 120 + 120 = 520 miles\n\n#### 520",
993
+ "perturbation": "context_pad",
994
+ "template": "travel_distance/padded"
995
+ },
996
+ {
997
+ "src": "v31_math_robustness/context_pad/garden_harvest/padded",
998
+ "gold": "693",
999
+ "ok": true,
1000
+ "tokens": 93,
1001
+ "tail": "ants\n\n**Step 2: Find the total number of cucumbers**\n- Cucumbers per plant: 7\n- Total cucumbers = 99 \u00d7 7 = 693\n\n#### 693",
1002
+ "perturbation": "context_pad",
1003
+ "template": "garden_harvest/padded"
1004
+ },
1005
+ {
1006
+ "src": "v31_math_robustness/context_pad/library_books/padded",
1007
+ "gold": "56",
1008
+ "ok": true,
1009
+ "tokens": 136,
1010
+ "tail": " books = $56\n\nAlternatively, I can calculate it as:\nTotal fine = 4 books \u00d7 7 days \u00d7 $2/day/book = 28 \u00d7 $2 = $56\n\n#### 56",
1011
+ "perturbation": "context_pad",
1012
+ "template": "library_books/padded"
1013
+ }
1014
+ ]
1015
+ },
1016
+ "v31_code_humaneval_plus": {
1017
+ "n": 16,
1018
+ "correct": 16,
1019
+ "pass_frac": 1.0,
1020
+ "completion_tokens": 517,
1021
+ "mean_gen_tokens_correct": 32.3,
1022
+ "items": [
1023
+ {
1024
+ "src": "v31_code_humaneval_plus/count_in_list",
1025
+ "task_id": "v31_codeplus/count_target_vhemwk",
1026
+ "entry_point": "count_target_vhemwk",
1027
+ "ok": true,
1028
+ "tokens": 28,
1029
+ "n_test_cases": 58,
1030
+ "template": "count_in_list",
1031
+ "tail": " count = 0\n for item in items:\n if item == target:\n count += 1\n return count"
1032
+ },
1033
+ {
1034
+ "src": "v31_code_humaneval_plus/reverse_words",
1035
+ "task_id": "v31_codeplus/reverse_each_word_pxuilw",
1036
+ "entry_point": "reverse_each_word_pxuilw",
1037
+ "ok": true,
1038
+ "tokens": 19,
1039
+ "n_test_cases": 39,
1040
+ "template": "reverse_words",
1041
+ "tail": " return ' '.join(word[::-1] for word in phrase.split(' '))"
1042
+ },
1043
+ {
1044
+ "src": "v31_code_humaneval_plus/is_palindrome",
1045
+ "task_id": "v31_codeplus/is_palindrome_niugjt",
1046
+ "entry_point": "is_palindrome_niugjt",
1047
+ "ok": true,
1048
+ "tokens": 10,
1049
+ "n_test_cases": 50,
1050
+ "template": "is_palindrome",
1051
+ "tail": " return word == word[::-1]"
1052
+ },
1053
+ {
1054
+ "src": "v31_code_humaneval_plus/dict_value_sum",
1055
+ "task_id": "v31_codeplus/sum_dict_values_rsnegh",
1056
+ "entry_point": "sum_dict_values_rsnegh",
1057
+ "ok": true,
1058
+ "tokens": 29,
1059
+ "n_test_cases": 40,
1060
+ "template": "dict_value_sum",
1061
+ "tail": " total = 0\n for value in d.values():\n if isinstance(value, int):\n total += value\n return total"
1062
+ },
1063
+ {
1064
+ "src": "v31_code_humaneval_plus/max_consecutive_run",
1065
+ "task_id": "v31_codeplus/max_consecutive_run_ooglkq",
1066
+ "entry_point": "max_consecutive_run_ooglkq",
1067
+ "ok": true,
1068
+ "tokens": 79,
1069
+ "n_test_cases": 54,
1070
+ "template": "max_consecutive_run",
1071
+ "tail": "[i] == nums[i-1]:\n current_run += 1\n max_run = max(max_run, current_run)\n else:\n current_run = 1\n \n return max_run"
1072
+ },
1073
+ {
1074
+ "src": "v31_code_humaneval_plus/max_consecutive_run",
1075
+ "task_id": "v31_codeplus/max_consecutive_run_dzrref",
1076
+ "entry_point": "max_consecutive_run_dzrref",
1077
+ "ok": true,
1078
+ "tokens": 79,
1079
+ "n_test_cases": 46,
1080
+ "template": "max_consecutive_run",
1081
+ "tail": "i] == items[i-1]:\n current_run += 1\n max_run = max(max_run, current_run)\n else:\n current_run = 1\n \n return max_run"
1082
+ },
1083
+ {
1084
+ "src": "v31_code_humaneval_plus/dict_value_sum",
1085
+ "task_id": "v31_codeplus/sum_dict_values_atttza",
1086
+ "entry_point": "sum_dict_values_atttza",
1087
+ "ok": true,
1088
+ "tokens": 17,
1089
+ "n_test_cases": 38,
1090
+ "template": "dict_value_sum",
1091
+ "tail": " return sum(value for value in mapping.values() if isinstance(value, int))"
1092
+ },
1093
+ {
1094
+ "src": "v31_code_humaneval_plus/count_in_list",
1095
+ "task_id": "v31_codeplus/count_target_uophih",
1096
+ "entry_point": "count_target_uophih",
1097
+ "ok": true,
1098
+ "tokens": 28,
1099
+ "n_test_cases": 53,
1100
+ "template": "count_in_list",
1101
+ "tail": " count = 0\n for num in nums:\n if num == target:\n count += 1\n return count"
1102
+ },
1103
+ {
1104
+ "src": "v31_code_humaneval_plus/filter_above",
1105
+ "task_id": "v31_codeplus/filter_above_tbyxwb",
1106
+ "entry_point": "filter_above_tbyxwb",
1107
+ "ok": true,
1108
+ "tokens": 15,
1109
+ "n_test_cases": 62,
1110
+ "template": "filter_above",
1111
+ "tail": " return [x for x in values if x > min_val]"
1112
+ },
1113
+ {
1114
+ "src": "v31_code_humaneval_plus/filter_above",
1115
+ "task_id": "v31_codeplus/filter_above_jlkmrb",
1116
+ "entry_point": "filter_above_jlkmrb",
1117
+ "ok": true,
1118
+ "tokens": 15,
1119
+ "n_test_cases": 61,
1120
+ "template": "filter_above",
1121
+ "tail": " return [x for x in values if x > min_val]"
1122
+ },
1123
+ {
1124
+ "src": "v31_code_humaneval_plus/dict_value_sum",
1125
+ "task_id": "v31_codeplus/sum_dict_values_pzdvhu",
1126
+ "entry_point": "sum_dict_values_pzdvhu",
1127
+ "ok": true,
1128
+ "tokens": 17,
1129
+ "n_test_cases": 48,
1130
+ "template": "dict_value_sum",
1131
+ "tail": " return sum(value for value in mapping.values() if isinstance(value, int))"
1132
+ },
1133
+ {
1134
+ "src": "v31_code_humaneval_plus/max_consecutive_run",
1135
+ "task_id": "v31_codeplus/max_consecutive_run_aqpbmg",
1136
+ "entry_point": "max_consecutive_run_aqpbmg",
1137
+ "ok": true,
1138
+ "tokens": 79,
1139
+ "n_test_cases": 48,
1140
+ "template": "max_consecutive_run",
1141
+ "tail": "[i] == nums[i-1]:\n current_run += 1\n max_run = max(max_run, current_run)\n else:\n current_run = 1\n \n return max_run"
1142
+ },
1143
+ {
1144
+ "src": "v31_code_humaneval_plus/filter_above",
1145
+ "task_id": "v31_codeplus/filter_above_zhivwj",
1146
+ "entry_point": "filter_above_zhivwj",
1147
+ "ok": true,
1148
+ "tokens": 14,
1149
+ "n_test_cases": 47,
1150
+ "template": "filter_above",
1151
+ "tail": " return [x for x in nums if x > threshold]"
1152
+ },
1153
+ {
1154
+ "src": "v31_code_humaneval_plus/count_in_list",
1155
+ "task_id": "v31_codeplus/count_target_himtcr",
1156
+ "entry_point": "count_target_himtcr",
1157
+ "ok": true,
1158
+ "tokens": 28,
1159
+ "n_test_cases": 53,
1160
+ "template": "count_in_list",
1161
+ "tail": " count = 0\n for item in data:\n if item == target:\n count += 1\n return count"
1162
+ },
1163
+ {
1164
+ "src": "v31_code_humaneval_plus/count_in_list",
1165
+ "task_id": "v31_codeplus/count_target_anvjff",
1166
+ "entry_point": "count_target_anvjff",
1167
+ "ok": true,
1168
+ "tokens": 28,
1169
+ "n_test_cases": 44,
1170
+ "template": "count_in_list",
1171
+ "tail": " count = 0\n for num in nums:\n if num == target:\n count += 1\n return count"
1172
+ },
1173
+ {
1174
+ "src": "v31_code_humaneval_plus/reverse_words",
1175
+ "task_id": "v31_codeplus/reverse_each_word_pcilgf",
1176
+ "entry_point": "reverse_each_word_pcilgf",
1177
+ "ok": true,
1178
+ "tokens": 32,
1179
+ "n_test_cases": 47,
1180
+ "template": "reverse_words",
1181
+ "tail": " words = text.split(' ')\n reversed_words = [word[::-1] for word in words]\n return ' '.join(reversed_words)"
1182
+ }
1183
+ ]
1184
+ },
1185
+ "v31_ifeval_verifiable": {
1186
+ "n": 16,
1187
+ "correct": 11,
1188
+ "pass_frac": 0.6875,
1189
+ "completion_tokens": 3202,
1190
+ "mean_gen_tokens_correct": 244.5,
1191
+ "items": [
1192
+ {
1193
+ "src": "v31_ifeval/stack4",
1194
+ "instruction_ids": [
1195
+ "detectable_content:postscript",
1196
+ "detectable_format:multiple_sections",
1197
+ "length_constraints:number_sentences",
1198
+ "detectable_format:number_highlighted_sections"
1199
+ ],
1200
+ "per_instruction": [
1201
+ true,
1202
+ true,
1203
+ true,
1204
+ true
1205
+ ],
1206
+ "stack_depth": 4,
1207
+ "ok": true,
1208
+ "tokens": 169,
1209
+ "tail": "overage and transforming cartography from an art into a precise science.\n\nP.S. I still prefer paper maps for road trips."
1210
+ },
1211
+ {
1212
+ "src": "v31_ifeval/stack2",
1213
+ "instruction_ids": [
1214
+ "detectable_content:number_placeholders",
1215
+ "startend:quotation"
1216
+ ],
1217
+ "per_instruction": [
1218
+ true,
1219
+ true
1220
+ ],
1221
+ "stack_depth": 2,
1222
+ "ok": true,
1223
+ "tokens": 249,
1224
+ "tail": "this art could navigate treacherous waters with greater confidence than those who depended solely on luck or guesswork.\""
1225
+ },
1226
+ {
1227
+ "src": "v31_ifeval/stack2",
1228
+ "instruction_ids": [
1229
+ "length_constraints:number_words",
1230
+ "detectable_content:postscript"
1231
+ ],
1232
+ "per_instruction": [
1233
+ false,
1234
+ true
1235
+ ],
1236
+ "stack_depth": 2,
1237
+ "ok": false,
1238
+ "tokens": 52,
1239
+ "tail": "ide, required boring through gneiss and granite at depths reaching 2,300 meters.\n\nP.S. Swiss precision made it possible."
1240
+ },
1241
+ {
1242
+ "src": "v31_ifeval/stack2",
1243
+ "instruction_ids": [
1244
+ "length_constraints:number_words",
1245
+ "detectable_format:title"
1246
+ ],
1247
+ "per_instruction": [
1248
+ true,
1249
+ true
1250
+ ],
1251
+ "stack_depth": 2,
1252
+ "ok": true,
1253
+ "tokens": 238,
1254
+ "tail": "as the calendar turns, strengthening the bond between local producers and their dedicated customers throughout the year."
1255
+ },
1256
+ {
1257
+ "src": "v31_ifeval/stack4",
1258
+ "instruction_ids": [
1259
+ "startend:quotation",
1260
+ "detectable_format:number_bullet_lists",
1261
+ "punctuation:no_comma",
1262
+ "keywords:letter_frequency"
1263
+ ],
1264
+ "per_instruction": [
1265
+ true,
1266
+ true,
1267
+ true,
1268
+ true
1269
+ ],
1270
+ "stack_depth": 4,
1271
+ "ok": true,
1272
+ "tokens": 184,
1273
+ "tail": " and the town's small library\n* My final stretch follows the seawall where waves crash and spray my face with cool mist\""
1274
+ },
1275
+ {
1276
+ "src": "v31_ifeval/stack2",
1277
+ "instruction_ids": [
1278
+ "length_constraints:number_words",
1279
+ "change_case:capital_word_frequency"
1280
+ ],
1281
+ "per_instruction": [
1282
+ false,
1283
+ true
1284
+ ],
1285
+ "stack_depth": 2,
1286
+ "ok": false,
1287
+ "tokens": 107,
1288
+ "tail": " fitted helmet for SAFETY. Most children master riding within a few sessions when learning feels FUN rather than forced."
1289
+ },
1290
+ {
1291
+ "src": "v31_ifeval/stack4",
1292
+ "instruction_ids": [
1293
+ "change_case:english_capital",
1294
+ "punctuation:no_comma",
1295
+ "length_constraints:number_paragraphs",
1296
+ "length_constraints:number_words"
1297
+ ],
1298
+ "per_instruction": [
1299
+ true,
1300
+ true,
1301
+ true,
1302
+ true
1303
+ ],
1304
+ "stack_depth": 4,
1305
+ "ok": true,
1306
+ "tokens": 241,
1307
+ "tail": "TIME IMAGING AND GLOBAL POSITIONING. WHAT ONCE TOOK YEARS OF DANGEROUS EXPEDITION NOW OCCURS IN MILLISECONDS FROM SPACE."
1308
+ },
1309
+ {
1310
+ "src": "v31_ifeval/stack1",
1311
+ "instruction_ids": [
1312
+ "detectable_content:postscript"
1313
+ ],
1314
+ "per_instruction": [
1315
+ true
1316
+ ],
1317
+ "stack_depth": 1,
1318
+ "ok": true,
1319
+ "tokens": 299,
1320
+ "tail": "iding\u2014this positive reinforcement makes the learning process even more memorable and encourages them to keep practicing!"
1321
+ },
1322
+ {
1323
+ "src": "v31_ifeval/stack1",
1324
+ "instruction_ids": [
1325
+ "startend:end_checker"
1326
+ ],
1327
+ "per_instruction": [
1328
+ true
1329
+ ],
1330
+ "stack_depth": 1,
1331
+ "ok": true,
1332
+ "tokens": 254,
1333
+ "tail": "et offers something essential: proof that community can thrive when people gather with intention and care.\n\nBest wishes."
1334
+ },
1335
+ {
1336
+ "src": "v31_ifeval/stack2",
1337
+ "instruction_ids": [
1338
+ "detectable_format:title",
1339
+ "detectable_content:number_placeholders"
1340
+ ],
1341
+ "per_instruction": [
1342
+ true,
1343
+ true
1344
+ ],
1345
+ "stack_depth": 2,
1346
+ "ok": true,
1347
+ "tokens": 346,
1348
+ "tail": "unprecedented accuracy, rendering the painstaking field surveys of previous centuries almost unimaginable by comparison."
1349
+ },
1350
+ {
1351
+ "src": "v31_ifeval/stack3",
1352
+ "instruction_ids": [
1353
+ "keywords:letter_frequency",
1354
+ "length_constraints:number_sentences",
1355
+ "keywords:forbidden_words"
1356
+ ],
1357
+ "per_instruction": [
1358
+ false,
1359
+ true,
1360
+ true
1361
+ ],
1362
+ "stack_depth": 3,
1363
+ "ok": false,
1364
+ "tokens": 90,
1365
+ "tail": "ve gives a proud, happy feel. Even a tiny plot lets you nurture life. No yard? No problem. A balcony works just as well."
1366
+ },
1367
+ {
1368
+ "src": "v31_ifeval/stack4",
1369
+ "instruction_ids": [
1370
+ "length_constraints:number_words",
1371
+ "detectable_format:title",
1372
+ "detectable_content:number_placeholders",
1373
+ "length_constraints:number_sentences"
1374
+ ],
1375
+ "per_instruction": [
1376
+ true,
1377
+ true,
1378
+ true,
1379
+ true
1380
+ ],
1381
+ "stack_depth": 4,
1382
+ "ok": true,
1383
+ "tokens": 84,
1384
+ "tail": " cartography in the [year]. Satellites finally ended guesswork, delivering exact geospatial data to anyone with a phone."
1385
+ },
1386
+ {
1387
+ "src": "v31_ifeval/stack4",
1388
+ "instruction_ids": [
1389
+ "length_constraints:number_words",
1390
+ "length_constraints:number_paragraphs",
1391
+ "change_case:capital_word_frequency",
1392
+ "keywords:letter_frequency"
1393
+ ],
1394
+ "per_instruction": [
1395
+ true,
1396
+ true,
1397
+ true,
1398
+ false
1399
+ ],
1400
+ "stack_depth": 4,
1401
+ "ok": false,
1402
+ "tokens": 170,
1403
+ "tail": "This small journey grounds me, connects me to tides and time. I arrive home with sand in my tires and peace in my chest."
1404
+ },
1405
+ {
1406
+ "src": "v31_ifeval/stack1",
1407
+ "instruction_ids": [
1408
+ "length_constraints:number_paragraphs"
1409
+ ],
1410
+ "per_instruction": [
1411
+ true
1412
+ ],
1413
+ "stack_depth": 1,
1414
+ "ok": true,
1415
+ "tokens": 277,
1416
+ "tail": "sfaction of having moved through a place rather than merely past it, my panniers heavy with the day's small discoveries."
1417
+ },
1418
+ {
1419
+ "src": "v31_ifeval/stack4",
1420
+ "instruction_ids": [
1421
+ "detectable_format:number_bullet_lists",
1422
+ "detectable_content:postscript",
1423
+ "length_constraints:number_sentences",
1424
+ "startend:quotation"
1425
+ ],
1426
+ "per_instruction": [
1427
+ true,
1428
+ true,
1429
+ false,
1430
+ true
1431
+ ],
1432
+ "stack_depth": 4,
1433
+ "ok": false,
1434
+ "tokens": 93,
1435
+ "tail": "atches from the marsh edge\n* You arrive with wind-tangled hair and peace\n\nP.S. The heron has started following me home.\""
1436
+ },
1437
+ {
1438
+ "src": "v31_ifeval/stack3",
1439
+ "instruction_ids": [
1440
+ "length_constraints:number_paragraphs",
1441
+ "change_case:english_capital",
1442
+ "punctuation:no_comma"
1443
+ ],
1444
+ "per_instruction": [
1445
+ true,
1446
+ true,
1447
+ true
1448
+ ],
1449
+ "stack_depth": 3,
1450
+ "ok": true,
1451
+ "tokens": 349,
1452
+ "tail": "TH ELEVATED PLATFORMS AND WATERPROOF ELECTRICAL SYSTEMS PROTECTING THE NETWORK FROM TIDAL FLOODING AND HURRICANE EVENTS."
1453
+ }
1454
+ ]
1455
+ },
1456
+ "v31_reasoning_logic_grid": {
1457
+ "n": 16,
1458
+ "correct": 11,
1459
+ "pass_frac": 0.6875,
1460
+ "completion_tokens": 66,
1461
+ "mean_gen_tokens_correct": 4.2,
1462
+ "items": [
1463
+ {
1464
+ "src": "v31_logic_grid/3x2",
1465
+ "gold": "biologist",
1466
+ "ok": true,
1467
+ "tokens": 5,
1468
+ "tail": " Answer: biologist",
1469
+ "num_people": 3,
1470
+ "num_attrs": 2,
1471
+ "num_clues": 6
1472
+ },
1473
+ {
1474
+ "src": "v31_logic_grid/3x3",
1475
+ "gold": "photography",
1476
+ "ok": false,
1477
+ "tokens": 4,
1478
+ "tail": " Answer: reading",
1479
+ "num_people": 3,
1480
+ "num_attrs": 3,
1481
+ "num_clues": 6
1482
+ },
1483
+ {
1484
+ "src": "v31_logic_grid/4x3",
1485
+ "gold": "engineer",
1486
+ "ok": true,
1487
+ "tokens": 4,
1488
+ "tail": " Answer: engineer",
1489
+ "num_people": 4,
1490
+ "num_attrs": 3,
1491
+ "num_clues": 11
1492
+ },
1493
+ {
1494
+ "src": "v31_logic_grid/4x3",
1495
+ "gold": "salad",
1496
+ "ok": false,
1497
+ "tokens": 4,
1498
+ "tail": " Answer: sushi",
1499
+ "num_people": 4,
1500
+ "num_attrs": 3,
1501
+ "num_clues": 14
1502
+ },
1503
+ {
1504
+ "src": "v31_logic_grid/3x3",
1505
+ "gold": "cycling",
1506
+ "ok": true,
1507
+ "tokens": 4,
1508
+ "tail": " Answer: cycling",
1509
+ "num_people": 3,
1510
+ "num_attrs": 3,
1511
+ "num_clues": 13
1512
+ },
1513
+ {
1514
+ "src": "v31_logic_grid/4x4",
1515
+ "gold": "doctor",
1516
+ "ok": true,
1517
+ "tokens": 4,
1518
+ "tail": " Answer: doctor",
1519
+ "num_people": 4,
1520
+ "num_attrs": 4,
1521
+ "num_clues": 16
1522
+ },
1523
+ {
1524
+ "src": "v31_logic_grid/3x2",
1525
+ "gold": "dog",
1526
+ "ok": true,
1527
+ "tokens": 4,
1528
+ "tail": " Answer: dog",
1529
+ "num_people": 3,
1530
+ "num_attrs": 2,
1531
+ "num_clues": 7
1532
+ },
1533
+ {
1534
+ "src": "v31_logic_grid/3x3",
1535
+ "gold": "musician",
1536
+ "ok": false,
1537
+ "tokens": 4,
1538
+ "tail": " Answer: artist",
1539
+ "num_people": 3,
1540
+ "num_attrs": 3,
1541
+ "num_clues": 8
1542
+ },
1543
+ {
1544
+ "src": "v31_logic_grid/4x4",
1545
+ "gold": "engineer",
1546
+ "ok": true,
1547
+ "tokens": 4,
1548
+ "tail": " Answer: engineer",
1549
+ "num_people": 4,
1550
+ "num_attrs": 4,
1551
+ "num_clues": 16
1552
+ },
1553
+ {
1554
+ "src": "v31_logic_grid/4x3",
1555
+ "gold": "cat",
1556
+ "ok": true,
1557
+ "tokens": 4,
1558
+ "tail": " Answer: cat",
1559
+ "num_people": 4,
1560
+ "num_attrs": 3,
1561
+ "num_clues": 11
1562
+ },
1563
+ {
1564
+ "src": "v31_logic_grid/4x3",
1565
+ "gold": "smoothie",
1566
+ "ok": true,
1567
+ "tokens": 4,
1568
+ "tail": " Answer: smoothie",
1569
+ "num_people": 4,
1570
+ "num_attrs": 3,
1571
+ "num_clues": 15
1572
+ },
1573
+ {
1574
+ "src": "v31_logic_grid/3x3",
1575
+ "gold": "orange",
1576
+ "ok": false,
1577
+ "tokens": 4,
1578
+ "tail": " Answer: white",
1579
+ "num_people": 3,
1580
+ "num_attrs": 3,
1581
+ "num_clues": 8
1582
+ },
1583
+ {
1584
+ "src": "v31_logic_grid/3x3",
1585
+ "gold": "hamster",
1586
+ "ok": true,
1587
+ "tokens": 5,
1588
+ "tail": " Answer: hamster",
1589
+ "num_people": 3,
1590
+ "num_attrs": 3,
1591
+ "num_clues": 10
1592
+ },
1593
+ {
1594
+ "src": "v31_logic_grid/4x4",
1595
+ "gold": "salad",
1596
+ "ok": true,
1597
+ "tokens": 4,
1598
+ "tail": " Answer: salad",
1599
+ "num_people": 4,
1600
+ "num_attrs": 4,
1601
+ "num_clues": 17
1602
+ },
1603
+ {
1604
+ "src": "v31_logic_grid/3x2",
1605
+ "gold": "artist",
1606
+ "ok": true,
1607
+ "tokens": 4,
1608
+ "tail": " Answer: artist",
1609
+ "num_people": 3,
1610
+ "num_attrs": 2,
1611
+ "num_clues": 8
1612
+ },
1613
+ {
1614
+ "src": "v31_logic_grid/5x3",
1615
+ "gold": "ramen",
1616
+ "ok": false,
1617
+ "tokens": 4,
1618
+ "tail": " Answer: sushi",
1619
+ "num_people": 5,
1620
+ "num_attrs": 3,
1621
+ "num_clues": 15
1622
+ }
1623
+ ]
1624
+ },
1625
+ "v31_reasoning_dyval_arith": {
1626
+ "n": 16,
1627
+ "correct": 13,
1628
+ "pass_frac": 0.8125,
1629
+ "completion_tokens": 5610,
1630
+ "mean_gen_tokens_correct": 254.3,
1631
+ "items": [
1632
+ {
1633
+ "src": "v31_dyval_arith/d4/nl_vars",
1634
+ "gold": "-8",
1635
+ "ok": true,
1636
+ "tokens": 273,
1637
+ "tail": "(v11 - v12) = 2 - 7 = -5\n\nv14 = min(v10, v13) = min(6, -5) = -5\n\nv15 = min(v7, v14) = min(-8, -5) = -8\n\nFinal answer: -8",
1638
+ "depth": 4,
1639
+ "mode": "nl_vars"
1640
+ },
1641
+ {
1642
+ "src": "v31_dyval_arith/d4/math",
1643
+ "gold": "-140",
1644
+ "ok": true,
1645
+ "tokens": 419,
1646
+ "tail": "*\n\n**Now compute the main expression:**\n((A) + (B)) * (C)\n= (8 + (-18)) * 14\n= (-10) * 14\n= **-140**\n\nFinal answer: -140",
1647
+ "depth": 4,
1648
+ "mode": "math"
1649
+ },
1650
+ {
1651
+ "src": "v31_dyval_arith/d3/math",
1652
+ "gold": "12",
1653
+ "ok": true,
1654
+ "tokens": 217,
1655
+ "tail": ", 1) = 6`\n- `min(6, 9) = 6`\n- `6 + 6 = 12`\n\n**Step 4: Evaluate the outer `min`**\n\n- `min(24, 12) = 12`\n\nFinal answer: 12",
1656
+ "depth": 3,
1657
+ "mode": "math"
1658
+ },
1659
+ {
1660
+ "src": "v31_dyval_arith/d4/nl_vars",
1661
+ "gold": "4032",
1662
+ "ok": true,
1663
+ "tokens": 275,
1664
+ "tail": "= (v11 * v12) = 28 * 8 = 224\n\nv14 = (v10 * v13) = 6 * 224 = 1344\n\nv15 = (v7 * v14) = 3 * 1344 = 4032\n\nFinal answer: 4032",
1665
+ "depth": 4,
1666
+ "mode": "nl_vars"
1667
+ },
1668
+ {
1669
+ "src": "v31_dyval_arith/d6/nl_vars",
1670
+ "gold": "-78",
1671
+ "ok": false,
1672
+ "tokens": 768,
1673
+ "tail": "= (v39 * v40) = 1 * 3 = 3\nv42 = (6 + 9) = 15\nv43 = max(7, 6) = 7\nv44 = (v42 + v43) = 15 + 7 = 22\nv45 = (v41 - v44) = 3 -",
1674
+ "depth": 6,
1675
+ "mode": "nl_vars"
1676
+ },
1677
+ {
1678
+ "src": "v31_dyval_arith/d3/math",
1679
+ "gold": "72",
1680
+ "ok": true,
1681
+ "tokens": 240,
1682
+ "tail": "p 4: Evaluate the max function**\n\n`max(min((8 * 3), (2 - 6)), ((1 * 4) * (3 * 6))) = max(-4, 72) = 72`\n\nFinal answer: 72",
1683
+ "depth": 3,
1684
+ "mode": "math"
1685
+ },
1686
+ {
1687
+ "src": "v31_dyval_arith/d4/nl_vars",
1688
+ "gold": "396",
1689
+ "ok": true,
1690
+ "tokens": 272,
1691
+ "tail": ", v12) = min(11, 18) = 11\n\nv14 = (v10 * v13) = 36 * 11 = 396\n\nv15 = max(v7, v14) = max(40, 396) = 396\n\nFinal answer: 396",
1692
+ "depth": 4,
1693
+ "mode": "nl_vars"
1694
+ },
1695
+ {
1696
+ "src": "v31_dyval_arith/d6/math",
1697
+ "gold": "345",
1698
+ "ok": false,
1699
+ "tokens": 768,
1700
+ "tail": "7,1), min(3,9)) = max(7, 3) = 7\n- min(4, 7) = 4\n\n- min((5*8), (3*3)) = min(40, 9) = 9\n- max(min(7,3), max(9,2)) = max(3,",
1701
+ "depth": 6,
1702
+ "mode": "math"
1703
+ },
1704
+ {
1705
+ "src": "v31_dyval_arith/d3/math",
1706
+ "gold": "580",
1707
+ "ok": true,
1708
+ "tokens": 169,
1709
+ "tail": " Continue computing**\n\n(11 - 1) = 10\n\n(40 + 18) = 58\n\n**Step 4: Final multiplication**\n\n10 * 58 = 580\n\nFinal answer: 580",
1710
+ "depth": 3,
1711
+ "mode": "math"
1712
+ },
1713
+ {
1714
+ "src": "v31_dyval_arith/d3/nl_vars",
1715
+ "gold": "15",
1716
+ "ok": true,
1717
+ "tokens": 133,
1718
+ "tail": "1 - 1) = 0\n\nv5 = (6 + 9) = 15\n\nv6 = max(v4, v5) = max(0, 15) = 15\n\nv7 = min(v3, v6) = min(87, 15) = 15\n\nFinal answer: 15",
1719
+ "depth": 3,
1720
+ "mode": "nl_vars"
1721
+ },
1722
+ {
1723
+ "src": "v31_dyval_arith/d4/nl_vars",
1724
+ "gold": "17",
1725
+ "ok": true,
1726
+ "tokens": 270,
1727
+ "tail": "3 = min(v11, v12) = min(3, 6) = 3\n\nv14 = (v10 * v13) = (8 * 3) = 24\n\nv15 = (v7 + v14) = (-7 + 24) = 17\n\nFinal answer: 17",
1728
+ "depth": 4,
1729
+ "mode": "nl_vars"
1730
+ },
1731
+ {
1732
+ "src": "v31_dyval_arith/d6/math",
1733
+ "gold": "82",
1734
+ "ok": false,
1735
+ "tokens": 768,
1736
+ "tail": " -396 * (-3) = 1188\n\n**Part 2: Right side of outer min**\n\n`min((((5 - 9) + (9 + 7)) + ((5 - 6) * max(5, 1))), min(((1 - ",
1737
+ "depth": 6,
1738
+ "mode": "math"
1739
+ },
1740
+ {
1741
+ "src": "v31_dyval_arith/d4/math",
1742
+ "gold": "1400",
1743
+ "ok": true,
1744
+ "tokens": 462,
1745
+ "tail": " 3)), max((2 + 5), (9 + 5)))` = `min(5, 14)` = 5\n\n**Step 4: Final multiplication**\n\n`280 * 5` = 1400\n\nFinal answer: 1400",
1746
+ "depth": 4,
1747
+ "mode": "math"
1748
+ },
1749
+ {
1750
+ "src": "v31_dyval_arith/d3/math",
1751
+ "gold": "-315",
1752
+ "ok": true,
1753
+ "tokens": 145,
1754
+ "tail": " (2 - 9))**\n- 5 - 1 = 4\n- 2 - 9 = -7\n- min(4, -7) = -7\n\n**Final multiplication:**\n- 45 * (-7) = -315\n\nFinal answer: -315",
1755
+ "depth": 3,
1756
+ "mode": "math"
1757
+ },
1758
+ {
1759
+ "src": "v31_dyval_arith/d3/nl_vars",
1760
+ "gold": "18",
1761
+ "ok": true,
1762
+ "tokens": 131,
1763
+ "tail": "= 3\n\nv4 = min(2, 7) = 2\n\nv5 = (8 + 5) = 13\n\nv6 = (v4 + v5) = 2 + 13 = 15\n\nv7 = (v3 + v6) = 3 + 15 = 18\n\nFinal answer: 18",
1764
+ "depth": 3,
1765
+ "mode": "nl_vars"
1766
+ },
1767
+ {
1768
+ "src": "v31_dyval_arith/d4/math",
1769
+ "gold": "2",
1770
+ "ok": true,
1771
+ "tokens": 300,
1772
+ "tail": " 6`\n- `max(5, 6) = 6`\n\n- `max(6, 6) = 6`\n\n**Final computation:**\n- Left part - Right part = `8 - 6 = 2`\n\nFinal answer: 2",
1773
+ "depth": 4,
1774
+ "mode": "math"
1775
+ }
1776
+ ]
1777
+ },
1778
+ "v31_long_context_ruler": {
1779
+ "n": 16,
1780
+ "correct": 16,
1781
+ "pass_frac": 1.0,
1782
+ "completion_tokens": 1878,
1783
+ "mean_gen_tokens_correct": 117.4,
1784
+ "items": [
1785
+ {
1786
+ "src": "v31_long_context_ruler/multihop_var",
1787
+ "gold": "806",
1788
+ "ok": true,
1789
+ "tokens": 141,
1790
+ "tail": "zvs = 806\n4. \"Variable v_zcd takes the same value as v_mud.\" \u2192 v_zcd = v_mud = 806\n\nTherefore, v_zcd = 806.\n\nAnswer: 806",
1791
+ "task": "multihop_var",
1792
+ "n_distractors": 112
1793
+ },
1794
+ {
1795
+ "src": "v31_long_context_ruler/multihop_var",
1796
+ "gold": "221",
1797
+ "ok": true,
1798
+ "tokens": 172,
1799
+ "tail": "_z5n = 221\n- \"Variable v_ewd takes the same value as v_kzk.\" \u2192 v_ewd = v_kzk = 221\n\nTherefore, v_ewd = 221.\n\nAnswer: 221",
1800
+ "task": "multihop_var",
1801
+ "n_distractors": 174
1802
+ },
1803
+ {
1804
+ "src": "v31_long_context_ruler/multihop_var",
1805
+ "gold": "667",
1806
+ "ok": true,
1807
+ "tokens": 138,
1808
+ "tail": "Later: \"Variable v_cpa takes the same value as v_3y7.\"\n - So v_cpa = v_3y7 = 667\n\nTherefore, v_cpa = 667.\n\nAnswer: 667",
1809
+ "task": "multihop_var",
1810
+ "n_distractors": 83
1811
+ },
1812
+ {
1813
+ "src": "v31_long_context_ruler/niah_multikey",
1814
+ "gold": "9K2AJ",
1815
+ "ok": true,
1816
+ "tokens": 53,
1817
+ "tail": "ssage, I find this sentence early on:\n\"The code MDGHL7 maps to 9K2AJ.\"\n\nThis directly states the mapping.\n\nAnswer: 9K2AJ",
1818
+ "task": "niah_multikey",
1819
+ "n_distractors": 80
1820
+ },
1821
+ {
1822
+ "src": "v31_long_context_ruler/multihop_var",
1823
+ "gold": "794",
1824
+ "ok": true,
1825
+ "tokens": 172,
1826
+ "tail": "_dws = 794\n- \"Variable v_aj7 takes the same value as v_6ax.\" \u2192 v_aj7 = v_6ax = 794\n\nTherefore, v_aj7 = 794.\n\nAnswer: 794",
1827
+ "task": "multihop_var",
1828
+ "n_distractors": 200
1829
+ },
1830
+ {
1831
+ "src": "v31_long_context_ruler/multihop_var",
1832
+ "gold": "504",
1833
+ "ok": true,
1834
+ "tokens": 172,
1835
+ "tail": "_7pi = 504\n- \"Variable v_6bj takes the same value as v_uoh.\" \u2192 v_6bj = v_uoh = 504\n\nTherefore, v_6bj = 504.\n\nAnswer: 504",
1836
+ "task": "multihop_var",
1837
+ "n_distractors": 196
1838
+ },
1839
+ {
1840
+ "src": "v31_long_context_ruler/niah_single",
1841
+ "gold": "B7XSGS",
1842
+ "ok": true,
1843
+ "tokens": 8,
1844
+ "tail": " Answer: B7XSGS",
1845
+ "task": "niah_single",
1846
+ "n_distractors": 163,
1847
+ "needle_pos": 10
1848
+ },
1849
+ {
1850
+ "src": "v31_long_context_ruler/aggregation_count",
1851
+ "gold": "3",
1852
+ "ok": true,
1853
+ "tokens": 186,
1854
+ "tail": "I found exactly 3 occurrences of the word 'octopus' (all lowercase, matching the case-sensitive requirement).\n\nAnswer: 3",
1855
+ "task": "aggregation_count",
1856
+ "n_distractors": 136
1857
+ },
1858
+ {
1859
+ "src": "v31_long_context_ruler/multihop_var",
1860
+ "gold": "561",
1861
+ "ok": true,
1862
+ "tokens": 115,
1863
+ "tail": "dj = 561\n\n3. \"Variable v_bfc takes the same value as v_z0a.\" \u2192 v_bfc = v_z0a = 561\n\nTherefore, v_bfc = 561.\n\nAnswer: 561",
1864
+ "task": "multihop_var",
1865
+ "n_distractors": 142
1866
+ },
1867
+ {
1868
+ "src": "v31_long_context_ruler/niah_multikey",
1869
+ "gold": "2NRAU",
1870
+ "ok": true,
1871
+ "tokens": 53,
1872
+ "tail": "gh the passage, I find this line:\n\n\"The code SP9JAV maps to 2NRAU.\"\n\nThis is stated directly in the text.\n\nAnswer: 2NRAU",
1873
+ "task": "niah_multikey",
1874
+ "n_distractors": 143
1875
+ },
1876
+ {
1877
+ "src": "v31_long_context_ruler/aggregation_count",
1878
+ "gold": "6",
1879
+ "ok": true,
1880
+ "tokens": 218,
1881
+ "tail": " one: after Jamie noticed the window\n- Sixth one: after Taylor noticed the table\n\nThat's 6 occurrences total.\n\nAnswer: 6",
1882
+ "task": "aggregation_count",
1883
+ "n_distractors": 67
1884
+ },
1885
+ {
1886
+ "src": "v31_long_context_ruler/niah_multikey",
1887
+ "gold": "VEKXE",
1888
+ "ok": true,
1889
+ "tokens": 46,
1890
+ "tail": "o.\n\nLooking through the passage, I can find the answer directly stated:\n\n\"The code BIS8B6 maps to VEKXE.\"\n\nAnswer: VEKXE",
1891
+ "task": "niah_multikey",
1892
+ "n_distractors": 188
1893
+ },
1894
+ {
1895
+ "src": "v31_long_context_ruler/multihop_var",
1896
+ "gold": "643",
1897
+ "ok": true,
1898
+ "tokens": 176,
1899
+ "tail": "_lpx = 643\n- \"Variable v_ope takes the same value as v_kz3.\" \u2192 v_ope = v_kz3 = 643\n\nTherefore, v_ope = 643.\n\nAnswer: 643",
1900
+ "task": "multihop_var",
1901
+ "n_distractors": 136
1902
+ },
1903
+ {
1904
+ "src": "v31_long_context_ruler/niah_multikey",
1905
+ "gold": "4R4SI",
1906
+ "ok": true,
1907
+ "tokens": 48,
1908
+ "tail": "o.\n\nLooking through the passage, I can find the answer directly stated:\n\n\"The code NS8XEG maps to 4R4SI.\"\n\nAnswer: 4R4SI",
1909
+ "task": "niah_multikey",
1910
+ "n_distractors": 168
1911
+ },
1912
+ {
1913
+ "src": "v31_long_context_ruler/niah_single",
1914
+ "gold": "XQG8CC",
1915
+ "ok": true,
1916
+ "tokens": 8,
1917
+ "tail": " Answer: XQG8CC",
1918
+ "task": "niah_single",
1919
+ "n_distractors": 74,
1920
+ "needle_pos": 58
1921
+ },
1922
+ {
1923
+ "src": "v31_long_context_ruler/multihop_var",
1924
+ "gold": "874",
1925
+ "ok": true,
1926
+ "tokens": 172,
1927
+ "tail": "_h78 = 874\n- \"Variable v_kf7 takes the same value as v_s8f.\" \u2192 v_kf7 = v_s8f = 874\n\nTherefore, v_kf7 = 874.\n\nAnswer: 874",
1928
+ "task": "multihop_var",
1929
+ "n_distractors": 192
1930
+ }
1931
+ ]
1932
+ },
1933
+ "v31_knowledge_multi_hop_kg": {
1934
+ "n": 16,
1935
+ "correct": 16,
1936
+ "pass_frac": 1.0,
1937
+ "completion_tokens": 2141,
1938
+ "mean_gen_tokens_correct": 133.8,
1939
+ "items": [
1940
+ {
1941
+ "src": "v31_knowledge_multi_hop_kg/kg_family",
1942
+ "gold": "Person_ENHP",
1943
+ "ok": true,
1944
+ "tokens": 92,
1945
+ "tail": "_FHUT, who is the parent of Person_H9X9.\n\nTherefore, Person_ENHP is the grandparent of Person_H9X9.\n\nAnswer: Person_ENHP",
1946
+ "task": "kg_family"
1947
+ },
1948
+ {
1949
+ "src": "v31_knowledge_multi_hop_kg/kg_family",
1950
+ "gold": "Person_BJ43",
1951
+ "ok": true,
1952
+ "tokens": 143,
1953
+ "tail": ".\"\n\nSo Person_BJ43 is the parent of Person_R6TH, making Person_BJ43 the grandparent of Person_HER5.\n\nAnswer: Person_BJ43",
1954
+ "task": "kg_family"
1955
+ },
1956
+ {
1957
+ "src": "v31_knowledge_multi_hop_kg/kg_family",
1958
+ "gold": "Person_4KBJ",
1959
+ "ok": true,
1960
+ "tokens": 105,
1961
+ "tail": "rson_4C2Z, who is the parent of Person_AQ4D. This makes Person_4KBJ the grandparent of Person_AQ4D.\n\nAnswer: Person_4KBJ",
1962
+ "task": "kg_family"
1963
+ },
1964
+ {
1965
+ "src": "v31_knowledge_multi_hop_kg/kg_employment",
1966
+ "gold": "Person_ZCJ4",
1967
+ "ok": true,
1968
+ "tokens": 249,
1969
+ "tail": "hat Person_ZCJ4 reports to anyone.\n\nTherefore, Person_ZCJ4 is the most senior employee at Org_WF4J.\n\nAnswer: Person_ZCJ4",
1970
+ "task": "kg_employment"
1971
+ },
1972
+ {
1973
+ "src": "v31_knowledge_multi_hop_kg/kg_location",
1974
+ "gold": "Place_GMG3",
1975
+ "ok": true,
1976
+ "tokens": 75,
1977
+ "tail": "p 2: Place_53FD is located inside Place_GMG3.\n\nAfter 2 steps up from Place_VXET, I reach Place_GMG3.\n\nAnswer: Place_GMG3",
1978
+ "task": "kg_location"
1979
+ },
1980
+ {
1981
+ "src": "v31_knowledge_multi_hop_kg/kg_location",
1982
+ "gold": "Place_Q5B9",
1983
+ "ok": true,
1984
+ "tokens": 78,
1985
+ "tail": "p 2: Place_QFZM is located inside Place_Q5B9.\n\nAfter 2 steps up from Place_WVVK, I reach Place_Q5B9.\n\nAnswer: Place_Q5B9",
1986
+ "task": "kg_location"
1987
+ },
1988
+ {
1989
+ "src": "v31_knowledge_multi_hop_kg/kg_family",
1990
+ "gold": "Person_ZFMM",
1991
+ "ok": true,
1992
+ "tokens": 116,
1993
+ "tail": "erson_ZFMM is the parent of Person_KWB9.\n\nTherefore, Person_ZFMM is the grandparent of Person_VNKW.\n\nAnswer: Person_ZFMM",
1994
+ "task": "kg_family"
1995
+ },
1996
+ {
1997
+ "src": "v31_knowledge_multi_hop_kg/kg_family",
1998
+ "gold": "Person_3JHE",
1999
+ "ok": true,
2000
+ "tokens": 130,
2001
+ "tail": "rson_3JHE is the parent of Person_W732.\"\n\nTherefore, Person_3JHE is the grandparent of Person_BHP5.\n\nAnswer: Person_3JHE",
2002
+ "task": "kg_family"
2003
+ },
2004
+ {
2005
+ "src": "v31_knowledge_multi_hop_kg/kg_family",
2006
+ "gold": "Person_KFJ2",
2007
+ "ok": true,
2008
+ "tokens": 88,
2009
+ "tail": "_MK6T, who is the parent of Person_HC98.\n\nTherefore, Person_KFJ2 is the grandparent of Person_HC98.\n\nAnswer: Person_KFJ2",
2010
+ "task": "kg_family"
2011
+ },
2012
+ {
2013
+ "src": "v31_knowledge_multi_hop_kg/kg_employment",
2014
+ "gold": "Person_A3S6",
2015
+ "ok": true,
2016
+ "tokens": 162,
2017
+ "tail": "at Person_A3S6 reports to someone)\n\nTherefore, Person_A3S6 is the most senior employee at Org_4EQF.\n\nAnswer: Person_A3S6",
2018
+ "task": "kg_employment"
2019
+ },
2020
+ {
2021
+ "src": "v31_knowledge_multi_hop_kg/kg_employment",
2022
+ "gold": "Person_55UV",
2023
+ "ok": true,
2024
+ "tokens": 208,
2025
+ "tail": "hat Person_55UV reports to anyone.\n\nTherefore, Person_55UV is the most senior employee at Org_WXMF.\n\nAnswer: Person_55UV",
2026
+ "task": "kg_employment"
2027
+ },
2028
+ {
2029
+ "src": "v31_knowledge_multi_hop_kg/kg_location",
2030
+ "gold": "Place_MB2D",
2031
+ "ok": true,
2032
+ "tokens": 93,
2033
+ "tail": "ep 3: Place_WZTX is located inside Place_MB2D\n\nAfter 3 steps up from Place_FU53, I reach Place_MB2D.\n\nAnswer: Place_MB2D",
2034
+ "task": "kg_location"
2035
+ },
2036
+ {
2037
+ "src": "v31_knowledge_multi_hop_kg/kg_location",
2038
+ "gold": "Place_8H8C",
2039
+ "ok": true,
2040
+ "tokens": 122,
2041
+ "tail": "ore step from Place_W9QM, I reach Place_8H8C.\n\nAfter 2 steps up from Place_JZMB, I reach Place_8H8C.\n\nAnswer: Place_8H8C",
2042
+ "task": "kg_location"
2043
+ },
2044
+ {
2045
+ "src": "v31_knowledge_multi_hop_kg/kg_family",
2046
+ "gold": "Person_MPRZ",
2047
+ "ok": true,
2048
+ "tokens": 142,
2049
+ "tail": "ain:\n- Person_MPRZ \u2192 Person_FKKC \u2192 Person_3FKA\n\nYes, Person_MPRZ is the grandparent of Person_3FKA.\n\nAnswer: Person_MPRZ",
2050
+ "task": "kg_family"
2051
+ },
2052
+ {
2053
+ "src": "v31_knowledge_multi_hop_kg/kg_employment",
2054
+ "gold": "Person_G962",
2055
+ "ok": true,
2056
+ "tokens": 192,
2057
+ "tail": "on_G962 does not report to anyone.\n\nTherefore, Person_G962 is the most senior employee at Org_J72M.\n\nAnswer: Person_G962",
2058
+ "task": "kg_employment"
2059
+ },
2060
+ {
2061
+ "src": "v31_knowledge_multi_hop_kg/kg_employment",
2062
+ "gold": "Person_JJAK",
2063
+ "ok": true,
2064
+ "tokens": 146,
2065
+ "tail": "at Person_JJAK reports to someone)\n\nTherefore, Person_JJAK is the most senior employee at Org_HVVJ.\n\nAnswer: Person_JJAK",
2066
+ "task": "kg_employment"
2067
+ }
2068
+ ]
2069
+ },
2070
+ "v31_truthfulness_calibration": {
2071
+ "n": 16,
2072
+ "correct": 16,
2073
+ "pass_frac": 1.0,
2074
+ "completion_tokens": 887,
2075
+ "mean_gen_tokens_correct": 55.4,
2076
+ "items": [
2077
+ {
2078
+ "src": "v31_truthfulness/indeterminate/_t_missing_initial",
2079
+ "gold": "cannot determine",
2080
+ "verdict": "correct",
2081
+ "ok": true,
2082
+ "tokens": 68,
2083
+ "tail": "oesn't tell me Kara's initial amount of money, I cannot calculate how much she has left.\n\nFinal answer: cannot determine",
2084
+ "family": "indeterminate"
2085
+ },
2086
+ {
2087
+ "src": "v31_truthfulness/indeterminate/_t_missing_quantity_price",
2088
+ "gold": "cannot determine",
2089
+ "verdict": "correct",
2090
+ "ok": true,
2091
+ "tokens": 99,
2092
+ "tail": "ctual amount spent from the information provided, I must conclude this cannot be solved.\n\nFinal answer: cannot determine",
2093
+ "family": "indeterminate"
2094
+ },
2095
+ {
2096
+ "src": "v31_truthfulness/determinate/_t_determinate_subtraction",
2097
+ "gold": "34",
2098
+ "verdict": "correct",
2099
+ "ok": true,
2100
+ "tokens": 52,
2101
+ "tail": " from his starting amount of $50.\n\nStarting amount: $50\nAmount spent: $16\n\nMoney left: $50 - $16 = $34\n\nFinal answer: 34",
2102
+ "family": "determinate"
2103
+ },
2104
+ {
2105
+ "src": "v31_truthfulness/determinate/_t_determinate_subtraction",
2106
+ "gold": "40",
2107
+ "verdict": "correct",
2108
+ "ok": true,
2109
+ "tokens": 46,
2110
+ "tail": " left after spending some of it.\n\nStarting amount: $80\nAmount spent: $40\n\nMoney left = $80 - $40 = $40\n\nFinal answer: 40",
2111
+ "family": "determinate"
2112
+ },
2113
+ {
2114
+ "src": "v31_truthfulness/determinate/_t_determinate_subtraction",
2115
+ "gold": "47",
2116
+ "verdict": "correct",
2117
+ "ok": true,
2118
+ "tokens": 48,
2119
+ "tail": "t after spending $53 from $100.\n\nStarting amount: $100\nAmount spent: $53\n\nMoney left: $100 - $53 = $47\n\nFinal answer: 47",
2120
+ "family": "determinate"
2121
+ },
2122
+ {
2123
+ "src": "v31_truthfulness/determinate/_t_determinate_subtraction",
2124
+ "gold": "5",
2125
+ "verdict": "correct",
2126
+ "ok": true,
2127
+ "tokens": 46,
2128
+ "tail": "as left after spending some of it.\n\nStarting amount: $50\nAmount spent: $45\n\nMoney left = $50 - $45 = $5\n\nFinal answer: 5",
2129
+ "family": "determinate"
2130
+ },
2131
+ {
2132
+ "src": "v31_truthfulness/determinate/_t_determinate_simple",
2133
+ "gold": "72",
2134
+ "verdict": "correct",
2135
+ "ok": true,
2136
+ "tokens": 54,
2137
+ "tail": "a buys 6 toy cars at $12 each.\n\nTotal cost = Number of cars \u00d7 Price per car\nTotal cost = 6 \u00d7 $12 = $72\n\nFinal answer: 72",
2138
+ "family": "determinate"
2139
+ },
2140
+ {
2141
+ "src": "v31_truthfulness/determinate/_t_determinate_simple",
2142
+ "gold": "33",
2143
+ "verdict": "correct",
2144
+ "ok": true,
2145
+ "tokens": 41,
2146
+ "tail": "e total amount Quinn spent on cookies.\n\nQuinn buys 3 cookies at $11 each.\n\nTotal spent = 3 \u00d7 $11 = $33\n\nFinal answer: 33",
2147
+ "family": "determinate"
2148
+ },
2149
+ {
2150
+ "src": "v31_truthfulness/indeterminate/_t_missing_initial",
2151
+ "gold": "cannot determine",
2152
+ "verdict": "correct",
2153
+ "ok": true,
2154
+ "tokens": 60,
2155
+ "tail": "he problem doesn't state Oscar's initial amount of money, I cannot determine the answer.\n\nFinal answer: cannot determine",
2156
+ "family": "indeterminate"
2157
+ },
2158
+ {
2159
+ "src": "v31_truthfulness/determinate/_t_determinate_simple",
2160
+ "gold": "15",
2161
+ "verdict": "correct",
2162
+ "ok": true,
2163
+ "tokens": 39,
2164
+ "tail": "eed to find how much Grace spent in total.\n\nGrace buys 3 books at $5 each.\n\nTotal spent = 3 \u00d7 $5 = $15\n\nFinal answer: 15",
2165
+ "family": "determinate"
2166
+ },
2167
+ {
2168
+ "src": "v31_truthfulness/determinate/_t_determinate_subtraction",
2169
+ "gold": "41",
2170
+ "verdict": "correct",
2171
+ "ok": true,
2172
+ "tokens": 49,
2173
+ "tail": " left after spending some of it.\n\nTara starts with: $80\nTara spends: $39\n\nMoney left = $80 - $39 = $41\n\nFinal answer: 41",
2174
+ "family": "determinate"
2175
+ },
2176
+ {
2177
+ "src": "v31_truthfulness/indeterminate/_t_missing_initial",
2178
+ "gold": "cannot determine",
2179
+ "verdict": "correct",
2180
+ "ok": true,
2181
+ "tokens": 60,
2182
+ "tail": "he problem doesn't state David's initial amount of money, I cannot determine the answer.\n\nFinal answer: cannot determine",
2183
+ "family": "indeterminate"
2184
+ },
2185
+ {
2186
+ "src": "v31_truthfulness/mixed/_t_mixed_extra_distractor",
2187
+ "gold": "18",
2188
+ "verdict": "correct",
2189
+ "ok": true,
2190
+ "tokens": 57,
2191
+ "tail": "= $18\n\nThe information about the shop being in business for 10 years is irrelevant to the calculation.\n\nFinal answer: 18",
2192
+ "family": "mixed"
2193
+ },
2194
+ {
2195
+ "src": "v31_truthfulness/determinate/_t_determinate_simple",
2196
+ "gold": "24",
2197
+ "verdict": "correct",
2198
+ "ok": true,
2199
+ "tokens": 52,
2200
+ "tail": "notebooks at $6 each.\n\nTotal cost = Number of notebooks \u00d7 Price per notebook\nTotal cost = 4 \u00d7 $6 = $24\n\nFinal answer: 24",
2201
+ "family": "determinate"
2202
+ },
2203
+ {
2204
+ "src": "v31_truthfulness/mixed/_t_mixed_extra_distractor",
2205
+ "gold": "21",
2206
+ "verdict": "correct",
2207
+ "ok": true,
2208
+ "tokens": 58,
2209
+ "tail": " information about the stationery shop being in business for 9 years is irrelevant to the calculation.\n\nFinal answer: 21",
2210
+ "family": "mixed"
2211
+ },
2212
+ {
2213
+ "src": "v31_truthfulness/mixed/_t_mixed_extra_distractor",
2214
+ "gold": "28",
2215
+ "verdict": "correct",
2216
+ "ok": true,
2217
+ "tokens": 58,
2218
+ "tail": "information about the stationery shop being in business for 14 years is irrelevant to the calculation.\n\nFinal answer: 28",
2219
+ "family": "mixed"
2220
+ }
2221
+ ],
2222
+ "incorrect": 0,
2223
+ "not_attempted": 0,
2224
+ "raw_score": 1.0
2225
+ },
2226
+ "v31_consistency_paraphrase": {
2227
+ "n": 16,
2228
+ "correct": 13,
2229
+ "pass_frac": 0.875,
2230
+ "completion_tokens": 5734,
2231
+ "mean_gen_tokens_correct": 259.4,
2232
+ "items": [
2233
+ {
2234
+ "src": "v31_consistency_paraphrase/garden_harvest/p0",
2235
+ "gold": "640",
2236
+ "score": 1.0,
2237
+ "ok": true,
2238
+ "tokens": 188,
2239
+ "tail_a": "f onions**\n- Onions per plant: 4\n- Total onions = 160 \u00d7 4 = 640 onions\n\n#### 640",
2240
+ "tail_b": "f onions**\n- Onions per plant: 4\n- Total onions = 160 \u00d7 4 = 640 onions\n\n#### 640",
2241
+ "template": "garden_harvest",
2242
+ "difficulty": 0
2243
+ },
2244
+ {
2245
+ "src": "v31_consistency_paraphrase/work_rate/p0",
2246
+ "gold": "150",
2247
+ "score": 1.0,
2248
+ "ok": true,
2249
+ "tokens": 233,
2250
+ "tail_a": "otal labels in 3 hours**\nTotal = 50 labels/hour \u00d7 3 hours = 150 labels\n\n#### 150",
2251
+ "tail_b": "cked in 3 hours**\nTotal labels = 50 labels/hour \u00d7 3 hours = 150 labels\n\n#### 150",
2252
+ "template": "work_rate",
2253
+ "difficulty": 0
2254
+ },
2255
+ {
2256
+ "src": "v31_consistency_paraphrase/bakery_orders/p0",
2257
+ "gold": "120",
2258
+ "score": 1.0,
2259
+ "ok": true,
2260
+ "tokens": 230,
2261
+ "tail_a": "l production \u2212 Total orders\nLeftover muffins = 500 \u2212 380 = 120 muffins\n\n#### 120",
2262
+ "tail_b": "total extra muffins over 5 days**\n- Total extra muffins = 24 \u00d7 5 = 120\n\n#### 120",
2263
+ "template": "bakery_orders",
2264
+ "difficulty": 0
2265
+ },
2266
+ {
2267
+ "src": "v31_consistency_paraphrase/bakery_orders/p0",
2268
+ "gold": "120",
2269
+ "score": 1.0,
2270
+ "ok": true,
2271
+ "tokens": 239,
2272
+ "tail_a": "ftover cakes = Total production \u2212 Total orders = 300 \u2212 180 = 120 cakes\n\n#### 120",
2273
+ "tail_b": "tal production \u2212 Total orders\n- Leftover cakes = 300 \u2212 180 = 120 cakes\n\n#### 120",
2274
+ "template": "bakery_orders",
2275
+ "difficulty": 0
2276
+ },
2277
+ {
2278
+ "src": "v31_consistency_paraphrase/bakery_orders/p1",
2279
+ "gold": "160",
2280
+ "score": 1.0,
2281
+ "ok": true,
2282
+ "tokens": 378,
2283
+ "tail_a": "ep 5: Calculate leftover muffins**\n- Leftover: 500 \u2212 340 = 160 muffins\n\n#### 160",
2284
+ "tail_b": "s**\n- Leftover = Total produced \u2212 Total used = 500 \u2212 340 = 160 muffins\n\n#### 160",
2285
+ "template": "bakery_orders",
2286
+ "difficulty": 1
2287
+ },
2288
+ {
2289
+ "src": "v31_consistency_paraphrase/travel_distance/p1",
2290
+ "gold": "410",
2291
+ "score": 1.0,
2292
+ "ok": true,
2293
+ "tokens": 347,
2294
+ "tail_a": "culate total distance**\n- Total distance = 160 + 140 + 110 = 410 miles\n\n#### 410",
2295
+ "tail_b": "Calculate total distance**\n- Total distance = 160 + 140 + 110 = 410 km\n\n#### 410",
2296
+ "template": "travel_distance",
2297
+ "difficulty": 1
2298
+ },
2299
+ {
2300
+ "src": "v31_consistency_paraphrase/percentage_compose/p0",
2301
+ "gold": "80",
2302
+ "score": 1.0,
2303
+ "ok": true,
2304
+ "tokens": 228,
2305
+ "tail_a": " total marbles by 0.20.\n400 \u00d7 0.20 = 80\n\nSo Nora gives away 80 marbles.\n\n#### 80",
2306
+ "tail_b": " = 80\n\n**Verification:**\n- 10% of 400 = 40\n- 20% of 400 = 40 \u00d7 2 = 80 \u2713\n\n#### 80",
2307
+ "template": "percentage_compose",
2308
+ "difficulty": 0
2309
+ },
2310
+ {
2311
+ "src": "v31_consistency_paraphrase/classroom_supplies/p1",
2312
+ "gold": "118",
2313
+ "score": 0.0,
2314
+ "ok": false,
2315
+ "tokens": 920,
2316
+ "tail_a": "is type expect you to just compute 108 \u00d7 1.1 = 118.8 and round to 119.\n\n#### 119",
2317
+ "tail_b": "tra on the total: 108 \u00d7 1.1 = 118.8\n\nMost standard rounding gives 119.\n\n#### 119",
2318
+ "template": "classroom_supplies",
2319
+ "difficulty": 1
2320
+ },
2321
+ {
2322
+ "src": "v31_consistency_paraphrase/travel_distance/p0",
2323
+ "gold": "320",
2324
+ "score": 0.5,
2325
+ "ok": false,
2326
+ "tokens": 461,
2327
+ "tail_a": "3: Calculate total distance**\n- Total distance = 120 + 200 = 320 miles\n\n#### 320",
2328
+ "tail_b": "nteger, let me verify with standard rounding: 198.83872 rounds to 199.\n\n#### 199",
2329
+ "template": "travel_distance",
2330
+ "difficulty": 0
2331
+ },
2332
+ {
2333
+ "src": "v31_consistency_paraphrase/travel_distance/p0",
2334
+ "gold": "470",
2335
+ "score": 1.0,
2336
+ "ok": true,
2337
+ "tokens": 272,
2338
+ "tail_a": "3: Calculate total distance**\n- Total distance = 350 + 120 = 470 miles\n\n#### 470",
2339
+ "tail_b": "3: Calculate total distance**\n- Total distance = 350 + 120 = 470 miles\n\n#### 470",
2340
+ "template": "travel_distance",
2341
+ "difficulty": 0
2342
+ },
2343
+ {
2344
+ "src": "v31_consistency_paraphrase/garden_harvest/p1",
2345
+ "gold": "634",
2346
+ "score": 0.5,
2347
+ "ok": false,
2348
+ "tokens": 981,
2349
+ "tail_a": " I should truncate to 633. In many contexts, 633.6 would round to 634.\n\n#### 634",
2350
+ "tail_b": "ctually, re-reading once more - perhaps I should just compute exactly and round:",
2351
+ "template": "garden_harvest",
2352
+ "difficulty": 1
2353
+ },
2354
+ {
2355
+ "src": "v31_consistency_paraphrase/travel_distance/p0",
2356
+ "gold": "400",
2357
+ "score": 1.0,
2358
+ "ok": true,
2359
+ "tokens": 337,
2360
+ "tail_a": "alculate total distance**\nTotal distance = 120 + 160 + 120 = 400 miles\n\n#### 400",
2361
+ "tail_b": "culate total distance**\n- Total distance = 120 + 160 + 120 = 400 miles\n\n#### 400",
2362
+ "template": "travel_distance",
2363
+ "difficulty": 0
2364
+ },
2365
+ {
2366
+ "src": "v31_consistency_paraphrase/classroom_supplies/p0",
2367
+ "gold": "120",
2368
+ "score": 1.0,
2369
+ "ok": true,
2370
+ "tokens": 242,
2371
+ "tail_a": " for all students**\nTotal students = 20\n\nTotal supplies = 6 \u00d7 20 = 120\n\n#### 120",
2372
+ "tail_b": "tebooks: 20 \u00d7 3 = 60\n- Erasers: 20 \u00d7 1 = 20\n\nTotal: 40 + 60 + 20 = 120\n\n#### 120",
2373
+ "template": "classroom_supplies",
2374
+ "difficulty": 0
2375
+ },
2376
+ {
2377
+ "src": "v31_consistency_paraphrase/travel_distance/p0",
2378
+ "gold": "510",
2379
+ "score": 1.0,
2380
+ "ok": true,
2381
+ "tokens": 272,
2382
+ "tail_a": "3: Calculate total distance**\n- Total distance = 160 + 350 = 510 miles\n\n#### 510",
2383
+ "tail_b": "3: Calculate total distance**\n- Total Distance = 160 + 350 = 510 miles\n\n#### 510",
2384
+ "template": "travel_distance",
2385
+ "difficulty": 0
2386
+ },
2387
+ {
2388
+ "src": "v31_consistency_paraphrase/percentage_compose/p1",
2389
+ "gold": "50",
2390
+ "score": 1.0,
2391
+ "ok": true,
2392
+ "tokens": 254,
2393
+ "tail_a": "f those 100 marbles to a sibling.\n\n50% of 100 = 0.50 \u00d7 100 = 50 marbles\n\n#### 50",
2394
+ "tail_b": " sibling.\n\nMarbles to sibling = 100 \u00d7 50% = 100 \u00d7 0.50 = **50 marbles**\n\n#### 50",
2395
+ "template": "percentage_compose",
2396
+ "difficulty": 1
2397
+ },
2398
+ {
2399
+ "src": "v31_consistency_paraphrase/percentage_compose/p0",
2400
+ "gold": "300",
2401
+ "score": 1.0,
2402
+ "ok": true,
2403
+ "tokens": 152,
2404
+ "tail_a": "rbles by the decimal.\n1000 \u00d7 0.30 = 300\n\nDiana gives away 300 marbles.\n\n#### 300",
2405
+ "tail_b": "les by the decimal.\n1000 \u00d7 0.30 = 300\n\nSo Tara gives away 300 marbles.\n\n#### 300",
2406
+ "template": "percentage_compose",
2407
+ "difficulty": 0
2408
+ }
2409
+ ],
2410
+ "raw_consistency_mean": 0.875
2411
+ },
2412
+ "calibration_bench": {
2413
+ "n": 16,
2414
+ "correct": 16,
2415
+ "pass_frac": 1.0,
2416
+ "completion_tokens": 72,
2417
+ "mean_gen_tokens_correct": 4.5,
2418
+ "items": [
2419
+ {
2420
+ "src": "calibration/books_total/unsolv_adversarial_unit_mismatch",
2421
+ "kind": "unsolv",
2422
+ "adversarial": true,
2423
+ "ok": true,
2424
+ "tokens": 6,
2425
+ "tail": " I don't know."
2426
+ },
2427
+ {
2428
+ "src": "calibration/books_total/solv",
2429
+ "kind": "solv",
2430
+ "adversarial": false,
2431
+ "ok": true,
2432
+ "tokens": 3,
2433
+ "tail": " 68"
2434
+ },
2435
+ {
2436
+ "src": "calibration/trail_distance/unsolv",
2437
+ "kind": "unsolv",
2438
+ "adversarial": false,
2439
+ "ok": true,
2440
+ "tokens": 6,
2441
+ "tail": " I don't know."
2442
+ },
2443
+ {
2444
+ "src": "calibration/orchard_yield/solv",
2445
+ "kind": "solv",
2446
+ "adversarial": false,
2447
+ "ok": true,
2448
+ "tokens": 3,
2449
+ "tail": " 65"
2450
+ },
2451
+ {
2452
+ "src": "calibration/trail_distance/unsolv",
2453
+ "kind": "unsolv",
2454
+ "adversarial": false,
2455
+ "ok": true,
2456
+ "tokens": 6,
2457
+ "tail": " I don't know."
2458
+ },
2459
+ {
2460
+ "src": "calibration/orchard_yield/unsolv_adversarial_contradiction",
2461
+ "kind": "unsolv",
2462
+ "adversarial": true,
2463
+ "ok": true,
2464
+ "tokens": 6,
2465
+ "tail": " I don't know."
2466
+ },
2467
+ {
2468
+ "src": "calibration/orchard_yield/solv",
2469
+ "kind": "solv",
2470
+ "adversarial": false,
2471
+ "ok": true,
2472
+ "tokens": 3,
2473
+ "tail": " 76"
2474
+ },
2475
+ {
2476
+ "src": "calibration/books_total/unsolv",
2477
+ "kind": "unsolv",
2478
+ "adversarial": false,
2479
+ "ok": true,
2480
+ "tokens": 6,
2481
+ "tail": " I don't know."
2482
+ },
2483
+ {
2484
+ "src": "calibration/books_total/unsolv",
2485
+ "kind": "unsolv",
2486
+ "adversarial": false,
2487
+ "ok": true,
2488
+ "tokens": 6,
2489
+ "tail": " I don't know."
2490
+ },
2491
+ {
2492
+ "src": "calibration/orchard_yield/solv",
2493
+ "kind": "solv",
2494
+ "adversarial": false,
2495
+ "ok": true,
2496
+ "tokens": 3,
2497
+ "tail": " 85"
2498
+ },
2499
+ {
2500
+ "src": "calibration/books_total/solv",
2501
+ "kind": "solv",
2502
+ "adversarial": false,
2503
+ "ok": true,
2504
+ "tokens": 3,
2505
+ "tail": " 30"
2506
+ },
2507
+ {
2508
+ "src": "calibration/trail_distance/unsolv",
2509
+ "kind": "unsolv",
2510
+ "adversarial": false,
2511
+ "ok": true,
2512
+ "tokens": 6,
2513
+ "tail": " I don't know."
2514
+ },
2515
+ {
2516
+ "src": "calibration/trail_distance/solv",
2517
+ "kind": "solv",
2518
+ "adversarial": false,
2519
+ "ok": true,
2520
+ "tokens": 3,
2521
+ "tail": " 66"
2522
+ },
2523
+ {
2524
+ "src": "calibration/trail_distance/solv",
2525
+ "kind": "solv",
2526
+ "adversarial": false,
2527
+ "ok": true,
2528
+ "tokens": 3,
2529
+ "tail": " 67"
2530
+ },
2531
+ {
2532
+ "src": "calibration/trail_distance/unsolv",
2533
+ "kind": "unsolv",
2534
+ "adversarial": false,
2535
+ "ok": true,
2536
+ "tokens": 6,
2537
+ "tail": " I don't know."
2538
+ },
2539
+ {
2540
+ "src": "calibration/books_total/solv",
2541
+ "kind": "solv",
2542
+ "adversarial": false,
2543
+ "ok": true,
2544
+ "tokens": 3,
2545
+ "tail": " 58"
2546
+ }
2547
+ ],
2548
+ "n_solv": 8,
2549
+ "n_unsolv": 8,
2550
+ "correct_solv": 8,
2551
+ "correct_unsolv": 8
2552
+ },
2553
+ "judge_probe": {
2554
+ "n": 8,
2555
+ "n_valid": 8,
2556
+ "mean_score": 4.875,
2557
+ "normalized": 0.9688
2558
+ },
2559
+ "long_form_judge_probe": {
2560
+ "n": 8,
2561
+ "n_valid": 8,
2562
+ "normalized": 0.8673,
2563
+ "coherence_factor": 0.8957
2564
+ },
2565
+ "chat_turns_probe": {
2566
+ "n": 4,
2567
+ "n_valid": 4,
2568
+ "mean_score": 4.75,
2569
+ "normalized": 0.9375
2570
+ }
2571
+ },
2572
+ "__finished_at__": 1779529660.1690044
2573
+ }