Shushant commited on
Commit
271d9b7
·
verified ·
1 Parent(s): 5f049d6

Training logs (final)

Browse files
Files changed (1) hide show
  1. training_logs/per_generator_auroc.tsv +575 -0
training_logs/per_generator_auroc.tsv CHANGED
@@ -183,3 +183,578 @@ step generator AUROC
183
  35 qwen1.5-72b-chat-8bit 0.9995
184
  35 text-bison-002 0.9992
185
  35 MACRO_AVG 0.9960
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
183
  35 qwen1.5-72b-chat-8bit 0.9995
184
  35 text-bison-002 0.9992
185
  35 MACRO_AVG 0.9960
186
+ 40 deepseek-r1-distill-qwen-32b 0.9998
187
+ 40 falcon3-10b-instruct 0.9992
188
+ 40 gemini-1.5-pro 0.9689
189
+ 40 gemini-2.0-flash 0.9998
190
+ 40 gemini-pro 0.9851
191
+ 40 gemini-pro-paraphrase 0.9884
192
+ 40 gpt-3.5-turbo 0.9999
193
+ 40 gpt-4-turbo 1.0000
194
+ 40 gpt-4-turbo-paraphrase 0.9963
195
+ 40 gpt-4.5-preview 0.9980
196
+ 40 gpt-4o 0.9999
197
+ 40 gpt-4o-mini 0.9999
198
+ 40 llama-2-70b-chat 0.9974
199
+ 40 llama-2-7b-chat 0.9967
200
+ 40 llama-3.1-8b-instruct 0.9929
201
+ 40 llama-3.3-70b-instruct 0.9993
202
+ 40 ministral-8b-instruct-2410 0.9997
203
+ 40 mistral-7b-instruct-v0.2 0.9965
204
+ 40 mixtral-8x7b-instruct-v0.1 0.9947
205
+ 40 o3-mini 0.9999
206
+ 40 qwen1.5-72b-chat-8bit 0.9995
207
+ 40 text-bison-002 0.9992
208
+ 40 MACRO_AVG 0.9960
209
+ 45 deepseek-r1-distill-qwen-32b 0.9998
210
+ 45 falcon3-10b-instruct 0.9992
211
+ 45 gemini-1.5-pro 0.9689
212
+ 45 gemini-2.0-flash 0.9998
213
+ 45 gemini-pro 0.9851
214
+ 45 gemini-pro-paraphrase 0.9884
215
+ 45 gpt-3.5-turbo 0.9999
216
+ 45 gpt-4-turbo 1.0000
217
+ 45 gpt-4-turbo-paraphrase 0.9963
218
+ 45 gpt-4.5-preview 0.9980
219
+ 45 gpt-4o 0.9999
220
+ 45 gpt-4o-mini 0.9999
221
+ 45 llama-2-70b-chat 0.9974
222
+ 45 llama-2-7b-chat 0.9967
223
+ 45 llama-3.1-8b-instruct 0.9929
224
+ 45 llama-3.3-70b-instruct 0.9993
225
+ 45 ministral-8b-instruct-2410 0.9997
226
+ 45 mistral-7b-instruct-v0.2 0.9965
227
+ 45 mixtral-8x7b-instruct-v0.1 0.9947
228
+ 45 o3-mini 0.9999
229
+ 45 qwen1.5-72b-chat-8bit 0.9995
230
+ 45 text-bison-002 0.9992
231
+ 45 MACRO_AVG 0.9960
232
+ 50 deepseek-r1-distill-qwen-32b 0.9998
233
+ 50 falcon3-10b-instruct 0.9992
234
+ 50 gemini-1.5-pro 0.9689
235
+ 50 gemini-2.0-flash 0.9998
236
+ 50 gemini-pro 0.9851
237
+ 50 gemini-pro-paraphrase 0.9884
238
+ 50 gpt-3.5-turbo 0.9999
239
+ 50 gpt-4-turbo 1.0000
240
+ 50 gpt-4-turbo-paraphrase 0.9963
241
+ 50 gpt-4.5-preview 0.9980
242
+ 50 gpt-4o 0.9999
243
+ 50 gpt-4o-mini 0.9999
244
+ 50 llama-2-70b-chat 0.9974
245
+ 50 llama-2-7b-chat 0.9967
246
+ 50 llama-3.1-8b-instruct 0.9929
247
+ 50 llama-3.3-70b-instruct 0.9993
248
+ 50 ministral-8b-instruct-2410 0.9997
249
+ 50 mistral-7b-instruct-v0.2 0.9965
250
+ 50 mixtral-8x7b-instruct-v0.1 0.9947
251
+ 50 o3-mini 0.9999
252
+ 50 qwen1.5-72b-chat-8bit 0.9995
253
+ 50 text-bison-002 0.9992
254
+ 50 MACRO_AVG 0.9960
255
+ 55 deepseek-r1-distill-qwen-32b 0.9998
256
+ 55 falcon3-10b-instruct 0.9992
257
+ 55 gemini-1.5-pro 0.9689
258
+ 55 gemini-2.0-flash 0.9998
259
+ 55 gemini-pro 0.9851
260
+ 55 gemini-pro-paraphrase 0.9884
261
+ 55 gpt-3.5-turbo 0.9999
262
+ 55 gpt-4-turbo 1.0000
263
+ 55 gpt-4-turbo-paraphrase 0.9963
264
+ 55 gpt-4.5-preview 0.9980
265
+ 55 gpt-4o 0.9999
266
+ 55 gpt-4o-mini 0.9999
267
+ 55 llama-2-70b-chat 0.9974
268
+ 55 llama-2-7b-chat 0.9967
269
+ 55 llama-3.1-8b-instruct 0.9929
270
+ 55 llama-3.3-70b-instruct 0.9993
271
+ 55 ministral-8b-instruct-2410 0.9997
272
+ 55 mistral-7b-instruct-v0.2 0.9965
273
+ 55 mixtral-8x7b-instruct-v0.1 0.9947
274
+ 55 o3-mini 0.9999
275
+ 55 qwen1.5-72b-chat-8bit 0.9995
276
+ 55 text-bison-002 0.9992
277
+ 55 MACRO_AVG 0.9960
278
+ 60 deepseek-r1-distill-qwen-32b 0.9998
279
+ 60 falcon3-10b-instruct 0.9992
280
+ 60 gemini-1.5-pro 0.9689
281
+ 60 gemini-2.0-flash 0.9998
282
+ 60 gemini-pro 0.9851
283
+ 60 gemini-pro-paraphrase 0.9884
284
+ 60 gpt-3.5-turbo 0.9999
285
+ 60 gpt-4-turbo 1.0000
286
+ 60 gpt-4-turbo-paraphrase 0.9963
287
+ 60 gpt-4.5-preview 0.9980
288
+ 60 gpt-4o 0.9999
289
+ 60 gpt-4o-mini 0.9999
290
+ 60 llama-2-70b-chat 0.9974
291
+ 60 llama-2-7b-chat 0.9967
292
+ 60 llama-3.1-8b-instruct 0.9929
293
+ 60 llama-3.3-70b-instruct 0.9993
294
+ 60 ministral-8b-instruct-2410 0.9997
295
+ 60 mistral-7b-instruct-v0.2 0.9965
296
+ 60 mixtral-8x7b-instruct-v0.1 0.9947
297
+ 60 o3-mini 0.9999
298
+ 60 qwen1.5-72b-chat-8bit 0.9995
299
+ 60 text-bison-002 0.9992
300
+ 60 MACRO_AVG 0.9960
301
+ 65 deepseek-r1-distill-qwen-32b 0.9998
302
+ 65 falcon3-10b-instruct 0.9992
303
+ 65 gemini-1.5-pro 0.9689
304
+ 65 gemini-2.0-flash 0.9998
305
+ 65 gemini-pro 0.9851
306
+ 65 gemini-pro-paraphrase 0.9884
307
+ 65 gpt-3.5-turbo 0.9999
308
+ 65 gpt-4-turbo 1.0000
309
+ 65 gpt-4-turbo-paraphrase 0.9963
310
+ 65 gpt-4.5-preview 0.9980
311
+ 65 gpt-4o 0.9999
312
+ 65 gpt-4o-mini 0.9999
313
+ 65 llama-2-70b-chat 0.9974
314
+ 65 llama-2-7b-chat 0.9967
315
+ 65 llama-3.1-8b-instruct 0.9929
316
+ 65 llama-3.3-70b-instruct 0.9993
317
+ 65 ministral-8b-instruct-2410 0.9997
318
+ 65 mistral-7b-instruct-v0.2 0.9965
319
+ 65 mixtral-8x7b-instruct-v0.1 0.9947
320
+ 65 o3-mini 0.9999
321
+ 65 qwen1.5-72b-chat-8bit 0.9995
322
+ 65 text-bison-002 0.9992
323
+ 65 MACRO_AVG 0.9960
324
+ 70 deepseek-r1-distill-qwen-32b 0.9998
325
+ 70 falcon3-10b-instruct 0.9992
326
+ 70 gemini-1.5-pro 0.9689
327
+ 70 gemini-2.0-flash 0.9998
328
+ 70 gemini-pro 0.9851
329
+ 70 gemini-pro-paraphrase 0.9884
330
+ 70 gpt-3.5-turbo 0.9999
331
+ 70 gpt-4-turbo 1.0000
332
+ 70 gpt-4-turbo-paraphrase 0.9963
333
+ 70 gpt-4.5-preview 0.9980
334
+ 70 gpt-4o 0.9999
335
+ 70 gpt-4o-mini 0.9999
336
+ 70 llama-2-70b-chat 0.9974
337
+ 70 llama-2-7b-chat 0.9967
338
+ 70 llama-3.1-8b-instruct 0.9929
339
+ 70 llama-3.3-70b-instruct 0.9993
340
+ 70 ministral-8b-instruct-2410 0.9997
341
+ 70 mistral-7b-instruct-v0.2 0.9965
342
+ 70 mixtral-8x7b-instruct-v0.1 0.9947
343
+ 70 o3-mini 0.9999
344
+ 70 qwen1.5-72b-chat-8bit 0.9995
345
+ 70 text-bison-002 0.9992
346
+ 70 MACRO_AVG 0.9960
347
+ 75 deepseek-r1-distill-qwen-32b 0.9998
348
+ 75 falcon3-10b-instruct 0.9992
349
+ 75 gemini-1.5-pro 0.9689
350
+ 75 gemini-2.0-flash 0.9998
351
+ 75 gemini-pro 0.9851
352
+ 75 gemini-pro-paraphrase 0.9884
353
+ 75 gpt-3.5-turbo 0.9999
354
+ 75 gpt-4-turbo 1.0000
355
+ 75 gpt-4-turbo-paraphrase 0.9963
356
+ 75 gpt-4.5-preview 0.9980
357
+ 75 gpt-4o 0.9999
358
+ 75 gpt-4o-mini 0.9999
359
+ 75 llama-2-70b-chat 0.9974
360
+ 75 llama-2-7b-chat 0.9967
361
+ 75 llama-3.1-8b-instruct 0.9929
362
+ 75 llama-3.3-70b-instruct 0.9993
363
+ 75 ministral-8b-instruct-2410 0.9997
364
+ 75 mistral-7b-instruct-v0.2 0.9965
365
+ 75 mixtral-8x7b-instruct-v0.1 0.9947
366
+ 75 o3-mini 0.9999
367
+ 75 qwen1.5-72b-chat-8bit 0.9995
368
+ 75 text-bison-002 0.9992
369
+ 75 MACRO_AVG 0.9960
370
+ 80 deepseek-r1-distill-qwen-32b 0.9998
371
+ 80 falcon3-10b-instruct 0.9992
372
+ 80 gemini-1.5-pro 0.9689
373
+ 80 gemini-2.0-flash 0.9998
374
+ 80 gemini-pro 0.9851
375
+ 80 gemini-pro-paraphrase 0.9884
376
+ 80 gpt-3.5-turbo 0.9999
377
+ 80 gpt-4-turbo 1.0000
378
+ 80 gpt-4-turbo-paraphrase 0.9963
379
+ 80 gpt-4.5-preview 0.9980
380
+ 80 gpt-4o 0.9999
381
+ 80 gpt-4o-mini 0.9999
382
+ 80 llama-2-70b-chat 0.9974
383
+ 80 llama-2-7b-chat 0.9967
384
+ 80 llama-3.1-8b-instruct 0.9929
385
+ 80 llama-3.3-70b-instruct 0.9993
386
+ 80 ministral-8b-instruct-2410 0.9997
387
+ 80 mistral-7b-instruct-v0.2 0.9965
388
+ 80 mixtral-8x7b-instruct-v0.1 0.9947
389
+ 80 o3-mini 0.9999
390
+ 80 qwen1.5-72b-chat-8bit 0.9995
391
+ 80 text-bison-002 0.9992
392
+ 80 MACRO_AVG 0.9960
393
+ 85 deepseek-r1-distill-qwen-32b 0.9998
394
+ 85 falcon3-10b-instruct 0.9992
395
+ 85 gemini-1.5-pro 0.9689
396
+ 85 gemini-2.0-flash 0.9998
397
+ 85 gemini-pro 0.9851
398
+ 85 gemini-pro-paraphrase 0.9884
399
+ 85 gpt-3.5-turbo 0.9999
400
+ 85 gpt-4-turbo 1.0000
401
+ 85 gpt-4-turbo-paraphrase 0.9963
402
+ 85 gpt-4.5-preview 0.9980
403
+ 85 gpt-4o 0.9999
404
+ 85 gpt-4o-mini 0.9999
405
+ 85 llama-2-70b-chat 0.9974
406
+ 85 llama-2-7b-chat 0.9967
407
+ 85 llama-3.1-8b-instruct 0.9929
408
+ 85 llama-3.3-70b-instruct 0.9993
409
+ 85 ministral-8b-instruct-2410 0.9997
410
+ 85 mistral-7b-instruct-v0.2 0.9965
411
+ 85 mixtral-8x7b-instruct-v0.1 0.9947
412
+ 85 o3-mini 0.9999
413
+ 85 qwen1.5-72b-chat-8bit 0.9995
414
+ 85 text-bison-002 0.9992
415
+ 85 MACRO_AVG 0.9960
416
+ 90 deepseek-r1-distill-qwen-32b 0.9998
417
+ 90 falcon3-10b-instruct 0.9992
418
+ 90 gemini-1.5-pro 0.9689
419
+ 90 gemini-2.0-flash 0.9998
420
+ 90 gemini-pro 0.9851
421
+ 90 gemini-pro-paraphrase 0.9884
422
+ 90 gpt-3.5-turbo 0.9999
423
+ 90 gpt-4-turbo 1.0000
424
+ 90 gpt-4-turbo-paraphrase 0.9963
425
+ 90 gpt-4.5-preview 0.9980
426
+ 90 gpt-4o 0.9999
427
+ 90 gpt-4o-mini 0.9999
428
+ 90 llama-2-70b-chat 0.9974
429
+ 90 llama-2-7b-chat 0.9967
430
+ 90 llama-3.1-8b-instruct 0.9929
431
+ 90 llama-3.3-70b-instruct 0.9993
432
+ 90 ministral-8b-instruct-2410 0.9997
433
+ 90 mistral-7b-instruct-v0.2 0.9965
434
+ 90 mixtral-8x7b-instruct-v0.1 0.9947
435
+ 90 o3-mini 0.9999
436
+ 90 qwen1.5-72b-chat-8bit 0.9995
437
+ 90 text-bison-002 0.9992
438
+ 90 MACRO_AVG 0.9960
439
+ 95 deepseek-r1-distill-qwen-32b 0.9998
440
+ 95 falcon3-10b-instruct 0.9992
441
+ 95 gemini-1.5-pro 0.9689
442
+ 95 gemini-2.0-flash 0.9998
443
+ 95 gemini-pro 0.9851
444
+ 95 gemini-pro-paraphrase 0.9884
445
+ 95 gpt-3.5-turbo 0.9999
446
+ 95 gpt-4-turbo 1.0000
447
+ 95 gpt-4-turbo-paraphrase 0.9963
448
+ 95 gpt-4.5-preview 0.9980
449
+ 95 gpt-4o 0.9999
450
+ 95 gpt-4o-mini 0.9999
451
+ 95 llama-2-70b-chat 0.9974
452
+ 95 llama-2-7b-chat 0.9967
453
+ 95 llama-3.1-8b-instruct 0.9929
454
+ 95 llama-3.3-70b-instruct 0.9993
455
+ 95 ministral-8b-instruct-2410 0.9997
456
+ 95 mistral-7b-instruct-v0.2 0.9965
457
+ 95 mixtral-8x7b-instruct-v0.1 0.9947
458
+ 95 o3-mini 0.9999
459
+ 95 qwen1.5-72b-chat-8bit 0.9995
460
+ 95 text-bison-002 0.9992
461
+ 95 MACRO_AVG 0.9960
462
+ 100 deepseek-r1-distill-qwen-32b 0.9998
463
+ 100 falcon3-10b-instruct 0.9992
464
+ 100 gemini-1.5-pro 0.9689
465
+ 100 gemini-2.0-flash 0.9998
466
+ 100 gemini-pro 0.9851
467
+ 100 gemini-pro-paraphrase 0.9884
468
+ 100 gpt-3.5-turbo 0.9999
469
+ 100 gpt-4-turbo 1.0000
470
+ 100 gpt-4-turbo-paraphrase 0.9963
471
+ 100 gpt-4.5-preview 0.9980
472
+ 100 gpt-4o 0.9999
473
+ 100 gpt-4o-mini 0.9999
474
+ 100 llama-2-70b-chat 0.9974
475
+ 100 llama-2-7b-chat 0.9967
476
+ 100 llama-3.1-8b-instruct 0.9929
477
+ 100 llama-3.3-70b-instruct 0.9993
478
+ 100 ministral-8b-instruct-2410 0.9997
479
+ 100 mistral-7b-instruct-v0.2 0.9965
480
+ 100 mixtral-8x7b-instruct-v0.1 0.9947
481
+ 100 o3-mini 0.9999
482
+ 100 qwen1.5-72b-chat-8bit 0.9995
483
+ 100 text-bison-002 0.9992
484
+ 100 MACRO_AVG 0.9960
485
+ 105 deepseek-r1-distill-qwen-32b 0.9998
486
+ 105 falcon3-10b-instruct 0.9992
487
+ 105 gemini-1.5-pro 0.9689
488
+ 105 gemini-2.0-flash 0.9998
489
+ 105 gemini-pro 0.9851
490
+ 105 gemini-pro-paraphrase 0.9884
491
+ 105 gpt-3.5-turbo 0.9999
492
+ 105 gpt-4-turbo 1.0000
493
+ 105 gpt-4-turbo-paraphrase 0.9963
494
+ 105 gpt-4.5-preview 0.9980
495
+ 105 gpt-4o 0.9999
496
+ 105 gpt-4o-mini 0.9999
497
+ 105 llama-2-70b-chat 0.9974
498
+ 105 llama-2-7b-chat 0.9967
499
+ 105 llama-3.1-8b-instruct 0.9929
500
+ 105 llama-3.3-70b-instruct 0.9993
501
+ 105 ministral-8b-instruct-2410 0.9997
502
+ 105 mistral-7b-instruct-v0.2 0.9965
503
+ 105 mixtral-8x7b-instruct-v0.1 0.9947
504
+ 105 o3-mini 0.9999
505
+ 105 qwen1.5-72b-chat-8bit 0.9995
506
+ 105 text-bison-002 0.9992
507
+ 105 MACRO_AVG 0.9960
508
+ 110 deepseek-r1-distill-qwen-32b 0.9998
509
+ 110 falcon3-10b-instruct 0.9992
510
+ 110 gemini-1.5-pro 0.9689
511
+ 110 gemini-2.0-flash 0.9998
512
+ 110 gemini-pro 0.9851
513
+ 110 gemini-pro-paraphrase 0.9884
514
+ 110 gpt-3.5-turbo 0.9999
515
+ 110 gpt-4-turbo 1.0000
516
+ 110 gpt-4-turbo-paraphrase 0.9963
517
+ 110 gpt-4.5-preview 0.9980
518
+ 110 gpt-4o 0.9999
519
+ 110 gpt-4o-mini 0.9999
520
+ 110 llama-2-70b-chat 0.9974
521
+ 110 llama-2-7b-chat 0.9967
522
+ 110 llama-3.1-8b-instruct 0.9929
523
+ 110 llama-3.3-70b-instruct 0.9993
524
+ 110 ministral-8b-instruct-2410 0.9997
525
+ 110 mistral-7b-instruct-v0.2 0.9965
526
+ 110 mixtral-8x7b-instruct-v0.1 0.9947
527
+ 110 o3-mini 0.9999
528
+ 110 qwen1.5-72b-chat-8bit 0.9995
529
+ 110 text-bison-002 0.9992
530
+ 110 MACRO_AVG 0.9960
531
+ 115 deepseek-r1-distill-qwen-32b 0.9998
532
+ 115 falcon3-10b-instruct 0.9992
533
+ 115 gemini-1.5-pro 0.9689
534
+ 115 gemini-2.0-flash 0.9998
535
+ 115 gemini-pro 0.9851
536
+ 115 gemini-pro-paraphrase 0.9884
537
+ 115 gpt-3.5-turbo 0.9999
538
+ 115 gpt-4-turbo 1.0000
539
+ 115 gpt-4-turbo-paraphrase 0.9963
540
+ 115 gpt-4.5-preview 0.9980
541
+ 115 gpt-4o 0.9999
542
+ 115 gpt-4o-mini 0.9999
543
+ 115 llama-2-70b-chat 0.9974
544
+ 115 llama-2-7b-chat 0.9967
545
+ 115 llama-3.1-8b-instruct 0.9929
546
+ 115 llama-3.3-70b-instruct 0.9993
547
+ 115 ministral-8b-instruct-2410 0.9997
548
+ 115 mistral-7b-instruct-v0.2 0.9965
549
+ 115 mixtral-8x7b-instruct-v0.1 0.9947
550
+ 115 o3-mini 0.9999
551
+ 115 qwen1.5-72b-chat-8bit 0.9995
552
+ 115 text-bison-002 0.9992
553
+ 115 MACRO_AVG 0.9960
554
+ 120 deepseek-r1-distill-qwen-32b 0.9998
555
+ 120 falcon3-10b-instruct 0.9992
556
+ 120 gemini-1.5-pro 0.9689
557
+ 120 gemini-2.0-flash 0.9998
558
+ 120 gemini-pro 0.9851
559
+ 120 gemini-pro-paraphrase 0.9884
560
+ 120 gpt-3.5-turbo 0.9999
561
+ 120 gpt-4-turbo 1.0000
562
+ 120 gpt-4-turbo-paraphrase 0.9963
563
+ 120 gpt-4.5-preview 0.9980
564
+ 120 gpt-4o 0.9999
565
+ 120 gpt-4o-mini 0.9999
566
+ 120 llama-2-70b-chat 0.9974
567
+ 120 llama-2-7b-chat 0.9967
568
+ 120 llama-3.1-8b-instruct 0.9929
569
+ 120 llama-3.3-70b-instruct 0.9993
570
+ 120 ministral-8b-instruct-2410 0.9997
571
+ 120 mistral-7b-instruct-v0.2 0.9965
572
+ 120 mixtral-8x7b-instruct-v0.1 0.9947
573
+ 120 o3-mini 0.9999
574
+ 120 qwen1.5-72b-chat-8bit 0.9995
575
+ 120 text-bison-002 0.9992
576
+ 120 MACRO_AVG 0.9960
577
+ 125 deepseek-r1-distill-qwen-32b 0.9998
578
+ 125 falcon3-10b-instruct 0.9992
579
+ 125 gemini-1.5-pro 0.9689
580
+ 125 gemini-2.0-flash 0.9998
581
+ 125 gemini-pro 0.9851
582
+ 125 gemini-pro-paraphrase 0.9884
583
+ 125 gpt-3.5-turbo 0.9999
584
+ 125 gpt-4-turbo 1.0000
585
+ 125 gpt-4-turbo-paraphrase 0.9963
586
+ 125 gpt-4.5-preview 0.9980
587
+ 125 gpt-4o 0.9999
588
+ 125 gpt-4o-mini 0.9999
589
+ 125 llama-2-70b-chat 0.9974
590
+ 125 llama-2-7b-chat 0.9967
591
+ 125 llama-3.1-8b-instruct 0.9929
592
+ 125 llama-3.3-70b-instruct 0.9993
593
+ 125 ministral-8b-instruct-2410 0.9997
594
+ 125 mistral-7b-instruct-v0.2 0.9965
595
+ 125 mixtral-8x7b-instruct-v0.1 0.9947
596
+ 125 o3-mini 0.9999
597
+ 125 qwen1.5-72b-chat-8bit 0.9995
598
+ 125 text-bison-002 0.9992
599
+ 125 MACRO_AVG 0.9960
600
+ 130 deepseek-r1-distill-qwen-32b 0.9998
601
+ 130 falcon3-10b-instruct 0.9992
602
+ 130 gemini-1.5-pro 0.9689
603
+ 130 gemini-2.0-flash 0.9998
604
+ 130 gemini-pro 0.9851
605
+ 130 gemini-pro-paraphrase 0.9884
606
+ 130 gpt-3.5-turbo 0.9999
607
+ 130 gpt-4-turbo 1.0000
608
+ 130 gpt-4-turbo-paraphrase 0.9963
609
+ 130 gpt-4.5-preview 0.9980
610
+ 130 gpt-4o 0.9999
611
+ 130 gpt-4o-mini 0.9999
612
+ 130 llama-2-70b-chat 0.9974
613
+ 130 llama-2-7b-chat 0.9967
614
+ 130 llama-3.1-8b-instruct 0.9929
615
+ 130 llama-3.3-70b-instruct 0.9993
616
+ 130 ministral-8b-instruct-2410 0.9997
617
+ 130 mistral-7b-instruct-v0.2 0.9965
618
+ 130 mixtral-8x7b-instruct-v0.1 0.9947
619
+ 130 o3-mini 0.9999
620
+ 130 qwen1.5-72b-chat-8bit 0.9995
621
+ 130 text-bison-002 0.9992
622
+ 130 MACRO_AVG 0.9960
623
+ 135 deepseek-r1-distill-qwen-32b 0.9998
624
+ 135 falcon3-10b-instruct 0.9992
625
+ 135 gemini-1.5-pro 0.9689
626
+ 135 gemini-2.0-flash 0.9998
627
+ 135 gemini-pro 0.9851
628
+ 135 gemini-pro-paraphrase 0.9884
629
+ 135 gpt-3.5-turbo 0.9999
630
+ 135 gpt-4-turbo 1.0000
631
+ 135 gpt-4-turbo-paraphrase 0.9963
632
+ 135 gpt-4.5-preview 0.9980
633
+ 135 gpt-4o 0.9999
634
+ 135 gpt-4o-mini 0.9999
635
+ 135 llama-2-70b-chat 0.9974
636
+ 135 llama-2-7b-chat 0.9967
637
+ 135 llama-3.1-8b-instruct 0.9929
638
+ 135 llama-3.3-70b-instruct 0.9993
639
+ 135 ministral-8b-instruct-2410 0.9997
640
+ 135 mistral-7b-instruct-v0.2 0.9965
641
+ 135 mixtral-8x7b-instruct-v0.1 0.9947
642
+ 135 o3-mini 0.9999
643
+ 135 qwen1.5-72b-chat-8bit 0.9995
644
+ 135 text-bison-002 0.9992
645
+ 135 MACRO_AVG 0.9960
646
+ 140 deepseek-r1-distill-qwen-32b 0.9998
647
+ 140 falcon3-10b-instruct 0.9992
648
+ 140 gemini-1.5-pro 0.9689
649
+ 140 gemini-2.0-flash 0.9998
650
+ 140 gemini-pro 0.9851
651
+ 140 gemini-pro-paraphrase 0.9884
652
+ 140 gpt-3.5-turbo 0.9999
653
+ 140 gpt-4-turbo 1.0000
654
+ 140 gpt-4-turbo-paraphrase 0.9963
655
+ 140 gpt-4.5-preview 0.9980
656
+ 140 gpt-4o 0.9999
657
+ 140 gpt-4o-mini 0.9999
658
+ 140 llama-2-70b-chat 0.9974
659
+ 140 llama-2-7b-chat 0.9967
660
+ 140 llama-3.1-8b-instruct 0.9929
661
+ 140 llama-3.3-70b-instruct 0.9993
662
+ 140 ministral-8b-instruct-2410 0.9997
663
+ 140 mistral-7b-instruct-v0.2 0.9965
664
+ 140 mixtral-8x7b-instruct-v0.1 0.9947
665
+ 140 o3-mini 0.9999
666
+ 140 qwen1.5-72b-chat-8bit 0.9995
667
+ 140 text-bison-002 0.9992
668
+ 140 MACRO_AVG 0.9960
669
+ 145 deepseek-r1-distill-qwen-32b 0.9998
670
+ 145 falcon3-10b-instruct 0.9992
671
+ 145 gemini-1.5-pro 0.9689
672
+ 145 gemini-2.0-flash 0.9998
673
+ 145 gemini-pro 0.9851
674
+ 145 gemini-pro-paraphrase 0.9884
675
+ 145 gpt-3.5-turbo 0.9999
676
+ 145 gpt-4-turbo 1.0000
677
+ 145 gpt-4-turbo-paraphrase 0.9963
678
+ 145 gpt-4.5-preview 0.9980
679
+ 145 gpt-4o 0.9999
680
+ 145 gpt-4o-mini 0.9999
681
+ 145 llama-2-70b-chat 0.9974
682
+ 145 llama-2-7b-chat 0.9967
683
+ 145 llama-3.1-8b-instruct 0.9929
684
+ 145 llama-3.3-70b-instruct 0.9993
685
+ 145 ministral-8b-instruct-2410 0.9997
686
+ 145 mistral-7b-instruct-v0.2 0.9965
687
+ 145 mixtral-8x7b-instruct-v0.1 0.9947
688
+ 145 o3-mini 0.9999
689
+ 145 qwen1.5-72b-chat-8bit 0.9995
690
+ 145 text-bison-002 0.9992
691
+ 145 MACRO_AVG 0.9960
692
+ 150 deepseek-r1-distill-qwen-32b 0.9998
693
+ 150 falcon3-10b-instruct 0.9992
694
+ 150 gemini-1.5-pro 0.9689
695
+ 150 gemini-2.0-flash 0.9998
696
+ 150 gemini-pro 0.9851
697
+ 150 gemini-pro-paraphrase 0.9884
698
+ 150 gpt-3.5-turbo 0.9999
699
+ 150 gpt-4-turbo 1.0000
700
+ 150 gpt-4-turbo-paraphrase 0.9963
701
+ 150 gpt-4.5-preview 0.9980
702
+ 150 gpt-4o 0.9999
703
+ 150 gpt-4o-mini 0.9999
704
+ 150 llama-2-70b-chat 0.9974
705
+ 150 llama-2-7b-chat 0.9967
706
+ 150 llama-3.1-8b-instruct 0.9929
707
+ 150 llama-3.3-70b-instruct 0.9993
708
+ 150 ministral-8b-instruct-2410 0.9997
709
+ 150 mistral-7b-instruct-v0.2 0.9965
710
+ 150 mixtral-8x7b-instruct-v0.1 0.9947
711
+ 150 o3-mini 0.9999
712
+ 150 qwen1.5-72b-chat-8bit 0.9995
713
+ 150 text-bison-002 0.9992
714
+ 150 MACRO_AVG 0.9960
715
+ 155 deepseek-r1-distill-qwen-32b 0.9998
716
+ 155 falcon3-10b-instruct 0.9992
717
+ 155 gemini-1.5-pro 0.9689
718
+ 155 gemini-2.0-flash 0.9998
719
+ 155 gemini-pro 0.9851
720
+ 155 gemini-pro-paraphrase 0.9884
721
+ 155 gpt-3.5-turbo 0.9999
722
+ 155 gpt-4-turbo 1.0000
723
+ 155 gpt-4-turbo-paraphrase 0.9963
724
+ 155 gpt-4.5-preview 0.9980
725
+ 155 gpt-4o 0.9999
726
+ 155 gpt-4o-mini 0.9999
727
+ 155 llama-2-70b-chat 0.9974
728
+ 155 llama-2-7b-chat 0.9967
729
+ 155 llama-3.1-8b-instruct 0.9929
730
+ 155 llama-3.3-70b-instruct 0.9993
731
+ 155 ministral-8b-instruct-2410 0.9997
732
+ 155 mistral-7b-instruct-v0.2 0.9965
733
+ 155 mixtral-8x7b-instruct-v0.1 0.9947
734
+ 155 o3-mini 0.9999
735
+ 155 qwen1.5-72b-chat-8bit 0.9995
736
+ 155 text-bison-002 0.9992
737
+ 155 MACRO_AVG 0.9960
738
+ 160 deepseek-r1-distill-qwen-32b 0.9998
739
+ 160 falcon3-10b-instruct 0.9992
740
+ 160 gemini-1.5-pro 0.9689
741
+ 160 gemini-2.0-flash 0.9998
742
+ 160 gemini-pro 0.9851
743
+ 160 gemini-pro-paraphrase 0.9884
744
+ 160 gpt-3.5-turbo 0.9999
745
+ 160 gpt-4-turbo 1.0000
746
+ 160 gpt-4-turbo-paraphrase 0.9963
747
+ 160 gpt-4.5-preview 0.9980
748
+ 160 gpt-4o 0.9999
749
+ 160 gpt-4o-mini 0.9999
750
+ 160 llama-2-70b-chat 0.9974
751
+ 160 llama-2-7b-chat 0.9967
752
+ 160 llama-3.1-8b-instruct 0.9929
753
+ 160 llama-3.3-70b-instruct 0.9993
754
+ 160 ministral-8b-instruct-2410 0.9997
755
+ 160 mistral-7b-instruct-v0.2 0.9965
756
+ 160 mixtral-8x7b-instruct-v0.1 0.9947
757
+ 160 o3-mini 0.9999
758
+ 160 qwen1.5-72b-chat-8bit 0.9995
759
+ 160 text-bison-002 0.9992
760
+ 160 MACRO_AVG 0.9960