Accuracy (normalized) on ARC-Challenge
test set self-reported 51.880
Accuracy (normalized) on HellaSwag
validation set self-reported 69.530
Accuracy (normalized) on PIQA
validation set self-reported 77.530
Exact Match (flexible) on GSM8K
test set self-reported 39.270