| |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| | |
| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| | |
| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8029|± |0.0110| | |
| | | |strict-match | 5|exact_match|↑ |0.7961|± |0.0111| | |
| | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| | |
| |----------------|------:|------|-----:|--------|---|-----:|---|------| | |
| |kobest_boolq | 1|none | 5|acc |↑ |0.9167|± |0.0074| | |
| | | |none | 5|f1 |↑ |0.9167|± | N/A| | |
| |kobest_copa | 1|none | 5|acc |↑ |0.7130|± |0.0143| | |
| | | |none | 5|f1 |↑ |0.7125|± | N/A| | |
| |kobest_hellaswag| 1|none | 5|acc |↑ |0.4540|± |0.0223| | |
| | | |none | 5|acc_norm|↑ |0.5700|± |0.0222| | |
| | | |none | 5|f1 |↑ |0.4505|± | N/A| | |
| |kobest_sentineg | 1|none | 5|acc |↑ |0.9496|± |0.0110| | |
| | | |none | 5|f1 |↑ |0.9496|± | N/A| | |
| |kobest_wic | 1|none | 5|acc |↑ |0.7111|± |0.0128| | |
| | | |none | 5|f1 |↑ |0.7025|± | N/A| | |
| | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| | |
| |-------------------------------------------------------|------:|------|-----:|-----------|---|-----:|---|-----:| | |
| |kmmlu_direct_accounting | 2|none | 5|exact_match|↑ |0.5500|± |0.0500| | |
| |kmmlu_direct_agricultural_sciences | 2|none | 5|exact_match|↑ |0.3680|± |0.0153| | |
| |kmmlu_direct_aviation_engineering_and_maintenance | 2|none | 5|exact_match|↑ |0.4670|± |0.0158| | |
| |kmmlu_direct_biology | 2|none | 5|exact_match|↑ |0.3740|± |0.0153| | |
| |kmmlu_direct_chemical_engineering | 2|none | 5|exact_match|↑ |0.4650|± |0.0158| | |
| |kmmlu_direct_chemistry | 2|none | 5|exact_match|↑ |0.4900|± |0.0204| | |
| |kmmlu_direct_civil_engineering | 2|none | 5|exact_match|↑ |0.3540|± |0.0151| | |
| |kmmlu_direct_computer_science | 2|none | 5|exact_match|↑ |0.7320|± |0.0140| | |
| |kmmlu_direct_construction | 2|none | 5|exact_match|↑ |0.3590|± |0.0152| | |
| |kmmlu_direct_criminal_law | 2|none | 5|exact_match|↑ |0.4250|± |0.0350| | |
| |kmmlu_direct_ecology | 2|none | 5|exact_match|↑ |0.4900|± |0.0158| | |
| |kmmlu_direct_economics | 2|none | 5|exact_match|↑ |0.6154|± |0.0428| | |
| |kmmlu_direct_education | 2|none | 5|exact_match|↑ |0.6900|± |0.0465| | |
| |kmmlu_direct_electrical_engineering | 2|none | 5|exact_match|↑ |0.3170|± |0.0147| | |
| |kmmlu_direct_electronics_engineering | 2|none | 5|exact_match|↑ |0.5440|± |0.0158| | |
| |kmmlu_direct_energy_management | 2|none | 5|exact_match|↑ |0.3960|± |0.0155| | |
| |kmmlu_direct_environmental_science | 2|none | 5|exact_match|↑ |0.2950|± |0.0144| | |
| |kmmlu_direct_fashion | 2|none | 5|exact_match|↑ |0.4660|± |0.0158| | |
| |kmmlu_direct_food_processing | 2|none | 5|exact_match|↑ |0.4370|± |0.0157| | |
| |kmmlu_direct_gas_technology_and_engineering | 2|none | 5|exact_match|↑ |0.3650|± |0.0152| | |
| |kmmlu_direct_geomatics | 2|none | 5|exact_match|↑ |0.3770|± |0.0153| | |
| |kmmlu_direct_health | 2|none | 5|exact_match|↑ |0.6200|± |0.0488| | |
| |kmmlu_direct_industrial_engineer | 2|none | 5|exact_match|↑ |0.4730|± |0.0158| | |
| |kmmlu_direct_information_technology | 2|none | 5|exact_match|↑ |0.7080|± |0.0144| | |
| |kmmlu_direct_interior_architecture_and_design | 2|none | 5|exact_match|↑ |0.6080|± |0.0154| | |
| |kmmlu_direct_korean_history | 2|none | 5|exact_match|↑ |0.3200|± |0.0469| | |
| |kmmlu_direct_law | 2|none | 5|exact_match|↑ |0.4730|± |0.0158| | |
| |kmmlu_direct_machine_design_and_manufacturing | 2|none | 5|exact_match|↑ |0.4750|± |0.0158| | |
| |kmmlu_direct_management | 2|none | 5|exact_match|↑ |0.6160|± |0.0154| | |
| |kmmlu_direct_maritime_engineering | 2|none | 5|exact_match|↑ |0.4817|± |0.0204| | |
| |kmmlu_direct_marketing | 2|none | 5|exact_match|↑ |0.8010|± |0.0126| | |
| |kmmlu_direct_materials_engineering | 2|none | 5|exact_match|↑ |0.4970|± |0.0158| | |
| |kmmlu_direct_math | 2|none | 5|exact_match|↑ |0.3500|± |0.0276| | |
| |kmmlu_direct_mechanical_engineering | 2|none | 5|exact_match|↑ |0.4040|± |0.0155| | |
| |kmmlu_direct_nondestructive_testing | 2|none | 5|exact_match|↑ |0.4580|± |0.0158| | |
| |kmmlu_direct_patent | 2|none | 5|exact_match|↑ |0.4100|± |0.0494| | |
| |kmmlu_direct_political_science_and_sociology | 2|none | 5|exact_match|↑ |0.5500|± |0.0288| | |
| |kmmlu_direct_psychology | 2|none | 5|exact_match|↑ |0.4700|± |0.0158| | |
| |kmmlu_direct_public_safety | 2|none | 5|exact_match|↑ |0.3680|± |0.0153| | |
| |kmmlu_direct_railway_and_automotive_engineering | 2|none | 5|exact_match|↑ |0.3550|± |0.0151| | |
| |kmmlu_direct_real_estate | 2|none | 5|exact_match|↑ |0.4650|± |0.0354| | |
| |kmmlu_direct_refrigerating_machinery | 2|none | 5|exact_match|↑ |0.3730|± |0.0153| | |
| |kmmlu_direct_social_welfare | 2|none | 5|exact_match|↑ |0.6140|± |0.0154| | |
| |kmmlu_direct_taxation | 2|none | 5|exact_match|↑ |0.4050|± |0.0348| | |
| |kmmlu_direct_telecommunications_and_wireless_technology| 2|none | 5|exact_match|↑ |0.6080|± |0.0154| | |
| | Groups |Version|Filter|n-shot|Metric| |Value | |Stderr| | |
| |------------------|------:|------|------|------|---|-----:|---|-----:| | |
| |mmlu | 2|none | |acc |↑ |0.6755|± |0.0038| | |
| | - humanities | 2|none | |acc |↑ |0.6140|± |0.0067| | |
| | - other | 2|none | |acc |↑ |0.7271|± |0.0077| | |
| | - social sciences| 2|none | |acc |↑ |0.7793|± |0.0073| | |
| | - stem | 2|none | |acc |↑ |0.6153|± |0.0084| |