Question about benchmark results
#5
by
tarruda - opened
Are the benchmarks in the README for the final 3.5 flash?
The reason I ask is that I ran some benchmarks on a IQ4_XS quant (https://huggingface.co/AesSedai/Step-3.5-Flash-GGUF) of Step 3.5 Flash and in some benchmarks seem to have gotten better results than what you published.
If these benchmarks are just for midtrain, did you publish the final benchmarks anywhere? I'd love to compare with the quantized version.
| Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr|
|---------------------------------------|------:|------|-----:|------|---|-----:|---|-----:|
|mmlu | 2|none | |acc | |0.8238|± |0.0031|
| - humanities | 2|none | 0|acc |↑ |0.7543|± |0.0060|
| - formal_logic | 1|none | 0|acc |↑ |0.7063|± |0.0407|
| - high_school_european_history | 1|none | 0|acc |↑ |0.8788|± |0.0255|
| - high_school_us_history | 1|none | 0|acc |↑ |0.9314|± |0.0177|
| - high_school_world_history | 1|none | 0|acc |↑ |0.9283|± |0.0168|
| - international_law | 1|none | 0|acc |↑ |0.9008|± |0.0273|
| - jurisprudence | 1|none | 0|acc |↑ |0.8796|± |0.0315|
| - logical_fallacies | 1|none | 0|acc |↑ |0.8466|± |0.0283|
| - moral_disputes | 1|none | 0|acc |↑ |0.8295|± |0.0202|
| - moral_scenarios | 1|none | 0|acc |↑ |0.6011|± |0.0164|
| - philosophy | 1|none | 0|acc |↑ |0.8778|± |0.0186|
| - prehistory | 1|none | 0|acc |↑ |0.8889|± |0.0175|
| - professional_law | 1|none | 0|acc |↑ |0.6656|± |0.0120|
| - world_religions | 1|none | 0|acc |↑ |0.9123|± |0.0217|
| - other | 2|none | 0|acc |↑ |0.8626|± |0.0060|
| - business_ethics | 1|none | 0|acc |↑ |0.8100|± |0.0394|
| - clinical_knowledge | 1|none | 0|acc |↑ |0.8943|± |0.0189|
| - college_medicine | 1|none | 0|acc |↑ |0.7919|± |0.0310|
| - global_facts | 1|none | 0|acc |↑ |0.6900|± |0.0465|
| - human_aging | 1|none | 0|acc |↑ |0.8251|± |0.0255|
| - management | 1|none | 0|acc |↑ |0.8641|± |0.0339|
| - marketing | 1|none | 0|acc |↑ |0.9573|± |0.0133|
| - medical_genetics | 1|none | 0|acc |↑ |0.8900|± |0.0314|
| - miscellaneous | 1|none | 0|acc |↑ |0.9298|± |0.0091|
| - nutrition | 1|none | 0|acc |↑ |0.9052|± |0.0168|
| - professional_accounting | 1|none | 0|acc |↑ |0.7979|± |0.0240|
| - professional_medicine | 1|none | 0|acc |↑ |0.8897|± |0.0190|
| - virology | 1|none | 0|acc |↑ |0.5904|± |0.0383|
| - social sciences | 2|none | 0|acc |↑ |0.9012|± |0.0053|
| - econometrics | 1|none | 0|acc |↑ |0.7544|± |0.0405|
| - high_school_geography | 1|none | 0|acc |↑ |0.9242|± |0.0189|
| - high_school_government_and_politics| 1|none | 0|acc |↑ |0.9793|± |0.0103|
| - high_school_macroeconomics | 1|none | 0|acc |↑ |0.8923|± |0.0157|
| - high_school_microeconomics | 1|none | 0|acc |↑ |0.9328|± |0.0163|
| - high_school_psychology | 1|none | 0|acc |↑ |0.9541|± |0.0090|
| - human_sexuality | 1|none | 0|acc |↑ |0.8855|± |0.0279|
| - professional_psychology | 1|none | 0|acc |↑ |0.8709|± |0.0136|
| - public_relations | 1|none | 0|acc |↑ |0.8182|± |0.0369|
| - security_studies | 1|none | 0|acc |↑ |0.8490|± |0.0229|
| - sociology | 1|none | 0|acc |↑ |0.9055|± |0.0207|
| - us_foreign_policy | 1|none | 0|acc |↑ |0.9600|± |0.0197|
| - stem | 2|none | 0|acc |↑ |0.8138|± |0.0067|
| - abstract_algebra | 1|none | 0|acc |↑ |0.7000|± |0.0461|
| - anatomy | 1|none | 0|acc |↑ |0.8444|± |0.0313|
| - astronomy | 1|none | 0|acc |↑ |0.9211|± |0.0219|
| - college_biology | 1|none | 0|acc |↑ |0.9514|± |0.0180|
| - college_chemistry | 1|none | 0|acc |↑ |0.6000|± |0.0492|
| - college_computer_science | 1|none | 0|acc |↑ |0.8300|± |0.0378|
| - college_mathematics | 1|none | 0|acc |↑ |0.6600|± |0.0476|
| - college_physics | 1|none | 0|acc |↑ |0.7549|± |0.0428|
| - computer_security | 1|none | 0|acc |↑ |0.8200|± |0.0386|
| - conceptual_physics | 1|none | 0|acc |↑ |0.8766|± |0.0215|
| - electrical_engineering | 1|none | 0|acc |↑ |0.8345|± |0.0310|
| - elementary_mathematics | 1|none | 0|acc |↑ |0.8836|± |0.0165|
| - high_school_biology | 1|none | 0|acc |↑ |0.9258|± |0.0149|
| - high_school_chemistry | 1|none | 0|acc |↑ |0.8079|± |0.0277|
| - high_school_computer_science | 1|none | 0|acc |↑ |0.8900|± |0.0314|
| - high_school_mathematics | 1|none | 0|acc |↑ |0.6481|± |0.0291|
| - high_school_physics | 1|none | 0|acc |↑ |0.7550|± |0.0351|
| - high_school_statistics | 1|none | 0|acc |↑ |0.8333|± |0.0254|
| - machine_learning | 1|none | 0|acc |↑ |0.5982|± |0.0465|
| Groups |Version|Filter|n-shot|Metric| |Value | |Stderr|
|------------------|------:|------|-----:|------|---|-----:|---|-----:|
|mmlu | 2|none | |acc | |0.8238|± |0.0031|
| - humanities | 2|none | 0|acc |↑ |0.7543|± |0.0060|
| - other | 2|none | 0|acc |↑ |0.8626|± |0.0060|
| - social sciences| 2|none | 0|acc |↑ |0.9012|± |0.0053|
| - stem | 2|none | 0|acc |↑ |0.8138|± |0.0067|
| Tasks |Version|Filter|n-shot| Metric | |Value| |Stderr|
|---------------------|------:|------|-----:|--------|---|----:|---|-----:|
|gpqa_diamond_zeroshot| 1|none | 0|acc |↑ |0.399|± |0.0349|
| | |none | 0|acc_norm|↑ |0.399|± |0.0349|
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|-------------------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gpqa_diamond_cot_zeroshot| 1|flexible-extract| 0|exact_match|↑ |0.7525|± |0.0307|
| | |strict-match | 0|exact_match|↑ |0.6667|± |0.0336|
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot| 3|flexible-extract| 0|exact_match|↑ |0.9310|± |0.0070|
| | |strict-match | 0|exact_match|↑ |0.8567|± |0.0097|