Question about benchmark results

#5
by tarruda - opened

Are the benchmarks in the README for the final 3.5 flash?

The reason I ask is that I ran some benchmarks on a IQ4_XS quant (https://huggingface.co/AesSedai/Step-3.5-Flash-GGUF) of Step 3.5 Flash and in some benchmarks seem to have gotten better results than what you published.

If these benchmarks are just for midtrain, did you publish the final benchmarks anywhere? I'd love to compare with the quantized version.

|                 Tasks                 |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|---------------------------------------|------:|------|-----:|------|---|-----:|---|-----:|
|mmlu                                   |      2|none  |      |acc   |   |0.8238|±  |0.0031|
| - humanities                          |      2|none  |     0|acc   |↑  |0.7543|±  |0.0060|
|  - formal_logic                       |      1|none  |     0|acc   |↑  |0.7063|±  |0.0407|
|  - high_school_european_history       |      1|none  |     0|acc   |↑  |0.8788|±  |0.0255|
|  - high_school_us_history             |      1|none  |     0|acc   |↑  |0.9314|±  |0.0177|
|  - high_school_world_history          |      1|none  |     0|acc   |↑  |0.9283|±  |0.0168|
|  - international_law                  |      1|none  |     0|acc   |↑  |0.9008|±  |0.0273|
|  - jurisprudence                      |      1|none  |     0|acc   |↑  |0.8796|±  |0.0315|
|  - logical_fallacies                  |      1|none  |     0|acc   |↑  |0.8466|±  |0.0283|
|  - moral_disputes                     |      1|none  |     0|acc   |↑  |0.8295|±  |0.0202|
|  - moral_scenarios                    |      1|none  |     0|acc   |↑  |0.6011|±  |0.0164|
|  - philosophy                         |      1|none  |     0|acc   |↑  |0.8778|±  |0.0186|
|  - prehistory                         |      1|none  |     0|acc   |↑  |0.8889|±  |0.0175|
|  - professional_law                   |      1|none  |     0|acc   |↑  |0.6656|±  |0.0120|
|  - world_religions                    |      1|none  |     0|acc   |↑  |0.9123|±  |0.0217|
| - other                               |      2|none  |     0|acc   |↑  |0.8626|±  |0.0060|
|  - business_ethics                    |      1|none  |     0|acc   |↑  |0.8100|±  |0.0394|
|  - clinical_knowledge                 |      1|none  |     0|acc   |↑  |0.8943|±  |0.0189|
|  - college_medicine                   |      1|none  |     0|acc   |↑  |0.7919|±  |0.0310|
|  - global_facts                       |      1|none  |     0|acc   |↑  |0.6900|±  |0.0465|
|  - human_aging                        |      1|none  |     0|acc   |↑  |0.8251|±  |0.0255|
|  - management                         |      1|none  |     0|acc   |↑  |0.8641|±  |0.0339|
|  - marketing                          |      1|none  |     0|acc   |↑  |0.9573|±  |0.0133|
|  - medical_genetics                   |      1|none  |     0|acc   |↑  |0.8900|±  |0.0314|
|  - miscellaneous                      |      1|none  |     0|acc   |↑  |0.9298|±  |0.0091|
|  - nutrition                          |      1|none  |     0|acc   |↑  |0.9052|±  |0.0168|
|  - professional_accounting            |      1|none  |     0|acc   |↑  |0.7979|±  |0.0240|
|  - professional_medicine              |      1|none  |     0|acc   |↑  |0.8897|±  |0.0190|
|  - virology                           |      1|none  |     0|acc   |↑  |0.5904|±  |0.0383|
| - social sciences                     |      2|none  |     0|acc   |↑  |0.9012|±  |0.0053|
|  - econometrics                       |      1|none  |     0|acc   |↑  |0.7544|±  |0.0405|
|  - high_school_geography              |      1|none  |     0|acc   |↑  |0.9242|±  |0.0189|
|  - high_school_government_and_politics|      1|none  |     0|acc   |↑  |0.9793|±  |0.0103|
|  - high_school_macroeconomics         |      1|none  |     0|acc   |↑  |0.8923|±  |0.0157|
|  - high_school_microeconomics         |      1|none  |     0|acc   |↑  |0.9328|±  |0.0163|
|  - high_school_psychology             |      1|none  |     0|acc   |↑  |0.9541|±  |0.0090|
|  - human_sexuality                    |      1|none  |     0|acc   |↑  |0.8855|±  |0.0279|
|  - professional_psychology            |      1|none  |     0|acc   |↑  |0.8709|±  |0.0136|
|  - public_relations                   |      1|none  |     0|acc   |↑  |0.8182|±  |0.0369|
|  - security_studies                   |      1|none  |     0|acc   |↑  |0.8490|±  |0.0229|
|  - sociology                          |      1|none  |     0|acc   |↑  |0.9055|±  |0.0207|
|  - us_foreign_policy                  |      1|none  |     0|acc   |↑  |0.9600|±  |0.0197|
| - stem                                |      2|none  |     0|acc   |↑  |0.8138|±  |0.0067|
|  - abstract_algebra                   |      1|none  |     0|acc   |↑  |0.7000|±  |0.0461|
|  - anatomy                            |      1|none  |     0|acc   |↑  |0.8444|±  |0.0313|
|  - astronomy                          |      1|none  |     0|acc   |↑  |0.9211|±  |0.0219|
|  - college_biology                    |      1|none  |     0|acc   |↑  |0.9514|±  |0.0180|
|  - college_chemistry                  |      1|none  |     0|acc   |↑  |0.6000|±  |0.0492|
|  - college_computer_science           |      1|none  |     0|acc   |↑  |0.8300|±  |0.0378|
|  - college_mathematics                |      1|none  |     0|acc   |↑  |0.6600|±  |0.0476|
|  - college_physics                    |      1|none  |     0|acc   |↑  |0.7549|±  |0.0428|
|  - computer_security                  |      1|none  |     0|acc   |↑  |0.8200|±  |0.0386|
|  - conceptual_physics                 |      1|none  |     0|acc   |↑  |0.8766|±  |0.0215|
|  - electrical_engineering             |      1|none  |     0|acc   |↑  |0.8345|±  |0.0310|
|  - elementary_mathematics             |      1|none  |     0|acc   |↑  |0.8836|±  |0.0165|
|  - high_school_biology                |      1|none  |     0|acc   |↑  |0.9258|±  |0.0149|
|  - high_school_chemistry              |      1|none  |     0|acc   |↑  |0.8079|±  |0.0277|
|  - high_school_computer_science       |      1|none  |     0|acc   |↑  |0.8900|±  |0.0314|
|  - high_school_mathematics            |      1|none  |     0|acc   |↑  |0.6481|±  |0.0291|
|  - high_school_physics                |      1|none  |     0|acc   |↑  |0.7550|±  |0.0351|
|  - high_school_statistics             |      1|none  |     0|acc   |↑  |0.8333|±  |0.0254|
|  - machine_learning                   |      1|none  |     0|acc   |↑  |0.5982|±  |0.0465|

|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|-----:|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |   |0.8238|±  |0.0031|
| - humanities     |      2|none  |     0|acc   |↑  |0.7543|±  |0.0060|
| - other          |      2|none  |     0|acc   |↑  |0.8626|±  |0.0060|
| - social sciences|      2|none  |     0|acc   |↑  |0.9012|±  |0.0053|
| - stem           |      2|none  |     0|acc   |↑  |0.8138|±  |0.0067|
|        Tasks        |Version|Filter|n-shot| Metric |   |Value|   |Stderr|
|---------------------|------:|------|-----:|--------|---|----:|---|-----:|
|gpqa_diamond_zeroshot|      1|none  |     0|acc     |↑  |0.399|±  |0.0349|
|                     |       |none  |     0|acc_norm|↑  |0.399|±  |0.0349|
|          Tasks          |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-------------------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gpqa_diamond_cot_zeroshot|      1|flexible-extract|     0|exact_match|↑  |0.7525|±  |0.0307|
|                         |       |strict-match    |     0|exact_match|↑  |0.6667|±  |0.0336|
|  Tasks  |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot|      3|flexible-extract|     0|exact_match|↑  |0.9310|±  |0.0070|
|         |       |strict-match    |     0|exact_match|↑  |0.8567|±  |0.0097|

Sign up or log in to comment