--- title: Benchlm colorFrom: gray colorTo: indigo sdk: static pinned: false license: apache-2.0 short_description: llm benchmarks --- To see results on github go to https://github.com/steampunque/benchlm To see results on hf go to https://huggingface.co/spaces/steampunque/benchlm ``` Independent LLM benchmarks for a wide range of open weight models using custom prompts including category and discipline summaries. The model list is actively updated with latest models releases. Older obsoleted model results are not kept. The primary model families being tracked as of 6/2/25 are: Meta (Llama), Mistral, Google (Gemma), Qwen (Qwen2.5, Qwen2.5 coder, Qwen3, QwQ) Secondary model being tracked as of 6/2/25 are: Microsoft (Phi), Deepseek (Qwen R1 distills), Falcon family, internlm family, GLM family, ultravox family. Models being tracked are selected based on general high popularity, high performance or other innovation, with non-restrictive open source license terms. Tests are run using a modified llama.cpp server (supporting logprob completion mode). MODEL CATEGORIES: CHAT : instruct tuned text in text out THINK : RL tuned reasoning models with block or equivalent CODE : coding optimized models MATH : math optimized models applied to Hendryks MATH500 set VISION : image + text in, text out AUDIO : audio + text in, text out MT : machine language translation METHODOLOGY: -The majority of the benchmarks use zero shot instruct prompts, as opposed to other benchmarks suites such as lm-evaluation-harness which treats any model whether instruct tuned or not as a completion model and determines results based on completion logprobs and/or multi shot prompting to "teach" the model what it is supposed to do on a prompt by prompt basis. As a result the majority of the benchmarks are run on instruct tuned models, with some exceptions such as coding benchmarks and a few native completion based benchmarks such as lambada and winogrande. This strategy is intentional and is explicitly designed to evaluate the capability of the instruction tuning of the models. It also means results from the same benchmark can differ significantly with results from completion logprobs/multishot prompting evals such as lm-evaluation harness. Since the majority use case of instruct models is zero shot by design this benchmark is more relevant to actual use case of the instruct tuned models. -All CoT, code, and math tests are zero shot. A few BBH tests use fewshot examples. -Math CoT test such as GSM8K, APPLE, MATH etc. are self graded against correct answer using LLM under test If self grade does not work reliably (such as with very small model) the result is zeroed to mark invalid test. -All non-CoT MC tests do two queries, 1 with answers in test order and 2nd with answers circularly shifted 1. To score a correct answer in MC both queries must answer correctly. -All not-CoT tests (non-CoT MC, etc.) disable the think block prefix for thinking models. The think block is also disabled for all CODE tests with thinking models. -Cot MC tests (e.g. MMLUPRO, GPQA, etc.) do one query only. -Winogrande using logprob completion (evaluates the probability of a common completion for the two possible cases). -The new SQA test is not run on all models. When it is run, the result is only added to SQA test. The SQA result is not added to knowledge and composite averages. The test is not meant for small models and results are given for information only. -MMMU uses the validation split (30 questions from 30 categories) and CoT prompting -MMMUPRO uses the 10-question split and CoT prompting -BBA provides reference BBH results (BBHA) so dropoff from audio prompts can be seen BBHA results are not part of the final tablulated average for the audio test. -MT results are based on averaged sentence BLEU over the testsets using languages de, es, fr, ru, ja, and zh to/from en. unique tokenizers are used for ja, zh, and ru for bleu eval -All new tests are run with a maximum of 250 questions per test category on CoT tests. This is necessary to contain test time with new thinking models when can generate very lengthy responses. The result is printed in italics if there were more than 10 skipped questions in a test category. Note some very old runs had skips due to JSON errors in questions but these will not significantly impact averages. TESTS: KNOWLEDGE: TQA - Truthful QA SQA - Simple QA 4333 question arcane knowledge quiz JEOPARDY - 100 Question JEOPARDY quiz LANGUAGE: LAMBADA - Language Modeling Broadened to Account for Discourse Aspects UNDERSTANDING: WG - Winogrande BOOLQ - Boolean questions STORYCLOZE - Story questions OBQA - Open Book Question / Answer SIQA - Social IQ RACE - Reading comprehension dataset from examinations MMLU - massive multitask language understanding MEDQA - medical QA REASONING CSQA - Common Sense Question Answer COPA - Choice of Plausible Alternatives HELLASWAG -Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations PIQA - Physical Interaction: Question Answering ARC - A12 Reasoning Challenge AGIEVAL - AGIEval logiqa, lsat, sat AGIEVALC - Gaokao SAT, logiqa, jec (Chinese) MUSR - Multimodal Semantic Reasoning COT: GSM8K - Grade School Math CoT BBH - Beyond the Imitation Game Bench Hard CoT GPQA - Google-Proof QA science CoT MMLUPRO - massive multitask language understanding pro CoT AGIEVAL - satmath, aquarat AGIEVALC - mathcloze, mathqa (Chinese) MUSR - Multimodal Semantic Reasoning APPLE - 100 custom Apple Questions MATH: MATH1..MATH5 - MATH Datasets level 1 through 5 (Hendrycks et al.) CODE: HUMANEVAL - Python HUMANEVALP - Python, extended test HUMANEVALX - Python, Java, Javascript, C++ MBPP - Python MBPPP - Python, extendend test CRUXEVAL - Python USE {TEST}FIM FOR FIM TEST, i.e. HUMANEVAL->HUMANEVALFIM VISION: CHARTQA - Chart Question/Answer DOCVQA - Document Vision QA REALWORLDQA - Realworld QA MMMU - Massive Multi-discipline Multimodal Understanding (CoT) MMMUPRO - Massive Multi-discipline Multimodal Understanding Pro (CoT) AUDIO: BBA - Big Bench Audio BBHA - Big Bench Hard Audio subset with original text prompts MT: OPUS - Open Parallel Corpora FLORES200 - Facebook Low Resource ``` CHAT MODELS: MODEL | Falcon3-1B-Instruct | Falcon3-7B-Instruct | Falcon3-10B-Instruct | gemma-2-9b-it | gemma-2-27b-it | gemma-3-1b-it | gemma-3-4b-it | gemma-3-12b-it | gemma-3-12b-it | gemma-3-27b-it | glm-4-9b-chat | glm-4-9b-chat | internlm3-8b-instruct | Ling-mini-2.0 | Ling-mini-2.0 | Llama-3.1-8B-Instruct | Llama-3.2-3B-Instruct | Llama-4-Scout-17B-16E-Instruct | Llama-4-Scout-17B-16E-Instruct | Llama-4-Scout-17B-16E-Instruct | Mistral-7B-Instruct-v0.3 | Mistral-Small-3.1-24B-Instruct-2503 | Mistral-Small-3.1-24B-Instruct-2503 | Mistral-Small-3.1-24B-Instruct-2503 | Mistral-Small-3.2-24B-Instruct-2506 | Phi-3.5-mini-8k-instruct | Phi-3.5-mini-128k-instruct | Phi-4-mini-instruct | Phi-4 | phi-4 | Qwen2.5-3B-32k-Instruct | Qwen2.5-3B-32k-Instruct | Qwen2.5-7B-32k-Instruct | Qwen2.5-7B-32k-Instruct | Qwen2.5-14B-32k-Instruct | Qwen2.5-32B-Instruct | Qwen3-4B-Instruct-2507 | ---------------------------------------------|---------------------|---------------------|----------------------|---------------|----------------|---------------|---------------|----------------|----------------|----------------|---------------|---------------|-----------------------|---------------|---------------|-----------------------|-----------------------|--------------------------------|--------------------------------|--------------------------------|--------------------------|-------------------------------------|-------------------------------------|-------------------------------------|-------------------------------------|--------------------------|----------------------------|---------------------|-------|-------|-------------------------|-------------------------|-------------------------|-------------------------|--------------------------|----------------------|------------------------| params | 1.67B | 7.46B | 10.31B | 9.24B | 27.23B | 0.99989B | 3.88B | 11.77B | 11.77B | 27.01B | 9.40B | 9.40B | 8.80B | 16.26B | 16.26B | 8.03B | 3.21B | 107.77B | 107.77B | 107.77B | 7.25B | 23.57B | 23.57B | 23.57B | 23.57B | 3.82B | 3.82B | 3.84B | 14.66B| 14.66B| 3.09B | 3.09B | 7.62B | 7.62B | 14.77B | 32.76B | 4.02B | quant | IQ4_XS | Q6_K | IQ4_XS | Q6_K | IQ4_XS | Q8_0 | Q6_K | IQ4_XS | Q4_K_H | Q4_K_H | IQ4_XS | Q6_K | IQ4_XS | Q6_K | Q6_K_H | Q6_K | Q6_K | Q2_K_H | Q3_K_H | Q4_K_H | Q8_0 | Q2_K_H | Q3_K_H | Q4_K_H | Q4_K_H | Q6_K | Q6_K | Q6_K | IQ4_XS| Q4_K_H| IQ4_XS | Q6_K | IQ4_XS | Q6_K | IQ4_XS | IQ4_XS | Q6_K_H | engine | llama.cpp version: 4341 | llama.cpp version: 4341 | llama.cpp version: 4341 | llama.cpp version: 3266| llama.cpp version: 3389| llama.cpp version: 4877 | llama.cpp version: 4888 | llama.cpp version: 4938 | llama.cpp version: 5572 | llama.cpp version: 5586 | llama.cpp version: 3496| llama.cpp version: 3334| llama.cpp version: 4488 | llama.cpp version: 6827 | llama.cpp version: 6827 | llama.cpp version: 3428| llama.cpp version: 3825| llama.cpp version: 5236 | llama.cpp version: 5279 | llama.cpp version: 5335 | llama.cpp version: 3262 | llama.cpp version: 5509 | llama.cpp version: 5509 | llama.cpp version: 5509 | llama.cpp version: 5742 | llama.cpp version: 3609 | llama.cpp version: 3600 | llama.cpp version: 4792 | llama.cpp version: 4295 | llama.cpp version: 7562 | llama.cpp version: 4038 | llama.cpp version: 4038 | llama.cpp version: 3943 | llama.cpp version: 3870 | llama.cpp version: 3821 | llama.cpp version: 3821| llama.cpp version: 6628 | **TEST** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | WG | 0.600 | 0.670 | 0.700 | 0.762 | 0.772 | 0.576 | 0.692 | 0.743 | 0.741 | 0.748 | 0.759 | 0.753 | 0.708 | 0.634 | 0.637 | 0.741 | 0.685 | - | - | - | 0.751 | 0.775 | 0.772 | 0.784 | 0.780 | 0.744 | 0.734 | 0.707 | 0.708 | 0.699 | 0.687 | 0.695 | 0.709 | 0.709 | 0.754 | 0.746 | 0.677 | LAMBADA | 0.524 | 0.688 | 0.692 | 0.735 | 0.755 | 0.504 | 0.635 | 0.724 | 0.721 | 0.742 | 0.786 | 0.783 | 0.662 | 0.621 | 0.617 | 0.747 | 0.705 | - | - | - | 0.766 | 0.786 | 0.789 | 0.798 | 0.792 | 0.677 | 0.613 | 0.653 | 0.750 | 0.751 | 0.685 | 0.682 | 0.722 | 0.724 | 0.769 | 0.781 | 0.670 | HELLASWAG | 0.308 | 0.684 | 0.716 | 0.775 | 0.810 | 0.307 | 0.527 | 0.779 | 0.767 | 0.802 | 0.834 | 0.840 | 0.846 | 0.673 | 0.700 | 0.696 | 0.559 | - | - | - | 0.591 | 0.866 | 0.877 | 0.899 | 0.872 | 0.716 | 0.669 | 0.542 | 0.801 | 0.819 | 0.670 | 0.713 | 0.820 | 0.822 | 0.863 | 0.894 | 0.775 | BOOLQ | 0.364 | 0.591 | 0.621 | 0.687 | 0.739 | 0.521 | 0.603 | 0.669 | - | 0.701 | 0.633 | 0.625 | 0.562 | 0.544 | 0.585 | 0.610 | 0.478 | - | - | - | 0.658 | - | - | 0.646 | 0.684 | 0.562 | 0.573 | 0.453 | 0.653 | 0.649 | 0.517 | 0.533 | 0.617 | 0.623 | 0.647 | 0.701 | 0.606 | STORYCLOZE | 0.774 | 0.949 | 0.947 | 0.958 | 0.973 | 0.685 | 0.900 | 0.948 | - | 0.964 | 0.967 | 0.976 | 0.982 | 0.893 | 0.924 | 0.895 | 0.870 | - | - | - | 0.917 | - | - | 0.968 | 0.969 | 0.531 | 0.921 | 0.889 | 0.754 | 0.964 | 0.913 | 0.896 | 0.920 | 0.915 | 0.938 | 0.981 | 0.928 | CSQA | 0.488 | 0.725 | 0.746 | 0.751 | 0.763 | 0.339 | 0.614 | 0.716 | - | 0.741 | 0.727 | 0.733 | 0.730 | 0.717 | 0.746 | 0.686 | 0.642 | - | - | - | 0.627 | - | - | 0.756 | 0.751 | 0.669 | 0.660 | 0.633 | 0.740 | 0.769 | 0.701 | 0.717 | 0.768 | 0.781 | 0.795 | 0.823 | 0.737 | OBQA | 0.380 | 0.761 | 0.745 | 0.846 | 0.860 | 0.334 | 0.648 | 0.807 | - | 0.855 | 0.821 | 0.802 | 0.801 | 0.787 | 0.806 | 0.765 | 0.709 | - | - | - | 0.676 | - | - | 0.866 | 0.880 | 0.751 | 0.720 | 0.719 | 0.857 | 0.859 | 0.700 | 0.731 | 0.802 | 0.804 | 0.863 | 0.904 | 0.804 | COPA | 0.612 | 0.870 | 0.903 | 0.925 | 0.949 | 0.415 | 0.785 | 0.932 | - | 0.944 | 0.955 | 0.944 | 0.927 | 0.863 | 0.884 | 0.889 | 0.749 | - | - | - | 0.812 | - | - | 0.924 | 0.932 | 0.884 | 0.870 | 0.834 | 0.934 | 0.944 | 0.841 | 0.858 | 0.925 | 0.919 | 0.935 | 0.958 | 0.887 | PIQA | 0.233 | 0.696 | 0.732 | 0.801 | 0.841 | 0.386 | 0.653 | 0.784 | - | 0.818 | 0.773 | 0.779 | 0.777 | 0.749 | 0.776 | 0.725 | 0.637 | - | - | - | 0.708 | - | - | 0.826 | 0.831 | 0.733 | 0.677 | 0.674 | 0.832 | 0.849 | 0.695 | 0.713 | 0.794 | 0.807 | 0.848 | 0.870 | 0.761 | SIQA | 0.425 | 0.658 | 0.688 | 0.693 | 0.731 | 0.385 | 0.588 | 0.699 | - | 0.716 | 0.664 | 0.665 | 0.706 | 0.638 | 0.653 | 0.648 | 0.622 | - | - | - | 0.620 | - | - | 0.737 | 0.710 | 0.667 | 0.661 | 0.645 | 0.639 | 0.721 | 0.656 | 0.663 | 0.721 | 0.712 | 0.746 | 0.742 | 0.692 | MEDQA | 0.141 | 0.420 | 0.430 | 0.501 | 0.549 | 0.073 | 0.292 | 0.503 | - | 0.553 | 0.436 | 0.445 | 0.457 | 0.443 | 0.472 | 0.500 | 0.413 | - | - | - | 0.334 | - | - | 0.593 | 0.597 | 0.423 | 0.395 | 0.361 | 0.560 | 0.610 | 0.344 | 0.363 | 0.453 | 0.458 | 0.542 | 0.610 | 0.494 | SQA | - | 0.033 | - | - | 0.117 | - | 0.052 | 0.092 | - | 0.092 | - | - | 0.039 | 0.059 | 0.079 | 0.073 | - | - | - | - | - | - | - | 0.066 | 0.073 | - | - | 0.039 | - | 0.072 | - | - | - | - | - | - | 0.058 | JEOPARDY | 0.010 | 0.400 | 0.310 | 0.580 | 0.760 | - | 0.350 | 0.550 | 0.560 | 0.830 | 0.370 | 0.420 | 0.210 | 0.680 | 0.490 | 0.510 | 0.350 | 0.680 | 0.580 | 0.540 | 0.490 | 0.680 | - | 0.740 | 0.640 | 0.320 | 0.250 | 0.280 | 0.390 | 0.520 | 0.120 | 0.120 | 0.300 | 0.290 | 0.540 | 0.600 | 0.310 | GSM8K | 0.485 | 0.890 | 0.918 | 0.890 | 0.899 | - | 0.843 | 0.928 | _0.928_ | _0.964_ | 0.855 | 0.839 | 0.890 | _0.956_ | _0.964_ | 0.872 | 0.822 | - | - | - | 0.611 | - | - | _0.940_ | _0.968_ | 0.855 | 0.714 | 0.868 | 0.946 | _0.944_| 0.829 | 0.856 | 0.909 | 0.880 | 0.938 | 0.950 | _0.960_ | APPLE | 0.150 | 0.810 | 0.740 | 0.750 | 0.730 | - | 0.630 | 0.740 | 0.770 | 0.850 | 0.630 | 0.610 | 0.670 | 0.860 | 0.890 | 0.690 | 0.610 | 0.840 | 0.860 | 0.860 | 0.390 | 0.830 | 0.780 | 0.820 | 0.890 | 0.560 | 0.560 | 0.640 | 0.910 | 0.850 | 0.640 | 0.560 | 0.740 | 0.750 | 0.830 | 0.860 | 0.850 | HUMANEVAL | 0.115 | 0.737 | 0.774 | 0.658 | 0.743 | 0.408 | 0.701 | 0.859 | 0.829 | 0.890 | 0.737 | 0.731 | 0.804 | 0.829 | 0.841 | 0.652 | 0.585 | - | - | - | 0.390 | 0.841 | 0.823 | 0.853 | 0.871 | 0.682 | 0.621 | 0.646 | 0.847 | 0.829 | 0.695 | 0.780 | 0.798 | 0.817 | 0.804 | 0.884 | 0.841 | HUMANEVALP | 0.073 | 0.628 | 0.664 | 0.548 | 0.615 | 0.317 | 0.597 | 0.713 | - | 0.719 | 0.615 | 0.634 | 0.713 | 0.731 | 0.713 | 0.536 | 0.475 | - | - | - | 0.329 | - | - | 0.731 | 0.750 | 0.591 | 0.524 | 0.554 | 0.725 | 0.713 | 0.615 | 0.682 | 0.670 | 0.658 | 0.676 | 0.768 | 0.713 | HUMANEVALFIM | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | MBPP | 0.334 | 0.677 | 0.653 | 0.595 | 0.642 | 0.536 | 0.614 | 0.692 | - | 0.677 | 0.579 | 0.591 | 0.552 | 0.696 | 0.665 | 0.564 | 0.498 | - | - | - | 0.451 | - | - | 0.618 | 0.642 | 0.610 | 0.498 | 0.501 | 0.673 | 0.673 | 0.595 | 0.599 | 0.669 | 0.661 | 0.669 | 0.684 | 0.665 | MBPPP | 0.312 | 0.629 | 0.611 | 0.584 | 0.638 | 0.531 | 0.598 | 0.642 | - | 0.625 | 0.562 | 0.575 | 0.477 | - | 0.660 | 0.540 | 0.482 | - | - | - | 0.397 | - | - | 0.593 | 0.647 | 0.575 | 0.477 | 0.504 | 0.651 | 0.638 | 0.540 | 0.584 | 0.633 | 0.651 | 0.633 | 0.700 | 0.629 | HUMANEVALX_cpp | 0.054 | 0.506 | 0.603 | 0.512 | 0.579 | 0.158 | 0.585 | 0.756 | - | 0.780 | 0.439 | 0.432 | 0.402 | - | 0.719 | 0.457 | 0.323 | - | - | - | 0.225 | - | - | 0.292 | 0.713 | 0.280 | 0.219 | 0.445 | 0.676 | 0.670 | 0.420 | 0.237 | 0.475 | 0.554 | 0.323 | 0.701 | 0.652 | HUMANEVALX_java | 0.042 | 0.640 | 0.719 | 0.640 | 0.768 | 0.317 | 0.658 | 0.804 | - | 0.810 | 0.207 | 0.628 | 0.597 | - | 0.798 | 0.487 | 0.439 | - | - | - | 0.256 | - | - | 0.804 | 0.829 | 0.079 | 0.060 | 0.536 | 0.634 | 0.524 | 0.640 | 0.615 | 0.695 | 0.737 | 0.780 | 0.865 | 0.823 | HUMANEVALX_js | 0.115 | 0.676 | 0.652 | 0.579 | 0.743 | 0.359 | 0.664 | 0.835 | - | 0.841 | 0.628 | 0.628 | 0.670 | - | 0.829 | 0.560 | 0.067 | - | - | - | 0.402 | - | - | 0.786 | 0.786 | 0.560 | 0.451 | 0.548 | 0.786 | 0.804 | 0.646 | 0.689 | 0.719 | 0.750 | 0.798 | 0.847 | 0.841 | HUMANEVALX | 0.071 | 0.607 | 0.658 | 0.577 | 0.697 | 0.278 | 0.636 | 0.798 | - | 0.810 | 0.424 | 0.563 | 0.556 | - | 0.782 | 0.502 | 0.276 | - | - | - | 0.294 | - | - | 0.628 | 0.776 | 0.306 | 0.243 | 0.510 | 0.699 | 0.666 | 0.569 | 0.514 | 0.630 | 0.680 | 0.634 | 0.804 | 0.772 | CRUXEVAL_input | 0.210 | 0.411 | 0.448 | 0.462 | 0.485 | 0.038 | 0.388 | 0.440 | - | 0.528 | 0.416 | 0.406 | 0.477 | - | 0.472 | 0.435 | 0.353 | - | - | - | 0.276 | - | - | 0.547 | 0.550 | 0.398 | 0.388 | 0.336 | 0.447 | 0.461 | 0.350 | 0.331 | 0.387 | 0.412 | 0.541 | 0.517 | 0.518 | CRUXEVAL_output | 0.152 | 0.355 | 0.410 | 0.375 | 0.482 | 0.196 | 0.348 | 0.457 | - | 0.491 | 0.356 | 0.338 | 0.372 | - | 0.458 | 0.360 | 0.291 | - | - | - | 0.303 | - | - | 0.516 | 0.498 | 0.342 | 0.296 | 0.317 | 0.463 | 0.475 | 0.275 | 0.311 | 0.382 | 0.386 | 0.471 | 0.455 | 0.463 | CRUXEVAL | 0.181 | 0.383 | 0.429 | 0.418 | 0.483 | 0.117 | 0.368 | 0.448 | - | 0.510 | 0.386 | 0.372 | 0.425 | - | 0.465 | 0.397 | 0.322 | - | - | - | 0.290 | - | - | 0.531 | 0.524 | 0.370 | 0.342 | 0.326 | 0.455 | 0.468 | 0.312 | 0.321 | 0.385 | 0.399 | 0.506 | 0.486 | 0.491 | CRUXEVALFIM_input | - | 0.418 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | CRUXEVALFIM_output | - | 0.356 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | CRUXEVALFIM | - | 0.387 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | TQA_mc | 0.146 | 0.523 | 0.510 | 0.701 | 0.767 | 0.115 | 0.468 | 0.663 | - | 0.696 | 0.636 | 0.640 | 0.637 | - | 0.621 | 0.564 | 0.555 | - | - | - | 0.549 | - | - | 0.767 | 0.713 | 0.621 | 0.581 | 0.477 | 0.725 | 0.740 | 0.516 | 0.548 | 0.654 | 0.657 | 0.747 | 0.804 | 0.707 | TQA_tf | 0.381 | 0.410 | 0.431 | 0.692 | 0.725 | 0.390 | 0.491 | 0.677 | - | 0.634 | 0.484 | 0.457 | 0.593 | - | 0.592 | 0.512 | 0.566 | - | - | - | 0.548 | - | - | 0.735 | 0.670 | 0.483 | 0.487 | 0.588 | 0.686 | 0.700 | 0.414 | 0.300 | 0.574 | 0.568 | 0.706 | 0.731 | 0.640 | TQA | 0.354 | 0.423 | 0.440 | 0.693 | 0.730 | 0.358 | 0.488 | 0.676 | - | 0.641 | 0.502 | 0.478 | 0.598 | - | 0.596 | 0.518 | 0.565 | - | - | - | 0.548 | - | - | 0.739 | 0.675 | 0.499 | 0.498 | 0.575 | 0.691 | 0.705 | 0.426 | 0.329 | 0.583 | 0.578 | 0.711 | 0.740 | 0.648 | ARC_challenge | 0.374 | 0.809 | 0.819 | 0.882 | 0.897 | 0.319 | 0.699 | 0.869 | - | 0.899 | 0.835 | 0.853 | 0.871 | - | 0.877 | 0.776 | 0.706 | - | - | - | 0.688 | - | - | 0.912 | 0.907 | 0.813 | 0.802 | 0.759 | 0.911 | 0.912 | 0.750 | 0.777 | 0.843 | 0.851 | 0.911 | 0.934 | 0.875 | ARC_easy | 0.598 | 0.925 | 0.933 | 0.952 | 0.963 | 0.563 | 0.875 | 0.955 | - | 0.971 | 0.933 | 0.940 | 0.945 | - | 0.956 | 0.906 | 0.843 | - | - | - | 0.843 | - | - | 0.970 | 0.968 | 0.934 | 0.932 | 0.908 | 0.970 | 0.969 | 0.895 | 0.904 | 0.945 | 0.946 | 0.969 | 0.978 | 0.952 | ARC | 0.524 | 0.886 | 0.895 | 0.929 | 0.941 | 0.482 | 0.817 | 0.927 | - | 0.947 | 0.901 | 0.911 | 0.920 | - | 0.930 | 0.863 | 0.798 | - | - | - | 0.792 | - | - | 0.951 | 0.948 | 0.894 | 0.889 | 0.859 | 0.950 | 0.950 | 0.847 | 0.862 | 0.911 | 0.915 | 0.950 | 0.963 | 0.927 | RACE_high | 0.431 | 0.698 | 0.730 | 0.802 | 0.833 | 0.338 | 0.633 | 0.795 | - | 0.829 | 0.788 | 0.787 | 0.830 | - | 0.773 | 0.679 | 0.589 | - | - | - | 0.607 | - | - | 0.853 | 0.853 | 0.613 | 0.625 | 0.690 | 0.819 | 0.830 | 0.698 | 0.712 | 0.779 | 0.788 | 0.852 | 0.882 | 0.802 | RACE_middle | 0.463 | 0.777 | 0.793 | 0.849 | 0.883 | 0.398 | 0.713 | 0.858 | - | 0.883 | 0.816 | 0.825 | 0.866 | - | 0.839 | 0.734 | 0.680 | - | - | - | 0.696 | - | - | 0.894 | 0.885 | 0.706 | 0.692 | 0.745 | 0.861 | 0.868 | 0.775 | 0.776 | 0.841 | 0.853 | 0.887 | 0.923 | 0.850 | RACE | 0.440 | 0.721 | 0.748 | _0.816_ | 0.847 | 0.355 | 0.656 | 0.813 | - | 0.844 | 0.796 | 0.798 | 0.840 | - | 0.793 | 0.695 | 0.615 | - | - | - | _0.633_ | - | - | 0.865 | 0.862 | 0.640 | 0.645 | 0.706 | 0.831 | 0.841 | 0.720 | 0.730 | 0.797 | 0.807 | 0.862 | 0.894 | 0.816 | MMLU abstract_algebra | 0.180 | 0.410 | 0.450 | 0.330 | 0.310 | 0.110 | 0.160 | 0.350 | - | 0.410 | 0.220 | 0.210 | 0.300 | - | 0.340 | 0.200 | 0.270 | - | - | - | 0.190 | - | - | 0.370 | 0.430 | 0.300 | 0.210 | 0.210 | 0.410 | 0.510 | 0.240 | 0.250 | 0.440 | 0.430 | 0.570 | 0.600 | 0.510 | anatomy | 0.318 | 0.577 | 0.592 | 0.626 | 0.607 | 0.296 | 0.459 | 0.577 | - | 0.629 | 0.503 | 0.511 | 0.637 | - | 0.592 | 0.555 | 0.540 | - | - | - | 0.447 | - | - | 0.718 | 0.725 | 0.570 | 0.585 | 0.503 | 0.703 | 0.711 | 0.525 | 0.562 | 0.622 | 0.622 | 0.644 | 0.733 | 0.585 | astronomy | 0.263 | 0.736 | 0.756 | 0.760 | 0.828 | 0.296 | 0.559 | 0.769 | - | 0.848 | 0.644 | 0.651 | 0.802 | - | 0.743 | 0.677 | 0.565 | - | - | - | 0.573 | - | - | 0.888 | 0.888 | 0.703 | 0.703 | 0.611 | 0.776 | 0.835 | 0.618 | 0.657 | 0.763 | 0.769 | 0.868 | 0.875 | 0.789 | business_ethics | 0.260 | 0.570 | 0.560 | 0.620 | 0.670 | 0.270 | 0.480 | 0.630 | - | 0.710 | 0.570 | 0.610 | 0.670 | - | 0.700 | 0.550 | 0.480 | - | - | - | 0.520 | - | - | 0.730 | 0.720 | 0.620 | 0.620 | 0.560 | 0.740 | 0.750 | 0.630 | 0.590 | 0.680 | 0.710 | 0.750 | 0.800 | 0.680 | clinical_knowledge | 0.373 | 0.652 | 0.683 | 0.743 | 0.788 | 0.316 | 0.554 | 0.716 | - | 0.784 | 0.618 | 0.622 | 0.716 | - | 0.716 | 0.675 | 0.592 | - | - | - | 0.581 | - | - | 0.811 | 0.777 | 0.713 | 0.698 | 0.637 | 0.781 | 0.781 | 0.633 | 0.645 | 0.709 | 0.713 | 0.803 | 0.815 | 0.732 | college_biology | 0.340 | 0.763 | 0.777 | 0.854 | 0.895 | 0.256 | 0.618 | 0.826 | - | 0.847 | 0.687 | 0.715 | 0.777 | - | 0.777 | 0.722 | 0.625 | - | - | - | 0.625 | - | - | 0.888 | 0.902 | 0.805 | 0.763 | 0.694 | 0.868 | 0.895 | 0.694 | 0.694 | 0.784 | 0.784 | 0.854 | 0.923 | 0.840 | college_chemistry | 0.180 | 0.470 | 0.430 | 0.470 | 0.430 | 0.180 | 0.260 | 0.400 | - | 0.470 | 0.380 | 0.380 | 0.440 | - | 0.440 | 0.400 | 0.310 | - | - | - | 0.350 | - | - | 0.480 | 0.500 | 0.460 | 0.430 | 0.420 | 0.520 | 0.540 | 0.310 | 0.370 | 0.480 | 0.490 | 0.460 | 0.530 | 0.490 | college_computer_science | 0.110 | 0.540 | 0.590 | 0.460 | 0.580 | 0.200 | 0.360 | 0.500 | - | 0.560 | 0.470 | 0.480 | 0.640 | - | 0.600 | 0.400 | 0.350 | - | - | - | 0.320 | - | - | 0.650 | 0.580 | 0.480 | 0.410 | 0.420 | 0.600 | 0.640 | 0.390 | 0.460 | 0.620 | 0.590 | 0.630 | 0.720 | 0.620 | college_mathematics | 0.090 | 0.320 | 0.320 | 0.260 | 0.300 | 0.080 | 0.170 | 0.300 | - | 0.400 | 0.240 | 0.280 | 0.280 | - | 0.460 | 0.260 | 0.210 | - | - | - | 0.180 | - | - | 0.340 | 0.380 | 0.270 | 0.170 | 0.160 | 0.340 | 0.410 | 0.200 | 0.180 | 0.380 | 0.350 | 0.490 | 0.540 | 0.380 | college_medicine | 0.283 | 0.566 | 0.612 | 0.658 | 0.716 | 0.260 | 0.462 | 0.624 | - | 0.682 | 0.572 | 0.589 | 0.676 | - | 0.658 | 0.589 | 0.491 | - | - | - | 0.456 | - | - | 0.722 | 0.728 | 0.612 | 0.566 | 0.531 | 0.728 | 0.757 | 0.560 | 0.606 | 0.606 | 0.624 | 0.710 | 0.739 | 0.647 | college_physics | 0.186 | 0.372 | 0.411 | 0.352 | 0.421 | 0.098 | 0.196 | 0.411 | - | 0.470 | 0.313 | 0.323 | 0.362 | - | 0.490 | 0.313 | 0.303 | - | - | - | 0.254 | - | - | 0.529 | 0.539 | 0.333 | 0.294 | 0.284 | 0.529 | 0.509 | 0.382 | 0.392 | 0.401 | 0.372 | 0.519 | 0.656 | 0.529 | computer_security | 0.370 | 0.710 | 0.690 | 0.730 | 0.710 | 0.350 | 0.590 | 0.740 | - | 0.760 | 0.710 | 0.730 | 0.720 | - | 0.690 | 0.690 | 0.620 | - | - | - | 0.600 | - | - | 0.710 | 0.720 | 0.700 | 0.650 | 0.700 | 0.730 | 0.760 | 0.650 | 0.690 | 0.720 | 0.710 | 0.730 | 0.800 | 0.710 | conceptual_physics | 0.234 | 0.680 | 0.680 | 0.638 | 0.727 | 0.174 | 0.404 | 0.634 | - | 0.748 | 0.561 | 0.587 | 0.646 | - | 0.697 | 0.463 | 0.361 | - | - | - | 0.365 | - | - | 0.744 | 0.731 | 0.565 | 0.553 | 0.544 | 0.748 | 0.765 | 0.485 | 0.519 | 0.642 | 0.642 | 0.800 | 0.834 | 0.753 | econometrics | 0.122 | 0.649 | 0.587 | 0.557 | 0.587 | 0.140 | 0.315 | 0.535 | - | 0.570 | 0.456 | 0.464 | 0.578 | - | 0.578 | 0.482 | 0.359 | - | - | - | 0.318 | - | - | 0.587 | 0.614 | 0.456 | 0.421 | 0.368 | 0.596 | 0.561 | 0.421 | 0.438 | 0.605 | 0.596 | 0.649 | 0.675 | 0.622 | electrical_engineering | 0.220 | 0.641 | 0.648 | 0.558 | 0.593 | 0.296 | 0.393 | 0.558 | - | 0.627 | 0.544 | 0.572 | 0.655 | - | 0.662 | 0.524 | 0.462 | - | - | - | 0.393 | - | - | 0.703 | 0.662 | 0.496 | 0.475 | 0.572 | 0.634 | 0.675 | 0.441 | 0.434 | 0.606 | 0.606 | 0.648 | 0.703 | 0.634 | elementary_mathematics | 0.113 | 0.505 | 0.497 | 0.476 | 0.476 | 0.058 | 0.288 | 0.502 | - | 0.719 | 0.367 | 0.373 | 0.481 | - | 0.626 | 0.357 | 0.280 | - | - | - | 0.222 | - | - | 0.653 | 0.621 | 0.423 | 0.388 | 0.335 | 0.544 | 0.579 | 0.407 | 0.417 | 0.560 | 0.568 | 0.791 | 0.838 | 0.616 | formal_logic | 0.182 | 0.444 | 0.484 | 0.293 | 0.468 | 0.142 | 0.269 | 0.452 | - | 0.507 | 0.325 | 0.357 | 0.420 | - | 0.380 | 0.420 | 0.253 | - | - | - | 0.277 | - | - | 0.563 | 0.484 | 0.452 | 0.380 | 0.380 | 0.531 | 0.587 | 0.325 | 0.341 | 0.452 | 0.428 | 0.539 | 0.626 | 0.579 | global_facts | 0.120 | 0.190 | 0.290 | 0.330 | 0.370 | 0.070 | 0.110 | 0.300 | - | 0.420 | 0.200 | 0.240 | 0.330 | - | 0.220 | 0.150 | 0.110 | - | - | - | 0.160 | - | - | 0.520 | 0.450 | 0.240 | 0.130 | 0.120 | 0.320 | 0.340 | 0.140 | 0.200 | 0.260 | 0.260 | 0.470 | 0.430 | 0.270 | high_school_biology | 0.348 | 0.764 | 0.774 | 0.851 | 0.890 | 0.358 | 0.645 | 0.816 | - | 0.854 | 0.800 | 0.809 | 0.825 | - | 0.838 | 0.729 | 0.677 | - | - | - | 0.654 | - | - | 0.874 | 0.870 | 0.793 | 0.774 | 0.729 | 0.887 | 0.883 | 0.722 | 0.754 | 0.803 | 0.806 | 0.845 | 0.896 | 0.858 | high_school_chemistry | 0.216 | 0.522 | 0.507 | 0.586 | 0.600 | 0.167 | 0.359 | 0.517 | - | 0.610 | 0.546 | 0.517 | 0.527 | - | 0.566 | 0.467 | 0.433 | - | - | - | 0.310 | - | - | 0.650 | 0.640 | 0.512 | 0.492 | 0.482 | 0.655 | 0.699 | 0.413 | 0.463 | 0.532 | 0.536 | 0.596 | 0.724 | 0.684 | high_school_computer_science | 0.250 | 0.740 | 0.740 | 0.710 | 0.770 | 0.270 | 0.570 | 0.750 | - | 0.830 | 0.660 | 0.660 | 0.760 | - | 0.830 | 0.610 | 0.540 | - | - | - | 0.490 | - | - | 0.810 | 0.790 | 0.610 | 0.580 | 0.590 | 0.870 | 0.820 | 0.600 | 0.660 | 0.770 | 0.770 | 0.830 | 0.870 | 0.780 | high_school_european_history | 0.490 | 0.757 | 0.745 | 0.806 | 0.830 | 0.363 | 0.678 | 0.818 | - | 0.824 | 0.812 | 0.830 | 0.787 | - | 0.775 | 0.709 | 0.672 | - | - | - | 0.678 | - | - | 0.830 | 0.830 | 0.727 | 0.672 | 0.678 | 0.812 | 0.848 | 0.733 | 0.733 | 0.787 | 0.800 | 0.824 | 0.818 | 0.787 | high_school_geography | 0.393 | 0.717 | 0.747 | 0.878 | 0.888 | 0.424 | 0.646 | 0.818 | - | 0.858 | 0.792 | 0.818 | 0.792 | - | 0.818 | 0.757 | 0.671 | - | - | - | 0.671 | - | - | 0.853 | 0.853 | 0.792 | 0.737 | 0.722 | 0.888 | 0.888 | 0.712 | 0.732 | 0.833 | 0.833 | 0.868 | 0.883 | 0.797 | high_school_government_and_politics | 0.487 | 0.875 | 0.875 | 0.926 | 0.963 | 0.450 | 0.772 | 0.911 | - | 0.937 | 0.875 | 0.870 | 0.880 | - | 0.891 | 0.818 | 0.725 | - | - | - | 0.805 | - | - | 0.937 | 0.948 | 0.849 | 0.834 | 0.839 | 0.937 | 0.937 | 0.772 | 0.797 | 0.917 | 0.917 | 0.958 | 0.968 | 0.860 | high_school_macroeconomics | 0.235 | 0.653 | 0.687 | 0.717 | 0.758 | 0.238 | 0.474 | 0.682 | - | 0.771 | 0.651 | 0.653 | 0.733 | - | 0.682 | 0.556 | 0.497 | - | - | - | 0.478 | - | - | 0.766 | 0.743 | 0.646 | 0.635 | 0.617 | 0.807 | 0.805 | 0.564 | 0.592 | 0.684 | 0.684 | 0.802 | 0.825 | 0.715 | high_school_mathematics | 0.088 | 0.344 | 0.337 | 0.277 | 0.325 | 0.033 | 0.211 | 0.337 | - | 0.422 | 0.237 | 0.240 | 0.285 | - | 0.429 | 0.255 | 0.233 | - | - | - | 0.162 | - | - | 0.362 | 0.348 | 0.214 | 0.203 | 0.185 | 0.274 | 0.411 | 0.270 | 0.244 | 0.440 | 0.422 | 0.500 | 0.537 | 0.396 | high_school_microeconomics | 0.268 | 0.823 | 0.827 | 0.801 | 0.852 | 0.315 | 0.533 | 0.798 | - | 0.831 | 0.760 | 0.773 | 0.798 | - | 0.810 | 0.684 | 0.575 | - | - | - | 0.540 | - | - | 0.873 | 0.878 | 0.794 | 0.743 | 0.760 | 0.861 | 0.886 | 0.672 | 0.697 | 0.827 | 0.827 | 0.857 | 0.907 | 0.840 | high_school_physics | 0.099 | 0.509 | 0.496 | 0.423 | 0.496 | 0.112 | 0.198 | 0.403 | - | 0.562 | 0.344 | 0.364 | 0.443 | - | 0.589 | 0.317 | 0.211 | - | - | - | 0.165 | - | - | 0.596 | 0.582 | 0.377 | 0.384 | 0.324 | 0.569 | 0.589 | 0.317 | 0.311 | 0.470 | 0.456 | 0.635 | 0.695 | 0.602 | high_school_psychology | 0.445 | 0.827 | 0.853 | 0.896 | 0.910 | 0.445 | 0.750 | 0.882 | - | 0.900 | 0.840 | 0.858 | 0.862 | - | 0.867 | 0.834 | 0.761 | - | - | - | 0.764 | - | - | 0.913 | 0.900 | 0.855 | 0.844 | 0.823 | 0.904 | 0.913 | 0.803 | 0.796 | 0.858 | 0.856 | 0.882 | 0.902 | 0.860 | high_school_statistics | 0.185 | 0.564 | 0.625 | 0.574 | 0.615 | 0.129 | 0.337 | 0.574 | - | 0.555 | 0.509 | 0.500 | 0.638 | - | 0.625 | 0.462 | 0.342 | - | - | - | 0.361 | - | - | 0.648 | 0.648 | 0.569 | 0.523 | 0.458 | 0.643 | 0.722 | 0.481 | 0.518 | 0.615 | 0.648 | 0.717 | 0.782 | 0.671 | high_school_us_history | 0.436 | 0.764 | 0.794 | _0.829_ | 0.867 | 0.348 | 0.705 | 0.843 | - | 0.872 | 0.833 | 0.867 | 0.774 | - | 0.764 | 0.784 | 0.696 | - | - | - | _0.699_ | - | - | 0.897 | 0.921 | 0.759 | 0.735 | 0.754 | 0.877 | 0.872 | 0.715 | 0.759 | 0.843 | 0.852 | 0.882 | 0.906 | 0.803 | high_school_world_history | 0.535 | 0.759 | 0.818 | 0.872 | 0.881 | 0.392 | 0.696 | 0.864 | - | 0.907 | 0.810 | 0.827 | 0.801 | - | 0.805 | 0.789 | 0.725 | - | - | - | 0.720 | - | - | 0.869 | 0.873 | 0.746 | 0.742 | 0.772 | 0.869 | 0.898 | 0.776 | 0.793 | 0.818 | 0.827 | 0.869 | 0.877 | 0.822 | human_aging | 0.309 | 0.596 | 0.627 | 0.690 | 0.739 | 0.345 | 0.524 | 0.645 | - | 0.699 | 0.582 | 0.591 | 0.641 | - | 0.663 | 0.618 | 0.569 | - | - | - | 0.542 | - | - | 0.713 | 0.704 | 0.582 | 0.547 | 0.587 | 0.726 | 0.730 | 0.569 | 0.587 | 0.681 | 0.690 | 0.717 | 0.771 | 0.641 | human_sexuality | 0.351 | 0.648 | 0.694 | 0.746 | 0.755 | 0.358 | 0.519 | 0.740 | - | 0.763 | 0.648 | 0.633 | 0.648 | - | 0.694 | 0.671 | 0.587 | - | - | - | 0.569 | - | - | 0.839 | 0.816 | 0.664 | 0.587 | 0.603 | 0.740 | 0.816 | 0.625 | 0.625 | 0.740 | 0.717 | 0.786 | 0.839 | 0.664 | international_law | 0.404 | 0.727 | 0.776 | 0.801 | 0.760 | 0.495 | 0.677 | 0.801 | - | 0.809 | 0.735 | 0.752 | 0.743 | - | 0.760 | 0.776 | 0.710 | - | - | - | 0.710 | - | - | 0.834 | 0.818 | 0.735 | 0.727 | 0.727 | 0.892 | 0.867 | 0.710 | 0.685 | 0.768 | 0.785 | 0.834 | 0.867 | 0.752 | jurisprudence | 0.444 | 0.740 | 0.768 | 0.785 | 0.833 | 0.379 | 0.648 | 0.740 | - | 0.796 | 0.675 | 0.722 | 0.777 | - | 0.777 | 0.731 | 0.574 | - | - | - | 0.626 | - | - | 0.833 | 0.805 | 0.722 | 0.750 | 0.722 | 0.787 | 0.879 | 0.694 | 0.712 | 0.759 | 0.750 | 0.824 | 0.824 | 0.759 | logical_fallacies | 0.380 | 0.711 | 0.730 | 0.811 | 0.797 | 0.300 | 0.644 | 0.779 | - | 0.871 | 0.730 | 0.754 | 0.717 | - | 0.797 | 0.736 | 0.687 | - | - | - | 0.660 | - | - | 0.797 | 0.785 | 0.785 | 0.754 | 0.766 | 0.779 | 0.846 | 0.705 | 0.723 | 0.773 | 0.766 | 0.834 | 0.877 | 0.803 | machine_learning | 0.196 | 0.508 | 0.491 | 0.437 | 0.571 | 0.169 | 0.285 | 0.464 | - | 0.482 | 0.419 | 0.401 | 0.500 | - | 0.553 | 0.366 | 0.285 | - | - | - | 0.321 | - | - | 0.571 | 0.571 | 0.437 | 0.375 | 0.375 | 0.544 | 0.598 | 0.339 | 0.321 | 0.437 | 0.410 | 0.526 | 0.642 | 0.526 | management | 0.417 | 0.825 | 0.786 | 0.825 | 0.844 | 0.475 | 0.708 | 0.864 | - | 0.825 | 0.737 | 0.766 | 0.834 | - | 0.834 | 0.737 | 0.669 | - | - | - | 0.708 | - | - | 0.844 | 0.864 | 0.786 | 0.776 | 0.747 | 0.854 | 0.834 | 0.689 | 0.718 | 0.805 | 0.825 | 0.825 | 0.864 | 0.786 | marketing | 0.517 | 0.820 | 0.854 | 0.863 | 0.893 | 0.508 | 0.782 | 0.858 | - | 0.897 | 0.850 | 0.858 | 0.888 | - | 0.829 | 0.837 | 0.799 | - | - | - | 0.756 | - | - | 0.893 | 0.888 | 0.820 | 0.803 | 0.837 | 0.914 | 0.910 | 0.811 | 0.816 | 0.888 | 0.893 | 0.897 | 0.901 | 0.833 | medical_genetics | 0.340 | 0.720 | 0.750 | 0.780 | 0.810 | 0.240 | 0.510 | 0.720 | - | 0.790 | 0.630 | 0.640 | 0.710 | - | 0.750 | 0.720 | 0.660 | - | - | - | 0.600 | - | - | 0.850 | 0.880 | 0.710 | 0.700 | 0.700 | 0.860 | 0.830 | 0.660 | 0.690 | 0.770 | 0.770 | 0.820 | 0.900 | 0.740 | miscellaneous | 0.420 | 0.749 | 0.768 | 0.830 | 0.854 | 0.401 | 0.687 | 0.825 | - | 0.879 | 0.775 | 0.796 | 0.768 | - | 0.790 | 0.773 | 0.736 | - | - | - | 0.727 | - | - | 0.872 | 0.872 | 0.777 | 0.759 | 0.734 | 0.864 | 0.869 | 0.724 | 0.726 | 0.807 | 0.814 | 0.871 | 0.885 | 0.782 | moral_disputes | 0.323 | 0.609 | 0.618 | 0.680 | 0.736 | 0.332 | 0.511 | 0.664 | - | 0.719 | 0.604 | 0.612 | 0.635 | - | 0.604 | 0.621 | 0.560 | - | - | - | 0.524 | - | - | 0.748 | 0.754 | 0.615 | 0.621 | 0.653 | 0.748 | 0.777 | 0.537 | 0.566 | 0.664 | 0.676 | 0.725 | 0.760 | 0.606 | moral_scenarios | 0.115 | 0.165 | 0.411 | 0.325 | 0.366 | 0.117 | 0.143 | 0.207 | - | 0.489 | 0.307 | 0.360 | 0.188 | - | 0.241 | 0.205 | 0.410 | - | - | - | 0.122 | - | - | 0.482 | 0.377 | 0.366 | 0.404 | 0.270 | 0.582 | 0.655 | 0.130 | 0.058 | 0.318 | 0.368 | 0.546 | 0.565 | 0.268 | nutrition | 0.313 | 0.650 | 0.666 | 0.683 | 0.758 | 0.333 | 0.565 | 0.676 | - | 0.764 | 0.643 | 0.653 | 0.751 | - | 0.722 | 0.689 | 0.620 | - | - | - | 0.555 | - | - | 0.843 | 0.826 | 0.669 | 0.620 | 0.630 | 0.771 | 0.764 | 0.647 | 0.630 | 0.745 | 0.745 | 0.790 | 0.797 | 0.686 | philosophy | 0.327 | 0.681 | 0.675 | 0.658 | 0.713 | 0.363 | 0.536 | 0.726 | - | 0.742 | 0.652 | 0.659 | 0.688 | - | 0.630 | 0.617 | 0.578 | - | - | - | 0.587 | - | - | 0.736 | 0.781 | 0.630 | 0.588 | 0.630 | 0.784 | 0.778 | 0.562 | 0.565 | 0.675 | 0.688 | 0.774 | 0.778 | 0.649 | prehistory | 0.308 | 0.660 | 0.697 | 0.728 | 0.783 | 0.342 | 0.577 | 0.759 | - | 0.827 | 0.635 | 0.663 | 0.641 | - | 0.722 | 0.700 | 0.604 | - | - | - | 0.580 | - | - | 0.805 | 0.805 | 0.697 | 0.663 | 0.617 | 0.805 | 0.824 | 0.641 | 0.666 | 0.762 | 0.756 | 0.836 | 0.861 | 0.703 | professional_accounting | 0.184 | 0.418 | 0.432 | 0.496 | 0.514 | 0.152 | 0.280 | 0.436 | - | 0.531 | 0.404 | 0.425 | 0.429 | - | 0.457 | 0.393 | 0.336 | - | - | - | 0.336 | - | - | 0.531 | 0.517 | 0.418 | 0.386 | 0.421 | 0.510 | 0.592 | 0.386 | 0.414 | 0.457 | 0.460 | 0.560 | 0.631 | 0.471 | professional_law | 0.202 | 0.397 | 0.417 | 0.478 | 0.528 | 0.177 | 0.323 | 0.441 | - | 0.489 | 0.404 | 0.408 | 0.417 | - | 0.391 | 0.397 | 0.369 | - | - | - | 0.333 | - | - | 0.518 | 0.505 | 0.410 | 0.401 | 0.394 | 0.492 | 0.554 | 0.340 | 0.337 | 0.401 | 0.402 | 0.477 | 0.541 | 0.404 | professional_medicine | 0.235 | 0.639 | 0.636 | 0.756 | 0.794 | 0.113 | 0.481 | 0.761 | - | 0.783 | 0.654 | 0.680 | 0.680 | - | 0.724 | 0.724 | 0.713 | - | - | - | 0.564 | - | - | 0.827 | 0.827 | 0.687 | 0.658 | 0.588 | 0.823 | 0.812 | 0.573 | 0.580 | 0.680 | 0.683 | 0.812 | 0.845 | 0.709 | professional_psychology | 0.300 | 0.647 | 0.665 | 0.728 | 0.805 | 0.272 | 0.495 | 0.718 | - | 0.753 | 0.598 | 0.609 | 0.684 | - | 0.689 | 0.642 | 0.509 | - | - | - | 0.521 | - | - | 0.790 | 0.777 | 0.655 | 0.617 | 0.619 | 0.799 | 0.805 | 0.586 | 0.591 | 0.707 | 0.702 | 0.776 | 0.810 | 0.668 | public_relations | 0.409 | 0.563 | 0.600 | 0.700 | 0.672 | 0.354 | 0.509 | 0.690 | - | 0.681 | 0.572 | 0.627 | 0.618 | - | 0.590 | 0.518 | 0.545 | - | - | - | 0.554 | - | - | 0.736 | 0.736 | 0.554 | 0.572 | 0.600 | 0.727 | 0.754 | 0.563 | 0.572 | 0.627 | 0.645 | 0.736 | 0.663 | 0.636 | security_studies | 0.240 | 0.608 | 0.644 | 0.746 | 0.763 | 0.285 | 0.632 | 0.661 | - | 0.755 | 0.624 | 0.632 | 0.718 | - | 0.673 | 0.665 | 0.616 | - | - | - | 0.600 | - | - | 0.787 | 0.767 | 0.669 | 0.673 | 0.661 | 0.730 | 0.759 | 0.620 | 0.653 | 0.718 | 0.718 | 0.767 | 0.775 | 0.714 | sociology | 0.412 | 0.781 | 0.791 | 0.815 | 0.860 | 0.517 | 0.666 | 0.820 | - | 0.850 | 0.736 | 0.741 | 0.810 | - | 0.791 | 0.786 | 0.741 | - | - | - | 0.716 | - | - | 0.860 | 0.875 | 0.820 | 0.781 | 0.766 | 0.870 | 0.865 | 0.716 | 0.736 | 0.815 | 0.825 | 0.855 | 0.860 | 0.791 | us_foreign_policy | 0.510 | 0.780 | 0.790 | 0.868 | 0.840 | 0.460 | 0.740 | 0.890 | - | 0.860 | 0.780 | 0.800 | 0.840 | - | 0.800 | 0.800 | 0.800 | - | - | - | 0.757 | - | - | 0.920 | 0.910 | 0.760 | 0.770 | 0.770 | 0.890 | 0.880 | 0.750 | 0.780 | 0.820 | 0.820 | 0.890 | 0.880 | 0.770 | virology | 0.246 | 0.433 | 0.445 | 0.472 | 0.506 | 0.283 | 0.415 | 0.469 | - | 0.481 | 0.415 | 0.439 | 0.469 | - | 0.427 | 0.439 | 0.415 | - | - | - | 0.387 | - | - | 0.512 | 0.512 | 0.403 | 0.367 | 0.367 | 0.500 | 0.487 | 0.373 | 0.427 | 0.463 | 0.457 | 0.487 | 0.518 | 0.475 | world_religions | 0.403 | 0.748 | 0.801 | 0.800 | 0.847 | 0.350 | 0.684 | 0.795 | - | 0.836 | 0.766 | 0.766 | 0.748 | - | 0.789 | 0.789 | 0.742 | - | - | - | 0.747 | - | - | 0.871 | 0.853 | 0.742 | 0.725 | 0.707 | 0.836 | 0.818 | 0.783 | 0.760 | 0.818 | 0.818 | 0.859 | 0.871 | 0.771 | MMLU | 0.285 | 0.591 | 0.623 | _0.647_ | 0.687 | 0.269 | 0.477 | 0.631 | - | 0.701 | 0.580 | 0.595 | 0.617 | - | 0.629 | 0.570 | 0.525 | - | - | - | _0.486_ | - | - | 0.717 | 0.704 | 0.599 | 0.578 | 0.560 | 0.710 | 0.737 | 0.532 | 0.540 | 0.639 | 0.643 | 0.721 | 0.757 | 0.639 | AGIEVAL aquarat | 0.374 | 0.602 | 0.562 | 0.665 | 0.602 | 0.409 | 0.763 | 0.846 | - | 0.844 | 0.653 | 0.637 | 0.783 | - | 0.884 | 0.598 | 0.633 | - | - | - | 0.279 | - | - | 0.516 | 0.764 | 0.409 | 0.574 | 0.724 | 0.834 | 0.580 | 0.732 | 0.728 | 0.799 | 0.830 | 0.822 | 0.870 | 0.896 | logiqa | 0.208 | 0.356 | 0.337 | 0.447 | 0.477 | 0.145 | 0.342 | 0.479 | - | 0.509 | 0.399 | 0.416 | 0.433 | - | 0.387 | 0.328 | 0.265 | - | - | - | 0.264 | - | - | 0.468 | 0.447 | 0.281 | 0.267 | 0.267 | 0.445 | 0.466 | 0.316 | 0.342 | 0.427 | 0.436 | 0.493 | 0.554 | 0.465 | lsatar | 0.213 | 0.213 | 0.282 | 0.208 | 0.260 | 0.217 | 0.213 | 0.365 | - | 0.317 | 0.073 | 0.217 | 0.308 | - | 0.673 | 0.295 | 0.239 | - | - | - | 0.186 | - | - | 0.269 | 0.639 | 0.256 | 0.247 | 0.234 | 0.369 | 0.269 | 0.230 | 0.226 | 0.260 | 0.300 | 0.321 | 0.400 | 0.791 | lsatlr | 0.203 | 0.486 | 0.537 | 0.635 | 0.654 | 0.115 | 0.374 | 0.596 | - | 0.686 | 0.505 | 0.515 | 0.592 | - | 0.662 | 0.441 | 0.327 | - | - | - | 0.366 | - | - | 0.709 | 0.686 | 0.415 | 0.386 | 0.401 | 0.621 | 0.660 | 0.452 | 0.449 | 0.598 | 0.603 | 0.729 | 0.811 | 0.686 | lsatrc | 0.312 | 0.594 | 0.646 | 0.750 | 0.754 | 0.208 | 0.475 | 0.702 | - | 0.717 | 0.635 | 0.643 | 0.706 | - | 0.669 | 0.624 | 0.486 | - | - | - | 0.520 | - | - | 0.814 | 0.806 | 0.531 | 0.524 | 0.557 | 0.762 | 0.758 | 0.553 | 0.617 | 0.661 | 0.687 | 0.810 | 0.836 | 0.706 | saten | 0.470 | 0.791 | 0.810 | 0.834 | 0.868 | 0.305 | 0.728 | 0.854 | - | 0.893 | 0.815 | 0.820 | 0.844 | - | 0.781 | 0.781 | 0.689 | - | - | - | 0.679 | - | - | 0.893 | 0.873 | 0.713 | 0.708 | 0.723 | 0.830 | 0.854 | 0.733 | 0.776 | 0.810 | 0.844 | 0.888 | 0.922 | 0.839 | satmath | 0.559 | 0.790 | 0.822 | 0.886 | 0.768 | 0.468 | 0.945 | 0.981 | - | 0.936 | 0.863 | 0.868 | 0.968 | - | 0.990 | 0.618 | 0.845 | - | - | - | 0.400 | - | - | 0.804 | 0.813 | 0.713 | 0.754 | 0.890 | 0.977 | 0.613 | 0.900 | 0.922 | 0.963 | 0.963 | 0.990 | 0.981 | 0.995 | AGIEVAL | 0.294 | 0.503 | 0.523 | 0.598 | 0.602 | 0.226 | 0.488 | 0.639 | - | 0.663 | 0.525 | 0.546 | 0.611 | - | 0.652 | 0.480 | 0.433 | - | - | - | 0.359 | - | - | 0.615 | 0.665 | 0.429 | 0.438 | 0.475 | 0.638 | 0.583 | 0.501 | 0.520 | 0.599 | 0.616 | 0.681 | 0.734 | 0.702 | AGIEVALC_biology | - | 0.365 | - | - | - | 0.104 | 0.334 | 0.595 | - | 0.665 | 0.756 | 0.778 | 0.869 | - | 0.856 | - | - | - | - | - | - | - | - | 0.721 | 0.739 | - | - | - | - | - | 0.660 | 0.700 | 0.804 | 0.813 | 0.834 | 0.582 | 0.826 | AGIEVALC_chemistry | - | 0.269 | - | - | - | 0.078 | 0.289 | 0.446 | - | 0.480 | 0.642 | 0.691 | 0.715 | - | 0.710 | - | - | - | - | - | - | - | - | 0.509 | 0.509 | - | - | - | - | - | 0.441 | 0.470 | 0.583 | 0.627 | 0.696 | 0.789 | 0.691 | AGIEVALC_chinese | - | 0.247 | - | - | - | 0.048 | 0.231 | 0.373 | - | 0.439 | 0.642 | 0.650 | 0.723 | - | 0.654 | - | - | - | - | - | - | - | - | 0.569 | 0.577 | - | - | - | - | - | 0.508 | 0.504 | 0.585 | 0.593 | 0.760 | 0.735 | 0.573 | AGIEVALC_english | - | 0.774 | - | - | - | 0.444 | 0.728 | 0.862 | - | 0.866 | 0.823 | 0.833 | 0.905 | - | 0.843 | - | - | - | - | - | - | - | - | 0.892 | 0.892 | - | - | - | - | - | 0.794 | 0.839 | 0.856 | 0.849 | 0.915 | 0.924 | 0.849 | AGIEVALC_geography | - | 0.407 | - | - | - | 0.246 | 0.396 | 0.608 | - | 0.678 | 0.728 | 0.728 | 0.814 | - | 0.693 | - | - | - | - | - | - | - | - | 0.718 | 0.718 | - | - | - | - | - | 0.643 | 0.633 | 0.753 | 0.778 | 0.804 | 0.839 | 0.693 | AGIEVALC_history | - | 0.374 | - | - | - | 0.225 | 0.421 | 0.642 | - | 0.689 | 0.829 | 0.834 | 0.872 | - | 0.838 | - | - | - | - | - | - | - | - | 0.736 | 0.736 | - | - | - | - | - | 0.740 | 0.744 | 0.774 | 0.800 | 0.842 | 0.923 | 0.770 | AGIEVALC_jecqaca | - | 0.221 | - | - | - | 0.142 | 0.258 | 0.292 | - | 0.348 | 0.414 | 0.440 | 0.660 | - | 0.454 | - | - | - | - | - | - | - | - | 0.416 | 0.410 | - | - | - | - | - | 0.425 | 0.424 | 0.482 | 0.487 | 0.564 | 0.622 | 0.409 | AGIEVALC_jecqakd | - | 0.223 | - | - | - | 0.118 | 0.229 | 0.356 | - | 0.400 | 0.549 | 0.559 | 0.759 | - | 0.561 | - | - | - | - | - | - | - | - | 0.465 | 0.461 | - | - | - | - | - | 0.498 | 0.526 | 0.592 | 0.605 | 0.732 | 0.747 | 0.521 | AGIEVALC_logiqa | - | 0.310 | - | - | - | 0.193 | 0.328 | 0.488 | - | 0.523 | 0.479 | 0.490 | 0.556 | - | 0.454 | - | - | - | - | - | - | - | - | 0.525 | 0.525 | - | - | - | - | - | 0.399 | 0.405 | 0.497 | 0.500 | 0.565 | 0.588 | 0.499 | AGIEVALC_mathcloze | - | 0.508 | - | - | - | - | 0.567 | 0.779 | - | 0.855 | 0.491 | 0.542 | 0.508 | - | 0.957 | - | - | - | - | - | - | - | - | 0.754 | 0.915 | - | - | - | - | - | 0.508 | 0.440 | 0.694 | 0.686 | 0.737 | 0.805 | 0.949 | AGIEVALC_mathqa | - | 0.569 | - | - | - | 0.322 | 0.616 | 0.779 | - | _0.744_ | 0.621 | 0.648 | 0.845 | - | _0.892_ | - | - | - | - | - | - | - | - | _0.664_ | _0.844_ | - | - | - | - | - | 0.595 | 0.683 | 0.779 | 0.755 | 0.808 | 0.834 | _0.932_ | AGIEVALC_physics | - | 0.327 | - | - | - | 0.091 | 0.206 | 0.304 | - | 0.471 | 0.396 | 0.425 | 0.563 | - | 0.609 | - | - | - | - | - | - | - | - | 0.431 | 0.477 | - | - | - | - | - | 0.390 | 0.413 | 0.431 | 0.500 | 0.683 | 0.770 | 0.465 | AGIEVALC | - | 0.361 | - | - | - | 0.187 | 0.368 | 0.514 | - | _0.554_ | 0.589 | 0.607 | 0.724 | - | _0.646_ | - | - | - | - | - | - | - | - | _0.583_ | _0.603_ | - | - | - | - | - | 0.529 | 0.548 | 0.627 | 0.636 | 0.716 | 0.734 | _0.626_ | BBH boolean_expressions | 0.544 | 0.860 | 0.876 | 0.768 | 0.460 | 0.632 | 0.880 | 0.880 | - | 0.732 | 0.848 | 0.868 | 0.800 | - | 0.760 | 0.844 | 0.480 | - | - | - | 0.764 | - | - | 0.872 | 0.860 | 0.852 | 0.832 | 0.776 | 0.936 | 0.932 | 0.756 | 0.796 | 0.864 | 0.880 | 0.888 | 0.808 | 0.700 | causal_judgement | 0.550 | 0.577 | 0.582 | 0.598 | 0.604 | 0.550 | 0.582 | 0.652 | - | 0.620 | 0.550 | 0.550 | 0.641 | - | 1.000 | 0.540 | 0.518 | - | - | - | 0.588 | - | - | 0.631 | 0.582 | 0.588 | 0.593 | 0.577 | 0.647 | 0.614 | 0.497 | 0.529 | 0.508 | 0.513 | 0.647 | 0.700 | 0.625 | date_understanding | 0.324 | 0.668 | 0.748 | 0.748 | 0.788 | 0.408 | 0.868 | 0.920 | - | 0.760 | 0.580 | 0.572 | 0.832 | - | 0.888 | 0.716 | 0.664 | - | - | - | 0.548 | - | - | 0.728 | 0.920 | 0.696 | 0.576 | 0.648 | 0.932 | 0.876 | 0.616 | 0.648 | 0.764 | 0.740 | 0.856 | 0.872 | 0.936 | disambiguation_qa | 0.400 | 0.712 | 0.668 | 0.660 | 0.720 | 0.284 | 0.432 | 0.448 | - | 0.612 | 0.584 | 0.636 | 0.716 | - | 0.616 | 0.516 | 0.472 | - | - | - | 0.600 | - | - | 0.388 | 0.516 | 0.720 | 0.752 | 0.608 | 0.768 | 0.828 | 0.544 | 0.556 | 0.656 | 0.636 | 0.764 | 0.780 | 0.720 | dyck_languages | 0.424 | 0.704 | 0.712 | 0.728 | 0.600 | 0.344 | 0.636 | 0.824 | - | 0.892 | 0.516 | 0.544 | 0.592 | - | 0.756 | 0.796 | 0.680 | - | - | - | 0.744 | - | - | 0.792 | 0.684 | 0.580 | 0.468 | 0.656 | 0.776 | 0.736 | 0.596 | 0.628 | 0.868 | 0.836 | 0.648 | 0.820 | 0.540 | formal_fallacies | 0.624 | 0.740 | 0.660 | 0.832 | 0.760 | 0.612 | 0.876 | 0.832 | - | 0.820 | 0.568 | 0.660 | 0.984 | - | 0.984 | 0.984 | 0.816 | - | - | - | 0.852 | - | - | 0.964 | 0.692 | 0.808 | 0.808 | 0.592 | 0.804 | 0.796 | 0.928 | 0.852 | 0.628 | 0.628 | 0.784 | 0.812 | 0.980 | geometric_shapes | 0.056 | 0.544 | 0.456 | 0.436 | 0.420 | 0.128 | 0.376 | 0.456 | - | 0.544 | 0.392 | 0.400 | 0.812 | - | 0.804 | 0.440 | 0.416 | - | - | - | 0.288 | - | - | 0.280 | 0.716 | 0.416 | 0.292 | 0.316 | 0.648 | 0.676 | 0.204 | 0.212 | 0.544 | 0.604 | 0.584 | 0.640 | 0.812 | hyperbaton | 0.512 | 0.572 | 0.680 | 0.884 | 0.836 | 0.108 | 0.940 | 0.976 | - | 0.932 | 0.740 | 0.824 | 0.884 | - | 0.940 | 0.880 | 0.624 | - | - | - | 0.656 | - | - | 0.884 | 0.892 | 0.936 | 0.936 | 0.860 | 0.996 | 0.988 | 0.636 | 0.676 | 0.832 | 0.792 | 0.868 | 0.956 | 0.968 | logical_deduction_five_objects | 0.176 | 0.700 | 0.532 | 0.568 | 0.608 | 0.284 | 0.604 | 0.840 | - | 0.784 | 0.528 | 0.516 | 0.784 | - | 0.980 | 0.568 | 0.484 | - | - | - | 0.352 | - | - | 0.600 | 0.968 | 0.632 | 0.532 | 0.536 | 0.940 | 0.928 | 0.468 | 0.528 | 0.752 | 0.728 | 0.876 | 0.924 | 0.960 | logical_deduction_seven_objects | 0.152 | 0.556 | 0.492 | 0.560 | 0.552 | 0.212 | 0.640 | 0.740 | - | 0.776 | 0.444 | 0.500 | 0.756 | - | 0.972 | 0.488 | 0.408 | - | - | - | 0.296 | - | - | 0.616 | 0.944 | 0.568 | 0.500 | 0.472 | 0.920 | 0.880 | 0.420 | 0.436 | 0.668 | 0.656 | 0.792 | 0.864 | 0.928 | logical_deduction_three_objects | 0.376 | 0.868 | 0.820 | 0.844 | 0.892 | 0.428 | 0.860 | 0.992 | - | 0.912 | 0.836 | 0.840 | 0.960 | - | 0.996 | 0.804 | 0.652 | - | - | - | 0.608 | - | - | 0.840 | 0.988 | 0.844 | 0.804 | 0.796 | 0.992 | 0.988 | 0.696 | 0.720 | 0.940 | 0.956 | 0.980 | 0.992 | 0.988 | movie_recommendation | 0.424 | 0.652 | 0.676 | 0.552 | 0.508 | 0.372 | 0.536 | 0.664 | - | 0.632 | 0.604 | 0.648 | 0.740 | - | 0.600 | 0.536 | 0.456 | - | - | - | 0.508 | - | - | 0.684 | 0.672 | 0.520 | 0.508 | 0.528 | 0.992 | 0.632 | 0.604 | 0.568 | 0.556 | 0.536 | 0.672 | 0.648 | 0.496 | multistep_arithmetic_two | 0.136 | 0.944 | 0.968 | 0.488 | 0.472 | - | 0.868 | 0.888 | - | 0.972 | 0.580 | 0.524 | 0.508 | - | 0.992 | 0.700 | 0.532 | - | - | - | 0.108 | - | - | 0.832 | 0.956 | 0.836 | 0.420 | 0.408 | 0.984 | 0.976 | 0.852 | 0.876 | 0.896 | 0.948 | 0.964 | 0.976 | 0.992 | navigate | 0.540 | 0.580 | 0.588 | 0.596 | 0.648 | 0.592 | 0.648 | 0.724 | - | 0.744 | 0.420 | 0.420 | 0.580 | - | 1.000 | 0.580 | 0.580 | - | - | - | 0.600 | - | - | 0.680 | 0.464 | 0.588 | 0.584 | 0.612 | 0.640 | 0.956 | 0.576 | 0.572 | 0.596 | 0.596 | 0.624 | 0.684 | 0.992 | object_counting | 0.464 | 0.764 | 0.820 | 0.848 | 0.856 | - | 0.908 | 0.908 | - | 0.976 | 0.616 | 0.660 | 0.892 | - | 0.868 | 0.864 | 0.808 | - | - | - | 0.608 | - | - | 0.832 | 0.984 | 0.836 | 0.344 | 0.596 | 0.996 | 0.992 | 0.740 | 0.764 | 0.848 | 0.804 | 0.892 | 0.896 | 0.960 | penguins_in_a_table | 0.369 | 0.842 | 0.746 | 0.890 | 0.842 | 0.267 | 0.876 | 0.986 | - | 0.739 | 0.917 | 0.917 | 0.958 | - | 0.993 | 0.856 | 0.801 | - | - | - | 0.623 | - | - | 0.616 | 0.993 | 0.883 | 0.712 | 0.801 | 1.000 | 0.876 | 0.821 | 0.849 | 0.945 | 0.924 | 0.958 | 0.986 | 1.000 | reasoning_about_colored_objects | 0.276 | 0.860 | 0.800 | 0.744 | 0.900 | 0.180 | 0.752 | 0.888 | - | 0.844 | 0.876 | 0.796 | 0.940 | - | 0.976 | 0.824 | 0.568 | - | - | - | 0.608 | - | - | 0.804 | 0.992 | 0.808 | 0.656 | 0.776 | 0.968 | 0.956 | 0.700 | 0.764 | 0.904 | 0.868 | 0.944 | 0.984 | 0.996 | ruin_names | 0.176 | 0.484 | 0.636 | 0.716 | 0.760 | 0.172 | 0.468 | 0.696 | - | 0.816 | 0.696 | 0.652 | 0.716 | - | 0.760 | 0.744 | 0.532 | - | - | - | 0.400 | - | - | 0.764 | 0.748 | 0.612 | 0.600 | 0.588 | 0.816 | 0.784 | 0.396 | 0.324 | 0.440 | 0.544 | 0.692 | 0.760 | 0.720 | salient_translation_error_detection | 0.212 | 0.448 | 0.508 | 0.548 | 0.568 | 0.172 | 0.560 | 0.640 | - | 0.564 | 0.476 | 0.488 | 0.580 | - | 0.652 | 0.512 | 0.464 | - | - | - | 0.444 | - | - | 0.600 | 0.656 | 0.520 | 0.532 | 0.540 | 0.636 | 0.372 | 0.452 | 0.432 | 0.560 | 0.572 | 0.612 | 0.700 | 0.716 | snarks | 0.483 | 0.685 | 0.707 | 0.691 | 0.719 | 0.033 | 0.724 | 0.803 | - | 0.634 | 0.702 | 0.707 | 0.769 | - | 0.747 | 0.651 | 0.657 | - | - | - | 0.606 | - | - | 0.634 | 0.786 | 0.747 | 0.786 | 0.814 | 0.882 | 0.876 | 0.662 | 0.623 | 0.747 | 0.780 | 0.831 | 0.865 | 0.842 | sports_understanding | 0.584 | 0.672 | 0.692 | 0.788 | 0.816 | 0.488 | 0.696 | 0.804 | - | 0.844 | 0.472 | 0.468 | 0.668 | - | 0.544 | 0.720 | 0.644 | - | - | - | 0.716 | - | - | 0.796 | 0.708 | 0.596 | 0.600 | 0.544 | 0.740 | 0.736 | 0.620 | 0.616 | 0.676 | 0.684 | 0.680 | 0.748 | 0.680 | temporal_sequences | 0.164 | 0.528 | 0.540 | 0.708 | 0.748 | 0.436 | 0.988 | 0.996 | - | 0.940 | 0.756 | 0.840 | 0.956 | - | 1.000 | 0.856 | 0.712 | - | - | - | 0.404 | - | - | 0.844 | 0.992 | 0.784 | 0.508 | 0.768 | 1.000 | 0.996 | 0.324 | 0.388 | 0.800 | 0.820 | 0.988 | 0.992 | 0.992 | tracking_shuffled_objects_five_objects | 0.208 | 0.560 | 0.616 | 0.600 | 0.692 | 0.508 | 0.924 | 1.000 | - | 0.648 | 0.544 | 0.536 | 0.864 | - | 0.988 | 0.656 | 0.500 | - | - | - | 0.344 | - | - | 0.716 | 0.992 | 0.940 | 0.712 | 0.852 | 1.000 | 0.988 | 0.420 | 0.452 | 0.840 | 0.908 | 0.924 | 0.972 | 0.988 | tracking_shuffled_objects_seven_objects | 0.140 | 0.324 | 0.524 | 0.572 | 0.640 | 0.228 | 0.884 | 0.988 | - | 0.660 | 0.512 | 0.436 | 0.764 | - | 0.980 | 0.592 | 0.420 | - | - | - | 0.296 | - | - | 0.744 | 0.984 | 0.896 | 0.612 | 0.848 | 0.984 | 0.880 | 0.292 | 0.312 | 0.800 | 0.868 | 0.848 | 0.980 | 0.948 | tracking_shuffled_objects_three_objects | 0.288 | 0.696 | 0.732 | 0.732 | 0.848 | 0.808 | 0.972 | 0.992 | - | 0.548 | 0.620 | 0.696 | 0.956 | - | 0.996 | 0.728 | 0.608 | - | - | - | 0.436 | - | - | 0.880 | 0.996 | 0.960 | 0.788 | 0.884 | 1.000 | 0.996 | 0.604 | 0.664 | 0.832 | 0.872 | 0.856 | 0.996 | 0.992 | web_of_lies | 0.476 | 0.576 | 0.520 | 0.520 | 0.488 | 0.488 | 0.516 | 0.540 | - | 0.532 | 0.476 | 0.488 | 0.512 | - | 1.000 | 0.512 | 0.544 | - | - | - | 0.488 | - | - | 0.560 | 0.504 | 0.488 | 0.492 | 0.512 | 0.512 | 0.964 | 0.512 | 0.512 | 0.528 | 0.532 | 0.544 | 0.624 | 1.000 | word_sorting | 0.056 | 0.204 | 0.292 | 0.404 | 0.540 | 0.080 | 0.236 | 0.424 | - | 0.536 | 0.404 | 0.392 | 0.144 | - | 0.316 | 0.512 | 0.360 | - | - | - | 0.280 | - | - | 0.632 | 0.592 | 0.204 | 0.152 | 0.140 | 0.360 | 0.392 | 0.156 | 0.156 | 0.212 | 0.220 | 0.292 | 0.400 | 0.272 | BBH | 0.334 | 0.638 | 0.650 | 0.664 | 0.674 | 0.355 | 0.711 | 0.794 | - | 0.743 | 0.596 | 0.608 | 0.749 | - | 0.853 | 0.681 | 0.566 | - | - | - | 0.506 | - | - | 0.714 | 0.806 | 0.696 | 0.592 | 0.627 | 0.846 | 0.838 | 0.554 | 0.567 | 0.709 | 0.718 | 0.775 | 0.827 | 0.841 | MUSR murder_mystery | 0.552 | 0.640 | 0.592 | 0.668 | 0.576 | 0.528 | 0.592 | 0.608 | - | 0.552 | 0.616 | 0.584 | 0.620 | - | 0.640 | 0.584 | 0.576 | - | - | - | 0.516 | - | - | 0.712 | 0.680 | 0.636 | 0.620 | 0.588 | 0.708 | 0.680 | 0.544 | 0.612 | 0.604 | 0.584 | 0.652 | 0.640 | 0.636 | object_placements | 0.429 | 0.535 | 0.578 | 0.519 | 0.542 | 0.296 | 0.480 | 0.542 | - | 0.448 | 0.492 | 0.531 | 0.460 | - | 0.580 | 0.546 | 0.523 | - | - | - | 0.453 | - | - | 0.516 | 0.532 | 0.503 | 0.457 | 0.453 | 0.464 | 0.404 | 0.472 | 0.476 | 0.531 | 0.554 | 0.519 | 0.265 | 0.532 | team_allocation | 0.436 | 0.512 | 0.496 | 0.460 | 0.476 | 0.328 | 0.400 | 0.560 | - | 0.572 | 0.572 | 0.588 | 0.448 | - | 0.640 | 0.460 | 0.396 | - | - | - | 0.356 | - | - | 0.612 | 0.576 | 0.536 | 0.480 | 0.508 | 0.628 | 0.556 | 0.444 | 0.384 | 0.512 | 0.476 | 0.556 | 0.592 | 0.708 | MUSR | 0.472 | 0.562 | 0.555 | 0.548 | 0.531 | 0.383 | 0.490 | 0.570 | - | 0.524 | 0.559 | 0.567 | 0.509 | - | 0.620 | 0.530 | 0.498 | - | - | - | 0.441 | - | - | 0.613 | 0.596 | 0.558 | 0.518 | 0.515 | 0.599 | 0.546 | 0.486 | 0.490 | 0.548 | 0.538 | 0.575 | 0.497 | 0.625 | GPQA_diamond | - | - | - | - | - | - | - | - | - | 0.388 | - | - | - | - | 0.520 | - | - | 0.479 | 0.469 | - | - | - | - | 0.358 | 0.540 | - | - | - | - | 0.545 | - | - | - | - | - | - | 0.570 | GPQA | - | - | - | - | - | - | - | - | - | 0.388 | - | - | - | - | 0.520 | - | - | 0.479 | 0.469 | - | - | - | - | 0.358 | 0.540 | - | - | - | - | 0.545 | - | - | - | - | - | - | 0.570 | MMLUPRO biology | 0.324 | 0.708 | 0.702 | 0.747 | 0.772 | 0.361 | 0.640 | 0.794 | - | _0.752_ | 0.676 | 0.695 | 0.750 | - | _0.804_ | 0.686 | 0.623 | - | - | - | 0.582 | - | - | _0.776_ | _0.584_ | 0.702 | 0.662 | 0.682 | 0.835 | _0.740_| 0.610 | 0.638 | 0.709 | 0.729 | 0.797 | 0.764 | _0.852_ | business | 0.190 | 0.624 | 0.525 | 0.583 | 0.626 | 0.173 | 0.518 | 0.659 | - | _0.616_ | 0.522 | 0.562 | 0.628 | - | _0.760_ | 0.558 | 0.458 | - | - | - | 0.335 | - | - | _0.612_ | _0.756_ | 0.571 | 0.509 | 0.588 | 0.785 | _0.756_| 0.504 | 0.558 | 0.647 | 0.661 | 0.718 | 0.755 | _0.768_ | chemistry | 0.166 | 0.639 | 0.500 | 0.503 | 0.546 | 0.115 | 0.380 | 0.574 | - | _0.536_ | 0.465 | 0.467 | 0.589 | - | _0.844_ | 0.467 | 0.390 | - | - | - | - | - | - | _0.488_ | _0.728_ | 0.463 | 0.296 | 0.513 | 0.765 | _0.700_| 0.387 | 0.451 | 0.559 | 0.580 | 0.684 | 0.701 | _0.824_ | computer_science | 0.197 | 0.602 | 0.590 | 0.482 | 0.560 | 0.170 | 0.421 | 0.643 | - | _0.560_ | 0.497 | 0.502 | 0.585 | - | _0.756_ | 0.485 | 0.414 | - | - | - | - | - | - | _0.556_ | _0.680_ | 0.475 | 0.448 | 0.456 | 0.734 | _0.708_| 0.434 | 0.402 | 0.590 | 0.604 | 0.663 | 0.734 | _0.736_ | economics | 0.236 | 0.663 | 0.662 | 0.668 | 0.678 | 0.206 | 0.534 | 0.699 | - | _0.660_ | 0.617 | 0.610 | 0.662 | - | _0.744_ | 0.568 | 0.492 | - | - | - | - | - | - | _0.648_ | _0.612_ | 0.609 | 0.587 | 0.629 | 0.792 | _0.716_| 0.521 | 0.550 | 0.674 | 0.687 | 0.721 | 0.787 | _0.756_ | engineering | 0.157 | 0.437 | 0.424 | 0.406 | 0.414 | 0.138 | 0.253 | 0.373 | - | _0.420_ | 0.303 | 0.298 | 0.454 | - | _0.612_ | 0.378 | 0.302 | - | - | - | - | - | - | _0.488_ | _0.544_ | 0.297 | 0.283 | 0.361 | 0.589 | _0.604_| 0.296 | 0.309 | 0.418 | 0.420 | 0.512 | 0.573 | _0.668_ | health | 0.158 | 0.503 | 0.517 | 0.545 | 0.621 | 0.156 | 0.399 | 0.596 | - | _0.548_ | 0.492 | 0.496 | 0.544 | - | _0.660_ | 0.558 | 0.437 | - | - | - | - | - | - | _0.544_ | _0.616_ | 0.515 | 0.466 | 0.506 | 0.700 | _0.632_| 0.388 | 0.416 | 0.556 | 0.569 | 0.643 | 0.690 | _0.644_ | history | 0.149 | 0.406 | 0.467 | 0.493 | 0.490 | 0.152 | 0.354 | 0.540 | - | _0.588_ | 0.425 | 0.438 | 0.459 | - | _0.532_ | 0.451 | 0.380 | - | - | - | - | - | - | _0.592_ | _0.568_ | 0.380 | 0.380 | 0.409 | 0.627 | _0.560_| 0.333 | 0.367 | 0.459 | 0.464 | 0.566 | 0.624 | _0.568_ | law | 0.123 | 0.268 | 0.295 | 0.343 | 0.405 | 0.158 | 0.263 | 0.372 | - | _0.400_ | 0.299 | 0.284 | 0.307 | - | _0.404_ | 0.303 | 0.243 | - | - | - | - | - | - | _0.384_ | _0.328_ | 0.276 | 0 | 0.294 | 0.500 | _0.408_| 0.220 | 0.237 | 0.300 | 0.292 | 0.366 | 0.455 | _0.412_ | math | 0.203 | 0.694 | 0.564 | 0.538 | 0.570 | 0.180 | 0.586 | 0.739 | - | _0.664_ | 0.490 | 0.523 | 0.617 | - | _0.900_ | 0.555 | 0.511 | - | - | - | - | - | - | _0.536_ | _0.812_ | 0.522 | 0.458 | 0.578 | 0.816 | _0.672_| 0.581 | 0.603 | 0.712 | 0.723 | 0.775 | 0.814 | _0.884_ | other | 0.164 | 0.450 | 0.496 | 0.551 | 0.574 | 0.173 | 0.428 | 0.580 | - | _0.552_ | 0.464 | 0.458 | 0.536 | - | _0.568_ | 0.487 | 0.389 | - | - | - | - | - | - | _0.484_ | _0.592_ | 0.500 | 0.433 | 0.493 | 0.706 | _0.648_| 0.410 | 0.405 | 0.529 | 0.551 | 0.611 | 0.664 | _0.604_ | philosophy | 0.148 | 0.442 | 0.462 | 0.448 | 0.488 | 0.176 | 0.356 | 0.555 | - | _0.560_ | 0.408 | 0.412 | 0.424 | - | _0.576_ | 0.382 | 0.326 | - | - | - | - | - | - | _0.476_ | _0.580_ | 0.406 | 0.390 | 0.394 | 0.633 | _0.604_| 0.376 | 0.364 | 0.480 | 0.464 | 0.557 | 0.599 | _0.544_ | physics | 0.159 | 0.583 | 0.493 | 0.501 | 0.559 | 0.125 | 0.397 | 0.595 | - | _0.512_ | 0.441 | 0.461 | 0.587 | - | _0.808_ | 0.488 | 0.397 | - | - | - | - | - | - | _0.488_ | _0.724_ | 0.455 | 0.425 | 0.500 | 0.765 | _0.748_| 0.419 | 0.456 | 0.589 | 0.602 | 0.702 | 0.543 | _0.872_ | psychology | 0.258 | 0.621 | 0.645 | 0.647 | 0.692 | 0.273 | 0.567 | 0.685 | - | _0.632_ | 0.586 | 0.602 | 0.665 | - | _0.664_ | 0.637 | 0.518 | - | - | - | - | - | - | _0.680_ | _0.544_ | 0.621 | 0.572 | 0.583 | 0.759 | _0.652_| 0.526 | 0.563 | 0.636 | 0.644 | 0.721 | 0.749 | _0.684_ | MMLUPRO | 0.186 | 0.552 | 0.517 | 0.528 | 0.568 | 0.177 | 0.436 | 0.597 | - | _0.571_ | 0.471 | 0.480 | 0.559 | - | _0.688_ | 0.499 | 0.419 | - | - | - | 0.453 | - | - | _0.553_ | _0.619_ | 0.482 | 0.408 | 0.502 | 0.719 | _0.653_| 0.430 | 0.457 | 0.564 | 0.575 | 0.649 | 0.671 | _0.701_ | CATEGORIES REASONING | 0.367 | 0.713 | 0.738 | 0.788 | 0.814 | 0.344 | 0.598 | 0.787 | 0.767 | 0.809 | 0.804 | 0.811 | 0.815 | - | 0.746 | 0.713 | 0.606 | - | - | - | 0.628 | 0.866 | 0.877 | 0.863 | 0.848 | 0.724 | 0.691 | 0.619 | 0.809 | 0.822 | 0.689 | 0.719 | 0.805 | 0.809 | 0.850 | 0.874 | 0.785 | UNDERSTANDING | 0.366 | 0.644 | 0.670 | _0.707_ | 0.742 | 0.327 | 0.552 | 0.695 | 0.741 | 0.746 | 0.661 | 0.670 | 0.691 | - | 0.679 | 0.631 | 0.579 | - | - | - | _0.563_ | 0.775 | 0.772 | 0.764 | 0.756 | 0.614 | 0.622 | 0.617 | 0.728 | 0.766 | 0.605 | 0.613 | 0.692 | 0.696 | 0.761 | 0.793 | 0.695 | LANGUAGE | 0.524 | 0.688 | 0.692 | 0.735 | 0.755 | 0.504 | 0.635 | 0.724 | 0.721 | 0.742 | 0.786 | 0.783 | 0.662 | - | 0.617 | 0.747 | 0.705 | - | - | - | 0.766 | 0.786 | 0.789 | 0.798 | 0.792 | 0.677 | 0.613 | 0.653 | 0.750 | 0.751 | 0.685 | 0.682 | 0.722 | 0.724 | 0.769 | 0.781 | 0.670 | KNOWLEDGE | 0.354 | 0.442 | 0.496 | 0.690 | 0.733 | 0.353 | 0.478 | 0.626 | 0.560 | 0.630 | 0.553 | 0.543 | 0.615 | - | 0.597 | 0.547 | 0.536 | 0.680 | 0.580 | 0.540 | 0.582 | 0.680 | - | 0.676 | 0.653 | 0.517 | 0.519 | 0.534 | 0.676 | 0.686 | 0.469 | 0.426 | 0.595 | 0.597 | 0.693 | 0.725 | 0.622 | COT | 0.220 | 0.552 | 0.530 | 0.550 | 0.582 | 0.201 | 0.470 | 0.616 | - | _0.630_ | 0.485 | 0.500 | 0.586 | - | _0.706_ | 0.530 | 0.446 | 0.479 | 0.469 | - | 0.498 | - | - | _0.600_ | _0.651_ | 0.506 | 0.440 | 0.513 | 0.725 | _0.669_| 0.443 | 0.462 | 0.570 | 0.581 | 0.653 | 0.684 | _0.708_ | MATHCOT | 0.369 | 0.730 | 0.752 | 0.735 | 0.740 | 0.417 | 0.793 | 0.879 | _0.882_ | _0.790_ | 0.682 | 0.679 | 0.813 | - | _0.945_ | 0.728 | 0.647 | 0.840 | 0.860 | 0.860 | 0.493 | 0.830 | 0.780 | _0.743_ | _0.908_ | 0.767 | 0.638 | 0.745 | 0.919 | _0.899_| 0.667 | 0.694 | 0.823 | 0.821 | 0.869 | 0.903 | _0.946_ | CODE | 0.176 | 0.460 | 0.534 | 0.495 | 0.568 | 0.241 | 0.485 | 0.582 | 0.829 | 0.618 | 0.456 | 0.475 | 0.500 | - | 0.587 | 0.463 | 0.366 | - | - | - | 0.321 | 0.841 | 0.823 | 0.409 | 0.619 | 0.427 | 0.376 | 0.418 | 0.568 | 0.567 | 0.437 | 0.445 | 0.510 | 0.528 | 0.578 | 0.612 | 0.597 | DISCIPLINES NLP | 0.408 | 0.647 | 0.670 | _0.755_ | 0.786 | 0.392 | 0.595 | 0.748 | 0.751 | 0.761 | 0.729 | 0.728 | 0.737 | - | 0.692 | 0.677 | 0.609 | - | - | - | _0.642_ | 0.834 | 0.841 | 0.808 | 0.792 | 0.647 | 0.637 | 0.614 | 0.755 | 0.776 | 0.632 | 0.630 | 0.731 | 0.734 | 0.791 | 0.818 | 0.733 | MATH | 0.294 | 0.669 | 0.659 | 0.637 | 0.653 | 0.298 | 0.659 | 0.775 | _0.882_ | _0.727_ | 0.590 | 0.597 | 0.720 | - | _0.844_ | 0.629 | 0.556 | 0.840 | 0.860 | 0.860 | 0.451 | 0.830 | 0.780 | _0.678_ | _0.789_ | 0.646 | 0.543 | 0.625 | 0.817 | _0.795_| 0.576 | 0.599 | 0.741 | 0.742 | 0.799 | 0.843 | _0.831_ | SCIENCE | 0.350 | 0.706 | 0.713 | 0.739 | 0.769 | 0.304 | 0.580 | 0.737 | - | _0.797_ | 0.686 | 0.698 | 0.756 | - | _0.823_ | 0.676 | 0.605 | 0.479 | 0.469 | - | 0.673 | - | - | _0.806_ | _0.821_ | 0.696 | 0.660 | 0.681 | 0.845 | _0.858_| 0.629 | 0.657 | 0.738 | 0.748 | 0.815 | 0.806 | _0.830_ | ENGINEERING | 0.166 | 0.464 | 0.453 | 0.426 | 0.438 | 0.158 | 0.271 | 0.397 | - | _0.496_ | 0.334 | 0.333 | 0.480 | - | _0.630_ | 0.397 | 0.323 | - | - | - | 0.393 | - | - | _0.567_ | _0.587_ | 0.323 | 0.308 | 0.388 | 0.595 | _0.630_| 0.315 | 0.325 | 0.443 | 0.444 | 0.530 | 0.590 | _0.655_ | MEDICINE | 0.216 | 0.524 | 0.540 | 0.595 | 0.648 | 0.182 | 0.411 | 0.598 | - | _0.642_ | 0.521 | 0.530 | 0.570 | - | _0.593_ | 0.577 | 0.496 | - | - | - | 0.447 | - | - | _0.681_ | _0.684_ | 0.537 | 0.501 | 0.490 | 0.672 | _0.684_| 0.459 | 0.478 | 0.574 | 0.580 | 0.655 | 0.702 | _0.595_ | HUMANITIES | 0.291 | 0.550 | 0.615 | _0.645_ | 0.679 | 0.272 | 0.495 | 0.641 | 0.560 | _0.705_ | 0.593 | 0.610 | 0.622 | - | _0.628_ | 0.578 | 0.529 | 0.680 | 0.580 | 0.540 | _0.536_ | 0.680 | - | _0.710_ | _0.698_ | 0.588 | 0.567 | 0.562 | 0.739 | _0.745_| 0.527 | 0.533 | 0.629 | 0.638 | 0.716 | 0.742 | _0.634_ | BUSINESS | 0.252 | 0.679 | 0.655 | 0.678 | 0.709 | 0.245 | 0.537 | 0.704 | - | _0.743_ | 0.623 | 0.637 | 0.696 | - | _0.745_ | 0.598 | 0.517 | - | - | - | 0.466 | - | - | _0.749_ | _0.762_ | 0.637 | 0.604 | 0.635 | 0.801 | _0.792_| 0.565 | 0.596 | 0.701 | 0.710 | 0.759 | 0.802 | _0.759_ | LAW | 0.200 | 0.362 | 0.427 | _0.483_ | 0.524 | 0.172 | 0.316 | 0.443 | - | _0.504_ | 0.417 | 0.429 | 0.494 | - | _0.507_ | 0.406 | 0.344 | - | - | - | _0.370_ | - | - | _0.537_ | _0.543_ | 0.392 | 0.310 | 0.390 | 0.541 | _0.582_| 0.374 | 0.383 | 0.451 | 0.456 | 0.541 | 0.604 | _0.514_ | COMPOSITE AVERAGE AVG | 0.342 | 0.612 | 0.641 | _0.692_ | 0.724 | 0.324 | 0.555 | 0.701 | _0.753_ | _0.729_ | 0.648 | 0.654 | 0.689 | - | _0.692_ | 0.629 | 0.561 | 0.620 | 0.595 | 0.700 | _0.578_ | 0.833 | 0.841 | _0.740_ | _0.757_ | 0.616 | 0.585 | 0.591 | 0.748 | _0.760_| 0.578 | 0.586 | 0.686 | 0.691 | 0.754 | 0.783 | _0.716_ | THINKING MODELS: MODEL | Qwen3-0.6B | Qwen3-1.7B | Qwen3-4B | Qwen3-4B | Qwen3-4B-Thinking-2507 | Qwen3-8B | Qwen3-8B | Qwen3-8B | Qwen3-14B | Qwen3-14B | Qwen3-30B-A3B | Qwen3-30B-A3B | Qwen3-32B | Qwen3-32B | QwQ-32B-Preview | QwQ-32B | Ring-mini-2.0 | ---------------------------------------------|------------|------------|----------|----------|------------------------|----------|----------|----------|-----------|-----------|---------------|---------------|-----------|-----------|-----------------|---------|---------------| params | 0.75163B | 2.03B | 4.02B | 4.02B | 4.02B | 8.19B | 8.19B | 8.19B | 14.77B | 14.77B | 30.53B | 30.53B | 32.8B | 32.8B | 32.76B | 32.76B | 16.26B | quant | Q8_0 | Q8_0 | Q8_0 | Q8_0_H | Q6_K_H | Q4_K_H | Q6_K_H | Q6_K | IQ4_XS | Q4_K_H | IQ4_XS | Q4_K_H | IQ4_XS | Q4_K_H | IQ4_XS | Q4_K_H | Q6_K_H | engine | llama.cpp version: 5679 | llama.cpp version: 5415 | llama.cpp version: 5242 | llama.cpp version: 5509 | llama.cpp version: 6653 | llama.cpp version: 5279 | llama.cpp version: 5223 | llama.cpp version: 5153 | llama.cpp version: 5223 | llama.cpp version: 5379 | llama.cpp version: 5279 | llama.cpp version: 5353 | llama.cpp version: 5466 | llama.cpp version: 5466 | llama.cpp version: 4273 | llama.cpp version: 6118 | llama.cpp version: 6815 | **TEST** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | WG | 0.564 | 0.610 | 0.662 | 0.642 | 0.662 | 0.651 | 0.689 | 0.678 | 0.722 | 0.726 | 0.699 | 0.700 | 0.712 | 0.731 | 0.750 | 0.689 | 0.606 | LAMBADA | 0.471 | 0.590 | 0.644 | 0.638 | 0.662 | 0.660 | 0.700 | 0.700 | 0.729 | 0.714 | 0.698 | 0.725 | 0.692 | 0.701 | 0.780 | 0.651 | 0.522 | HELLASWAG | 0.277 | 0.553 | 0.721 | 0.726 | 0.722 | 0.718 | 0.787 | 0.787 | 0.827 | 0.792 | 0.815 | 0.832 | 0.838 | 0.812 | 0.875 | 0.883 | 0.689 | BOOLQ | 0.449 | 0.531 | 0.626 | 0.626 | 0.485 | 0.641 | 0.608 | 0.611 | 0.662 | 0.632 | - | 0.502 | 0.603 | 0.574 | 0.629 | 0.658 | 0.556 | STORYCLOZE | 0.764 | 0.833 | 0.849 | 0.843 | 0.809 | 0.809 | 0.852 | 0.873 | - | 0.905 | - | 0.917 | 0.960 | 0.959 | 0.964 | 0.935 | 0.937 | CSQA | 0.377 | 0.567 | 0.705 | 0.705 | 0.704 | 0.685 | 0.740 | 0.748 | - | 0.749 | - | 0.742 | - | 0.778 | 0.796 | 0.680 | 0.724 | OBQA | 0.394 | 0.584 | 0.756 | 0.754 | 0.729 | 0.719 | 0.767 | 0.774 | - | 0.787 | - | 0.836 | - | 0.869 | 0.882 | 0.815 | 0.764 | COPA | 0.569 | 0.765 | 0.872 | 0.865 | 0.801 | 0.829 | 0.828 | 0.864 | - | 0.919 | - | 0.919 | - | 0.946 | 0.936 | 0.962 | 0.877 | PIQA | 0.393 | 0.574 | 0.710 | 0.710 | 0.685 | 0.744 | 0.769 | 0.781 | - | 0.798 | - | 0.845 | - | 0.815 | 0.829 | 0.871 | 0.755 | SIQA | 0.363 | 0.569 | 0.664 | 0.664 | 0.640 | 0.637 | 0.671 | 0.679 | - | 0.689 | - | 0.693 | - | 0.714 | 0.714 | 0.747 | 0.664 | MEDQA | 0.135 | 0.278 | 0.435 | 0.428 | 0.450 | 0.448 | 0.499 | 0.509 | - | 0.531 | - | 0.597 | - | 0.553 | 0.598 | 0.518 | 0.450 | SQA | 0.249 | 0.032 | 0.039 | 0.040 | 0.147 | 0.036 | 0.039 | 0.042 | - | 0.045 | - | 0.055 | - | 0.047 | - | 0.060 | 0.022 | JEOPARDY | 0.640 | 0.270 | 0.280 | 0.220 | 0.400 | 0.410 | 0.280 | 0.240 | 0.480 | 0.490 | 0.520 | 0.470 | 0.470 | 0.470 | 0.600 | 0.440 | 0.030 | GSM8K | _0.748_ | _0.920_ | 0.946 | _0.960_ | _0.960_ | 0.946 | 0.953 | 0.956 | - | 0.948 | - | 0.962 | - | _0.972_ | 0.962 | _0.964_ | _0.952_ | APPLE | 0.460 | 0.790 | 0.850 | 0.840 | 0.880 | 0.790 | 0.880 | 0.890 | 0.910 | 0.920 | 0.820 | 0.850 | 0.910 | 0.910 | 0.870 | 0.880 | 0.850 | HUMANEVAL | 0.445 | 0.682 | 0.817 | 0.804 | 0.786 | 0.835 | 0.865 | 0.859 | - | 0.859 | - | 0.884 | - | 0.890 | 0.414 | 0.512 | 0.853 | HUMANEVALP | 0.335 | 0.591 | 0.682 | 0.676 | 0.670 | 0.725 | 0.713 | 0.731 | - | 0.737 | - | 0.750 | - | 0.780 | 0.359 | 0.432 | 0.750 | HUMANEVALFIM | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | MBPP | 0.408 | 0.544 | 0.645 | 0.642 | 0.536 | 0.571 | 0.618 | 0.630 | - | 0.700 | - | 0.677 | - | 0.684 | 0.404 | 0.568 | 0.665 | MBPPP | 0.388 | 0.482 | 0.598 | 0.602 | 0.517 | 0.580 | 0.611 | 0.566 | - | 0.651 | - | - | - | 0.678 | 0.392 | 0.584 | 0.642 | HUMANEVALX_cpp | 0.231 | 0.359 | 0.463 | 0.353 | 0.646 | 0.524 | 0.615 | 0.554 | - | 0.652 | - | - | - | 0.737 | 0.378 | 0.603 | 0.750 | HUMANEVALX_java | 0.274 | 0.548 | 0.731 | 0.737 | 0.774 | 0.737 | 0.780 | 0.798 | - | 0.841 | - | - | - | 0.847 | 0.097 | 0.280 | 0.835 | HUMANEVALX_js | 0.256 | 0.518 | 0.719 | 0.695 | 0.439 | 0.762 | 0.774 | 0.774 | - | 0.786 | - | - | - | 0.817 | 0.493 | 0.493 | 0.786 | HUMANEVALX | 0.254 | 0.475 | 0.638 | 0.595 | 0.619 | 0.674 | 0.723 | 0.709 | - | 0.760 | - | - | - | 0.800 | 0.323 | 0.459 | 0.790 | CRUXEVAL_input | 0.353 | 0.406 | 0.457 | 0.453 | 0.500 | 0.445 | 0.528 | 0.510 | - | 0.537 | - | - | - | 0.450 | 0.200 | 0.498 | 0.398 | CRUXEVAL_output | 0.241 | 0.338 | 0.420 | 0.403 | 0.440 | 0.405 | 0.446 | 0.447 | - | 0.501 | - | - | - | 0.431 | 0.368 | 0.513 | 0.391 | CRUXEVAL | 0.297 | 0.372 | 0.438 | 0.428 | 0.470 | 0.425 | 0.487 | 0.478 | - | 0.519 | - | - | - | 0.440 | 0.284 | 0.506 | 0.395 | CRUXEVALFIM_input | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | CRUXEVALFIM_output | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | CRUXEVALFIM | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | TQA_mc | 0.261 | 0.406 | 0.600 | 0.598 | 0.656 | 0.592 | 0.641 | 0.635 | - | 0.676 | - | - | - | 0.742 | 0.795 | 0.701 | 0.608 | TQA_tf | 0.429 | 0.445 | 0.502 | 0.500 | 0.519 | 0.513 | 0.430 | 0.458 | - | 0.614 | - | - | - | 0.456 | 0.523 | 0.628 | 0.513 | TQA | 0.409 | 0.441 | 0.514 | 0.511 | 0.535 | 0.523 | 0.455 | 0.479 | - | 0.621 | - | - | - | 0.490 | 0.554 | 0.637 | 0.524 | ARC_challenge | 0.275 | 0.686 | 0.854 | 0.852 | 0.850 | 0.833 | 0.882 | 0.882 | - | 0.896 | - | - | - | 0.910 | 0.917 | 0.843 | 0.824 | ARC_easy | 0.502 | 0.850 | 0.937 | 0.933 | 0.948 | 0.934 | 0.952 | 0.955 | - | 0.964 | - | - | - | 0.974 | 0.975 | 0.906 | 0.937 | ARC | 0.427 | 0.796 | 0.910 | 0.906 | 0.916 | 0.901 | 0.929 | 0.931 | - | 0.942 | - | - | - | 0.953 | 0.956 | 0.886 | 0.900 | RACE_high | 0.359 | 0.594 | 0.759 | 0.756 | 0.784 | 0.747 | 0.794 | 0.798 | - | 0.826 | - | - | - | 0.822 | 0.871 | 0.862 | 0.748 | RACE_middle | 0.397 | 0.652 | 0.808 | 0.808 | 0.830 | 0.818 | 0.842 | 0.844 | - | 0.873 | - | - | - | 0.881 | - | 0.889 | 0.802 | RACE | 0.370 | 0.611 | 0.774 | 0.771 | 0.797 | 0.768 | 0.808 | 0.811 | - | 0.839 | - | - | - | 0.839 | 0.871 | 0.870 | 0.764 | MMLU abstract_algebra | 0.100 | 0.240 | 0.410 | 0.420 | 0.460 | 0.360 | 0.430 | 0.470 | - | 0.500 | - | - | - | 0.470 | - | 0.470 | 0.380 | anatomy | 0.274 | 0.437 | 0.540 | 0.555 | 0.614 | 0.592 | 0.622 | 0.607 | - | 0.651 | - | - | - | 0.644 | - | 0.681 | 0.622 | astronomy | 0.328 | 0.611 | 0.723 | 0.730 | 0.750 | 0.796 | 0.822 | 0.828 | - | 0.861 | - | - | - | 0.868 | - | 0.802 | 0.750 | business_ethics | 0.290 | 0.460 | 0.670 | 0.670 | 0.630 | 0.610 | 0.650 | 0.650 | - | 0.720 | - | - | - | 0.730 | - | 0.720 | 0.630 | clinical_knowledge | 0.350 | 0.528 | 0.690 | 0.709 | 0.686 | 0.728 | 0.758 | 0.743 | - | 0.762 | - | - | - | 0.781 | - | 0.766 | 0.694 | college_biology | 0.388 | 0.604 | 0.770 | 0.784 | 0.805 | 0.784 | 0.805 | 0.812 | - | 0.847 | - | - | - | 0.881 | - | 0.819 | 0.694 | college_chemistry | 0.230 | 0.340 | 0.420 | 0.420 | 0.460 | 0.420 | 0.480 | 0.490 | - | 0.550 | - | - | - | 0.500 | - | 0.460 | 0.490 | college_computer_science | 0.230 | 0.380 | 0.580 | 0.580 | 0.620 | 0.610 | 0.650 | 0.700 | - | 0.630 | - | - | - | 0.700 | - | 0.580 | 0.470 | college_mathematics | 0.160 | 0.300 | 0.370 | 0.390 | 0.360 | 0.390 | 0.450 | 0.500 | - | 0.450 | - | - | - | 0.400 | - | 0.350 | 0.390 | college_medicine | 0.312 | 0.537 | 0.676 | 0.664 | 0.664 | 0.676 | 0.716 | 0.722 | - | 0.739 | - | - | - | 0.722 | - | 0.722 | 0.595 | college_physics | 0.137 | 0.254 | 0.519 | 0.509 | 0.460 | 0.500 | 0.529 | 0.578 | - | 0.578 | - | - | - | 0.558 | - | 0.421 | 0.441 | computer_security | 0.430 | 0.650 | 0.690 | 0.700 | 0.710 | 0.710 | 0.750 | 0.740 | - | 0.770 | - | - | - | 0.770 | - | 0.630 | 0.640 | conceptual_physics | 0.255 | 0.514 | 0.685 | 0.685 | 0.702 | 0.697 | 0.736 | 0.761 | - | 0.821 | - | - | - | 0.821 | - | 0.668 | 0.668 | econometrics | 0.140 | 0.394 | 0.587 | 0.614 | 0.622 | 0.552 | 0.614 | 0.631 | - | 0.622 | - | - | - | 0.605 | - | 0.543 | 0.447 | electrical_engineering | 0.324 | 0.475 | 0.586 | 0.579 | 0.579 | 0.586 | 0.648 | 0.648 | - | 0.717 | - | - | - | 0.724 | - | 0.517 | 0.558 | elementary_mathematics | 0.148 | 0.391 | 0.582 | 0.574 | 0.589 | 0.584 | 0.640 | 0.650 | - | 0.701 | - | - | - | 0.679 | - | 0.547 | 0.476 | formal_logic | 0.238 | 0.357 | 0.547 | 0.531 | 0.484 | 0.460 | 0.476 | 0.476 | - | 0.515 | - | - | - | 0.579 | - | 0.611 | 0.436 | global_facts | 0.140 | 0.110 | 0.180 | 0.220 | 0.180 | 0.240 | 0.260 | 0.280 | - | 0.300 | - | - | - | 0.310 | - | 0.430 | 0.200 | high_school_biology | 0.432 | 0.600 | 0.822 | 0.822 | 0.832 | 0.822 | 0.848 | 0.858 | - | 0.883 | - | - | - | 0.906 | - | 0.783 | 0.790 | high_school_chemistry | 0.147 | 0.423 | 0.600 | 0.591 | 0.625 | 0.551 | 0.635 | 0.650 | - | 0.709 | - | - | - | 0.665 | - | 0.596 | 0.581 | high_school_computer_science | 0.340 | 0.580 | 0.750 | 0.750 | 0.760 | 0.730 | 0.820 | 0.830 | - | 0.830 | - | - | - | 0.810 | - | 0.800 | 0.700 | high_school_european_history | 0.387 | 0.600 | 0.703 | 0.690 | 0.781 | 0.727 | 0.818 | 0.812 | - | 0.787 | - | - | - | 0.806 | - | 0.824 | 0.696 | high_school_geography | 0.393 | 0.631 | 0.803 | 0.808 | 0.792 | 0.752 | 0.787 | 0.808 | - | 0.853 | - | - | - | 0.878 | - | 0.858 | 0.797 | high_school_government_and_politics | 0.284 | 0.621 | 0.849 | 0.854 | 0.834 | 0.834 | 0.891 | 0.906 | - | 0.901 | - | - | - | 0.958 | - | 0.886 | 0.823 | high_school_macroeconomics | 0.292 | 0.474 | 0.661 | 0.656 | 0.653 | 0.669 | 0.712 | 0.712 | - | 0.787 | - | - | - | 0.810 | - | 0.692 | 0.623 | high_school_mathematics | 0.166 | 0.292 | 0.348 | 0.355 | 0.403 | 0.351 | 0.418 | 0.396 | - | 0.474 | - | - | - | 0.381 | - | 0.296 | 0.355 | high_school_microeconomics | 0.390 | 0.609 | 0.773 | 0.768 | 0.798 | 0.802 | 0.882 | 0.886 | - | 0.911 | - | - | - | 0.894 | - | 0.676 | 0.735 | high_school_physics | 0.119 | 0.350 | 0.556 | 0.549 | 0.529 | 0.549 | 0.602 | 0.602 | - | 0.649 | - | - | - | 0.662 | - | 0.456 | 0.437 | high_school_psychology | 0.526 | 0.746 | 0.842 | 0.844 | 0.849 | 0.849 | 0.877 | 0.877 | - | 0.891 | - | - | - | 0.921 | - | 0.790 | 0.836 | high_school_statistics | 0.333 | 0.462 | 0.648 | 0.657 | 0.657 | 0.625 | 0.694 | 0.689 | - | 0.703 | - | - | - | 0.736 | - | 0.537 | 0.550 | high_school_us_history | 0.338 | 0.553 | 0.784 | 0.764 | 0.759 | 0.710 | 0.823 | 0.848 | - | 0.862 | - | - | - | 0.897 | - | 0.872 | 0.754 | high_school_world_history | 0.459 | 0.645 | 0.793 | 0.776 | 0.789 | 0.797 | 0.839 | 0.831 | - | 0.827 | - | - | - | 0.864 | - | 0.877 | 0.725 | human_aging | 0.331 | 0.439 | 0.587 | 0.578 | 0.596 | 0.600 | 0.609 | 0.623 | - | 0.668 | - | - | - | 0.771 | - | 0.721 | 0.551 | human_sexuality | 0.374 | 0.549 | 0.641 | 0.664 | 0.679 | 0.664 | 0.740 | 0.755 | - | 0.770 | - | - | - | 0.809 | - | 0.770 | 0.656 | international_law | 0.429 | 0.537 | 0.669 | 0.652 | 0.694 | 0.628 | 0.694 | 0.710 | - | 0.826 | - | - | - | 0.801 | - | 0.809 | 0.743 | jurisprudence | 0.398 | 0.527 | 0.675 | 0.694 | 0.759 | 0.675 | 0.731 | 0.731 | - | 0.805 | - | - | - | 0.777 | - | 0.787 | 0.675 | logical_fallacies | 0.319 | 0.613 | 0.791 | 0.785 | 0.791 | 0.717 | 0.779 | 0.803 | - | 0.822 | - | - | - | 0.803 | - | 0.644 | 0.723 | machine_learning | 0.276 | 0.339 | 0.526 | 0.508 | 0.500 | 0.392 | 0.491 | 0.455 | - | 0.562 | - | - | - | 0.455 | - | 0.616 | 0.553 | management | 0.514 | 0.640 | 0.786 | 0.805 | 0.805 | 0.834 | 0.844 | 0.873 | - | 0.825 | - | - | - | 0.796 | - | 0.718 | 0.834 | marketing | 0.602 | 0.739 | 0.816 | 0.807 | 0.837 | 0.820 | 0.884 | 0.876 | - | 0.876 | - | - | - | 0.876 | - | 0.773 | 0.794 | medical_genetics | 0.370 | 0.580 | 0.750 | 0.710 | 0.810 | 0.750 | 0.780 | 0.750 | - | 0.790 | - | - | - | 0.840 | - | 0.750 | 0.750 | miscellaneous | 0.390 | 0.597 | 0.752 | 0.744 | 0.761 | 0.757 | 0.789 | 0.795 | - | 0.831 | - | - | - | 0.854 | - | 0.786 | 0.740 | moral_disputes | 0.289 | 0.473 | 0.580 | 0.583 | 0.578 | 0.540 | 0.609 | 0.615 | - | 0.638 | - | - | - | 0.699 | - | 0.760 | 0.589 | moral_scenarios | 0.109 | 0 | 0.145 | 0.140 | 0.230 | 0.234 | 0.322 | 0.269 | - | 0.330 | - | - | - | 0.292 | - | 0.518 | 0.082 | nutrition | 0.316 | 0.509 | 0.660 | 0.669 | 0.699 | 0.660 | 0.702 | 0.702 | - | 0.771 | - | - | - | 0.771 | - | 0.771 | 0.656 | philosophy | 0.225 | 0.501 | 0.623 | 0.617 | 0.636 | 0.575 | 0.636 | 0.643 | - | 0.675 | - | - | - | 0.710 | - | 0.755 | 0.639 | prehistory | 0.345 | 0.530 | 0.688 | 0.688 | 0.703 | 0.688 | 0.753 | 0.743 | - | 0.774 | - | - | - | 0.796 | - | 0.833 | 0.663 | professional_accounting | 0.237 | 0.308 | 0.439 | 0.443 | 0.443 | 0.404 | 0.482 | 0.482 | - | 0.510 | - | - | - | 0.546 | - | 0.588 | 0.393 | professional_law | 0.201 | 0.288 | 0.378 | 0.382 | 0.370 | 0.369 | 0.414 | 0.424 | - | 0.431 | - | - | - | 0.475 | - | 0.505 | 0.393 | professional_medicine | 0.209 | 0.463 | 0.698 | 0.705 | 0.720 | 0.676 | 0.757 | 0.775 | - | 0.786 | - | - | - | 0.830 | - | 0.790 | 0.632 | professional_psychology | 0.303 | 0.459 | 0.651 | 0.655 | 0.620 | 0.601 | 0.676 | 0.679 | - | 0.733 | - | - | - | 0.766 | - | 0.712 | 0.616 | public_relations | 0.345 | 0.500 | 0.572 | 0.563 | 0.654 | 0.581 | 0.600 | 0.636 | - | 0.645 | - | - | - | 0.709 | - | 0.672 | 0.600 | security_studies | 0.412 | 0.604 | 0.636 | 0.636 | 0.685 | 0.677 | 0.730 | 0.730 | - | 0.746 | - | - | - | 0.755 | - | 0.804 | 0.640 | sociology | 0.427 | 0.656 | 0.731 | 0.746 | 0.800 | 0.766 | 0.800 | 0.815 | - | 0.781 | - | - | - | 0.855 | - | 0.840 | 0.736 | us_foreign_policy | 0.470 | 0.610 | 0.720 | 0.710 | 0.740 | 0.780 | 0.830 | 0.830 | - | 0.830 | - | - | - | 0.840 | - | 0.860 | 0.800 | virology | 0.319 | 0.379 | 0.433 | 0.433 | 0.457 | 0.409 | 0.463 | 0.475 | - | 0.487 | - | - | - | 0.487 | - | 0.542 | 0.415 | world_religions | 0.362 | 0.637 | 0.719 | 0.719 | 0.760 | 0.783 | 0.783 | 0.777 | - | 0.818 | - | - | - | 0.807 | - | 0.865 | 0.748 | MMLU | 0.298 | 0.457 | 0.598 | 0.598 | 0.613 | 0.598 | 0.651 | 0.654 | - | 0.684 | - | - | - | 0.698 | - | 0.674 | 0.577 | AGIEVAL aquarat | 0.572 | 0.760 | 0.866 | 0.840 | 0.860 | 0.834 | 0.866 | 0.897 | - | 0.885 | - | - | - | 0.860 | - | 0.848 | 0.864 | logiqa | 0.062 | 0.230 | 0.451 | 0.453 | 0.462 | 0.393 | 0.420 | 0.431 | - | 0.465 | - | - | - | 0.520 | - | 0.586 | 0.414 | lsatar | 0.208 | 0.313 | 0.486 | 0.500 | 0.904 | 0.430 | 0.486 | 0.517 | - | 0.469 | - | - | - | 0.495 | - | 0.678 | 0.808 | lsatlr | 0.164 | 0.372 | 0.601 | 0.594 | 0.672 | 0.574 | 0.641 | 0.658 | - | 0.725 | - | - | - | 0.768 | - | 0.813 | 0.703 | lsatrc | 0.327 | 0.464 | 0.669 | 0.665 | 0.710 | 0.657 | 0.687 | 0.713 | - | 0.713 | - | - | - | 0.806 | - | 0.828 | 0.695 | saten | 0.412 | 0.655 | 0.830 | 0.825 | 0.830 | 0.825 | 0.820 | 0.820 | - | 0.834 | - | - | - | 0.873 | - | 0.898 | 0.737 | satmath | 0.772 | 0.950 | 0.990 | 0.981 | 0.977 | 0.986 | 0.990 | 0.990 | - | 0.990 | - | - | - | 0.995 | - | 0.972 | 0.954 | AGIEVAL | 0.282 | 0.458 | 0.641 | 0.636 | 0.703 | 0.608 | 0.643 | 0.659 | - | 0.678 | - | - | - | 0.717 | - | 0.764 | 0.676 | AGIEVALC_biology | 0.152 | 0.539 | 0.765 | 0.760 | 0.773 | 0.769 | 0.834 | 0.847 | - | 0.856 | - | - | - | 0.878 | - | 0.708 | 0.695 | AGIEVALC_chemistry | 0.117 | 0.397 | 0.622 | 0.602 | 0.661 | 0.568 | 0.647 | 0.656 | - | 0.720 | - | - | - | 0.803 | - | 0.754 | 0.558 | AGIEVALC_chinese | 0.081 | 0.365 | 0.516 | 0.504 | 0.573 | 0.581 | 0.609 | 0.634 | - | 0.678 | - | - | - | 0.739 | - | 0.707 | 0.536 | AGIEVALC_english | 0.477 | 0.728 | 0.856 | 0.849 | 0.866 | 0.820 | 0.856 | 0.856 | - | 0.866 | - | - | - | 0.872 | - | 0.888 | 0.813 | AGIEVALC_geography | 0.281 | 0.547 | 0.708 | 0.683 | 0.628 | 0.648 | 0.758 | 0.743 | - | 0.768 | - | - | - | 0.829 | - | 0.819 | 0.628 | AGIEVALC_history | 0.319 | 0.612 | 0.736 | 0.727 | 0.774 | 0.702 | 0.761 | 0.770 | - | 0.821 | - | - | - | 0.872 | - | 0.889 | 0.702 | AGIEVALC_jecqaca | 0.183 | 0.303 | 0.397 | 0.392 | 0.369 | 0.378 | 0.425 | 0.439 | - | 0.482 | - | - | - | 0.566 | - | 0.652 | 0.367 | AGIEVALC_jecqakd | 0.123 | 0.359 | 0.480 | 0.484 | 0.492 | 0.513 | 0.561 | 0.574 | - | 0.613 | - | - | - | 0.676 | - | 0.701 | 0.473 | AGIEVALC_logiqa | 0.122 | 0.317 | 0.496 | 0.483 | 0.508 | 0.471 | 0.499 | 0.497 | - | 0.562 | - | - | - | 0.599 | - | 0.642 | 0.465 | AGIEVALC_mathcloze | 0.728 | 0.669 | 0.923 | 0.957 | 0.957 | 0.838 | 0.830 | 0.923 | - | 0.881 | - | - | - | 0.932 | 0.864 | 0.889 | 0.983 | AGIEVALC_mathqa | _0.500_ | _0.704_ | 0.828 | _0.812_ | _0.928_ | _0.764_ | 0.863 | 0.851 | - | 0.813 | - | - | - | _0.852_ | 0.828 | _0.904_ | _0.892_ | AGIEVALC_physics | 0.080 | 0.333 | 0.436 | 0.431 | 0.471 | 0.454 | 0.563 | 0.545 | - | 0.626 | - | - | - | 0.701 | 0.741 | 0.528 | 0.419 | AGIEVALC | _0.225_ | _0.448_ | 0.602 | _0.589_ | _0.611_ | _0.580_ | 0.639 | 0.646 | - | 0.680 | - | - | - | _0.729_ | 0.811 | _0.733_ | _0.574_ | BBH boolean_expressions | 0.724 | 0.728 | 0.820 | 0.812 | 0.648 | 0.612 | 0.560 | 0.620 | - | 0.900 | - | - | - | 0.832 | - | 0.768 | 0.840 | causal_judgement | 0.491 | 0.561 | 0.593 | 0.582 | 0.604 | 0.540 | 0.588 | 0.572 | - | 0.604 | - | - | - | 0.631 | - | 0.636 | 0.636 | date_understanding | 0.504 | 0.752 | 0.880 | 0.888 | 0.928 | 0.852 | 0.936 | 0.912 | - | 0.916 | - | - | - | 0.940 | - | 0.884 | 0.904 | disambiguation_qa | 0.448 | 0.464 | 0.648 | 0.588 | 0.680 | 0.464 | 0.544 | 0.520 | - | 0.636 | - | - | - | 0.448 | - | 0.436 | 0.568 | dyck_languages | 0.412 | 0.524 | 0.580 | 0.572 | 0.500 | 0.672 | 0.696 | 0.688 | - | 0.772 | - | - | - | 0.816 | - | 0.848 | 0.448 | formal_fallacies | 0.800 | 0.800 | 0.748 | 0.776 | 1.000 | 0.568 | 0.992 | 0.604 | - | 0.768 | - | - | - | 0.728 | - | 0.976 | 0.520 | geometric_shapes | 0.228 | 0.572 | 0.536 | 0.556 | 0.804 | 0.692 | 0.716 | 0.676 | - | 0.688 | - | - | - | 0.728 | - | 0.780 | 0.764 | hyperbaton | 0.576 | 0.692 | 0.872 | 0.856 | 0.960 | 0.912 | 0.952 | 0.960 | - | 0.976 | - | - | - | 0.948 | - | 0.940 | 0.984 | logical_deduction_five_objects | 0.416 | 0.772 | 0.884 | 0.868 | 1.000 | 0.856 | 0.872 | 0.928 | - | 0.936 | - | - | - | 0.972 | - | 0.988 | 0.988 | logical_deduction_seven_objects | 0.360 | 0.664 | 0.856 | 0.840 | 1.000 | 0.816 | 0.860 | 0.880 | - | 0.888 | - | - | - | 0.924 | - | 0.968 | 0.996 | logical_deduction_three_objects | 0.612 | 0.932 | 0.988 | 0.988 | 0.996 | 0.988 | 0.984 | 0.980 | - | 0.996 | - | - | - | 1.000 | - | 0.988 | 0.980 | movie_recommendation | 0.360 | 0.416 | 0.528 | 0.504 | 0.604 | 0.492 | 0.520 | 0.544 | - | 0.572 | - | - | - | 0.616 | - | 0.668 | 0.684 | multistep_arithmetic_two | 0.896 | 0.988 | 0.996 | 0.988 | 1.000 | 0.984 | 1.000 | 1.000 | - | _0.572_ | - | - | - | 0.996 | - | 1.000 | 0.984 | navigate | 0.516 | 0.576 | 0.580 | 0.580 | 1.000 | 0.508 | 0.992 | 0.608 | - | 0.680 | - | - | - | 0.728 | - | 0.996 | 0.996 | object_counting | 0.664 | 0.872 | 0.992 | 0.996 | 1.000 | 0.996 | 0.992 | 0.996 | - | 0.996 | - | - | - | 1.000 | - | 0.996 | 0.952 | penguins_in_a_table | 0.602 | 0.897 | 0.945 | 0.958 | 0.986 | 0.993 | 1.000 | 0.993 | - | 1.000 | - | - | - | 1.000 | - | 1.000 | 0.986 | reasoning_about_colored_objects | 0.520 | 0.792 | 0.952 | 0.960 | 0.996 | 0.928 | 0.940 | 0.960 | - | 0.948 | - | - | - | 0.984 | - | 0.964 | 0.968 | ruin_names | 0.164 | 0.512 | 0.508 | 0.516 | 0.632 | 0.604 | 0.656 | 0.652 | - | 0.772 | - | - | - | 0.776 | - | 0.768 | 0.776 | salient_translation_error_detection | 0.316 | 0.488 | 0.612 | 0.632 | 0.696 | 0.604 | 0.628 | 0.572 | - | 0.660 | - | - | - | 0.680 | - | 0.576 | 0.652 | snarks | 0.471 | 0.573 | 0.730 | 0.685 | 0.780 | 0.735 | 0.792 | 0.735 | - | 0.780 | - | - | - | 0.837 | - | 0.831 | 0.764 | sports_understanding | 0.472 | 0.524 | 0.624 | 0.596 | 0.708 | 0.540 | 0.644 | 0.636 | - | 0.560 | - | - | - | 0.776 | - | 0.464 | 0.600 | temporal_sequences | 0.136 | 0.400 | 0.912 | 0.892 | 1.000 | 0.940 | 0.992 | 0.992 | - | 0.980 | - | - | - | 0.992 | - | 0.948 | 0.996 | tracking_shuffled_objects_five_objects | 0.280 | 0.648 | 0.940 | 0.964 | 1.000 | 0.968 | 0.956 | 0.936 | - | 0.996 | - | - | - | 0.996 | - | 0.988 | 0.988 | tracking_shuffled_objects_seven_objects | 0.232 | 0.564 | 0.852 | 0.884 | 0.996 | 0.924 | 0.952 | 0.972 | - | 0.944 | - | - | - | 0.948 | - | 0.980 | 0.952 | tracking_shuffled_objects_three_objects | 0.408 | 0.736 | 0.896 | 0.884 | 1.000 | 0.920 | 0.920 | 0.952 | - | 0.996 | - | - | - | 0.996 | - | 0.992 | 0.992 | web_of_lies | 0.456 | 0.460 | 0.552 | 0.544 | 1.000 | 0.488 | - | 0.488 | - | 0.540 | - | - | - | 0.492 | - | 1.000 | 1.000 | word_sorting | 0.080 | 0.136 | 0.220 | 0.228 | 0.280 | 0.292 | 0.292 | 0.288 | - | 0.324 | - | - | - | 0.324 | - | 0.400 | 0.220 | BBH | 0.446 | 0.628 | 0.748 | 0.744 | 0.845 | 0.734 | 0.763 | 0.763 | - | _0.791_ | - | - | - | 0.817 | - | 0.825 | 0.819 | MUSR murder_mystery | 0.524 | 0.560 | 0.640 | 0.668 | 0.588 | 0.584 | 0.636 | 0.652 | - | 0.672 | - | - | - | 0.636 | - | 0.560 | 0.620 | object_placements | 0.480 | 0.512 | 0.566 | 0.556 | 0.476 | 0.536 | 0.582 | 0.578 | - | 0.528 | - | - | - | 0.516 | - | 0.436 | 0.476 | team_allocation | 0.280 | 0.468 | 0.628 | 0.612 | 0.708 | 0.648 | 0.656 | 0.668 | - | 0.564 | - | - | - | 0.632 | - | 0.728 | 0.464 | MUSR | 0.428 | 0.513 | 0.611 | 0.612 | 0.590 | 0.589 | 0.624 | 0.632 | - | 0.588 | - | - | - | 0.594 | - | 0.574 | 0.520 | GPQA_diamond | 0.262 | 0.282 | 0.434 | 0.489 | 0.616 | 0.398 | 0.530 | - | - | 0.439 | - | - | - | 0.555 | - | 0.555 | 0.540 | GPQA | 0.262 | 0.282 | 0.434 | 0.489 | 0.616 | 0.398 | 0.530 | - | - | 0.439 | - | - | - | 0.555 | - | 0.555 | 0.540 | MMLUPRO biology | _0.268_ | _0.596_ | 0.799 | _0.784_ | _0.852_ | _0.804_ | 0.822 | 0.831 | - | _0.824_ | - | - | - | _0.852_ | - | _0.812_ | _0.716_ | business | _0.348_ | _0.548_ | 0.717 | _0.692_ | _0.784_ | _0.724_ | 0.740 | 0.738 | - | _0.784_ | - | - | - | _0.812_ | - | _0.788_ | _0.676_ | chemistry | _0.300_ | _0.564_ | 0.720 | _0.700_ | _0.848_ | _0.700_ | 0.746 | 0.747 | - | _0.796_ | - | - | - | _0.780_ | - | _0.740_ | _0.704_ | computer_science | _0.232_ | _0.532_ | 0.680 | _0.664_ | _0.744_ | _0.676_ | 0.704 | 0.707 | - | _0.704_ | - | - | - | _0.804_ | - | _0.780_ | _0.676_ | economics | _0.276_ | _0.544_ | 0.716 | _0.732_ | _0.784_ | _0.732_ | 0.741 | 0.759 | - | _0.808_ | - | - | - | _0.832_ | - | _0.796_ | _0.700_ | engineering | _0.232_ | _0.456_ | 0.557 | _0.596_ | _0.624_ | _0.536_ | 0.587 | 0.600 | - | _0.620_ | - | - | - | _0.676_ | - | _0.476_ | _0.496_ | health | _0.192_ | _0.312_ | 0.559 | _0.564_ | _0.668_ | _0.632_ | 0.630 | 0.639 | - | _0.652_ | - | - | - | _0.680_ | - | _0.652_ | _0.576_ | history | _0.200_ | _0.300_ | 0.488 | _0.524_ | _0.604_ | _0.536_ | 0.561 | 0.556 | - | _0.600_ | - | - | - | _0.680_ | - | _0.588_ | _0.492_ | law | _0.076_ | _0.208_ | 0.295 | _0.304_ | _0.336_ | _0.348_ | 0.353 | 0.379 | - | _0.384_ | - | - | - | _0.468_ | - | _0.392_ | _0.320_ | math | _0.444_ | _0.676_ | _0.832_ | _0.800_ | _0.884_ | _0.780_ | 0.824 | 0.827 | - | _0.844_ | - | - | - | _0.872_ | - | _0.860_ | _0.796_ | other | _0.188_ | _0.356_ | _0.484_ | _0.528_ | _0.600_ | _0.560_ | 0.590 | 0.593 | - | _0.656_ | - | - | - | _0.704_ | - | _0.636_ | _0.544_ | philosophy | _0.168_ | _0.336_ | _0.504_ | _0.520_ | _0.600_ | _0.492_ | 0.559 | 0.567 | - | _0.580_ | - | - | - | _0.664_ | - | _0.612_ | _0.484_ | physics | _0.232_ | _0.580_ | _0.708_ | _0.740_ | _0.852_ | _0.720_ | 0.753 | 0.752 | - | _0.756_ | - | - | - | _0.824_ | - | _0.736_ | _0.724_ | psychology | _0.212_ | _0.512_ | _0.672_ | _0.632_ | _0.700_ | _0.668_ | 0.725 | 0.692 | - | _0.728_ | - | - | - | _0.748_ | - | _0.672_ | _0.624_ | MMLUPRO | _0.240_ | _0.465_ | _0.622_ | _0.627_ | _0.705_ | _0.636_ | 0.674 | 0.679 | - | _0.695_ | - | - | - | _0.742_ | - | _0.681_ | _0.609_ | CATEGORIES REASONING | 0.332 | 0.593 | 0.744 | 0.746 | 0.742 | 0.738 | 0.787 | 0.792 | 0.827 | 0.802 | 0.815 | 0.832 | 0.838 | 0.824 | 0.885 | 0.850 | 0.730 | UNDERSTANDING | 0.353 | 0.521 | 0.653 | 0.651 | 0.661 | 0.646 | 0.693 | 0.697 | 0.722 | 0.725 | 0.699 | 0.747 | 0.849 | 0.743 | 0.809 | 0.731 | 0.643 | LANGUAGE | 0.471 | 0.590 | 0.644 | 0.638 | 0.662 | 0.660 | 0.700 | 0.700 | 0.729 | 0.714 | 0.698 | 0.725 | 0.692 | 0.701 | 0.780 | 0.651 | 0.522 | KNOWLEDGE | 0.367 | 0.456 | 0.552 | 0.548 | 0.531 | 0.558 | 0.527 | 0.541 | 0.657 | 0.632 | 0.520 | 0.501 | 0.599 | 0.563 | 0.581 | 0.659 | 0.531 | COT | _0.307_ | _0.478_ | _0.616_ | _0.611_ | _0.712_ | _0.611_ | 0.667 | 0.670 | - | _0.682_ | - | - | - | _0.717_ | - | _0.674_ | _0.636_ | MATHCOT | _0.545_ | _0.766_ | 0.900 | _0.890_ | _0.954_ | _0.882_ | 0.884 | 0.895 | 0.910 | _0.904_ | 0.820 | 0.954 | 0.910 | _0.924_ | 0.927 | _0.948_ | _0.949_ | CODE | 0.317 | 0.443 | 0.538 | 0.524 | 0.534 | 0.532 | 0.582 | 0.573 | - | 0.618 | - | 0.755 | - | 0.586 | 0.321 | 0.506 | 0.551 | DISCIPLINES NLP | 0.402 | 0.562 | 0.674 | 0.672 | 0.673 | 0.671 | 0.694 | 0.700 | 0.767 | 0.739 | 0.769 | 0.764 | 0.768 | 0.723 | 0.772 | 0.765 | 0.654 | MATH | _0.433_ | _0.638_ | _0.789_ | _0.767_ | _0.829_ | _0.779_ | 0.808 | 0.815 | 0.910 | _0.824_ | 0.820 | 0.954 | 0.910 | _0.828_ | 0.927 | _0.829_ | _0.803_ | SCIENCE | _0.340_ | _0.648_ | _0.783_ | _0.789_ | _0.813_ | _0.782_ | 0.808 | 0.818 | - | _0.847_ | - | - | - | _0.866_ | 0.946 | _0.793_ | _0.774_ | ENGINEERING | _0.265_ | _0.463_ | 0.561 | _0.589_ | _0.607_ | _0.554_ | 0.595 | 0.606 | - | _0.655_ | - | - | - | _0.693_ | - | _0.491_ | _0.518_ | MEDICINE | _0.232_ | _0.392_ | 0.554 | _0.552_ | _0.578_ | _0.566_ | 0.613 | 0.620 | - | _0.642_ | - | 0.597 | - | _0.668_ | 0.598 | _0.642_ | _0.547_ | HUMANITIES | _0.283_ | _0.449_ | _0.595_ | _0.593_ | _0.627_ | _0.598_ | 0.642 | 0.639 | 0.480 | _0.677_ | 0.520 | 0.470 | 0.470 | _0.707_ | 0.600 | _0.717_ | _0.577_ | BUSINESS | _0.358_ | _0.555_ | 0.717 | _0.717_ | _0.744_ | _0.725_ | 0.756 | 0.762 | - | _0.807_ | - | - | - | _0.815_ | - | _0.724_ | _0.683_ | LAW | _0.200_ | _0.329_ | 0.438 | _0.458_ | _0.490_ | _0.448_ | 0.473 | 0.488 | - | _0.535_ | - | - | - | _0.586_ | - | _0.625_ | _0.489_ | COMPOSITE AVERAGE AVG | _0.363_ | _0.539_ | _0.664_ | _0.661_ | _0.676_ | _0.663_ | 0.693 | 0.699 | 0.767 | _0.732_ | 0.768 | 0.764 | 0.767 | _0.731_ | 0.759 | _0.743_ | _0.652_ | CODE MODELS: MODEL | Codestral-22B-v0.1 | Codestral-22B-Instruct-v0.1 | Deepseek-Coder-V2-Lite-Instruct | Qwen2.5-Coder-0.5B-32k-Instruct | Qwen2.5-Coder-1.5B-Instruct | Qwen2.5-Coder-3B-Instruct | Qwen2.5-Coder-3B-Instruct | Qwen2.5-3B-32k-Instruct | Qwen2.5-Coder-7B | Qwen2.5-Coder-7B | Qwen2.5-Coder-7B-Instruct | Qwen2.5-Coder-7B-Instruct | Qwen2.5-Coder-7B-Instruct | Qwen2.5-Coder-14B-Instruct | Qwen2.5-Coder-14B-Instruct | Qwen2.5-Coder-14B | Qwen2.5-Coder-32B-Instruct | Qwen2.5-Coder-32B-Instruct | Qwen3-Coder-30B-A3B-Instruct | ---------------------------------------------|--------------------|-----------------------------|---------------------------------|---------------------------------|-----------------------------|---------------------------|---------------------------|-------------------------|------------------|------------------|---------------------------|---------------------------|---------------------------|----------------------------|----------------------------|-------------------|----------------------------|----------------------------|------------------------------| params | 22B | 22B | 14.77B | 0.49403B | 1.54B | 3.09B | 3.09B | 3.09B | 7.62B | 7.62B | 7.62B | 7.62B | 7.62B | 14.77B | 14.77B | 14.77B | 32.76B | 32.76B | 30.53B | quant | IQ4_XS | IQ4_XS | IQ4_XS | Q6_K | Q6_K | Q6_K | Q6_K_H | Q6_K | IQ4_XS | Q6_K | IQ4_XS | Q4_K_H | Q6_K_H | IQ4_XS | Q4_K_H | IQ4_XS | IQ4_XS | Q4_K_H | Q4_K_H | engine | llama.cpp version: 4132 | llama.cpp version: 4191 | llama.cpp version: 4488 | llama.cpp version: 4150 | llama.cpp version: 4150 | llama.cpp version: 4150 | llama.cpp version: 7021 | llama.cpp version: 4150 | llama.cpp version: 4295 | llama.cpp version: 4132 | llama.cpp version: 4094 | llama.cpp version: 7003 | llama.cpp version: 7003 | llama.cpp version: 4120 | llama.cpp version: 7048 | llama.cpp version: 4150 | llama.cpp version: 4150 | llama.cpp version: 7192 | llama.cpp version: 5935 | **TEST** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | HUMANEVAL | 0.664 | 0.810 | 0.847 | 0.518 | 0.676 | 0.835 | 0.847 | 0.780 | 0.640 | 0.713 | 0.829 | 0.841 | 0.841 | 0.878 | 0.871 | 0.676 | 0.884 | 0.896 | 0.939 | HUMANEVALP | 0.554 | 0.682 | - | 0.432 | 0.567 | 0.719 | 0.731 | 0.682 | 0.530 | 0.579 | 0.707 | 0.713 | 0.713 | 0.756 | 0.731 | 0.536 | 0.756 | 0.762 | 0.810 | HUMANEVALFIM | 0.719 | 0.719 | 0.621 | 0.518 | 0.524 | 0.634 | 0.408 | - | 0.713 | 0.756 | 0.493 | 0.786 | 0.737 | 0.829 | 0.829 | 0.518 | 0.890 | 0.829 | 0.713 | MBPP | 0.630 | 0.653 | - | 0.408 | 0.560 | 0.618 | 0.638 | 0.599 | 0.614 | 0.571 | 0.735 | 0.731 | 0.708 | 0.727 | 0.723 | 0.661 | 0.715 | 0.708 | 0.696 | MBPPP | 0.558 | 0.593 | - | 0.352 | 0.504 | 0.589 | 0.598 | 0.584 | 0.540 | 0.513 | 0.687 | 0.687 | 0.660 | 0.665 | 0.656 | 0.558 | 0.669 | 0.678 | 0.669 | HUMANEVALX_cpp | 0.640 | 0.621 | - | 0.286 | 0.426 | 0.567 | 0.554 | 0.237 | 0.548 | 0.475 | 0.676 | 0.701 | 0.719 | 0.506 | 0.463 | 0.573 | 0.689 | 0.682 | 0.817 | HUMANEVALX_java | 0.756 | 0.670 | - | 0.512 | 0.609 | 0.743 | 0.737 | 0.615 | 0.725 | 0.652 | 0.798 | 0.792 | 0.810 | 0.201 | 0.262 | 0.762 | 0.841 | 0.859 | 0.884 | HUMANEVALX_js | 0.658 | 0.621 | - | 0.493 | 0.615 | 0.670 | 0.756 | 0.682 | 0.628 | 0.658 | 0.798 | 0.768 | 0.774 | 0.817 | 0.810 | 0.695 | 0.835 | 0.823 | 0.871 | HUMANEVALX | 0.684 | 0.638 | - | 0.430 | 0.550 | 0.660 | 0.682 | 0.512 | 0.634 | 0.595 | 0.758 | 0.754 | 0.768 | 0.508 | 0.512 | 0.676 | 0.788 | 0.788 | 0.857 | CRUXEVAL_input | 0.438 | 0.351 | - | 0.435 | 0.416 | 0.481 | 0.477 | 0.347 | 0.255 | 0.267 | 0.578 | 0.580 | 0.578 | 0.677 | 0.666 | 0.281 | 0.676 | 0.673 | 0.577 | CRUXEVAL_output | 0.465 | 0.447 | - | 0.278 | 0.332 | 0.413 | 0.415 | 0.311 | 0.381 | 0.435 | 0.507 | 0.511 | 0.505 | 0.577 | 0.580 | 0.422 | 0.610 | 0.616 | 0.558 | CRUXEVAL | 0.451 | 0.399 | - | 0.356 | 0.374 | 0.447 | 0.446 | 0.329 | 0.318 | 0.351 | 0.543 | 0.545 | 0.541 | 0.627 | 0.623 | 0.351 | 0.643 | 0.645 | 0.568 | CRUXEVALFIM_input | 0.295 | 0.351 | - | 0.017 | 0.155 | 0.208 | 0.271 | - | 0.296 | 0.313 | 0.322 | 0.432 | 0.447 | 0.421 | 0.542 | 0.346 | 0.592 | 0.585 | 0.440 | CRUXEVALFIM_output | 0.441 | 0.355 | - | 0.098 | 0.222 | 0.323 | 0.335 | - | 0.352 | 0.365 | 0.481 | 0.436 | 0.446 | 0.546 | 0.395 | 0.481 | 0.483 | 0.506 | 0.340 | CRUXEVALFIM | 0.368 | 0.353 | - | 0.058 | 0.188 | 0.266 | 0.303 | - | 0.324 | 0.339 | 0.401 | 0.434 | 0.446 | 0.483 | 0.468 | 0.413 | 0.536 | 0.545 | 0.390 | CODE | 0.483 | 0.467 | 0.734 | 0.278 | 0.368 | 0.453 | 0.462 | 0.449 | 0.413 | 0.427 | 0.548 | 0.571 | 0.571 | 0.593 | 0.585 | 0.458 | 0.648 | 0.650 | 0.576 | MATH MODELS: MODEL | Deepseek-R1-Distill-Llama-8B | Deepseek-R1-Distill-Llama-8B | Deepseek-R1-Distill-Qwen-1.5B | Deepseek-R1-Distill-Qwen-7B | Deepseek-R1-Distill-Qwen-7B | Deepseek-R1-Distill-Qwen-14B | Deepseek-R1-Distill-Qwen-14B | Deepseek-R1-Distill-Qwen-32B | Deepseek-R1-Distill-Qwen-32B | GLM-4.7-Flash | GLM-4.7-Flash | GLM-Z1-9B-0414 | GLM-Z1-9B-0414 | GLM-Z1-9B-0414 | GLM-Z1-32B-0414 | Qwen2.5-Math-1.5B-Instruct | Qwen2.5-Math-7B-Instruct | Qwen3-32B | QwQ-32B | QwQ-32B | ---------------------------------------------|------------------------------|------------------------------|-------------------------------|-----------------------------|-----------------------------|------------------------------|------------------------------|------------------------------|------------------------------|---------------|---------------|----------------|----------------|----------------|-----------------|----------------------------|--------------------------|-----------|---------|---------| params | 8.03B | 8.03B | 1.78B | 7.62B | 7.62B | 14.77B | 14.77B | 32.76B | 32.76B | 29.94B | 29.94B | 9.40B | 9.40B | 9.40B | 32.57B | 1.54B | 7.62B | 32.8B | 32.76B | 32.76B | quant | Q6_K | Q6_K_H | Q8_0 | IQ4_XS | Q6_K_H | IQ4_XS | Q4_K_H | IQ4_XS | Q4_K_H | Q4_K_H | Q6_K_H | Q4_K_H | Q4_P_H | Q6_K_H | Q4_K_H | IQ4_XS | Q6_K | Q4_K_H | IQ4_XS | Q4_K_H | engine | llama.cpp version: 4707 | llama.cpp version: 5898 | llama.cpp version: 4763 | llama.cpp version: 4644 | llama.cpp version: 7699 | llama.cpp version: 4657 | llama.cpp version: 7710 | llama.cpp version: 4559 | llama.cpp version: 7719 | llama.cpp version: 7885 | llama.cpp version: 7845 | llama.cpp version: 7230 | llama.cpp version: 7268 | llama.cpp version: 5935 | llama.cpp version: 7607 | llama.cpp version: 4406 | llama.cpp version: 4394 | llama.cpp version: 5633 | llama.cpp version: 4820 | llama.cpp version: 6026 | **TEST** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | GSM8K | - | _0.888_ | - | - | _0.888_ | - | _0.944_ | - | _0.956_ | _0.992_ | _0.984_ | _0.968_ | _0.968_ | _0.964_ | _0.988_ | - | - | - | - | _0.964_ | APPLE | - | 0.870 | - | - | 0.810 | - | 0.790 | - | 0.810 | 0.960 | 1.000 | 0.880 | 0.920 | 0.880 | 0.910 | - | - | - | - | 0.880 | GPQA_diamond | - | 0.308 | - | - | 0.323 | - | 0.489 | - | 0.494 | 0.444 | 0.444 | 0.555 | 0.540 | 0.434 | 0.585 | - | - | - | - | 0.555 | GPQA | - | 0.308 | - | - | 0.323 | - | 0.489 | - | 0.494 | 0.444 | 0.444 | 0.555 | 0.540 | 0.434 | 0.585 | - | - | - | - | 0.555 | MATH1_algebra | 0.933 | 0.977 | 0.918 | 0.962 | 1.000 | 0.925 | 0.970 | 0.962 | 0.985 | 1.000 | 1.000 | 0.985 | 0.992 | 0.992 | 0.992 | 0.859 | 0.955 | 0.992 | 0.992 | 1.000 | MATH1_counting_and_probability | 0.820 | 0.948 | 0.794 | 0.948 | 1.000 | 0.923 | 0.948 | 0.948 | 0.948 | 1.000 | 1.000 | - | 0.974 | 1.000 | 1.000 | 0.897 | 0.974 | 1.000 | 0.974 | 1.000 | MATH1_geometry | 0.842 | 0.868 | 0.710 | 0.736 | 0.921 | 0.868 | 0.921 | 0.921 | 0.894 | 1.000 | 0.973 | - | - | 0.947 | 0.947 | 0.710 | 0.842 | 0.842 | 0.921 | 0.921 | MATH1_intermediate_algebra | 0.923 | 0.980 | 0.730 | 0.903 | 0.961 | 0.865 | 0.980 | 0.961 | 0.942 | 1.000 | 1.000 | - | - | 0.961 | 0.980 | 0.730 | 0.711 | 0.923 | 1.000 | 0.980 | MATH1_number_theory | 0.700 | 0.900 | 0.866 | 0.800 | 1.000 | 0.700 | 0.966 | 0.933 | 0.966 | 1.000 | 1.000 | - | - | 0.900 | 1.000 | 0.766 | 1.000 | 0.666 | 0.800 | 0.833 | MATH1_prealgebra | 0.813 | 0.941 | 0.883 | 0.965 | 0.976 | 0.883 | 0.988 | 0.953 | 0.988 | 1.000 | 1.000 | - | - | 0.988 | 1.000 | 0.837 | 0.883 | 0.930 | 0.953 | 0.976 | MATH1_precalculus | 0.684 | 1.000 | 0.596 | 0.859 | 1.000 | 0.842 | 0.947 | 1.000 | 1.000 | 1.000 | 1.000 | - | - | 0.947 | 0.929 | 0.631 | 0.789 | 0.947 | 0.982 | 0.929 | MATH1 | 0.842 | 0.956 | 0.814 | 0.910 | 0.983 | 0.878 | 0.965 | 0.958 | 0.970 | 1.000 | 0.997 | - | - | 0.972 | 0.981 | 0.794 | 0.885 | 0.931 | 0.963 | 0.965 | MATH2_algebra | 0.845 | 0.975 | 0.825 | 0.930 | 0.985 | 0.900 | 0.965 | 0.995 | 0.985 | 1.000 | 1.000 | - | - | 0.980 | 0.995 | 0.910 | 0.860 | 0.970 | 0.975 | 0.990 | MATH2_counting_and_probability | 0.831 | 0.930 | 0.782 | 0.851 | 0.891 | 0.841 | 0.881 | 0.950 | 0.900 | 1.000 | 1.000 | - | - | 0.980 | 0.980 | 0.683 | 0.861 | 0.970 | 0.990 | 0.970 | MATH2_geometry | 0.841 | 0.963 | 0.743 | 0.914 | 0.951 | 0.792 | 0.939 | 0.914 | 0.987 | 1.000 | 1.000 | - | - | 0.987 | 0.939 | 0.621 | 0.743 | 0.792 | 0.963 | 0.963 | MATH2_intermediate_algebra | 0.859 | 0.953 | 0.664 | 0.875 | 0.992 | 0.835 | 0.960 | 0.968 | 0.976 | 1.000 | 1.000 | - | - | 0.968 | 0.984 | 0.671 | 0.710 | 0.953 | 0.960 | 0.984 | MATH2_number_theory | 0.826 | 0.913 | 0.782 | 0.826 | 0.967 | 0.891 | 0.934 | 0.934 | 0.913 | 0.989 | 0.989 | - | - | 0.956 | 0.989 | 0.695 | 0.880 | 0.891 | 0.945 | 0.967 | MATH2_prealgebra | 0.898 | 0.966 | 0.887 | 0.909 | 0.977 | 0.875 | 0.949 | 0.971 | 0.971 | 1.000 | 1.000 | - | - | 0.988 | 0.988 | 0.836 | 0.881 | 0.932 | 0.971 | 0.960 | MATH2_precalculus | 0.787 | 0.964 | 0.663 | 0.902 | 1.000 | 0.805 | 0.964 | 0.955 | 0.991 | 1.000 | 1.000 | - | - | 0.955 | 0.823 | 0.557 | 0.725 | 0.858 | 0.964 | 0.964 | MATH2 | 0.846 | 0.956 | 0.777 | 0.893 | 0.970 | 0.856 | 0.946 | 0.963 | 0.965 | 0.998 | 0.998 | - | - | 0.975 | 0.963 | 0.742 | 0.817 | 0.921 | 0.968 | 0.973 | MATH3_algebra | 0.873 | 0.938 | 0.854 | 0.934 | 0.984 | 0.911 | 0.961 | 0.992 | 0.980 | 0.988 | 1.000 | - | - | 0.980 | 0.988 | 0.881 | 0.850 | 0.969 | 0.996 | 0.984 | MATH3_counting_and_probability | 0.800 | 0.890 | 0.730 | 0.770 | 0.920 | 0.830 | 0.850 | 0.930 | 0.940 | 1.000 | 1.000 | - | - | 0.970 | 0.980 | 0.710 | 0.880 | 0.950 | 1.000 | 1.000 | MATH3_geometry | 0.794 | 0.901 | 0.627 | 0.911 | 0.960 | 0.794 | 0.911 | 0.901 | 0.941 | 0.970 | 0.990 | - | - | 0.970 | 0.950 | 0.696 | 0.764 | 0.833 | 0.970 | 0.931 | MATH3_intermediate_algebra | 0.825 | 0.969 | 0.635 | 0.902 | 0.964 | 0.882 | 0.953 | 0.964 | 0.984 | 0.989 | 1.000 | - | - | 0.969 | 0.938 | 0.574 | 0.738 | 0.933 | 0.969 | 0.938 | MATH3_number_theory | 0.819 | 0.934 | 0.696 | 0.754 | 0.942 | 0.770 | 0.844 | 0.926 | 0.909 | 0.991 | 1.000 | - | - | 0.926 | 0.942 | 0.655 | 0.819 | 0.811 | 0.942 | 0.918 | MATH3_prealgebra | 0.875 | 0.950 | 0.763 | 0.883 | 0.959 | 0.892 | 0.941 | 0.946 | 0.950 | 0.995 | 1.000 | - | - | 0.986 | 0.986 | 0.816 | 0.883 | 0.946 | 0.982 | 0.977 | MATH3_precalculus | 0.661 | 0.929 | 0.582 | 0.874 | 0.952 | 0.818 | 0.960 | 0.968 | 0.929 | 1.000 | 1.000 | - | - | 0.968 | 0.850 | 0.480 | 0.685 | 0.858 | 0.897 | 0.905 | MATH3 | 0.822 | 0.937 | 0.719 | 0.876 | 0.960 | 0.859 | 0.929 | 0.954 | 0.954 | 0.991 | 0.999 | - | - | 0.970 | 0.954 | 0.714 | 0.810 | 0.915 | 0.969 | 0.955 | MATH4_algebra | 0.848 | 0.950 | 0.805 | 0.897 | 0.982 | 0.922 | 0.964 | 0.957 | 0.985 | 0.992 | 1.000 | - | - | 0.989 | 0.978 | 0.851 | 0.865 | 0.968 | 0.992 | 0.985 | MATH4_counting_and_probability | 0.729 | 0.882 | 0.639 | 0.738 | 0.900 | 0.711 | 0.864 | 0.945 | 0.945 | 0.990 | 0.990 | - | - | 0.963 | 0.945 | 0.558 | 0.783 | 0.945 | 0.981 | 0.981 | MATH4_geometry | 0.792 | 0.896 | 0.576 | 0.776 | 0.896 | 0.768 | 0.832 | 0.832 | 0.856 | 0.976 | 0.976 | - | - | 0.920 | 0.888 | 0.432 | 0.616 | 0.712 | 0.872 | 0.840 | MATH4_intermediate_algebra | 0.778 | 0.947 | 0.588 | 0.858 | 0.987 | 0.850 | 0.931 | 0.935 | 0.915 | 0.987 | 0.995 | - | - | 0.939 | 0.895 | 0.512 | 0.649 | 0.911 | 0.947 | 0.907 | MATH4_number_theory | 0.795 | 0.950 | 0.697 | 0.809 | 0.915 | 0.725 | 0.859 | 0.894 | 0.929 | 0.992 | 0.992 | - | - | 0.929 | 0.964 | 0.619 | 0.823 | 0.823 | 0.943 | 0.936 | MATH4_prealgebra | 0.806 | 0.931 | 0.785 | 0.874 | 0.958 | 0.827 | 0.942 | 0.921 | 0.931 | 0.989 | 0.989 | - | - | 0.958 | 0.963 | 0.748 | 0.801 | 0.879 | 0.926 | 0.942 | MATH4_precalculus | 0.719 | 0.956 | 0.570 | 0.868 | 0.956 | 0.728 | 0.921 | 0.947 | 0.912 | 0.991 | 0.991 | - | - | 0.947 | 0.807 | 0.333 | 0.578 | 0.859 | 0.973 | 0.868 | MATH4 | 0.792 | 0.935 | 0.684 | 0.845 | 0.953 | 0.816 | 0.915 | 0.925 | 0.932 | 0.989 | 0.992 | - | - | 0.953 | 0.929 | 0.620 | 0.746 | 0.887 | 0.952 | 0.930 | MATH5_algebra | 0.768 | 0.947 | 0.752 | 0.899 | 0.970 | 0.853 | 0.934 | 0.970 | 0.967 | 0.986 | 0.996 | - | - | 0.960 | 0.967 | 0.674 | 0.762 | 0.964 | 0.964 | 0.977 | MATH5_counting_and_probability | 0.699 | 0.910 | 0.569 | 0.756 | 0.829 | 0.699 | 0.788 | 0.910 | 0.861 | 0.959 | 0.983 | - | - | 0.934 | 0.967 | 0.495 | 0.642 | 0.910 | 0.934 | 0.902 | MATH5_geometry | 0.712 | 0.886 | 0.545 | 0.810 | 0.840 | 0.727 | 0.818 | 0.840 | 0.810 | 0.946 | 0.984 | - | - | 0.878 | 0.810 | 0.348 | 0.507 | 0.734 | 0.833 | 0.742 | MATH5_intermediate_algebra | 0.682 | 0.900 | 0.453 | 0.821 | 0.875 | 0.778 | 0.810 | 0.810 | 0.846 | 0.932 | 0.964 | - | - | 0.889 | 0.832 | 0.253 | 0.389 | 0.807 | 0.860 | 0.800 | MATH5_number_theory | 0.811 | 0.909 | 0.707 | 0.727 | 0.935 | 0.792 | 0.915 | 0.935 | 0.935 | 0.967 | 0.993 | - | - | 0.961 | 0.954 | 0.525 | 0.753 | 0.870 | 0.941 | 0.935 | MATH5_prealgebra | 0.777 | 0.849 | 0.720 | 0.808 | 0.896 | 0.782 | 0.823 | 0.875 | 0.891 | 0.958 | 0.984 | - | - | 0.953 | 0.937 | 0.580 | 0.797 | 0.911 | 0.927 | 0.948 | MATH5_precalculus | 0.562 | 0.903 | 0.437 | 0.851 | 0.851 | 0.792 | 0.800 | 0.814 | 0.851 | 0.918 | 0.955 | - | - | 0.888 | 0.666 | 0.259 | 0.429 | 0.777 | 0.851 | 0.770 | MATH5 | 0.723 | 0.904 | 0.609 | 0.822 | 0.897 | 0.787 | 0.851 | 0.884 | 0.889 | 0.955 | 0.981 | - | - | 0.926 | 0.886 | 0.462 | 0.617 | 0.865 | 0.907 | 0.879 | MATHCOT | 0.795 | _0.930_ | 0.700 | 0.860 | _0.940_ | 0.831 | _0.910_ | 0.930 | _0.934_ | _0.983_ | _0.992_ | - | - | _0.954_ | _0.936_ | 0.637 | 0.751 | 0.897 | 0.948 | _0.933_ | COMPOSITE AVERAGE AVG | 0.795 | _0.907_ | 0.700 | 0.860 | _0.918_ | 0.831 | _0.895_ | 0.930 | _0.918_ | _0.964_ | _0.972_ | - | - | _0.936_ | _0.923_ | 0.637 | 0.751 | 0.897 | 0.948 | _0.920_ | VISION MODELS: MODEL | gemma-3-4b-it | gemma-3-4b-it | gemma-3-12b-it | gemma-3-27b-it | GLM-4.6V-Flash | LFM2-VL-1.6B | Llama-4-Scout-17B-16E-Instruct | MiniCPM-V-4_5 | MiniCPM-V-4_5 | Ministral-3-3B-Instruct-2512 | Ministral-3-3B-Instruct-2512 | Ministral-3-8B-Instruct-2512 | Ministral-3-14B-Instruct-2512 | Mistral-Small-3.1-24B-Instruct-2503 | Mistral-Small-3.2-24B-Instruct-2506 | Qwen2.5-Omni-3B | Qwen2.5-Omni-3B | Qwen2.5-Omni-3B | Qwen2.5-Omni-7B | Qwen2.5-Omni-7B | Qwen2.5-VL-3B-Instruct | Qwen2.5-VL-3B-Instruct | Qwen2.5-VL-3B-Instruct | Qwen2.5-VL-7B-Instruct | Qwen2.5-VL-7B-Instruct | Qwen2.5-VL-32B-Instruct | Qwen3-VL-2B-Instruct | Qwen3-VL-4B-Instruct | Qwen3-VL-8B-Instruct | Qwen3-VL-8B-Instruct | Qwen3-VL-30B-A3B-Instruct | Qwen3-VL-32B-Instruct | Qwen3-VL-8B-Thinking | ---------------------------------------------|---------------|---------------|----------------|----------------|----------------|--------------|--------------------------------|---------------|---------------|------------------------------|------------------------------|------------------------------|-------------------------------|-------------------------------------|-------------------------------------|-----------------|-----------------|-----------------|-----------------|-----------------|------------------------|------------------------|------------------------|------------------------|------------------------|-------------------------|----------------------|----------------------|----------------------|----------------------|---------------------------|-----------------------|----------------------| params | 3.88B | 3.88B | 11.77B | 27.01B | 9.46B | 1.17B | 107.77B | 8.19B | 8.19B | 3.43B | 3.43B | 8.49B | 13.51B | 23.57B | 23.57B | 3.09B | 3.09B | 3.09B | 7.62B | 7.62B | 3.09B | 3.09B | 3.09B | 7.62B | 7.62B | 32.76B | 1.72B | 4.02B | 8.19B | 8.19B | 30.53B | 30.53B | 8.19B | quant | Q6_K | Q6_K_H | Q4_K_H | Q4_K_H | Q4_P_H | Q8_0_H | Q2_K_H | Q4_K_H | Q6_K_H | Q4_K_H | Q6_K_H | Q4_K_H | Q4_K_H | Q4_K_H | Q4_K_H | Q4_K_H | Q6_K_H | Q8_0_H | Q4_K_H | Q6_K_H | Q4_K_H | Q6_K_H | Q8_0_H | Q4_K_H | Q6_K_H | Q4_K_H | Q8_0_H | Q6_K_H | Q4_K_H | Q6_K_H | Q4_K_H | Q4_K_H | Q6_K_H | engine | llama.cpp version: 5706 | llama.cpp version: 5819 | llama.cpp version: 5819 | llama.cpp version: 5780 | llama.cpp version: 7445 | llama.cpp version: 7211 | llama.cpp version: 5935 | llama.cpp version: 7188 | llama.cpp version: 7188 | llama.cpp version: 7310 | llama.cpp version: 7310 | llama.cpp version: 7278 | llama.cpp version: 7278 | llama.cpp version: 5662 | llama.cpp version: 5780 | llama.cpp version: 7154 | llama.cpp version: 7154 | llama.cpp version: 7003 | llama.cpp version: 7130 | llama.cpp version: 6937 | llama.cpp version: 7154 | llama.cpp version: 7154 | llama.cpp version: 6937 | llama.cpp version: 7130 | llama.cpp version: 6915 | llama.cpp version: 7091 | llama.cpp version: 6937 | llama.cpp version: 6924 | llama.cpp version: 7215 | llama.cpp version: 6915 | llama.cpp version: 6915 | llama.cpp version: 7002 | llama.cpp version: 6937 | **TEST** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | CHARTQA | 0.464 | 0.456 | 0.558 | 0.662 | 0.801 | 0.715 | 0.719 | 0.782 | 0.784 | 0.702 | 0.737 | 0.787 | 0.734 | 0.743 | 0.716 | 0.825 | 0.831 | 0.816 | 0.794 | 0.794 | 0.843 | 0.884 | 0.882 | 0.802 | 0.800 | 0.789 | 0.720 | 0.755 | 0.754 | 0.762 | 0.787 | 0.799 | 0.734 | DOCVQA | 0.567 | 0.563 | 0.711 | 0.795 | 0.933 | 0.728 | 0.862 | 0.918 | 0.911 | 0.833 | 0.874 | 0.926 | 0.899 | 0.892 | 0.866 | 0.908 | 0.908 | 0.908 | 0.936 | 0.938 | 0.874 | 0.894 | 0.895 | 0.939 | 0.935 | 0.914 | 0.863 | 0.923 | 0.918 | 0.921 | 0.923 | 0.909 | 0.894 | REALWORLDQA | - | - | - | - | 0.700 | 0.464 | - | 0.693 | 0.686 | 0.579 | 0.574 | 0.599 | 0.599 | - | - | 0.811 | 0.812 | 0.781 | 0.680 | 0.681 | - | - | - | 0.682 | 0.676 | 0.682 | 0.646 | 0.717 | 0.686 | 0.676 | 0.713 | 0.750 | 0.673 | MMMU_Accounting | 0.366 | 0.400 | 0.566 | 0.700 | 0.600 | 0.333 | 0.866 | 0.833 | 0.766 | 0.533 | 0.566 | 0.600 | 0.600 | 0.466 | 0.733 | 0.433 | 0.366 | 0.333 | 0.500 | 0.533 | 0.533 | 0.400 | 0.466 | 0.566 | 0.600 | 0.566 | 0.466 | 0.566 | 0.566 | 0.533 | 0.466 | 0.733 | 0.733 | MMMU_Agriculture | 0.400 | 0.400 | 0.500 | 0.533 | 0.533 | 0.266 | 0.600 | 0.633 | 0.600 | 0.400 | 0.466 | 0.500 | 0.600 | 0.500 | 0.533 | 0.533 | 0.400 | 0.333 | 0.466 | 0.366 | 0.200 | 0.433 | 0.500 | 0.400 | 0.433 | 0.566 | 0.466 | 0.533 | 0.566 | 0.600 | 0.633 | 0.633 | 0.566 | MMMU_Architecture_and_Engineering | 0.200 | 0.166 | 0.400 | 0.333 | 0.500 | 0.300 | 0.366 | 0.400 | 0.466 | 0.333 | 0.400 | 0.300 | 0.366 | 0.400 | 0.400 | 0.400 | 0.266 | 0.300 | 0.433 | 0.300 | 0.433 | 0.200 | 0.333 | 0.366 | 0.433 | 0.466 | 0.066 | 0.266 | 0.266 | 0.300 | 0.366 | 0.500 | 0.533 | MMMU_Art_Theory | 0.533 | 0.666 | 0.833 | 0.866 | 0.733 | 0.300 | 0.800 | 0.866 | 0.833 | 0.433 | 0.500 | 0.766 | 0.633 | 0.866 | 0.700 | 0.633 | 0.633 | 0.600 | 0.766 | 0.733 | 0.500 | 0.566 | 0.500 | 0.733 | 0.700 | 0.766 | 0.733 | 0.733 | 0.800 | 0.833 | 0.900 | 0.933 | 0.800 | MMMU_Art | 0.566 | 0.566 | 0.700 | 0.766 | 0.666 | 0.433 | 0.766 | 0.633 | 0.633 | 0.533 | 0.466 | 0.700 | 0.533 | 0.633 | 0.666 | 0.533 | 0.600 | 0.666 | 0.666 | 0.666 | 0.366 | 0.400 | 0.500 | 0.566 | 0.533 | 0.700 | 0.500 | 0.466 | 0.633 | 0.666 | 0.633 | 0.666 | 0.666 | MMMU_Basic_Medical_Science | 0.333 | 0.533 | 0.633 | 0.566 | 0.666 | 0.400 | 0.700 | 0.666 | 0.666 | 0.400 | 0.433 | 0.700 | 0.600 | 0.733 | 0.600 | 0.433 | 0.433 | 0.533 | 0.533 | 0.500 | 0.300 | 0.500 | 0.500 | 0.500 | 0.633 | 0.600 | 0.566 | 0.633 | 0.666 | 0.733 | 0.666 | 0.666 | 0.700 | MMMU_Biology | 0.300 | 0.166 | 0.300 | 0.366 | 0.500 | 0.266 | 0.500 | 0.500 | 0.466 | 0.333 | 0.233 | 0.300 | 0.400 | 0.400 | 0.433 | 0.366 | 0.333 | 0.233 | 0.366 | 0.433 | 0.200 | 0.233 | 0.233 | 0.400 | 0.500 | 0.533 | 0.366 | 0.433 | 0.400 | 0.500 | 0.500 | 0.500 | 0.466 | MMMU_Chemistry | 0.033 | 0.266 | 0.333 | 0.333 | 0.433 | 0.233 | 0.433 | 0.366 | 0.433 | 0.233 | 0.233 | 0.400 | 0.333 | 0.366 | 0.366 | 0.133 | 0.233 | 0.100 | 0.300 | 0.266 | 0.333 | 0.266 | 0.333 | 0.300 | 0.333 | 0.366 | 0.266 | 0.366 | 0.300 | 0.266 | 0.366 | 0.466 | 0.433 | MMMU_Clinical_Medicine | 0.066 | 0.466 | 0.533 | 0.600 | 0.466 | 0.300 | 0.633 | 0.633 | 0.633 | 0.533 | 0.500 | 0.600 | 0.633 | 0.633 | 0.733 | 0.500 | 0.466 | 0.466 | 0.533 | 0.566 | 0.266 | 0.466 | 0.433 | 0.733 | 0.700 | 0.566 | 0.533 | 0.533 | 0.666 | 0.666 | 0.566 | 0.866 | 0.733 | MMMU_Computer_Science | 0.400 | 0.466 | 0.466 | 0.600 | 0.533 | 0.233 | 0.533 | 0.500 | 0.500 | 0.366 | 0.300 | 0.333 | 0.333 | 0.400 | 0.433 | 0.433 | 0.366 | 0.433 | 0.566 | 0.533 | 0.300 | 0.300 | 0.300 | 0.500 | 0.466 | 0.466 | 0.266 | 0.466 | 0.400 | 0.500 | 0.566 | 0.666 | 0.533 | MMMU_Design | 0.633 | 0.766 | 0.733 | 0.766 | 0.800 | 0.466 | 0.866 | 0.766 | 0.833 | 0.633 | 0.600 | 0.733 | 0.566 | 0.666 | 0.800 | 0.666 | 0.633 | 0.533 | 0.633 | 0.800 | 0.400 | 0.666 | 0.633 | 0.766 | 0.700 | 0.766 | 0.533 | 0.766 | 0.766 | 0.800 | 0.800 | 0.833 | 0.766 | MMMU_Diagnostics_and_Laboratory_Medicine | 0.100 | 0.200 | 0.300 | 0.233 | 0.400 | 0.200 | 0.433 | 0.466 | 0.500 | 0.300 | 0.233 | 0.366 | 0.433 | 0.400 | 0.433 | 0.300 | 0.133 | 0.433 | 0.333 | 0.366 | 0.233 | 0.366 | 0.266 | 0.300 | 0.300 | 0.466 | 0.200 | 0.366 | 0.400 | 0.366 | 0.400 | 0.433 | 0.466 | MMMU_Economics | 0.466 | 0.533 | 0.500 | 0.600 | 0.733 | 0.366 | 0.766 | 0.766 | 0.700 | 0.666 | 0.566 | 0.600 | 0.700 | 0.666 | 0.766 | 0.433 | 0.700 | 0.566 | 0.500 | 0.500 | 0.466 | 0.500 | 0.433 | 0.666 | 0.633 | 0.700 | 0.600 | 0.800 | 0.533 | 0.766 | 0.733 | 0.700 | 0.766 | MMMU_Electronics | 0.066 | 0.133 | 0.233 | 0.400 | 0.333 | 0.166 | 0.466 | 0.466 | 0.466 | 0.400 | 0.200 | 0.433 | 0.366 | 0.400 | 0.400 | 0.266 | 0.333 | 0.233 | 0.366 | 0.300 | 0.166 | 0.100 | 0.266 | 0.366 | 0.333 | 0.366 | 0.066 | 0.166 | 0.233 | 0.266 | 0.266 | 0.366 | 0.266 | MMMU_Energy_and_Power | 0.333 | 0.233 | 0.400 | 0.500 | 0.600 | 0.066 | 0.600 | 0.633 | 0.733 | 0.366 | 0.333 | 0.566 | 0.533 | 0.500 | 0.400 | 0.466 | 0.300 | 0.466 | 0.366 | 0.500 | 0.266 | 0.333 | 0.300 | 0.333 | 0.300 | 0.633 | 0.100 | 0.366 | 0.366 | 0.433 | 0.366 | 0.633 | 0.533 | MMMU_Finance | 0.333 | 0.333 | 0.466 | 0.533 | 0.566 | 0.100 | 0.500 | 0.433 | 0.500 | 0.433 | 0.366 | 0.533 | 0.533 | 0.500 | 0.466 | 0.366 | 0.400 | 0.366 | 0.466 | 0.500 | 0.366 | 0.233 | 0.333 | 0.433 | 0.533 | 0.466 | 0.300 | 0.500 | 0.500 | 0.500 | 0.500 | 0.566 | 0.566 | MMMU_Geography | 0.200 | 0.266 | 0.333 | 0.366 | 0.500 | 0.200 | 0.533 | 0.566 | 0.500 | 0.400 | 0.366 | 0.600 | 0.466 | 0.533 | 0.433 | 0.433 | 0.433 | 0.400 | 0.333 | 0.466 | 0.200 | 0.233 | 0.233 | 0.366 | 0.400 | 0.566 | 0.266 | 0.433 | 0.433 | 0.333 | 0.400 | 0.400 | 0.500 | MMMU_History | 0.566 | 0.633 | 0.733 | 0.800 | 0.733 | 0.433 | 0.800 | 0.666 | 0.633 | 0.466 | 0.400 | 0.633 | 0.666 | 0.500 | 0.700 | 0.566 | 0.500 | 0.666 | 0.600 | 0.733 | 0.466 | 0.433 | 0.700 | 0.633 | 0.566 | 0.833 | 0.566 | 0.600 | 0.666 | 0.700 | 0.766 | 0.833 | 0.700 | MMMU_Literature | 0.666 | 0.766 | 0.866 | 0.900 | 0.733 | 0.666 | 0.800 | 0.866 | 0.833 | 0.766 | 0.733 | 0.866 | 0.800 | 0.866 | 0.866 | 0.600 | 0.733 | 0.533 | 0.800 | 0.800 | 0.766 | 0.766 | 0.800 | 0.766 | 0.800 | 0.766 | 0.733 | 0.700 | 0.833 | 0.800 | 0.833 | 0.866 | 0.833 | MMMU_Manage | 0.233 | 0.333 | 0.333 | 0.500 | 0.500 | 0.233 | 0.500 | 0.466 | 0.500 | 0.433 | 0.466 | 0.533 | 0.566 | 0.466 | 0.566 | 0.400 | 0.400 | 0.466 | 0.400 | 0.466 | 0.400 | 0.300 | 0.366 | 0.333 | 0.400 | 0.600 | 0.200 | 0.400 | 0.466 | 0.433 | 0.366 | 0.600 | 0.466 | MMMU_Marketing | 0.333 | 0.400 | 0.466 | 0.666 | 0.766 | 0.266 | 0.800 | 0.766 | 0.700 | 0.666 | 0.566 | 0.633 | 0.733 | 0.633 | 0.700 | 0.466 | 0.466 | 0.433 | 0.666 | 0.500 | 0.266 | 0.233 | 0.266 | 0.566 | 0.533 | 0.800 | 0.533 | 0.733 | 0.733 | 0.833 | 0.900 | 0.833 | 0.900 | MMMU_Materials | 0.133 | 0.233 | 0.133 | 0.300 | 0.433 | 0.066 | 0.533 | 0.466 | 0.466 | 0.333 | 0.266 | 0.300 | 0.366 | 0.300 | 0.366 | 0.333 | 0.266 | 0.233 | 0.333 | 0.333 | 0.300 | 0.166 | 0.200 | 0.400 | 0.400 | 0.400 | 0.100 | 0.266 | 0.233 | 0.233 | 0.300 | 0.533 | 0.400 | MMMU_Math | 0.300 | 0.400 | 0.533 | 0.566 | 0.700 | 0.433 | 0.566 | 0.566 | 0.666 | 0.500 | 0.400 | 0.566 | 0.600 | 0.466 | 0.566 | 0.366 | 0.333 | 0.433 | 0.433 | 0.433 | 0.366 | 0.266 | 0.366 | 0.500 | 0.366 | 0.366 | 0.300 | 0.500 | 0.500 | 0.566 | 0.566 | 0.600 | 0.533 | MMMU_Mechanical_Engineering | 0.166 | 0.166 | 0.300 | 0.333 | 0.366 | 0.300 | 0.733 | 0.600 | 0.600 | 0.366 | 0.366 | 0.566 | 0.466 | 0.433 | 0.466 | 0.233 | 0.266 | 0.366 | 0.300 | 0.333 | 0.233 | 0.366 | 0.400 | 0.466 | 0.366 | 0.366 | 0.100 | 0.200 | 0.366 | 0.333 | 0.400 | 0.600 | 0.433 | MMMU_Music | 0.166 | 0.333 | 0.200 | 0.333 | 0.500 | 0.300 | 0.233 | 0.433 | 0.466 | 0.333 | 0.433 | 0.333 | 0.366 | 0.400 | 0.433 | 0.266 | 0.300 | 0.233 | 0.300 | 0.266 | 0.566 | 0.500 | 0.466 | 0.433 | 0.333 | 0.333 | 0.200 | 0.166 | 0.300 | 0.166 | 0.300 | 0.366 | 0.266 | MMMU_Pharmacy | 0.333 | 0.366 | 0.600 | 0.633 | 0.700 | 0.266 | 0.766 | 0.733 | 0.700 | 0.500 | 0.533 | 0.633 | 0.733 | 0.566 | 0.700 | 0.433 | 0.433 | 0.500 | 0.500 | 0.466 | 0.433 | 0.366 | 0.400 | 0.633 | 0.600 | 0.600 | 0.366 | 0.500 | 0.666 | 0.700 | 0.733 | 0.866 | 0.766 | MMMU_Physics | 0.166 | 0.300 | 0.433 | 0.600 | 0.733 | 0.266 | 0.666 | 0.666 | 0.700 | 0.500 | 0.600 | 0.533 | 0.600 | 0.433 | 0.600 | 0.400 | 0.466 | 0.400 | 0.500 | 0.466 | 0.333 | 0.433 | 0.500 | 0.466 | 0.400 | 0.666 | 0.333 | 0.566 | 0.800 | 0.800 | 0.766 | 0.866 | 0.833 | MMMU_Psychology | 0.366 | 0.433 | 0.500 | 0.566 | 0.600 | 0.266 | 0.566 | 0.633 | 0.566 | 0.433 | 0.366 | 0.666 | 0.533 | 0.633 | 0.466 | 0.533 | 0.500 | 0.433 | 0.600 | 0.600 | 0.400 | 0.433 | 0.500 | 0.566 | 0.500 | 0.566 | 0.466 | 0.533 | 0.666 | 0.633 | 0.566 | 0.666 | 0.666 | MMMU_Public_Health | 0.433 | 0.700 | 0.700 | 0.800 | 0.766 | 0.200 | 0.866 | 0.766 | 0.800 | 0.733 | 0.600 | 0.800 | 0.800 | 0.766 | 0.800 | 0.666 | 0.666 | 0.600 | 0.733 | 0.666 | 0.433 | 0.400 | 0.400 | 0.700 | 0.733 | 0.866 | 0.500 | 0.800 | 0.866 | 0.733 | 0.866 | 0.900 | 0.866 | MMMU_Sociology | 0.366 | 0.600 | 0.600 | 0.700 | 0.600 | 0.300 | 0.666 | 0.566 | 0.600 | 0.566 | 0.566 | 0.633 | 0.700 | 0.533 | 0.733 | 0.566 | 0.533 | 0.466 | 0.500 | 0.500 | 0.466 | 0.566 | 0.466 | 0.500 | 0.500 | 0.666 | 0.566 | 0.566 | 0.533 | 0.600 | 0.633 | 0.700 | 0.600 | MMMU | 0.318 | 0.407 | 0.487 | 0.558 | 0.590 | 0.287 | 0.628 | 0.611 | 0.615 | 0.463 | 0.435 | 0.557 | 0.552 | 0.535 | 0.575 | 0.438 | 0.430 | 0.425 | 0.493 | 0.496 | 0.365 | 0.381 | 0.413 | 0.508 | 0.501 | 0.580 | 0.375 | 0.497 | 0.537 | 0.553 | 0.571 | 0.660 | 0.610 | MMMUPRO_Accounting | 0.224 | 0.310 | 0.534 | 0.603 | 0.689 | 0.103 | 0.741 | 0.655 | 0.672 | 0.500 | 0.517 | 0.517 | 0.637 | 0.551 | 0.586 | 0.224 | 0.293 | 0.344 | 0.396 | 0.413 | 0.379 | 0.327 | 0.310 | 0.362 | 0.465 | 0.568 | 0.362 | 0.482 | 0.586 | 0.637 | 0.706 | 0.827 | 0.689 | MMMUPRO_Agriculture | 0.200 | 0.200 | 0.350 | 0.450 | 0.183 | 0.183 | 0.383 | 0.266 | 0.316 | 0.250 | 0.183 | 0.300 | 0.333 | 0.283 | 0.266 | 0.216 | 0.200 | 0.216 | 0.216 | 0.166 | 0.116 | 0.200 | 0.283 | 0.150 | 0.216 | 0.266 | 0.183 | 0.283 | 0.316 | 0.350 | 0.383 | 0.433 | 0.333 | MMMUPRO_Architecture_and_Engineering | 0.100 | 0.133 | 0.216 | 0.333 | 0.266 | 0.100 | 0.433 | 0.350 | 0.283 | 0.250 | 0.150 | 0.400 | 0.383 | 0.316 | 0.366 | 0.200 | 0.250 | 0.233 | 0.266 | 0.350 | 0.183 | 0.100 | 0.183 | 0.383 | 0.216 | 0.300 | 0.066 | 0.183 | 0.166 | 0.266 | 0.233 | 0.383 | 0.433 | MMMUPRO_Art_Theory | 0.472 | 0.490 | 0.636 | 0.709 | 0.563 | 0.163 | 0.672 | 0.709 | 0.709 | 0.345 | 0.418 | 0.654 | 0.654 | 0.618 | 0.527 | 0.509 | 0.527 | 0.509 | 0.545 | 0.618 | 0.345 | 0.436 | 0.400 | 0.600 | 0.654 | 0.636 | 0.527 | 0.436 | 0.672 | 0.636 | 0.672 | 0.727 | 0.618 | MMMUPRO_Art | 0.396 | 0.452 | 0.547 | 0.622 | 0.509 | 0.207 | 0.603 | 0.622 | 0.622 | 0.415 | 0.415 | 0.452 | 0.490 | 0.471 | 0.528 | 0.452 | 0.528 | 0.433 | 0.528 | 0.566 | 0.283 | 0.415 | 0.339 | 0.433 | 0.528 | 0.490 | 0.471 | 0.471 | 0.509 | 0.528 | 0.547 | 0.641 | 0.509 | MMMUPRO_Basic_Medical_Science | 0.269 | 0.250 | 0.384 | 0.442 | 0.480 | 0.230 | 0.596 | 0.326 | 0.442 | 0.192 | 0.173 | 0.307 | 0.423 | 0.384 | 0.403 | 0.173 | 0.173 | 0.192 | 0.307 | 0.307 | 0.192 | 0.134 | 0.211 | 0.365 | 0.346 | 0.442 | 0.269 | 0.480 | 0.403 | 0.461 | 0.461 | 0.557 | 0.365 | MMMUPRO_Biology | 0.169 | 0.237 | 0.288 | 0.322 | 0.322 | 0.152 | 0.423 | 0.389 | 0.406 | 0.118 | 0.186 | 0.406 | 0.338 | 0.355 | 0.372 | 0.254 | 0.271 | 0.203 | 0.254 | 0.254 | 0.118 | 0.084 | 0.152 | 0.288 | 0.355 | 0.474 | 0.288 | 0.355 | 0.322 | 0.389 | 0.423 | 0.457 | 0.288 | MMMUPRO_Chemistry | 0.200 | 0.266 | 0.333 | 0.350 | 0.383 | 0.200 | 0.366 | 0.350 | 0.466 | 0.266 | 0.216 | 0.300 | 0.433 | 0.383 | 0.450 | 0.150 | 0.300 | 0.300 | 0.283 | 0.250 | 0.316 | 0.250 | 0.283 | 0.333 | 0.333 | 0.300 | 0.266 | 0.216 | 0.300 | 0.350 | 0.366 | 0.583 | 0.383 | MMMUPRO_Clinical_Medicine | 0.118 | 0.135 | 0.237 | 0.372 | 0.237 | 0.101 | 0.322 | 0.372 | 0.423 | 0.203 | 0.203 | 0.322 | 0.389 | 0.474 | 0.389 | 0.237 | 0.237 | 0.254 | 0.220 | 0.186 | 0.101 | 0.169 | 0.152 | 0.237 | 0.254 | 0.491 | 0.118 | 0.271 | 0.423 | 0.457 | 0.440 | 0.559 | 0.440 | MMMUPRO_Computer_Science | 0.283 | 0.350 | 0.383 | 0.300 | 0.433 | 0.133 | 0.483 | 0.466 | 0.500 | 0.250 | 0.283 | 0.433 | 0.350 | 0.333 | 0.383 | 0.400 | 0.283 | 0.250 | 0.316 | 0.350 | 0.266 | 0.200 | 0.200 | 0.416 | 0.350 | 0.416 | 0.150 | 0.366 | 0.416 | 0.450 | 0.383 | 0.583 | 0.533 | MMMUPRO_Design | 0.433 | 0.500 | 0.533 | 0.616 | 0.500 | 0.150 | 0.616 | 0.700 | 0.666 | 0.400 | 0.350 | 0.583 | 0.566 | 0.533 | 0.616 | 0.383 | 0.383 | 0.500 | 0.616 | 0.616 | 0.350 | 0.533 | 0.416 | 0.550 | 0.533 | 0.683 | 0.450 | 0.516 | 0.633 | 0.650 | 0.733 | 0.700 | 0.616 | MMMUPRO_Diagnostics_and_Laboratory_Medicine | 0.116 | 0.200 | 0.200 | 0.233 | 0.316 | 0.150 | 0.383 | 0.333 | 0.350 | 0.250 | 0.166 | 0.283 | 0.333 | 0.200 | 0.300 | 0.233 | 0.166 | 0.100 | 0.166 | 0.166 | 0.083 | 0.116 | 0.133 | 0.150 | 0.216 | 0.200 | 0.150 | 0.216 | 0.300 | 0.366 | 0.266 | 0.316 | 0.300 | MMMUPRO_Economics | 0.423 | 0.457 | 0.559 | 0.644 | 0.711 | 0.118 | 0.677 | 0.610 | 0.627 | 0.491 | 0.372 | 0.508 | 0.677 | 0.661 | 0.627 | 0.440 | 0.355 | 0.406 | 0.491 | 0.440 | 0.288 | 0.338 | 0.288 | 0.457 | 0.508 | 0.627 | 0.457 | 0.694 | 0.728 | 0.762 | 0.694 | 0.847 | 0.762 | MMMUPRO_Electronics | 0.233 | 0.316 | 0.350 | 0.316 | 0.500 | 0.050 | 0.616 | 0.516 | 0.483 | 0.316 | 0.316 | 0.516 | 0.650 | 0.600 | 0.600 | 0.366 | 0.366 | 0.300 | 0.516 | 0.483 | 0.200 | 0.200 | 0.200 | 0.450 | 0.433 | 0.533 | 0.033 | 0.250 | 0.250 | 0.366 | 0.316 | 0.633 | 0.400 | MMMUPRO_Energy_and_Power | 0.172 | 0.172 | 0.120 | 0.327 | 0.431 | 0.051 | 0.568 | 0.448 | 0.500 | 0.224 | 0.224 | 0.327 | 0.500 | 0.224 | 0.275 | 0.086 | 0.172 | 0.137 | 0.206 | 0.189 | 0.137 | 0.086 | 0.155 | 0.224 | 0.293 | 0.310 | 0.086 | 0.206 | 0.310 | 0.258 | 0.344 | 0.517 | 0.448 | MMMUPRO_Finance | 0.283 | 0.366 | 0.533 | 0.516 | 0.766 | 0.133 | 0.633 | 0.633 | 0.716 | 0.516 | 0.450 | 0.666 | 0.683 | 0.600 | 0.650 | 0.333 | 0.416 | 0.350 | 0.533 | 0.500 | 0.283 | 0.250 | 0.333 | 0.450 | 0.450 | 0.666 | 0.300 | 0.500 | 0.616 | 0.666 | 0.616 | 0.833 | 0.750 | MMMUPRO_Geography | 0.346 | 0.307 | 0.346 | 0.403 | 0.461 | 0.115 | 0.480 | 0.403 | 0.403 | 0.250 | 0.211 | 0.384 | 0.365 | 0.384 | 0.384 | 0.269 | 0.269 | 0.307 | 0.365 | 0.250 | 0.211 | 0.192 | 0.192 | 0.326 | 0.346 | 0.365 | 0.250 | 0.269 | 0.365 | 0.365 | 0.365 | 0.403 | 0.384 | MMMUPRO_History | 0.375 | 0.392 | 0.535 | 0.553 | 0.517 | 0.125 | 0.607 | 0.571 | 0.589 | 0.464 | 0.392 | 0.500 | 0.535 | 0.428 | 0.553 | 0.375 | 0.375 | 0.339 | 0.500 | 0.553 | 0.214 | 0.428 | 0.428 | 0.482 | 0.482 | 0.607 | 0.267 | 0.482 | 0.589 | 0.446 | 0.589 | 0.678 | 0.571 | MMMUPRO_Literature | 0.500 | 0.461 | 0.634 | 0.692 | 0.557 | 0.365 | 0.730 | 0.673 | 0.673 | 0.461 | 0.480 | 0.673 | 0.615 | 0.615 | 0.634 | 0.557 | 0.519 | 0.519 | 0.615 | 0.576 | 0.500 | 0.634 | 0.596 | 0.615 | 0.673 | 0.673 | 0.519 | 0.576 | 0.692 | 0.692 | 0.615 | 0.788 | 0.711 | MMMUPRO_Manage | 0.220 | 0.240 | 0.320 | 0.400 | 0.600 | 0.160 | 0.480 | 0.460 | 0.500 | 0.320 | 0.280 | 0.460 | 0.460 | 0.420 | 0.420 | 0.280 | 0.260 | 0.240 | 0.300 | 0.380 | 0.260 | 0.340 | 0.260 | 0.320 | 0.300 | 0.380 | 0.220 | 0.360 | 0.480 | 0.420 | 0.380 | 0.580 | 0.580 | MMMUPRO_Marketing | 0.288 | 0.305 | 0.440 | 0.440 | 0.745 | 0.101 | 0.627 | 0.627 | 0.677 | 0.491 | 0.423 | 0.508 | 0.593 | 0.525 | 0.593 | 0.254 | 0.288 | 0.322 | 0.406 | 0.389 | 0.338 | 0.271 | 0.322 | 0.474 | 0.389 | 0.593 | 0.423 | 0.627 | 0.661 | 0.711 | 0.711 | 0.711 | 0.745 | MMMUPRO_Materials | 0.083 | 0.133 | 0.150 | 0.250 | 0.383 | 0.083 | 0.316 | 0.333 | 0.266 | 0.166 | 0.150 | 0.266 | 0.300 | 0.166 | 0.300 | 0.216 | 0.166 | 0.116 | 0.183 | 0.200 | 0.100 | 0.083 | 0.100 | 0.216 | 0.250 | 0.216 | 0.083 | 0.216 | 0.250 | 0.316 | 0.266 | 0.483 | 0.333 | MMMUPRO_Math | 0.283 | 0.233 | 0.316 | 0.466 | 0.483 | 0.150 | 0.483 | 0.516 | 0.533 | 0.350 | 0.316 | 0.350 | 0.533 | 0.416 | 0.466 | 0.350 | 0.300 | 0.266 | 0.250 | 0.266 | 0.216 | 0.200 | 0.133 | 0.400 | 0.366 | 0.333 | 0.216 | 0.483 | 0.550 | 0.483 | 0.450 | 0.700 | 0.583 | MMMUPRO_Mechanical_Engineering | 0.152 | 0.186 | 0.271 | 0.305 | 0.338 | 0.152 | 0.474 | 0.355 | 0.271 | 0.237 | 0.271 | 0.406 | 0.322 | 0.440 | 0.372 | 0.186 | 0.203 | 0.203 | 0.237 | 0.322 | 0.152 | 0.169 | 0.169 | 0.237 | 0.355 | 0.372 | 0.135 | 0.254 | 0.169 | 0.288 | 0.338 | 0.423 | 0.372 | MMMUPRO_Music | 0.216 | 0.250 | 0.266 | 0.233 | 0.250 | 0.183 | 0.183 | 0.266 | 0.233 | 0.266 | 0.266 | 0.200 | 0.216 | 0.183 | 0.233 | 0.283 | 0.300 | 0.266 | 0.316 | 0.233 | 0.200 | 0.150 | 0.200 | 0.200 | 0.283 | 0.266 | 0.233 | 0.150 | 0.216 | 0.150 | 0.300 | 0.233 | 0.250 | MMMUPRO_Pharmacy | 0.298 | 0.298 | 0.456 | 0.491 | 0.543 | 0.298 | 0.596 | 0.631 | 0.596 | 0.315 | 0.385 | 0.543 | 0.614 | 0.508 | 0.684 | 0.315 | 0.368 | 0.263 | 0.333 | 0.421 | 0.333 | 0.280 | 0.333 | 0.403 | 0.421 | 0.631 | 0.315 | 0.614 | 0.578 | 0.561 | 0.578 | 0.736 | 0.543 | MMMUPRO_Physics | 0.166 | 0.116 | 0.416 | 0.400 | 0.550 | 0.133 | 0.533 | 0.450 | 0.533 | 0.350 | 0.316 | 0.483 | 0.450 | 0.466 | 0.466 | 0.283 | 0.250 | 0.266 | 0.366 | 0.333 | 0.200 | 0.183 | 0.216 | 0.266 | 0.283 | 0.466 | 0.416 | 0.483 | 0.516 | 0.566 | 0.516 | 0.650 | 0.600 | MMMUPRO_Psychology | 0.366 | 0.333 | 0.300 | 0.383 | 0.450 | 0.200 | 0.500 | 0.450 | 0.400 | 0.316 | 0.200 | 0.483 | 0.450 | 0.350 | 0.416 | 0.283 | 0.283 | 0.300 | 0.316 | 0.333 | 0.200 | 0.200 | 0.266 | 0.316 | 0.300 | 0.383 | 0.216 | 0.450 | 0.483 | 0.516 | 0.450 | 0.483 | 0.466 | MMMUPRO_Public_Health | 0.241 | 0.293 | 0.396 | 0.551 | 0.603 | 0.086 | 0.758 | 0.620 | 0.655 | 0.327 | 0.258 | 0.568 | 0.637 | 0.482 | 0.551 | 0.172 | 0.327 | 0.327 | 0.517 | 0.431 | 0.120 | 0.086 | 0.189 | 0.551 | 0.534 | 0.741 | 0.310 | 0.517 | 0.586 | 0.603 | 0.672 | 0.810 | 0.741 | MMMUPRO_Sociology | 0.333 | 0.462 | 0.574 | 0.629 | 0.500 | 0.314 | 0.592 | 0.555 | 0.537 | 0.592 | 0.462 | 0.444 | 0.537 | 0.574 | 0.518 | 0.407 | 0.425 | 0.407 | 0.500 | 0.407 | 0.333 | 0.425 | 0.314 | 0.481 | 0.407 | 0.611 | 0.407 | 0.611 | 0.462 | 0.518 | 0.629 | 0.648 | 0.611 | MMMUPRO | 0.263 | 0.293 | 0.384 | 0.442 | 0.473 | 0.154 | 0.527 | 0.487 | 0.500 | 0.328 | 0.300 | 0.440 | 0.481 | 0.430 | 0.463 | 0.294 | 0.306 | 0.294 | 0.367 | 0.363 | 0.232 | 0.246 | 0.256 | 0.369 | 0.382 | 0.466 | 0.270 | 0.398 | 0.449 | 0.473 | 0.480 | 0.596 | 0.510 | DISCIPLINES NLP | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | MATH | 0.305 | 0.338 | 0.400 | 0.450 | 0.511 | 0.205 | 0.505 | 0.505 | 0.538 | 0.344 | 0.316 | 0.411 | 0.450 | 0.394 | 0.450 | 0.383 | 0.311 | 0.316 | 0.355 | 0.366 | 0.272 | 0.227 | 0.222 | 0.438 | 0.377 | 0.388 | 0.216 | 0.444 | 0.472 | 0.488 | 0.466 | 0.638 | 0.550 | SCIENCE | 0.178 | 0.218 | 0.318 | 0.378 | 0.418 | 0.173 | 0.452 | 0.414 | 0.443 | 0.273 | 0.260 | 0.369 | 0.400 | 0.354 | 0.400 | 0.267 | 0.271 | 0.233 | 0.305 | 0.285 | 0.204 | 0.209 | 0.256 | 0.298 | 0.329 | 0.398 | 0.267 | 0.351 | 0.380 | 0.423 | 0.432 | 0.547 | 0.438 | ENGINEERING | 0.239 | 0.272 | 0.337 | 0.409 | 0.445 | 0.154 | 0.563 | 0.507 | 0.501 | 0.331 | 0.302 | 0.472 | 0.476 | 0.442 | 0.463 | 0.299 | 0.304 | 0.310 | 0.387 | 0.411 | 0.237 | 0.257 | 0.279 | 0.400 | 0.387 | 0.467 | 0.161 | 0.306 | 0.337 | 0.387 | 0.409 | 0.550 | 0.472 | MEDICINE | 0.222 | 0.309 | 0.408 | 0.467 | 0.490 | 0.206 | 0.580 | 0.525 | 0.550 | 0.339 | 0.314 | 0.479 | 0.534 | 0.481 | 0.529 | 0.309 | 0.314 | 0.323 | 0.383 | 0.373 | 0.222 | 0.247 | 0.270 | 0.419 | 0.435 | 0.541 | 0.300 | 0.467 | 0.525 | 0.541 | 0.538 | 0.646 | 0.557 | HUMANITIES | 0.392 | 0.441 | 0.517 | 0.571 | 0.529 | 0.262 | 0.577 | 0.571 | 0.557 | 0.423 | 0.397 | 0.533 | 0.517 | 0.508 | 0.524 | 0.434 | 0.445 | 0.423 | 0.497 | 0.494 | 0.347 | 0.401 | 0.403 | 0.478 | 0.485 | 0.552 | 0.409 | 0.461 | 0.535 | 0.517 | 0.557 | 0.608 | 0.552 | BUSINESS | 0.309 | 0.360 | 0.477 | 0.550 | 0.681 | 0.169 | 0.653 | 0.619 | 0.639 | 0.495 | 0.444 | 0.550 | 0.619 | 0.552 | 0.603 | 0.346 | 0.373 | 0.369 | 0.456 | 0.451 | 0.344 | 0.314 | 0.327 | 0.449 | 0.465 | 0.591 | 0.378 | 0.559 | 0.598 | 0.635 | 0.616 | 0.738 | 0.701 | LAW | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | VISION | 0.498 | 0.500 | 0.628 | 0.714 | 0.826 | 0.635 | 0.780 | 0.815 | 0.813 | 0.720 | 0.746 | 0.806 | 0.782 | 0.790 | 0.773 | 0.805 | 0.807 | 0.800 | 0.811 | 0.811 | 0.780 | 0.805 | 0.808 | 0.816 | 0.858 | 0.810 | 0.736 | 0.800 | 0.801 | 0.848 | 0.818 | 0.831 | 0.790 | COMPOSITE AVERAGE AVG | 0.471 | 0.479 | 0.602 | 0.685 | 0.790 | 0.584 | 0.752 | 0.782 | 0.781 | 0.679 | 0.700 | 0.768 | 0.750 | 0.749 | 0.739 | 0.751 | 0.753 | 0.746 | 0.764 | 0.764 | 0.717 | 0.741 | 0.745 | 0.769 | 0.824 | 0.775 | 0.686 | 0.757 | 0.763 | 0.821 | 0.782 | 0.806 | 0.761 | AUDIO MODELS: MODEL | Qwen2.5-Omni-3B | Qwen2.5-Omni-3B | Qwen2.5-Omni-3B | Qwen2.5-Omni-7B | Qwen2.5-Omni-7B | ultravox-v0_5-llama-3_1-8b | ultravox-v0_5-llama-3_1-8b | ultravox-v0_5-deepseek-r1-llama-3_1-8b | ultravox-v0_5-deepseek-r1-llama-3_1-8b | ultravox-v0_6-gemma-3-27b | ultravox-v0_6-qwen-3-32b | Voxtral-Mini-3B-2507 | Voxtral-Mini-3B-2507 | Voxtral-Small-24B-2507 | ---------------------------------------------|-----------------|-----------------|-----------------|-----------------|-----------------|----------------------------|----------------------------|----------------------------------------|----------------------------------------|---------------------------|--------------------------|----------------------|----------------------|------------------------| params | 3.09B | 3.09B | 3.09B | 7.62B | 7.62B | 8.03B | 8.03B | 8.03B | 8.03B | 27.01B | 32.8B | 4.01B | 4.01B | 23.57B | quant | Q4_K_H | Q6_K_H | Q8_0_H | Q4_K_H | Q6_K_H | Q4_K_H | Q6_K_H | Q6_K | Q6_K_H | Q4_K_H | Q4_K_H | Q6_K | Q6_K_H | Q4_K_H | engine | llama.cpp version: 7154 | llama.cpp version: 7154 | llama.cpp version: 7003 | llama.cpp version: 7130 | llama.cpp version: 6937 | llama.cpp version: 7310 | llama.cpp version: 7310 | llama.cpp version: 5869 | llama.cpp version: 7484 | llama.cpp version: 5853 | llama.cpp version: 5853 | llama.cpp version: 6014 | llama.cpp version: 7410 | llama.cpp version: 7330 | **TEST** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | BBA_formal_fallacies | 0.564 | 0.580 | 0.624 | 0.516 | 0.516 | 0.548 | 0.596 | 0.768 | 0.872 | 0.640 | 0.996 | 0.544 | 0.528 | 0.616 | BBA_navigate | 0.660 | 0.636 | 0.676 | 0.760 | 0.748 | 0.728 | 0.688 | 0.988 | 0.992 | 0.716 | 0.976 | 0.664 | 0.636 | 0.688 | BBA_object_counting | 0.652 | 0.656 | 0.692 | 0.676 | 0.648 | 0.736 | 0.844 | 0.924 | 1.000 | 0.800 | 0.984 | 0.596 | 0.560 | 0.524 | BBA_web_of_lies | 0.648 | 0.636 | 0.608 | 0.536 | 0.592 | 0.716 | 0.756 | 0.932 | 0.880 | 0.464 | 0.784 | 0.576 | 0.636 | 0.668 | BBA | 0.631 | 0.627 | 0.650 | 0.622 | 0.626 | 0.682 | 0.721 | 0.903 | 0.936 | 0.655 | 0.935 | 0.595 | 0.590 | 0.624 | BBHA_formal_fallacies | - | - | - | - | - | 0.628 | - | - | 0.868 | - | - | - | 0.540 | - | BBHA_navigate | - | - | - | - | - | 0.652 | - | - | 0.992 | - | - | - | 0.688 | - | BBHA_object_counting | - | - | - | - | - | 0.772 | - | - | 1.000 | - | - | - | 0.700 | - | BBHA_web_of_lies | - | - | - | - | - | 0.844 | - | - | 0.952 | - | - | - | 0.760 | - | BBHA | - | - | - | - | - | 0.724 | - | - | 0.953 | - | - | - | 0.672 | - | DISCIPLINES NLP | 0.606 | 0.608 | 0.616 | 0.526 | 0.554 | 0.632 | 0.676 | 0.850 | 0.876 | 0.552 | 0.890 | 0.560 | 0.582 | 0.642 | MATH | 0.656 | 0.646 | 0.684 | 0.718 | 0.698 | 0.732 | 0.766 | 0.956 | 0.996 | 0.758 | 0.980 | 0.630 | 0.598 | 0.606 | SCIENCE | - | - | - | - | - | - | - | - | - | - | - | - | - | - | ENGINEERING | - | - | - | - | - | - | - | - | - | - | - | - | - | - | MEDICINE | - | - | - | - | - | - | - | - | - | - | - | - | - | - | HUMANITIES | - | - | - | - | - | - | - | - | - | - | - | - | - | - | BUSINESS | - | - | - | - | - | - | - | - | - | - | - | - | - | - | LAW | - | - | - | - | - | - | - | - | - | - | - | - | - | - | AUDIO | - | - | - | - | - | - | - | - | - | - | - | - | - | - | COMPOSITE AVERAGE AVG | 0.631 | 0.627 | 0.650 | 0.622 | 0.626 | 0.682 | 0.721 | 0.903 | 0.936 | 0.655 | 0.935 | 0.595 | 0.590 | 0.624 | MT MODELS: MODEL | HY-MT1.5-7B | madlad400-7b-mt | madlad400-10b-mt | madlad400-10b-mt | plamo-2-translate | translategemma-4b-it | translategemma-12b-it | translategemma-12b-it | translategemma-27b-it | ---------------------------------------------|-------------|-----------------|------------------|------------------|-------------------|----------------------|-----------------------|-----------------------|-----------------------| params | 7.50B | 8.30B | 10.71B | 10.71B | 9.53B | 3.88B | 11.77B | 11.77B | 27.01B | quant | Q6_K_H | Q4_K_H | Q4_K_H | Q6_K_H | Q6_K_H | Q6_K_H | Q4_K_H | Q6_K_H | Q4_K_H | engine | llama.cpp version: 7789 | llama.cpp version: 7885 | llama.cpp version: 7830 | llama.cpp version: 7772 | llama.cpp version: 7762 | llama.cpp version: 7760 | llama.cpp version: 7779 | llama.cpp version: 7779 | llama.cpp version: 7789 | **TEST** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | **acc** | FLORES200_de_en | 32.2 | 42.8 | 42.3 | 42.3 | 35.5 | 37.8 | 40.3 | 40.1 | 42.2 | FLORES200_en_de | 3.0 | 37.8 | 36.6 | 36.9 | 24.1 | 31.0 | 34.5 | 34.2 | 35.3 | FLORES200_en_es | 25.5 | 26.8 | 26.5 | 26.5 | 21.3 | 25.1 | 27.0 | 27.3 | 27.2 | FLORES200_en_fr | 37.2 | 50.0 | 49.7 | 49.8 | 36.5 | 40.9 | 44.7 | 44.2 | 45.2 | FLORES200_en_ja | 31.0 | 22.8 | 22.2 | 22.3 | 28.2 | 27.7 | 30.7 | 30.8 | 31.0 | FLORES200_en_ru | 22.6 | 29.5 | 28.7 | 28.8 | 18.2 | 25.4 | 27.5 | 28.1 | 28.2 | FLORES200_en_zh | 37.0 | 37.2 | 37.2 | 37.3 | 33.6 | 36.6 | 39.7 | 39.5 | 41.9 | FLORES200_es_en | 23.9 | 29.4 | 29.1 | 29.2 | 24.7 | 28.0 | 28.7 | 29.0 | 29.9 | FLORES200_fr_en | 32.6 | 44.3 | 44.1 | 44.3 | 36.4 | 38.8 | 41.2 | 41.7 | 43.4 | FLORES200_ja_en | 20.5 | 26.8 | 25.9 | 25.9 | 23.4 | 21.5 | 24.1 | 23.9 | 26.6 | FLORES200_ru_en | 26.7 | 34.9 | 34.7 | 34.9 | 29.0 | 30.7 | 33.0 | 32.7 | 34.8 | FLORES200_zh_en | 22.9 | 28.3 | 27.3 | 27.3 | 23.4 | 23.5 | 25.7 | 25.6 | 27.8 | FLORES200 | 26.3 | 34.2 | 33.7 | 33.8 | 27.9 | 30.6 | 33.1 | 33.1 | 34.4 | OPUS_de_en | 25.9 | 35.6 | 35.6 | 27.1 | 21.3 | 28.2 | 29.8 | 29.2 | 30.5 | OPUS_en_de | 11.0 | 33.2 | 32.9 | 30.7 | 21.1 | 25.5 | 26.5 | 25.5 | 27.1 | OPUS_en_es | 29.0 | 37.7 | 37.1 | 37.1 | 27.9 | 31.1 | 32.0 | 31.3 | 32.4 | OPUS_en_fr | 24.7 | 34.3 | 34.2 | 34.3 | 26.7 | 27.0 | 28.8 | 28.8 | 29.0 | OPUS_en_ja | 10.2 | 16.0 | 16.0 | 15.9 | 11.9 | 9.8 | 10.5 | 10.5 | 11.7 | OPUS_en_ru | 22.0 | 31.4 | 31.7 | 31.6 | 19.7 | 23.1 | 24.9 | 24.4 | 25.5 | OPUS_en_zh | 28.8 | 41.2 | 41.1 | 41.3 | 26.7 | 26.4 | 28.9 | 28.8 | 29.9 | OPUS_es_en | 28.4 | 40.6 | 40.2 | 40.3 | 26.7 | 33.2 | 35.0 | 34.5 | 36.4 | OPUS_fr_en | 25.8 | 36.4 | 35.8 | 35.8 | 29.6 | 29.2 | 30.9 | 30.6 | 32.3 | OPUS_ja_en | 14.8 | 19.4 | 18.9 | 18.8 | 16.0 | 15.1 | 17.0 | 16.2 | 16.8 | OPUS_ru_en | 24.5 | 35.2 | 34.7 | 34.8 | 27.1 | 27.1 | 29.0 | 28.3 | 30.2 | OPUS_zh_en | 26.1 | 39.4 | 38.5 | 38.8 | 23.8 | 22.1 | 25.5 | 24.9 | 27.6 | OPUS | 22.6 | 33.4 | 33.1 | 32.2 | 23.2 | 24.8 | 26.6 | 26.1 | 27.4 | DE_EN | 28.0 | 38.0 | 37.8 | 32.1 | 26.0 | 31.4 | 33.3 | 32.8 | 34.4 | EN_DE | 8.3 | 34.7 | 34.1 | 32.7 | 22.1 | 27.3 | 29.2 | 28.3 | 29.8 | ES_EN | 26.8 | 36.8 | 36.4 | 36.5 | 26.0 | 31.4 | 32.8 | 32.6 | 34.2 | EN_ES | 27.8 | 34.0 | 33.5 | 33.5 | 25.6 | 29.0 | 30.2 | 29.9 | 30.6 | FR_EN | 28.0 | 39.0 | 38.6 | 38.6 | 31.9 | 32.4 | 34.3 | 34.3 | 36.0 | EN_FR | 28.9 | 39.5 | 39.4 | 39.5 | 30.0 | 31.6 | 34.1 | 33.9 | 34.4 | RU_EN | 25.2 | 35.1 | 34.7 | 34.8 | 27.7 | 28.2 | 30.3 | 29.8 | 31.7 | EN_RU | 22.2 | 30.7 | 30.7 | 30.6 | 19.1 | 23.8 | 25.7 | 25.6 | 26.3 | JA_EN | 16.6 | 21.8 | 21.2 | 21.2 | 18.4 | 17.2 | 19.3 | 18.8 | 20.1 | EN_JA | 17.2 | 18.2 | 18.0 | 18.0 | 17.3 | 15.8 | 17.2 | 17.3 | 18.1 | ZH_EN | 25.0 | 35.6 | 34.7 | 34.9 | 23.6 | 22.5 | 25.5 | 25.1 | 27.6 | EN_ZH | 31.5 | 39.8 | 39.8 | 39.9 | 29.0 | 29.8 | 32.4 | 32.3 | 33.9 | COMPOSITE AVERAGE AVG | 23.8 | 33.6 | 33.2 | 32.7 | 24.7 | 26.7 | 28.7 | 28.4 | 29.8 |