Kwai-Klear
/

Klear-46B-A2.5B-Base

@@ -88,44 +88,44 @@ The base and instruction tuned + DPO models have the following architecture:
 |             | GPQA (0-shot)          | 35.3                 | 31.03        | 33.9          | 35.7           | 30.1               | 35.5               |
 |             | AGIEval (0-shot)       | 52.3                 | 48.3*        | 51.7          | 55.7           | 54.3               | 56                 |
 |             | BBH (3-shot, cot)      | 77.9                 | 75.6         | 78.1          | 80.1           | 75.4               | 81.2               |
-| **Others**  | HellaSwag (0-shot)     | 80.5                 | 80*          | 78.7          | 81.5           | 80                 | 81.2               |
 |             | Triviaqa (5-shot)      | 69.6                 | 60.8*        | 56.3          | 62.1           | 60.9               | 65.6               |
 |             | Naturalqs (5-shot)     | 37.5                 | 23.46        | 25.7          | 29.1           | 28                 | 30.7               |
 |             | PIQA (0-shot)          | 81.6                 | 80.14        | 79.5          | 81.9           | 82                 | 80.7               |
 |             | OpenBookQA (0-shot)    | 37.8                 | 34.2         | 35            | 35.6           | 38.2               | 34.6               |
-|             |       Average          | 69.66          | -            | 66.14   | 68.86    | 65.43        | 69.65        |
 Note:
 1. `*`During pretraining, we found that the HumanEval metric fluctuated significantly and was extremely sensitive to formatting. Therefore, we referred to the prompt from Ling-series paper to modify the original HumanEval. The results in the table are the evaluation metrics after this modification.
 2. For Mimo-base-7B, the results marked with `*` are sourced from their public report, other evaluations are conducted based on internal evaluation frameworks.
 ### Klear-46B-A2.5B-Inst. Evaluation Results
-| Ability                   | Benchmark                   | Klear-46B-A2.5B-inst. | InternLM3-8B-Inst. | MiniCPM4-8B | Qwen3-8B (NoThink) | gemma3-12b-it | Phi4-14B | Qwen3-30B-A3B-2507 |
-| ------------------------- | --------------------------- | --------------- | --------------------- | ----------- | ------------------ | ------------- | -------- | ------------------ |
-|                           | # Total Params              | 46B             | 8B                    | 8B          | 8B                 | 12B           | 14B      | 30B                |
-|                           | # Activated Params          | 2.5B            | 8B                    | 8B          | 8B                 | 12B           | 14B      | 3B                 |
-| **English Understanding** | MMLU-Redux                  | 81.61 | 74.65 | 77.63 | 79.32 | 78.39 | 83.09 | 88.11 |
-|                           | MMLU-Pro                    | 63.47 | 50.87 | 54.69 | 63.8  | 60.69 | 67.25 | 78.22 |
-|                           | GPQA-Diamoind               | 47.85 | 38.76 | 38.51 | 51.77 | 39.02 | 59.47 | 71.21 |
-|                           | SimpleQA                    | 6.52  | 4.44  | 3.51  | 5.5   | 6.22  | 3.28  | 23.39 |
-| **Chinese Understanding** | CLUEWSC                     | 88.16 | 77.63 | 81.91 | 82.89 | 91.12 | 88.16 | 92.11 |
-|                           | CEval                       | 83.99 | 84.26 | 81.78 | 81.66 | 60.81 | 64.79 | 88.57 |
-|                           | C-SimpleQA                  | 42.3  | 25.87 | 23.13 | 37.07 | 28.97 | 24.77 | 75.37 |
-| **Math & Reasoning**      | MATH500                     | 82.8  | 68.4  | 79.8  | 85    | 86.8  | 80.6  | 97.2  |
-|                           | AIME24                      | 25.62 | 11.25 | 22.92 | 28.33 | 23.96 | 15.83 | 75    |
-|                           | AIME25                      | 18.12 | 8.12  | 15.21 | 20.62 | 18.33 | 18.75 | 61.88 |
-| **Code**                  | HumanEval                   | 87.8  | 82.3* | 74.39 | 83.54 | 82.32 | 85.37 | 81.71 |
-|                           | HumanEval+                  | 81.1  | -     | 70.12 | 76.83 | 75.61 | 83.54 | 76.83 |
-|                           | MBPPEvalplus                | 83.1  | 62.4  | 82    | 76.2  | 85.7  | 77.5  | 89.4  |
-|                           | MBPPEvalplus++              | 70.4  | 50.4  | 69.3  | 66.1  | 74.1  | 66.7  | 75.1  |
-|                           | LiveCodeBench v5(2408-2501) | 28.67 | 14.7  | 12.19 | 27.24 | 24.73 | 23.66 | 41.22 |
-| **Instruction Following** | IF-Eval                     | 80.04 | 79.3  | 73.01 | 84.47 | 81.52 | 59.33 | 83.92 |
-|                           | Multi-IF(en+zh)             | 78.73 | 62.53 | 61.79 | 78.95 | 76.56 | 62.7  | 77.75 |
-| **Comprehensive Ability** | MTBench                     | 8.23  | 7.86  | 6.875 | 8.21  | 8.675 | 8.625 | 9.33  |
-|                           | MT-Eval                     | 8.11  | 7.36  | 6.7   | 8.18  | 8.45  | 8.12  | -     |
-|                           | AlignBench v1.1             | 6.85  | 6.13  | 5.99  | 6.95  | 6.3   | 6.33  | 7.06  |
-|                           | LiveBench 1125              | 50.1  | 26.3  | 25.5  | 52.1  | 43.1  | 40    | 68.4  |
-|                           | Average                     | 53.50 | -     | 46.05 | 52.61 | 50.54 | 48.95 | -     |
 Note:
 1. For InternLM3-8B-Instruct, the results marked with `*` are sourced from their public report, other evaluations are conducted based on internal evaluation frameworks.

 |             | GPQA (0-shot)          | 35.3                 | 31.03        | 33.9          | 35.7           | 30.1               | 35.5               |
 |             | AGIEval (0-shot)       | 52.3                 | 48.3*        | 51.7          | 55.7           | 54.3               | 56                 |
 |             | BBH (3-shot, cot)      | 77.9                 | 75.6         | 78.1          | 80.1           | 75.4               | 81.2               |
+|             | HellaSwag (0-shot)     | 80.5                 | 80*          | 78.7          | 81.5           | 80                 | 81.2               |
 |             | Triviaqa (5-shot)      | 69.6                 | 60.8*        | 56.3          | 62.1           | 60.9               | 65.6               |
 |             | Naturalqs (5-shot)     | 37.5                 | 23.46        | 25.7          | 29.1           | 28                 | 30.7               |
 |             | PIQA (0-shot)          | 81.6                 | 80.14        | 79.5          | 81.9           | 82                 | 80.7               |
 |             | OpenBookQA (0-shot)    | 37.8                 | 34.2         | 35            | 35.6           | 38.2               | 34.6               |
+|             |                        | 69.66                | -            | 66.14         | 68.86          | 65.43              | 69.65              |
 Note:
 1. `*`During pretraining, we found that the HumanEval metric fluctuated significantly and was extremely sensitive to formatting. Therefore, we referred to the prompt from Ling-series paper to modify the original HumanEval. The results in the table are the evaluation metrics after this modification.
 2. For Mimo-base-7B, the results marked with `*` are sourced from their public report, other evaluations are conducted based on internal evaluation frameworks.
 ### Klear-46B-A2.5B-Inst. Evaluation Results
+| Ability       | Benchmark                   | Klear-46B-A2.5B | InternLM3-8B-Instruct | MiniCPM4-8B | Qwen3-8B (NoThink) | gemma3-12b-it | Phi4-14B | Qwen3-30B-A3B-2507 |
+| ------------- | --------------------------- | --------------- | --------------------- | ----------- | ------------------ | ------------- | -------- | ------------------ |
+|               | # Total Params              | 46B             | 8B                    | 8B          | 8B                 | 12B           | 14B      | 30B                |
+|               | # Activated Params          | 2.5B            | 8B                    | 8B          | 8B                 | 12B           | 14B      | 3B                 |
+| **General**   | MMLU-Redux                  | 81.61           | 74.65                 | 77.63       | 79.32              | 78.39         | 83.09    | 88.11              |
+|               | MMLU-Pro                    | 63.47           | 50.87                 | 54.69       | 63.8               | 60.69         | 67.25    | 78.22              |
+|               | GPQA-Diamoind               | 47.85           | 38.76                 | 38.51       | 51.77              | 39.02         | 59.47    | 71.21              |
+|               | SimpleQA                    | 6.52            | 4.44                  | 3.51        | 5.5                | 6.22          | 3.28     | 23.39              |
+|               | CLUEWSC                     | 88.16           | 77.63                 | 81.91       | 82.89              | 91.12         | 88.16    | 92.11              |
+|               | CEval                       | 83.99           | 84.26                 | 81.78       | 81.66              | 60.81         | 64.79    | 88.57              |
+|               | C-SimpleQA                  | 42.3            | 25.87                 | 23.13       | 37.07              | 28.97         | 24.77    | 75.37              |
+|               | LiveBench 1125              | 50.1            | 26.3                  | 25.5        | 52.1               | 43.1          | 40       | 68.4               |
+| **Math**      | MATH500                     | 82.8            | 68.4                  | 79.8        | 85                 | 86.8          | 80.6     | 97.2               |
+|               | AIME24                      | 25.62           | 11.25                 | 22.92       | 28.33              | 23.96         | 15.83    | 75                 |
+|               | AIME25                      | 18.12           | 8.12                  | 15.21       | 20.62              | 18.33         | 18.75    | 61.88              |
+| **Code**      | HumanEval                   | 87.8            | 82.3*                 | 74.39       | 83.54              | 82.32         | 85.37    | 81.71              |
+|               | HumanEval+                  | 81.1            | -                     | 70.12       | 76.83              | 75.61         | 83.54    | 76.83              |
+|               | MBPPEvalplus                | 83.1            | 62.4                  | 82          | 76.2               | 85.7          | 77.5     | 89.4               |
+|               | MBPPEvalplus++              | 70.4            | 50.4                  | 69.3        | 66.1               | 74.1          | 66.7     | 75.1               |
+|               | LiveCodeBench v5(2408-2501) | 28.67           | 14.7                  | 12.19       | 27.24              | 24.73         | 23.66    | 41.22              |
+| **Alignment** | IF-Eval                     | 80.04           | 79.3                  | 73.01       | 84.47              | 81.52         | 59.33    | 83.92              |
+|               | Multi-IF(en+zh)             | 78.73           | 62.53                 | 61.79       | 78.95              | 76.56         | 62.7     | 77.75              |
+|               | MTBench                     | 8.23            | 7.86                  | 6.875       | 8.21               | 8.675         | 8.625    | 9.33               |
+|               | MT-Eval                     | 8.11            | 7.36                  | 6.7         | 8.18               | 8.45          | 8.12     | -                  |
+|               | AlignBench v1.1             | 6.85            | 6.13                  | 5.99        | 6.95               | 6.3           | 6.33     | 7.06               |
+|               | Average                     | 53.50           | -                     | 46.05       | 52.61              | 50.54         | 48.95    | -                  |
 Note:
 1. For InternLM3-8B-Instruct, the results marked with `*` are sourced from their public report, other evaluations are conducted based on internal evaluation frameworks.