Kwai-Klear
/

Klear-46B-A2.5B-Base

@@ -1,13 +1,3 @@
----
-license: apache-2.0
-language:
-- zh
-- en
-base_model:
-- Kwai-Klear/Klear-46B-A2.5B-Base
-pipeline_tag: text-generation
-library_name: transformers
----
 # Klear
 <div align="center">
@@ -20,7 +10,7 @@ library_name: transformers
 ## 🔥News
-- 2025.09.05: We released `Klear-46B-A2.5B` series. Currently, Klear-46B-A2.5B offers two versions: `a base model` and an advanced version that includes `instruction tuned` model. Additionally, `an reasoning version is currently in training`. Please stay tuned for more updates.
 ## 1. Introduction
@@ -44,7 +34,7 @@ As a result, Klear-46B-A2.5B-Base matches or surpasses the performance of dense
 ## Model Summary
-The base and instruction-tuned models have the following architecture:
 | **propoty**               | **value**                                                              |
 |---------------------------|------------------------------------------------------------------------|
@@ -82,7 +72,7 @@ The base and instruction-tuned models have the following architecture:
 | **Code**    | HumanEval (0-shot*)    | 89                   | -            | 84.1          | 87.8           | 83.5               | 90.9               |
 |             | MBPP (3-shot)          | 76                   | 55.2         | 69            | 74             | 66.6               | 75.6               |
 | **Math**    | MATH (4-shot, cot)     | 55.7                 | 36.78        | 58.4          | 57.1           | 56.98              | 57.6               |
-|             | CMATH (3-shot)         | 87.8                 | 78.5         | 88.3          | 90.7           | 85.7               | 89.7               |
 |             | GSM8K (4-shot, cot)    | 87.3                 | 78.47        | 89.4          | 90.3           | 87.6               | 91.1               |
 | **General** | MMLU-Pro (5-shot, cot) | 57.6                 | 43.1         | 55.2          | 58.1           | 49.9               | 58.8               |
 |             | MMLU (5-shot)          | 80.5                 | 69.24        | 77.1          | 80.6           | 73.7               | 80.4               |
@@ -92,46 +82,45 @@ The base and instruction-tuned models have the following architecture:
 |             | AGIEval (0-shot)       | 52.3                 | 48.3*        | 51.7          | 55.7           | 54.3               | 56                 |
 |             | BBH (3-shot, cot)      | 77.9                 | 75.6         | 78.1          | 80.1           | 75.4               | 81.2               |
 | **Others**  | HellaSwag (0-shot)     | 80.5                 | 80*          | 78.7          | 81.5           | 80                 | 81.2               |
-|             | Winogrande (3-shot)    | 78.8                 | 78*          | 73.6          | 78.5           | 72.1               | 77.9               |
 |             | Triviaqa (5-shot)      | 69.6                 | 60.8*        | 56.3          | 62.1           | 60.9               | 65.6               |
 |             | Naturalqs (5-shot)     | 37.5                 | 23.46        | 25.7          | 29.1           | 28                 | 30.7               |
 |             | PIQA (0-shot)          | 81.6                 | 80.14        | 79.5          | 81.9           | 82                 | 80.7               |
-|             | SIQA (0-shot)          | 67.9                 | 51.74        | 56.2          | 58.4           | 56.3               | 56.3               |
 |             | OpenBookQA (0-shot)    | 37.8                 | 34.2         | 35            | 35.6           | 38.2               | 34.6               |
 Note:
 1. `*`During pretraining, we found that the HumanEval metric fluctuated significantly and was extremely sensitive to formatting. Therefore, we referred to the prompt from Ling-series paper to modify the original HumanEval. The results in the table are the evaluation metrics after this modification.
-2. For Mimo-base-7B, the results marked with `*` are sourced from other public reports.
 ### Klear-46B-A2.5B-Inst. Evaluation Results
-| Ability                   | Benchmark                   | Klear-46B-A2.5B-Inst. | MiniCPM4-8B | Qwen3-8B (NoThink) | gemma3-12b-it | Phi4-14B |
-| ------------------------- | --------------------------- | --------------- | ----------- | ------------------ | ------------- | -------- |
-|                           | # Total Params              | 46B             | 8B          | 8B                 | 12B           | 14B      |
-|                           | # Activated Params          | 2.5B            | 8B          | 8B                 | 12B           | 14B      |
-| **English Understanding** | MMLU-Redux                  | 82.23           | 77.63       | 79.32              | 78.39         | 83.09    |
-|                           | MMLU-Pro                    | 64.82           | 54.69       | 63.8               | 60.69         | 67.25    |
-|                           | GPQA-Diamoind               | 49.49           | 38.51       | 51.77              | 39.02         | 59.47    |
-|                           | SimpleQA                    | 5.94            | 3.51        | 5.5                | 6.22          | 3.28     |
-| **Chinese Understanding** | CLUEWSC                     | 88.82           | 81.91       | 82.89              | 91.12         | 88.16    |
-|                           | CEval                       | 84.29           | 81.78       | 81.66              | 60.81         | 64.79    |
-|                           | C-SimpleQA                  | 42.03           | 23.13       | 37.07              | 28.97         | 24.77    |
-| **Math & Reasoning**      | MATH500                     | 86.4            | 79.8        | 85                 | 86.8          | 80.6     |
-|                           | AIME24                      | 30.42           | 22.92       | 28.33              | 23.96         | 15.83    |
-|                           | AIME25                      | 21.04           | 15.21       | 20.62              | 18.33         | 18.75    |
-|                           | ZebraLogic                  | 46.4            | 8.5         | 25.7               | 18            | 30.3     |
-| **Code**                  | HumanEval                   | 89.63           | 74.39       | 83.54              | 82.32         | 85.37    |
-|                           | HumanEval+                  | 87.2            | 70.12       | 76.83              | 75.61         | 83.54    |
-|                           | MBPPEvalplus                | 79.6            | 82          | 76.2               | 85.7          | 77.5     |
-|                           | MBPPEvalplus++              | 68.5            | 69.3        | 66.1               | 74.1          | 66.7     |
-|                           | LiveCodeBench v5(2408-2501) | 29.75           | 12.19       | 27.24              | 24.73         | 23.66    |
-| **Instruction Following** | IF-Eval                     | 80.41           | 73.01       | 84.47              | 81.52         | 59.33    |
-|                           | Multi-IF(en+zh)             | 78.25           | 61.79       | 78.95              | 76.56         | 62.7     |
-| **Comprehensive Ability** | MTBench                     | 8.03            | 6.875       | 8.21               | 8.675         | 8.625    |
-|                           | MT-Eval                     | 8.1             | 6.7         | 8.18               | 8.45          | 8.12     |
-|                           | Arena-Hard v2               | 19.8            | 2.2         | 19.8               | 50            | 9.6      |
-|                           | AlignBench v1.1             | 6.8             | 5.99        | 6.95               | 6.3           | 6.33     |
-|                           | LiveBench 1125              | 48.7            | 25.5        | 52.1               | 43.1          | 40       |
 ## 3. Quick start
@@ -223,11 +212,4 @@ outputs = llm.generate([prompt], sampling_params)
 print(outputs[0].outputs[0].text)
-```
-## Citation
-If you find `Klear-46B-A2.5B` is useful or want to use in your projects, please kindly cite our paper:
-```
 ```

 # Klear
 <div align="center">
 ## 🔥News
+- 2025.09.05: We released `Klear-46B-A2.5B` series. Currently, Klear-46B-A2.5B offers two versions: `a base model` and an advanced version that includes `instruction tuned + DPO` model. Additionally, `an reasoning version is currently in training`. Please stay tuned for more updates.
 ## 1. Introduction
 ## Model Summary
+The base and instruction tuned + DPO models have the following architecture:
 | **propoty**               | **value**                                                              |
 |---------------------------|------------------------------------------------------------------------|
 | **Code**    | HumanEval (0-shot*)    | 89                   | -            | 84.1          | 87.8           | 83.5               | 90.9               |
 |             | MBPP (3-shot)          | 76                   | 55.2         | 69            | 74             | 66.6               | 75.6               |
 | **Math**    | MATH (4-shot, cot)     | 55.7                 | 36.78        | 58.4          | 57.1           | 56.98              | 57.6               |
+|             | CMATH (3-shot)         | 87.83                | 78.5         | 88.3          | 90.7           | 85.7               | 89.7               |
 |             | GSM8K (4-shot, cot)    | 87.3                 | 78.47        | 89.4          | 90.3           | 87.6               | 91.1               |
 | **General** | MMLU-Pro (5-shot, cot) | 57.6                 | 43.1         | 55.2          | 58.1           | 49.9               | 58.8               |
 |             | MMLU (5-shot)          | 80.5                 | 69.24        | 77.1          | 80.6           | 73.7               | 80.4               |
 |             | AGIEval (0-shot)       | 52.3                 | 48.3*        | 51.7          | 55.7           | 54.3               | 56                 |
 |             | BBH (3-shot, cot)      | 77.9                 | 75.6         | 78.1          | 80.1           | 75.4               | 81.2               |
 | **Others**  | HellaSwag (0-shot)     | 80.5                 | 80*          | 78.7          | 81.5           | 80                 | 81.2               |
 |             | Triviaqa (5-shot)      | 69.6                 | 60.8*        | 56.3          | 62.1           | 60.9               | 65.6               |
 |             | Naturalqs (5-shot)     | 37.5                 | 23.46        | 25.7          | 29.1           | 28                 | 30.7               |
 |             | PIQA (0-shot)          | 81.6                 | 80.14        | 79.5          | 81.9           | 82                 | 80.7               |
 |             | OpenBookQA (0-shot)    | 37.8                 | 34.2         | 35            | 35.6           | 38.2               | 34.6               |
+|             |       Average          | 69.66          | -            | 66.14   | 68.86    | 65.43        | 69.65        |
 Note:
 1. `*`During pretraining, we found that the HumanEval metric fluctuated significantly and was extremely sensitive to formatting. Therefore, we referred to the prompt from Ling-series paper to modify the original HumanEval. The results in the table are the evaluation metrics after this modification.
+2. For Mimo-base-7B, the results marked with `*` are sourced from their public report, other evaluations are conducted based on internal evaluation frameworks.
 ### Klear-46B-A2.5B-Inst. Evaluation Results
+| Ability                   | Benchmark                   | Klear-46B-A2.5B-inst. | InternLM3-8B-Instruct | MiniCPM4-8B | Qwen3-8B (NoThink) | gemma3-12b-it | Phi4-14B | Qwen3-30B-A3B-2507 |
+| ------------------------- | --------------------------- | --------------- | --------------------- | ----------- | ------------------ | ------------- | -------- | ------------------ |
+|                           | # Total Params              | 46B             | 8B                    | 8B          | 8B                 | 12B           | 14B      | 30B                |
+|                           | # Activated Params          | 2.5B            | 8B                    | 8B          | 8B                 | 12B           | 14B      | 3B                 |
+| **English Understanding** | MMLU-Redux                  | 82.16           | 74.65                 | 77.63       | 79.32              | 78.39         | 83.09    | 88.11              |
+|                           | MMLU-Pro                    | 63.86           | 50.87                 | 54.69       | 63.8               | 60.69         | 67.25    | 78.22              |
+|                           | GPQA-Diamoind               | 49.24           | 38.76                 | 38.51       | 51.77              | 39.02         | 59.47    | 71.21              |
+|                           | SimpleQA                    | 6.52            | 4.44                  | 3.51        | 5.5                | 6.22          | 3.28     | 23.39              |
+| **Chinese Understanding** | CLUEWSC                     | 88.16           | 77.63                 | 81.91       | 82.89              | 91.12         | 88.16    | 92.11              |
+|                           | CEval                       | 83.99           | 84.26                 | 81.78       | 81.66              | 60.81         | 64.79    | 88.57              |
+|                           | C-SimpleQA                  | 42.3            | 25.87                 | 23.13       | 37.07              | 28.97         | 24.77    | 75.37              |
+| **Math & Reasoning**      | MATH500                     | 82.8            | 68.4                  | 79.8        | 85                 | 86.8          | 80.6     | 97.2               |
+|                           | AIME24                      | 25.62           | 11.25                 | 22.92       | 28.33              | 23.96         | 15.83    | 75                 |
+|                           | AIME25                      | 18.12           | 8.12                  | 15.21       | 20.62              | 18.33         | 18.75    | 61.88              |
+| **Code**                  | HumanEval                   | 87.8            | 82.3*                 | 74.39       | 83.54              | 82.32         | 85.37    | 81.71              |
+|                           | HumanEval+                  | 81.1            | -                     | 70.12       | 76.83              | 75.61         | 83.54    | 76.83              |
+|                           | MBPPEvalplus                | 83.1            | 62.4                  | 82          | 76.2               | 85.7          | 77.5     | 89.4               |
+|                           | MBPPEvalplus++              | 70.4            | 50.4                  | 69.3        | 66.1               | 74.1          | 66.7     | 75.1               |
+|                           | LiveCodeBench v5(2408-2501) | 28.67           | 14.7                  | 12.19       | 27.24              | 24.73         | 23.66    | 41.22              |
+| **Instruction Following** | IF-Eval                     | 80.04           | 79.3                  | 73.01       | 84.47              | 81.52         | 59.33    | 83.92              |
+|                           | Multi-IF(en+zh)             | 78.73           | 62.53                 | 61.79       | 78.95              | 76.56         | 62.7     | 77.75              |
+| **Comprehensive Ability** | MTBench                     | 8.23            | 7.86                  | 6.875       | 8.21               | 8.675         | 8.625    | 9.33               |
+|                           | MT-Eval                     | 8.11            | 7.36                  | 6.7         | 8.18               | 8.45          | 8.12     | -                  |
+|                           | AlignBench v1.1             | 6.85            | 6.13                  | 5.99        | 6.95               | 6.3           | 6.33     | 7.06               |
+|                           | LiveBench 1125              | 50.1            | 26.3                  | 25.5        | 52.1               | 43.1          | 40       | 68.4               |
+|                           | Average                     | 53.61           | -                     | 46.05       | 52.61              | 50.54         | 48.95    | -                  |
+Note:
+1. For InternLM3-8B-Instruct, the results marked with `*` are sourced from their public report, other evaluations are conducted based on internal evaluation frameworks.
 ## 3. Quick start
 print(outputs[0].outputs[0].text)
 ```