Update README.md
Browse files
README.md
CHANGED
|
@@ -1,13 +1,3 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
language:
|
| 4 |
-
- zh
|
| 5 |
-
- en
|
| 6 |
-
base_model:
|
| 7 |
-
- Kwai-Klear/Klear-46B-A2.5B-Base
|
| 8 |
-
pipeline_tag: text-generation
|
| 9 |
-
library_name: transformers
|
| 10 |
-
---
|
| 11 |
# Klear
|
| 12 |
|
| 13 |
<div align="center">
|
|
@@ -20,7 +10,7 @@ library_name: transformers
|
|
| 20 |
|
| 21 |
## 🔥News
|
| 22 |
|
| 23 |
-
- 2025.09.05: We released `Klear-46B-A2.5B` series. Currently, Klear-46B-A2.5B offers two versions: `a base model` and an advanced version that includes `instruction tuned` model. Additionally, `an reasoning version is currently in training`. Please stay tuned for more updates.
|
| 24 |
|
| 25 |
|
| 26 |
## 1. Introduction
|
|
@@ -44,7 +34,7 @@ As a result, Klear-46B-A2.5B-Base matches or surpasses the performance of dense
|
|
| 44 |
|
| 45 |
## Model Summary
|
| 46 |
|
| 47 |
-
The base and instruction
|
| 48 |
|
| 49 |
| **propoty** | **value** |
|
| 50 |
|---------------------------|------------------------------------------------------------------------|
|
|
@@ -82,7 +72,7 @@ The base and instruction-tuned models have the following architecture:
|
|
| 82 |
| **Code** | HumanEval (0-shot*) | 89 | - | 84.1 | 87.8 | 83.5 | 90.9 |
|
| 83 |
| | MBPP (3-shot) | 76 | 55.2 | 69 | 74 | 66.6 | 75.6 |
|
| 84 |
| **Math** | MATH (4-shot, cot) | 55.7 | 36.78 | 58.4 | 57.1 | 56.98 | 57.6 |
|
| 85 |
-
| | CMATH (3-shot) | 87.
|
| 86 |
| | GSM8K (4-shot, cot) | 87.3 | 78.47 | 89.4 | 90.3 | 87.6 | 91.1 |
|
| 87 |
| **General** | MMLU-Pro (5-shot, cot) | 57.6 | 43.1 | 55.2 | 58.1 | 49.9 | 58.8 |
|
| 88 |
| | MMLU (5-shot) | 80.5 | 69.24 | 77.1 | 80.6 | 73.7 | 80.4 |
|
|
@@ -92,46 +82,45 @@ The base and instruction-tuned models have the following architecture:
|
|
| 92 |
| | AGIEval (0-shot) | 52.3 | 48.3* | 51.7 | 55.7 | 54.3 | 56 |
|
| 93 |
| | BBH (3-shot, cot) | 77.9 | 75.6 | 78.1 | 80.1 | 75.4 | 81.2 |
|
| 94 |
| **Others** | HellaSwag (0-shot) | 80.5 | 80* | 78.7 | 81.5 | 80 | 81.2 |
|
| 95 |
-
| | Winogrande (3-shot) | 78.8 | 78* | 73.6 | 78.5 | 72.1 | 77.9 |
|
| 96 |
| | Triviaqa (5-shot) | 69.6 | 60.8* | 56.3 | 62.1 | 60.9 | 65.6 |
|
| 97 |
| | Naturalqs (5-shot) | 37.5 | 23.46 | 25.7 | 29.1 | 28 | 30.7 |
|
| 98 |
| | PIQA (0-shot) | 81.6 | 80.14 | 79.5 | 81.9 | 82 | 80.7 |
|
| 99 |
-
| | SIQA (0-shot) | 67.9 | 51.74 | 56.2 | 58.4 | 56.3 | 56.3 |
|
| 100 |
| | OpenBookQA (0-shot) | 37.8 | 34.2 | 35 | 35.6 | 38.2 | 34.6 |
|
|
|
|
| 101 |
|
| 102 |
Note:
|
| 103 |
1. `*`During pretraining, we found that the HumanEval metric fluctuated significantly and was extremely sensitive to formatting. Therefore, we referred to the prompt from Ling-series paper to modify the original HumanEval. The results in the table are the evaluation metrics after this modification.
|
| 104 |
-
2. For Mimo-base-7B, the results marked with `*` are sourced from
|
| 105 |
|
| 106 |
### Klear-46B-A2.5B-Inst. Evaluation Results
|
| 107 |
-
| Ability | Benchmark | Klear-46B-A2.5B-
|
| 108 |
-
| ------------------------- | --------------------------- | --------------- | ----------- | ------------------ | ------------- | -------- |
|
| 109 |
-
| | # Total Params | 46B | 8B | 8B | 12B | 14B |
|
| 110 |
-
| | # Activated Params | 2.5B | 8B | 8B | 12B | 14B |
|
| 111 |
-
| **English Understanding** | MMLU-Redux | 82.
|
| 112 |
-
| | MMLU-Pro |
|
| 113 |
-
| | GPQA-Diamoind | 49.
|
| 114 |
-
| | SimpleQA |
|
| 115 |
-
| **Chinese Understanding** | CLUEWSC | 88.
|
| 116 |
-
| | CEval |
|
| 117 |
-
| | C-SimpleQA | 42.
|
| 118 |
-
| **Math & Reasoning** | MATH500 |
|
| 119 |
-
| | AIME24 |
|
| 120 |
-
| | AIME25 |
|
| 121 |
-
|
|
| 122 |
-
|
|
| 123 |
-
| |
|
| 124 |
-
| | MBPPEvalplus
|
| 125 |
-
| |
|
| 126 |
-
|
|
| 127 |
-
|
|
| 128 |
-
|
|
| 129 |
-
|
|
| 130 |
-
| |
|
| 131 |
-
| |
|
| 132 |
-
| |
|
| 133 |
-
|
| 134 |
-
|
| 135 |
|
| 136 |
## 3. Quick start
|
| 137 |
|
|
@@ -223,11 +212,4 @@ outputs = llm.generate([prompt], sampling_params)
|
|
| 223 |
|
| 224 |
print(outputs[0].outputs[0].text)
|
| 225 |
|
| 226 |
-
```
|
| 227 |
-
|
| 228 |
-
## Citation
|
| 229 |
-
|
| 230 |
-
If you find `Klear-46B-A2.5B` is useful or want to use in your projects, please kindly cite our paper:
|
| 231 |
-
|
| 232 |
-
```
|
| 233 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# Klear
|
| 2 |
|
| 3 |
<div align="center">
|
|
|
|
| 10 |
|
| 11 |
## 🔥News
|
| 12 |
|
| 13 |
+
- 2025.09.05: We released `Klear-46B-A2.5B` series. Currently, Klear-46B-A2.5B offers two versions: `a base model` and an advanced version that includes `instruction tuned + DPO` model. Additionally, `an reasoning version is currently in training`. Please stay tuned for more updates.
|
| 14 |
|
| 15 |
|
| 16 |
## 1. Introduction
|
|
|
|
| 34 |
|
| 35 |
## Model Summary
|
| 36 |
|
| 37 |
+
The base and instruction tuned + DPO models have the following architecture:
|
| 38 |
|
| 39 |
| **propoty** | **value** |
|
| 40 |
|---------------------------|------------------------------------------------------------------------|
|
|
|
|
| 72 |
| **Code** | HumanEval (0-shot*) | 89 | - | 84.1 | 87.8 | 83.5 | 90.9 |
|
| 73 |
| | MBPP (3-shot) | 76 | 55.2 | 69 | 74 | 66.6 | 75.6 |
|
| 74 |
| **Math** | MATH (4-shot, cot) | 55.7 | 36.78 | 58.4 | 57.1 | 56.98 | 57.6 |
|
| 75 |
+
| | CMATH (3-shot) | 87.83 | 78.5 | 88.3 | 90.7 | 85.7 | 89.7 |
|
| 76 |
| | GSM8K (4-shot, cot) | 87.3 | 78.47 | 89.4 | 90.3 | 87.6 | 91.1 |
|
| 77 |
| **General** | MMLU-Pro (5-shot, cot) | 57.6 | 43.1 | 55.2 | 58.1 | 49.9 | 58.8 |
|
| 78 |
| | MMLU (5-shot) | 80.5 | 69.24 | 77.1 | 80.6 | 73.7 | 80.4 |
|
|
|
|
| 82 |
| | AGIEval (0-shot) | 52.3 | 48.3* | 51.7 | 55.7 | 54.3 | 56 |
|
| 83 |
| | BBH (3-shot, cot) | 77.9 | 75.6 | 78.1 | 80.1 | 75.4 | 81.2 |
|
| 84 |
| **Others** | HellaSwag (0-shot) | 80.5 | 80* | 78.7 | 81.5 | 80 | 81.2 |
|
|
|
|
| 85 |
| | Triviaqa (5-shot) | 69.6 | 60.8* | 56.3 | 62.1 | 60.9 | 65.6 |
|
| 86 |
| | Naturalqs (5-shot) | 37.5 | 23.46 | 25.7 | 29.1 | 28 | 30.7 |
|
| 87 |
| | PIQA (0-shot) | 81.6 | 80.14 | 79.5 | 81.9 | 82 | 80.7 |
|
|
|
|
| 88 |
| | OpenBookQA (0-shot) | 37.8 | 34.2 | 35 | 35.6 | 38.2 | 34.6 |
|
| 89 |
+
| | Average | 69.66 | - | 66.14 | 68.86 | 65.43 | 69.65 |
|
| 90 |
|
| 91 |
Note:
|
| 92 |
1. `*`During pretraining, we found that the HumanEval metric fluctuated significantly and was extremely sensitive to formatting. Therefore, we referred to the prompt from Ling-series paper to modify the original HumanEval. The results in the table are the evaluation metrics after this modification.
|
| 93 |
+
2. For Mimo-base-7B, the results marked with `*` are sourced from their public report, other evaluations are conducted based on internal evaluation frameworks.
|
| 94 |
|
| 95 |
### Klear-46B-A2.5B-Inst. Evaluation Results
|
| 96 |
+
| Ability | Benchmark | Klear-46B-A2.5B-inst. | InternLM3-8B-Instruct | MiniCPM4-8B | Qwen3-8B (NoThink) | gemma3-12b-it | Phi4-14B | Qwen3-30B-A3B-2507 |
|
| 97 |
+
| ------------------------- | --------------------------- | --------------- | --------------------- | ----------- | ------------------ | ------------- | -------- | ------------------ |
|
| 98 |
+
| | # Total Params | 46B | 8B | 8B | 8B | 12B | 14B | 30B |
|
| 99 |
+
| | # Activated Params | 2.5B | 8B | 8B | 8B | 12B | 14B | 3B |
|
| 100 |
+
| **English Understanding** | MMLU-Redux | 82.16 | 74.65 | 77.63 | 79.32 | 78.39 | 83.09 | 88.11 |
|
| 101 |
+
| | MMLU-Pro | 63.86 | 50.87 | 54.69 | 63.8 | 60.69 | 67.25 | 78.22 |
|
| 102 |
+
| | GPQA-Diamoind | 49.24 | 38.76 | 38.51 | 51.77 | 39.02 | 59.47 | 71.21 |
|
| 103 |
+
| | SimpleQA | 6.52 | 4.44 | 3.51 | 5.5 | 6.22 | 3.28 | 23.39 |
|
| 104 |
+
| **Chinese Understanding** | CLUEWSC | 88.16 | 77.63 | 81.91 | 82.89 | 91.12 | 88.16 | 92.11 |
|
| 105 |
+
| | CEval | 83.99 | 84.26 | 81.78 | 81.66 | 60.81 | 64.79 | 88.57 |
|
| 106 |
+
| | C-SimpleQA | 42.3 | 25.87 | 23.13 | 37.07 | 28.97 | 24.77 | 75.37 |
|
| 107 |
+
| **Math & Reasoning** | MATH500 | 82.8 | 68.4 | 79.8 | 85 | 86.8 | 80.6 | 97.2 |
|
| 108 |
+
| | AIME24 | 25.62 | 11.25 | 22.92 | 28.33 | 23.96 | 15.83 | 75 |
|
| 109 |
+
| | AIME25 | 18.12 | 8.12 | 15.21 | 20.62 | 18.33 | 18.75 | 61.88 |
|
| 110 |
+
| **Code** | HumanEval | 87.8 | 82.3* | 74.39 | 83.54 | 82.32 | 85.37 | 81.71 |
|
| 111 |
+
| | HumanEval+ | 81.1 | - | 70.12 | 76.83 | 75.61 | 83.54 | 76.83 |
|
| 112 |
+
| | MBPPEvalplus | 83.1 | 62.4 | 82 | 76.2 | 85.7 | 77.5 | 89.4 |
|
| 113 |
+
| | MBPPEvalplus++ | 70.4 | 50.4 | 69.3 | 66.1 | 74.1 | 66.7 | 75.1 |
|
| 114 |
+
| | LiveCodeBench v5(2408-2501) | 28.67 | 14.7 | 12.19 | 27.24 | 24.73 | 23.66 | 41.22 |
|
| 115 |
+
| **Instruction Following** | IF-Eval | 80.04 | 79.3 | 73.01 | 84.47 | 81.52 | 59.33 | 83.92 |
|
| 116 |
+
| | Multi-IF(en+zh) | 78.73 | 62.53 | 61.79 | 78.95 | 76.56 | 62.7 | 77.75 |
|
| 117 |
+
| **Comprehensive Ability** | MTBench | 8.23 | 7.86 | 6.875 | 8.21 | 8.675 | 8.625 | 9.33 |
|
| 118 |
+
| | MT-Eval | 8.11 | 7.36 | 6.7 | 8.18 | 8.45 | 8.12 | - |
|
| 119 |
+
| | AlignBench v1.1 | 6.85 | 6.13 | 5.99 | 6.95 | 6.3 | 6.33 | 7.06 |
|
| 120 |
+
| | LiveBench 1125 | 50.1 | 26.3 | 25.5 | 52.1 | 43.1 | 40 | 68.4 |
|
| 121 |
+
| | Average | 53.61 | - | 46.05 | 52.61 | 50.54 | 48.95 | - |
|
| 122 |
+
Note:
|
| 123 |
+
1. For InternLM3-8B-Instruct, the results marked with `*` are sourced from their public report, other evaluations are conducted based on internal evaluation frameworks.
|
| 124 |
|
| 125 |
## 3. Quick start
|
| 126 |
|
|
|
|
| 212 |
|
| 213 |
print(outputs[0].outputs[0].text)
|
| 214 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 215 |
```
|