Update README.md
Browse files
README.md
CHANGED
|
@@ -79,7 +79,7 @@ The base and instruction tuned + DPO models have the following architecture:
|
|
| 79 |
| ----------- | ---------------------- | -------------------- | ------------ | ------------- | -------------- | ------------------ | ------------------ |
|
| 80 |
| | # Total Params | 46B | 7B | 8B | 14B | 16.8B | 30B |
|
| 81 |
| | # Activated Params | 2.5B | 7B | 8B | 14B | 2.75B | 3B |
|
| 82 |
-
| **Code** | HumanEval (0-shot
|
| 83 |
| | MBPP (3-shot) | 76 | 69.2* | 69 | 74 | 66.6 | 75.6 |
|
| 84 |
| **Math** | MATH (4-shot, cot) | 55.7 | 38.8 | 60.8* | 62.02* | 59.9 | 59.04* |
|
| 85 |
| | CMATH (3-shot) | 87.83 | 78.5 | 88.3 | 90.7 | 85.7 | 89.7 |
|
|
@@ -99,8 +99,9 @@ The base and instruction tuned + DPO models have the following architecture:
|
|
| 99 |
| | Average | 69.66 | - | 66.62 | 69.60 | 65.60 | 70.41 |
|
| 100 |
|
| 101 |
Note:
|
| 102 |
-
1.
|
| 103 |
-
2.
|
|
|
|
| 104 |
|
| 105 |
### Klear-46B-A2.5B-Instruct Evaluation Results
|
| 106 |
| Ability | Benchmark | Klear-46B-A2.5B--Instruct | InternLM3-8B-Instruct | MiniCPM4-8B | Qwen3-8B (NoThink) | gemma3-12b-it | Phi4-14B | Qwen3-30B-A3B-2507 |
|
|
|
|
| 79 |
| ----------- | ---------------------- | -------------------- | ------------ | ------------- | -------------- | ------------------ | ------------------ |
|
| 80 |
| | # Total Params | 46B | 7B | 8B | 14B | 16.8B | 30B |
|
| 81 |
| | # Activated Params | 2.5B | 7B | 8B | 14B | 2.75B | 3B |
|
| 82 |
+
| **Code** | HumanEval† (0-shot) | 89 | - | 84.1 | 87.8 | 83.5 | 90.9 |
|
| 83 |
| | MBPP (3-shot) | 76 | 69.2* | 69 | 74 | 66.6 | 75.6 |
|
| 84 |
| **Math** | MATH (4-shot, cot) | 55.7 | 38.8 | 60.8* | 62.02* | 59.9 | 59.04* |
|
| 85 |
| | CMATH (3-shot) | 87.83 | 78.5 | 88.3 | 90.7 | 85.7 | 89.7 |
|
|
|
|
| 99 |
| | Average | 69.66 | - | 66.62 | 69.60 | 65.60 | 70.41 |
|
| 100 |
|
| 101 |
Note:
|
| 102 |
+
1. Results marked with `*` are sourced from their public report, other evaluations are conducted based on internal evaluation frameworks.
|
| 103 |
+
2. `†`During pretraining, we found that the HumanEval metric fluctuated significantly and was extremely sensitive to formatting. Therefore, we referred to the prompt from Ling-series paper to modify the original HumanEval. The results in the table are the evaluation metrics after this modification.
|
| 104 |
+
|
| 105 |
|
| 106 |
### Klear-46B-A2.5B-Instruct Evaluation Results
|
| 107 |
| Ability | Benchmark | Klear-46B-A2.5B--Instruct | InternLM3-8B-Instruct | MiniCPM4-8B | Qwen3-8B (NoThink) | gemma3-12b-it | Phi4-14B | Qwen3-30B-A3B-2507 |
|