Suu commited on
Commit
ec01c1d
·
verified ·
1 Parent(s): facdd30

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -3
README.md CHANGED
@@ -79,7 +79,7 @@ The base and instruction tuned + DPO models have the following architecture:
79
  | ----------- | ---------------------- | -------------------- | ------------ | ------------- | -------------- | ------------------ | ------------------ |
80
  | | # Total Params | 46B | 7B | 8B | 14B | 16.8B | 30B |
81
  | | # Activated Params | 2.5B | 7B | 8B | 14B | 2.75B | 3B |
82
- | **Code** | HumanEval (0-shot*) | 89 | - | 84.1 | 87.8 | 83.5 | 90.9 |
83
  | | MBPP (3-shot) | 76 | 69.2* | 69 | 74 | 66.6 | 75.6 |
84
  | **Math** | MATH (4-shot, cot) | 55.7 | 38.8 | 60.8* | 62.02* | 59.9 | 59.04* |
85
  | | CMATH (3-shot) | 87.83 | 78.5 | 88.3 | 90.7 | 85.7 | 89.7 |
@@ -99,8 +99,9 @@ The base and instruction tuned + DPO models have the following architecture:
99
  | | Average | 69.66 | - | 66.62 | 69.60 | 65.60 | 70.41 |
100
 
101
  Note:
102
- 1. `*`During pretraining, we found that the HumanEval metric fluctuated significantly and was extremely sensitive to formatting. Therefore, we referred to the prompt from Ling-series paper to modify the original HumanEval. The results in the table are the evaluation metrics after this modification.
103
- 2. Results marked with `*` are sourced from their public report, other evaluations are conducted based on internal evaluation frameworks.
 
104
 
105
  ### Klear-46B-A2.5B-Instruct Evaluation Results
106
  | Ability | Benchmark | Klear-46B-A2.5B--Instruct | InternLM3-8B-Instruct | MiniCPM4-8B | Qwen3-8B (NoThink) | gemma3-12b-it | Phi4-14B | Qwen3-30B-A3B-2507 |
 
79
  | ----------- | ---------------------- | -------------------- | ------------ | ------------- | -------------- | ------------------ | ------------------ |
80
  | | # Total Params | 46B | 7B | 8B | 14B | 16.8B | 30B |
81
  | | # Activated Params | 2.5B | 7B | 8B | 14B | 2.75B | 3B |
82
+ | **Code** | HumanEval (0-shot) | 89 | - | 84.1 | 87.8 | 83.5 | 90.9 |
83
  | | MBPP (3-shot) | 76 | 69.2* | 69 | 74 | 66.6 | 75.6 |
84
  | **Math** | MATH (4-shot, cot) | 55.7 | 38.8 | 60.8* | 62.02* | 59.9 | 59.04* |
85
  | | CMATH (3-shot) | 87.83 | 78.5 | 88.3 | 90.7 | 85.7 | 89.7 |
 
99
  | | Average | 69.66 | - | 66.62 | 69.60 | 65.60 | 70.41 |
100
 
101
  Note:
102
+ 1. Results marked with `*` are sourced from their public report, other evaluations are conducted based on internal evaluation frameworks.
103
+ 2. `†`During pretraining, we found that the HumanEval metric fluctuated significantly and was extremely sensitive to formatting. Therefore, we referred to the prompt from Ling-series paper to modify the original HumanEval. The results in the table are the evaluation metrics after this modification.
104
+
105
 
106
  ### Klear-46B-A2.5B-Instruct Evaluation Results
107
  | Ability | Benchmark | Klear-46B-A2.5B--Instruct | InternLM3-8B-Instruct | MiniCPM4-8B | Qwen3-8B (NoThink) | gemma3-12b-it | Phi4-14B | Qwen3-30B-A3B-2507 |