BubbleQ commited on
Commit
4336d6c
·
verified ·
1 Parent(s): 3ae963a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -28
README.md CHANGED
@@ -88,44 +88,44 @@ The base and instruction tuned + DPO models have the following architecture:
88
  | | GPQA (0-shot) | 35.3 | 31.03 | 33.9 | 35.7 | 30.1 | 35.5 |
89
  | | AGIEval (0-shot) | 52.3 | 48.3* | 51.7 | 55.7 | 54.3 | 56 |
90
  | | BBH (3-shot, cot) | 77.9 | 75.6 | 78.1 | 80.1 | 75.4 | 81.2 |
91
- | **Others** | HellaSwag (0-shot) | 80.5 | 80* | 78.7 | 81.5 | 80 | 81.2 |
92
  | | Triviaqa (5-shot) | 69.6 | 60.8* | 56.3 | 62.1 | 60.9 | 65.6 |
93
  | | Naturalqs (5-shot) | 37.5 | 23.46 | 25.7 | 29.1 | 28 | 30.7 |
94
  | | PIQA (0-shot) | 81.6 | 80.14 | 79.5 | 81.9 | 82 | 80.7 |
95
  | | OpenBookQA (0-shot) | 37.8 | 34.2 | 35 | 35.6 | 38.2 | 34.6 |
96
- | | Average | 69.66 | - | 66.14 | 68.86 | 65.43 | 69.65 |
97
 
98
  Note:
99
  1. `*`During pretraining, we found that the HumanEval metric fluctuated significantly and was extremely sensitive to formatting. Therefore, we referred to the prompt from Ling-series paper to modify the original HumanEval. The results in the table are the evaluation metrics after this modification.
100
  2. For Mimo-base-7B, the results marked with `*` are sourced from their public report, other evaluations are conducted based on internal evaluation frameworks.
101
 
102
  ### Klear-46B-A2.5B-Inst. Evaluation Results
103
- | Ability | Benchmark | Klear-46B-A2.5B-inst. | InternLM3-8B-Inst. | MiniCPM4-8B | Qwen3-8B (NoThink) | gemma3-12b-it | Phi4-14B | Qwen3-30B-A3B-2507 |
104
- | ------------------------- | --------------------------- | --------------- | --------------------- | ----------- | ------------------ | ------------- | -------- | ------------------ |
105
- | | # Total Params | 46B | 8B | 8B | 8B | 12B | 14B | 30B |
106
- | | # Activated Params | 2.5B | 8B | 8B | 8B | 12B | 14B | 3B |
107
- | **English Understanding** | MMLU-Redux | 81.61 | 74.65 | 77.63 | 79.32 | 78.39 | 83.09 | 88.11 |
108
- | | MMLU-Pro | 63.47 | 50.87 | 54.69 | 63.8 | 60.69 | 67.25 | 78.22 |
109
- | | GPQA-Diamoind | 47.85 | 38.76 | 38.51 | 51.77 | 39.02 | 59.47 | 71.21 |
110
- | | SimpleQA | 6.52 | 4.44 | 3.51 | 5.5 | 6.22 | 3.28 | 23.39 |
111
- | **Chinese Understanding** | CLUEWSC | 88.16 | 77.63 | 81.91 | 82.89 | 91.12 | 88.16 | 92.11 |
112
- | | CEval | 83.99 | 84.26 | 81.78 | 81.66 | 60.81 | 64.79 | 88.57 |
113
- | | C-SimpleQA | 42.3 | 25.87 | 23.13 | 37.07 | 28.97 | 24.77 | 75.37 |
114
- | **Math & Reasoning** | MATH500 | 82.8 | 68.4 | 79.8 | 85 | 86.8 | 80.6 | 97.2 |
115
- | | AIME24 | 25.62 | 11.25 | 22.92 | 28.33 | 23.96 | 15.83 | 75 |
116
- | | AIME25 | 18.12 | 8.12 | 15.21 | 20.62 | 18.33 | 18.75 | 61.88 |
117
- | **Code** | HumanEval | 87.8 | 82.3* | 74.39 | 83.54 | 82.32 | 85.37 | 81.71 |
118
- | | HumanEval+ | 81.1 | - | 70.12 | 76.83 | 75.61 | 83.54 | 76.83 |
119
- | | MBPPEvalplus | 83.1 | 62.4 | 82 | 76.2 | 85.7 | 77.5 | 89.4 |
120
- | | MBPPEvalplus++ | 70.4 | 50.4 | 69.3 | 66.1 | 74.1 | 66.7 | 75.1 |
121
- | | LiveCodeBench v5(2408-2501) | 28.67 | 14.7 | 12.19 | 27.24 | 24.73 | 23.66 | 41.22 |
122
- | **Instruction Following** | IF-Eval | 80.04 | 79.3 | 73.01 | 84.47 | 81.52 | 59.33 | 83.92 |
123
- | | Multi-IF(en+zh) | 78.73 | 62.53 | 61.79 | 78.95 | 76.56 | 62.7 | 77.75 |
124
- | **Comprehensive Ability** | MTBench | 8.23 | 7.86 | 6.875 | 8.21 | 8.675 | 8.625 | 9.33 |
125
- | | MT-Eval | 8.11 | 7.36 | 6.7 | 8.18 | 8.45 | 8.12 | - |
126
- | | AlignBench v1.1 | 6.85 | 6.13 | 5.99 | 6.95 | 6.3 | 6.33 | 7.06 |
127
- | | LiveBench 1125 | 50.1 | 26.3 | 25.5 | 52.1 | 43.1 | 40 | 68.4 |
128
- | | Average | 53.50 | - | 46.05 | 52.61 | 50.54 | 48.95 | - |
129
 
130
  Note:
131
  1. For InternLM3-8B-Instruct, the results marked with `*` are sourced from their public report, other evaluations are conducted based on internal evaluation frameworks.
 
88
  | | GPQA (0-shot) | 35.3 | 31.03 | 33.9 | 35.7 | 30.1 | 35.5 |
89
  | | AGIEval (0-shot) | 52.3 | 48.3* | 51.7 | 55.7 | 54.3 | 56 |
90
  | | BBH (3-shot, cot) | 77.9 | 75.6 | 78.1 | 80.1 | 75.4 | 81.2 |
91
+ | | HellaSwag (0-shot) | 80.5 | 80* | 78.7 | 81.5 | 80 | 81.2 |
92
  | | Triviaqa (5-shot) | 69.6 | 60.8* | 56.3 | 62.1 | 60.9 | 65.6 |
93
  | | Naturalqs (5-shot) | 37.5 | 23.46 | 25.7 | 29.1 | 28 | 30.7 |
94
  | | PIQA (0-shot) | 81.6 | 80.14 | 79.5 | 81.9 | 82 | 80.7 |
95
  | | OpenBookQA (0-shot) | 37.8 | 34.2 | 35 | 35.6 | 38.2 | 34.6 |
96
+ | | | 69.66 | - | 66.14 | 68.86 | 65.43 | 69.65 |
97
 
98
  Note:
99
  1. `*`During pretraining, we found that the HumanEval metric fluctuated significantly and was extremely sensitive to formatting. Therefore, we referred to the prompt from Ling-series paper to modify the original HumanEval. The results in the table are the evaluation metrics after this modification.
100
  2. For Mimo-base-7B, the results marked with `*` are sourced from their public report, other evaluations are conducted based on internal evaluation frameworks.
101
 
102
  ### Klear-46B-A2.5B-Inst. Evaluation Results
103
+ | Ability | Benchmark | Klear-46B-A2.5B | InternLM3-8B-Instruct | MiniCPM4-8B | Qwen3-8B (NoThink) | gemma3-12b-it | Phi4-14B | Qwen3-30B-A3B-2507 |
104
+ | ------------- | --------------------------- | --------------- | --------------------- | ----------- | ------------------ | ------------- | -------- | ------------------ |
105
+ | | # Total Params | 46B | 8B | 8B | 8B | 12B | 14B | 30B |
106
+ | | # Activated Params | 2.5B | 8B | 8B | 8B | 12B | 14B | 3B |
107
+ | **General** | MMLU-Redux | 81.61 | 74.65 | 77.63 | 79.32 | 78.39 | 83.09 | 88.11 |
108
+ | | MMLU-Pro | 63.47 | 50.87 | 54.69 | 63.8 | 60.69 | 67.25 | 78.22 |
109
+ | | GPQA-Diamoind | 47.85 | 38.76 | 38.51 | 51.77 | 39.02 | 59.47 | 71.21 |
110
+ | | SimpleQA | 6.52 | 4.44 | 3.51 | 5.5 | 6.22 | 3.28 | 23.39 |
111
+ | | CLUEWSC | 88.16 | 77.63 | 81.91 | 82.89 | 91.12 | 88.16 | 92.11 |
112
+ | | CEval | 83.99 | 84.26 | 81.78 | 81.66 | 60.81 | 64.79 | 88.57 |
113
+ | | C-SimpleQA | 42.3 | 25.87 | 23.13 | 37.07 | 28.97 | 24.77 | 75.37 |
114
+ | | LiveBench 1125 | 50.1 | 26.3 | 25.5 | 52.1 | 43.1 | 40 | 68.4 |
115
+ | **Math** | MATH500 | 82.8 | 68.4 | 79.8 | 85 | 86.8 | 80.6 | 97.2 |
116
+ | | AIME24 | 25.62 | 11.25 | 22.92 | 28.33 | 23.96 | 15.83 | 75 |
117
+ | | AIME25 | 18.12 | 8.12 | 15.21 | 20.62 | 18.33 | 18.75 | 61.88 |
118
+ | **Code** | HumanEval | 87.8 | 82.3* | 74.39 | 83.54 | 82.32 | 85.37 | 81.71 |
119
+ | | HumanEval+ | 81.1 | - | 70.12 | 76.83 | 75.61 | 83.54 | 76.83 |
120
+ | | MBPPEvalplus | 83.1 | 62.4 | 82 | 76.2 | 85.7 | 77.5 | 89.4 |
121
+ | | MBPPEvalplus++ | 70.4 | 50.4 | 69.3 | 66.1 | 74.1 | 66.7 | 75.1 |
122
+ | | LiveCodeBench v5(2408-2501) | 28.67 | 14.7 | 12.19 | 27.24 | 24.73 | 23.66 | 41.22 |
123
+ | **Alignment** | IF-Eval | 80.04 | 79.3 | 73.01 | 84.47 | 81.52 | 59.33 | 83.92 |
124
+ | | Multi-IF(en+zh) | 78.73 | 62.53 | 61.79 | 78.95 | 76.56 | 62.7 | 77.75 |
125
+ | | MTBench | 8.23 | 7.86 | 6.875 | 8.21 | 8.675 | 8.625 | 9.33 |
126
+ | | MT-Eval | 8.11 | 7.36 | 6.7 | 8.18 | 8.45 | 8.12 | - |
127
+ | | AlignBench v1.1 | 6.85 | 6.13 | 5.99 | 6.95 | 6.3 | 6.33 | 7.06 |
128
+ | | Average | 53.50 | - | 46.05 | 52.61 | 50.54 | 48.95 | - |
129
 
130
  Note:
131
  1. For InternLM3-8B-Instruct, the results marked with `*` are sourced from their public report, other evaluations are conducted based on internal evaluation frameworks.