Update README.md (#2)
Browse files- Update README.md (d9d949cbed11369f43fd1c5eb0a50194e86ed1e7)
README.md
CHANGED
|
@@ -51,37 +51,39 @@ Performance of Step 3.5 Flash measured across **Reasoning**, **Coding**, and **A
|
|
| 51 |
|
| 52 |
### Detailed Benchmarks
|
| 53 |
|
| 54 |
-
| Benchmark |
|
| 55 |
-
| --- | --- | --- | --- | --- | --- | --- |
|
| 56 |
-
| # Activated Params | 11B |
|
| 57 |
-
| # Total Params
|
| 58 |
-
|
|
| 59 |
-
| | | |
|
| 60 |
-
|
|
| 61 |
-
|
|
| 62 |
-
|
|
| 63 |
-
|
|
| 64 |
-
|
|
| 65 |
-
|
|
| 66 |
-
|
|
| 67 |
-
|
|
| 68 |
-
|
|
| 69 |
-
| | | |
|
| 70 |
-
|
|
| 71 |
-
|
|
| 72 |
-
|
|
| 73 |
-
|
|
| 74 |
-
| | | |
|
| 75 |
-
|
|
| 76 |
-
|
|
| 77 |
-
|
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
|
|
|
|
|
|
| 85 |
### Recommended Inference Parameters
|
| 86 |
1. For general chat domain, we suggest: `temperature=0.6, top_p=0.95`
|
| 87 |
2. For reasoning / agent scenario, we recommend: `temperature=1.0, top_p=0.95`.
|
|
|
|
| 51 |
|
| 52 |
### Detailed Benchmarks
|
| 53 |
|
| 54 |
+
| Benchmark | # Shots | Step3.5 Flash (Base) | MiMo‑V2 Flash (Base) | GLM‑4.5 (Base) | DeepSeek V3.1 (Base) | DeepSeek V3.2 (Exp Base) | Kimi‑K2 (Base) |
|
| 55 |
+
| --- | --- | --- | --- | --- | --- | --- | --- |
|
| 56 |
+
| # Activated Params | - | 11B | 15B | 32B | 37B | 37B | 32B |
|
| 57 |
+
| # Total Params | - | 196B | 309B | 355B | 671B | 671B | 1043B |
|
| 58 |
+
| General | | | | | | | |
|
| 59 |
+
| BBH | 3-shot | 88.2 | 88.5 | 86.2 | 88.2† | 88.7† | 88.7 |
|
| 60 |
+
| MMLU | 5-shot | 85.8 | 86.7 | 86.1 | 87.4† | 87.8† | 87.8 |
|
| 61 |
+
| MMLU‑Redux | 5-shot | 89.2 | 90.6 | - | 90.0† | 90.4† | 90.2 |
|
| 62 |
+
| MMLU‑Pro | 5-shot | 62.3 | 73.2 | - | 58.8† | 62.1† | 69.2 |
|
| 63 |
+
| HellaSwag | 10-shot | 90.2 | 88.5 | 87.1 | 89.2† | 89.4† | 94.6 |
|
| 64 |
+
| WinoGrande | 5-shot | 79.1 | 83.8 | - | 85.9† | 85.6† | 85.3 |
|
| 65 |
+
| GPQA | 5-shot | 41.7 | 43.5* | 33.5* | 43.1* | 37.3* | 43.1* |
|
| 66 |
+
| SuperGPQA | 5-shot | 41.0 | 41.1 | - | 42.3† | 43.6† | 44.7 |
|
| 67 |
+
| SimpleQA | 5-shot | 31.6 | 20.6 | 30.0 | 26.3† | 27.0† | 35.3 |
|
| 68 |
+
| Mathematics | | | | | | | |
|
| 69 |
+
| GSM8K | 8-shot | 88.2 | 92.3 | 87.6 | 91.4† | 91.1† | 92.1 |
|
| 70 |
+
| MATH | 4-shot | 66.8 | 71.0 | 62.6 | 62.6† | 62.5† | 70.2 |
|
| 71 |
+
| Code | | | | | | | |
|
| 72 |
+
| HumanEval | 3-shot | 81.1 | 77.4* | 79.8* | 72.5* | 67.7* | 84.8* |
|
| 73 |
+
| MBPP | 3-shot | 79.4 | 81.0* | 81.6* | 74.6* | 75.6* | 89.0* |
|
| 74 |
+
| HumanEval+ | 0-shot | 72.0 | 70.7 | - | 64.6† | 67.7† | - |
|
| 75 |
+
| MBPP+ | 0-shot | 70.6 | 71.4 | - | 72.2† | 69.8† | - |
|
| 76 |
+
| MultiPL‑E HumanEval | 0-shot | 67.7 | 59.5 | - | 45.9† | 45.7† | 60.5 |
|
| 77 |
+
| MultiPL‑E MBPP | 0-shot | 58.0 | 56.7 | - | 52.5† | 50.6† | 58.8 |
|
| 78 |
+
| Chinese | | | | | | | |
|
| 79 |
+
| C‑EVAL | 5-shot | 89.6 | 87.9 | 86.9 | 90.0† | 91.0† | 92.5 |
|
| 80 |
+
| CMMLU | 5-shot | 88.9 | 87.4 | - | 88.8† | 88.9† | 90.9 |
|
| 81 |
+
| C‑SimpleQA | 5-shot | 63.2 | 61.5 | 70.1 | 70.9† | 68.0† | 77.6 |
|
| 82 |
+
|
| 83 |
+
1. “*” denotes cases where the original score was unavailable; we report results evaluated under the same test conditions as Step3.5 Flash for fair
|
| 84 |
+
comparison.
|
| 85 |
+
2. “†” indicates DeepSeek scores quoted from the MiMo‑V2‑Flash report.
|
| 86 |
+
|
| 87 |
### Recommended Inference Parameters
|
| 88 |
1. For general chat domain, we suggest: `temperature=0.6, top_p=0.95`
|
| 89 |
2. For reasoning / agent scenario, we recommend: `temperature=1.0, top_p=0.95`.
|