Update README.md (#6)
Browse files- Update README.md (26bae147850e57c3753a521335f4af72459090ff)
README.md
CHANGED
|
@@ -51,13 +51,37 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
|
| 51 |
|
| 52 |
## **Evaluation Summary**
|
| 53 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
| Model Specifications | LongBench V2 | AIME25 | HMMT25 | GSM8K | Minerva | GPQA-D | MBPP | HumanEval | LCBv6 |
|
| 55 |
|----------------------|--------------|--------|--------|-------|---------|--------|-------|------------|--------|
|
| 56 |
| **K2 Low**<br><sub>Dense · 70B</sub> | 40.7 | 27.3 | 19.0 | 92.4 | 85.0 | 48.5 | 71.0 | 82.3 | 39.9 |
|
| 57 |
| **K2 Medium**<br><sub>Dense · 70B</sub> | 41.3 | 62.0 | 45.6 | 92.0 | 90.6 | 60.6 | 75.8 | 84.2 | 51.3 |
|
| 58 |
| **K2 High**<br><sub>Dense · 70B</sub> | 42.6 | 80.2 | 71.4 | 94.8 | 94.5 | 69.3 | 84.8 | 91.5 | 67.0 |
|
| 59 |
|
| 60 |
-
|
| 61 |
Please refer to our [Tech Report](https://www.llm360.ai/reports/K2_V2_report.pdf) for detailed evaluation results.
|
| 62 |
|
| 63 |
---
|
|
|
|
| 51 |
|
| 52 |
## **Evaluation Summary**
|
| 53 |
|
| 54 |
+
Below we report performance across general, reasoning, mathematical, and coding benchmarks. Scores for K2-V2 checkpoints (base → mid-4) demonstrate the impact of staged mid-training on reasoning quality.
|
| 55 |
+
|
| 56 |
+
| Task / Model | base | mid-1 | mid-2 | mid-3 | mid-4 | Qwen2.5-72B | Llama3.0-70B | Llama3.1-70B | Olmo3-32B |
|
| 57 |
+
|--------------|------|-------|-------|-------|-------|--------------|---------------|---------------|------------|
|
| 58 |
+
| **General Tasks** | | | | | | | | | |
|
| 59 |
+
| **MMLU** | 74.3 | 74.4 | 73.5 | 75.0 | 75.2 | **86.1** | <u>79.5</u> | 79.3 | 75.2 |
|
| 60 |
+
| **MMLU-Pro** | 43.7 | 46.8 | 48.1 | **59.8** | 57.0 | <u>58.1</u> | 52.8 | 53.8 | 49.6 |
|
| 61 |
+
| **BBH** | 68.4 | 79.8 | 81.1 | 82.2 | <u>83.2</u> | **86.3** | 82.2 | 82.1 | 77.6 |
|
| 62 |
+
| **HELLASWAG** | <u>87.8</u> | 86.9 | 86.6 | 86.6 | 86.0 | 87.6 | **88.0** | 85.0 | 84.8 |
|
| 63 |
+
| **WINOGRANDE** | 82.6 | 83.7 | 83.7 | 83.7 | 83.0 | 83.9 | <u>85.3</u> | 79.8 | **90.3** |
|
| 64 |
+
| **PIQA** | 84.2 | 84.0 | 83.3 | 82.9 | 83.1 | 83.5 | <u>84.6</u> | 84.3 | **85.6** |
|
| 65 |
+
| **TRUTHFULQA** | 54.0 | 54.9 | 55.1 | <u>55.8</u> | 53.9 | **60.5** | 45.6 | 49.7 | 54.9 |
|
| 66 |
+
| **Math & STEM Tasks** | | | | | | | | | |
|
| 67 |
+
| **GPQA-DIAMOND** | 26.3 | 31.3 | 27.8 | <u>43.9</u> | **55.1** | 34.9 | 21.2 | 27.3 | 30.3 |
|
| 68 |
+
| **GSM8K** | 68.0 | 76.4 | 82.1 | **93.6** | <u>92.5</u> | 91.2 | 83.2 | 81.1 | 80.5 |
|
| 69 |
+
| **MATH** | 27.8 | 38.2 | 41.1 | **94.7** | <u>91.4</u> | 58.5 | 41.9 | 41.6 | 43.4 |
|
| 70 |
+
| **AIME 2025** | 0.0 | 17.6 | 25.1 | **53.2** | <u>46.9</u> | 1.7 | 0.1 | 0.2 | 14.7 |
|
| 71 |
+
| **ARC-CHALLENGE** | 64.9 | 66.4 | 66.4 | 66.0 | 66.3 | **72.4** | <u>69.2</u> | 64.9 | 65.4 |
|
| 72 |
+
| **Coding Tasks** | | | | | | | | | |
|
| 73 |
+
| **MBPP** | 57.6 | 57.8 | 58.2 | 59.8 | 61.8 | **75.4** | <u>69.2</u> | 64.4 | 60.2 |
|
| 74 |
+
| **HUMANEVAL** | 50.0 | 51.2 | <u>53.7</u> | **54.3** | **54.3** | **54.3** | 42.1 | 50.6 | 36.0 |
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
Below we report the evaluation results for K2-V2 after supervised fine-tuning (SFT). These variants correspond to three levels of reasoning effort (Low < Medium < High).
|
| 78 |
+
|
| 79 |
| Model Specifications | LongBench V2 | AIME25 | HMMT25 | GSM8K | Minerva | GPQA-D | MBPP | HumanEval | LCBv6 |
|
| 80 |
|----------------------|--------------|--------|--------|-------|---------|--------|-------|------------|--------|
|
| 81 |
| **K2 Low**<br><sub>Dense · 70B</sub> | 40.7 | 27.3 | 19.0 | 92.4 | 85.0 | 48.5 | 71.0 | 82.3 | 39.9 |
|
| 82 |
| **K2 Medium**<br><sub>Dense · 70B</sub> | 41.3 | 62.0 | 45.6 | 92.0 | 90.6 | 60.6 | 75.8 | 84.2 | 51.3 |
|
| 83 |
| **K2 High**<br><sub>Dense · 70B</sub> | 42.6 | 80.2 | 71.4 | 94.8 | 94.5 | 69.3 | 84.8 | 91.5 | 67.0 |
|
| 84 |
|
|
|
|
| 85 |
Please refer to our [Tech Report](https://www.llm360.ai/reports/K2_V2_report.pdf) for detailed evaluation results.
|
| 86 |
|
| 87 |
---
|