File size: 6,138 Bytes
e6fa89b 2bbe448 e6fa89b 2bbe448 e6fa89b 2bbe448 e6fa89b 2bbe448 e6fa89b 2bbe448 e6fa89b 2bbe448 e6fa89b 35cf92c e6fa89b 2bbe448 e6fa89b 2bbe448 e6fa89b 2bbe448 e6fa89b 2bbe448 e6fa89b 2bbe448 e6fa89b 2bbe448 e6fa89b 2bbe448 e6fa89b 2bbe448 e6fa89b 2bbe448 e6fa89b 2bbe448 e6fa89b 2bbe448 e6fa89b 2bbe448 e6fa89b 2bbe448 e6fa89b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 | ---
license: apache-2.0
language:
- en
---
# **K2-V2**
<img src="figures/K2.LOGO.PRIMARY.RGB.png" width="100" alt="K2-V2 model logo"/>
📚 [Tech Report](https://www.llm360.ai/reports/K2_V2_report.pdf) - 📝 [Code](https://github.com/llm360/k2v2_train) - 🏢 [Project Page](https://huggingface.co/LLM360/K2-V2)
K2-V2 is our most capable fully open model to date, and one of the strongest open-weight models in its class. It uses a 70B-parameter dense transformer architecture and represents the latest advancement in the LLM360 model family.
<img src="figures/sft-models.png" width="400" alt="K2-V2 SFT results"/>
Beyond standard competencies such as factual knowledge and conversational ability, K2-V2 demonstrates strong long-context consistency, deep mathematical understanding, and robust reasoning skills. These capabilities serve as building blocks for sophisticated downstream applications, such as solving complex math problems and executing agentic workflows.
<img src="figures/base-models.png" width="400" alt="K2-V2 GPQA results"/>
---
## **Quick Start**
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("LLM360/K2-V2", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("LLM360/K2-V2")
prompt = "Explain why the derivative of sin(x) is cos(x)."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
---
## **Evaluation Summary**
Below we report performance across general, reasoning, mathematical, and coding benchmarks. Scores for K2-V2 checkpoints (base → mid-4) demonstrate the impact of staged mid-training on reasoning quality.
| Task / Model | base | mid-1 | mid-2 | mid-3 | mid-4 | Qwen2.5-72B | Llama3.0-70B | Llama3.1-70B | Olmo3-32B |
|--------------|------|-------|-------|-------|-------|--------------|---------------|---------------|------------|
| **General Tasks** | | | | | | | | | |
| **MMLU** | 74.3 | 74.4 | 73.5 | 75.0 | 75.2 | **86.1** | <u>79.5</u> | 79.3 | 75.2 |
| **MMLU-Pro** | 43.7 | 46.8 | 48.1 | **59.8** | 57.0 | <u>58.1</u> | 52.8 | 53.8 | 49.6 |
| **BBH** | 68.4 | 79.8 | 81.1 | 82.2 | <u>83.2</u> | **86.3** | 82.2 | 82.1 | 77.6 |
| **HELLASWAG** | <u>87.8</u> | 86.9 | 86.6 | 86.6 | 86.0 | 87.6 | **88.0** | 85.0 | 84.8 |
| **WINOGRANDE** | 82.6 | 83.7 | 83.7 | 83.7 | 83.0 | 83.9 | <u>85.3</u> | 79.8 | **90.3** |
| **PIQA** | 84.2 | 84.0 | 83.3 | 82.9 | 83.1 | 83.5 | <u>84.6</u> | 84.3 | **85.6** |
| **TRUTHFULQA** | 54.0 | 54.9 | 55.1 | <u>55.8</u> | 53.9 | **60.5** | 45.6 | 49.7 | 54.9 |
| **Math & STEM Tasks** | | | | | | | | | |
| **GPQA-DIAMOND** | 26.3 | 31.3 | 27.8 | <u>43.9</u> | **55.1** | 34.9 | 21.2 | 27.3 | 30.3 |
| **GSM8K** | 68.0 | 76.4 | 82.1 | **93.6** | <u>92.5</u> | 91.2 | 83.2 | 81.1 | 80.5 |
| **MATH** | 27.8 | 38.2 | 41.1 | **94.7** | <u>91.4</u> | 58.5 | 41.9 | 41.6 | 43.4 |
| **AIME 2025** | 0.0 | 17.6 | 25.1 | **53.2** | <u>46.9</u> | 1.7 | 0.1 | 0.2 | 14.7 |
| **ARC-CHALLENGE** | 64.9 | 66.4 | 66.4 | 66.0 | 66.3 | **72.4** | <u>69.2</u> | 64.9 | 65.4 |
| **Coding Tasks** | | | | | | | | | |
| **MBPP** | 57.6 | 57.8 | 58.2 | 59.8 | 61.8 | **75.4** | <u>69.2</u> | 64.4 | 60.2 |
| **HUMANEVAL** | 50.0 | 51.2 | <u>53.7</u> | **54.3** | **54.3** | **54.3** | 42.1 | 50.6 | 36.0 |
Please refer to our [Tech Report](https://www.llm360.ai/reports/K2_V2_report.pdf) for detailed evaluation results.
---
## **Datasets & Mixtures**
K2-V2 training is organized into three stages, each using a transparent, publicly released mixture:
### **Pretraining Mix**
* Large-scale natural text corpus spanning web content, books, code, and multilingual sources
* Mixture designed for stable scaling and broad general-knowledge coverage
* ~12T tokens
### **Mid-Training Mix**
* **TxT360-Midas**: reasoning-oriented + long-context extensions
* Domain-focused sources: math, programming, scientific literature
* Synthetic expansions where natural data is scarce
### **SFT Mix**
* Check out https://huggingface.co/LLM360/K2-V2-Instruct
All mixtures, filtering rules, and data sources are fully released for reproducibility.
Please refer to our [Tech Report](https://www.llm360.ai/reports/K2_V2_report.pdf) for detailed datasets and mixtures information.
---
## **Model Description**
- **Model type:** K2-V2 follows a standard decoder-only transformer with grouped-query attention and RMSNorm.
- **Training stage:** Pre-training
- **Language(s) (NLP):** English
- **License:** Apache 2.0
| Model Hyperparameter | Value |
| ----------- | ----------- |
| Total Parameters | 70B |
| Hidden Size | 8,192 |
| Intermediate Size (FFN) | 28,672 |
| Number of Attention Heads | 64 |
| Number of Layers | 80 |
| RMSNorm ɛ | 1e-5 |
| Pre-training Seq Length | 8,192 |
| Max Mid-training Seq Length | 524,288 |
| Vocab Size | 250,000 |
---
## **Intended Use**
K2-V2 is designed for:
* research on large language models and reasoning
* downstream fine-tuning (e.g., instruction following, agents, domain models)
* experimentation with long-context architectures
* open, transparent benchmarking of LLM scaling
K2-V2 is **not** instruction-tuned. For aligned conversational use, please see **K2-V2-Instruct**.
---
## **Limitations**
* May generate incorrect or hallucinated content, especially when asked about facts not seen during training
* Not optimized for safety, moderation, or refusal behavior (base model)
* Long-context performance depends on prompt quality and retrieval structure
* Primarily trained on English; multilingual capabilities are limited
* Inference cost is high due to the 70B parameter size
---
## Citation
If you use K2-V2 in your research, please cite the following:
```
@misc{llm360_k2v2_2025,
title = {K2-V2: A 360-Open, Reasoning-Enhanced Open Foundation Model},
author = {K2 Team},
year = {2025},
archivePrefix = {arXiv},
eprint = {XXXX.XXXXX},
primaryClass = {cs.CL}
}
```
|