|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
|
|
|
# **K2-V2** |
|
|
|
|
|
<img src="figures/K2.LOGO.PRIMARY.RGB.png" width="100" alt="K2-V2 model logo"/> |
|
|
|
|
|
📚 [Tech Report](https://www.llm360.ai/reports/K2_V2_report.pdf) - 📝 [Code](https://github.com/llm360/k2v2_train) - 🏢 [Project Page](https://huggingface.co/LLM360/K2-V2) |
|
|
|
|
|
K2-V2 is our most capable fully open model to date, and one of the strongest open-weight models in its class. It uses a 70B-parameter dense transformer architecture and represents the latest advancement in the LLM360 model family. |
|
|
|
|
|
<img src="figures/sft-models.png" width="400" alt="K2-V2 SFT results"/> |
|
|
|
|
|
Beyond standard competencies such as factual knowledge and conversational ability, K2-V2 demonstrates strong long-context consistency, deep mathematical understanding, and robust reasoning skills. These capabilities serve as building blocks for sophisticated downstream applications, such as solving complex math problems and executing agentic workflows. |
|
|
|
|
|
<img src="figures/base-models.png" width="400" alt="K2-V2 GPQA results"/> |
|
|
|
|
|
--- |
|
|
|
|
|
## **Quick Start** |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained("LLM360/K2-V2", device_map="auto") |
|
|
tokenizer = AutoTokenizer.from_pretrained("LLM360/K2-V2") |
|
|
|
|
|
prompt = "Explain why the derivative of sin(x) is cos(x)." |
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
|
outputs = model.generate(**inputs, max_new_tokens=200) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## **Evaluation Summary** |
|
|
|
|
|
Below we report performance across general, reasoning, mathematical, and coding benchmarks. Scores for K2-V2 checkpoints (base → mid-4) demonstrate the impact of staged mid-training on reasoning quality. |
|
|
|
|
|
| Task / Model | base | mid-1 | mid-2 | mid-3 | mid-4 | Qwen2.5-72B | Llama3.0-70B | Llama3.1-70B | Olmo3-32B | |
|
|
|--------------|------|-------|-------|-------|-------|--------------|---------------|---------------|------------| |
|
|
| **General Tasks** | | | | | | | | | | |
|
|
| **MMLU** | 74.3 | 74.4 | 73.5 | 75.0 | 75.2 | **86.1** | <u>79.5</u> | 79.3 | 75.2 | |
|
|
| **MMLU-Pro** | 43.7 | 46.8 | 48.1 | **59.8** | 57.0 | <u>58.1</u> | 52.8 | 53.8 | 49.6 | |
|
|
| **BBH** | 68.4 | 79.8 | 81.1 | 82.2 | <u>83.2</u> | **86.3** | 82.2 | 82.1 | 77.6 | |
|
|
| **HELLASWAG** | <u>87.8</u> | 86.9 | 86.6 | 86.6 | 86.0 | 87.6 | **88.0** | 85.0 | 84.8 | |
|
|
| **WINOGRANDE** | 82.6 | 83.7 | 83.7 | 83.7 | 83.0 | 83.9 | <u>85.3</u> | 79.8 | **90.3** | |
|
|
| **PIQA** | 84.2 | 84.0 | 83.3 | 82.9 | 83.1 | 83.5 | <u>84.6</u> | 84.3 | **85.6** | |
|
|
| **TRUTHFULQA** | 54.0 | 54.9 | 55.1 | <u>55.8</u> | 53.9 | **60.5** | 45.6 | 49.7 | 54.9 | |
|
|
| **Math & STEM Tasks** | | | | | | | | | | |
|
|
| **GPQA-DIAMOND** | 26.3 | 31.3 | 27.8 | <u>43.9</u> | **55.1** | 34.9 | 21.2 | 27.3 | 30.3 | |
|
|
| **GSM8K** | 68.0 | 76.4 | 82.1 | **93.6** | <u>92.5</u> | 91.2 | 83.2 | 81.1 | 80.5 | |
|
|
| **MATH** | 27.8 | 38.2 | 41.1 | **94.7** | <u>91.4</u> | 58.5 | 41.9 | 41.6 | 43.4 | |
|
|
| **AIME 2025** | 0.0 | 17.6 | 25.1 | **53.2** | <u>46.9</u> | 1.7 | 0.1 | 0.2 | 14.7 | |
|
|
| **ARC-CHALLENGE** | 64.9 | 66.4 | 66.4 | 66.0 | 66.3 | **72.4** | <u>69.2</u> | 64.9 | 65.4 | |
|
|
| **Coding Tasks** | | | | | | | | | | |
|
|
| **MBPP** | 57.6 | 57.8 | 58.2 | 59.8 | 61.8 | **75.4** | <u>69.2</u> | 64.4 | 60.2 | |
|
|
| **HUMANEVAL** | 50.0 | 51.2 | <u>53.7</u> | **54.3** | **54.3** | **54.3** | 42.1 | 50.6 | 36.0 | |
|
|
|
|
|
|
|
|
Please refer to our [Tech Report](https://www.llm360.ai/reports/K2_V2_report.pdf) for detailed evaluation results. |
|
|
|
|
|
--- |
|
|
|
|
|
## **Datasets & Mixtures** |
|
|
|
|
|
K2-V2 training is organized into three stages, each using a transparent, publicly released mixture: |
|
|
|
|
|
### **Pretraining Mix** |
|
|
|
|
|
* Large-scale natural text corpus spanning web content, books, code, and multilingual sources |
|
|
* Mixture designed for stable scaling and broad general-knowledge coverage |
|
|
* ~12T tokens |
|
|
|
|
|
### **Mid-Training Mix** |
|
|
|
|
|
* **TxT360-Midas**: reasoning-oriented + long-context extensions |
|
|
* Domain-focused sources: math, programming, scientific literature |
|
|
* Synthetic expansions where natural data is scarce |
|
|
|
|
|
### **SFT Mix** |
|
|
|
|
|
* Check out https://huggingface.co/LLM360/K2-V2-Instruct |
|
|
|
|
|
All mixtures, filtering rules, and data sources are fully released for reproducibility. |
|
|
|
|
|
Please refer to our [Tech Report](https://www.llm360.ai/reports/K2_V2_report.pdf) for detailed datasets and mixtures information. |
|
|
|
|
|
--- |
|
|
|
|
|
## **Model Description** |
|
|
- **Model type:** K2-V2 follows a standard decoder-only transformer with grouped-query attention and RMSNorm. |
|
|
- **Training stage:** Pre-training |
|
|
- **Language(s) (NLP):** English |
|
|
- **License:** Apache 2.0 |
|
|
|
|
|
|
|
|
| Model Hyperparameter | Value | |
|
|
| ----------- | ----------- | |
|
|
| Total Parameters | 70B | |
|
|
| Hidden Size | 8,192 | |
|
|
| Intermediate Size (FFN) | 28,672 | |
|
|
| Number of Attention Heads | 64 | |
|
|
| Number of Layers | 80 | |
|
|
| RMSNorm ɛ | 1e-5 | |
|
|
| Pre-training Seq Length | 8,192 | |
|
|
| Max Mid-training Seq Length | 524,288 | |
|
|
| Vocab Size | 250,000 | |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## **Intended Use** |
|
|
|
|
|
K2-V2 is designed for: |
|
|
|
|
|
* research on large language models and reasoning |
|
|
* downstream fine-tuning (e.g., instruction following, agents, domain models) |
|
|
* experimentation with long-context architectures |
|
|
* open, transparent benchmarking of LLM scaling |
|
|
|
|
|
K2-V2 is **not** instruction-tuned. For aligned conversational use, please see **K2-V2-Instruct**. |
|
|
|
|
|
--- |
|
|
|
|
|
## **Limitations** |
|
|
|
|
|
* May generate incorrect or hallucinated content, especially when asked about facts not seen during training |
|
|
* Not optimized for safety, moderation, or refusal behavior (base model) |
|
|
* Long-context performance depends on prompt quality and retrieval structure |
|
|
* Primarily trained on English; multilingual capabilities are limited |
|
|
* Inference cost is high due to the 70B parameter size |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use K2-V2 in your research, please cite the following: |
|
|
|
|
|
``` |
|
|
@misc{llm360_k2v2_2025, |
|
|
title = {K2-V2: A 360-Open, Reasoning-Enhanced Open Foundation Model}, |
|
|
author = {K2 Team}, |
|
|
year = {2025}, |
|
|
archivePrefix = {arXiv}, |
|
|
eprint = {XXXX.XXXXX}, |
|
|
primaryClass = {cs.CL} |
|
|
} |
|
|
``` |
|
|
|