--- license: apache-2.0 language: - en base_model: - LLM360/K2-V2 --- # **K2-V2-Instruct** K2-V2 model logo 📚 [Tech Report](https://www.llm360.ai/reports/K2_V2_report.pdf) - 📝 [Training Code](https://github.com/llm360/k2v2_train) - 🏢 [Evaluation Code](https://github.com/llm360/eval360) 🗂️ [Pretraining Data: TxT360](https://huggingface.co/datasets/LLM360/TxT360) - 🗂️ [Midtraining Data: TxT360-Midas](https://huggingface.co/datasets/LLM360/TxT360-Midas) - 🗂️ [SFT Data: TxT360-3efforts](https://huggingface.co/datasets/LLM360/TxT360-3efforts) K2-V2 is our most capable fully open model to date, and one of the strongest open-weight models in its class. It uses a 70B-parameter dense transformer architecture and represents the latest advancement in the LLM360 model family. K2-V2 SFT results Beyond standard competencies such as factual knowledge and conversational ability, K2-V2 demonstrates strong long-context consistency, deep mathematical understanding, and robust reasoning skills. These capabilities serve as building blocks for sophisticated downstream applications, such as solving complex math problems and executing agentic workflows. K2-V2 GPQA results --- ## **Quick Start** ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("llm360/k2-v2-instruct", device_map="auto") tokenizer = AutoTokenizer.from_pretrained("llm360/k2-v2-instruct") prompt = "Explain why the derivative of sin(x) is cos(x)." messages = [ {"role": "system", "content": "You are K2, a helpful assistant created by Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) Institute of Foundation Models (IFM)."}, {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True reasoning_effort="high" # Or "medium"/"low" ) inputs = tokenizer(text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=200) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` Alternatively, you may serve the model using VLLM: ``` vllm serve LLM360/K2-V2-Instruct --tensor-parallel-size 8 --port 8000 ``` K2-V2-Instruct uses `reasoning_effort="low"|"medium"|"high"` in the chat template to determine reasoning effort. If you cannot use `tokenizer.apply_chat_template`, you may also pass in these arguments using `extra_body` and `chat_template_kwargs`: ``` from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", api_key="key" ) completion = client.chat.completions.create( model="LLM360/K2-V2-Instruct", messages = [ {"role": "system", "content": "You are K2, a helpful assistant created by Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) Institute of Foundation Models (IFM)."}, {"role": "user", "content": "Explain why the derivative of sin(x) is cos(x)."} ], extra_body={ "chat_template_kwargs": {"reasoning_effort": "high"}, }, ) ``` --- ## **Evaluation Summary** Below we report performance across general, reasoning, mathematical, and coding benchmarks. Scores for K2-V2 checkpoints (base → mid-4) demonstrate the impact of staged mid-training on reasoning quality. | Task / Model | base | mid-1 | mid-2 | mid-3 | mid-4 | Qwen2.5-72B | Llama3.0-70B | Llama3.1-70B | Olmo3-32B | |--------------|------|-------|-------|-------|-------|--------------|---------------|---------------|------------| | **General Tasks** | | | | | | | | | | | **MMLU** | 74.3 | 74.4 | 73.5 | 75.0 | 75.2 | **86.1** | 79.5 | 79.3 | 75.2 | | **MMLU-Pro** | 43.7 | 46.8 | 48.1 | **59.8** | 57.0 | 58.1 | 52.8 | 53.8 | 49.6 | | **BBH** | 68.4 | 79.8 | 81.1 | 82.2 | 83.2 | **86.3** | 82.2 | 82.1 | 77.6 | | **HELLASWAG** | 87.8 | 86.9 | 86.6 | 86.6 | 86.0 | 87.6 | **88.0** | 85.0 | 84.8 | | **WINOGRANDE** | 82.6 | 83.7 | 83.7 | 83.7 | 83.0 | 83.9 | 85.3 | 79.8 | **90.3** | | **PIQA** | 84.2 | 84.0 | 83.3 | 82.9 | 83.1 | 83.5 | 84.6 | 84.3 | **85.6** | | **TRUTHFULQA** | 54.0 | 54.9 | 55.1 | 55.8 | 53.9 | **60.5** | 45.6 | 49.7 | 54.9 | | **Math & STEM Tasks** | | | | | | | | | | | **GPQA-DIAMOND** | 26.3 | 31.3 | 27.8 | 43.9 | **55.1** | 34.9 | 21.2 | 27.3 | 30.3 | | **GSM8K** | 68.0 | 76.4 | 82.1 | **93.6** | 92.5 | 91.2 | 83.2 | 81.1 | 80.5 | | **MATH** | 27.8 | 38.2 | 41.1 | **94.7** | 91.4 | 58.5 | 41.9 | 41.6 | 43.4 | | **AIME 2025** | 0.0 | 17.6 | 25.1 | **53.2** | 46.9 | 1.7 | 0.1 | 0.2 | 14.7 | | **ARC-CHALLENGE** | 64.9 | 66.4 | 66.4 | 66.0 | 66.3 | **72.4** | 69.2 | 64.9 | 65.4 | | **Coding Tasks** | | | | | | | | | | | **MBPP** | 57.6 | 57.8 | 58.2 | 59.8 | 61.8 | **75.4** | 69.2 | 64.4 | 60.2 | | **HUMANEVAL** | 50.0 | 51.2 | 53.7 | **54.3** | **54.3** | **54.3** | 42.1 | 50.6 | 36.0 | | **Logic Puzzles** | | | | | | | | | | | **COUNTDOWN** | 1.3 | 53.3 | 53.1 | 35.9 | **75.6** | 6.0 | 1.0 | 0.5 | 23.2 | | **KK-4 PEOPLE** | 4.8 | 44.9 | 68.0 | 64.5 | **92.9** | 26.1 | 4.2 | 7.6 | 42.4 | | **KK-8 PEOPLE** | 0.5 | 23.2 | 41.3 | 51.6 | **82.8** | 5.7 | 1.1 | 1.3 | 13.0 | | **ORDER-15 ITEMS** | 4.7 | 30.7 | 47.2 | 55.8 | **87.6** | 37.0 | 3.5 | 4.5 | 25.0 | | **ORDER-30 ITEMS** | 0.0 | 0.3 | 3.0 | 34.1 | **40.3** | 0.7 | 0.2 | 0.1 | 0.6 | | **Instruction Following** | | | | | | | | | | | **IFEVAL** | 17.4 | 26.2 | 28.5 | 34.5 | 26.7 | **40.3** | 15.1 | 17.4 | 13.2 | | **Arabic** | | | | | | | | | | | **MMLU-Arabic** | 65.4 | 66.1 | 64.5 | 66.6 | 65.5 | **74.1** | 65.0 | 66.8 | 47.8 | Below we report the evaluation results for K2-V2 after supervised fine-tuning (SFT). These variants correspond to three levels of reasoning effort (Low < Medium < High). | Metric / Model | **K2 Low**
Dense · 70B | **K2 Medium**
Dense · 70B | **K2 High**
Dense · 70B | **Olmo3 Think SFT**
Dense · 32B · No RL | **Olmo3 Think**
Dense · 32B · RL | **GLM-4.5 Air**
MoE · 106B A12B | **MiniMax-M2**
MoE · 230B A10B | **Qwen3 235B**
MoE · 235B A22B · Reasoning | **Qwen 2.5 72B**
Dense · 72B | |--------|--------------------------------------|------------------------------------------|----------------------------------------|------------------------------------------------------|--------------------------------------------------|----------------------------------------------------|------------------------------------------------------|--------------------------------------------------------------------|-------------------------------------------| | **LongBench V2** | 40.7 | 41.3 | 42.6 | 42.8 | 47.1 | 49.4 | 55.8 | 60.9 | 47.2 | | **AIME25** | 27.3 | 62.0 | 80.2 | 68.3 | 73.3 | 81.3 | 75.8 | 88.8 | 15.2 | | **HMMT25** | 19.0 | 45.6 | 71.4 | 43.3 | 50.83 | 73.3 | 63.5 | 84.2 | 9.79 | | **GSM8K** | 92.4 | 92.0 | 94.8 | 96.1 | 95.7 | 96.1 | 95.4 | 93.5 | 85.8 | | **Minerva** | 85.0 | 90.6 | 94.5 | 96.9 | 97.3 | 94.9 | 85.3 | 98.0 | 82.1 | | **GPQA-D** | 48.5 | 60.6 | 69.3 | 58.0 | 59.8 | 75.3 | 76.2 | 80.7 | 50.5 | | **MBPP** | 71.0 | 75.8 | 84.8 | 87.6 | 91.6 | 82.8 | 83.8 | 96.2 | 80.0 | | **HumanEval** | 82.3 | 91.5 | 91.5 | 96.3 | 96.3 | 97.6 | 89.6 | 94.5 | 85.4 | | **LCBv6** | 39.9 | 51.3 | 67.0 | 67.9 | 67.6 | 67.8 | 79.2 | 72.8 | 36.7 | Please refer to our [Tech Report](https://www.llm360.ai/reports/K2_V2_report.pdf) for detailed evaluation results. --- ## **Datasets & Mixtures** ### **SFT Mix** * **TxT360-3efforts**: curated instruction + mixed-difficulty reasoning traces * Tool-calling demonstrations * Small but high-value corpus to showcase model potential All mixtures, filtering rules, and data sources are fully released for reproducibility. Please refer to our [Tech Report](https://www.llm360.ai/reports/K2_V2_report.pdf) for detailed datasets and mixtures information. --- ## **Model Description** - **Model type:** K2-V2 follows a standard decoder-only transformer with grouped-query attention and RMSNorm. - **Training stage:** Pre-training & Post-training - **Language(s) (NLP):** English - **License:** Apache 2.0 | Model Hyperparameter | Value | | ----------- | ----------- | | Total Parameters | 70B | | Hidden Size | 8,192 | | Intermediate Size (FFN) | 28,672 | | Number of Attention Heads | 64 | | Number of Layers | 80 | | RMSNorm ɛ | 1e-5 | | Pre-training Seq Length | 8,192 | | Post-training Seq Length | 524,288 | | Vocab Size | 250,000 | --- ## Citation If you use K2-V2-Instruct in your research, please cite the following: ``` @misc{llm360_k2v2_2025, title = {K2-V2: A 360-Open, Reasoning-Enhanced LLM}, author = {K2 Team}, year = {2025}, } ```