K2-V2 / README.md

desaifan-mbzuai

Update README.md

2bbe448 verified 2 months ago

preview code

raw

history blame

6.14 kB

metadata

license: apache-2.0
language:
  - en

K2-V2

📚 Tech Report - 📝 Code - 🏢 Project Page

K2-V2 is our most capable fully open model to date, and one of the strongest open-weight models in its class. It uses a 70B-parameter dense transformer architecture and represents the latest advancement in the LLM360 model family.

Beyond standard competencies such as factual knowledge and conversational ability, K2-V2 demonstrates strong long-context consistency, deep mathematical understanding, and robust reasoning skills. These capabilities serve as building blocks for sophisticated downstream applications, such as solving complex math problems and executing agentic workflows.

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("LLM360/K2-V2", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("LLM360/K2-V2")

prompt = "Explain why the derivative of sin(x) is cos(x)."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Evaluation Summary

Below we report performance across general, reasoning, mathematical, and coding benchmarks. Scores for K2-V2 checkpoints (base → mid-4) demonstrate the impact of staged mid-training on reasoning quality.

Task / Model	base	mid-1	mid-2	mid-3	mid-4	Qwen2.5-72B	Llama3.0-70B	Llama3.1-70B	Olmo3-32B
General Tasks
MMLU	74.3	74.4	73.5	75.0	75.2	86.1	79.5	79.3	75.2
MMLU-Pro	43.7	46.8	48.1	59.8	57.0	58.1	52.8	53.8	49.6
BBH	68.4	79.8	81.1	82.2	83.2	86.3	82.2	82.1	77.6
HELLASWAG	87.8	86.9	86.6	86.6	86.0	87.6	88.0	85.0	84.8
WINOGRANDE	82.6	83.7	83.7	83.7	83.0	83.9	85.3	79.8	90.3
PIQA	84.2	84.0	83.3	82.9	83.1	83.5	84.6	84.3	85.6
TRUTHFULQA	54.0	54.9	55.1	55.8	53.9	60.5	45.6	49.7	54.9
Math & STEM Tasks
GPQA-DIAMOND	26.3	31.3	27.8	43.9	55.1	34.9	21.2	27.3	30.3
GSM8K	68.0	76.4	82.1	93.6	92.5	91.2	83.2	81.1	80.5
MATH	27.8	38.2	41.1	94.7	91.4	58.5	41.9	41.6	43.4
AIME 2025	0.0	17.6	25.1	53.2	46.9	1.7	0.1	0.2	14.7
ARC-CHALLENGE	64.9	66.4	66.4	66.0	66.3	72.4	69.2	64.9	65.4
Coding Tasks
MBPP	57.6	57.8	58.2	59.8	61.8	75.4	69.2	64.4	60.2
HUMANEVAL	50.0	51.2	53.7	54.3	54.3	54.3	42.1	50.6	36.0

Please refer to our Tech Report for detailed evaluation results.

Datasets & Mixtures

K2-V2 training is organized into three stages, each using a transparent, publicly released mixture:

Pretraining Mix

Large-scale natural text corpus spanning web content, books, code, and multilingual sources
Mixture designed for stable scaling and broad general-knowledge coverage
~12T tokens

Mid-Training Mix

TxT360-Midas: reasoning-oriented + long-context extensions
Domain-focused sources: math, programming, scientific literature
Synthetic expansions where natural data is scarce

SFT Mix

Check out https://huggingface.co/LLM360/K2-V2-Instruct

All mixtures, filtering rules, and data sources are fully released for reproducibility.

Please refer to our Tech Report for detailed datasets and mixtures information.

Model Description

Model type: K2-V2 follows a standard decoder-only transformer with grouped-query attention and RMSNorm.
Training stage: Pre-training
Language(s) (NLP): English
License: Apache 2.0

Model Hyperparameter	Value
Total Parameters	70B
Hidden Size	8,192
Intermediate Size (FFN)	28,672
Number of Attention Heads	64
Number of Layers	80
RMSNorm ɛ	1e-5
Pre-training Seq Length	8,192
Max Mid-training Seq Length	524,288
Vocab Size	250,000

Intended Use

K2-V2 is designed for:

research on large language models and reasoning
downstream fine-tuning (e.g., instruction following, agents, domain models)
experimentation with long-context architectures
open, transparent benchmarking of LLM scaling

K2-V2 is not instruction-tuned. For aligned conversational use, please see K2-V2-Instruct.

Limitations

May generate incorrect or hallucinated content, especially when asked about facts not seen during training
Not optimized for safety, moderation, or refusal behavior (base model)
Long-context performance depends on prompt quality and retrieval structure
Primarily trained on English; multilingual capabilities are limited
Inference cost is high due to the 70B parameter size

Citation

If you use K2-V2 in your research, please cite the following:

@misc{llm360_k2v2_2025,
  title         = {K2-V2: A 360-Open, Reasoning-Enhanced Open Foundation Model},
  author        = {K2 Team},
  year          = {2025},
  archivePrefix = {arXiv},
  eprint        = {XXXX.XXXXX},
  primaryClass  = {cs.CL}
}