File size: 16,087 Bytes
284d932 72a07a6 012a443 72a07a6 91561c0 72a07a6 3ae963a 72a07a6 9a3b52d 5c3e89c 72a07a6 fa963d2 72a07a6 b0128eb 72a07a6 cd5b505 f0fd7af 72a07a6 eded72c 42cb824 fa963d2 72a07a6 4336d6c 72a07a6 42cb824 72a07a6 eded72c 72a07a6 5c3e89c fd791f9 bb32577 fd791f9 bb32577 c12901a fa963d2 a8a9c37 f0e52a3 72a07a6 f7f628f 72a07a6 f0e52a3 72a07a6 f0e52a3 72a07a6 5c3e89c 72a07a6 5c3e89c f0e52a3 72a07a6 8942d1e 72a07a6 5838330 72a07a6 78ab107 72a07a6 78ab107 5c3e89c 72a07a6 d9e3406 72a07a6 f0e52a3 5c3e89c f0e52a3 72a07a6 f0e52a3 72a07a6 f0e52a3 7a3473d 72a07a6 f0e52a3 3e4e7aa f0e52a3 72a07a6 f0e52a3 72a07a6 11c6065 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 |
---
license: apache-2.0
language:
- zh
- en
pipeline_tag: text-generation
library_name: transformers
---
# Klear
<div align="center">
<img src="figures/klear-logo-02.png" width="500"/>
<p>
🤗 <a href="https://huggingface.co/Kwai-Klear">Hugging Face</a> | 💻 <a href="https://github.com/Kwai-Klear/Klear1.0/tree/main">Github Repository</a> | 📑 <a href="https://huggingface.co/Kwai-Klear/Klear-46B-A2.5B-Base">Technique Report</a> | 💬 <a href="https://github.com/Kwai-Klear/Klear1.0/issues">Issues & Discussions</a>
</p>
</div>
## 🔥News
- 2025.09.05: We’ve released the `Klear-46B-A2.5B` series, which currently includes `a base model` and an `instruction-tuned model with DPO`. `A reasoning-enhanced variant is also in training` — stay tuned for upcoming updates!
## 1. Introduction
`Klear-46B-A2.5B` is a sparse Mixture-of-Experts (MoE) large language model developed by **the Kwai-Klear Team at Kuaishou**, designed to deliver both **high performance** and **inference efficiency**. It features **256 experts**, with only **8 experts and 1 shared expert activated** per layer during the forward pass, resulting in **46 billion total parameters** but just **2.5 billion active** — achieving dense-level performance at a fraction of the computational cost.
The model was trained on over **22 trillion tokens** using a **three-stage progressive curriculum**:
**1. Foundational Knowledge Learning (12T tokens):**
General-purpose datasets such as CommonCrawl were processed with stratified quality filters, following a curriculum learning strategy that progresses from lower to higher data quality.
**2. Data Complexity Enhancement (8T tokens):**
The proportion of mathematical, coding, and STEM-related data was gradually increased to strengthen the model's reasoning and problem-solving capabilities.
**3. Reasoning Enhancement and Longcontext Stage (2T tokens):**
Training focused on synthetic and reasoning-intensive data, combined with a fast learning rate annealing strategy to maximize data efficiency and optimize final performance.
As a result, Klear-46B-A2.5B-Base matches or surpasses the performance of dense models with several times more active parameters, while offering significantly better efficiency and cost-effectiveness for real-world deployment.
## Model Summary
The base and instruction tuned + DPO models have the following architecture:
| **key** | **value** |
|---------------------------|------------------------------------------------------------------------|
| hidden_size | 2048 |
| moe_intermediate_size | 896 |
| n_shared_experts | 1 |
| num_attention_heads | 32 |
| num_experts | 256 |
| num_experts_per_tok | 8 |
| num_hidden_layers | 32 |
| num_key_value_heads | 4 |
| vocab_size | 151936 |
| tie_word_embeddings | false |
| context length | 65536 |
### Model Downloads
<div align="center">
| **Model** | **#Total Params** | **#Activated Params** | **Context Length** | **Download Link** |
| :------------: | :------------: | :------------: | :------------: | :------------: |
| Klear-46B-A2.5B-Base | 46B | 2.5B | 64K | [🤗 Hugging Face](https://huggingface.co/Kwai-Klear/Klear-46B-A2.5B-Base) |
| Klear-46B-A2.5B-Instruct | 46B | 2.5B | 64K | [🤗 Hugging Face](https://huggingface.co/Kwai-Klear/Klear-46B-A2.5B-Instruct) |
</div>
## 2. Benchmark Evaluation
### Klear-46B-A2.5B-Base Evaluation Results
| Ability | Benchmark | Klear-46B-A2.5B-Base | MiMO-7B-Base | Qwen3-8B-BASE | Qwen3-14B-BASE | Ling-lite-1.5-Base | Qwen3-30B-A3B-BASE |
| ----------- | ---------------------- | -------------------- | ------------ | ------------- | -------------- | ------------------ | ------------------ |
| | # Total Params | 46B | 7B | 8B | 14B | 16.8B | 30B |
| | # Activated Params | 2.5B | 7B | 8B | 14B | 2.75B | 3B |
| **Code** | HumanEval† (0-shot) | 89 | - | 84.1 | 87.8 | 83.5 | 90.9 |
| | MBPP (3-shot) | 76 | 69.2* | 69 | 74 | 66.6 | 75.6 |
| **Math** | MATH (4-shot, cot) | 55.7 | 38.8 | 60.8* | 62.02* | 59.9 | 59.04* |
| | CMATH (3-shot) | 87.83 | 78.5 | 88.3 | 90.7 | 85.7 | 89.7 |
| | GSM8K (4-shot, cot) | 87.3 | 78.47 | 89.4 | 90.3 | 87.6 | 91.1 |
| **General** | MMLU-Pro (5-shot, cot) | 57.6 | 43.1 | 55.2 | 58.1 | 49.9 | 58.8 |
| | MMLU (5-shot) | 80.5 | 69.24 | 77.1 | 80.6 | 73.7 | 80.4 |
| | CEval (5-shot) | 89.8 | 67.98 | 81.9 | 84.8 | 78.2 | 87.4 |
| | CMMLU (5-shot) | 88 | 70.79 | 82 | 85.6 | 81.2 | 87.1 |
| | GPQA (0-shot) | 35.3 | 31.03 | 33.9 | 35.7 | 30.1 | 35.5 |
| | AGIEval (0-shot) | 52.3 | 48.3* | 51.7 | 55.7 | 54.3 | 56 |
| | BBH (3-shot, cot) | 77.9 | 75.6 | 78.1 | 80.1 | 75.4 | 81.2 |
| | HellaSwag (0-shot) | 80.5 | 80* | 78.7 | 81.5 | 80 | 81.2 |
| | Triviaqa (5-shot) | 69.6 | 60.8* | 56.3 | 62.1 | 60.9 | 65.6 |
| | Naturalqs (5-shot) | 37.5 | 23.46 | 25.7 | 29.1 | 28 | 30.7 |
| | PIQA (0-shot) | 81.6 | 80.14 | 79.5 | 81.9 | 82 | 80.7 |
| | OpenBookQA (0-shot) | 37.8 | 34.2 | 35 | 35.6 | 38.2 | 34.6 |
| | Average | 69.66 | - | 66.62 | 69.60 | 65.60 | 70.41 |
Note:
1. Results marked with `*` are sourced from their public report, other evaluations are conducted based on internal evaluation frameworks.
2. `†`During pretraining, we found that the HumanEval metric fluctuated significantly and was extremely sensitive to formatting. Therefore, we referred to the prompt from Ling-series paper to modify the original HumanEval. The results in the table are the evaluation metrics after this modification.
### Klear-46B-A2.5B-Instruct Evaluation Results
| Ability | Benchmark | Klear-46B-A2.5B--Instruct | InternLM3-8B-Instruct | MiniCPM4-8B | Qwen3-8B (NoThink) | gemma3-12b-it | Phi4-14B | Qwen3-30B-A3B-2507 |
| ------------- | --------------------------- | ------------------------- | --------------------- | ----------- | ------------------ | ------------- | -------- | ------------------ |
| | # Total Params | 46B | 8B | 8B | 8B | 12B | 14B | 30B |
| | # Activated Params | 2.5B | 8B | 8B | 8B | 12B | 14B | 3B |
| **General** | MMLU-Redux | 81.95 | 74.65 | 77.63 | 79.32 | 78.39 | 83.09 | 88.11 |
| | MMLU-Pro | 63.61 | 50.87 | 54.69 | 63.8 | 60.69 | 67.25 | 78.22 |
| | GPQA-Diamoind | 49.12 | 38.76 | 38.51 | 51.77 | 39.02 | 59.47 | 71.21 |
| | SimpleQA | 6.2 | 4.44 | 3.51 | 5.5 | 6.22 | 3.28 | 23.39 |
| | CLUEWSC | 88.49 | 77.63 | 81.91 | 82.89 | 91.12 | 88.16 | 92.11 |
| | CEval | 85.98 | 84.26 | 81.78 | 81.66 | 60.81 | 64.79 | 88.57 |
| | C-SimpleQA | 42.8 | 25.87 | 23.13 | 37.07 | 28.97 | 24.77 | 75.37 |
| | LiveBench 1125 | 50 | 26.3 | 25.5 | 52.1 | 43.1 | 40 | 68.4 |
| **Math** | MATH500 | 86.4 | 68.4 | 79.8 | 85 | 86.8 | 80.6 | 97.2 |
| | AIME24 | 28.33 | 11.25 | 22.92 | 28.33 | 23.96 | 15.83 | 75 |
| | AIME25 | 19.17 | 8.12 | 15.21 | 20.62 | 18.33 | 18.75 | 61.88 |
| **Code** | HumanEval | 86.59 | 82.3* | 78.05 | 83.54 | 82.32 | 85.37 | 81.71 |
| | HumanEval+ | 79.27 | - | 73.17 | 76.83 | 75.61 | 83.54 | 76.83 |
| | MBPPEvalplus | 79.9 | 62.4 | 83.3 | 76.2 | 85.7 | 77.5 | 89.4 |
| | MBPPEvalplus++ | 68.8 | 50.4 | 71.7 | 66.1 | 74.1 | 66.7 | 75.1 |
| | LiveCodeBench v5(2408-2501) | 27.96 | 14.7 | 12.19 | 27.24 | 24.73 | 23.66 | 41.22 |
| **Alignment** | IF-Eval | 81.89 | 79.3 | 73.01 | 84.47 | 81.52 | 59.33 | 83.92 |
| | Multi-IF(en+zh) | 78.46 | 61.83 | 61.79 | 78.95 | 76.56 | 62.7 | 77.75 |
| | MTBench | 8.42 | 7.86 | 6.875 | 8.21 | 8.68 | 8.62 | 9.33 |
| | MT-Eval | 8.13 | 7.36 | 6.7 | 8.18 | 8.45 | 8.12 | - |
| | AlignBench v1.1 | 7 | 6.13 | 5.99 | 6.95 | 6.3 | 6.33 | 7.06 |
| | Average | 53.74 | - | 46.54 | 52.61 | 50.54 | 48.95 | - |
Note:
1. For InternLM3-8B-Instruct, the results marked with `*` are sourced from their official website, other evaluations are conducted based on internal evaluation frameworks.
2. For Multi-IF, we report the overall average computed across all three rounds, pooling the Chinese and English metrics.
## 3. Quick start
### Inference with huggingface
You can now inference in Transformers starting from version `4.56.0` and `set trust_remote_code=True`.
#### Klear-46B-A2.5B-Base
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_path = "/path/to/Klear-Base"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", dtype=torch.bfloat16, trust_remote_code=True)
text = "世界上最大的湖是"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=256)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
```
#### Klear-46B-A2.5B-Instruct
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
model_path = "/path/to/Klear-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", dtype=torch.bfloat16, trust_remote_code=True)
messages = [
{"role": "user", "content": "帮我用 python 写一个计算器的代码吧。"}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=1024)
result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)
```
### Inference with vllm
[vLLM](https://github.com/vllm-project/vllm) is a high-speed and memery-efficicent inference framework. We provide **our own forked version of [vLLM](https://github.com/Kwai-Klear/vllm) here.**
```shell
git clone https://github.com/Kwai-Klear/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 pip install --editable .
vllm serve /path/to/Klear-Instruct --port 8000 --tensor-parallel-size 8 --trust-remote-code
```
An OpenAI-compatible API will be available at `http://localhost:8000/v1`.
Or you can refer to the following Python script for offline inference
```python
import torch
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_path = "/path/to/Klear-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
llm = LLM(
model=model_path,
trust_remote_code=True,
tensor_parallel_size=torch.cuda.device_count(),
gpu_memory_utilization=0.7
)
messages = [
{"role": "user", "content": "帮我用 python 写一个计算器的代码吧。"}
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
sampling_params = SamplingParams(
temperature=0.6, top_p=0.95, top_k=40, max_new_tokens=1024
)
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)
``` |