File size: 16,087 Bytes

284d932
 
 
 
 
 
 
 
72a07a6
 
 
 
 
012a443
72a07a6
 
91561c0
72a07a6
 
 
3ae963a
72a07a6
 
 
 
 
9a3b52d
5c3e89c
72a07a6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fa963d2
72a07a6
b0128eb
72a07a6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cd5b505
f0fd7af
72a07a6
 
 
 
 
 
 
 
 
 
eded72c
42cb824
 
fa963d2
72a07a6
 
 
 
 
 
 
 
4336d6c
72a07a6
 
 
 
42cb824
72a07a6
 
eded72c
 
72a07a6
5c3e89c
fd791f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bb32577
 
 
 
fd791f9
 
 
 
 
 
bb32577
c12901a
fa963d2
a8a9c37
 
f0e52a3
72a07a6
 
 
 
f7f628f
72a07a6
 
 
 
 
 
 
f0e52a3
 
72a07a6
f0e52a3
72a07a6
 
 
 
 
 
 
 
5c3e89c
72a07a6
 
 
 
 
5c3e89c
f0e52a3
72a07a6
8942d1e
72a07a6
 
 
 
 
 
 
 
 
 
 
 
 
5838330
72a07a6
 
78ab107
72a07a6
78ab107
5c3e89c
72a07a6
 
 
 
 
 
d9e3406
72a07a6
f0e52a3
 
5c3e89c
f0e52a3
72a07a6
 
 
 
f0e52a3
 
72a07a6
f0e52a3
7a3473d
72a07a6
 
f0e52a3
 
 
 
 
 
 
3e4e7aa
f0e52a3
 
 
72a07a6
f0e52a3
72a07a6
11c6065

---
license: apache-2.0
language:
- zh
- en
pipeline_tag: text-generation
library_name: transformers
---
# Klear

<div align="center">
  <img src="figures/klear-logo-02.png" width="500"/>
  <p>
    🤗 <a href="https://huggingface.co/Kwai-Klear">Hugging Face</a> | 💻 <a href="https://github.com/Kwai-Klear/Klear1.0/tree/main">Github Repository</a> |  📑 <a href="https://huggingface.co/Kwai-Klear/Klear-46B-A2.5B-Base">Technique Report</a> |  💬 <a href="https://github.com/Kwai-Klear/Klear1.0/issues">Issues & Discussions</a>
  </p>
</div>
 

## 🔥News

- 2025.09.05: We’ve released the `Klear-46B-A2.5B` series, which currently includes `a base model` and an `instruction-tuned model with DPO`. `A reasoning-enhanced variant is also in training` — stay tuned for upcoming updates!


## 1. Introduction


`Klear-46B-A2.5B` is a sparse Mixture-of-Experts (MoE) large language model developed by **the Kwai-Klear Team at Kuaishou**, designed to deliver both **high performance** and **inference efficiency**. It features **256 experts**, with only **8 experts and 1 shared expert activated** per layer during the forward pass, resulting in **46 billion total parameters** but just **2.5 billion active** — achieving dense-level performance at a fraction of the computational cost.

The model was trained on over **22 trillion tokens** using a **three-stage progressive curriculum**:

**1. Foundational Knowledge Learning (12T tokens):**
General-purpose datasets such as CommonCrawl were processed with stratified quality filters, following a curriculum learning strategy that progresses from lower to higher data quality.

**2. Data Complexity Enhancement (8T tokens):**
The proportion of mathematical, coding, and STEM-related data was gradually increased to strengthen the model's reasoning and problem-solving capabilities.

**3. Reasoning Enhancement and Longcontext Stage (2T tokens):**
Training focused on synthetic and reasoning-intensive data, combined with a fast learning rate annealing strategy to maximize data efficiency and optimize final performance.

As a result, Klear-46B-A2.5B-Base matches or surpasses the performance of dense models with several times more active parameters, while offering significantly better efficiency and cost-effectiveness for real-world deployment.


## Model Summary

The base and instruction tuned + DPO models have the following architecture:

| **key**               | **value**                                                              |
|---------------------------|------------------------------------------------------------------------|
| hidden_size       | 2048                                                                     |
| moe_intermediate_size                  | 896                                                                    |
| n_shared_experts       | 1                                                                      |
| num_attention_heads         | 32                                                                    |
| num_experts     | 256                                                                    |
| num_experts_per_tok               | 8                                                                    |
| num_hidden_layers       | 32                                                                      |
| num_key_value_heads               | 4                                                                   |
| vocab_size                | 151936                                                                 |
| tie_word_embeddings       | false                                                                  |
| context length       | 65536                                                                  |


### Model Downloads

<div align="center">

| **Model** | **#Total Params** | **#Activated Params** | **Context Length** | **Download Link** |
| :------------: | :------------: | :------------: | :------------: | :------------: |
| Klear-46B-A2.5B-Base | 46B | 2.5B | 64K   | [🤗 Hugging Face](https://huggingface.co/Kwai-Klear/Klear-46B-A2.5B-Base)   |
| Klear-46B-A2.5B-Instruct  | 46B | 2.5B |  64K   | [🤗 Hugging Face](https://huggingface.co/Kwai-Klear/Klear-46B-A2.5B-Instruct)   |

</div>


## 2. Benchmark Evaluation
### Klear-46B-A2.5B-Base Evaluation Results
| Ability     | Benchmark              | Klear-46B-A2.5B-Base | MiMO-7B-Base | Qwen3-8B-BASE | Qwen3-14B-BASE | Ling-lite-1.5-Base | Qwen3-30B-A3B-BASE |
| ----------- | ---------------------- | -------------------- | ------------ | ------------- | -------------- | ------------------ | ------------------ |
|             | # Total Params         | 46B                  | 7B           | 8B            | 14B            | 16.8B              | 30B                |
|             | # Activated Params     | 2.5B                 | 7B           | 8B            | 14B            | 2.75B              | 3B                 |
| **Code**    | HumanEval† (0-shot)    | 89                   | -            | 84.1          | 87.8           | 83.5               | 90.9               |
|             | MBPP (3-shot)          | 76                   | 69.2*        | 69            | 74             | 66.6               | 75.6               |
| **Math**    | MATH (4-shot, cot)     | 55.7                 | 38.8         | 60.8*         | 62.02*         | 59.9               | 59.04*             |
|             | CMATH (3-shot)         | 87.83                | 78.5         | 88.3          | 90.7           | 85.7               | 89.7               |
|             | GSM8K (4-shot, cot)    | 87.3                 | 78.47        | 89.4          | 90.3           | 87.6               | 91.1               |
| **General** | MMLU-Pro (5-shot, cot) | 57.6                 | 43.1         | 55.2          | 58.1           | 49.9               | 58.8               |
|             | MMLU (5-shot)          | 80.5                 | 69.24        | 77.1          | 80.6           | 73.7               | 80.4               |
|             | CEval (5-shot)         | 89.8                 | 67.98        | 81.9          | 84.8           | 78.2               | 87.4               |
|             | CMMLU (5-shot)         | 88                   | 70.79        | 82            | 85.6           | 81.2               | 87.1               |
|             | GPQA (0-shot)          | 35.3                 | 31.03        | 33.9          | 35.7           | 30.1               | 35.5               |
|             | AGIEval (0-shot)       | 52.3                 | 48.3*        | 51.7          | 55.7           | 54.3               | 56                 |
|             | BBH (3-shot, cot)      | 77.9                 | 75.6         | 78.1          | 80.1           | 75.4               | 81.2               |
|             | HellaSwag (0-shot)     | 80.5                 | 80*          | 78.7          | 81.5           | 80                 | 81.2               |
|             | Triviaqa (5-shot)      | 69.6                 | 60.8*        | 56.3          | 62.1           | 60.9               | 65.6               |
|             | Naturalqs (5-shot)     | 37.5                 | 23.46        | 25.7          | 29.1           | 28                 | 30.7               |
|             | PIQA (0-shot)          | 81.6                 | 80.14        | 79.5          | 81.9           | 82                 | 80.7               |
|             | OpenBookQA (0-shot)    | 37.8                 | 34.2         | 35            | 35.6           | 38.2               | 34.6               |
|             | Average                | 69.66                | -            | 66.62         | 69.60          | 65.60              | 70.41              |

Note:
1.  Results marked with `*` are sourced from their public report, other evaluations are conducted based on internal evaluation frameworks.
2. `†`During pretraining, we found that the HumanEval metric fluctuated significantly and was extremely sensitive to formatting. Therefore, we referred to the prompt from Ling-series paper to modify the original HumanEval. The results in the table are the evaluation metrics after this modification. 

### Klear-46B-A2.5B-Instruct Evaluation Results
| Ability       | Benchmark                   | Klear-46B-A2.5B--Instruct | InternLM3-8B-Instruct | MiniCPM4-8B | Qwen3-8B (NoThink) | gemma3-12b-it | Phi4-14B | Qwen3-30B-A3B-2507 |
| ------------- | --------------------------- | ------------------------- | --------------------- | ----------- | ------------------ | ------------- | -------- | ------------------ |
|               | # Total Params              | 46B                       | 8B                    | 8B          | 8B                 | 12B           | 14B      | 30B                |
|               | # Activated Params          | 2.5B                      | 8B                    | 8B          | 8B                 | 12B           | 14B      | 3B                 |
| **General**   | MMLU-Redux                  | 81.95                     | 74.65                 | 77.63       | 79.32              | 78.39         | 83.09    | 88.11              |
|               | MMLU-Pro                    | 63.61                     | 50.87                 | 54.69       | 63.8               | 60.69         | 67.25    | 78.22              |
|               | GPQA-Diamoind               | 49.12                     | 38.76                 | 38.51       | 51.77              | 39.02         | 59.47    | 71.21              |
|               | SimpleQA                    | 6.2                       | 4.44                  | 3.51        | 5.5                | 6.22          | 3.28     | 23.39              |
|               | CLUEWSC                     | 88.49                     | 77.63                 | 81.91       | 82.89              | 91.12         | 88.16    | 92.11              |
|               | CEval                       | 85.98                     | 84.26                 | 81.78       | 81.66              | 60.81         | 64.79    | 88.57              |
|               | C-SimpleQA                  | 42.8                      | 25.87                 | 23.13       | 37.07              | 28.97         | 24.77    | 75.37              |
|               | LiveBench 1125              | 50                        | 26.3                  | 25.5        | 52.1               | 43.1          | 40       | 68.4               |
| **Math**      | MATH500                     | 86.4                      | 68.4                  | 79.8        | 85                 | 86.8          | 80.6     | 97.2               |
|               | AIME24                      | 28.33                     | 11.25                 | 22.92       | 28.33              | 23.96         | 15.83    | 75                 |
|               | AIME25                      | 19.17                     | 8.12                  | 15.21       | 20.62              | 18.33         | 18.75    | 61.88              |
| **Code**      | HumanEval                   | 86.59                     | 82.3*                 | 78.05       | 83.54              | 82.32         | 85.37    | 81.71              |
|               | HumanEval+                  | 79.27                     | -                     | 73.17       | 76.83              | 75.61         | 83.54    | 76.83              |
|               | MBPPEvalplus                | 79.9                      | 62.4                  | 83.3        | 76.2               | 85.7          | 77.5     | 89.4               |
|               | MBPPEvalplus++              | 68.8                      | 50.4                  | 71.7        | 66.1               | 74.1          | 66.7     | 75.1               |
|               | LiveCodeBench v5(2408-2501) | 27.96                     | 14.7                  | 12.19       | 27.24              | 24.73         | 23.66    | 41.22              |
| **Alignment** | IF-Eval                     | 81.89                     | 79.3                  | 73.01       | 84.47              | 81.52         | 59.33    | 83.92              |
|               | Multi-IF(en+zh)             | 78.46                     | 61.83                 | 61.79       | 78.95              | 76.56         | 62.7     | 77.75              |
|               | MTBench                     | 8.42                      | 7.86                  | 6.875       | 8.21               | 8.68          | 8.62     | 9.33               |
|               | MT-Eval                     | 8.13                      | 7.36                  | 6.7         | 8.18               | 8.45          | 8.12     | -                  |
|               | AlignBench v1.1             | 7                         | 6.13                  | 5.99        | 6.95               | 6.3           | 6.33     | 7.06               |
|               | Average                     | 53.74                     | -                     | 46.54       | 52.61              | 50.54         | 48.95    | -                  |

Note:
1. For InternLM3-8B-Instruct, the results marked with `*` are sourced from their official website, other evaluations are conducted based on internal evaluation frameworks.
2. For Multi-IF, we report the overall average computed across all three rounds, pooling the Chinese and English metrics.

## 3. Quick start

### Inference with huggingface

You can now inference in Transformers starting from version `4.56.0` and `set trust_remote_code=True`.

#### Klear-46B-A2.5B-Base

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = "/path/to/Klear-Base"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", dtype=torch.bfloat16, trust_remote_code=True)

text = "世界上最大的湖是"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=256)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
```

#### Klear-46B-A2.5B-Instruct

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_path = "/path/to/Klear-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path)

model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", dtype=torch.bfloat16, trust_remote_code=True)

messages = [
    {"role": "user", "content": "帮我用 python 写一个计算器的代码吧。"}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=1024)

result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)
```

### Inference with vllm

[vLLM](https://github.com/vllm-project/vllm) is a high-speed and memery-efficicent inference framework. We provide **our own forked version of [vLLM](https://github.com/Kwai-Klear/vllm) here.**

```shell
git clone https://github.com/Kwai-Klear/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 pip install --editable .
vllm serve /path/to/Klear-Instruct --port 8000 --tensor-parallel-size 8 --trust-remote-code
```

An OpenAI-compatible API will be available at `http://localhost:8000/v1`.

Or you can refer to the following Python script for offline inference
```python
import torch
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_path = "/path/to/Klear-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

llm = LLM(
    model=model_path,
    trust_remote_code=True,
    tensor_parallel_size=torch.cuda.device_count(),
    gpu_memory_utilization=0.7
)
messages = [
    {"role": "user", "content": "帮我用 python 写一个计算器的代码吧。"}
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

sampling_params = SamplingParams(
    temperature=0.6, top_p=0.95, top_k=40, max_new_tokens=1024
)

outputs = llm.generate([prompt], sampling_params)

print(outputs[0].outputs[0].text)

```