Update README.md
Browse files
README.md
CHANGED
|
@@ -1,226 +1,226 @@
|
|
| 1 |
-
# Klear
|
| 2 |
-
|
| 3 |
-
<div align="center">
|
| 4 |
-
<img src="figures/klear-logo-02.png" width="500"/>
|
| 5 |
-
<p>
|
| 6 |
-
🤗 <a href="https://huggingface.co/Kwai-Klear">Hugging Face</a> | 📑 <a href="">Technique Report</a>
|
| 7 |
-
<br>
|
| 8 |
-
🖥️ <a href="https://kml-dtmachine-15498-prod-1.kmlhb2az1l3-2.corp.kuaishou.com">Chat with Klear</a> | 💬 <a href="https://github.com/Kwai-Klear">Issues & Discussions</a>
|
| 9 |
-
</p>
|
| 10 |
-
</div>
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
## 🔥News
|
| 14 |
-
|
| 15 |
-
- 2025.09.05: We released `Klear-46B-A2.5B` series. Currently, Klear-46B-A2.5B offers two versions: `a base model` and an advanced version that includes `instruction tuned` model. Additionally, `an reasoning version is currently in training`. Please stay tuned for more updates.
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
## 1. Introduction
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
`Klear-46B-A2.5B` is a sparse Mixture-of-Experts (MoE) large language model developed by **the Kwai-Klear Team at Kuaishou**, designed to deliver both **high performance** and **inference efficiency**. It features **256 experts**, with only **8 activated** per forward pass, resulting in **46 billion total parameters** but just **2.5 billion active** — achieving dense-level performance at a fraction of the computational cost.
|
| 22 |
-
|
| 23 |
-
The model was trained on over **22 trillion tokens** using a **three-stage progressive curriculum**:
|
| 24 |
-
|
| 25 |
-
**1. Foundational Knowledge Learning (12T tokens):**
|
| 26 |
-
General-purpose datasets such as CommonCrawl were processed with stratified quality filters, following a curriculum learning strategy that progresses from lower to higher data quality.
|
| 27 |
-
|
| 28 |
-
**2. Data Complexity Enhancement (8T tokens):**
|
| 29 |
-
The proportion of mathematical, coding, and STEM-related data was gradually increased to strengthen the model's reasoning and problem-solving capabilities.
|
| 30 |
-
|
| 31 |
-
**3. Reasoning Enhancement and Longcontext Stage (2T tokens):**
|
| 32 |
-
Training focused on synthetic and reasoning-intensive data, combined with a fast learning rate annealing strategy to maximize data efficiency and optimize final performance.
|
| 33 |
-
|
| 34 |
-
As a result, Klear-46B-A2.5B-Base matches or surpasses the performance of dense models with several times more active parameters, while offering significantly better efficiency and cost-effectiveness for real-world deployment.
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
## Model Summary
|
| 38 |
-
|
| 39 |
-
this repo contains the base and instruction-tuned model**. which has the following architecture:
|
| 40 |
-
|
| 41 |
-
| **propoty** | **value** |
|
| 42 |
-
|---------------------------|------------------------------------------------------------------------|
|
| 43 |
-
| hidden_size | 2048 |
|
| 44 |
-
| moe_intermediate_size | 896 |
|
| 45 |
-
| n_shared_experts | 1 |
|
| 46 |
-
| num_attention_heads | 32 |
|
| 47 |
-
| num_experts | 256 |
|
| 48 |
-
| num_experts_per_tok | 8 |
|
| 49 |
-
| num_hidden_layers | 32 |
|
| 50 |
-
| num_key_value_heads | 4 |
|
| 51 |
-
| vocab_size | 151936 |
|
| 52 |
-
| tie_word_embeddings | false |
|
| 53 |
-
| context length | 65536 |
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
### Model Downloads
|
| 57 |
-
|
| 58 |
-
<div align="center">
|
| 59 |
-
|
| 60 |
-
| **Model** | **#Total Params** | **#Activated Params** | **Context Length** | **Download Link** |
|
| 61 |
-
| :------------: | :------------: | :------------: | :------------: | :------------: |
|
| 62 |
-
| Klear-46B-A2.5B-Base | 46B | 2.5B | 64K | [🤗 Hugging Face](https://huggingface.co/Kwai-Klear) |
|
| 63 |
-
| Klear-46B-A2.5B-Inst. | 46B | 2.5B | 64K | [🤗 Hugging Face](https://huggingface.co/Kwai-Klear) |
|
| 64 |
-
|
| 65 |
-
</div>
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
## 2. Benchmark Evaluation
|
| 69 |
-
### Klear-46B-A2.5B-Base Evaluation Results
|
| 70 |
-
| Ability | Benchmark | Klear-46B-A2.5B-Base | MiMO-7B-Base | Qwen3-8B-BASE | Qwen3-14B-BASE | Ling-lite-1.5-Base | Qwen3-30B-A3B-BASE |
|
| 71 |
-
| ----------- | ---------------------- | -------------------- | ------------ | ------------- | -------------- | ------------------ | ------------------ |
|
| 72 |
-
| | # Total Params | 46B | 7B | 8B | 14B | 16.8B | 30B |
|
| 73 |
-
| | # Activated Params | 2.5B | 7B | 8B | 14B | 2.75B | 3B |
|
| 74 |
-
| **Code** | HumanEval (0-shot*) | 89 | - | 84.1 | 87.8 | 83.5 | 90.9 |
|
| 75 |
-
| | MBPP (3-shot) | 76 | 55.2 | 69 | 74 | 66.6 | 75.6 |
|
| 76 |
-
| **Math** | MATH (4-shot, cot) | 55.7 | 36.78 | 58.4 | 57.1 | 56.98 | 57.6 |
|
| 77 |
-
| | CMATH (3-shot) | 87.8 | 78.5 | 88.3 | 90.7 | 85.7 | 89.7 |
|
| 78 |
-
| | GSM8K (4-shot, cot) | 87.3 | 78.47 | 89.4 | 90.3 | 87.6 | 91.1 |
|
| 79 |
-
| **General** | MMLU-Pro (5-shot, cot) | 57.6 | 43.1 | 55.2 | 58.1 | 49.9 | 58.8 |
|
| 80 |
-
| | MMLU (5-shot) | 80.5 | 69.24 | 77.1 | 80.6 | 73.7 | 80.4 |
|
| 81 |
-
| | CEval (5-shot) | 89.8 | 67.98 | 81.9 | 84.8 | 78.2 | 87.4 |
|
| 82 |
-
| | CMMLU (5-shot) | 88 | 70.79 | 82 | 85.6 | 81.2 | 87.1 |
|
| 83 |
-
| | GPQA (0-shot) | 35.3 | 31.03 | 33.9 | 35.7 | 30.1 | 35.5 |
|
| 84 |
-
| | AGIEval (0-shot) | 52.3 | 48.3* | 51.7 | 55.7 | 54.3 | 56 |
|
| 85 |
-
| | BBH (3-shot, cot) | 77.9 | 75.6 | 78.1 | 80.1 | 75.4 | 81.2 |
|
| 86 |
-
| **Others** | HellaSwag (0-shot) | 80.5 | 80* | 78.7 | 81.5 | 80 | 81.2 |
|
| 87 |
-
| | Winogrande (3-shot) | 78.8 | 78* | 73.6 | 78.5 | 72.1 | 77.9 |
|
| 88 |
-
| | Triviaqa (5-shot) | 69.6 | 60.8* | 56.3 | 62.1 | 60.9 | 65.6 |
|
| 89 |
-
| | Naturalqs (5-shot) | 37.5 | 23.46 | 25.7 | 29.1 | 28 | 30.7 |
|
| 90 |
-
| | PIQA (0-shot) | 81.6 | 80.14 | 79.5 | 81.9 | 82 | 80.7 |
|
| 91 |
-
| | SIQA (0-shot) | 67.9 | 51.74 | 56.2 | 58.4 | 56.3 | 56.3 |
|
| 92 |
-
| | OpenBookQA (0-shot) | 37.8 | 34.2 | 35 | 35.6 | 38.2 | 34.6 |
|
| 93 |
-
|
| 94 |
-
Note:
|
| 95 |
-
1. `*`During pretraining, we found that the HumanEval metric fluctuated significantly and was extremely sensitive to formatting. Therefore, we referred to the prompt from Ling-series paper to modify the original HumanEval. The results in the table are the evaluation metrics after this modification.
|
| 96 |
-
2. For Mimo-base-7B, the results marked with `*` are sourced from other public reports.
|
| 97 |
-
|
| 98 |
-
### Klear-46B-A2.5B-Inst. Evaluation Results
|
| 99 |
-
| Ability | Benchmark | Klear-46B-A2.5B-Inst. | MiniCPM4-8B | Qwen3-8B (NoThink) | gemma3-12b-it | Phi4-14B |
|
| 100 |
-
| ------------------------- | --------------------------- | --------------- | ----------- | ------------------ | ------------- | -------- |
|
| 101 |
-
| | # Total Params | 46B | 8B | 8B | 12B | 14B |
|
| 102 |
-
| | # Activated Params | 2.5B | 8B | 8B | 12B | 14B |
|
| 103 |
-
| **English Understanding** | MMLU-Redux | 82.23 | 77.63 | 79.32 | 78.39 | 83.09 |
|
| 104 |
-
| | MMLU-Pro | 64.82 | 54.69 | 63.8 | 60.69 | 67.25 |
|
| 105 |
-
| | GPQA-Diamoind | 49.49 | 38.51 | 51.77 | 39.02 | 59.47 |
|
| 106 |
-
| | SimpleQA | 5.94 | 3.51 | 5.5 | 6.22 | 3.28 |
|
| 107 |
-
| **Chinese Understanding** | CLUEWSC | 88.82 | 81.91 | 82.89 | 91.12 | 88.16 |
|
| 108 |
-
| | CEval | 84.29 | 81.78 | 81.66 | 60.81 | 64.79 |
|
| 109 |
-
| | C-SimpleQA | 42.03 | 23.13 | 37.07 | 28.97 | 24.77 |
|
| 110 |
-
| **Math & Reasoning** | MATH500 | 86.4 | 79.8 | 85 | 86.8 | 80.6 |
|
| 111 |
-
| | AIME24 | 30.42 | 22.92 | 28.33 | 23.96 | 15.83 |
|
| 112 |
-
| | AIME25 | 21.04 | 15.21 | 20.62 | 18.33 | 18.75 |
|
| 113 |
-
| | ZebraLogic | 46.4 | 8.5 | 25.7 | 18 | 30.3 |
|
| 114 |
-
| **Code** | HumanEval | 89.63 | 74.39 | 83.54 | 82.32 | 85.37 |
|
| 115 |
-
| | HumanEval+ | 87.2 | 70.12 | 76.83 | 75.61 | 83.54 |
|
| 116 |
-
| | MBPPEvalplus | 79.6 | 82 | 76.2 | 85.7 | 77.5 |
|
| 117 |
-
| | MBPPEvalplus++ | 68.5 | 69.3 | 66.1 | 74.1 | 66.7 |
|
| 118 |
-
| | LiveCodeBench v5(2408-2501) | 29.75 | 12.19 | 27.24 | 24.73 | 23.66 |
|
| 119 |
-
| **Instruction Following** | IF-Eval | 80.41 | 73.01 | 84.47 | 81.52 | 59.33 |
|
| 120 |
-
| | Multi-IF(en+zh) | 78.25 | 61.79 | 78.95 | 76.56 | 62.7 |
|
| 121 |
-
| **Comprehensive Ability** | MTBench | 8.03 | 6.875 | 8.21 | 8.675 | 8.625 |
|
| 122 |
-
| | MT-Eval | 8.1 | 6.7 | 8.18 | 8.45 | 8.12 |
|
| 123 |
-
| | Arena-Hard v2 | 19.8 | 2.2 | 19.8 | 50 | 9.6 |
|
| 124 |
-
| | AlignBench v1.1 | 6.8 | 5.99 | 6.95 | 6.3 | 6.33 |
|
| 125 |
-
| | LiveBench 1125 | 48.7 | 25.5 | 52.1 | 43.1 | 40 |
|
| 126 |
-
|
| 127 |
-
## 3. Quick start
|
| 128 |
-
|
| 129 |
-
### Inference with huggingface
|
| 130 |
-
|
| 131 |
-
You can now inference in Transformers starting from version `4.56.0`.
|
| 132 |
-
|
| 133 |
-
#### Klear-46B-A2.5B-Base
|
| 134 |
-
|
| 135 |
-
```python
|
| 136 |
-
import torch
|
| 137 |
-
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 138 |
-
|
| 139 |
-
model_name = "/path/to/Klear-Base"
|
| 140 |
-
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
|
| 141 |
-
|
| 142 |
-
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", dtype=torch.bfloat16, trust_remote_code=True)
|
| 143 |
-
|
| 144 |
-
text = "世界上最大的湖是"
|
| 145 |
-
inputs = tokenizer(text, return_tensors="pt")
|
| 146 |
-
outputs = model.generate(**inputs.to(model.device), max_new_tokens=256)
|
| 147 |
-
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 148 |
-
print(result)
|
| 149 |
-
```
|
| 150 |
-
|
| 151 |
-
#### Klear-46B-A2.5B-Inst.
|
| 152 |
-
|
| 153 |
-
```python
|
| 154 |
-
import torch
|
| 155 |
-
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
|
| 156 |
-
|
| 157 |
-
model_name = "/path/to/Klear-Inst."
|
| 158 |
-
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 159 |
-
|
| 160 |
-
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16)
|
| 161 |
-
|
| 162 |
-
messages = [
|
| 163 |
-
{"role": "user", "content": "帮我用 python 写一个计算器的代码吧。"}
|
| 164 |
-
]
|
| 165 |
-
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
|
| 166 |
-
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=1024)
|
| 167 |
-
|
| 168 |
-
result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
|
| 169 |
-
print(result)
|
| 170 |
-
```
|
| 171 |
-
|
| 172 |
-
### Inference with vllm
|
| 173 |
-
|
| 174 |
-
[vLLM](https://github.com/vllm-project/vllm) is a high-speed and memery-efficicent inference framework. We provide our own forked version of [vLLM](https://github.com/vllm-project/vllm) here.
|
| 175 |
-
|
| 176 |
-
```shell
|
| 177 |
-
git clone
|
| 178 |
-
cd vllm
|
| 179 |
-
pip install
|
| 180 |
-
vllm serve Klear-46B-A2.5B-inst --port 8000 --tensor-parallel-size 8 --trust-remote-code
|
| 181 |
-
```
|
| 182 |
-
|
| 183 |
-
An OpenAI-compatible API will be available at `http://localhost:8000/v1`.
|
| 184 |
-
|
| 185 |
-
Or you can refer to the following Python script for offline inference
|
| 186 |
-
```python
|
| 187 |
-
from vllm import LLM, SamplingParams
|
| 188 |
-
|
| 189 |
-
model_path = "/path/to/Klear"
|
| 190 |
-
llm = LLM(
|
| 191 |
-
model=model_path,
|
| 192 |
-
trust_remote_code=True,
|
| 193 |
-
num_speculative_tokens=1,
|
| 194 |
-
disable_log_stats=False
|
| 195 |
-
)
|
| 196 |
-
sampling_params = SamplingParams(temperature=0.2)
|
| 197 |
-
|
| 198 |
-
conversation = [
|
| 199 |
-
{
|
| 200 |
-
"role": "system",
|
| 201 |
-
"content": ""
|
| 202 |
-
},
|
| 203 |
-
{
|
| 204 |
-
"role": "user",
|
| 205 |
-
"content": "Please help me write a snake game code.",
|
| 206 |
-
},
|
| 207 |
-
]
|
| 208 |
-
|
| 209 |
-
outputs = llm.chat(conversation,
|
| 210 |
-
sampling_params=sampling_params,
|
| 211 |
-
use_tqdm=False)
|
| 212 |
-
|
| 213 |
-
for idx, output in enumerate(outputs):
|
| 214 |
-
prompt = output.prompt
|
| 215 |
-
generated_text = output.outputs[0].text
|
| 216 |
-
print(f"==== Response #{idx} ====")
|
| 217 |
-
print(f"Prompt: {prompt}, Generated text: {generated_text}")
|
| 218 |
-
|
| 219 |
-
```
|
| 220 |
-
|
| 221 |
-
## Citation
|
| 222 |
-
|
| 223 |
-
If you find `Klear-46B-A2.5B` is useful or want to use in your projects, please kindly cite our paper:
|
| 224 |
-
|
| 225 |
-
```
|
| 226 |
-
```
|
|
|
|
| 1 |
+
# Klear
|
| 2 |
+
|
| 3 |
+
<div align="center">
|
| 4 |
+
<img src="figures/klear-logo-02.png" width="500"/>
|
| 5 |
+
<p>
|
| 6 |
+
🤗 <a href="https://huggingface.co/Kwai-Klear">Hugging Face</a> | 📑 <a href="">Technique Report</a>
|
| 7 |
+
<br>
|
| 8 |
+
🖥️ <a href="https://kml-dtmachine-15498-prod-1.kmlhb2az1l3-2.corp.kuaishou.com">Chat with Klear</a> | 💬 <a href="https://github.com/Kwai-Klear">Issues & Discussions</a>
|
| 9 |
+
</p>
|
| 10 |
+
</div>
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
## 🔥News
|
| 14 |
+
|
| 15 |
+
- 2025.09.05: We released `Klear-46B-A2.5B` series. Currently, Klear-46B-A2.5B offers two versions: `a base model` and an advanced version that includes `instruction tuned` model. Additionally, `an reasoning version is currently in training`. Please stay tuned for more updates.
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
## 1. Introduction
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
`Klear-46B-A2.5B` is a sparse Mixture-of-Experts (MoE) large language model developed by **the Kwai-Klear Team at Kuaishou**, designed to deliver both **high performance** and **inference efficiency**. It features **256 experts**, with only **8 activated** per forward pass, resulting in **46 billion total parameters** but just **2.5 billion active** — achieving dense-level performance at a fraction of the computational cost.
|
| 22 |
+
|
| 23 |
+
The model was trained on over **22 trillion tokens** using a **three-stage progressive curriculum**:
|
| 24 |
+
|
| 25 |
+
**1. Foundational Knowledge Learning (12T tokens):**
|
| 26 |
+
General-purpose datasets such as CommonCrawl were processed with stratified quality filters, following a curriculum learning strategy that progresses from lower to higher data quality.
|
| 27 |
+
|
| 28 |
+
**2. Data Complexity Enhancement (8T tokens):**
|
| 29 |
+
The proportion of mathematical, coding, and STEM-related data was gradually increased to strengthen the model's reasoning and problem-solving capabilities.
|
| 30 |
+
|
| 31 |
+
**3. Reasoning Enhancement and Longcontext Stage (2T tokens):**
|
| 32 |
+
Training focused on synthetic and reasoning-intensive data, combined with a fast learning rate annealing strategy to maximize data efficiency and optimize final performance.
|
| 33 |
+
|
| 34 |
+
As a result, Klear-46B-A2.5B-Base matches or surpasses the performance of dense models with several times more active parameters, while offering significantly better efficiency and cost-effectiveness for real-world deployment.
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
## Model Summary
|
| 38 |
+
|
| 39 |
+
this repo contains the base and instruction-tuned model**. which has the following architecture:
|
| 40 |
+
|
| 41 |
+
| **propoty** | **value** |
|
| 42 |
+
|---------------------------|------------------------------------------------------------------------|
|
| 43 |
+
| hidden_size | 2048 |
|
| 44 |
+
| moe_intermediate_size | 896 |
|
| 45 |
+
| n_shared_experts | 1 |
|
| 46 |
+
| num_attention_heads | 32 |
|
| 47 |
+
| num_experts | 256 |
|
| 48 |
+
| num_experts_per_tok | 8 |
|
| 49 |
+
| num_hidden_layers | 32 |
|
| 50 |
+
| num_key_value_heads | 4 |
|
| 51 |
+
| vocab_size | 151936 |
|
| 52 |
+
| tie_word_embeddings | false |
|
| 53 |
+
| context length | 65536 |
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
### Model Downloads
|
| 57 |
+
|
| 58 |
+
<div align="center">
|
| 59 |
+
|
| 60 |
+
| **Model** | **#Total Params** | **#Activated Params** | **Context Length** | **Download Link** |
|
| 61 |
+
| :------------: | :------------: | :------------: | :------------: | :------------: |
|
| 62 |
+
| Klear-46B-A2.5B-Base | 46B | 2.5B | 64K | [🤗 Hugging Face](https://huggingface.co/Kwai-Klear) |
|
| 63 |
+
| Klear-46B-A2.5B-Inst. | 46B | 2.5B | 64K | [🤗 Hugging Face](https://huggingface.co/Kwai-Klear) |
|
| 64 |
+
|
| 65 |
+
</div>
|
| 66 |
+
|
| 67 |
+
|
| 68 |
+
## 2. Benchmark Evaluation
|
| 69 |
+
### Klear-46B-A2.5B-Base Evaluation Results
|
| 70 |
+
| Ability | Benchmark | Klear-46B-A2.5B-Base | MiMO-7B-Base | Qwen3-8B-BASE | Qwen3-14B-BASE | Ling-lite-1.5-Base | Qwen3-30B-A3B-BASE |
|
| 71 |
+
| ----------- | ---------------------- | -------------------- | ------------ | ------------- | -------------- | ------------------ | ------------------ |
|
| 72 |
+
| | # Total Params | 46B | 7B | 8B | 14B | 16.8B | 30B |
|
| 73 |
+
| | # Activated Params | 2.5B | 7B | 8B | 14B | 2.75B | 3B |
|
| 74 |
+
| **Code** | HumanEval (0-shot*) | 89 | - | 84.1 | 87.8 | 83.5 | 90.9 |
|
| 75 |
+
| | MBPP (3-shot) | 76 | 55.2 | 69 | 74 | 66.6 | 75.6 |
|
| 76 |
+
| **Math** | MATH (4-shot, cot) | 55.7 | 36.78 | 58.4 | 57.1 | 56.98 | 57.6 |
|
| 77 |
+
| | CMATH (3-shot) | 87.8 | 78.5 | 88.3 | 90.7 | 85.7 | 89.7 |
|
| 78 |
+
| | GSM8K (4-shot, cot) | 87.3 | 78.47 | 89.4 | 90.3 | 87.6 | 91.1 |
|
| 79 |
+
| **General** | MMLU-Pro (5-shot, cot) | 57.6 | 43.1 | 55.2 | 58.1 | 49.9 | 58.8 |
|
| 80 |
+
| | MMLU (5-shot) | 80.5 | 69.24 | 77.1 | 80.6 | 73.7 | 80.4 |
|
| 81 |
+
| | CEval (5-shot) | 89.8 | 67.98 | 81.9 | 84.8 | 78.2 | 87.4 |
|
| 82 |
+
| | CMMLU (5-shot) | 88 | 70.79 | 82 | 85.6 | 81.2 | 87.1 |
|
| 83 |
+
| | GPQA (0-shot) | 35.3 | 31.03 | 33.9 | 35.7 | 30.1 | 35.5 |
|
| 84 |
+
| | AGIEval (0-shot) | 52.3 | 48.3* | 51.7 | 55.7 | 54.3 | 56 |
|
| 85 |
+
| | BBH (3-shot, cot) | 77.9 | 75.6 | 78.1 | 80.1 | 75.4 | 81.2 |
|
| 86 |
+
| **Others** | HellaSwag (0-shot) | 80.5 | 80* | 78.7 | 81.5 | 80 | 81.2 |
|
| 87 |
+
| | Winogrande (3-shot) | 78.8 | 78* | 73.6 | 78.5 | 72.1 | 77.9 |
|
| 88 |
+
| | Triviaqa (5-shot) | 69.6 | 60.8* | 56.3 | 62.1 | 60.9 | 65.6 |
|
| 89 |
+
| | Naturalqs (5-shot) | 37.5 | 23.46 | 25.7 | 29.1 | 28 | 30.7 |
|
| 90 |
+
| | PIQA (0-shot) | 81.6 | 80.14 | 79.5 | 81.9 | 82 | 80.7 |
|
| 91 |
+
| | SIQA (0-shot) | 67.9 | 51.74 | 56.2 | 58.4 | 56.3 | 56.3 |
|
| 92 |
+
| | OpenBookQA (0-shot) | 37.8 | 34.2 | 35 | 35.6 | 38.2 | 34.6 |
|
| 93 |
+
|
| 94 |
+
Note:
|
| 95 |
+
1. `*`During pretraining, we found that the HumanEval metric fluctuated significantly and was extremely sensitive to formatting. Therefore, we referred to the prompt from Ling-series paper to modify the original HumanEval. The results in the table are the evaluation metrics after this modification.
|
| 96 |
+
2. For Mimo-base-7B, the results marked with `*` are sourced from other public reports.
|
| 97 |
+
|
| 98 |
+
### Klear-46B-A2.5B-Inst. Evaluation Results
|
| 99 |
+
| Ability | Benchmark | Klear-46B-A2.5B-Inst. | MiniCPM4-8B | Qwen3-8B (NoThink) | gemma3-12b-it | Phi4-14B |
|
| 100 |
+
| ------------------------- | --------------------------- | --------------- | ----------- | ------------------ | ------------- | -------- |
|
| 101 |
+
| | # Total Params | 46B | 8B | 8B | 12B | 14B |
|
| 102 |
+
| | # Activated Params | 2.5B | 8B | 8B | 12B | 14B |
|
| 103 |
+
| **English Understanding** | MMLU-Redux | 82.23 | 77.63 | 79.32 | 78.39 | 83.09 |
|
| 104 |
+
| | MMLU-Pro | 64.82 | 54.69 | 63.8 | 60.69 | 67.25 |
|
| 105 |
+
| | GPQA-Diamoind | 49.49 | 38.51 | 51.77 | 39.02 | 59.47 |
|
| 106 |
+
| | SimpleQA | 5.94 | 3.51 | 5.5 | 6.22 | 3.28 |
|
| 107 |
+
| **Chinese Understanding** | CLUEWSC | 88.82 | 81.91 | 82.89 | 91.12 | 88.16 |
|
| 108 |
+
| | CEval | 84.29 | 81.78 | 81.66 | 60.81 | 64.79 |
|
| 109 |
+
| | C-SimpleQA | 42.03 | 23.13 | 37.07 | 28.97 | 24.77 |
|
| 110 |
+
| **Math & Reasoning** | MATH500 | 86.4 | 79.8 | 85 | 86.8 | 80.6 |
|
| 111 |
+
| | AIME24 | 30.42 | 22.92 | 28.33 | 23.96 | 15.83 |
|
| 112 |
+
| | AIME25 | 21.04 | 15.21 | 20.62 | 18.33 | 18.75 |
|
| 113 |
+
| | ZebraLogic | 46.4 | 8.5 | 25.7 | 18 | 30.3 |
|
| 114 |
+
| **Code** | HumanEval | 89.63 | 74.39 | 83.54 | 82.32 | 85.37 |
|
| 115 |
+
| | HumanEval+ | 87.2 | 70.12 | 76.83 | 75.61 | 83.54 |
|
| 116 |
+
| | MBPPEvalplus | 79.6 | 82 | 76.2 | 85.7 | 77.5 |
|
| 117 |
+
| | MBPPEvalplus++ | 68.5 | 69.3 | 66.1 | 74.1 | 66.7 |
|
| 118 |
+
| | LiveCodeBench v5(2408-2501) | 29.75 | 12.19 | 27.24 | 24.73 | 23.66 |
|
| 119 |
+
| **Instruction Following** | IF-Eval | 80.41 | 73.01 | 84.47 | 81.52 | 59.33 |
|
| 120 |
+
| | Multi-IF(en+zh) | 78.25 | 61.79 | 78.95 | 76.56 | 62.7 |
|
| 121 |
+
| **Comprehensive Ability** | MTBench | 8.03 | 6.875 | 8.21 | 8.675 | 8.625 |
|
| 122 |
+
| | MT-Eval | 8.1 | 6.7 | 8.18 | 8.45 | 8.12 |
|
| 123 |
+
| | Arena-Hard v2 | 19.8 | 2.2 | 19.8 | 50 | 9.6 |
|
| 124 |
+
| | AlignBench v1.1 | 6.8 | 5.99 | 6.95 | 6.3 | 6.33 |
|
| 125 |
+
| | LiveBench 1125 | 48.7 | 25.5 | 52.1 | 43.1 | 40 |
|
| 126 |
+
|
| 127 |
+
## 3. Quick start
|
| 128 |
+
|
| 129 |
+
### Inference with huggingface
|
| 130 |
+
|
| 131 |
+
You can now inference in Transformers starting from version `4.56.0`.
|
| 132 |
+
|
| 133 |
+
#### Klear-46B-A2.5B-Base
|
| 134 |
+
|
| 135 |
+
```python
|
| 136 |
+
import torch
|
| 137 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 138 |
+
|
| 139 |
+
model_name = "/path/to/Klear-Base"
|
| 140 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
|
| 141 |
+
|
| 142 |
+
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", dtype=torch.bfloat16, trust_remote_code=True)
|
| 143 |
+
|
| 144 |
+
text = "世界上最大的湖是"
|
| 145 |
+
inputs = tokenizer(text, return_tensors="pt")
|
| 146 |
+
outputs = model.generate(**inputs.to(model.device), max_new_tokens=256)
|
| 147 |
+
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 148 |
+
print(result)
|
| 149 |
+
```
|
| 150 |
+
|
| 151 |
+
#### Klear-46B-A2.5B-Inst.
|
| 152 |
+
|
| 153 |
+
```python
|
| 154 |
+
import torch
|
| 155 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
|
| 156 |
+
|
| 157 |
+
model_name = "/path/to/Klear-Inst."
|
| 158 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 159 |
+
|
| 160 |
+
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16)
|
| 161 |
+
|
| 162 |
+
messages = [
|
| 163 |
+
{"role": "user", "content": "帮我用 python 写一个计算器的代码吧。"}
|
| 164 |
+
]
|
| 165 |
+
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
|
| 166 |
+
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=1024)
|
| 167 |
+
|
| 168 |
+
result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
|
| 169 |
+
print(result)
|
| 170 |
+
```
|
| 171 |
+
|
| 172 |
+
### Inference with vllm
|
| 173 |
+
|
| 174 |
+
[vLLM](https://github.com/vllm-project/vllm) is a high-speed and memery-efficicent inference framework. We provide our own forked version of [vLLM](https://github.com/vllm-project/vllm) here.
|
| 175 |
+
|
| 176 |
+
```shell
|
| 177 |
+
git clone
|
| 178 |
+
cd vllm
|
| 179 |
+
pip install
|
| 180 |
+
vllm serve Klear-46B-A2.5B-inst --port 8000 --tensor-parallel-size 8 --trust-remote-code
|
| 181 |
+
```
|
| 182 |
+
|
| 183 |
+
An OpenAI-compatible API will be available at `http://localhost:8000/v1`.
|
| 184 |
+
|
| 185 |
+
Or you can refer to the following Python script for offline inference
|
| 186 |
+
```python
|
| 187 |
+
from vllm import LLM, SamplingParams
|
| 188 |
+
|
| 189 |
+
model_path = "/path/to/Klear"
|
| 190 |
+
llm = LLM(
|
| 191 |
+
model=model_path,
|
| 192 |
+
trust_remote_code=True,
|
| 193 |
+
num_speculative_tokens=1,
|
| 194 |
+
disable_log_stats=False
|
| 195 |
+
)
|
| 196 |
+
sampling_params = SamplingParams(temperature=0.2)
|
| 197 |
+
|
| 198 |
+
conversation = [
|
| 199 |
+
{
|
| 200 |
+
"role": "system",
|
| 201 |
+
"content": ""
|
| 202 |
+
},
|
| 203 |
+
{
|
| 204 |
+
"role": "user",
|
| 205 |
+
"content": "Please help me write a snake game code.",
|
| 206 |
+
},
|
| 207 |
+
]
|
| 208 |
+
|
| 209 |
+
outputs = llm.chat(conversation,
|
| 210 |
+
sampling_params=sampling_params,
|
| 211 |
+
use_tqdm=False)
|
| 212 |
+
|
| 213 |
+
for idx, output in enumerate(outputs):
|
| 214 |
+
prompt = output.prompt
|
| 215 |
+
generated_text = output.outputs[0].text
|
| 216 |
+
print(f"==== Response #{idx} ====")
|
| 217 |
+
print(f"Prompt: {prompt}, Generated text: {generated_text}")
|
| 218 |
+
|
| 219 |
+
```
|
| 220 |
+
|
| 221 |
+
## Citation
|
| 222 |
+
|
| 223 |
+
If you find `Klear-46B-A2.5B` is useful or want to use in your projects, please kindly cite our paper:
|
| 224 |
+
|
| 225 |
+
```
|
| 226 |
+
```
|