File size: 9,398 Bytes
b4d7b0e 42afc22 b4d7b0e 9e629ea b4d7b0e 4169e9a b4d7b0e 552d370 b4d7b0e 552d370 b4d7b0e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 |
---
license: apache-2.0
language:
- zh
- en
pipeline_tag: text-generation
library_name: transformers
---
<div align="center">
<img src="https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm_logo.png?raw=true" width="500em" ></img>
</div>
<p align="center">
<a href="https://github.com/OpenBMB/MiniCPM/" target="_blank">GitHub Repo</a> |
<a href="https://github.com/OpenBMB/MiniCPM/blob/main/docs/MiniCPM_SALA.pdf" target="_blank">Technical Report</a> |
<a href="https://mp.weixin.qq.com/s/KIhH2nCURBXuFXAtYRpuXg?poc_token=HBIsUWijxino8oJ5s6HcjcfXFRi0Xj2LJlxPYD9c">Join Us</a>
</p>
<p align="center">
π Contact us in <a href="https://discord.gg/3cGQn9b3YM" target="_blank">Discord</a> and <a href="https://github.com/OpenBMB/MiniCPM/blob/main/assets/wechat.jpg" target="_blank">WeChat</a>
</p>
> [!NOTE]
> ### π 2026 Sparse Operator Acceleration & Race (SOAR) is Now Live!
>
> **"The MiniCPM-SALA architecture is just the beginning. Realizing its full potential requires deep system-level synergy and cross-layer compilation optimization."**
>
> In collaboration with **SGLang** and **NVIDIA**, OpenBMB invites global geeks to push the boundaries of 9B-scale, 1M-token inference on **NVIDIA 6000D**.
>
> π° **Prize Pool: >$100,000 USD** (π₯ Top Prize: **$89,000**) | π **Challenge:** Single & Multi-batch Optimization
>
> π **[Click Here to Join the Race @ soar.openbmb.cn](https://soar.openbmb.cn/)**
## What's New
- [2026.02.11] **MiniCPM-SALA** is released! This is the first large-scale hybrid model effectively integrating sparse and linear attention for million-token context modeling. You can find technical report [here](https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf).π₯π₯π₯
### Highlights
MiniCPM-SALA (Sparse Attention and Linear Attention) is the first large-scale hybrid model effectively integrating sparse and linear attention for million-token context modeling
β
Innovative Hybrid Architecture: Synergizes 25% Sparse Attention (InfLLM-v2) for high-fidelity long context modeling with 75% Linear Attention (Lightning Attention) for global efficiency.
β
Shattering Efficiency Walls: Breaks the "Compute Wall" and the "Memory Wall," achieving 3.5Γ inference speed and significantly lower KV-cache overhead compared to dense baselines.
β
Million-Token Context: Empowered by HyPE (Hybrid Positional Embedding), it scales to 1M+ tokens while maintaining strong length generalization.
β
HALO Adaptation: Utilizes Hybrid Attention via Layer Optimization (HALO), a novel distillation recipe that effectively transfers dense attention capabilities to the hybrid architecture, avoiding the severe performance degradation typical of pure linear models.
## Introduction
MiniCPM-SALA is an efficient hybrid model in which 25% of the layers adopt [InfLLM-V2](https://arxiv.org/abs/2509.24663) and the remaining 75% utilize Lightning Attention. This architecture enables inference of one million tokens on consumer GPUs such as the NVIDIA RTX 5090.
- **SALA Hybrid Attention Mechanism**
- Integrates 25% InfLLM-V2 and 75% Lightning Attention, effectively leveraging the granular focus of sparse attention for local details and the high efficiency of linear attention for broad context.
- **Transformer-to-Hybrid Continue Training**
- Circumvents the inefficiencies of cold-start training by performing an architectural transformation on the pre-trained weights, thereby reducing the total training budget to approximately 25% relative to training a comparable model from scratch.
- **[HyPE](https://arxiv.org/abs/2601.22156) (Hybrid Positional Encoding)**
- Harmonizes the performance across both short and long contexts, which can maintain general capabilities (e.g., knowledge, mathematics, and coding) comparable to modern full-attention models like Qwen3-8B and achieve substantial advantages across multiple long-context benchmarks.
- **Efficient Inference on Long Sequences**
- Achieves up to 3.5x the inference speed of Qwen3-8B at a sequence length of 256K tokens on A6000D, supports inference at context lengths of up to 1M tokens on both NVIDIA A6000D and 5090 GPUs, whereas Qwen3-8B fails at this length due to out-of-memory (OOM) errors.
## Inference
To achieve optimal performance, we recommend using `Temperature=0.9`.
### HuggingFace
Our model is readily compatible with π€ Hugging Face transformers. You can perform inference with our model as follows:
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "openbmb/MiniCPM-SALA"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, device_map="auto")
model.eval()
prompts = ["My name is", "The capital of China is"]
with torch.no_grad():
inputs = tokenizer(prompts, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs)
output_texts = tokenizer.batch_decode(outputs)
print(output_texts)
```
### SGLang
#### Requirements
- CUDA 12.x or higher
- `gcc` / `g++` compiler
- `uv` package manager (script will check)
#### Installation
```bash
# Clone repository
git clone -b minicpm_sala https://github.com/OpenBMB/sglang.git
cd sglang
# One-click installation (creates venv and compiles all dependencies)
bash install_minicpm_sala.sh
# Or specify PyPI mirror
bash install_minicpm_sala.sh https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
```
The installation script performs the following steps:
1. Creates `sglang_minicpm_sala_env` virtual environment (Python 3.12)
2. Clones dependencies to `3rdparty/` (infllmv2) and initializes submodules (sparse_kernel)
3. Installs MiniCPM-SALA (current repo)
4. Compiles and installs `infllmv2_cuda_impl`
5. Compiles and installs `sparse_kernel`
6. Installs `tilelang` & `flash-linear-attention`
#### Usage
```bash
# Activate environment
source sglang_minicpm_sala_env/bin/activate
# Launch Inference Server (Replace MODEL_PATH with actual path)
MODEL_PATH=/path/to/your/MiniCPM-SALA
python3 -m sglang.launch_server \
--model ${MODEL_PATH} \
--trust-remote-code \
--disable-radix-cache \
--attention-backend minicpm_flashinfer \
--chunked-prefill-size 8192 \
--max-running-requests 32 \
--skip-server-warmup \
--port 31111 \
--dense-as-sparse
```
| Parameter | Description |
|-----------|-------------|
| `--trust-remote-code` | Allow custom code in model |
| `--disable-radix-cache` | Disable RadixAttention prefix cache |
| `--attention-backend minicpm_flashinfer` | Use MiniCPM FlashInfer backend |
| `--chunked-prefill-size 8192` | Chunked prefill size |
| `--max-running-requests 32` | Max concurrent requests |
| `--skip-server-warmup` | Skip server warmup |
| `--port 31111` | Server port |
| `--dense-as-sparse` | Use dense-as-sparse mode |
#### Manual Installation
If the script doesn't work for you, follow these steps:
```bash
# 0. Ensure uv is installed
pip install uv
# 1. Create venv
uv venv --python 3.12 sglang_minicpm_sala_env
source sglang_minicpm_sala_env/bin/activate
# 2. Install SGLang
uv pip install --upgrade pip setuptools wheel
uv pip install -e ./python[all]
# 3. Compile CUDA Extensions
# (Ensure dependencies are cloned to 3rdparty/)
cd 3rdparty/infllmv2_cuda_impl && python setup.py install && cd ../..
cd 3rdparty/sparse_kernel && python setup.py install && cd ../..
# 4. Install extra deps
uv pip install tilelang flash-linear-attention
```
#### Q&A
**Q: CUDA extension compilation failed?**
- Ensure CUDA 12+ is installed (`nvcc --version`).
- Ensure `gcc` / `g++` are available.
- If `CXX` is set to `clang++ -pthread`, manually `export CXX=g++`.
## Evaluation Results
### Efficiency Evaluation


### Long-Context Evaluation

### Ultra-long Context Evaluation

### Standard Evaluation

## Statement
- As a language model, MiniCPM-SALA generates content by learning from a vast amount of text.
- However, it does not possess the ability to comprehend or express personal opinions or value judgments.
- Any content generated by MiniCPM-SALA does not represent the viewpoints or positions of the model developers.
- Therefore, when using content generated by MiniCPM-SALA, users should take full responsibility for evaluating and verifying it on their own.
## LICENSE
- This repository and MiniCPM models are released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
## Citation
- Please cite our [paper](https://github.com/OpenBMB/MiniCPM/blob/main/docs/MiniCPM_SALA.pdf) if you find our work valuable.
```bibtex
@article{minicpm4,
title={{MiniCPM-SALA}: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling},
author={MiniCPM Team},
year={2026}
}
``` |