BigDong commited on
Commit
b4d7b0e
Β·
1 Parent(s): 6c071d2

update readme

Browse files
Files changed (1) hide show
  1. README.md +211 -0
README.md ADDED
@@ -0,0 +1,211 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - zh
5
+ - en
6
+ pipeline_tag: text-generation
7
+ library_name: transformers
8
+ ---
9
+ <div align="center">
10
+ <img src="https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm_logo.png?raw=true" width="500em" ></img>
11
+ </div>
12
+
13
+ <p align="center">
14
+ <a href="https://github.com/OpenBMB/MiniCPM/" target="_blank">GitHub Repo</a> |
15
+ <a href="https://github.com/OpenBMB/MiniCPM/blob/main/docs/MiniCPM_SALA.pdf" target="_blank">Technical Report</a> |
16
+ <a href="https://mp.weixin.qq.com/s/KIhH2nCURBXuFXAtYRpuXg?poc_token=HBIsUWijxino8oJ5s6HcjcfXFRi0Xj2LJlxPYD9c">Join Us</a>
17
+ </p>
18
+ <p align="center">
19
+ πŸ‘‹ Contact us in <a href="https://discord.gg/3cGQn9b3YM" target="_blank">Discord</a> and <a href="https://github.com/OpenBMB/MiniCPM/blob/main/assets/wechat.jpg" target="_blank">WeChat</a>
20
+ </p>
21
+
22
+ ## What's New
23
+ - [2026.02.11] **[MiniCPM-SALA](https://huggingface.co/openbmb/MiniCPM-SALA)** is released! This is the first large-scale hybrid model effectively integrating sparse and linear attention for million-token context modeling. You can find technical report [here](https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf).πŸ”₯πŸ”₯πŸ”₯
24
+
25
+ ### Highlights
26
+
27
+ MiniCPM-SALA (Sparse Attention and Linear Attention) is the first large-scale hybrid model effectively integrating sparse and linear attention for million-token context modeling
28
+
29
+ βœ… Innovative Hybrid Architecture: Synergizes 25% Sparse Attention (InfLLM-v2) for high-fidelity long context modeling with 75% Linear Attention (Lightning Attention) for global efficiency.
30
+
31
+ βœ… Shattering Efficiency Walls: Breaks the "Compute Wall" and the "Memory Wall," achieving 3.5Γ— inference speed and significantly lower KV-cache overhead compared to dense baselines.
32
+
33
+ βœ… Million-Token Context: Empowered by HyPE (Hybrid Positional Embedding), it scales to 1M+ tokens while maintaining strong length generalization.
34
+
35
+ βœ… HALO Adaptation: Utilizes Hybrid Attention via Layer Optimization (HALO), a novel distillation recipe that effectively transfers dense attention capabilities to the hybrid architecture, avoiding the severe performance degradation typical of pure linear models.
36
+
37
+ ## Introduction
38
+
39
+ MiniCPM-SALA is an efficient hybrid model in which 25% of the layers adopt [InfLLM-V2](https://arxiv.org/abs/2509.24663) and the remaining 75% utilize Lightning Attention. This architecture enables inference of one million tokens on consumer GPUs such as the NVIDIA RTX 5090.
40
+
41
+ - **SALA Hybrid Attention Mechanism**
42
+ - Integrates 25% InfLLM-V2 and 75% Lightning Attention, effectively leveraging the granular focus of sparse attention for local details and the high efficiency of linear attention for broad context.
43
+
44
+ - **Transformer-to-Hybrid Continue Training**
45
+ - Circumvents the inefficiencies of cold-start training by performing an architectural transformation on the pre-trained weights, thereby reducing the total training budget to approximately 25% relative to training a comparable model from scratch.
46
+
47
+ - **[HyPE](https://arxiv.org/abs/2601.22156) (Hybrid Positional Encoding)**
48
+ - Harmonizes the performance across both short and long contexts, which can maintain general capabilities (e.g., knowledge, mathematics, and coding) comparable to modern full-attention models like Qwen3-8B and achieve substantial advantages across multiple long-context benchmarks.
49
+
50
+ - **Efficient Inference on Long Sequences**
51
+ - Achieves up to 3.5x the inference speed of Qwen3-8B at a sequence length of 256K tokens on A6000D, supports inference at context lengths of up to 1M tokens on both NVIDIA A6000D and 5090 GPUs, whereas Qwen3-8B fails at this length due to out-of-memory (OOM) errors.
52
+
53
+ ## Usage
54
+
55
+ ### HuggingFace
56
+
57
+ Our model is readily compatible with πŸ€— Hugging Face transformers. You can perform inference with our model as follows:
58
+
59
+ ```python
60
+ import torch
61
+ from transformers import AutoModelForCausalLM, AutoTokenizer
62
+
63
+ model_path = "openbmb/MiniCPM-SALA"
64
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
65
+ model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, device_map="auto")
66
+ model.eval()
67
+
68
+ prompts = ["My name is", "The capital of China is"]
69
+ with torch.no_grad():
70
+ inputs = tokenizer(prompts, return_tensors="pt").to(model.device)
71
+ outputs = model.generate(**inputs)
72
+ output_texts = tokenizer.batch_decode(outputs)
73
+ print(output_texts)
74
+ ```
75
+
76
+ ### SGLang
77
+
78
+ #### Requirements
79
+
80
+ - CUDA 12.x or higher
81
+ - `gcc` / `g++` compiler
82
+ - `uv` package manager (script will check)
83
+
84
+ #### Installation
85
+
86
+ ```bash
87
+ # Clone repository
88
+ git clone -b minicpm_sala https://github.com/OpenBMB/sglang.git
89
+ cd sglang
90
+
91
+ # One-click installation (creates venv and compiles all dependencies)
92
+ bash install_minicpm_sala.sh
93
+
94
+ # Or specify PyPI mirror
95
+ bash install_minicpm_sala.sh https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
96
+ ```
97
+
98
+ The installation script performs the following steps:
99
+
100
+ 1. Creates `sglang_minicpm_sala_env` virtual environment (Python 3.12)
101
+ 2. Clones dependencies to `3rdparty/` (infllmv2) and initializes submodules (sparse_kernel)
102
+ 3. Installs MiniCPM-SALA (current repo)
103
+ 4. Compiles and installs `infllmv2_cuda_impl`
104
+ 5. Compiles and installs `sparse_kernel`
105
+ 6. Installs `tilelang` & `flash-linear-attention`
106
+
107
+ #### Usage
108
+
109
+ ```bash
110
+ # Activate environment
111
+ source sglang_minicpm_sala_env/bin/activate
112
+
113
+ # Launch Inference Server (Replace MODEL_PATH with actual path)
114
+ MODEL_PATH=/path/to/your/MiniCPM-SALA
115
+
116
+ python3 -m sglang.launch_server \
117
+ --model ${MODEL_PATH} \
118
+ --trust-remote-code \
119
+ --disable-radix-cache \
120
+ --attention-backend minicpm_flashinfer \
121
+ --chunked-prefill-size 8192 \
122
+ --max-running-requests 32 \
123
+ --skip-server-warmup \
124
+ --port 31111 \
125
+ --dense-as-sparse
126
+ ```
127
+
128
+ | Parameter | Description |
129
+ |-----------|-------------|
130
+ | `--trust-remote-code` | Allow custom code in model |
131
+ | `--disable-radix-cache` | Disable RadixAttention prefix cache |
132
+ | `--attention-backend minicpm_flashinfer` | Use MiniCPM FlashInfer backend |
133
+ | `--chunked-prefill-size 8192` | Chunked prefill size |
134
+ | `--max-running-requests 32` | Max concurrent requests |
135
+ | `--skip-server-warmup` | Skip server warmup |
136
+ | `--port 31111` | Server port |
137
+ | `--dense-as-sparse` | Use dense-as-sparse mode |
138
+
139
+ #### Manual Installation
140
+
141
+ If the script doesn't work for you, follow these steps:
142
+
143
+ ```bash
144
+ # 0. Ensure uv is installed
145
+ pip install uv
146
+
147
+ # 1. Create venv
148
+ uv venv --python 3.12 sglang_minicpm_sala_env
149
+ source sglang_minicpm_sala_env/bin/activate
150
+
151
+ # 2. Install SGLang
152
+ uv pip install --upgrade pip setuptools wheel
153
+ uv pip install -e ./python[all]
154
+
155
+ # 3. Compile CUDA Extensions
156
+ # (Ensure dependencies are cloned to 3rdparty/)
157
+ cd 3rdparty/infllmv2_cuda_impl && python setup.py install && cd ../..
158
+ cd 3rdparty/sparse_kernel && python setup.py install && cd ../..
159
+
160
+ # 4. Install extra deps
161
+ uv pip install tilelang flash-linear-attention
162
+ ```
163
+
164
+ #### Q&A
165
+
166
+ **Q: CUDA extension compilation failed?**
167
+
168
+ - Ensure CUDA 12+ is installed (`nvcc --version`).
169
+ - Ensure `gcc` / `g++` are available.
170
+ - If `CXX` is set to `clang++ -pthread`, manually `export CXX=g++`.
171
+
172
+
173
+ ## Evaluation Results
174
+
175
+ ### Efficiency Evaluation
176
+
177
+ ![inference_speed_a6000d](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm_sala/inference_speed_a600d.png?raw=true)
178
+
179
+ ![inference_speed_5090](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm_sala/inference_speed_5090.png?raw=true)
180
+
181
+ ### Long-Context Evaluation
182
+
183
+ ![long_text_evaluation](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm_sala/long_text_evaluation.png?raw=true)
184
+
185
+ ### Ultra-long Context Evaluation
186
+
187
+ ![ultra_long_text_evaluation](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm_sala/ultra_long_text_evaluation.png?raw=true)
188
+
189
+ ### Standard Evaluation
190
+
191
+ ![benchmark](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm_sala/benchmark.png?raw=true)
192
+
193
+ ## Statement
194
+ - As a language model, MiniCPM-SALA generates content by learning from a vast amount of text.
195
+ - However, it does not possess the ability to comprehend or express personal opinions or value judgments.
196
+ - Any content generated by MiniCPM-SALA does not represent the viewpoints or positions of the model developers.
197
+ - Therefore, when using content generated by MiniCPM-SALA, users should take full responsibility for evaluating and verifying it on their own.
198
+
199
+ ## LICENSE
200
+ - This repository and MiniCPM models are released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
201
+
202
+ ## Citation
203
+ - Please cite our [paper](https://github.com/OpenBMB/MiniCPM/blob/main/docs/MiniCPM_SALA.pdf) if you find our work valuable.
204
+
205
+ ```bibtex
206
+ @article{minicpm4,
207
+ title={{MiniCPM4}: Ultra-Efficient LLMs on End Devices},
208
+ author={MiniCPM Team},
209
+ year={2025}
210
+ }
211
+ ```