Instructions to use openbmb/MiniCPM-SALA with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use openbmb/MiniCPM-SALA with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="openbmb/MiniCPM-SALA", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("openbmb/MiniCPM-SALA", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use openbmb/MiniCPM-SALA with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "openbmb/MiniCPM-SALA" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openbmb/MiniCPM-SALA", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/openbmb/MiniCPM-SALA
- SGLang
How to use openbmb/MiniCPM-SALA with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "openbmb/MiniCPM-SALA" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openbmb/MiniCPM-SALA", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "openbmb/MiniCPM-SALA" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openbmb/MiniCPM-SALA", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use openbmb/MiniCPM-SALA with Docker Model Runner:
docker model run hf.co/openbmb/MiniCPM-SALA
| license: apache-2.0 | |
| language: | |
| - zh | |
| - en | |
| pipeline_tag: text-generation | |
| library_name: transformers | |
| <div align="center"> | |
| <img src="https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm_logo.png?raw=true" width="500em" ></img> | |
| </div> | |
| <p align="center"> | |
| <a href="https://github.com/OpenBMB/MiniCPM/" target="_blank">GitHub Repo</a> | | |
| <a href="https://github.com/OpenBMB/MiniCPM/blob/main/docs/MiniCPM_SALA.pdf" target="_blank">Technical Report</a> | | |
| <a href="https://mp.weixin.qq.com/s/KIhH2nCURBXuFXAtYRpuXg?poc_token=HBIsUWijxino8oJ5s6HcjcfXFRi0Xj2LJlxPYD9c">Join Us</a> | |
| </p> | |
| <p align="center"> | |
| 👋 Contact us in <a href="https://discord.gg/3cGQn9b3YM" target="_blank">Discord</a> and <a href="https://github.com/OpenBMB/MiniCPM/blob/main/assets/wechat.jpg" target="_blank">WeChat</a> | |
| </p> | |
| > [!NOTE] | |
| > ### 🏆 2026 Sparse Operator Acceleration & Race (SOAR) is Now Live! | |
| > | |
| > **"The MiniCPM-SALA architecture is just the beginning. Realizing its full potential requires deep system-level synergy and cross-layer compilation optimization."** | |
| > | |
| > In collaboration with **SGLang** and **NVIDIA**, OpenBMB invites global geeks to push the boundaries of 9B-scale, 1M-token inference on **NVIDIA 6000D**. | |
| > | |
| > 💰 **Prize Pool: >$100,000 USD** (🥇 Top Prize: **$89,000**) | 🚀 **Challenge:** Single & Multi-batch Optimization | |
| > | |
| > 👉 **[Click Here to Join the Race @ soar.openbmb.cn](https://soar.openbmb.cn/)** | |
| ## What's New | |
| - [2026.02.11] **MiniCPM-SALA** is released! This is the first large-scale hybrid model effectively integrating sparse and linear attention for million-token context modeling. You can find technical report [here](https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf).🔥🔥🔥 | |
| ### Highlights | |
| MiniCPM-SALA (Sparse Attention and Linear Attention) is the first large-scale hybrid model effectively integrating sparse and linear attention for million-token context modeling | |
| ✅ Innovative Hybrid Architecture: Synergizes 25% Sparse Attention (InfLLM-v2) for high-fidelity long context modeling with 75% Linear Attention (Lightning Attention) for global efficiency. | |
| ✅ Shattering Efficiency Walls: Breaks the "Compute Wall" and the "Memory Wall," achieving 3.5× inference speed and significantly lower KV-cache overhead compared to dense baselines. | |
| ✅ Million-Token Context: Empowered by HyPE (Hybrid Positional Embedding), it scales to 1M+ tokens while maintaining strong length generalization. | |
| ✅ HALO Adaptation: Utilizes Hybrid Attention via Layer Optimization (HALO), a novel distillation recipe that effectively transfers dense attention capabilities to the hybrid architecture, avoiding the severe performance degradation typical of pure linear models. | |
| ## Introduction | |
| MiniCPM-SALA is an efficient hybrid model in which 25% of the layers adopt [InfLLM-V2](https://arxiv.org/abs/2509.24663) and the remaining 75% utilize Lightning Attention. This architecture enables inference of one million tokens on consumer GPUs such as the NVIDIA RTX 5090. | |
| - **SALA Hybrid Attention Mechanism** | |
| - Integrates 25% InfLLM-V2 and 75% Lightning Attention, effectively leveraging the granular focus of sparse attention for local details and the high efficiency of linear attention for broad context. | |
| - **Transformer-to-Hybrid Continue Training** | |
| - Circumvents the inefficiencies of cold-start training by performing an architectural transformation on the pre-trained weights, thereby reducing the total training budget to approximately 25% relative to training a comparable model from scratch. | |
| - **[HyPE](https://arxiv.org/abs/2601.22156) (Hybrid Positional Encoding)** | |
| - Harmonizes the performance across both short and long contexts, which can maintain general capabilities (e.g., knowledge, mathematics, and coding) comparable to modern full-attention models like Qwen3-8B and achieve substantial advantages across multiple long-context benchmarks. | |
| - **Efficient Inference on Long Sequences** | |
| - Achieves up to 3.5x the inference speed of Qwen3-8B at a sequence length of 256K tokens on A6000D, supports inference at context lengths of up to 1M tokens on both NVIDIA A6000D and 5090 GPUs, whereas Qwen3-8B fails at this length due to out-of-memory (OOM) errors. | |
| ## Inference | |
| To achieve optimal performance, we recommend using `Temperature=0.9`. | |
| ### HuggingFace | |
| Our model is readily compatible with 🤗 Hugging Face transformers. You can perform inference with our model as follows: | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| model_path = "openbmb/MiniCPM-SALA" | |
| tokenizer = AutoTokenizer.from_pretrained(model_path) | |
| model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, device_map="auto") | |
| model.eval() | |
| prompts = ["My name is", "The capital of China is"] | |
| with torch.no_grad(): | |
| inputs = tokenizer(prompts, return_tensors="pt").to(model.device) | |
| outputs = model.generate(**inputs) | |
| output_texts = tokenizer.batch_decode(outputs) | |
| print(output_texts) | |
| ``` | |
| ### SGLang | |
| #### Requirements | |
| - CUDA 12.x or higher | |
| - `gcc` / `g++` compiler | |
| - `uv` package manager (script will check) | |
| #### Installation | |
| ```bash | |
| # Clone repository | |
| git clone -b minicpm_sala https://github.com/OpenBMB/sglang.git | |
| cd sglang | |
| # One-click installation (creates venv and compiles all dependencies) | |
| bash install_minicpm_sala.sh | |
| # Or specify PyPI mirror | |
| bash install_minicpm_sala.sh https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple | |
| ``` | |
| The installation script performs the following steps: | |
| 1. Creates `sglang_minicpm_sala_env` virtual environment (Python 3.12) | |
| 2. Clones dependencies to `3rdparty/` (infllmv2) and initializes submodules (sparse_kernel) | |
| 3. Installs MiniCPM-SALA (current repo) | |
| 4. Compiles and installs `infllmv2_cuda_impl` | |
| 5. Compiles and installs `sparse_kernel` | |
| 6. Installs `tilelang` & `flash-linear-attention` | |
| #### Usage | |
| ```bash | |
| # Activate environment | |
| source sglang_minicpm_sala_env/bin/activate | |
| # Launch Inference Server (Replace MODEL_PATH with actual path) | |
| MODEL_PATH=/path/to/your/MiniCPM-SALA | |
| python3 -m sglang.launch_server \ | |
| --model ${MODEL_PATH} \ | |
| --trust-remote-code \ | |
| --disable-radix-cache \ | |
| --attention-backend minicpm_flashinfer \ | |
| --chunked-prefill-size 8192 \ | |
| --max-running-requests 32 \ | |
| --skip-server-warmup \ | |
| --port 31111 \ | |
| --dense-as-sparse | |
| ``` | |
| | Parameter | Description | | |
| |-----------|-------------| | |
| | `--trust-remote-code` | Allow custom code in model | | |
| | `--disable-radix-cache` | Disable RadixAttention prefix cache | | |
| | `--attention-backend minicpm_flashinfer` | Use MiniCPM FlashInfer backend | | |
| | `--chunked-prefill-size 8192` | Chunked prefill size | | |
| | `--max-running-requests 32` | Max concurrent requests | | |
| | `--skip-server-warmup` | Skip server warmup | | |
| | `--port 31111` | Server port | | |
| | `--dense-as-sparse` | Use dense-as-sparse mode | | |
| #### Manual Installation | |
| If the script doesn't work for you, follow these steps: | |
| ```bash | |
| # 0. Ensure uv is installed | |
| pip install uv | |
| # 1. Create venv | |
| uv venv --python 3.12 sglang_minicpm_sala_env | |
| source sglang_minicpm_sala_env/bin/activate | |
| # 2. Install SGLang | |
| uv pip install --upgrade pip setuptools wheel | |
| uv pip install -e ./python[all] | |
| # 3. Compile CUDA Extensions | |
| # (Ensure dependencies are cloned to 3rdparty/) | |
| cd 3rdparty/infllmv2_cuda_impl && python setup.py install && cd ../.. | |
| cd 3rdparty/sparse_kernel && python setup.py install && cd ../.. | |
| # 4. Install extra deps | |
| uv pip install tilelang flash-linear-attention | |
| ``` | |
| #### Q&A | |
| **Q: CUDA extension compilation failed?** | |
| - Ensure CUDA 12+ is installed (`nvcc --version`). | |
| - Ensure `gcc` / `g++` are available. | |
| - If `CXX` is set to `clang++ -pthread`, manually `export CXX=g++`. | |
| ## Evaluation Results | |
| ### Efficiency Evaluation | |
|  | |
|  | |
| ### Long-Context Evaluation | |
|  | |
| ### Ultra-long Context Evaluation | |
|  | |
| ### Standard Evaluation | |
|  | |
| ## Statement | |
| - As a language model, MiniCPM-SALA generates content by learning from a vast amount of text. | |
| - However, it does not possess the ability to comprehend or express personal opinions or value judgments. | |
| - Any content generated by MiniCPM-SALA does not represent the viewpoints or positions of the model developers. | |
| - Therefore, when using content generated by MiniCPM-SALA, users should take full responsibility for evaluating and verifying it on their own. | |
| ## LICENSE | |
| - This repository and MiniCPM models are released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License. | |
| ## Citation | |
| - Please cite our [paper](https://github.com/OpenBMB/MiniCPM/blob/main/docs/MiniCPM_SALA.pdf) if you find our work valuable. | |
| ```bibtex | |
| @article{minicpm4, | |
| title={{MiniCPM-SALA}: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling}, | |
| author={MiniCPM Team}, | |
| year={2026} | |
| } | |
| ``` |