xcjthu commited on
Commit
8ab8cc2
·
1 Parent(s): ef9ecae

update usage for sparse attention

Browse files
Files changed (1) hide show
  1. README.md +9 -0
README.md CHANGED
@@ -78,10 +78,17 @@ MiniCPM4.1 adopts sparse attention and speculative decoding to improve the infer
78
  ## Usage
79
  MiniCPM 4.1 can be used with following frameworks: Huggingface Transformers, SGLang, vLLM, and CPM.cu. For the ultimate inference speed, we highly recommend CPM.cu.
80
 
 
 
 
 
 
 
81
 
82
  ### Inference with Transformers
83
  MiniCPM4.1-8B requires `transformers>=4.56`.
84
 
 
85
  ```python
86
  from transformers import AutoModelForCausalLM, AutoTokenizer
87
  import torch
@@ -121,6 +128,7 @@ responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0
121
  print(responses)
122
  ```
123
 
 
124
  MiniCPM4.1-8B supports `InfLLM v2`, a sparse attention mechanism designed for efficient long-sequence inference. It requires the [infllmv2_cuda_impl](https://github.com/OpenBMB/infllmv2_cuda_impl) library.
125
 
126
  You can install it by running the following command:
@@ -158,6 +166,7 @@ These parameters control the behavior of InfLLM v2:
158
  * `use_nope` (default: false): Whether to use the NOPE technique in block selection for improved performance.
159
  * `dense_len` (default: 8192): Since Sparse Attention offers limited benefits for short sequences, the model can use standard (dense) attention for shorter texts. The model will use dense attention for sequences with a token length below `dense_len` and switch to sparse attention for sequences exceeding this length. Set this to `-1` to always use sparse attention regardless of sequence length.
160
 
 
161
  MiniCPM4.1 natively supports context lengths of up to 65,536(64k) tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques for effective handling of long texts. We have validated the model's performance on context lengths of up to 131,072 tokens by modifying the LongRoPE factor.
162
 
163
  You can apply the LongRoPE factor modification by modifying the model files. Specifically, in the `config.json` file, adjust the `rope_scaling` fields.
 
78
  ## Usage
79
  MiniCPM 4.1 can be used with following frameworks: Huggingface Transformers, SGLang, vLLM, and CPM.cu. For the ultimate inference speed, we highly recommend CPM.cu.
80
 
81
+ MiniCPM4/MiniCPM4.1 supports both dense attention inference and sparse attention inference modes, where vLLM and SGLang currently only support dense inference mode. If you want to use sparse inference mode, please use Huggingface Transformers and CPM.cu.
82
+
83
+ - Dense attention inference: vLLM, SGLang, Huggingface Transformers
84
+ - Sparse attention inference: Huggingface Transformers, CPM.cu
85
+
86
+ **To facilitate researches in sparse attention, we provide [InfLLM-V2 Training Kernels](https://github.com/OpenBMB/infllmv2_cuda_impl) and [InfLLM-V2 Inference Kernels](https://github.com/openbmb/cpm.cu).**
87
 
88
  ### Inference with Transformers
89
  MiniCPM4.1-8B requires `transformers>=4.56`.
90
 
91
+ - **Inference with Dense Attention**
92
  ```python
93
  from transformers import AutoModelForCausalLM, AutoTokenizer
94
  import torch
 
128
  print(responses)
129
  ```
130
 
131
+ - **Inference with Sparse Attention**
132
  MiniCPM4.1-8B supports `InfLLM v2`, a sparse attention mechanism designed for efficient long-sequence inference. It requires the [infllmv2_cuda_impl](https://github.com/OpenBMB/infllmv2_cuda_impl) library.
133
 
134
  You can install it by running the following command:
 
166
  * `use_nope` (default: false): Whether to use the NOPE technique in block selection for improved performance.
167
  * `dense_len` (default: 8192): Since Sparse Attention offers limited benefits for short sequences, the model can use standard (dense) attention for shorter texts. The model will use dense attention for sequences with a token length below `dense_len` and switch to sparse attention for sequences exceeding this length. Set this to `-1` to always use sparse attention regardless of sequence length.
168
 
169
+ - **Long Context Extension**
170
  MiniCPM4.1 natively supports context lengths of up to 65,536(64k) tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques for effective handling of long texts. We have validated the model's performance on context lengths of up to 131,072 tokens by modifying the LongRoPE factor.
171
 
172
  You can apply the LongRoPE factor modification by modifying the model files. Specifically, in the `config.json` file, adjust the `rope_scaling` fields.