Instructions to use openbmb/MiniCPM4.1-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use openbmb/MiniCPM4.1-8B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="openbmb/MiniCPM4.1-8B", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("openbmb/MiniCPM4.1-8B", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use openbmb/MiniCPM4.1-8B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "openbmb/MiniCPM4.1-8B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "openbmb/MiniCPM4.1-8B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/openbmb/MiniCPM4.1-8B

SGLang

How to use openbmb/MiniCPM4.1-8B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "openbmb/MiniCPM4.1-8B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "openbmb/MiniCPM4.1-8B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "openbmb/MiniCPM4.1-8B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "openbmb/MiniCPM4.1-8B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use openbmb/MiniCPM4.1-8B with Docker Model Runner:
```
docker model run hf.co/openbmb/MiniCPM4.1-8B
```

20250912

by ituser1 - opened Sep 12, 2025

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

+75

-186

Files changed (3) hide show

README.md +2 -51
config.json +1 -1
modeling_minicpm.py +72 -134

README.md CHANGED Viewed

@@ -20,7 +20,6 @@ library_name: transformers
 </p>
 ## What's New
-- [2025.09.29] **[InfLLM-V2 paper](https://arxiv.org/abs/2509.24663) is released!** We can train a sparse attention model with only 5B long-text tokens. 🔥🔥🔥
 - [2025.09.05] **MiniCPM4.1** series are released! This series is a hybrid reasoning model with trainable sparse attention, which can be used in both deep reasoning mode and non-reasoning mode. 🔥🔥🔥
 - [2025.06.06] **MiniCPM4** series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report [here](https://arxiv.org/abs/2506.07900).🔥🔥🔥
@@ -64,11 +63,6 @@ MiniCPM4.1 launches end-side versions with 8B parameter scale, both achieving be
 ![benchmark](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/benchmark4.1.png?raw=true)
-### Best Practices
-1. It is advisable  to use temperature=0.9, topp=0.95. And we suggest setting max_output_token to 65,536 tokens.
-2. For math problems, we recommend using "Please reason step by step, and put your final answer within \boxed{}."
-3. And for English multiple-choice questions, we recommend starting with "Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step before answering." And "你回答的最后一行必须是以下格式 '答案：$选项' (不带引号), 其中选项是ABCD之一。请在回答之前一步步思考" for Chinese MCQ.
 ### Efficiency Evaluation
 MiniCPM4.1 adopts sparse attention and speculative decoding to improve the inference efficiency. On RTX 4090, MiniCPM4.1 achieves 3x decoding speed improvement in reasoning.
@@ -84,17 +78,8 @@ MiniCPM4.1 adopts sparse attention and speculative decoding to improve the infer
 ## Usage
 MiniCPM 4.1 can be used with following frameworks: Huggingface Transformers, SGLang, vLLM, and CPM.cu. For the ultimate inference speed, we highly recommend CPM.cu.
-MiniCPM4/MiniCPM4.1 supports both dense attention inference and sparse attention inference modes, where vLLM and SGLang currently only support dense inference mode. If you want to use sparse inference mode, please use Huggingface Transformers and CPM.cu.
-- Dense attention inference: vLLM, SGLang, Huggingface Transformers
-- Sparse attention inference: Huggingface Transformers, CPM.cu
-**To facilitate researches in sparse attention, we provide [InfLLM-V2 Training Kernels](https://github.com/OpenBMB/infllmv2_cuda_impl) and [InfLLM-V2 Inference Kernels](https://github.com/openbmb/cpm.cu).**
 ### Inference with Transformers
-MiniCPM4.1-8B requires `transformers>=4.56`.
-- **Inference with Dense Attention**
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
@@ -134,7 +119,6 @@ responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0
 print(responses)
 ```
-- **Inference with Sparse Attention**
 MiniCPM4.1-8B supports `InfLLM v2`, a sparse attention mechanism designed for efficient long-sequence inference. It requires the [infllmv2_cuda_impl](https://github.com/OpenBMB/infllmv2_cuda_impl) library.
 You can install it by running the following command:
@@ -172,7 +156,6 @@ These parameters control the behavior of InfLLM v2:
 * `use_nope` (default: false): Whether to use the NOPE technique in block selection for improved performance.
 * `dense_len` (default: 8192): Since Sparse Attention offers limited benefits for short sequences, the model can use standard (dense) attention for shorter texts. The model will use dense attention for sequences with a token length below `dense_len` and switch to sparse attention for sequences exceeding this length. Set this to `-1` to always use sparse attention regardless of sequence length.
-- **Long Context Extension**
 MiniCPM4.1 natively supports context lengths of up to 65,536(64k) tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques for effective handling of long texts. We have validated the model's performance on context lengths of up to 131,072 tokens by modifying the LongRoPE factor.
 You can apply the LongRoPE factor modification by modifying the model files. Specifically, in the `config.json` file, adjust the `rope_scaling` fields.
@@ -484,37 +467,6 @@ python3 -m cpmcu.cli \
 For more details about CPM.cu, please refer to [the repo CPM.cu](https://github.com/OpenBMB/cpm.cu).
-### Inference with llama.cpp and Ollama
-We also support inference with [llama.cpp](https://github.com/ggml-org/llama.cpp) and [Ollama](https://ollama.com/).
-##### llama.cpp
-You can download the GGUF format of MiniCPM4.1-8B model from [huggingface](https://huggingface.co/openbmb/MiniCPM4.1-8B-GGUF) and run it with llama.cpp for efficient CPU or GPU inference.
-```
-# case 1: main-cli
-./build/bin/llama-cli -m MiniCPM4.1-8B-Q4_K_M.gguf -p "Write an article about Artificial Intelligence." -n 1500
-# case 2: server
-## launch server
-./build/bin/llama-server -m MiniCPM4.1-8B-Q4_K_M.gguf --host 127.0.0.1 --port 8080 -c 4096 -fa on &
-## send request
-curl -X POST http://127.0.0.1:8080/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "gpt-3.5-turbo",
-    "messages": [{"role": "user", "content": "Write an article about Artificial Intelligence."}],
-    "max_tokens": 1500
-  }'
-```
-##### Ollama
-Please refer to [model hub](https://ollama.com/openbmb/minicpm4.1) for model download. After installing ollama package, you can use MiniCPM4.1 with following commands:
-```
-ollama run openbmb/minicpm4.1
-```
 ### Hybird Reasoning Mode
 MiniCPM4.1 supports hybrid reasoning mode, which can be used in both deep reasoning mode and non-reasoning mode. To enable hybrid reasoning mode. User can set `enable_thinking=True` in `tokenizer.apply_chat_template` to enable hybrid reasoning mode, and set `enable_thinking=False` to enable non-reasoning mode. Similarly, user can directly add `/no_think` at the end of the query to enable non-reasoning mode. If not add any special token or add `/think` at the end of the query, the model will enable reasoning mode.
@@ -550,9 +502,8 @@ prompt_text = tokenizer.apply_chat_template(
 ```bibtex
 @article{minicpm4,
-  title={Minicpm4: Ultra-efficient llms on end devices},
-  author={MiniCPM, Team},
-  journal={arXiv preprint arXiv:2506.07900},
   year={2025}
 }
 ```

 </p>
 ## What's New
 - [2025.09.05] **MiniCPM4.1** series are released! This series is a hybrid reasoning model with trainable sparse attention, which can be used in both deep reasoning mode and non-reasoning mode. 🔥🔥🔥
 - [2025.06.06] **MiniCPM4** series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report [here](https://arxiv.org/abs/2506.07900).🔥🔥🔥
 ![benchmark](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/benchmark4.1.png?raw=true)
 ### Efficiency Evaluation
 MiniCPM4.1 adopts sparse attention and speculative decoding to improve the inference efficiency. On RTX 4090, MiniCPM4.1 achieves 3x decoding speed improvement in reasoning.
 ## Usage
 MiniCPM 4.1 can be used with following frameworks: Huggingface Transformers, SGLang, vLLM, and CPM.cu. For the ultimate inference speed, we highly recommend CPM.cu.
 ### Inference with Transformers
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
 print(responses)
 ```
 MiniCPM4.1-8B supports `InfLLM v2`, a sparse attention mechanism designed for efficient long-sequence inference. It requires the [infllmv2_cuda_impl](https://github.com/OpenBMB/infllmv2_cuda_impl) library.
 You can install it by running the following command:
 * `use_nope` (default: false): Whether to use the NOPE technique in block selection for improved performance.
 * `dense_len` (default: 8192): Since Sparse Attention offers limited benefits for short sequences, the model can use standard (dense) attention for shorter texts. The model will use dense attention for sequences with a token length below `dense_len` and switch to sparse attention for sequences exceeding this length. Set this to `-1` to always use sparse attention regardless of sequence length.
 MiniCPM4.1 natively supports context lengths of up to 65,536(64k) tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques for effective handling of long texts. We have validated the model's performance on context lengths of up to 131,072 tokens by modifying the LongRoPE factor.
 You can apply the LongRoPE factor modification by modifying the model files. Specifically, in the `config.json` file, adjust the `rope_scaling` fields.
 For more details about CPM.cu, please refer to [the repo CPM.cu](https://github.com/OpenBMB/cpm.cu).
 ### Hybird Reasoning Mode
 MiniCPM4.1 supports hybrid reasoning mode, which can be used in both deep reasoning mode and non-reasoning mode. To enable hybrid reasoning mode. User can set `enable_thinking=True` in `tokenizer.apply_chat_template` to enable hybrid reasoning mode, and set `enable_thinking=False` to enable non-reasoning mode. Similarly, user can directly add `/no_think` at the end of the query to enable non-reasoning mode. If not add any special token or add `/think` at the end of the query, the model will enable reasoning mode.
 ```bibtex
 @article{minicpm4,
+  title={{MiniCPM4}: Ultra-Efficient LLMs on End Devices},
+  author={MiniCPM Team},
   year={2025}
 }
 ```

config.json CHANGED Viewed

@@ -30,7 +30,7 @@
     "original_max_position_embeddings": 65536
   },
   "torch_dtype": "bfloat16",
-  "transformers_version": "4.56.1",
   "use_cache": true,
   "vocab_size": 73448,
   "rope_theta": 10000.0,

     "original_max_position_embeddings": 65536
   },
   "torch_dtype": "bfloat16",
+  "transformers_version": "4.46.3",
   "use_cache": true,
   "vocab_size": 73448,
   "rope_theta": 10000.0,

modeling_minicpm.py CHANGED Viewed

@@ -21,7 +21,7 @@ from typing import Any, Dict, List, Optional, Tuple, Union
 import torch
 import torch.nn.functional as F
 import torch.utils.checkpoint
-from torch import  nn
 from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
 from transformers.activations import ACT2FN
 from transformers.cache_utils import Cache, DynamicCache, CacheLayerMixin, DynamicLayer
@@ -47,9 +47,7 @@ from transformers.utils import (
 )
 from transformers.utils.import_utils import is_torch_fx_available
-from .configuration_minicpm import MiniCPMConfig    #!一定要改
 try:
     from flash_attn import flash_attn_func, flash_attn_varlen_func
@@ -70,28 +68,50 @@ from functools import lru_cache
 def compressed_attention(
     q: torch.Tensor,
     k: torch.Tensor,
-    k2: torch.Tensor,
     kernel_size: int,
     kernel_stride: int,
     block_size: int,
     topk: int,
     cu_seqlens_q: torch.Tensor,
     cu_seqlens_k: torch.Tensor,
-    cu_seqlens_k2: torch.Tensor,
     max_seqlen_q: int,
     max_seqlen_k: int,
     sm_scale: float = None,
     init_blocks: int = 1,
     local_blocks: int = 2,
-    cache_lens=None,
 ) -> Tuple[torch.Tensor, torch.Tensor]:
     with torch.no_grad():
         batch_size = cu_seqlens_q.shape[0] - 1
         # Check if it's prefilling stage
         is_prefilling = cache_lens is None or (cache_lens == 0).all().item()
-        if is_prefilling:  # prefilling stage
             # Calculate q_idx for each query position in each batch
             cache_lens = torch.zeros(batch_size, dtype=torch.int32, device=q.device)
             q_idx = torch.cat([
@@ -99,24 +119,25 @@ def compressed_attention(
                  max_seqlen_q - (cu_seqlens_q[i + 1] - cu_seqlens_q[i])) // block_size
                 for i in range(batch_size)
             ], dim=0)  # shape: [total_q_len]
-        else:  # decoding stage
-            # Each batch has only one query (last position)
-            q_idx = cache_lens // block_size  # shape: [batch_size] = [total_q_len] in decoding
-        # 计算attention score
         score = infllmv2_attn_stage1(
             q.contiguous(),
             k.contiguous(),
-            k2.contiguous(),
             cu_seqlens_q=cu_seqlens_q,
             cu_seqlens_k=cu_seqlens_k,
-            cu_seqlens_v=cu_seqlens_k2,
             max_seqlen_q=max_seqlen_q,
             max_seqlen_k=max_seqlen_k,
-            causal=is_prefilling
-        )
-        score = score[:, :q_idx.shape[0], :]  # [num_heads, total_q_len, num_blocks]
         block_score = max_pooling_1d_varlen(
             score.contiguous(),
             cu_seqlens_q,
@@ -127,9 +148,7 @@ def compressed_attention(
             local_blocks=local_blocks,
             init_blocks=init_blocks,
             block_size=block_size,
-            stride=kernel_stride
-        )  # shape: [num_heads, total_q_len, num_blocks]
         # get topk
         topk = min(topk, block_score.shape[-1])
@@ -243,11 +262,6 @@ class InfLLMv2CacheLayer(DynamicLayer):
         self.no_compress_k_cache = []
         self.cached_compressed_cu_seqlens = torch.tensor([], dtype=torch.int32)
         self.compress_k_cache_varlen = torch.tensor([], dtype=torch.float32)
-        # Add support for compress_k2
-        self.compress_k2_cache = []
-        self.cached_compressed_cu_seqlens2 = torch.tensor([], dtype=torch.int32)
-        self.compress_k2_cache_varlen = torch.tensor([], dtype=torch.float32)
-        self.no_compress_k2_cache = []
     def update_no_rope_key(self, key_states):
         if self.no_rope_keys.numel() == 0:
@@ -289,45 +303,12 @@ class InfLLMv2CacheLayer(DynamicLayer):
                     k_chunk_list.append(None)
         return k_chunk_list
-    def update_compress_k2(self, key_states, cu_seqlens=None):
-        if len(self.compress_k2_cache) == 0:
-            if cu_seqlens is not None:
-                self.cached_compressed_cu_seqlens2 = cu_seqlens.clone()
-            self.compress_k2_cache_varlen = key_states
-            split_sizes = (cu_seqlens[1:] - cu_seqlens[:-1]).tolist()
-            self.compress_k2_cache = list(torch.split(key_states, split_sizes))
-        else:
-            for index, k in enumerate(key_states):
-                if k is not None:
-                    self.compress_k2_cache[index] = torch.cat([self.compress_k2_cache[index], k], dim=0)
-            new_seq_lens = torch.tensor([tensor.shape[0] for tensor in self.compress_k2_cache], dtype=torch.int32)
-            new_cumsum = torch.cumsum(new_seq_lens, dim=0, dtype=torch.int32)
-            self.compress_k2_cache_varlen = torch.cat(self.compress_k2_cache, dim=0)
-            self.cached_compressed_cu_seqlens2 = torch.cat([torch.tensor([0], dtype=torch.int32), new_cumsum]).to(self.compress_k2_cache_varlen.device)
-        return self.compress_k2_cache_varlen, self.cached_compressed_cu_seqlens2
-    def update_no_compress_k2(self, key_states, kernel_size=128, kernel_stride=64):
-        k_chunk_list = []
-        for index, k in enumerate(key_states):
-            if len(self.no_compress_k2_cache) <= index:
-                self.no_compress_k2_cache.append(k)
-            else:
-                self.no_compress_k2_cache[index] = torch.cat([self.no_compress_k2_cache[index], k], dim=0)
-                current_len = self.no_compress_k2_cache[index].shape[0]
-                if current_len >= kernel_size:
-                    k_chunk_list.append(self.no_compress_k2_cache[index][:kernel_size])
-                    self.no_compress_k2_cache[index] = self.no_compress_k2_cache[index][kernel_stride:]
-                else:
-                    k_chunk_list.append(None)
-        return k_chunk_list
 class InfLLMv2Cache(DynamicCache):
-    def __init__(self, config,num_hidden_layers: Optional[int] = None) -> None:
         super().__init__(config=config)
         self.layers = [InfLLMv2CacheLayer() for _ in range(num_hidden_layers)] if num_hidden_layers else []
         self._seen_tokens = 0
     def update(self, key_states, value_states, layer_idx, cache_kwargs=None):
         if layer_idx == 0:
@@ -343,12 +324,6 @@ class InfLLMv2Cache(DynamicCache):
     def update_no_compress_k(self, key_states, layer_idx, kernel_size=32, kernel_stride=16, cache_kwargs=None):
         return self.layers[layer_idx].update_no_compress_k(key_states, kernel_size, kernel_stride)
-    def update_compress_k2(self, key_states, layer_idx, cu_seqlens=None, cache_kwargs=None):
-        return self.layers[layer_idx].update_compress_k2(key_states, cu_seqlens)
-    def update_no_compress_k2(self, key_states, layer_idx, kernel_size=128, kernel_stride=64, cache_kwargs=None):
-        return self.layers[layer_idx].update_no_compress_k2(key_states, kernel_size, kernel_stride)
     def crop(self, max_length):
         for layer in self.layers:
             layer.crop(max_length)
@@ -616,6 +591,7 @@ def _unpad_one_tensor(hidden_states, attention_mask):
     unpadded_states = index_first_axis(reshaped_states, indices)
     return unpadded_states, indices, cu_seqlens, max_seqlen_in_batch
 def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
     """
     This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
@@ -1022,9 +998,7 @@ class MiniCPMInfLLMv2Attention(MiniCPMAttention):
         self.local_blocks = self.window_size // self.block_size  # local_blocks
         self.topk = self.config.sparse_config.get('topk', 64) + (self.window_size//self.block_size)
         self.use_nope = self.config.sparse_config.get('use_nope', False)
         self.compress_k = CompressK(self.num_key_value_heads, self.head_dim, kernel_size=self.kernel_size, kernel_stride=self.kernel_stride)
-        self.compress_k2 = CompressK(self.num_key_value_heads, self.head_dim, kernel_size=self.kernel_size*4, kernel_stride=self.kernel_stride*4)
     def forward(
         self,
@@ -1049,7 +1023,6 @@ class MiniCPMInfLLMv2Attention(MiniCPMAttention):
         bsz, q_len, _ = hidden_states.size()
         query_states = self.q_proj(hidden_states)
         key_states = self.k_proj(hidden_states)
         value_states = self.v_proj(hidden_states)
@@ -1080,12 +1053,11 @@ class MiniCPMInfLLMv2Attention(MiniCPMAttention):
         key_states = key_states.transpose(1, 2)
         value_states = value_states.transpose(1, 2)
         if self.use_nope:
-            key_states_no_rope =past_key_value.update_no_rope_key(key_states_no_rope, self.layer_idx)
             no_rope_param = {
                 'key_states_no_rope': key_states_no_rope,
                 'query_states_no_rope': query_states_no_rope,
             }
         else:
             no_rope_param = None
@@ -1131,8 +1103,16 @@ class MiniCPMInfLLMv2Attention(MiniCPMAttention):
         return attn_output, attn_weights, past_key_value
     def _sparse_attention_forward(
-            self, query_states, key_states, value_states, attention_mask, query_length, dropout=0.0, softmax_scale=None, no_rope_param=None, past_key_value=None
-        ):
             """
             Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
             first unpad the input, then computes the attention scores and pad the final attention scores.
@@ -1162,17 +1142,15 @@ class MiniCPMInfLLMv2Attention(MiniCPMAttention):
                 batch_size = query_states.shape[0]
                 # assert batch_size == 1, 'Only batch_size=1 is supported at the moment.'
                 if past_key_value!=None:
-                    compressed_k, compressed_cu_seqlens, compressed_k2, compressed_cu_seqlens2 = self.get_compress_k(
                         key_states=key_states if self.use_nope ==False else no_rope_param['key_states_no_rope'],  # This can be optimized a bit;
                         attention_mask=attention_mask,
-                        past_key_value=past_key_value,
-                    )
                 query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
                     query_states, key_states, value_states, attention_mask, query_length
                 )
                 cu_seqlens_q, cu_seqlens_k = cu_seq_lens
                 max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
                 if no_rope_param != None:
@@ -1183,12 +1161,7 @@ class MiniCPMInfLLMv2Attention(MiniCPMAttention):
                 if past_key_value==None:
                     # compress_k use varlen form
                     compressed_k, compressed_cu_seqlens = self.compress_k(key_states,cu_seqlens_k)
-                    compressed_k2, compressed_cu_seqlens2 = self.compress_k2(key_states,cu_seqlens_k)
-                else:
-                    # compressed_k and compressed_k2 already retrieved from get_compress_k above
-                    pass
                 attn_output_unpad = self.sparse_forward(
                     query_states,
                     key_states,
@@ -1198,16 +1171,15 @@ class MiniCPMInfLLMv2Attention(MiniCPMAttention):
                     max_seqlen_in_batch_q,
                     max_seqlen_in_batch_k,
                     no_rope_param=no_rope_param,
-                    compressed_k=compressed_k, compressed_cu_seqlens=compressed_cu_seqlens,
-                    compressed_k2=compressed_k2, compressed_cu_seqlens2=compressed_cu_seqlens2
-                )
                 attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
             else:
                 raise ValueError('Need attention mask')
             return attn_output
     def get_compress_k(self, key_states, attention_mask, past_key_value):
         """
         Get compressed key states and corresponding cumulative sequence lengths.
@@ -1219,51 +1191,34 @@ class MiniCPMInfLLMv2Attention(MiniCPMAttention):
             no_rope_param: Optional parameter containing key states without rope
         Returns:
-            Tuple of (compressed_k, compressed_cu_seqlens, compressed_k2, compressed_cu_seqlens2)
         """
         # Check if this is prefilling or initial compression condition
         is_prefilling = (
             key_states.shape[1] >= self.dense_len and
             (
                 not past_key_value.layers[self.layer_idx].compress_k_cache
             )
         )
         if is_prefilling:
             unpadded_key_states, indices, cu_seqlens, max_seqlen_in_batch = _unpad_one_tensor(key_states,attention_mask=attention_mask)
             # Compress the keys
             compressed_k, compressed_cu_seqlens = self.compress_k(unpadded_key_states, cu_seqlens)
-            compressed_k2, compressed_cu_seqlens2 = self.compress_k2(unpadded_key_states, cu_seqlens)
             past_key_value.update_compress_k(
                 compressed_k, self.layer_idx, compressed_cu_seqlens)
-            past_key_value.update_compress_k2(
-                compressed_k2, self.layer_idx, compressed_cu_seqlens2)
             no_compress_k_list = []
             # Compute and update no_compress_k
             for i in range(len(compressed_cu_seqlens)-1):
                 no_compress_k_start = (compressed_cu_seqlens[i+1]- compressed_cu_seqlens[i]) * self.kernel_stride
                 no_compress_k_list.append(unpadded_key_states[cu_seqlens[i]+no_compress_k_start:cu_seqlens[i+1]].clone())
             past_key_value.update_no_compress_k(
                 no_compress_k_list, self.layer_idx,kernel_stride=self.kernel_stride,
                 kernel_size=self.kernel_size)
-            # Also update no_compress_k2
-            no_compress_k2_list = []
-            for i in range(len(compressed_cu_seqlens2)-1):
-                no_compress_k2_start = (compressed_cu_seqlens2[i+1]- compressed_cu_seqlens2[i]) * self.kernel_stride * 4
-                no_compress_k2_list.append(unpadded_key_states[cu_seqlens[i]+no_compress_k2_start:cu_seqlens[i+1]].clone())
-            past_key_value.update_no_compress_k2(
-                no_compress_k2_list, self.layer_idx,kernel_stride=self.kernel_stride*4,
-                kernel_size=self.kernel_size*4)
         else:
             # Decode case: incremental update
             batch_size = key_states.shape[0] # key_states.shape = [batch_size, seq, k_head_num, head_dim]
@@ -1278,32 +1233,16 @@ class MiniCPMInfLLMv2Attention(MiniCPMAttention):
                 kernel_size=self.kernel_size)
             new_compressed_k_list = []
             for no_compress_k in no_compress_k_list:
                 if no_compress_k is not None:
                     # We have enough tokens to compress
                     new_compressed_k = no_compress_k.mean(dim=0, keepdim=True)  # [1, n_heads_k, head_dim]
                     new_compressed_k_list.append(new_compressed_k)
                 else:
                     new_compressed_k_list.append(None)
             compressed_k, compressed_cu_seqlens = past_key_value.update_compress_k(new_compressed_k_list, self.layer_idx,)
-            # For compress_k2, update no_compress_k2 buffer and compress when ready
-            no_compress_k2_list = past_key_value.update_no_compress_k2(
-                key_states_split, self.layer_idx,
-                kernel_stride=self.kernel_stride*4,
-                kernel_size=self.kernel_size*4)
-            new_compressed_k2_list = []
-            for no_compress_k2 in no_compress_k2_list:
-                if no_compress_k2 is not None:
-                    # We have enough tokens to compress for k2
-                    new_compressed_k2 = no_compress_k2.mean(dim=0, keepdim=True)  # [1, n_heads_k, head_dim]
-                    new_compressed_k2_list.append(new_compressed_k2)
-                else:
-                    new_compressed_k2_list.append(None)
-            compressed_k2, compressed_cu_seqlens2 = past_key_value.update_compress_k2(new_compressed_k2_list, self.layer_idx,)
-        return compressed_k, compressed_cu_seqlens, compressed_k2, compressed_cu_seqlens2
     def sparse_forward(self,
                        query_layer,
                        key_layer,
@@ -1313,8 +1252,8 @@ class MiniCPMInfLLMv2Attention(MiniCPMAttention):
                        max_seqlen_in_batch_q,
                        max_seqlen_in_batch_k,
                        no_rope_param=None,
-                       compressed_k=None, compressed_cu_seqlens=None,
-                       compressed_k2=None, compressed_cu_seqlens2=None):
         compressed_seqlens = compressed_cu_seqlens[1:] - compressed_cu_seqlens[:-1]
         cache_lens = None
         if max_seqlen_in_batch_q==1 and max_seqlen_in_batch_k>1: #decoding
@@ -1324,14 +1263,13 @@ class MiniCPMInfLLMv2Attention(MiniCPMAttention):
         topk_idx = compressed_attention(
             query_layer if no_rope_param is None else no_rope_param['query_states_no_rope'],
             compressed_k,
-            compressed_k2,
             self.kernel_size,
             self.kernel_stride,
             self.block_size,
             self.topk,
             cu_seqlens_q,
             compressed_cu_seqlens,
-            compressed_cu_seqlens2,
             max_seqlen_in_batch_q,
             compressed_seqlens.max().item(),
             None,

 import torch
 import torch.nn.functional as F
 import torch.utils.checkpoint
+from torch import nn
 from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
 from transformers.activations import ACT2FN
 from transformers.cache_utils import Cache, DynamicCache, CacheLayerMixin, DynamicLayer
 )
 from transformers.utils.import_utils import is_torch_fx_available
+from .configuration_minicpm import MiniCPMConfig
 try:
     from flash_attn import flash_attn_func, flash_attn_varlen_func
 def compressed_attention(
     q: torch.Tensor,
     k: torch.Tensor,
+    v: torch.Tensor,
     kernel_size: int,
     kernel_stride: int,
     block_size: int,
     topk: int,
     cu_seqlens_q: torch.Tensor,
     cu_seqlens_k: torch.Tensor,
     max_seqlen_q: int,
     max_seqlen_k: int,
     sm_scale: float = None,
     init_blocks: int = 1,
     local_blocks: int = 2,
+    cache_lens: torch.Tensor = None,
 ) -> Tuple[torch.Tensor, torch.Tensor]:
+    """Attention between query and compressed key and value. Compute attention output and topk block idx used in topk_sparse_attention.
+    Args:
+        q (torch.Tensor): shape [total_q_len, num_q_heads, head_dim]
+        k (torch.Tensor): shape [total_kv_len, num_kv_heads, head_dim]
+        v (torch.Tensor): shape [total_kv_len, num_kv_heads, head_dim]
+        kernel_size (int): kernel size in compress_key_value
+        kernel_stride (int): stride of compress_key_value
+        block_size (int): key value block size for topk sparse attention.
+        topk (int): number of blocks for each query.
+        cu_seqlens_q (torch.Tensor): shape [batch_size + 1], similar to cu_seqlens_q in flash_attn_func_varlen.
+        cu_seqlens_k (torch.Tensor): shape [batch_size + 1], similar to cu_seqlens_k in flash_attn_func_varlen.
+        max_seqlen_q (int): max q len of the batch.
+        max_seqlen_k (int): max k len of the batch.
+        sm_scale (float, optional): softmax scale. Defaults to None, means 1/sqrt(head_dim).
+        init_blocks (int, optional): Number of init blocks for each query. Defaults to 1.
+        local_blocks (int, optional): Number of local blocks for each query. Defaults to 2.
+        cache_lens (torch.Tensor, optional): shape [batch_size], used to record the cache length of each query. Defaults to None.
+    Returns:
+        Tuple[torch.Tensor, torch.Tensor]: attention output and topk_idx used in topk_sparse_attention
+    """
     with torch.no_grad():
         batch_size = cu_seqlens_q.shape[0] - 1
         # Check if it's prefilling stage
         is_prefilling = cache_lens is None or (cache_lens == 0).all().item()
+        # prefilling stage
+        if is_prefilling:
             # Calculate q_idx for each query position in each batch
             cache_lens = torch.zeros(batch_size, dtype=torch.int32, device=q.device)
             q_idx = torch.cat([
                  max_seqlen_q - (cu_seqlens_q[i + 1] - cu_seqlens_q[i])) // block_size
                 for i in range(batch_size)
             ], dim=0)  # shape: [total_q_len]
+        # decoding stage
+        else:
+            # Each batch has only one query (last position). Shape: [batch_size] = [total_q_len] in decoding
+            q_idx = cache_lens // block_size
+        # compute attention score
         score = infllmv2_attn_stage1(
             q.contiguous(),
             k.contiguous(),
+            v.contiguous(),
             cu_seqlens_q=cu_seqlens_q,
             cu_seqlens_k=cu_seqlens_k,
             max_seqlen_q=max_seqlen_q,
             max_seqlen_k=max_seqlen_k,
+            causal=is_prefilling)
+        # Shape: [num_heads, total_q_len, num_blocks]
+        score = score[:, :q_idx.shape[0], :]
+        # Shape: [num_heads, total_q_len, num_blocks]
         block_score = max_pooling_1d_varlen(
             score.contiguous(),
             cu_seqlens_q,
             local_blocks=local_blocks,
             init_blocks=init_blocks,
             block_size=block_size,
+            stride=kernel_stride)
         # get topk
         topk = min(topk, block_score.shape[-1])
         self.no_compress_k_cache = []
         self.cached_compressed_cu_seqlens = torch.tensor([], dtype=torch.int32)
         self.compress_k_cache_varlen = torch.tensor([], dtype=torch.float32)
     def update_no_rope_key(self, key_states):
         if self.no_rope_keys.numel() == 0:
                     k_chunk_list.append(None)
         return k_chunk_list
 class InfLLMv2Cache(DynamicCache):
+    def __init__(self,
+                 config,num_hidden_layers: Optional[int] = None) -> None:
         super().__init__(config=config)
         self.layers = [InfLLMv2CacheLayer() for _ in range(num_hidden_layers)] if num_hidden_layers else []
         self._seen_tokens = 0
     def update(self, key_states, value_states, layer_idx, cache_kwargs=None):
         if layer_idx == 0:
     def update_no_compress_k(self, key_states, layer_idx, kernel_size=32, kernel_stride=16, cache_kwargs=None):
         return self.layers[layer_idx].update_no_compress_k(key_states, kernel_size, kernel_stride)
     def crop(self, max_length):
         for layer in self.layers:
             layer.crop(max_length)
     unpadded_states = index_first_axis(reshaped_states, indices)
     return unpadded_states, indices, cu_seqlens, max_seqlen_in_batch
 def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
     """
     This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
         self.local_blocks = self.window_size // self.block_size  # local_blocks
         self.topk = self.config.sparse_config.get('topk', 64) + (self.window_size//self.block_size)
         self.use_nope = self.config.sparse_config.get('use_nope', False)
         self.compress_k = CompressK(self.num_key_value_heads, self.head_dim, kernel_size=self.kernel_size, kernel_stride=self.kernel_stride)
     def forward(
         self,
         bsz, q_len, _ = hidden_states.size()
         query_states = self.q_proj(hidden_states)
         key_states = self.k_proj(hidden_states)
         value_states = self.v_proj(hidden_states)
         key_states = key_states.transpose(1, 2)
         value_states = value_states.transpose(1, 2)
         if self.use_nope:
+            key_states_no_rope = past_key_value.update_no_rope_key(key_states_no_rope, self.layer_idx)
             no_rope_param = {
                 'key_states_no_rope': key_states_no_rope,
                 'query_states_no_rope': query_states_no_rope,
             }
         else:
             no_rope_param = None
         return attn_output, attn_weights, past_key_value
     def _sparse_attention_forward(
+            self,
+            query_states,
+            key_states,
+            value_states,
+            attention_mask,
+            query_length,
+            dropout=0.0,
+            softmax_scale=None,
+            no_rope_param=None,
+            past_key_value=None):
             """
             Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
             first unpad the input, then computes the attention scores and pad the final attention scores.
                 batch_size = query_states.shape[0]
                 # assert batch_size == 1, 'Only batch_size=1 is supported at the moment.'
                 if past_key_value!=None:
+                    compressed_k, compressed_cu_seqlens = self.get_compress_k(
                         key_states=key_states if self.use_nope ==False else no_rope_param['key_states_no_rope'],  # This can be optimized a bit;
                         attention_mask=attention_mask,
+                        past_key_value=past_key_value)
                 query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
                     query_states, key_states, value_states, attention_mask, query_length
                 )
                 cu_seqlens_q, cu_seqlens_k = cu_seq_lens
                 max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
                 if no_rope_param != None:
                 if past_key_value==None:
                     # compress_k use varlen form
                     compressed_k, compressed_cu_seqlens = self.compress_k(key_states,cu_seqlens_k)
                 attn_output_unpad = self.sparse_forward(
                     query_states,
                     key_states,
                     max_seqlen_in_batch_q,
                     max_seqlen_in_batch_k,
                     no_rope_param=no_rope_param,
+                    compressed_k=compressed_k,
+                    compressed_cu_seqlens=compressed_cu_seqlens)
                 attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
             else:
                 raise ValueError('Need attention mask')
             return attn_output
     def get_compress_k(self, key_states, attention_mask, past_key_value):
         """
         Get compressed key states and corresponding cumulative sequence lengths.
             no_rope_param: Optional parameter containing key states without rope
         Returns:
+            Tuple of (compressed_k, compressed_cu_seqlens)
         """
         # Check if this is prefilling or initial compression condition
         is_prefilling = (
             key_states.shape[1] >= self.dense_len and
             (
                 not past_key_value.layers[self.layer_idx].compress_k_cache
             )
         )
         if is_prefilling:
             unpadded_key_states, indices, cu_seqlens, max_seqlen_in_batch = _unpad_one_tensor(key_states,attention_mask=attention_mask)
             # Compress the keys
             compressed_k, compressed_cu_seqlens = self.compress_k(unpadded_key_states, cu_seqlens)
             past_key_value.update_compress_k(
                 compressed_k, self.layer_idx, compressed_cu_seqlens)
             no_compress_k_list = []
             # Compute and update no_compress_k
             for i in range(len(compressed_cu_seqlens)-1):
                 no_compress_k_start = (compressed_cu_seqlens[i+1]- compressed_cu_seqlens[i]) * self.kernel_stride
                 no_compress_k_list.append(unpadded_key_states[cu_seqlens[i]+no_compress_k_start:cu_seqlens[i+1]].clone())
             past_key_value.update_no_compress_k(
                 no_compress_k_list, self.layer_idx,kernel_stride=self.kernel_stride,
                 kernel_size=self.kernel_size)
         else:
             # Decode case: incremental update
             batch_size = key_states.shape[0] # key_states.shape = [batch_size, seq, k_head_num, head_dim]
                 kernel_size=self.kernel_size)
             new_compressed_k_list = []
             for no_compress_k in no_compress_k_list:
                 if no_compress_k is not None:
                     # We have enough tokens to compress
                     new_compressed_k = no_compress_k.mean(dim=0, keepdim=True)  # [1, n_heads_k, head_dim]
                     new_compressed_k_list.append(new_compressed_k)
                 else:
                     new_compressed_k_list.append(None)
             compressed_k, compressed_cu_seqlens = past_key_value.update_compress_k(new_compressed_k_list, self.layer_idx,)
+        return compressed_k, compressed_cu_seqlens
     def sparse_forward(self,
                        query_layer,
                        key_layer,
                        max_seqlen_in_batch_q,
                        max_seqlen_in_batch_k,
                        no_rope_param=None,
+                       compressed_k=None,
+                       compressed_cu_seqlens=None):
         compressed_seqlens = compressed_cu_seqlens[1:] - compressed_cu_seqlens[:-1]
         cache_lens = None
         if max_seqlen_in_batch_q==1 and max_seqlen_in_batch_k>1: #decoding
         topk_idx = compressed_attention(
             query_layer if no_rope_param is None else no_rope_param['query_states_no_rope'],
             compressed_k,
+            compressed_k.clone(),
             self.kernel_size,
             self.kernel_stride,
             self.block_size,
             self.topk,
             cu_seqlens_q,
             compressed_cu_seqlens,
             max_seqlen_in_batch_q,
             compressed_seqlens.max().item(),
             None,