guanwenyu1995 commited on
Commit
6066ea1
·
verified ·
1 Parent(s): 0289755

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -1,3 +1,226 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - zh
5
+ - en
6
+ pipeline_tag: text-generation
7
+ library_name: transformers
8
  ---
9
+ <div align="center">
10
+ <img src="https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm_logo.png?raw=true" width="500em" ></img>
11
+ </div>
12
+
13
+ <p align="center">
14
+ <a href="https://github.com/OpenBMB/MiniCPM/" target="_blank">GitHub Repo</a> |
15
+ <a href="https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf" target="_blank">Technical Report</a>
16
+ </p>
17
+ <p align="center">
18
+ 👋 Join us on <a href="https://discord.gg/3cGQn9b3YM" target="_blank">Discord</a> and <a href="https://github.com/OpenBMB/MiniCPM/blob/main/assets/wechat.jpg" target="_blank">WeChat</a>
19
+ </p>
20
+
21
+ ## What's New
22
+ - [2025.06.06] **MiniCPM4** series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report [here](https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf).🔥🔥🔥
23
+
24
+ ## MiniCPM4 Series
25
+ MiniCPM4 series are highly efficient large language models (LLMs) designed explicitly for end-side devices, which achieves this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems.
26
+ - [MiniCPM4-8B](https://huggingface.co/openbmb/MiniCPM4-8B): The flagship of MiniCPM4, with 8B parameters, trained on 8T tokens.
27
+ - [MiniCPM4-0.5B](https://huggingface.co/openbmb/MiniCPM4-0.5B): The small version of MiniCPM4, with 0.5B parameters, trained on 1T tokens. (**<-- you are here**)
28
+ - [MiniCPM4-8B-Eagle-FRSpec](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-FRSpec): Eagle head for FRSpec, accelerating speculative inference for MiniCPM4-8B.
29
+ - [MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu): Eagle head trained with QAT for FRSpec, efficiently integrate speculation and quantization to achieve ultra acceleration for MiniCPM4-8B.
30
+ - [MiniCPM4-8B-Eagle-vLLM](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-vLLM): Eagle head in vLLM format, accelerating speculative inference for MiniCPM4-8B.
31
+ - [MiniCPM4-8B-marlin-Eagle-vLLM](https://huggingface.co/openbmb/MiniCPM4-8B-marlin-Eagle-vLLM): Quantized Eagle head for vLLM format, accelerating speculative inference for MiniCPM4-8B.
32
+ - [BitCPM4-0.5B](https://huggingface.co/openbmb/BitCPM4-0.5B): Extreme ternary quantization applied to MiniCPM4-0.5B compresses model parameters into ternary values, achieving a 90% reduction in bit width.
33
+ - [BitCPM4-1B](https://huggingface.co/openbmb/BitCPM4-1B): Extreme ternary quantization applied to MiniCPM3-1B compresses model parameters into ternary values, achieving a 90% reduction in bit width.
34
+ - [MiniCPM4-Survey](https://huggingface.co/openbmb/MiniCPM4-Survey): Based on MiniCPM4-8B, accepts users' quiries as input and autonomously generate trustworthy, long-form survey papers.
35
+ - [MiniCPM4-MCP](https://huggingface.co/openbmb/MiniCPM4-MCP): Based on MiniCPM4-8B, accepts users' queries and available MCP tools as input and autonomously calls relevant MCP tools to satisfy users' requirements.
36
+
37
+ ## Introduction
38
+ MiniCPM 4 is an extremely efficient edge-side large model that has undergone efficient optimization across four dimensions: model architecture, learning algorithms, training data, and inference systems, achieving ultimate efficiency improvements.
39
+
40
+ - 🏗️ **Efficient Model Architecture:**
41
+ - InfLLM v2 -- Trainable Sparse Attention Mechanism: Adopts a trainable sparse attention mechanism architecture where each token only needs to compute relevance with less than 5% of tokens in 128K long text processing, significantly reducing computational overhead for long texts
42
+
43
+ - 🧠 **Efficient Learning Algorithms:**
44
+ - Model Wind Tunnel 2.0 -- Efficient Predictable Scaling: Introduces scaling prediction methods for performance of downstream tasks, enabling more precise model training configuration search
45
+ - BitCPM -- Ultimate Ternary Quantization: Compresses model parameter bit-width to 3 values, achieving 90% extreme model bit-width reduction
46
+ - Efficient Training Engineering Optimization: Adopts FP8 low-precision computing technology combined with Multi-token Prediction training strategy
47
+
48
+ - 📚 **High-Quality Training Data:**
49
+ - UltraClean -- High-quality Pre-training Data Filtering and Generation: Builds iterative data cleaning strategies based on efficient data verification, open-sourcing high-quality Chinese and English pre-training dataset [UltraFinweb](https://huggingface.co/datasets/openbmb/Ultra-FineWeb)
50
+ - UltraChat v2 -- High-quality Supervised Fine-tuning Data Generation: Constructs large-scale high-quality supervised fine-tuning datasets covering multiple dimensions including knowledge-intensive data, reasoning-intensive data, instruction-following data, long text understanding data, and tool calling data
51
+
52
+ - ⚡ **Efficient Inference System:**
53
+ - CPM.cu -- Lightweight and Efficient CUDA Inference Framework: Integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding
54
+ - ArkInfer -- Cross-platform Deployment System: Supports efficient deployment across multiple backend environments, providing flexible cross-platform adaptation capabilities
55
+
56
+ ## Usage
57
+ ### Inference with Transformers
58
+ ```python
59
+ from transformers import AutoModelForCausalLM, AutoTokenizer
60
+ import torch
61
+ torch.manual_seed(0)
62
+
63
+ path = 'openbmb/MiniCPM4-0.5B'
64
+ device = "cuda"
65
+ tokenizer = AutoTokenizer.from_pretrained(path)
66
+ model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)
67
+
68
+ # User can directly use the chat interface
69
+ responds, history = model.chat(tokenizer, "Write an article about Artificial Intelligence.", temperature=0.7, top_p=0.7)
70
+ print(responds)
71
+
72
+ # User can also use the generate interface
73
+ # messages = [
74
+ # {"role": "user", "content": "Write an article about Artificial Intelligence."},
75
+ # ]
76
+ # prompt_text = tokenizer.apply_chat_template(
77
+ # messages,
78
+ # tokenize=False,
79
+ # add_generation_prompt=True,
80
+ # )
81
+ # model_inputs = tokenizer([prompt_text], return_tensors="pt").to(device)
82
+
83
+ # model_outputs = model.generate(
84
+ # **model_inputs,
85
+ # max_new_tokens=1024,
86
+ # top_p=0.7,
87
+ # temperature=0.7
88
+ # )
89
+ # output_token_ids = [
90
+ # model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs['input_ids']))
91
+ # ]
92
+
93
+ # responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
94
+ # print(responses)
95
+ ```
96
+
97
+ ### Inference with [SGLang](https://github.com/sgl-project/sglang)
98
+
99
+ For now, you need to install our forked version of SGLang.
100
+ ```bash
101
+ git clone -b openbmb https://github.com/OpenBMB/sglang.git
102
+ cd sglang
103
+
104
+ pip install --upgrade pip
105
+ pip install -e "python[all]"
106
+ ```
107
+
108
+ You can start the inference server by running the following command:
109
+ ```bash
110
+ python -m sglang.launch_server --model openbmb/MiniCPM4-0.5B --trust-remote-code --port 30000 --chat-template chatml
111
+ ```
112
+
113
+ Then you can use the chat interface by running the following command:
114
+ ```python
115
+ import openai
116
+
117
+ client = openai.Client(base_url=f"http://localhost:30000/v1", api_key="None")
118
+
119
+ response = client.chat.completions.create(
120
+ model="openbmb/MiniCPM4-0.5B",
121
+ messages=[
122
+ {"role": "user", "content": "Write an article about Artificial Intelligence."},
123
+ ],
124
+ temperature=0.7,
125
+ max_tokens=1024,
126
+ )
127
+
128
+ print(response.choices[0].message.content)
129
+ ```
130
+
131
+ ### Inference with [vLLM](https://github.com/vllm-project/vllm)
132
+ For now, you need to install the latest version of vLLM.
133
+ ```
134
+ pip install -U vllm \
135
+ --pre \
136
+ --extra-index-url https://wheels.vllm.ai/nightly
137
+ ```
138
+
139
+ Then you can inference MiniCPM4-0.5B with vLLM:
140
+ ```python
141
+ from transformers import AutoTokenizer
142
+ from vllm import LLM, SamplingParams
143
+
144
+ model_name = "openbmb/MiniCPM4-0.5B"
145
+ prompt = [{"role": "user", "content": "Please recommend 5 tourist attractions in Beijing. "}]
146
+
147
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
148
+ input_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
149
+
150
+ llm = LLM(
151
+ model=model_name,
152
+ trust_remote_code=True,
153
+ max_num_batched_tokens=32768,
154
+ dtype="bfloat16",
155
+ gpu_memory_utilization=0.8,
156
+ )
157
+ sampling_params = SamplingParams(top_p=0.7, temperature=0.7, max_tokens=1024, repetition_penalty=1.02)
158
+
159
+ outputs = llm.generate(prompts=input_text, sampling_params=sampling_params)
160
+
161
+ print(outputs[0].outputs[0].text)
162
+ ```
163
+
164
+ Also, you can start the inference server by running the following command:
165
+ > **Note**: In vLLM's chat API, `add_special_tokens` is `False` by default. This means important special tokens—such as the beginning-of-sequence (BOS) token—will not be added automatically. To ensure the input prompt is correctly formatted for the model, you should explicitly set `extra_body={"add_special_tokens": True}`.
166
+
167
+ ```bash
168
+ vllm serve openbmb/MiniCPM4-0.5B
169
+ ```
170
+
171
+ Then you can use the chat interface by running the following code:
172
+
173
+ ```python
174
+ import openai
175
+
176
+ client = openai.Client(base_url="http://localhost:8000/v1", api_key="EMPTY")
177
+
178
+ response = client.chat.completions.create(
179
+ model="openbmb/MiniCPM4-0.5B",
180
+ messages=[
181
+ {"role": "user", "content": "Write an article about Artificial Intelligence."},
182
+ ],
183
+ temperature=0.7,
184
+ max_tokens=1024,
185
+ extra_body=dict(add_special_tokens=True), # Ensures special tokens are added for chat template
186
+
187
+ )
188
+
189
+ print(response.choices[0].message.content)
190
+ ```
191
+
192
+
193
+ ## Evaluation Results
194
+ On two typical end-side chips, Jetson AGX Orin and RTX 4090, MiniCPM4 demonstrates significantly faster processing speed compared to similar-size models in long text processing tasks. As text length increases, MiniCPM4's efficiency advantage becomes more pronounced. On the Jetson AGX Orin platform, compared to Qwen3-8B, MiniCPM4 achieves approximately 7x decoding speed improvement.
195
+
196
+ ![benchmark](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/efficiency.png?raw=true)
197
+
198
+ #### Comprehensive Evaluation
199
+ MiniCPM4 launches end-side versions with 8B and 0.5B parameter scales, both achieving best-in-class performance in their respective categories.
200
+
201
+ ![benchmark](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/benchmark.png?raw=true)
202
+
203
+ #### Long Text Evaluation
204
+ MiniCPM4 is pre-trained on 32K long texts and achieves length extension through YaRN technology. In the 128K long text needle-in-a-haystack task, MiniCPM4 demonstrates outstanding performance.
205
+
206
+ ![long-niah](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/128k-niah.png?raw=true)
207
+
208
+ ## Statement
209
+ - As a language model, MiniCPM generates content by learning from a vast amount of text.
210
+ - However, it does not possess the ability to comprehend or express personal opinions or value judgments.
211
+ - Any content generated by MiniCPM does not represent the viewpoints or positions of the model developers.
212
+ - Therefore, when using content generated by MiniCPM, users should take full responsibility for evaluating and verifying it on their own.
213
+
214
+ ## LICENSE
215
+ - This repository and MiniCPM models are released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
216
+
217
+ ## Citation
218
+ - Please cite our [paper](https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf) if you find our work valuable.
219
+
220
+ ```bibtex
221
+ @article{minicpm4,
222
+ title={{MiniCPM4}: Ultra-Efficient LLMs on End Devices},
223
+ author={MiniCPM Team},
224
+ year={2025}
225
+ }
226
+ ```
added_tokens.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "<|execute_end|>": 73444,
3
+ "<|execute_start|>": 73443,
4
+ "<|fim_middle|>": 73446,
5
+ "<|fim_prefix|>": 73445,
6
+ "<|fim_suffix|>": 73447,
7
+ "<|im_end|>": 73440,
8
+ "<|im_start|>": 73441,
9
+ "<|tool_call|>": 73442
10
+ }
config.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "openbmb/MiniCPM4-0.5B",
3
+ "architectures": [
4
+ "MiniCPMForCausalLM"
5
+ ],
6
+ "auto_map": {
7
+ "AutoConfig": "configuration_minicpm.MiniCPMConfig",
8
+ "AutoModel": "modeling_minicpm.MiniCPMModel",
9
+ "AutoModelForCausalLM": "modeling_minicpm.MiniCPMForCausalLM",
10
+ "AutoModelForSeq2SeqLM": "modeling_minicpm.MiniCPMForCausalLM",
11
+ "AutoModelForSequenceClassification": "modeling_minicpm.MiniCPMForSequenceClassification"
12
+ },
13
+ "bos_token_id": 1,
14
+ "eos_token_id": [2, 73440],
15
+ "hidden_act": "silu",
16
+ "hidden_size": 1024,
17
+ "initializer_range": 0.1,
18
+ "intermediate_size": 4096,
19
+ "max_position_embeddings": 32768,
20
+ "num_attention_heads": 16,
21
+ "num_hidden_layers": 24,
22
+ "num_key_value_heads": 2,
23
+ "rms_norm_eps": 1e-05,
24
+ "rope_scaling": {
25
+ "rope_type": "longrope",
26
+ "long_factor": [1.0004360675811768, 1.0668443441390991, 1.1631425619125366, 1.3025742769241333, 1.5040205717086792, 1.7941505908966064, 2.2101221084594727, 2.802666664123535, 3.6389970779418945, 4.804192543029785, 6.39855432510376, 8.527148246765137, 11.277542114257812, 14.684998512268066, 18.69317054748535, 23.13019371032715, 27.72362518310547, 32.1606559753418, 36.168827056884766, 39.57627868652344, 42.32667541503906, 44.45526885986328, 46.04962921142578, 47.21482849121094, 48.05115509033203, 48.64370346069336, 49.05967712402344, 49.34980392456055, 49.551246643066406, 49.69068145751953, 49.78697967529297, 49.85338592529297],
27
+ "short_factor": [1.0004360675811768, 1.0668443441390991, 1.1631425619125366, 1.3025742769241333, 1.5040205717086792, 1.7941505908966064, 2.2101221084594727, 2.802666664123535, 3.6389970779418945, 4.804192543029785, 6.39855432510376, 8.527148246765137, 11.277542114257812, 14.684998512268066, 18.69317054748535, 23.13019371032715, 27.72362518310547, 32.1606559753418, 36.168827056884766, 39.57627868652344, 42.32667541503906, 44.45526885986328, 46.04962921142578, 47.21482849121094, 48.05115509033203, 48.64370346069336, 49.05967712402344, 49.34980392456055, 49.551246643066406, 49.69068145751953, 49.78697967529297, 49.85338592529297],
28
+ "original_max_position_embeddings": 32768
29
+ },
30
+ "torch_dtype": "bfloat16",
31
+ "transformers_version": "4.46.3",
32
+ "use_cache": true,
33
+ "vocab_size": 73448,
34
+ "scale_emb": 12,
35
+ "dim_model_base": 256,
36
+ "scale_depth": 1.4
37
+ }
configuration_minicpm.py ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2025 The OpenBMB Team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ """ MiniCPM model configuration"""
16
+
17
+ from transformers.configuration_utils import PretrainedConfig
18
+ from transformers.utils import logging
19
+
20
+ logger = logging.get_logger(__name__)
21
+
22
+ MINICPM_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
23
+
24
+
25
+ class MiniCPMConfig(PretrainedConfig):
26
+ r"""
27
+ This is the configuration class to store the configuration of a [`MiniCPMModel`]. It is used to instantiate an MiniCPM
28
+ model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
29
+ defaults will yield a similar configuration to that of the MiniCPM-7B.
30
+
31
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
32
+ documentation from [`PretrainedConfig`] for more information.
33
+
34
+
35
+ Args:
36
+ vocab_size (`int`, *optional*, defaults to 32000):
37
+ Vocabulary size of the MiniCPM model. Defines the number of different tokens that can be represented by the
38
+ `inputs_ids` passed when calling [`MiniCPMModel`]
39
+ hidden_size (`int`, *optional*, defaults to 4096):
40
+ Dimension of the hidden representations.
41
+ intermediate_size (`int`, *optional*, defaults to 11008):
42
+ Dimension of the MLP representations.
43
+ num_hidden_layers (`int`, *optional*, defaults to 32):
44
+ Number of hidden layers in the Transformer decoder.
45
+ num_attention_heads (`int`, *optional*, defaults to 32):
46
+ Number of attention heads for each attention layer in the Transformer decoder.
47
+ num_key_value_heads (`int`, *optional*):
48
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
49
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
50
+ `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
51
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
52
+ by meanpooling all the original heads within that group. For more details checkout [this
53
+ paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
54
+ `num_attention_heads`.
55
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
56
+ The non-linear activation function (function or string) in the decoder.
57
+ max_position_embeddings (`int`, *optional*, defaults to 2048):
58
+ The maximum sequence length that this model might ever be used with. MiniCPM 1 supports up to 2048 tokens,
59
+ MiniCPM 2 up to 4096, CodeMiniCPM up to 16384.
60
+ initializer_range (`float`, *optional*, defaults to 0.02):
61
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
62
+ rms_norm_eps (`float`, *optional*, defaults to 1e-06):
63
+ The epsilon used by the rms normalization layers.
64
+ use_cache (`bool`, *optional*, defaults to `True`):
65
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
66
+ relevant if `config.is_decoder=True`.
67
+ pad_token_id (`int`, *optional*):
68
+ Padding token id.
69
+ bos_token_id (`int`, *optional*, defaults to 1):
70
+ Beginning of stream token id.
71
+ eos_token_id (`int`, *optional*, defaults to 2):
72
+ End of stream token id.
73
+ pretraining_tp (`int`, *optional*, defaults to 1):
74
+ Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
75
+ document](https://huggingface.co/docs/transformers/parallelism) to understand more about it. This value is
76
+ necessary to ensure exact reproducibility of the pretraining results. Please refer to [this
77
+ issue](https://github.com/pytorch/pytorch/issues/76232).
78
+ tie_word_embeddings (`bool`, *optional*, defaults to `False`):
79
+ Whether to tie weight embeddings
80
+ rope_theta (`float`, *optional*, defaults to 10000.0):
81
+ The base period of the RoPE embeddings.
82
+ rope_scaling (`Dict`, *optional*):
83
+ Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
84
+ strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
85
+ `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
86
+ `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
87
+ these scaling strategies behave:
88
+ https://www.reddit.com/r/LocalMiniCPM/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
89
+ experimental feature, subject to breaking API changes in future versions.
90
+ attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
91
+ Whether to use a bias in the query, key, value and output projection layers during self-attention.
92
+ attention_dropout (`float`, *optional*, defaults to 0.0):
93
+ The dropout ratio for the attention probabilities.
94
+
95
+ ```python
96
+ >>> from transformers import MiniCPMModel, MiniCPMConfig
97
+
98
+ >>> # Initializing a MiniCPM minicpm-7b style configuration
99
+ >>> configuration = MiniCPMConfig()
100
+
101
+ >>> # Initializing a model from the minicpm-7b style configuration
102
+ >>> model = MiniCPMModel(configuration)
103
+
104
+ >>> # Accessing the model configuration
105
+ >>> configuration = model.config
106
+ ```"""
107
+
108
+ model_type = 'minicpm'
109
+ keys_to_ignore_at_inference = ['past_key_values']
110
+
111
+ def __init__(
112
+ self,
113
+ vocab_size=32000,
114
+ hidden_size=4096,
115
+ intermediate_size=11008,
116
+ num_hidden_layers=32,
117
+ num_attention_heads=32,
118
+ num_key_value_heads=None,
119
+ hidden_act='silu',
120
+ max_position_embeddings=2048,
121
+ initializer_range=0.02,
122
+ rms_norm_eps=1e-6,
123
+ use_cache=True,
124
+ pad_token_id=None,
125
+ bos_token_id=1,
126
+ eos_token_id=2,
127
+ pretraining_tp=1,
128
+ tie_word_embeddings=True,
129
+ rope_theta=10000.0,
130
+ rope_scaling=None,
131
+ attention_bias=False,
132
+ attention_dropout=0.0,
133
+ scale_emb=1,
134
+ dim_model_base=1,
135
+ scale_depth=1,
136
+ mup_denominator=None,
137
+ sparse_config=None,
138
+ **kwargs):
139
+
140
+ self.vocab_size = vocab_size
141
+ self.max_position_embeddings = max_position_embeddings
142
+ self.hidden_size = hidden_size
143
+ self.intermediate_size = intermediate_size
144
+ self.num_hidden_layers = num_hidden_layers
145
+ self.num_attention_heads = num_attention_heads
146
+
147
+ # for backward compatibility
148
+ if num_key_value_heads is None:
149
+ num_key_value_heads = num_attention_heads
150
+
151
+ self.num_key_value_heads = num_key_value_heads
152
+ self.hidden_act = hidden_act
153
+ self.initializer_range = initializer_range
154
+ self.rms_norm_eps = rms_norm_eps
155
+ self.pretraining_tp = pretraining_tp
156
+ self.use_cache = use_cache
157
+ self.rope_theta = rope_theta
158
+ self.rope_scaling = rope_scaling
159
+ # self._rope_scaling_validation()
160
+ self.attention_bias = attention_bias
161
+ self.attention_dropout = attention_dropout
162
+ self.scale_emb = scale_emb
163
+ self.dim_model_base = dim_model_base
164
+ self.scale_depth = scale_depth
165
+ # only used for Eagle Head
166
+ self.mup_denominator = mup_denominator
167
+
168
+ # sparse config
169
+ self.sparse_config = sparse_config
170
+
171
+ super().__init__(
172
+ pad_token_id=pad_token_id,
173
+ bos_token_id=bos_token_id,
174
+ eos_token_id=eos_token_id,
175
+ tie_word_embeddings=tie_word_embeddings,
176
+ **kwargs,
177
+ )
178
+ try:
179
+ import flash_attn
180
+ self._attn_implementation = 'flash_attention_2'
181
+ except:
182
+ pass
183
+
184
+ def _rope_scaling_validation(self):
185
+ """
186
+ Validate the `rope_scaling` configuration.
187
+ """
188
+ if self.rope_scaling is None:
189
+ return
190
+
191
+ if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2:
192
+ raise ValueError(
193
+ '`rope_scaling` must be a dictionary with with two fields, `type` and `factor`, '
194
+ f'got {self.rope_scaling}'
195
+ )
196
+ rope_scaling_type = self.rope_scaling.get('type', None)
197
+ rope_scaling_factor = self.rope_scaling.get('factor', None)
198
+ if rope_scaling_type is None or rope_scaling_type not in ['linear', 'dynamic']:
199
+ raise ValueError(
200
+ f"`rope_scaling`'s type field must be one of ['linear', 'dynamic'], got {rope_scaling_type}"
201
+ )
202
+ if rope_scaling_factor is None or not isinstance(rope_scaling_factor, float) or rope_scaling_factor <= 1.0:
203
+ raise ValueError(f"`rope_scaling`'s factor field must be a float > 1, got {rope_scaling_factor}")
generation_config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 1,
3
+ "do_sample": true,
4
+ "eos_token_id": [
5
+ 2,
6
+ 73440
7
+ ],
8
+ "pad_token_id": 2,
9
+ "temperature": 0.8,
10
+ "top_p": 0.8,
11
+ "transformers_version": "4.46.1"
12
+ }
modeling_minicpm.py ADDED
@@ -0,0 +1,1615 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2025 The OpenBMB Team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ """ PyTorch MiniCPM model."""
16
+ import math
17
+ import re
18
+ import warnings
19
+ from typing import Any, Dict, List, Optional, Tuple, Union
20
+
21
+ import torch
22
+ import torch.nn.functional as F
23
+ import torch.utils.checkpoint
24
+ from torch import nn
25
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
26
+ from transformers.activations import ACT2FN
27
+ from transformers.cache_utils import Cache, DynamicCache, CacheLayerMixin, DynamicLayer
28
+ from transformers.modeling_attn_mask_utils import (
29
+ AttentionMaskConverter,
30
+ _prepare_4d_attention_mask,
31
+ _prepare_4d_causal_attention_mask,
32
+ _prepare_4d_causal_attention_mask_for_sdpa,
33
+ )
34
+ from transformers.modeling_outputs import (
35
+ BaseModelOutputWithPast,
36
+ CausalLMOutputWithPast,
37
+ SequenceClassifierOutputWithPast,
38
+ )
39
+ from transformers.modeling_utils import PreTrainedModel
40
+ from transformers.pytorch_utils import ALL_LAYERNORM_LAYERS, is_torch_greater_or_equal_than_1_13
41
+ from transformers.utils import (
42
+ add_start_docstrings,
43
+ add_start_docstrings_to_model_forward,
44
+ is_flash_attn_greater_or_equal_2_10,
45
+ logging,
46
+ replace_return_docstrings,
47
+ )
48
+ from transformers.utils.import_utils import is_torch_fx_available
49
+
50
+ from .configuration_minicpm import MiniCPMConfig
51
+
52
+ try:
53
+ from flash_attn import flash_attn_func, flash_attn_varlen_func
54
+ from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input # noqa
55
+ except:
56
+ pass
57
+
58
+
59
+
60
+ # This makes `_prepare_4d_causal_attention_mask` a leaf function in the FX graph.
61
+ # It means that the function will not be traced through and simply appear as a node in the graph.
62
+ if is_torch_fx_available():
63
+ if not is_torch_greater_or_equal_than_1_13:
64
+ import torch.fx
65
+
66
+ _prepare_4d_causal_attention_mask = torch.fx.wrap(_prepare_4d_causal_attention_mask)
67
+
68
+
69
+ logger = logging.get_logger(__name__)
70
+
71
+ _CONFIG_FOR_DOC = 'MiniCPMConfig'
72
+
73
+
74
+
75
+ def get_quantizer(quant_type="none", bit=4, group_size=128):
76
+ if quant_type == "intsym":
77
+ return SteIntSymQuantizerGPTQ(bit, group_size)
78
+ elif quant_type == "ternary":
79
+ return SteTernaryQuantizer(group_size)
80
+ elif quant_type == "none":
81
+ return NoQuantizer()
82
+ else:
83
+ raise ValueError(f"Unsupported quantization type: {quant_type}")
84
+
85
+ class SteIntSymQuantizerGPTQ(nn.Module):
86
+ def __init__(self, bit=4, group_size=-1):
87
+ super().__init__()
88
+ self.bit = bit
89
+ self.group_size = group_size
90
+
91
+ def forward(self, x):
92
+ org_w_shape = x.shape
93
+
94
+ if self.group_size > 0:
95
+ assert org_w_shape[-1] % self.group_size == 0
96
+ x = x.reshape(-1, self.group_size)
97
+ elif self.group_size == -1:
98
+ assert org_w_shape[-1] % self.group_size == 0
99
+ x = x.reshape(-1, x.shape[-1])
100
+ elif self.group_size == 0:
101
+ x = x.reshape(1, -1)
102
+
103
+ assert x.dim() == 2
104
+
105
+ xmax = x.max(dim=1, keepdim=True)[0]
106
+ xmin = x.min(dim=1, keepdim=True)[0]
107
+ abs_max_val = torch.maximum(torch.abs(xmin), xmax) # 与Quantizer的xmax计算一致
108
+ scales = abs_max_val * 2 / (2 ** self.bit - 1) # 分子分母都对齐
109
+
110
+ max_int = 2 ** (self.bit - 1) - 1
111
+ min_int = - (2 ** (self.bit - 1))
112
+
113
+ assert torch.isnan(scales).sum() == 0
114
+
115
+ x_q = (torch.clamp(torch.round(x / scales), min_int, max_int)) * scales
116
+
117
+ assert torch.isnan(x_q).sum() == 0
118
+
119
+ x = x.reshape(org_w_shape)
120
+ x_q = x_q.reshape(org_w_shape)
121
+
122
+ return x + (x_q - x).detach()
123
+
124
+ class SteTernaryQuantizer(nn.Module):
125
+ def __init__(self, group_size):
126
+ super().__init__()
127
+ self.group_size = group_size
128
+
129
+ def forward(self, x):
130
+ org_w_shape = x.shape
131
+ if self.group_size > 0:
132
+ assert x.shape[-1] % self.group_size == 0
133
+ x = x.reshape(-1, self.group_size)
134
+ elif self.group_size == -1:
135
+ x = x.reshape(-1, x.shape[-1])
136
+
137
+ assert x.dim() == 2
138
+
139
+ scales = 1.0 / (x.abs().mean(dim=1, keepdim=True).clamp_(min=1e-5))
140
+ x_q = (torch.clamp(torch.round(x * scales),-1,1) / scales)
141
+
142
+ assert torch.isnan(x_q).sum() == 0
143
+
144
+ x = x.reshape(org_w_shape)
145
+ x_q = x_q.reshape(org_w_shape)
146
+
147
+ return x + (x_q - x).detach()
148
+
149
+ class NoQuantizer(nn.Module):
150
+ def __init__(self):
151
+ super().__init__()
152
+
153
+ def forward(self, x):
154
+ return x
155
+
156
+ class LinearQuantizer(nn.Linear):
157
+ def __init__(self, in_features, out_features, bias=False, quant_type="ternary", bit=4, group_size=-1):
158
+ super().__init__(in_features, out_features, bias)
159
+ self.quantizer = get_quantizer(quant_type, bit, group_size)
160
+
161
+ def forward(self, x):
162
+ weight_tensor = self.quantizer(self.weight)
163
+ x = torch.nn.functional.linear(x, weight_tensor)
164
+ if self.bias is not None:
165
+ x = x + self.bias
166
+ return x
167
+
168
+ def _get_unpad_data(attention_mask):
169
+ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
170
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
171
+ max_seqlen_in_batch = seqlens_in_batch.max().item()
172
+ cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0))
173
+ return (
174
+ indices,
175
+ cu_seqlens,
176
+ max_seqlen_in_batch,
177
+ )
178
+
179
+
180
+
181
+
182
+ # @torch.jit.script # type: ignore
183
+ def rms_layernorm(hidden: torch.Tensor, weight: torch.Tensor, eps: float):
184
+ old_dtype = hidden.dtype
185
+ variance = hidden.to(torch.float32).pow(2).mean(dim=-1, keepdim=True)
186
+ hidden = (hidden * torch.rsqrt(variance + eps)).to(old_dtype)
187
+ return hidden * weight
188
+
189
+
190
+ class MiniCPMRMSNorm(nn.Module):
191
+ def __init__(self, hidden_size, eps=1e-6):
192
+ """
193
+ MiniCPMRMSNorm is equivalent to T5LayerNorm
194
+ """
195
+ super().__init__()
196
+ self.weight = nn.Parameter(torch.ones(hidden_size))
197
+ self.variance_epsilon = eps
198
+
199
+ def forward(self, hidden_states):
200
+ return rms_layernorm(hidden_states, self.weight, self.variance_epsilon)
201
+
202
+
203
+ ALL_LAYERNORM_LAYERS.append(MiniCPMRMSNorm)
204
+
205
+
206
+ class MiniCPMRotaryEmbedding(nn.Module):
207
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
208
+ super().__init__()
209
+
210
+ self.dim = dim
211
+ self.max_position_embeddings = max_position_embeddings
212
+ self.base = base
213
+ inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
214
+ self.register_buffer('inv_freq', inv_freq, persistent=False)
215
+
216
+ # Build here to make `torch.jit.trace` work.
217
+ self._set_cos_sin_cache(
218
+ # seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
219
+ seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.float32
220
+ )
221
+
222
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
223
+ self.max_seq_len_cached = seq_len
224
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
225
+ freqs = torch.outer(t, self.inv_freq)
226
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
227
+ emb = torch.cat((freqs, freqs), dim=-1)
228
+
229
+ self.register_buffer('cos_cached', emb.cos().to(dtype), persistent=False)
230
+ self.register_buffer('sin_cached', emb.sin().to(dtype), persistent=False)
231
+
232
+ def forward(self, x, seq_len=None):
233
+ # x: [bs, num_attention_heads, seq_len, head_size]
234
+ if seq_len > self.max_seq_len_cached:
235
+ self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
236
+
237
+ return (
238
+ self.cos_cached[:seq_len].to(dtype=x.dtype),
239
+ self.sin_cached[:seq_len].to(dtype=x.dtype),
240
+ )
241
+
242
+
243
+ class MiniCPMLongRoPE(MiniCPMRotaryEmbedding):
244
+ """MiniCPMRotaryEmbedding extended with Dynamic NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla"""
245
+
246
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, short_factor=None, long_factor=None, original_max_position_embeddings=None):
247
+ self.short_factor = short_factor
248
+ self.long_factor = long_factor
249
+ self.original_max_position_embeddings = original_max_position_embeddings
250
+ scale = (max_position_embeddings / self.original_max_position_embeddings)
251
+ self.scaling_factor = math.sqrt(1 + math.log(scale) / math.log(self.original_max_position_embeddings))
252
+ super().__init__(dim, max_position_embeddings, base, device)
253
+
254
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
255
+ self.max_seq_len_cached = seq_len
256
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
257
+ if seq_len > self.original_max_position_embeddings:
258
+ ext_factors = torch.tensor(self.long_factor, dtype=torch.float32, device=device)
259
+ else:
260
+ ext_factors = torch.tensor(self.short_factor, dtype=torch.float32, device=device)
261
+
262
+ freqs = torch.mul(
263
+ torch.outer(t, 1.0 / ext_factors).to(device=device),
264
+ self.inv_freq.to(device=device).to(dtype)
265
+ )
266
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
267
+ emb = torch.cat((freqs, freqs), dim=-1)
268
+ self.register_buffer('cos_cached', emb.cos().to(dtype) * self.scaling_factor, persistent=False)
269
+ self.register_buffer('sin_cached', emb.sin().to(dtype) * self.scaling_factor, persistent=False)
270
+
271
+
272
+ class MiniCPMLinearScalingRotaryEmbedding(MiniCPMRotaryEmbedding):
273
+ """MiniCPMRotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""
274
+
275
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
276
+ self.scaling_factor = scaling_factor
277
+ super().__init__(dim, max_position_embeddings, base, device)
278
+
279
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
280
+ self.max_seq_len_cached = seq_len
281
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
282
+ t = t / self.scaling_factor
283
+
284
+ freqs = torch.outer(t, self.inv_freq)
285
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
286
+ emb = torch.cat((freqs, freqs), dim=-1)
287
+ self.register_buffer('cos_cached', emb.cos().to(dtype), persistent=False)
288
+ self.register_buffer('sin_cached', emb.sin().to(dtype), persistent=False)
289
+
290
+
291
+ class MiniCPMDynamicNTKScalingRotaryEmbedding(MiniCPMRotaryEmbedding):
292
+ """MiniCPMRotaryEmbedding extended with Dynamic NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla"""
293
+
294
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
295
+ self.scaling_factor = scaling_factor
296
+ super().__init__(dim, max_position_embeddings, base, device)
297
+
298
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
299
+ self.max_seq_len_cached = seq_len
300
+
301
+ if seq_len > self.max_position_embeddings:
302
+ base = self.base * (
303
+ (self.scaling_factor * seq_len / self.max_position_embeddings) - (self.scaling_factor - 1)
304
+ ) ** (self.dim / (self.dim - 2))
305
+ inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
306
+ self.register_buffer('inv_freq', inv_freq, persistent=False)
307
+
308
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
309
+
310
+ freqs = torch.outer(t, self.inv_freq)
311
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
312
+ emb = torch.cat((freqs, freqs), dim=-1)
313
+
314
+ self.register_buffer('cos_cached', emb.cos().to(dtype), persistent=False)
315
+ self.register_buffer('sin_cached', emb.sin().to(dtype), persistent=False)
316
+
317
+
318
+ def rotate_half(x):
319
+ """Rotates half the hidden dims of the input."""
320
+ x1 = x[..., : x.shape[-1] // 2]
321
+ x2 = x[..., x.shape[-1] // 2:]
322
+ return torch.cat((-x2, x1), dim=-1)
323
+
324
+
325
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1):
326
+ """Applies Rotary Position Embedding to the query and key tensors.
327
+
328
+ Args:
329
+ q (`torch.Tensor`): The query tensor.
330
+ k (`torch.Tensor`): The key tensor.
331
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
332
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
333
+ position_ids (`torch.Tensor`):
334
+ The position indices of the tokens corresponding to the query and key tensors. For example, this can be
335
+ used to pass offsetted position ids when working with a KV-cache.
336
+ unsqueeze_dim (`int`, *optional*, defaults to 1):
337
+ The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
338
+ sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
339
+ that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
340
+ k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
341
+ cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
342
+ the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
343
+ Returns:
344
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
345
+ """
346
+ # cos = cos[position_ids].unsqueeze(unsqueeze_dim)
347
+ # sin = sin[position_ids].unsqueeze(unsqueeze_dim)
348
+ # q_embed = (q * cos) + (rotate_half(q) * sin)
349
+ # k_embed = (k * cos) + (rotate_half(k) * sin)
350
+ orig_dtype = k.dtype
351
+ cos = cos[position_ids].unsqueeze(unsqueeze_dim) # [bs, 1, seq_len, dim]
352
+ sin = sin[position_ids].unsqueeze(unsqueeze_dim) # [bs, 1, seq_len, dim]
353
+ q_fp32 = q.to(dtype=torch.float32, device=q.device)
354
+ k_fp32 = k.to(dtype=torch.float32, device=k.device)
355
+ q_embed = (q_fp32 * cos) + (rotate_half(q_fp32) * sin)
356
+ k_embed = (k_fp32 * cos) + (rotate_half(k_fp32) * sin)
357
+ return q_embed.to(dtype=orig_dtype), k_embed.to(dtype=orig_dtype)
358
+
359
+
360
+ class MiniCPMMLP(nn.Module):
361
+ def __init__(self, config):
362
+ super().__init__()
363
+ self.config = config
364
+ self.hidden_size = config.hidden_size
365
+ self.intermediate_size = config.intermediate_size
366
+ # self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
367
+ # self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
368
+ # self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
369
+ self.gate_proj = LinearQuantizer(self.hidden_size, self.intermediate_size, bias=False, quant_type="ternary", bit=4, group_size=-1)
370
+ self.up_proj = LinearQuantizer(self.hidden_size, self.intermediate_size, bias=False, quant_type="ternary", bit=4, group_size=-1)
371
+ self.down_proj = LinearQuantizer(self.intermediate_size, self.hidden_size, bias=False, quant_type="ternary", bit=4, group_size=-1)
372
+ self.act_fn = ACT2FN[config.hidden_act]
373
+
374
+ def forward(self, x):
375
+ if self.config.pretraining_tp > 1:
376
+ slice = self.intermediate_size // self.config.pretraining_tp
377
+ gate_proj_slices = self.gate_proj.weight.split(slice, dim=0)
378
+ up_proj_slices = self.up_proj.weight.split(slice, dim=0)
379
+ down_proj_slices = self.down_proj.weight.split(slice, dim=1)
380
+
381
+ gate_proj = torch.cat(
382
+ [F.linear(x, gate_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1
383
+ )
384
+ up_proj = torch.cat([F.linear(x, up_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1)
385
+
386
+ intermediate_states = (self.act_fn(gate_proj) * up_proj).split(slice, dim=2)
387
+ down_proj = [
388
+ F.linear(intermediate_states[i], down_proj_slices[i]) for i in range(self.config.pretraining_tp)
389
+ ]
390
+ down_proj = sum(down_proj)
391
+ else:
392
+ down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
393
+
394
+ return down_proj
395
+
396
+ def _unpad_one_tensor(hidden_states, attention_mask):
397
+ # Unpad the hidden states using the indices
398
+ indices, cu_seqlens, max_seqlen_in_batch = _get_unpad_data(attention_mask)
399
+ batch_size, seq_len = hidden_states.shape[:2]
400
+
401
+ # Get the remaining dimensions
402
+ remaining_dims = hidden_states.shape[2:]
403
+
404
+ # Reshape to (batch_size * seq_len, *remaining_dims)
405
+ reshaped_states = hidden_states.reshape(batch_size * seq_len, *remaining_dims)
406
+
407
+ # Apply unpadding using indices
408
+ unpadded_states = index_first_axis(reshaped_states, indices)
409
+
410
+ return unpadded_states, indices, cu_seqlens, max_seqlen_in_batch
411
+
412
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
413
+ """
414
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
415
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
416
+ """
417
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
418
+ if n_rep == 1:
419
+ return hidden_states
420
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
421
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
422
+
423
+
424
+ class MiniCPMAttention(nn.Module):
425
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
426
+
427
+ def __init__(self, config: MiniCPMConfig, layer_idx: Optional[int] = None):
428
+ super().__init__()
429
+ self.config = config
430
+ self.layer_idx = layer_idx
431
+ if layer_idx is None:
432
+ logger.warning_once(
433
+ f'Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will '
434
+ 'to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` '
435
+ 'when creating this class.'
436
+ )
437
+
438
+ self.attention_dropout = config.attention_dropout
439
+ self.hidden_size = config.hidden_size
440
+ self.num_heads = config.num_attention_heads
441
+ self.head_dim = self.hidden_size // self.num_heads
442
+ self.num_key_value_heads = config.num_key_value_heads
443
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
444
+ self.max_position_embeddings = config.max_position_embeddings
445
+ self.rope_theta = config.rope_theta
446
+ self.is_causal = True
447
+
448
+ if (self.head_dim * self.num_heads) != self.hidden_size:
449
+ raise ValueError(
450
+ f'hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}'
451
+ f' and `num_heads`: {self.num_heads}).'
452
+ )
453
+
454
+ # self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias)
455
+ # self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias)
456
+ # self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias)
457
+ # self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias)
458
+ self.q_proj = LinearQuantizer(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias, quant_type="ternary", bit=4, group_size=-1)
459
+ self.k_proj = LinearQuantizer(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias, quant_type="ternary", bit=4, group_size=-1)
460
+ self.v_proj = LinearQuantizer(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias, quant_type="ternary", bit=4, group_size=-1)
461
+ self.o_proj = LinearQuantizer(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias, quant_type="ternary", bit=4, group_size=-1)
462
+ self._init_rope()
463
+
464
+ def _init_rope(self):
465
+ if self.config.rope_scaling is None:
466
+ self.rotary_emb = MiniCPMRotaryEmbedding(
467
+ self.head_dim,
468
+ max_position_embeddings=self.max_position_embeddings,
469
+ base=self.rope_theta,
470
+ )
471
+ else:
472
+ scaling_type = self.config.rope_scaling['rope_type']
473
+ scaling_factor = self.config.rope_scaling.get('factor', None)
474
+ if scaling_type == 'linear':
475
+ self.rotary_emb = MiniCPMLinearScalingRotaryEmbedding(
476
+ self.head_dim,
477
+ max_position_embeddings=self.max_position_embeddings,
478
+ scaling_factor=scaling_factor,
479
+ base=self.rope_theta,
480
+ )
481
+ elif scaling_type == 'dynamic':
482
+ self.rotary_emb = MiniCPMDynamicNTKScalingRotaryEmbedding(
483
+ self.head_dim,
484
+ max_position_embeddings=self.max_position_embeddings,
485
+ scaling_factor=scaling_factor,
486
+ base=self.rope_theta,
487
+ )
488
+ elif scaling_type == 'longrope':
489
+ self.rotary_emb = MiniCPMLongRoPE(
490
+ self.head_dim,
491
+ max_position_embeddings=self.max_position_embeddings,
492
+ short_factor=self.config.rope_scaling['short_factor'],
493
+ long_factor=self.config.rope_scaling['long_factor'],
494
+ base=self.rope_theta,
495
+ original_max_position_embeddings=self.config.rope_scaling['original_max_position_embeddings']
496
+ )
497
+ else:
498
+ raise ValueError(f'Unknown RoPE scaling type {scaling_type}')
499
+
500
+ def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
501
+ return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
502
+
503
+ def forward(
504
+ self,
505
+ hidden_states: torch.Tensor,
506
+ attention_mask: Optional[torch.Tensor] = None,
507
+ position_ids: Optional[torch.LongTensor] = None,
508
+ past_key_value: Optional[Cache] = None,
509
+ output_attentions: bool = False,
510
+ use_cache: bool = False,
511
+ **kwargs,
512
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
513
+ if 'padding_mask' in kwargs:
514
+ warnings.warn(
515
+ 'Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`'
516
+ )
517
+
518
+ bsz, q_len, _ = hidden_states.size()
519
+
520
+ if self.config.pretraining_tp > 1:
521
+ key_value_slicing = (self.num_key_value_heads * self.head_dim) // self.config.pretraining_tp
522
+ query_slices = self.q_proj.weight.split(
523
+ (self.num_heads * self.head_dim) // self.config.pretraining_tp, dim=0
524
+ )
525
+ key_slices = self.k_proj.weight.split(key_value_slicing, dim=0)
526
+ value_slices = self.v_proj.weight.split(key_value_slicing, dim=0)
527
+
528
+ query_states = [F.linear(hidden_states, query_slices[i]) for i in range(self.config.pretraining_tp)]
529
+ query_states = torch.cat(query_states, dim=-1)
530
+
531
+ key_states = [F.linear(hidden_states, key_slices[i]) for i in range(self.config.pretraining_tp)]
532
+ key_states = torch.cat(key_states, dim=-1)
533
+
534
+ value_states = [F.linear(hidden_states, value_slices[i]) for i in range(self.config.pretraining_tp)]
535
+ value_states = torch.cat(value_states, dim=-1)
536
+
537
+ else:
538
+ query_states = self.q_proj(hidden_states)
539
+ key_states = self.k_proj(hidden_states)
540
+ value_states = self.v_proj(hidden_states)
541
+
542
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
543
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
544
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
545
+
546
+ kv_seq_len = position_ids.max().item() + 1
547
+ cos, sin = self.rotary_emb(value_states.to(torch.float32), seq_len=kv_seq_len)
548
+
549
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
550
+
551
+ if past_key_value is not None:
552
+ cache_kwargs = {'sin': sin, 'cos': cos} # Specific to RoPE models
553
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
554
+
555
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
556
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
557
+
558
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
559
+ if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
560
+ raise ValueError(
561
+ f'Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is'
562
+ f' {attn_weights.size()}'
563
+ )
564
+
565
+ if attention_mask is not None:
566
+ if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
567
+ raise ValueError(
568
+ f'Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}'
569
+ )
570
+ attn_weights = attn_weights + attention_mask
571
+
572
+ # upcast attention to fp32
573
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
574
+ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
575
+ attn_output = torch.matmul(attn_weights, value_states)
576
+
577
+ if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
578
+ raise ValueError(
579
+ f'`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is'
580
+ f' {attn_output.size()}'
581
+ )
582
+
583
+ attn_output = attn_output.transpose(1, 2).contiguous()
584
+
585
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
586
+
587
+ if self.config.pretraining_tp > 1:
588
+ attn_output = attn_output.split(self.hidden_size // self.config.pretraining_tp, dim=2)
589
+ o_proj_slices = self.o_proj.weight.split(self.hidden_size // self.config.pretraining_tp, dim=1)
590
+ attn_output = sum([F.linear(attn_output[i], o_proj_slices[i]) for i in range(self.config.pretraining_tp)])
591
+ else:
592
+ attn_output = self.o_proj(attn_output)
593
+
594
+ if not output_attentions:
595
+ attn_weights = None
596
+
597
+ return attn_output, attn_weights, past_key_value
598
+
599
+
600
+ class MiniCPMFlashAttention2(MiniCPMAttention):
601
+ """
602
+ MiniCPM flash attention module. This module inherits from `MiniCPMAttention` as the weights of the module stays
603
+ untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
604
+ flash attention and deal with padding tokens in case the input contains any of them.
605
+ """
606
+
607
+ def __init__(self, *args, **kwargs):
608
+ super().__init__(*args, **kwargs)
609
+ # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
610
+ # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignment, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
611
+ # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
612
+ self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
613
+
614
+ def forward(
615
+ self,
616
+ hidden_states: torch.Tensor,
617
+ attention_mask: Optional[torch.LongTensor] = None,
618
+ position_ids: Optional[torch.LongTensor] = None,
619
+ past_key_value: Optional[Cache] = None,
620
+ output_attentions: bool = False,
621
+ use_cache: bool = False,
622
+ **kwargs,
623
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
624
+ # MiniCPMFlashAttention2 attention does not support output_attentions
625
+ if 'padding_mask' in kwargs:
626
+ warnings.warn(
627
+ 'Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`'
628
+ )
629
+
630
+ # overwrite attention_mask with padding_mask
631
+ attention_mask = kwargs.pop('padding_mask')
632
+
633
+ output_attentions = False
634
+
635
+ bsz, q_len, _ = hidden_states.size()
636
+
637
+ query_states = self.q_proj(hidden_states)
638
+ key_states = self.k_proj(hidden_states)
639
+ value_states = self.v_proj(hidden_states)
640
+
641
+ # Flash attention requires the input to have the shape
642
+ # batch_size x seq_length x head_dim x hidden_dim
643
+ # therefore we just need to keep the original shape
644
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
645
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
646
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
647
+
648
+ kv_seq_len = position_ids.max().item() + 1
649
+ cos, sin = self.rotary_emb(value_states.to(torch.float32), seq_len=kv_seq_len)
650
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
651
+
652
+ if past_key_value is not None:
653
+ cache_kwargs = {'sin': sin, 'cos': cos} # Specific to RoPE models
654
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
655
+
656
+ # TODO: These transpose are quite inefficient but Flash Attention requires the layout [batch_size, sequence_length, num_heads, head_dim]. We would need to refactor the KV cache
657
+ # to be able to avoid many of these transpose/reshape/view.
658
+ query_states = query_states.transpose(1, 2)
659
+ key_states = key_states.transpose(1, 2)
660
+ value_states = value_states.transpose(1, 2)
661
+
662
+ dropout_rate = self.attention_dropout if self.training else 0.0
663
+
664
+ # In PEFT, usually we cast the layer norms in float32 for training stability reasons
665
+ # therefore the input hidden states gets silently casted in float32. Hence, we need
666
+ # cast them back in the correct dtype just to be sure everything works as expected.
667
+ # This might slowdown training & inference so it is recommended to not cast the LayerNorms
668
+ # in fp32. (MiniCPMRMSNorm handles it correctly)
669
+
670
+ input_dtype = query_states.dtype
671
+ if input_dtype == torch.float32:
672
+ # Handle the case where the model is quantized
673
+ if hasattr(self.config, '_pre_quantization_dtype'):
674
+ target_dtype = self.config._pre_quantization_dtype
675
+ else:
676
+ target_dtype = self.q_proj.weight.dtype
677
+
678
+ logger.warning_once(
679
+ f'The input hidden states seems to be silently casted in float32, this might be related to'
680
+ f' the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in'
681
+ f' {target_dtype}.'
682
+ )
683
+
684
+ query_states = query_states.to(target_dtype)
685
+ key_states = key_states.to(target_dtype)
686
+ value_states = value_states.to(target_dtype)
687
+
688
+ attn_output = self._flash_attention_forward(
689
+ query_states, key_states, value_states, attention_mask, q_len, dropout=dropout_rate
690
+ )
691
+
692
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
693
+ attn_output = self.o_proj(attn_output)
694
+
695
+ if not output_attentions:
696
+ attn_weights = None
697
+
698
+ return attn_output, attn_weights, past_key_value
699
+
700
+ def _flash_attention_forward(
701
+ self, query_states, key_states, value_states, attention_mask, query_length, dropout=0.0, softmax_scale=None
702
+ ):
703
+ """
704
+ Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
705
+ first unpad the input, then computes the attention scores and pad the final attention scores.
706
+
707
+ Args:
708
+ query_states (`torch.Tensor`):
709
+ Input query states to be passed to Flash Attention API
710
+ key_states (`torch.Tensor`):
711
+ Input key states to be passed to Flash Attention API
712
+ value_states (`torch.Tensor`):
713
+ Input value states to be passed to Flash Attention API
714
+ attention_mask (`torch.Tensor`):
715
+ The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
716
+ position of padding tokens and 1 for the position of non-padding tokens.
717
+ dropout (`int`, *optional*):
718
+ Attention dropout
719
+ softmax_scale (`float`, *optional*):
720
+ The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
721
+ """
722
+ if not self._flash_attn_uses_top_left_mask:
723
+ causal = self.is_causal
724
+ else:
725
+ # TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in MiniCPMFlashAttention2 __init__.
726
+ causal = self.is_causal and query_length != 1
727
+ # Contains at least one padding token in the sequence
728
+ if attention_mask is not None:
729
+ batch_size = query_states.shape[0]
730
+ query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
731
+ query_states, key_states, value_states, attention_mask, query_length
732
+ )
733
+
734
+ cu_seqlens_q, cu_seqlens_k = cu_seq_lens
735
+ max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
736
+ attn_output_unpad = flash_attn_varlen_func(
737
+ query_states,
738
+ key_states,
739
+ value_states,
740
+ cu_seqlens_q=cu_seqlens_q,
741
+ cu_seqlens_k=cu_seqlens_k,
742
+ max_seqlen_q=max_seqlen_in_batch_q,
743
+ max_seqlen_k=max_seqlen_in_batch_k,
744
+ dropout_p=dropout,
745
+ softmax_scale=softmax_scale,
746
+ causal=causal,
747
+ )
748
+
749
+ attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
750
+ else:
751
+ attn_output = flash_attn_func(
752
+ query_states, key_states, value_states, dropout, softmax_scale=softmax_scale, causal=causal
753
+ )
754
+
755
+ return attn_output
756
+
757
+ def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
758
+ indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
759
+ batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
760
+
761
+ key_layer = index_first_axis(
762
+ key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
763
+ )
764
+ value_layer = index_first_axis(
765
+ value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
766
+ )
767
+ if query_length == kv_seq_len:
768
+ query_layer = index_first_axis(
769
+ query_layer.reshape(batch_size * kv_seq_len, self.num_heads, head_dim), indices_k
770
+ )
771
+ cu_seqlens_q = cu_seqlens_k
772
+ max_seqlen_in_batch_q = max_seqlen_in_batch_k
773
+ indices_q = indices_k
774
+ elif query_length == 1:
775
+ max_seqlen_in_batch_q = 1
776
+ cu_seqlens_q = torch.arange(
777
+ batch_size + 1, dtype=torch.int32, device=query_layer.device
778
+ ) # There is a memcpy here, that is very bad.
779
+ indices_q = cu_seqlens_q[:-1]
780
+ query_layer = query_layer.squeeze(1)
781
+ else:
782
+ # The -q_len: slice assumes left padding.
783
+ attention_mask = attention_mask[:, -query_length:]
784
+ query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
785
+
786
+ return (
787
+ query_layer,
788
+ key_layer,
789
+ value_layer,
790
+ indices_q,
791
+ (cu_seqlens_q, cu_seqlens_k),
792
+ (max_seqlen_in_batch_q, max_seqlen_in_batch_k),
793
+ )
794
+
795
+
796
+ class MiniCPMSdpaAttention(MiniCPMAttention):
797
+ """
798
+ MiniCPM attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
799
+ `MiniCPMAttention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
800
+ SDPA API.
801
+ """
802
+
803
+ # Adapted from MiniCPMAttention.forward
804
+ def forward(
805
+ self,
806
+ hidden_states: torch.Tensor,
807
+ attention_mask: Optional[torch.Tensor] = None,
808
+ position_ids: Optional[torch.LongTensor] = None,
809
+ past_key_value: Optional[Cache] = None,
810
+ output_attentions: bool = False,
811
+ use_cache: bool = False,
812
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
813
+ if output_attentions:
814
+ # TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is implemented.
815
+ logger.warning_once(
816
+ 'MiniCPMModel is using MiniCPMSdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, '
817
+ 'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
818
+ )
819
+ return super().forward(
820
+ hidden_states=hidden_states,
821
+ attention_mask=attention_mask,
822
+ position_ids=position_ids,
823
+ past_key_value=past_key_value,
824
+ output_attentions=output_attentions,
825
+ use_cache=use_cache,
826
+ )
827
+
828
+ bsz, q_len, _ = hidden_states.size()
829
+
830
+ query_states = self.q_proj(hidden_states)
831
+ key_states = self.k_proj(hidden_states)
832
+ value_states = self.v_proj(hidden_states)
833
+
834
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
835
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
836
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
837
+
838
+ kv_seq_len = position_ids.max().item() + 1
839
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
840
+
841
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
842
+
843
+ if past_key_value is not None:
844
+ cache_kwargs = {'sin': sin, 'cos': cos} # Specific to RoPE models
845
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
846
+
847
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
848
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
849
+
850
+ if attention_mask is not None:
851
+ if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
852
+ raise ValueError(
853
+ f'Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}'
854
+ )
855
+
856
+ # SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
857
+ # Reference: https://github.com/pytorch/pytorch/issues/112577.
858
+ if query_states.device.type == 'cuda' and attention_mask is not None:
859
+ query_states = query_states.contiguous()
860
+ key_states = key_states.contiguous()
861
+ value_states = value_states.contiguous()
862
+
863
+ attn_output = torch.nn.functional.scaled_dot_product_attention(
864
+ query_states,
865
+ key_states,
866
+ value_states,
867
+ attn_mask=attention_mask,
868
+ dropout_p=self.attention_dropout if self.training else 0.0,
869
+ # The q_len > 1 is necessary to match with AttentionMaskConverter.to_causal_4d that does not create a causal mask in case q_len == 1.
870
+ is_causal=self.is_causal and attention_mask is None and q_len > 1,
871
+ )
872
+
873
+ attn_output = attn_output.transpose(1, 2).contiguous()
874
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
875
+
876
+ attn_output = self.o_proj(attn_output)
877
+
878
+ return attn_output, None, past_key_value
879
+
880
+
881
+ MINICPM_ATTENTION_CLASSES = {
882
+ 'eager': MiniCPMAttention,
883
+ 'flash_attention_2': MiniCPMFlashAttention2,
884
+ 'sdpa': MiniCPMSdpaAttention,
885
+ }
886
+
887
+
888
+ class MiniCPMDecoderLayer(nn.Module):
889
+ def __init__(self, config: MiniCPMConfig, layer_idx: int):
890
+ super().__init__()
891
+ self.hidden_size = config.hidden_size
892
+ self.self_attn = MINICPM_ATTENTION_CLASSES[config._attn_implementation](config=config, layer_idx=layer_idx)
893
+
894
+ self.mlp = MiniCPMMLP(config)
895
+ self.input_layernorm = MiniCPMRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
896
+ self.post_attention_layernorm = MiniCPMRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
897
+
898
+ self.scale_depth = config.scale_depth
899
+ self.num_hidden_layers = config.num_hidden_layers
900
+
901
+ def forward(
902
+ self,
903
+ hidden_states: torch.Tensor,
904
+ attention_mask: Optional[torch.Tensor] = None,
905
+ position_ids: Optional[torch.LongTensor] = None,
906
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
907
+ output_attentions: Optional[bool] = False,
908
+ use_cache: Optional[bool] = False,
909
+ **kwargs,
910
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
911
+ """
912
+ Args:
913
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
914
+ attention_mask (`torch.FloatTensor`, *optional*):
915
+ attention mask of size `(batch_size, sequence_length)` if flash attention is used or `(batch_size, 1,
916
+ query_sequence_length, key_sequence_length)` if default attention is used.
917
+ output_attentions (`bool`, *optional*):
918
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
919
+ returned tensors for more detail.
920
+ use_cache (`bool`, *optional*):
921
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
922
+ (see `past_key_values`).
923
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
924
+ """
925
+ if 'padding_mask' in kwargs:
926
+ warnings.warn(
927
+ 'Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`'
928
+ )
929
+
930
+ residual = hidden_states
931
+ hidden_states = self.input_layernorm(hidden_states)
932
+ # Self Attention
933
+ hidden_states, self_attn_weights, present_key_value = self.self_attn(
934
+ hidden_states=hidden_states,
935
+ attention_mask=attention_mask,
936
+ position_ids=position_ids,
937
+ past_key_value=past_key_value,
938
+ output_attentions=output_attentions,
939
+ use_cache=use_cache,
940
+ **kwargs,
941
+ )
942
+
943
+ hidden_states = residual + hidden_states * (self.scale_depth / math.sqrt(self.num_hidden_layers))
944
+
945
+ # Fully Connected
946
+ residual = hidden_states
947
+ hidden_states = self.post_attention_layernorm(hidden_states)
948
+
949
+ hidden_states = self.mlp(hidden_states)
950
+ hidden_states = residual + hidden_states * (self.scale_depth / math.sqrt(self.num_hidden_layers))
951
+
952
+ outputs = (hidden_states,)
953
+
954
+ if output_attentions:
955
+ outputs += (self_attn_weights,)
956
+
957
+ if use_cache:
958
+ outputs += (present_key_value,)
959
+
960
+ return outputs
961
+
962
+
963
+ MINICPM_START_DOCSTRING = r"""
964
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
965
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
966
+ etc.)
967
+
968
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
969
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
970
+ and behavior.
971
+
972
+ Parameters:
973
+ config ([`MiniCPMConfig`]):
974
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
975
+ load the weights associated with the model, only the configuration. Check out the
976
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
977
+ """
978
+
979
+
980
+ @add_start_docstrings(
981
+ 'The bare MiniCPM Model outputting raw hidden-states without any specific head on top.',
982
+ MINICPM_START_DOCSTRING,
983
+ )
984
+ class MiniCPMPreTrainedModel(PreTrainedModel):
985
+ config_class = MiniCPMConfig
986
+ base_model_prefix = 'model'
987
+ supports_gradient_checkpointing = True
988
+ _no_split_modules = ['MiniCPMDecoderLayer']
989
+ _skip_keys_device_placement = 'past_key_values'
990
+ _supports_flash_attn_2 = True
991
+ _supports_sdpa = True
992
+ _supports_cache_class = True
993
+
994
+ def _init_weights(self, module):
995
+ std = self.config.initializer_range
996
+ if isinstance(module, nn.Linear):
997
+ module.weight.data.normal_(mean=0.0, std=std)
998
+ if module.bias is not None:
999
+ module.bias.data.zero_()
1000
+ elif isinstance(module, nn.Embedding):
1001
+ module.weight.data.normal_(mean=0.0, std=std)
1002
+ if module.padding_idx is not None:
1003
+ module.weight.data[module.padding_idx].zero_()
1004
+
1005
+
1006
+ MINICPM_INPUTS_DOCSTRING = r"""
1007
+ Args:
1008
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
1009
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
1010
+ it.
1011
+
1012
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
1013
+ [`PreTrainedTokenizer.__call__`] for details.
1014
+
1015
+ [What are input IDs?](../glossary#input-ids)
1016
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
1017
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
1018
+
1019
+ - 1 for tokens that are **not masked**,
1020
+ - 0 for tokens that are **masked**.
1021
+
1022
+ [What are attention masks?](../glossary#attention-mask)
1023
+
1024
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
1025
+ [`PreTrainedTokenizer.__call__`] for details.
1026
+
1027
+ If `past_key_values` is used, optionally only the last `input_ids` have to be input (see
1028
+ `past_key_values`).
1029
+
1030
+ If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
1031
+ and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
1032
+ information on the default strategy.
1033
+
1034
+ - 1 indicates the head is **not masked**,
1035
+ - 0 indicates the head is **masked**.
1036
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1037
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
1038
+ config.n_positions - 1]`.
1039
+
1040
+ [What are position IDs?](../glossary#position-ids)
1041
+ past_key_values (`Cache` or `tuple(tuple(torch.FloatTensor))`, *optional*):
1042
+ Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
1043
+ blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
1044
+ returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
1045
+
1046
+ Two formats are allowed:
1047
+ - a [`~cache_utils.Cache`] instance;
1048
+ - Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
1049
+ shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`). This is also known as the legacy
1050
+ cache format.
1051
+
1052
+ The model will output the same cache format that is fed as input. If no `past_key_values` are passed, the
1053
+ legacy cache format will be returned.
1054
+
1055
+ If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
1056
+ have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
1057
+ of shape `(batch_size, sequence_length)`.
1058
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
1059
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
1060
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
1061
+ model's internal embedding lookup matrix.
1062
+ use_cache (`bool`, *optional*):
1063
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
1064
+ `past_key_values`).
1065
+ output_attentions (`bool`, *optional*):
1066
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
1067
+ tensors for more detail.
1068
+ output_hidden_states (`bool`, *optional*):
1069
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
1070
+ more detail.
1071
+ return_dict (`bool`, *optional*):
1072
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
1073
+ """
1074
+
1075
+
1076
+ @add_start_docstrings(
1077
+ 'The bare MiniCPM Model outputting raw hidden-states without any specific head on top.',
1078
+ MINICPM_START_DOCSTRING,
1079
+ )
1080
+ class MiniCPMModel(MiniCPMPreTrainedModel):
1081
+ """
1082
+ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`MiniCPMDecoderLayer`]
1083
+
1084
+ Args:
1085
+ config: MiniCPMConfig
1086
+ """
1087
+
1088
+ def __init__(self, config: MiniCPMConfig):
1089
+ super().__init__(config)
1090
+ self.padding_idx = config.pad_token_id
1091
+ self.vocab_size = config.vocab_size
1092
+
1093
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
1094
+ self.layers = nn.ModuleList(
1095
+ [MiniCPMDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
1096
+ )
1097
+ self._use_sdpa = config._attn_implementation == 'sdpa'
1098
+ self._use_flash_attention_2 = config._attn_implementation == 'flash_attention_2'
1099
+
1100
+ self.norm = MiniCPMRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
1101
+
1102
+ self.gradient_checkpointing = False
1103
+ # Initialize weights and apply final processing
1104
+ self.post_init()
1105
+
1106
+ def get_input_embeddings(self):
1107
+ return self.embed_tokens
1108
+
1109
+ def set_input_embeddings(self, value):
1110
+ self.embed_tokens = value
1111
+
1112
+ @add_start_docstrings_to_model_forward(MINICPM_INPUTS_DOCSTRING)
1113
+ def forward(
1114
+ self,
1115
+ input_ids: torch.LongTensor = None,
1116
+ attention_mask: Optional[torch.Tensor] = None,
1117
+ position_ids: Optional[torch.LongTensor] = None,
1118
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1119
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1120
+ use_cache: Optional[bool] = None,
1121
+ output_attentions: Optional[bool] = None,
1122
+ output_hidden_states: Optional[bool] = None,
1123
+ return_dict: Optional[bool] = None,
1124
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
1125
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1126
+ output_hidden_states = (
1127
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1128
+ )
1129
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
1130
+
1131
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1132
+
1133
+ # retrieve input_ids and inputs_embeds
1134
+ if input_ids is not None and inputs_embeds is not None:
1135
+ raise ValueError('You cannot specify both input_ids and inputs_embeds at the same time')
1136
+ elif input_ids is not None:
1137
+ batch_size, seq_length = input_ids.shape[:2]
1138
+ elif inputs_embeds is not None:
1139
+ batch_size, seq_length = inputs_embeds.shape[:2]
1140
+ else:
1141
+ raise ValueError('You have to specify either input_ids or inputs_embeds')
1142
+
1143
+ if self.gradient_checkpointing and self.training:
1144
+ if use_cache:
1145
+ logger.warning_once(
1146
+ '`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...'
1147
+ )
1148
+ use_cache = False
1149
+
1150
+ past_key_values_length = 0
1151
+
1152
+ if use_cache:
1153
+ use_legacy_cache = not isinstance(past_key_values, Cache)
1154
+ if use_legacy_cache:
1155
+ raise ValueError(
1156
+ 'You must use the new past_key_values format, such as the Cache class, instead of the old tuple format.'
1157
+ )
1158
+
1159
+ # Calculate the usable length of past key values
1160
+ past_key_values_length = past_key_values.get_seq_length() if isinstance(past_key_values, Cache) else 0
1161
+
1162
+
1163
+ if position_ids is None:
1164
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
1165
+ position_ids = torch.arange(
1166
+ past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
1167
+ )
1168
+ position_ids = position_ids.unsqueeze(0)
1169
+
1170
+ if inputs_embeds is None:
1171
+ inputs_embeds = self.embed_tokens(input_ids) * self.config.scale_emb
1172
+
1173
+ if self._use_flash_attention_2:
1174
+ # 2d mask is passed through the layers
1175
+ # attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
1176
+ if attention_mask is None:
1177
+ raise ValueError(
1178
+ f'need attention_mask for flash attention, but got {attention_mask}.'
1179
+ )
1180
+ elif self._use_sdpa and not output_attentions:
1181
+ # output_attentions=True can not be supported when using SDPA, and we fall back on
1182
+ # the manual implementation that requires a 4D causal mask in all cases.
1183
+ attention_mask = _prepare_4d_causal_attention_mask_for_sdpa(
1184
+ attention_mask,
1185
+ (batch_size, seq_length),
1186
+ inputs_embeds,
1187
+ past_key_values_length,
1188
+ )
1189
+ else:
1190
+ # 4d mask is passed through the layers
1191
+ attention_mask = _prepare_4d_causal_attention_mask(
1192
+ attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
1193
+ )
1194
+
1195
+ # embed positions
1196
+ hidden_states = inputs_embeds
1197
+
1198
+ # decoder layers
1199
+ all_hidden_states = () if output_hidden_states else None
1200
+ all_self_attns = () if output_attentions else None
1201
+ next_decoder_cache = None
1202
+
1203
+ for decoder_layer in self.layers:
1204
+ if output_hidden_states:
1205
+ all_hidden_states += (hidden_states,)
1206
+
1207
+ if self.gradient_checkpointing and self.training:
1208
+ layer_outputs = self._gradient_checkpointing_func(
1209
+ decoder_layer.__call__,
1210
+ hidden_states,
1211
+ attention_mask,
1212
+ position_ids,
1213
+ past_key_values,
1214
+ output_attentions,
1215
+ use_cache,
1216
+ )
1217
+ else:
1218
+ layer_outputs = decoder_layer(
1219
+ hidden_states,
1220
+ attention_mask=attention_mask,
1221
+ position_ids=position_ids,
1222
+ past_key_value=past_key_values,
1223
+ output_attentions=output_attentions,
1224
+ use_cache=use_cache,
1225
+ )
1226
+
1227
+ hidden_states = layer_outputs[0]
1228
+
1229
+ if use_cache:
1230
+ next_decoder_cache = layer_outputs[2 if output_attentions else 1]
1231
+
1232
+ if output_attentions:
1233
+ all_self_attns += (layer_outputs[1],)
1234
+
1235
+ hidden_states = self.norm(hidden_states)
1236
+
1237
+ # add hidden states from the last decoder layer
1238
+ if output_hidden_states:
1239
+ all_hidden_states += (hidden_states,)
1240
+
1241
+ next_cache = None
1242
+ if use_cache:
1243
+ next_cache = next_decoder_cache.to_legacy_cache() if use_legacy_cache else next_decoder_cache
1244
+ if not return_dict:
1245
+ return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
1246
+ return BaseModelOutputWithPast(
1247
+ last_hidden_state=hidden_states,
1248
+ past_key_values=next_cache,
1249
+ hidden_states=all_hidden_states,
1250
+ attentions=all_self_attns,
1251
+ )
1252
+
1253
+
1254
+ class MiniCPMForCausalLM(MiniCPMPreTrainedModel):
1255
+ _tied_weights_keys = ['lm_head.weight']
1256
+
1257
+ def __init__(self, config):
1258
+ super().__init__(config)
1259
+ self.model = MiniCPMModel(config)
1260
+ self.vocab_size = config.vocab_size
1261
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
1262
+
1263
+ # Initialize weights and apply final processing
1264
+ self.post_init()
1265
+
1266
+ def get_input_embeddings(self):
1267
+ return self.model.embed_tokens
1268
+
1269
+ def set_input_embeddings(self, value):
1270
+ self.model.embed_tokens = value
1271
+
1272
+ def get_output_embeddings(self):
1273
+ return self.lm_head
1274
+
1275
+ def set_output_embeddings(self, new_embeddings):
1276
+ self.lm_head = new_embeddings
1277
+
1278
+ def set_decoder(self, decoder):
1279
+ self.model = decoder
1280
+
1281
+ def get_decoder(self):
1282
+ return self.model
1283
+
1284
+ @add_start_docstrings_to_model_forward(MINICPM_INPUTS_DOCSTRING)
1285
+ @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
1286
+ def forward(
1287
+ self,
1288
+ input_ids: torch.LongTensor = None,
1289
+ attention_mask: Optional[torch.Tensor] = None,
1290
+ position_ids: Optional[torch.LongTensor] = None,
1291
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1292
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1293
+ labels: Optional[torch.LongTensor] = None,
1294
+ use_cache: Optional[bool] = None,
1295
+ output_attentions: Optional[bool] = None,
1296
+ output_hidden_states: Optional[bool] = None,
1297
+ return_dict: Optional[bool] = None,
1298
+ logits_to_keep: Union[int, torch.Tensor] = 0,
1299
+ **kwargs,
1300
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
1301
+ r"""
1302
+ Args:
1303
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1304
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
1305
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
1306
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
1307
+
1308
+ Returns:
1309
+
1310
+ Example:
1311
+
1312
+ ```python
1313
+ >>> from transformers import AutoTokenizer, MiniCPMForCausalLM
1314
+
1315
+ >>> model = MiniCPMForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
1316
+ >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
1317
+
1318
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
1319
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
1320
+
1321
+ >>> # Generate
1322
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
1323
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
1324
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
1325
+ ```"""
1326
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1327
+ output_hidden_states = (
1328
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1329
+ )
1330
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1331
+
1332
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
1333
+ outputs = self.model(
1334
+ input_ids=input_ids,
1335
+ attention_mask=attention_mask,
1336
+ position_ids=position_ids,
1337
+ past_key_values=past_key_values,
1338
+ inputs_embeds=inputs_embeds,
1339
+ use_cache=use_cache,
1340
+ output_attentions=output_attentions,
1341
+ output_hidden_states=output_hidden_states,
1342
+ return_dict=return_dict,
1343
+ )
1344
+
1345
+ hidden_states = outputs[0]
1346
+ # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
1347
+ slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
1348
+ hidden_states = hidden_states[:, slice_indices, :].contiguous()
1349
+ if self.config.pretraining_tp > 1:
1350
+ lm_head_slices = self.lm_head.weight.split(self.vocab_size // self.config.pretraining_tp, dim=0)
1351
+ logits = [F.linear(hidden_states, lm_head_slices[i]) for i in range(self.config.pretraining_tp)]
1352
+ logits = torch.cat(logits, dim=-1)
1353
+ else:
1354
+ logits = self.lm_head(hidden_states / (self.config.hidden_size / self.config.dim_model_base))
1355
+ logits = logits.float()
1356
+
1357
+ loss = None
1358
+ if labels is not None:
1359
+ # Shift so that tokens < n predict n
1360
+ shift_logits = logits[..., :-1, :].contiguous()
1361
+ shift_labels = labels[..., 1:].contiguous()
1362
+ # Flatten the tokens
1363
+ loss_fct = CrossEntropyLoss()
1364
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
1365
+ shift_labels = shift_labels.view(-1)
1366
+ # Enable model parallelism
1367
+ shift_labels = shift_labels.to(shift_logits.device)
1368
+ loss = loss_fct(shift_logits, shift_labels)
1369
+
1370
+ if not return_dict:
1371
+ output = (logits,) + outputs[1:]
1372
+ return (loss,) + output if loss is not None else output
1373
+
1374
+ return CausalLMOutputWithPast(
1375
+ loss=loss,
1376
+ logits=logits,
1377
+ past_key_values=outputs.past_key_values,
1378
+ hidden_states=outputs.hidden_states,
1379
+ attentions=outputs.attentions,
1380
+ )
1381
+
1382
+ def prepare_inputs_for_generation(
1383
+ self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
1384
+ ):
1385
+ if past_key_values is not None:
1386
+ if isinstance(past_key_values, Cache):
1387
+ # Use the new Cache class methods
1388
+ cache_length = past_key_values.get_seq_length()
1389
+
1390
+
1391
+ past_length = cache_length
1392
+ max_cache_length = None
1393
+ else:
1394
+ raise ValueError(
1395
+ 'You must use the new past_key_values format, such as the Cache class, instead of the old tuple format.'
1396
+ )
1397
+
1398
+ # Keep only the unprocessed tokens:
1399
+ # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
1400
+ # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
1401
+ # input)
1402
+ if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
1403
+ input_ids = input_ids[:, -(attention_mask.shape[1] - past_length):]
1404
+ # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
1405
+ # input_ids based on the past_length.
1406
+ elif past_length < input_ids.shape[1]:
1407
+ input_ids = input_ids[:, past_length:]
1408
+ # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
1409
+
1410
+ # If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
1411
+ if (
1412
+ max_cache_length is not None
1413
+ and attention_mask is not None
1414
+ and cache_length + input_ids.shape[1] > max_cache_length
1415
+ ):
1416
+ attention_mask = attention_mask[:, -max_cache_length:]
1417
+
1418
+ position_ids = kwargs.get('position_ids', None)
1419
+ if attention_mask is not None and position_ids is None:
1420
+ # create position_ids on the fly for batch generation
1421
+ position_ids = attention_mask.long().cumsum(-1) - 1
1422
+ position_ids.masked_fill_(attention_mask == 0, 1)
1423
+ if past_key_values:
1424
+ position_ids = position_ids[:, -input_ids.shape[1]:]
1425
+
1426
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
1427
+ if inputs_embeds is not None and past_key_values is None:
1428
+ model_inputs = {'inputs_embeds': inputs_embeds}
1429
+ else:
1430
+ model_inputs = {'input_ids': input_ids}
1431
+
1432
+ model_inputs.update(
1433
+ {
1434
+ 'position_ids': position_ids,
1435
+ 'past_key_values': past_key_values,
1436
+ 'use_cache': kwargs.get('use_cache'),
1437
+ 'attention_mask': attention_mask,
1438
+ }
1439
+ )
1440
+ # Forward ALL kwargs that are uninitialized (e.g. `use_cache`).
1441
+ for key, value in kwargs.items():
1442
+ if key not in model_inputs:
1443
+ model_inputs[key] = value
1444
+ return model_inputs
1445
+
1446
+ @staticmethod
1447
+ def _reorder_cache(past_key_values, beam_idx):
1448
+ reordered_past = ()
1449
+ for layer_past in past_key_values:
1450
+ reordered_past += (
1451
+ tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
1452
+ )
1453
+ return reordered_past
1454
+
1455
+ @torch.inference_mode()
1456
+ def chat(self, tokenizer, query: str, history: List[Dict] = None, role: str = 'user',
1457
+ max_length: int = 4096, num_beams=1, do_sample=True, top_p=0.8, temperature=0.3, logits_processor=None,
1458
+ **kwargs):
1459
+ if history is None:
1460
+ history = []
1461
+ if logits_processor:
1462
+ gen_kwargs = {
1463
+ 'max_length': max_length,
1464
+ 'num_beams': num_beams,
1465
+ 'do_sample': do_sample,
1466
+ 'top_p': top_p,
1467
+ 'temperature': temperature,
1468
+ 'logits_processor': logits_processor,
1469
+ **kwargs
1470
+ }
1471
+ else:
1472
+ gen_kwargs = {
1473
+ 'max_length': max_length,
1474
+ 'num_beams': num_beams,
1475
+ 'do_sample': do_sample,
1476
+ 'top_p': top_p,
1477
+ 'temperature': temperature,
1478
+ 'logits_processor': logits_processor,
1479
+ **kwargs
1480
+ }
1481
+
1482
+ history.append({'role': role, 'content': query})
1483
+ history_str = tokenizer.apply_chat_template(history, tokenize=False, add_generation_prompt=False)
1484
+ inputs = tokenizer(history_str, return_tensors='pt').to(self.device)
1485
+ outputs = self.generate(**inputs, **gen_kwargs)
1486
+ outputs = outputs.tolist()[0][len(inputs['input_ids'][0]):-1]
1487
+ response = tokenizer.decode(outputs)
1488
+ pattern = re.compile(r'.*?(?=<AI>|<用户>)', re.DOTALL)
1489
+ matches = pattern.findall(response)
1490
+ if len(matches) > 0:
1491
+ response = matches[0]
1492
+ history.append({'role': 'assistant', 'content': response})
1493
+ return response, history
1494
+
1495
+
1496
+ @add_start_docstrings(
1497
+ """
1498
+ The MiniCPM Model transformer with a sequence classification head on top (linear layer).
1499
+
1500
+ [`MiniCPMForSequenceClassification`] uses the last token in order to do the classification, as other causal models
1501
+ (e.g. GPT-2) do.
1502
+
1503
+ Since it does classification on the last token, it requires to know the position of the last token. If a
1504
+ `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
1505
+ no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
1506
+ padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
1507
+ each row of the batch).
1508
+ """,
1509
+ MINICPM_START_DOCSTRING,
1510
+ )
1511
+ class MiniCPMForSequenceClassification(MiniCPMPreTrainedModel):
1512
+ def __init__(self, config):
1513
+ super().__init__(config)
1514
+ self.num_labels = config.num_labels
1515
+ self.model = MiniCPMModel(config)
1516
+ self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
1517
+
1518
+ # Initialize weights and apply final processing
1519
+ self.post_init()
1520
+
1521
+ def get_input_embeddings(self):
1522
+ return self.model.embed_tokens
1523
+
1524
+ def set_input_embeddings(self, value):
1525
+ self.model.embed_tokens = value
1526
+
1527
+ @add_start_docstrings_to_model_forward(MINICPM_INPUTS_DOCSTRING)
1528
+ def forward(
1529
+ self,
1530
+ input_ids: torch.LongTensor = None,
1531
+ attention_mask: Optional[torch.Tensor] = None,
1532
+ position_ids: Optional[torch.LongTensor] = None,
1533
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1534
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1535
+ labels: Optional[torch.LongTensor] = None,
1536
+ use_cache: Optional[bool] = None,
1537
+ output_attentions: Optional[bool] = None,
1538
+ output_hidden_states: Optional[bool] = None,
1539
+ return_dict: Optional[bool] = None,
1540
+ ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
1541
+ r"""
1542
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1543
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
1544
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
1545
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
1546
+ """
1547
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1548
+
1549
+ transformer_outputs = self.model(
1550
+ input_ids,
1551
+ attention_mask=attention_mask,
1552
+ position_ids=position_ids,
1553
+ past_key_values=past_key_values,
1554
+ inputs_embeds=inputs_embeds,
1555
+ use_cache=use_cache,
1556
+ output_attentions=output_attentions,
1557
+ output_hidden_states=output_hidden_states,
1558
+ return_dict=return_dict,
1559
+ )
1560
+ hidden_states = transformer_outputs[0]
1561
+ logits = self.score(hidden_states)
1562
+
1563
+ if input_ids is not None:
1564
+ batch_size = input_ids.shape[0]
1565
+ else:
1566
+ batch_size = inputs_embeds.shape[0]
1567
+
1568
+ if self.config.pad_token_id is None and batch_size != 1:
1569
+ raise ValueError('Cannot handle batch sizes > 1 if no padding token is defined.')
1570
+ if self.config.pad_token_id is None:
1571
+ sequence_lengths = -1
1572
+ else:
1573
+ if input_ids is not None:
1574
+ sequence_lengths = (torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1).to(
1575
+ logits.device
1576
+ )
1577
+ else:
1578
+ sequence_lengths = -1
1579
+
1580
+ pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
1581
+
1582
+ loss = None
1583
+ if labels is not None:
1584
+ labels = labels.to(logits.device)
1585
+ if self.config.problem_type is None:
1586
+ if self.num_labels == 1:
1587
+ self.config.problem_type = 'regression'
1588
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
1589
+ self.config.problem_type = 'single_label_classification'
1590
+ else:
1591
+ self.config.problem_type = 'multi_label_classification'
1592
+
1593
+ if self.config.problem_type == 'regression':
1594
+ loss_fct = MSELoss()
1595
+ if self.num_labels == 1:
1596
+ loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
1597
+ else:
1598
+ loss = loss_fct(pooled_logits, labels)
1599
+ elif self.config.problem_type == 'single_label_classification':
1600
+ loss_fct = CrossEntropyLoss()
1601
+ loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
1602
+ elif self.config.problem_type == 'multi_label_classification':
1603
+ loss_fct = BCEWithLogitsLoss()
1604
+ loss = loss_fct(pooled_logits, labels)
1605
+ if not return_dict:
1606
+ output = (pooled_logits,) + transformer_outputs[1:]
1607
+ return ((loss,) + output) if loss is not None else output
1608
+
1609
+ return SequenceClassifierOutputWithPast(
1610
+ loss=loss,
1611
+ logits=pooled_logits,
1612
+ past_key_values=transformer_outputs.past_key_values,
1613
+ hidden_states=transformer_outputs.hidden_states,
1614
+ attentions=transformer_outputs.attentions,
1615
+ )
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fe236eb8d3fd7e6bea58f8e44529318687d6be0921df0c1e9cfd8050d01e6808
3
+ size 867818482
special_tokens_map.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_end|>",
4
+ "<|im_start|>",
5
+ "<|tool_call|>",
6
+ "<|execute_start|>",
7
+ "<|execute_end|>",
8
+ "<|fim_prefix|>",
9
+ "<|fim_middle|>",
10
+ "<|fim_suffix|>"
11
+ ],
12
+ "bos_token": {
13
+ "content": "<s>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false
18
+ },
19
+ "eos_token": {
20
+ "content": "<|im_end|>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false
25
+ },
26
+ "unk_token": {
27
+ "content": "<unk>",
28
+ "lstrip": false,
29
+ "normalized": false,
30
+ "rstrip": false,
31
+ "single_word": false
32
+ }
33
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bb74d51116831c3bf65db812c553f94ab0c88dcf97a5bbb37e3504f6d359c530
3
+ size 1181204
tokenizer_config.json ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "add_prefix_space": null,
5
+ "added_tokens_decoder": {
6
+ "0": {
7
+ "content": "<unk>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false,
12
+ "special": true
13
+ },
14
+ "1": {
15
+ "content": "<s>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false,
20
+ "special": true
21
+ },
22
+ "2": {
23
+ "content": "</s>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": false,
27
+ "single_word": false,
28
+ "special": true
29
+ },
30
+ "73440": {
31
+ "content": "<|im_end|>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false,
36
+ "special": true
37
+ },
38
+ "73441": {
39
+ "content": "<|im_start|>",
40
+ "lstrip": false,
41
+ "normalized": false,
42
+ "rstrip": false,
43
+ "single_word": false,
44
+ "special": true
45
+ },
46
+ "73442": {
47
+ "content": "<|tool_call|>",
48
+ "lstrip": false,
49
+ "normalized": false,
50
+ "rstrip": false,
51
+ "single_word": false,
52
+ "special": true
53
+ },
54
+ "73443": {
55
+ "content": "<|execute_start|>",
56
+ "lstrip": false,
57
+ "normalized": false,
58
+ "rstrip": false,
59
+ "single_word": false,
60
+ "special": true
61
+ },
62
+ "73444": {
63
+ "content": "<|execute_end|>",
64
+ "lstrip": false,
65
+ "normalized": false,
66
+ "rstrip": false,
67
+ "single_word": false,
68
+ "special": true
69
+ },
70
+ "73445": {
71
+ "content": "<|fim_prefix|>",
72
+ "lstrip": false,
73
+ "normalized": false,
74
+ "rstrip": false,
75
+ "single_word": false,
76
+ "special": true
77
+ },
78
+ "73446": {
79
+ "content": "<|fim_middle|>",
80
+ "lstrip": false,
81
+ "normalized": false,
82
+ "rstrip": false,
83
+ "single_word": false,
84
+ "special": true
85
+ },
86
+ "73447": {
87
+ "content": "<|fim_suffix|>",
88
+ "lstrip": false,
89
+ "normalized": false,
90
+ "rstrip": false,
91
+ "single_word": false,
92
+ "special": true
93
+ }
94
+ },
95
+ "additional_special_tokens": [
96
+ "<|im_end|>",
97
+ "<|im_start|>",
98
+ "<|tool_call|>",
99
+ "<|execute_start|>",
100
+ "<|execute_end|>",
101
+ "<|fim_prefix|>",
102
+ "<|fim_middle|>",
103
+ "<|fim_suffix|>"
104
+ ],
105
+ "bos_token": "<s>",
106
+ "chat_template": "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
107
+ "clean_up_tokenization_spaces": false,
108
+ "eos_token": "<|im_end|>",
109
+ "legacy": true,
110
+ "model_max_length": 1000000000000000019884624838656,
111
+ "pad_token": null,
112
+ "sp_model_kwargs": {},
113
+ "spaces_between_special_tokens": false,
114
+ "tokenizer_class": "LlamaTokenizer",
115
+ "unk_token": "<unk>",
116
+ "use_default_system_prompt": false
117
+ }