anas LongCat0830 commited on
Commit
60982a0
·
0 Parent(s):

Duplicate from meituan-longcat/LongCat-Flash-Lite

Browse files

Co-authored-by: LongCat <LongCat0830@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tech_report.pdf filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
File without changes
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Meituan
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md ADDED
@@ -0,0 +1,260 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: LongCat-Flash-Lite
4
+ pipeline_tag: text-generation
5
+ tags:
6
+ - transformers
7
+ ---
8
+
9
+ # LongCat-Flash-Lite
10
+
11
+ <div align="center">
12
+ <img src="https://raw.githubusercontent.com/meituan-longcat/LongCat-Flash-Chat/main/figures/longcat_logo.svg"
13
+ width="300"
14
+ alt="LongCat Logo"/>
15
+ </div>
16
+
17
+ <hr>
18
+
19
+
20
+ <div align="center" style="line-height: 1;">
21
+
22
+ <a href="https://huggingface.co/meituan-longcat" target="_blank" style="margin: 2px;">
23
+ <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-LongCat-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
24
+ </a>
25
+ </div>
26
+
27
+ <div align="center" style="line-height: 1;">
28
+ <a href="https://github.com/meituan-longcat/LongCat-Flash-Chat/blob/main/figures/wechat_official_accounts.png" target="_blank" style="margin: 2px;">
29
+ <img alt="Wechat" src="https://img.shields.io/badge/WeChat-LongCat-brightgreen?logo=wechat&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
30
+ </a>
31
+ <a href="https://x.com/Meituan_LongCat" target="_blank" style="margin: 2px;">
32
+ <img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-LongCat-white?logo=x&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
33
+ </a>
34
+ </div>
35
+
36
+ <div align="center" style="line-height: 1;">
37
+ <a href="https://huggingface.co/meituan-longcat/LongCat-Flash-Chat/blob/main/LICENSE" style="margin: 2px;">
38
+ <img alt="License" src="https://img.shields.io/badge/License-MIT-f5de53?&color=f5de53" style="display: inline-block; vertical-align: middle;"/>
39
+ </a>
40
+ </div>
41
+
42
+ <p align="center">
43
+ <a href="https://arxiv.org/abs/2601.21204"><b>Tech Report</b>&nbsp;📄</a>
44
+ </p>
45
+
46
+ ## Model Introduction
47
+ We introduce LongCat-Flash-Lite, a non-thinking 68.5B parameter Mixture-of-Experts (MoE) model with approximately 3B activated parameters, supporting a 256k context length through the YaRN method. Building upon the LongCat-Flash architecture, LongCat-Flash-Lite distinguishes itself through the integration of an **N-gram embedding** table designed to enhance both model performance and inference speed. Despite allocating over 30B parameters to embeddings, LongCat-Flash-Lite not only outperforms parameter-equivalent MoE baselines but also demonstrates exceptional competitiveness against existing models of comparable scale, particularly in the agentic and coding domains.
48
+
49
+
50
+
51
+ ### Key Features
52
+
53
+ #### 🌟 Superior Scaling Efficiency: A Better Alternative to MoE
54
+ Through comprehensive scaling experiments across diverse scenarios, we identify specific regimes where embedding scaling achieves a superior Pareto frontier compared to increasing the number of experts, thereby offering a highly efficient alternative for model scaling. We further delineate a comprehensive set of architectural factors that determine embedding scaling efficacy, encompassing integration timing, parameter budgeting, hash collision mitigation, hyperparameter configuration, and embedding initialization, alongside the impacts of model width and depth.
55
+
56
+ #### 🌟 Superior Inference Efficiency with Specialized System Optimization
57
+ In contrast to FFN-based experts, the N-gram embedding table inherently mitigates I/O bottlenecks within MoE layers, yielding substantial improvements in inference latency. Furthermore, we introduce a specialized N-gram Cache and develop synchronized kernels, which collectively and significantly boost inference efficiency.
58
+
59
+ #### 🌟 Strong Agentic and Coding Performance
60
+ LongCat-Flash-Lite demonstrates robust capabilities in agentic tool use and coding proficiency that are highly competitive relative to its model scale.
61
+
62
+ Please refer to our [technical report](https://arxiv.org/abs/2601.21204) for details!
63
+
64
+ ## Evaluation Results
65
+
66
+ | Benchmark | Kimi-Linear-48B-A3B | Qwen3-Next-80B-A3B-Instruct | Gemini 2.5 Flash-Lite | LongCat-Flash-Lite |
67
+ |----------|---------------------|----------------------------|----------------------|---------|
68
+ | **Architecture** | MoE | MoE | - | MoE + NE |
69
+ | **# Total Params** | 48B | 80B | - | 68.5B |
70
+ | **# Activated Params** | 3B | 3B | - | 2.9B~4.5B |
71
+ | **Agentic Tool Use** | | | | |
72
+ | Tau2-Airline(avg@8) | 44.00 | 45.5* | 35.00 | 58.00 |
73
+ | Tau2-Retail(avg@8) | 18.86 | 57.3* | 37.50 | 73.10 |
74
+ | Tau2-Telecom(avg@8) | 15.68 | 13.2* | 21.93 | 72.80 |
75
+ | **Agentic Coding** | | | | |
76
+ | SWE-Bench(acc) | 32.80 | 37.60 | 41.3* | 54.40 |
77
+ | TerminalBench(acc) | 20.00 | 15.19 | 20.00 | 33.75 |
78
+ | SWE-Bench Multiligual | 37.20 | 31.30 | - | 38.10 |
79
+ |PRDBench | - | 15.36 | - | 39.63 |
80
+ | **General Domains** | | | | |
81
+ | GPQA-Diamond(avg@16) | 69.89 | 74.33 | 70.20* | 66.78 |
82
+ | MMLU(acc) | 79.91 | 89.28 | 84.68 | 85.52 |
83
+ | MMLU-Pro(acc) | 67.22 | 82.93 | 78.95 | 78.29 |
84
+ | CEval(acc) | 78.48 | 90.91 | 75.16 | 86.55 |
85
+ | CMMLU(acc) | 76.26 | 86.50 | 72.06 | 82.48 |
86
+ | **Mathematical Reasoning** | | | | |
87
+ | MATH500(acc) | 94.20 | 98.00 | 95.20 | 96.80 |
88
+ | AIME24(avg@32) | 70.52 | 81.35 | 63.33 | 72.19 |
89
+ | AIME25(avg@32) | 59.58 | 68.44 | 50.1* | 63.23 |
90
+
91
+ > **Note:** Values marked with * are sourced from public reports. NE is an abbreviation of N-gram Embedding.
92
+
93
+ ## Quick Start
94
+ To use LongCat-Flash-Lite with transformers, we need at least 2 GPUs (80GB VRAM each, e.g., H100/A100 80GB), and we recommend the following environment:
95
+
96
+ * `python` >= 3.10
97
+ * `torch` >= 2.6
98
+ * `transformers` >= 4.57.6
99
+ * `accelerate` >= 1.10.0
100
+
101
+ ```shell
102
+ pip install -U transformers==4.57.6 accelerate==1.10.0
103
+ ```
104
+
105
+ Basic Usage Example:
106
+ ```py
107
+ from transformers import AutoModelForCausalLM, AutoTokenizer
108
+
109
+ model_name = "meituan-longcat/LongCat-Flash-Lite"
110
+ model = AutoModelForCausalLM.from_pretrained(
111
+ model_name,
112
+ torch_dtype="auto",
113
+ device_map="auto",
114
+ trust_remote_code=True
115
+ )
116
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
117
+
118
+ messages = [
119
+ {"role": "system", "content": "You are a helpful assistant."},
120
+ {"role": "user", "content": "Give me a brief introduction to large language models."}
121
+ ]
122
+ input_ids = tokenizer.apply_chat_template(
123
+ messages,
124
+ add_generation_prompt=True,
125
+ return_tensors="pt"
126
+ ).to(model.device)
127
+ generated_ids = model.generate(inputs=input_ids, max_new_tokens=256)
128
+ output_ids = generated_ids[0][len(input_ids[0]):].tolist()
129
+ response = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
130
+ print(response)
131
+ ```
132
+
133
+ Tool Calling Example:
134
+ ```py
135
+ tools = [
136
+ {
137
+ "type": "function",
138
+ "function": {
139
+ "name": "func_add",
140
+ "description": "Calculate the sum of two numbers",
141
+ "parameters": {
142
+ "type": "object",
143
+ "properties": {
144
+ "x1": {"type": "number", "description": "The first addend"},
145
+ "x2": {"type": "number", "description": "The second addend"}
146
+ },
147
+ "required": ["x1", "x2"]
148
+ }
149
+ }
150
+ }
151
+ ]
152
+ messages = [
153
+ {"role": "system", "content": "You are a helpful assistant."},
154
+ {"role": "user", "content": "Please tell me what is $$125679 + 234519$$?"},
155
+ {
156
+ "role": "assistant",
157
+ "content": "I'll calculate the sum of 125679 and 234519 for you.",
158
+ "tool_calls": [{"type": "function", "function": {"name": "func_add", "arguments": {"x1": 125679, "x2": 234519}}}]
159
+ },
160
+ {"role": "tool", "name": "func_add", "content": '{"ans": 360198}'}
161
+ ]
162
+
163
+ input_ids = tokenizer.apply_chat_template(
164
+ messages,
165
+ tools=tools,
166
+ add_generation_prompt=True,
167
+ return_tensors="pt"
168
+ ).to(model.device)
169
+ generated_ids = model.generate(inputs=input_ids, max_new_tokens=256)
170
+ output_ids = generated_ids[0][len(input_ids[0]):].tolist()
171
+ response = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
172
+ print(response)
173
+ ```
174
+
175
+ Response Parsing:
176
+
177
+ ```python
178
+ from parse_model_response import parse_model_response
179
+
180
+ response = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
181
+ parsed_message = parse_model_response(response, tools)
182
+ ```
183
+ See [`parse_model_response.py`](./parse_model_response.py) for detailed implementation and examples.
184
+
185
+ <br>
186
+
187
+ Recommended Sampling Setting:
188
+ ```shell
189
+ { "repetition_penalty": 1.06, "temperature": 0.7, "top_p": 0.95, "top_k": 4 }
190
+ ```
191
+
192
+ ## Deployment
193
+
194
+ We have implemented basic adaptations in SGLang ([PR](https://github.com/sgl-project/sglang/pull/17838)) to support the deployment of LongCat-Flash-Lite.
195
+
196
+ LongCat-Flash-Lite can be served on a single node (e.g., 8xH20-141G) using a combination of Tensor Parallelism and Expert Parallelism.
197
+
198
+ Compile and update sgl-kernel first.
199
+
200
+ ```shell
201
+ cd sgl-kernel
202
+ python3 -m uv build --wheel --color=always --no-build-isolation \
203
+ -Ccmake.define.SGL_KERNEL_ENABLE_SM90A=1 \
204
+ -Ccmake.define.CMAKE_POLICY_VERSION_MINIMUM=3.5 \
205
+ -Cbuild-dir=build .
206
+ pip3 install dist/sgl_kernel-0.3.21-cp310-abi3-linux_x86_64.whl --force-reinstall
207
+ ```
208
+ Then launch the server.
209
+ ```py
210
+ python3 -m sglang.launch_server \
211
+ --model meituan-longcat/LongCat-Flash-Lite \
212
+ --port 8080 \
213
+ --host 0.0.0.0 \
214
+ --mem-fraction-static 0.9 \
215
+ --max-running-requests 64 \
216
+ --trust-remote-code \
217
+ --skip-server-warmup \
218
+ --attention-backend flashinfer \
219
+ --ep 8 \
220
+ --tp 8 \
221
+ --disable-cuda-graph
222
+ ```
223
+
224
+ ## License Agreement
225
+
226
+ This repository, including both the model weights and the source code, is released under the **MIT License**.
227
+
228
+ Any contributions to this repository are licensed under the MIT License, unless otherwise stated. This license does not grant any rights to use Meituan trademarks or patents.
229
+
230
+ For details, see the [LICENSE](./LICENSE) file.
231
+
232
+ ## Usage Considerations
233
+ This model has not been specifically designed or comprehensively evaluated for every possible downstream application.
234
+
235
+ Developers should take into account the known limitations of large language models, including performance variations across different languages, and carefully assess accuracy, safety, and fairness before deploying the model in sensitive or high-risk scenarios.
236
+ It is the responsibility of developers and downstream users to understand and comply with all applicable laws and regulations relevant to their use case, including but not limited to data protection, privacy, and content safety requirements.
237
+
238
+ Nothing in this Model Card should be interpreted as altering or restricting the terms of the MIT License under which the model is released.
239
+
240
+
241
+ ## Citation
242
+
243
+ We kindly encourage citation of our work if you find it useful.
244
+
245
+ ```
246
+ @misc{liu2026scalingembeddingsoutperformsscaling,
247
+ title={Scaling Embeddings Outperforms Scaling Experts in Language Models},
248
+ author={Hong Liu and Jiaqi Zhang and Chao Wang and Xing Hu and Linkun Lyu and Jiaqi Sun and Xurui Yang and Bo Wang and Fengcun Li and Yulei Qian and Lingtong Si and Yerui Sun and Rumei Li and Peng Pei and Yuchen Xie and Xunliang Cai},
249
+ year={2026},
250
+ eprint={2601.21204},
251
+ archivePrefix={arXiv},
252
+ primaryClass={cs.CL},
253
+ url={https://arxiv.org/abs/2601.21204},
254
+ }
255
+ ```
256
+
257
+
258
+ ## Contact
259
+ Please contact us at <a href="mailto:longcat-team@meituan.com">longcat-team@meituan.com</a> or open an issue if you have any questions.
260
+
config.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LongcatFlashNgramForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "auto_map": {
8
+ "AutoConfig": "configuration_longcat_ngram.LongcatFlashNgramConfig",
9
+ "AutoModel": "modeling_longcat_ngram.LongcatFlashNgramModel",
10
+ "AutoModelForCausalLM": "modeling_longcat_ngram.LongcatFlashNgramForCausalLM"
11
+ },
12
+ "vocab_size": 131072,
13
+ "hidden_size": 3072,
14
+ "ffn_hidden_size": 6144,
15
+ "expert_ffn_hidden_size": 1024,
16
+ "num_layers": 14,
17
+ "num_attention_heads": 32,
18
+ "kv_lora_rank": 512,
19
+ "q_lora_rank": 1536,
20
+ "qk_rope_head_dim": 64,
21
+ "v_head_dim": 128,
22
+ "qk_nope_head_dim": 128,
23
+ "mla_scale_q_lora": true,
24
+ "mla_scale_kv_lora": true,
25
+ "routed_scaling_factor": 6.0,
26
+ "n_routed_experts": 256,
27
+ "rms_norm_eps": 1e-5,
28
+ "use_cache": true,
29
+ "bos_token_id": 1,
30
+ "eos_token_id": 2,
31
+ "rope_theta": 5000000.0,
32
+ "max_position_embeddings": 327680,
33
+ "rope_scaling": {
34
+ "original_max_position_embeddings": 32768,
35
+ "rope_type": "yarn",
36
+ "factor": 10,
37
+ "beta_fast": 32,
38
+ "beta_slow": 1,
39
+ "mscale": 1,
40
+ "mscale_all_dim": 1
41
+ },
42
+ "zero_expert_num": 128,
43
+ "zero_expert_type": "identity",
44
+ "moe_topk": 12,
45
+ "ngram_vocab_size_ratio": 78,
46
+ "emb_neighbor_num": 4,
47
+ "emb_split_num": 4,
48
+ "torch_dtype": "bfloat16",
49
+ "transformers_version": "4.57.6"
50
+ }
51
+
configuration_longcat_ngram.py ADDED
@@ -0,0 +1,216 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers.models.longcat_flash import LongcatFlashConfig
2
+
3
+
4
+ class LongcatFlashNgramConfig(LongcatFlashConfig):
5
+ r"""
6
+ This is the configuration class to store the configuration of a [`LongcatFlashNgramModel`]. It is used to instantiate
7
+ a LongCat Flash model with N-gram enhanced embeddings according to the specified arguments, defining the model architecture.
8
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
9
+ documentation from [`PretrainedConfig`] for more information.
10
+
11
+
12
+ Args:
13
+ vocab_size (`int`, *optional*, defaults to 131072):
14
+ Vocabulary size of the LongCat Flash model. Defines the number of different tokens that can be represented by the
15
+ `input_ids` passed when calling [`LongcatFlashNgramModel`]
16
+ hidden_size (`int`, *optional*, defaults to 6144):
17
+ Dimension of the hidden representations.
18
+ num_hidden_layers (`int`, *optional*, defaults to 56):
19
+ Number of hidden layers in the Transformer decoder.
20
+ num_layers (`int`, *optional*, defaults to 28):
21
+ Number of layers, each with 2 sublayers.
22
+ num_attention_heads (`int`, *optional*, defaults to 64):
23
+ Number of attention heads for each attention layer in the Transformer decoder.
24
+ num_key_value_heads (`int`, *optional*):
25
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
26
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
27
+ `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
28
+ converting from a multi-head checkpoint to a GQA checkpoint, each group key and value head should be
29
+ constructed by meanpooling all the original heads within that group. For more details checkout [this
30
+ paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
31
+ `num_attention_heads`.
32
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
33
+ The non-linear activation function (function or string) in the decoder.
34
+ max_position_embeddings (`int`, *optional*, defaults to 131072):
35
+ The maximum sequence length that this model might ever be used with. Typically set this to something large
36
+ just in case (e.g., 512 or 1024 or 2048).
37
+ initializer_range (`float`, *optional*, defaults to 0.02):
38
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
39
+ rms_norm_eps (`float`, *optional*, defaults to 1e-05):
40
+ The epsilon value used by the RMS normalization layers.
41
+ use_cache (`bool`, *optional*, defaults to `True`):
42
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
43
+ relevant if `config.is_decoder=True`.
44
+ pad_token_id (`int`, *optional*):
45
+ Padding token id.
46
+ bos_token_id (`int`, *optional*, defaults to 1):
47
+ Beginning of stream token id.
48
+ eos_token_id (`int`, *optional*, defaults to 2):
49
+ End of stream token id.
50
+ tie_word_embeddings (`bool`, *optional*, defaults to `False`):
51
+ Whether to tie input and output embeddings.
52
+ rope_theta (`float`, *optional*, defaults to 10000000.0):
53
+ The base period of the RoPE embeddings.
54
+ rope_scaling (`Dict`, *optional*):
55
+ Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
56
+ strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
57
+ `{"type": strategy name, "factor": scaling factor}`.
58
+ attention_bias (`bool`, *optional*, defaults to `False`):
59
+ Whether to use a bias in the query, key, value and output projection layers during self-attention.
60
+ attention_dropout (`float`, *optional*, defaults to 0.0):
61
+ The dropout ratio for the attention probabilities.
62
+ ffn_hidden_size (`int`, *optional*, defaults to 12288):
63
+ Dimension of the MLP representations.
64
+ q_lora_rank (`int`, *optional*, defaults to 1536):
65
+ The rank of the query LoRA projection in MLA (Multi-head Latent Attention).
66
+ kv_lora_rank (`int`, *optional*, defaults to 512):
67
+ The rank of the key-value LoRA projection in MLA.
68
+ qk_nope_head_dim (`int`, *optional*, defaults to 128):
69
+ The dimension of the non-position encoding part of query/key heads.
70
+ qk_rope_head_dim (`int`, *optional*, defaults to 64):
71
+ The dimension of the RoPE part of query/key heads.
72
+ head_dim (`int`, *optional*, defaults to 64):
73
+ Standard dimension of qk heads, unused except for CI.
74
+ v_head_dim (`int`, *optional*, defaults to 128):
75
+ The dimension of value heads.
76
+ qk_head_dim (`int`, *optional*):
77
+ The total dimension of query/key heads. If not specified, set to `qk_nope_head_dim + qk_rope_head_dim`.
78
+ moe_topk (`int`, *optional*, defaults to 12):
79
+ Number of experts to route to for each token in the MoE layer.
80
+ n_routed_experts (`int`, *optional*, defaults to 512):
81
+ Number of routed experts in the MoE layer.
82
+ zero_expert_num (`int`, *optional*, defaults to 256):
83
+ Number of zero experts (identity function) to add to the expert pool.
84
+ expert_ffn_hidden_size (`int`, *optional*, defaults to 2048):
85
+ Hidden size of individual expert FFN layers.
86
+ routed_scaling_factor (`float`, *optional*, defaults to 6.0):
87
+ Scaling factor applied to the routing weights.
88
+ emb_neighbor_num (`int`, *optional*):
89
+ Maximum N-gram length for N-gram embeddings. This parameter determines the context window size for N-gram computation. Higher values capture
90
+ longer-range lexical patterns but increase memory usage.
91
+ emb_split_num (`int`, *optional*):
92
+ Number of hash functions (or splits) to use for N-gram embeddings. Multiple hash functions help improve the quality of N-gram representations.
93
+ ngram_vocab_size_ratio (`float`, *optional*):
94
+ Ratio multiplier for N-gram vocabulary size relative to the base vocabulary size. The N-gram vocabulary
95
+ size is calculated as `vocab_size * ngram_vocab_size_ratio`.
96
+
97
+ Example:
98
+ ```python
99
+ >>> from transformers import LongcatFlashNgramModel, LongcatFlashNgramConfig
100
+
101
+ >>> # Initializing a LongCat Flash N-gram style configuration
102
+ >>> configuration = LongcatFlashNgramConfig(
103
+ ... emb_neighbor_num=3,
104
+ ... emb_split_num=4,
105
+ ... ngram_vocab_size_ratio=1.5
106
+ ... )
107
+
108
+ >>> # Initializing a model from the configuration
109
+ >>> model = LongcatFlashNgramModel(configuration)
110
+
111
+ >>> # Accessing the model configuration
112
+ >>> configuration = model.config
113
+ ```"""
114
+
115
+ model_type = "longcat_flash_ngram"
116
+ keys_to_ignore_at_inference = ["past_key_values"]
117
+ base_model_tp_plan = {
118
+ "layers.*.self_attn.*.q_b_proj": "colwise",
119
+ "layers.*.self_attn.*.kv_b_proj": "colwise",
120
+ "layers.*.self_attn.*.o_proj": "rowwise",
121
+ "layers.*.mlps.*.gate_proj": "colwise",
122
+ "layers.*.mlps.*.up_proj": "colwise",
123
+ "layers.*.mlps.*.down_proj": "rowwise",
124
+ "layers.*.mlp.experts.*.gate_proj": "colwise",
125
+ "layers.*.mlp.experts.*.up_proj": "colwise",
126
+ "layers.*.mlp.experts.*.down_proj": "rowwise",
127
+ }
128
+
129
+ base_model_pp_plan = {
130
+ "embed_tokens": (["input_ids"], ["inputs_embeds"]),
131
+ "layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
132
+ "norm": (["hidden_states"], ["hidden_states"]),
133
+ }
134
+
135
+ def __init__(
136
+ self,
137
+ vocab_size=131072,
138
+ hidden_size=6144,
139
+ num_hidden_layers=56,
140
+ num_layers=28,
141
+ num_attention_heads=64,
142
+ num_key_value_heads=None,
143
+ hidden_act="silu",
144
+ max_position_embeddings=131072,
145
+ initializer_range=0.02,
146
+ rms_norm_eps=1e-5,
147
+ use_cache=True,
148
+ pad_token_id=None,
149
+ bos_token_id=1,
150
+ eos_token_id=2,
151
+ tie_word_embeddings=False,
152
+ rope_theta=10000000.0,
153
+ rope_scaling=None,
154
+ attention_bias=False,
155
+ attention_dropout=0.0,
156
+ ffn_hidden_size=12288,
157
+ q_lora_rank=1536,
158
+ kv_lora_rank=512,
159
+ qk_nope_head_dim=128,
160
+ qk_rope_head_dim=64,
161
+ head_dim=64,
162
+ v_head_dim=128,
163
+ qk_head_dim=None,
164
+ moe_topk=12,
165
+ n_routed_experts=512,
166
+ zero_expert_num=256,
167
+ expert_ffn_hidden_size=2048,
168
+ routed_scaling_factor=6.0,
169
+ emb_neighbor_num=None,
170
+ emb_split_num=None,
171
+ ngram_vocab_size_ratio=None,
172
+ **kwargs,
173
+ ):
174
+ # N-gram embedding specific parameters
175
+ self.emb_neighbor_num = emb_neighbor_num
176
+ self.emb_split_num = emb_split_num
177
+ self.ngram_vocab_size_ratio = ngram_vocab_size_ratio
178
+
179
+ super().__init__(
180
+ vocab_size=vocab_size,
181
+ hidden_size=hidden_size,
182
+ num_hidden_layers=num_hidden_layers,
183
+ num_layers=num_layers,
184
+ num_attention_heads=num_attention_heads,
185
+ num_key_value_heads=num_key_value_heads,
186
+ hidden_act=hidden_act,
187
+ max_position_embeddings=max_position_embeddings,
188
+ initializer_range=initializer_range,
189
+ rms_norm_eps=rms_norm_eps,
190
+ use_cache=use_cache,
191
+ pad_token_id=pad_token_id,
192
+ bos_token_id=bos_token_id,
193
+ eos_token_id=eos_token_id,
194
+ tie_word_embeddings=tie_word_embeddings,
195
+ rope_theta=rope_theta,
196
+ rope_scaling=rope_scaling,
197
+ attention_bias=attention_bias,
198
+ attention_dropout=attention_dropout,
199
+ ffn_hidden_size=ffn_hidden_size,
200
+ q_lora_rank=q_lora_rank,
201
+ kv_lora_rank=kv_lora_rank,
202
+ qk_nope_head_dim=qk_nope_head_dim,
203
+ qk_rope_head_dim=qk_rope_head_dim,
204
+ head_dim=head_dim,
205
+ v_head_dim=v_head_dim,
206
+ qk_head_dim=qk_head_dim,
207
+ moe_topk=moe_topk,
208
+ n_routed_experts=n_routed_experts,
209
+ zero_expert_num=zero_expert_num,
210
+ expert_ffn_hidden_size=expert_ffn_hidden_size,
211
+ routed_scaling_factor=routed_scaling_factor,
212
+ **kwargs,
213
+ )
214
+
215
+
216
+ __all__ = ["LongcatFlashNgramConfig"]
generation_config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "pad_token_id": 3,
6
+ "transformers_version": "4.57.6",
7
+ "repetition_penalty": 1.06,
8
+ "temperature": 0.7,
9
+ "top_p": 0.95,
10
+ "top_k": 4
11
+ }
model-00000-of-00026.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:19412c9358bc484c8c9af94bb2e135a6939a39441488c6c042b16261bfe746ad
3
+ size 5313604160
model-00001-of-00026.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bff23594192d36756bdac96c7ee67dd641b8969f7d37c12d2a68a5e3ed9b8d87
3
+ size 5313604768
model-00002-of-00026.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3dcfc95fd273d4a8a57674916db168d977c10cddb87e0cc5b1fd5fc682ad121f
3
+ size 5313604016
model-00003-of-00026.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b465510bc231ab86c9f7403bd82e03a12b72b323cbe319547453b786577791d1
3
+ size 5313604136
model-00004-of-00026.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bd0ca68e274a10a89944da1984cf583adb7b51a9f54782f4e72ba6e6305aa7c8
3
+ size 5315503392
model-00005-of-00026.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:edfcfed05027a48b2cab645e44b7dc8ec0544c1a64a0277bb0b76de703e85913
3
+ size 5313604136
model-00006-of-00026.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:29959ee6a47ed392e05966031f46a589d9c41cfce369ee0be0f32a83246bc3ba
3
+ size 5313604136
model-00007-of-00026.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ba85ee575bdafb34270bbab3f7256c87178c3a2c05a5e637bb9ee83e565fdf2b
3
+ size 5313604760
model-00008-of-00026.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0533404a0b7ee44b0ade17bf9b9c00969e37cf864ff5d5f7818b030c702abb06
3
+ size 5313604216
model-00009-of-00026.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b65ca6876f65b690af110232c4d460c912beab21c195c4e0170e9f01ca1d8bb2
3
+ size 5315105048
model-00010-of-00026.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8d1032fedf9f1029544ac164405278774fd2f9e3b8399d8466b797d08f756fc6
3
+ size 5315104024
model-00011-of-00026.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1e13ef0f626c36a424cbed2e4181b853384de4cb8ad0660b65084c7408cfc729
3
+ size 5315103008
model-00012-of-00026.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d1c91f60ac661af3c195337d4ca771e274947ef8e7f258607b79e0e19a02c30d
3
+ size 5315190600
model-00013-of-00026.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2b6994f58b462df2107a899881309116d8b668e78588786e33fdcda5deee7308
3
+ size 5315190584
model-00014-of-00026.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a646727dce5f710bb23f38b01faae5a97e22966c0152a7f538ba711ec730ea9e
3
+ size 5315190384
model-00015-of-00026.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2fe62f582173fe86bac2a206b5b0d358a745355136c968401203305456b3e220
3
+ size 5315199112
model-00016-of-00026.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:97d63ab634d0d6ee498b3e97db332c283b7d4c2c4aa65f2d5d949d7451a6c75c
3
+ size 5315199120
model-00017-of-00026.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d3cc16384a3b881669a1616091877297f08c6e6fd1a212b79a991f1adc7d3192
3
+ size 5315199120
model-00018-of-00026.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6130194e56cd991ffc0bcaab6f2ca2198e2a38db30b9cf3eef1833da86b97c80
3
+ size 5315199168
model-00019-of-00026.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:19b89d889c49b935ba4c6579c2b5f9fd3a5005b97cd719059ac9ee3237f99e7c
3
+ size 5315199192
model-00020-of-00026.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2d274ec0a1f29bd9fa060c39de83e8dafd62e5b3882d6f907bf6f7fc40535a4b
3
+ size 5314674968
model-00021-of-00026.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:701ed226f316cd6d608234d2f77714b54fc071445beba1fd9da9375d2115b831
3
+ size 5314674952
model-00022-of-00026.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6c194582d2c6af1bcf698927aeff47128401c0302047c8465b5ab5882e7e697d
3
+ size 5315068040
model-00023-of-00026.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4a084318a109e2776edd5a0fe71f87db3a513843c1eda6c374eed9469ad2c965
3
+ size 5315068048
model-00024-of-00026.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b2b2af8fa51ccc188f5c55afac3e3e0b59fb4c713dda44f9ec7820749cd851d6
3
+ size 5315199128
model-00025-of-00026.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c7643ca1f768712f319b01e552071b67f8550fb42c633bcee98b5fc40ab5501a
3
+ size 5315199232
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
model_hashes.txt ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 022ef6dfbf2f526e0ef9a66d636ab246675815f9877f6654f45d6cda522ec9e9 ./model-00035-of-00038.safetensors
2
+ 0c9d09758a0617c2b3ecc401f55141537114cf484b549a161628751ee9b699fa ./model-00037-of-00038.safetensors
3
+ 0cb681b76b8cf18fb5a8ae136923bcb1ff79b2d706b8e73c8677decad7ff5648 ./model-00029-of-00038.safetensors
4
+ 139cc01fc2445267a046da2939d589367d17cc146be8259f69bfabc9e23eca63 ./model-00025-of-00038.safetensors
5
+ 15b59cf9ecdb92b277d5b7a10f25f35db7373daa5050d55a65f8be2a1f283624 ./model-00011-of-00038.safetensors
6
+ 1a66b7d59d3bb94d1874dfa7fc7b91afa2632af4d2dd19b14ee503596b38f4e2 ./model-00007-of-00038.safetensors
7
+ 1ccc5afa878158ddab54d0e8381e4e91ecc524ebd2f99aa28cdb8e71fc6ed155 ./model-00034-of-00038.safetensors
8
+ 2151fa0ae1c16356c7606f7fdd0778aa49f2727f5341b7daa19ac3b8afbf6e0e ./model-00031-of-00038.safetensors
9
+ 239b6c3f708e901b273711c71f7b85619d462dcca9130e3506d6151f75b5387c ./model-00026-of-00038.safetensors
10
+ 244ba1cc3f7da33f1b3bfce6eec2a38c1b1410eb1aea22c7796bd0e2bec53dd5 ./model-00002-of-00038.safetensors
11
+ 3647fdc2158e2201da25a6e6b93ebc77d52e085f28995920c94346d79f5da845 ./model-00016-of-00038.safetensors
12
+ 367a91bc240583a8a701b41ded917297cebd04113cdd13c3b1703be396a1fdab ./model-00006-of-00038.safetensors
13
+ 3c91728af1195b4b7172a2190606fd3cfb53604aef9f0a27f951af509824e740 ./model-00028-of-00038.safetensors
14
+ 3ecba8f4358c5fa7a84c4e2a38b32e1c579e0601d661822288cd4ac629b58f71 ./model-00014-of-00038.safetensors
15
+ 411ac9baf753252971f9b7072ac08c950ff66c2acd410207a5c48480d898f532 ./model-00005-of-00038.safetensors
16
+ 4862c4cc80ac36701781ed02a84822a152e7a285b57cf1260f8fd23209d99cce ./model-00001-of-00038.safetensors
17
+ 4c5828059c2f5e3fb067e07232856190a018863fae38a1e1858fc2dd1fb86ca1 ./model-00019-of-00038.safetensors
18
+ 58fa5258626248582925148ffaec79f752045d76b3073eff9035f78244174e27 ./model-00000-of-00038.safetensors
19
+ 590f6a7b7577a82bd7edf6434462316190aea55ccf2529053bd3465d0c232e52 ./model-00032-of-00038.safetensors
20
+ 5f964d48cf6d02f34fc4ba242bbe1a11d83c4df5fa660695f9fee53b53b84eab ./model-00010-of-00038.safetensors
21
+ 70511491f956b62c2418781e3d9db56df42510e807735ec6d8f395ccb3340e82 ./model-00012-of-00038.safetensors
22
+ 7298c81efffeee772390b48e182e7d033be69c1d5df86c24a54d4f68f93b726a ./model-00030-of-00038.safetensors
23
+ 73aebcb63519a519befe68cb1c26b05bcc310ee14dfac44ca49b1f18d82c69bf ./model-00003-of-00038.safetensors
24
+ 7d14972288a2e73168f1fde0536b53bd9cd4e83daa3902cd8679fce38b8f677e ./model-00036-of-00038.safetensors
25
+ 9372820581b918bfa6866b6628013f331d91946846c65062374750cd5ca16fe9 ./model-00018-of-00038.safetensors
26
+ 94f4f1b53f863bf2e4fb7f817bfd722c4e3f712180a47b4c173efe86f9f708ff ./model-00020-of-00038.safetensors
27
+ 978483a120263813f2f0d1f0a4f741390202d0364e5c449b79e8f0fc05819a1c ./model-00017-of-00038.safetensors
28
+ 982852a74d080f1567e6a02aee0b60ef56bf37f5073cf2ae904bd36561f5f395 ./model-00027-of-00038.safetensors
29
+ 99d1e0f997b7f6e4cc7c2245649f0facbcef377bf88f883a31f2faa557cd8ac9 ./model-00009-of-00038.safetensors
30
+ a8fee6ecd8a561d353cd135a681252f923dc247d27c624d083b45c21354d4080 ./model-00023-of-00038.safetensors
31
+ ad229f37d38cd12675367219d16f803f2a07d5fb71bc84c9d9c62598692a4100 ./model-00024-of-00038.safetensors
32
+ ae5f05ffeeeceaa0d09a4703340fd7e7834d87d23d8a7b5f928a3a767a4d29cf ./model-00021-of-00038.safetensors
33
+ c549541ab83b5013190c191de4db8ff7dcabc5fa78cb68796970dabda4a14636 ./model-00015-of-00038.safetensors
34
+ e10c54ec49ab87bb7abf1b9df07b2767a37a26ddf1c62014d464144c01ff8ea9 ./model-00008-of-00038.safetensors
35
+ e614af5d55350bda182d21e2e39a69c4436f416f301bc536a8280d6e5fbbc465 ./model-00013-of-00038.safetensors
36
+ ebae9b3f3ff94b0b36b352ec72a0cc228fbe611dc59b3026a44d932961654791 ./model-00022-of-00038.safetensors
37
+ ed6f0578b36b47680355b736a48ab732380b77d29d29f8c0e18386397523ca59 ./model-00004-of-00038.safetensors
38
+ f95d2a591d0ca4cb93db00a13c691f006ac8aaf0a4aeb09fc84447f0176aab10 ./model-00033-of-00038.safetensors
modeling_longcat_ngram.py ADDED
@@ -0,0 +1,338 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # -*- coding: utf-8 -*-
2
+ # Copyright (c) 2025 Meituan
3
+ # This code is licensed under the MIT License, for details, see the ./LICENSE file.
4
+
5
+ from typing import Optional, Tuple, Dict, List
6
+
7
+ import torch
8
+ from torch import nn
9
+
10
+ from transformers.cache_utils import Cache, DynamicCache
11
+ from transformers.masking_utils import create_causal_mask
12
+ from transformers.modeling_outputs import BaseModelOutputWithPast
13
+ from transformers.processing_utils import Unpack
14
+ from transformers.utils import auto_docstring, logging
15
+ from transformers.models.longcat_flash.modeling_longcat_flash import (
16
+ LongcatFlashForCausalLM,
17
+ LongcatFlashModel,
18
+ LongcatFlashRMSNorm,
19
+ LongcatFlashRotaryEmbedding,
20
+ LongcatFlashDecoderLayer,
21
+ LongcatFlashPreTrainedModel,
22
+ )
23
+ from .configuration_longcat_ngram import LongcatFlashNgramConfig
24
+
25
+ logger = logging.get_logger(__name__)
26
+
27
+
28
+ @auto_docstring
29
+ class LongcatFlashNgramPreTrainedModel(LongcatFlashPreTrainedModel):
30
+ pass
31
+
32
+
33
+ class NgramCache(DynamicCache):
34
+ """
35
+ Extended DynamicCache for storing N-gram context alongside KV cache.
36
+ """
37
+ def __init__(self, config=None):
38
+ super().__init__()
39
+ self.ngram_context = None
40
+ # Keep only n-1 tokens (minimum needed for N-gram computation)
41
+ self.max_context_len = config.emb_neighbor_num - 1
42
+
43
+ def update_ngram_context(self, new_tokens: torch.Tensor) -> None:
44
+ """
45
+ Update N-gram context with window management.
46
+
47
+ Args:
48
+ new_tokens: New tokens to append, shape (batch_size, seq_len)
49
+ """
50
+ if self.ngram_context is None:
51
+ self.ngram_context = new_tokens.clone()
52
+ else:
53
+ self.ngram_context = torch.cat([self.ngram_context, new_tokens], dim=-1)
54
+
55
+ # Truncate to maintain constant memory footprint
56
+ if self.ngram_context.size(-1) > self.max_context_len:
57
+ self.ngram_context = self.ngram_context[..., -self.max_context_len:]
58
+
59
+ def reorder_cache(self, beam_idx: torch.LongTensor) -> "Cache":
60
+ """Reorder cache for beam search."""
61
+ # Reorder parent's KV cache
62
+ super().reorder_cache(beam_idx)
63
+
64
+ # Reorder N-gram context
65
+ if self.ngram_context is not None:
66
+ self.ngram_context = self.ngram_context.index_select(0, beam_idx.to(self.ngram_context.device))
67
+
68
+ return self
69
+
70
+
71
+ class NgramEmbedding(nn.Module):
72
+ """
73
+ Computes embeddings enriched with N-gram features without maintaining internal state.
74
+ """
75
+ def __init__(self, config, base_embeddings):
76
+ super().__init__()
77
+ self.config = config
78
+ self.word_embeddings = base_embeddings
79
+
80
+ self.m = config.ngram_vocab_size_ratio * config.vocab_size
81
+ self.k = config.emb_split_num
82
+ self.n = config.emb_neighbor_num
83
+
84
+ self._init_ngram_embeddings()
85
+ self._vocab_mods_cache = None
86
+
87
+ def _init_ngram_embeddings(self) -> None:
88
+ """Initialize N-gram embedding and projection layers."""
89
+ num_embedders = self.k * (self.n - 1)
90
+ emb_dim = self.config.hidden_size // num_embedders
91
+
92
+ embedders = []
93
+ post_projs = []
94
+
95
+ for i in range(num_embedders):
96
+ vocab_size = int(self.m + i * 2 + 1)
97
+ emb = nn.Embedding(vocab_size, emb_dim, padding_idx=self.config.pad_token_id)
98
+ proj = nn.Linear(emb_dim, self.config.hidden_size, bias=False)
99
+ embedders.append(emb)
100
+ post_projs.append(proj)
101
+
102
+ self.embedders = nn.ModuleList(embedders)
103
+ self.post_projs = nn.ModuleList(post_projs)
104
+
105
+ def _shift_right_ignore_eos(self, tensor: torch.Tensor, n: int, eos_token_id: int = 2) -> torch.Tensor:
106
+ """Shift tensor right by n positions, resetting at EOS tokens."""
107
+ batch_size, seq_len = tensor.shape
108
+ result = torch.zeros_like(tensor)
109
+ eos_mask = (tensor == eos_token_id)
110
+
111
+ for i in range(batch_size):
112
+ eos_positions = eos_mask[i].nonzero(as_tuple=True)[0]
113
+ prev_idx = 0
114
+
115
+ for eos_idx in eos_positions:
116
+ end_idx = eos_idx.item() + 1
117
+ if end_idx - prev_idx > n:
118
+ result[i, prev_idx+n:end_idx] = tensor[i, prev_idx:end_idx-n]
119
+ prev_idx = end_idx
120
+
121
+ if prev_idx < seq_len and seq_len - prev_idx > n:
122
+ result[i, prev_idx+n:seq_len] = tensor[i, prev_idx:seq_len-n]
123
+
124
+ return result
125
+
126
+ def _precompute_vocab_mods(self) -> Dict[Tuple[int, int], List[int]]:
127
+ """Precompute modular arithmetic values for vocabulary."""
128
+ if self._vocab_mods_cache is not None:
129
+ return self._vocab_mods_cache
130
+
131
+ vocab_mods = {}
132
+ vocab_size = self.config.vocab_size
133
+
134
+ for i in range(2, self.n + 1):
135
+ for j in range(self.k):
136
+ index = (i - 2) * self.k + j
137
+ emb_vocab_dim = int(self.m + index * 2 + 1)
138
+
139
+ mods = []
140
+ power_mod = 1
141
+ for _ in range(i - 1):
142
+ power_mod = (power_mod * vocab_size) % emb_vocab_dim
143
+ mods.append(power_mod)
144
+
145
+ vocab_mods[(i, j)] = mods
146
+
147
+ self._vocab_mods_cache = vocab_mods
148
+ return vocab_mods
149
+
150
+ def _get_ngram_ids(
151
+ self,
152
+ input_ids: torch.Tensor,
153
+ shifted_ids: Dict[int, torch.Tensor],
154
+ vocab_mods: List[int],
155
+ ngram: int
156
+ ) -> torch.Tensor:
157
+ """Compute N-gram hash IDs using polynomial rolling hash."""
158
+ ngram_ids = input_ids.clone()
159
+ for k in range(2, ngram + 1):
160
+ ngram_ids = ngram_ids + shifted_ids[k] * vocab_mods[k - 2]
161
+ return ngram_ids
162
+
163
+ def forward(
164
+ self,
165
+ input_ids: torch.Tensor,
166
+ ngram_context: Optional[torch.Tensor] = None
167
+ ) -> torch.Tensor:
168
+ """
169
+ Stateless forward pass.
170
+
171
+ Args:
172
+ input_ids: Current input token IDs of shape (batch_size, seq_len)
173
+ ngram_context: Optional historical context of shape (batch_size, context_len)
174
+
175
+ Returns:
176
+ Embedding tensor of shape (batch_size, seq_len, hidden_size)
177
+ """
178
+ seq_len = input_ids.size(-1)
179
+
180
+ # Determine complete context
181
+ if ngram_context is not None:
182
+ context = torch.cat([ngram_context[..., -(self.n-1):], input_ids], dim=-1)
183
+ else:
184
+ context = input_ids
185
+
186
+ # Base word embeddings
187
+ device = self.word_embeddings.weight.device
188
+ x = self.word_embeddings(input_ids.to(device)).clone()
189
+
190
+ # Precompute modular values
191
+ vocab_mods = self._precompute_vocab_mods()
192
+
193
+ # Compute shifted IDs
194
+ shifted_ids = {}
195
+ for i in range(2, self.n + 1):
196
+ shifted_ids[i] = self._shift_right_ignore_eos(
197
+ context, i - 1, eos_token_id=self.config.eos_token_id
198
+ )
199
+
200
+ # Add N-gram embeddings
201
+ for i in range(2, self.n + 1):
202
+ for j in range(self.k):
203
+ index = (i - 2) * self.k + j
204
+ emb_vocab_dim = int(self.m + index * 2 + 1)
205
+
206
+ ngram_ids = self._get_ngram_ids(context, shifted_ids, vocab_mods[(i, j)], ngram=i)
207
+ new_ids = (ngram_ids % emb_vocab_dim)[..., -seq_len:]
208
+
209
+ embedder_device = self.embedders[index].weight.device
210
+ x_ngram = self.embedders[index](new_ids.to(embedder_device))
211
+
212
+ proj_device = self.post_projs[index].weight.device
213
+ x_proj = self.post_projs[index](x_ngram.to(proj_device))
214
+ x = x + x_proj.to(x.device)
215
+
216
+ # Normalize
217
+ x = x / (1 + self.k * (self.n - 1))
218
+
219
+ return x
220
+
221
+
222
+ class LongcatFlashNgramModel(LongcatFlashModel):
223
+ """LongcatFlash model with N-gram enhanced embeddings."""
224
+ _keys_to_ignore_on_load_unexpected = [r"model\.mtp.*"]
225
+ config_class = LongcatFlashNgramConfig
226
+
227
+ def __init__(self, config):
228
+ super().__init__(config)
229
+
230
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
231
+ self.ngram_embeddings = NgramEmbedding(config, self.embed_tokens)
232
+
233
+ self.layers = nn.ModuleList(
234
+ [LongcatFlashDecoderLayer(config, layer_idx) for layer_idx in range(config.num_layers)]
235
+ )
236
+
237
+ self.head_dim = config.head_dim
238
+ self.config.num_hidden_layers = 2 * config.num_layers
239
+ self.norm = LongcatFlashRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
240
+ self.rotary_emb = LongcatFlashRotaryEmbedding(config=config)
241
+ self.gradient_checkpointing = False
242
+
243
+ self.post_init()
244
+
245
+ def forward(
246
+ self,
247
+ input_ids: Optional[torch.LongTensor] = None,
248
+ attention_mask: Optional[torch.Tensor] = None,
249
+ position_ids: Optional[torch.LongTensor] = None,
250
+ past_key_values: Optional[Cache] = None,
251
+ inputs_embeds: Optional[torch.FloatTensor] = None,
252
+ cache_position: Optional[torch.LongTensor] = None,
253
+ use_cache: Optional[bool] = None,
254
+ **kwargs
255
+ ) -> BaseModelOutputWithPast:
256
+ if (input_ids is None) ^ (inputs_embeds is not None):
257
+ raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
258
+
259
+ # Extract N-gram context if available
260
+ ngram_context = None
261
+ if isinstance(past_key_values, NgramCache) and past_key_values.ngram_context is not None:
262
+ ngram_context = past_key_values.ngram_context
263
+
264
+ if inputs_embeds is None:
265
+ inputs_embeds = self.ngram_embeddings(input_ids, ngram_context=ngram_context)
266
+
267
+ # Initialize NgramCache if needed
268
+ if use_cache and past_key_values is None:
269
+ past_key_values = NgramCache(config=self.config)
270
+
271
+ # Update N-gram context
272
+ if use_cache and isinstance(past_key_values, NgramCache):
273
+ past_key_values.update_ngram_context(input_ids)
274
+
275
+ # Prepare cache position
276
+ if cache_position is None:
277
+ past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
278
+ cache_position = torch.arange(
279
+ inputs_embeds.shape[1], device=inputs_embeds.device
280
+ ) + past_seen_tokens
281
+
282
+ if position_ids is None:
283
+ position_ids = cache_position.unsqueeze(0)
284
+
285
+ # Create causal mask
286
+ causal_mask = create_causal_mask(
287
+ config=self.config,
288
+ input_embeds=inputs_embeds,
289
+ attention_mask=attention_mask,
290
+ cache_position=cache_position,
291
+ past_key_values=past_key_values,
292
+ position_ids=position_ids,
293
+ )
294
+
295
+ # Forward through decoder layers
296
+ hidden_states = inputs_embeds
297
+ position_embeddings = self.rotary_emb(hidden_states, position_ids)
298
+
299
+ for decoder_layer in self.layers[: self.config.num_layers]:
300
+ hidden_states = decoder_layer(
301
+ hidden_states,
302
+ attention_mask=causal_mask,
303
+ position_ids=position_ids,
304
+ past_key_values=past_key_values,
305
+ cache_position=cache_position,
306
+ position_embeddings=position_embeddings,
307
+ **kwargs,
308
+ )
309
+
310
+ hidden_states = self.norm(hidden_states)
311
+
312
+ return BaseModelOutputWithPast(
313
+ last_hidden_state=hidden_states,
314
+ past_key_values=past_key_values,
315
+ hidden_states=None,
316
+ attentions=None,
317
+ )
318
+
319
+
320
+ class LongcatFlashNgramForCausalLM(LongcatFlashForCausalLM):
321
+ """LongcatFlash model for causal language modeling with N-gram embeddings."""
322
+ _keys_to_ignore_on_load_unexpected = [r"model\.mtp.*"]
323
+ config_class = LongcatFlashNgramConfig
324
+
325
+ def __init__(self, config):
326
+ super().__init__(config)
327
+ self.model = LongcatFlashNgramModel(config)
328
+
329
+ @torch.no_grad()
330
+ def generate(self, inputs=None, generation_config=None, **kwargs):
331
+ """Override to ensure NgramCache is used."""
332
+
333
+ if "past_key_values" not in kwargs or kwargs["past_key_values"] is None:
334
+ kwargs["past_key_values"] = NgramCache(config=self.config)
335
+
336
+ return super().generate(inputs=inputs, generation_config=generation_config, **kwargs)
337
+
338
+ __all__ = ["LongcatFlashNgramPreTrainedModel", "LongcatFlashNgramModel", "LongcatFlashNgramForCausalLM"]
parse_model_response.py ADDED
@@ -0,0 +1,236 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ import json
3
+ import uuid
4
+
5
+ def parse_arguments(json_value):
6
+ """
7
+ Attempt to parse a string as JSON
8
+
9
+ Args:
10
+ json_value: String to parse
11
+
12
+ Returns:
13
+ tuple: (parsed_value, is_valid_json)
14
+ """
15
+ try:
16
+ parsed_value = json.loads(json_value)
17
+ return parsed_value, True
18
+ except:
19
+ return json_value, False
20
+
21
+ def get_argument_type(func_name: str, arg_key: str, defined_tools: list):
22
+ """
23
+ Get the type definition of a tool parameter
24
+
25
+ Args:
26
+ func_name: Name of the function/tool
27
+ arg_key: Parameter key name
28
+ defined_tools: List of tool definitions
29
+
30
+ Returns:
31
+ str or None: Type of the parameter ('string', 'object', 'array', 'integer', 'number', 'boolean')
32
+ """
33
+ name2tool = {tool["name"]: tool for tool in defined_tools}
34
+ if func_name not in name2tool:
35
+ return None
36
+ tool = name2tool[func_name]
37
+ if "parameters" not in tool or "properties" not in tool["parameters"]:
38
+ return None
39
+ if arg_key not in tool["parameters"]["properties"]:
40
+ return None
41
+ return tool["parameters"]["properties"][arg_key].get("type")
42
+
43
+ def parse_model_response(response: str, defined_tools: list=[]):
44
+ """
45
+ Parse model response to extract reasoning_content, content, and tool_calls
46
+
47
+ Args:
48
+ response: Raw response text from the model
49
+ defined_tools: List of tool definitions
50
+
51
+ Returns:
52
+ dict: Message containing role, reasoning_content (optional), content (optional),
53
+ and tool_calls (optional)
54
+ """
55
+ text = response
56
+ reasoning_content = None
57
+ content = None
58
+ tool_calls = []
59
+
60
+ formatted_tools = []
61
+ for tool in defined_tools:
62
+ if "function" in tool:
63
+ formatted_tools.append(tool['function'])
64
+ else:
65
+ formatted_tools.append(tool)
66
+
67
+ if '</longcat_think>' in text:
68
+ text = text.replace('<longcat_think>', '')
69
+ thinking_end = text.find('</longcat_think>')
70
+ reasoning_content = text[: thinking_end].strip()
71
+ text = text[thinking_end + len('</longcat_think>'):].lstrip()
72
+
73
+ assert '<longcat_think>' not in text, "Unclosed <longcat_think> tag found in remaining text"
74
+ assert '</longcat_think>' not in text, "Unexpected </longcat_think> tag found without opening tag"
75
+
76
+ if '<longcat_tool_call>' in text:
77
+ index = text.find('<longcat_tool_call>')
78
+ content = text[:index]
79
+ text = text[index:].strip()
80
+ else:
81
+ content = text
82
+ text = ""
83
+
84
+ open_tags = text.count('<longcat_tool_call>')
85
+ close_tags = text.count('</longcat_tool_call>')
86
+ assert open_tags == close_tags, \
87
+ f"Mismatched tool_call tags: {open_tags} opening tags, {close_tags} closing tags"
88
+
89
+ tool_call_strs = re.findall(
90
+ r'<longcat_tool_call>(.*?)</longcat_tool_call>',
91
+ text,
92
+ re.DOTALL
93
+ )
94
+
95
+ for call in tool_call_strs:
96
+ func_name_match = re.match(r'([^\n<]+)', call.strip())
97
+ assert func_name_match, f"Missing function name in tool call: {call[:100]}"
98
+
99
+ func_name = func_name_match.group(1).strip()
100
+ assert func_name, "Empty function name in tool call"
101
+
102
+ # Verify argument tags are properly paired
103
+ arg_key_count = call.count('<longcat_arg_key>')
104
+ arg_key_close_count = call.count('</longcat_arg_key>')
105
+ arg_value_count = call.count('<longcat_arg_value>')
106
+ arg_value_close_count = call.count('</longcat_arg_value>')
107
+
108
+ assert arg_key_count == arg_key_close_count, \
109
+ f"Mismatched arg_key tags in function {func_name}: {arg_key_count} opening, {arg_key_close_count} closing"
110
+ assert arg_value_count == arg_value_close_count, \
111
+ f"Mismatched arg_value tags in function {func_name}: {arg_value_count} opening, {arg_value_close_count} closing"
112
+ assert arg_key_count == arg_value_count, \
113
+ f"Mismatched arg_key and arg_value count in function {func_name}: {arg_key_count} keys, {arg_value_count} values"
114
+
115
+ pairs = re.findall(
116
+ r'<longcat_arg_key>(.*?)</longcat_arg_key>\s*<longcat_arg_value>(.*?)</longcat_arg_value>',
117
+ call,
118
+ re.DOTALL
119
+ )
120
+
121
+ assert len(pairs) == arg_key_count, \
122
+ f"Failed to parse all arguments in function {func_name}: expected {arg_key_count}, got {len(pairs)}"
123
+
124
+ arguments = {}
125
+ for arg_key, arg_value in pairs:
126
+ arg_key = arg_key.strip()
127
+ arg_value = arg_value.strip()
128
+
129
+ assert arg_key, f"Empty argument key in function {func_name}"
130
+ assert arg_key not in arguments, \
131
+ f"Duplicate argument key '{arg_key}' in function {func_name}"
132
+
133
+ arg_type = get_argument_type(func_name, arg_key, formatted_tools)
134
+
135
+ if arg_type and arg_type != 'string':
136
+ parsed_value, is_good_json = parse_arguments(arg_value)
137
+ arg_value = parsed_value
138
+
139
+ arguments[arg_key] = arg_value
140
+
141
+ tool_calls.append({
142
+ 'id': "tool-call-" + str(uuid.uuid4()),
143
+ 'type': "function",
144
+ 'function': {
145
+ 'name': func_name,
146
+ 'arguments': arguments
147
+ }
148
+ })
149
+
150
+ message = {'role': 'assistant'}
151
+
152
+ if reasoning_content:
153
+ message['reasoning_content'] = reasoning_content
154
+ message['content'] = content
155
+ if tool_calls:
156
+ message['tool_calls'] = tool_calls
157
+
158
+ return message
159
+
160
+ if __name__=="__main__":
161
+ from transformers import AutoModelForCausalLM, AutoTokenizer
162
+ from parse_model_response import parse_model_response
163
+
164
+ model_name = "meituan-longcat/LongCat-Flash-Lite"
165
+ model = AutoModelForCausalLM.from_pretrained(
166
+ model_name,
167
+ torch_dtype="auto",
168
+ device_map="auto",
169
+ trust_remote_code=True
170
+ )
171
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
172
+
173
+ messages = [
174
+ {"role": "system", "content": "You are a helpful assistant."},
175
+ {"role": "user", "content": "Give me a brief introduction to large language models."}
176
+ ]
177
+ input_ids = tokenizer.apply_chat_template(
178
+ messages,
179
+ add_generation_prompt=True,
180
+ return_tensors="pt"
181
+ ).to(model.device)
182
+ generated_ids = model.generate(inputs=input_ids, max_new_tokens=256)
183
+ output_ids = generated_ids[0][len(input_ids[0]):].tolist()
184
+ response = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
185
+ print("Example 1: sample response.")
186
+ print("\nRaw response:")
187
+ print(response)
188
+ print("\nParsed result:")
189
+
190
+ response = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
191
+ parsed_message = parse_model_response(response)
192
+ print(json.dumps(parsed_message, indent=2, ensure_ascii=False))
193
+
194
+ tools = [
195
+ {
196
+ "type": "function",
197
+ "function": {
198
+ "name": "func_add",
199
+ "description": "Calculate the sum of two numbers",
200
+ "parameters": {
201
+ "type": "object",
202
+ "properties": {
203
+ "x1": {"type": "number", "description": "The first addend"},
204
+ "x2": {"type": "number", "description": "The second addend"}
205
+ },
206
+ "required": ["x1", "x2"]
207
+ }
208
+ }
209
+ }
210
+ ]
211
+ messages = [
212
+ {"role": "system", "content": "You are a helpful assistant."},
213
+ {"role": "user", "content": "Please tell me what is $$125679 + 234519$$?"},
214
+ # {
215
+ # "role": "assistant",
216
+ # "content": "I'll calculate the sum of 125679 and 234519 for you.",
217
+ # "tool_calls": [{"type": "function", "function": {"name": "func_add", "arguments": {"x1": 125679, "x2": 234519}}}]
218
+ # },
219
+ # {"role": "tool", "name": "func_add", "content": '{"ans": 360198}'}
220
+ ]
221
+
222
+ input_ids = tokenizer.apply_chat_template(
223
+ messages,
224
+ tools=tools,
225
+ add_generation_prompt=True,
226
+ return_tensors="pt"
227
+ ).to(model.device)
228
+ generated_ids = model.generate(inputs=input_ids, max_new_tokens=256)
229
+ output_ids = generated_ids[0][len(input_ids[0]):].tolist()
230
+ response = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
231
+ print("Example 2: tool call response.")
232
+ print("\nRaw response:")
233
+ print(response)
234
+ print("\nParsed result:")
235
+ parsed_message = parse_model_response(response, tools)
236
+ print(json.dumps(parsed_message, indent=2, ensure_ascii=False))
special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<longcat_s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</longcat_s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<longcat_pad>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "unk_token": {
24
+ "content": "<longcat_unk>",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ }
30
+ }
tech_report.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cae927e0ba049046b7872dba03467158db9579914f534143db56a17c5f1bcd02
3
+ size 736397
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_eos_token": true,
4
+ "add_prefix_space": false,
5
+ "bos_token": {
6
+ "__type": "AddedToken",
7
+ "content": "<longcat_s>",
8
+ "lstrip": false,
9
+ "normalized": true,
10
+ "rstrip": false,
11
+ "single_word": false
12
+ },
13
+ "clean_up_tokenization_spaces": false,
14
+ "eos_token": {
15
+ "__type": "AddedToken",
16
+ "content": "</longcat_s>",
17
+ "lstrip": false,
18
+ "normalized": true,
19
+ "rstrip": false,
20
+ "single_word": false
21
+ },
22
+ "model_max_length": 131072,
23
+ "pad_token": {
24
+ "__type": "AddedToken",
25
+ "content": "<longcat_pad>",
26
+ "lstrip": false,
27
+ "normalized": true,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ },
31
+ "sp_model_kwargs": {},
32
+ "tokenizer_class": "BloomTokenizer",
33
+ "unk_token": {
34
+ "__type": "AddedToken",
35
+ "content": "<longcat_unk>",
36
+ "lstrip": false,
37
+ "normalized": true,
38
+ "rstrip": false,
39
+ "single_word": false
40
+ },
41
+ "chat_template": "{%- set tool_choice = tool_choice | default('auto') %}\n{%- set ns = namespace(tool_types = [], last_query_index = -1) %}\n\n{%- if tools and tool_choice != 'none' %}\n {{- \"<longcat_tool_declare>\\n\"-}}\n {{- \"# Tools\\n\" }}\n {{- \"You have access to the following tools:\\n\\n\" }}\n {%- for tool in tools %}\n {%- if tool.type not in ns.tool_types %}\n {%- set ns.tool_types = ns.tool_types + [tool.type] %}\n {{- \"## Tool namespace: \" ~ tool.type ~ \"\\n\\n\" }}\n {%- endif %}\n {%- if tool.type == 'code_interpreter' %}\n {%- set tool = {\"type\":\"code_interpreter\",\"function\":{\"name\":\"code_interpreter_preview\",\"description\":\"The code will be executed in a stateful Jupyter notebook sandbox environment, only supports local computation, data processing, and file operations.\\nCode sandbox environment (network isolated) Any external network requests or online API calls are prohibited.\\nIf online functionality is needed, please use other permitted tools.\\nCode will respond with the output of the execution or time out after 60.0 seconds. \",\"parameters\":{\"type\":\"object\",\"properties\":{\"language\":{\"type\":\"string\",\"description\":\"The programming language of the code to be executed. Available values: python (Default), java, go, js, ts, c, c++.\"},\"code\":{\"type\":\"string\",\"description\":\"Python code to be executed must not include the following:\\n- Importing network libraries such as requests, httplib, etc.\\n- Any form of HTTP requests.\\n- External API calls.\\n- Network port operations. Example: ```python\\nimport pandas as pd\\npd.DataFrame({'A':[1,2]})\\n```\"},\"timeout\":{\"type\":\"number\",\"description\":\"The maximum execution time of the code, in seconds. Default is 60.0.\"}}},\"required\":[\"code\"]}} %}\n {%- endif %}\n {{- \"### Tool name: \" + tool.function.name + \"\\n\" }}\n {{- \"Description: \" + tool.function.description + \"\\n\\n\" }}\n {{- \"InputSchema: \" + tool.function.parameters | tojson(ensure_ascii=False) + \"\\n\\n\" }}\n {%- endfor %}\n {{- '**Note**: For each function call, output the function name and arguments within the following XML format:\\n<longcat_tool_call>{function-name}\\n<longcat_arg_key>{arg-key-1}</longcat_arg_key>\\n<longcat_arg_value>{arg-value-1}</longcat_arg_value>\\n<longcat_arg_key>{arg-key-2}</longcat_arg_key>\\n<longcat_arg_value>{arg-value-2}</longcat_arg_value>\\n...\\n</longcat_tool_call>\\n' }}\n {{- \"</longcat_tool_declare>\"-}}\n {%- for idx in range(messages|length - 1) %}\n {%- set msg = messages[idx] %}\n {%- if msg.role == 'assistant' and not msg.tool_calls %}\n {%- set ns.last_query_index = idx %}\n {%- endif %}\n {%- endfor%}\n{%- endif %}\n\n{%- for msg in messages %}\n {%- if msg.role == \"system\" %}\n {{- \"<longcat_system>\" + msg.content }}\n {%- elif msg.role == \"user\" %}\n {{- \"<longcat_user>\" }}\n {%- if msg[\"files\"] %}\n {{- '<longcat_files>\\n' ~ msg.files | tojson(indent=2) ~ '\\n</longcat_files>' }}\n {%- endif %}\n {{- msg.content }}\n {%- elif msg.role == \"assistant\" %}\n {{- \"<longcat_assistant>\" }}\n {%- if enable_thinking == true and msg.reasoning_content and ns.tool_types != [] and loop.index0 > ns.last_query_index %}\n {{- \"\\n<longcat_think>\\n\" ~ msg.reasoning_content ~ \"\\n</longcat_think>\\n\" }}\n {%- endif %}\n {%- if msg.content%}\n {{- msg.content }}\n {%- endif %}\n {%- if msg.tool_calls %}\n {%- for tool_call in msg.tool_calls -%}\n {{- \"<longcat_tool_call>\" ~ tool_call.function.name ~ \"\\n\" -}}\n {% set _args = tool_call.function.arguments %}\n {% for k, v in _args.items() %}\n {{- \"<longcat_arg_key>\" ~ k ~ \"</longcat_arg_key>\\n\" -}}\n {{- \"<longcat_arg_value>\" ~ (v if v is string else v | tojson(ensure_ascii=False)) ~ \"</longcat_arg_value>\\n\" -}}\n {% endfor %}\n {{- \"</longcat_tool_call>\\n\" }}\n {%- endfor %}\n {%- endif %}\n {{- \"</longcat_s>\" -}}\n {%- elif msg.role == \"tool\" %}\n {%- if messages[loop.index0 - 1].role != \"tool\"%}\n {{- \"<longcat_user>\" -}}\n {%- endif %}\n {{- \"<longcat_tool_response>\" ~ msg.content ~ \"</longcat_tool_response>\"-}}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {%- if enable_thinking == true %}\n {{- \" /think_on\" }}\n {%- if thinking_budget %}\n {%- if thinking_budget < 1024 %}\n {%- set thinking_budget = 1024 %}\n {%- endif%}\n {{- \"\\nthinking_budget: < \" ~ thinking_budget ~ \".\"}}\n {%- endif %}\n {{- \" <longcat_assistant><longcat_think>\\n\"}}\n {%- elif enable_thinking == false %}\n {{- \" /think_off <longcat_assistant><longcat_think>\\n\\n</longcat_think>\\n\" }}\n {%- else %}\n {{- \"<longcat_assistant>\" }}\n {%- endif %}\n{%- endif %}"
42
+ }