Mingke977 commited on
Commit
f2121c6
·
verified ·
1 Parent(s): f633f94

Add files using upload-large-folder tool

Browse files
.gitignore ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ .joycode/
2
+ venv/
README.md ADDED
@@ -0,0 +1,381 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - zh
4
+ - en
5
+ pipeline_tag: text-generation
6
+ library_name: transformers
7
+ ---
8
+ <div align="center">
9
+ <picture>
10
+ <img src="figures/joyai-logo.png" width="30%" alt="JoyAI-LLM Flash">
11
+ </picture>
12
+ </div>
13
+ <hr>
14
+
15
+ <div align="center" style="line-height: 1;">
16
+ <a href="https://huggingface.co/jdopensource" target="_blank"><img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-JD-ffc107?color=ffc107&logoColor=white"/></a>
17
+ <a href="https://huggingface.co/jdopensource/JoyAI-LLM-Flash/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/badge/License-Modified_MIT-f5de53?&color=f5de53"/></a>
18
+ </div>
19
+
20
+
21
+
22
+
23
+ ## 1. Model Introduction
24
+
25
+ JoyAI-LLM-Flash is a state-of-the-art medium-sized instruct language model with 3 billion activated parameters and 48 billion total parameters. JoyAI-LLM-Flash was pretrained on 20 trillion text tokens using Muon optimizer, followed by large-scale supervised fine-tuning (SFT), direct preference optimization (DPO), and reinforcement learning (RL) across diverse environments. JoyAI-LLM-Flash achieves strong performance across frontier knowledge, reasoning, coding tasks and agentic capabilities.
26
+
27
+ ### Key Features
28
+
29
+ - Fiber Bundle RL: Introduces fiber bundle theory into reinforcement learning, proposing a novel optimization framework, FiberPO. This method is specifically designed to handle the challenges of large-scale and heterogeneous agent training, improving stability and robustness under complex data distributions.
30
+ - Training-Inference Collaboration: apply Muon optimizer with dense MTP, develop novel optimization techniques to resolve instabilities while scaling up, delivering 1.3× to 1.7× the throughput of the non-MTP version.
31
+ - Agentic Intelligence: designed for tool use, reasoning, and autonomous problem-solving.
32
+
33
+ ## 2. Model Summary
34
+
35
+ | | |
36
+ | :-----------------------------------------: | :----------------------: |
37
+ | **Architecture** | Mixture-of-Experts (MoE) |
38
+ | **Total Parameters** | 48B |
39
+ | **Activated Parameters** | 3B |
40
+ | **Number of Layers** (Dense layer included) | 40 |
41
+ | **Number of Dense Layers** | 1 |
42
+ | **Attention Hidden Dimension** | 2048 |
43
+ | **MoE Hidden Dimension** (per Expert) | 768 |
44
+ | **Number of Attention Heads** | 32 |
45
+ | **Number of Experts** | 256 |
46
+ | **Selected Experts per Token** | 8 |
47
+ | **Number of Shared Experts** | 1 |
48
+ | **Vocabulary Size** | 129K |
49
+ | **Context Length** | 128K |
50
+ | **Attention Mechanism** | MLA |
51
+ | **Activation Function** | SwiGLU |
52
+ | </div> | |
53
+
54
+
55
+ ## 3. Evaluation Results
56
+
57
+ <table>
58
+ <thead>
59
+ <tr>
60
+ <th align="center">Benchmark</th>
61
+ <th align="center"><sup>JoyAI-LLM Flash</sup></th>
62
+ <th align="center"><sup>Qwen3-30B-A3B-Instuct-2507</sup></th>
63
+ <th align="center"><sup>GLM-4.7-Flash<br>(Non-thinking)</sup></th>
64
+ </tr>
65
+ </thead>
66
+ <tbody>
67
+
68
+
69
+ <tr>
70
+ <td align="center" colspan=8><strong>Knowledge &amp; Alignment</strong></td>
71
+ </tr>
72
+ <tr>
73
+ <td align="center" style="vertical-align: middle">MMLU</td>
74
+ <td align="center" style="vertical-align: middle"><strong>89.50</strong></td>
75
+ <td align="center" style="vertical-align: middle">86.87</td>
76
+ <td align="center" style="vertical-align: middle">80.53</td>
77
+ </tr>
78
+ <tr>
79
+ <td align="center" style="vertical-align: middle">MMLU-Pro</td>
80
+ <td align="center" style="vertical-align: middle"><strong>81.02</strong></td>
81
+ <td align="center" style="vertical-align: middle">73.88</td>
82
+ <td align="center" style="vertical-align: middle">63.62</td>
83
+ </tr>
84
+ <tr>
85
+ <td align="center" style="vertical-align: middle">CMMLU</td>
86
+ <td align="center" style="vertical-align: middle"><strong>87.03</strong></td>
87
+ <td align="center" style="vertical-align: middle">85.88</td>
88
+ <td align="center" style="vertical-align: middle">75.85</td>
89
+ </tr>
90
+ <tr>
91
+ <td align="center" style="vertical-align: middle">GPQA-Diamond</td>
92
+ <td align="center" style="vertical-align: middle"><strong>74.43</strong></td>
93
+ <td align="center" style="vertical-align: middle">68.69</td>
94
+ <td align="center" style="vertical-align: middle">39.90</td>
95
+ </tr>
96
+ <tr>
97
+ <td align="center" style="vertical-align: middle">SuperGPQA</td>
98
+ <td align="center" style="vertical-align: middle"><strong>55.00</strong></td>
99
+ <td align="center" style="vertical-align: middle">52.00</td>
100
+ <td align="center" style="vertical-align: middle">32.00</td>
101
+ </tr>
102
+ <tr>
103
+ <td align="center" style="vertical-align: middle">LiveBench</td>
104
+ <td align="center" style="vertical-align: middle"><strong>72.90</strong></td>
105
+ <td align="center" style="vertical-align: middle">59.70</td>
106
+ <td align="center" style="vertical-align: middle">43.10</td>
107
+ </tr>
108
+ <tr>
109
+ <td align="center" style="vertical-align: middle">IFEval</td>
110
+ <td align="center" style="vertical-align: middle"><strong>86.69</strong></td>
111
+ <td align="center" style="vertical-align: middle">83.18</td>
112
+ <td align="center" style="vertical-align: middle">82.44</td>
113
+ </tr>
114
+ <tr>
115
+ <td align="center" style="vertical-align: middle">AlignBench</td>
116
+ <td align="center" style="vertical-align: middle"><strong>8.24</strong></td>
117
+ <td align="center" style="vertical-align: middle">8.07</td>
118
+ <td align="center" style="vertical-align: middle">6.85</td>
119
+ </tr>
120
+ <tr>
121
+ <td align="center" style="vertical-align: middle">HellaSwag</td>
122
+ <td align="center" style="vertical-align: middle"><strong>91.79</strong></td>
123
+ <td align="center" style="vertical-align: middle">89.90</td>
124
+ <td align="center" style="vertical-align: middle">60.84</td>
125
+ </tr>
126
+
127
+ <tr>
128
+ <td align="center" colspan=8><strong>Coding</strong></td>
129
+ </tr>
130
+ <tr>
131
+ <td align="center" style="vertical-align: middle">HumanEval</td>
132
+ <td align="center" style="vertical-align: middle"><strong>96.34</strong></td>
133
+ <td align="center" style="vertical-align: middle">95.12</td>
134
+ <td align="center" style="vertical-align: middle">74.39</td>
135
+ </tr>
136
+ <tr>
137
+ <td align="center" style="vertical-align: middle">LiveCodeBench</td>
138
+ <td align="center" style="vertical-align: middle"><strong>65.60</strong></td>
139
+ <td align="center" style="vertical-align: middle">39.71</td>
140
+ <td align="center" style="vertical-align: middle">27.43</td>
141
+ </tr>
142
+ <tr>
143
+ <td align="center" style="vertical-align: middle">SciCode</td>
144
+ <td align="center" style="vertical-align: middle"><strong>3.08/22.92</strong></td>
145
+ <td align="center" style="vertical-align: middle"><strong>3.08/22.92</strong></td>
146
+ <td align="center" style="vertical-align: middle">3.08/15.11</td>
147
+ </tr>
148
+ <tr>
149
+ <td align="center" colspan=8><strong>Mathematics</strong></td>
150
+ </tr>
151
+ <tr>
152
+ <td align="center" style="vertical-align: middle">GSM8K</td>
153
+ <td align="center" style="vertical-align: middle"><strong>95.83</strong></td>
154
+ <td align="center" style="vertical-align: middle">79.83</td>
155
+ <td align="center" style="vertical-align: middle">81.88</td>
156
+ </tr>
157
+ <tr>
158
+ <td align="center" style="vertical-align: middle">AIME2025</td>
159
+ <td align="center" style="vertical-align: middle"><strong>65.83</strong></td>
160
+ <td align="center" style="vertical-align: middle">62.08</td>
161
+ <td align="center" style="vertical-align: middle">24.17</td>
162
+ </tr>
163
+ <tr>
164
+ <td align="center" style="vertical-align: middle">MATH 500</td>
165
+ <td align="center" style="vertical-align: middle"><strong>97.10</strong></td>
166
+ <td align="center" style="vertical-align: middle">89.80</td>
167
+ <td align="center" style="vertical-align: middle">90.90</td>
168
+ </tr>
169
+
170
+ <tr>
171
+ <td align="center" colspan=8><strong>Agentic</strong></td>
172
+ </tr>
173
+ <tr>
174
+ <td align="center" style="vertical-align: middle">SWE-bench Verified</td>
175
+ <td align="center" style="vertical-align: middle"><strong>60.60</strong></td>
176
+ <td align="center" style="vertical-align: middle">24.44</td>
177
+ <td align="center" style="vertical-align: middle">51.60</td>
178
+ </tr>
179
+ <tr>
180
+ <td align="center" style="vertical-align: middle">Tau2-Retail</td>
181
+ <td align="center" style="vertical-align: middle"><strong>67.55</strong></td>
182
+ <td align="center" style="vertical-align: middle">53.51</td>
183
+ <td align="center" style="vertical-align: middle">62.28</td>
184
+ </tr>
185
+ <tr>
186
+ <td align="center" style="vertical-align: middle">Tau2-Airline</td>
187
+ <td align="center" style="vertical-align: middle"><strong>54.00</strong></td>
188
+ <td align="center" style="vertical-align: middle">32.00</td>
189
+ <td align="center" style="vertical-align: middle">52.00</td>
190
+ </tr>
191
+ <tr>
192
+ <td align="center" style="vertical-align: middle">Tau2-Telecom</td>
193
+ <td align="center" style="vertical-align: middle">79.83</td>
194
+ <td align="center" style="vertical-align: middle">4.39</td>
195
+ <td align="center" style="vertical-align: middle"><strong>88.60</strong></td>
196
+ </tr>
197
+
198
+ <tr>
199
+ <td align="center" colspan=8><strong>Long Context</strong></td>
200
+ </tr>
201
+ <tr>
202
+ <td align="center" style="vertical-align: middle">RULER</td>
203
+ <td align="center" style="vertical-align: middle"><strong>95.60</strong></td>
204
+ <td align="center" style="vertical-align: middle">89.66</td>
205
+ <td align="center" style="vertical-align: middle">56.12</td>
206
+ </tr>
207
+ </tbody>
208
+ </table>
209
+
210
+
211
+ ## 4. Deployment
212
+
213
+ > [!Note]
214
+ > You can access JoyAI-LLM Flash API on https://docs.jdcloud.com/cn/jdaip/chat and we provide OpenAI/Anthropic-compatible API for you.
215
+ > Currently, JoyAI-LLM-Flash-INT4 is recommended to run on the following inference engines:
216
+
217
+ * vLLM
218
+ * SGLang
219
+
220
+ The minimum version requirement for `transformers` is `4.57.1`.
221
+
222
+ Deployment examples can be found in the [Model Deployment Guide](docs/deploy_guidance.md).
223
+
224
+
225
+
226
+ ## 5. Model Usage
227
+
228
+ The usage demos below demonstrate how to call our official API.
229
+
230
+ For third-party APIs deployed with vLLM or SGLang, please note that:
231
+
232
+ > [!Note] Recommended sampling parameters: `temperature=0.6`, `top_p=1.0`
233
+
234
+ ### Chat Completion
235
+
236
+ This is a simple chat completion script which shows how to call JoyAI-Flash API.
237
+
238
+ ```python
239
+ from openai import OpenAI
240
+
241
+ client = OpenAI(base_url="http://IP:PORT/v1", api_key="EMPTY")
242
+
243
+
244
+ def simple_chat(client: OpenAI):
245
+ messages = [
246
+ {
247
+ "role": "user",
248
+ "content": [
249
+ {
250
+ "type": "text",
251
+ "text": "which one is bigger, 9.11 or 9.9? think carefully.",
252
+ }
253
+ ],
254
+ },
255
+ ]
256
+ model_name = client.models.list().data[0].id
257
+ response = client.chat.completions.create(
258
+ model=model_name, messages=messages, stream=False, max_tokens=4096
259
+ )
260
+ print(f"response: {response.choices[0].message.content}")
261
+
262
+
263
+ if __name__ == "__main__":
264
+ simple_chat(client)
265
+ ```
266
+
267
+
268
+ ### Tool call Completion
269
+
270
+ This is a simple toll call completion script which shows how to call JoyAI-Flash API.
271
+
272
+ ```python
273
+ import json
274
+
275
+ from openai import OpenAI
276
+
277
+ client = OpenAI(base_url="http://IP:PORT/v1", api_key="EMPTY")
278
+
279
+
280
+ def my_calculator(expression: str) -> str:
281
+ return str(eval(expression))
282
+
283
+
284
+ def rewrite(expression: str) -> str:
285
+ return str(expression)
286
+
287
+
288
+ def simple_tool_call(client: OpenAI):
289
+ messages = [
290
+ {
291
+ "role": "user",
292
+ "content": [
293
+ {
294
+ "type": "text",
295
+ "text": "use my functions to compute the results for the equations: 6+1",
296
+ },
297
+ ],
298
+ },
299
+ ]
300
+ tools = [
301
+ {
302
+ "type": "function",
303
+ "function": {
304
+ "name": "my_calculator",
305
+ "description": "A calculator that can evaluate a mathematical equation and compute its results.",
306
+ "parameters": {
307
+ "type": "object",
308
+ "properties": {
309
+ "expression": {
310
+ "type": "string",
311
+ "description": "The mathematical expression to evaluate.",
312
+ },
313
+ },
314
+ "required": ["expression"],
315
+ },
316
+ },
317
+ },
318
+ {
319
+ "type": "function",
320
+ "function": {
321
+ "name": "rewrite",
322
+ "description": "Rewrite a given text for improved clarity",
323
+ "parameters": {
324
+ "type": "object",
325
+ "properties": {
326
+ "text": {
327
+ "type": "string",
328
+ "description": "The input text to rewrite",
329
+ }
330
+ },
331
+ },
332
+ },
333
+ },
334
+ ]
335
+ model_name = client.models.list().data[0].id
336
+ response = client.chat.completions.create(
337
+ model=model_name,
338
+ messages=messages,
339
+ temperature=1.0,
340
+ max_tokens=1024,
341
+ tools=tools,
342
+ tool_choice="auto",
343
+ )
344
+ tool_calls = response.choices[0].message.tool_calls
345
+
346
+ results = []
347
+ for tool_call in tool_calls:
348
+ function_name = tool_call.function.name
349
+ function_args = tool_call.function.arguments
350
+ if function_name == "my_calculator":
351
+ result = my_calculator(**json.loads(function_args))
352
+ results.append(result)
353
+ messages.append({"role": "assistant", "tool_calls": tool_calls})
354
+ for tool_call, result in zip(tool_calls, results):
355
+ messages.append(
356
+ {
357
+ "role": "tool",
358
+ "tool_call_id": tool_call.id,
359
+ "name": tool_call.function.name,
360
+ "content": result,
361
+ }
362
+ )
363
+ response = client.chat.completions.create(
364
+ model=model_name,
365
+ messages=messages,
366
+ temperature=1.0,
367
+ max_tokens=1024,
368
+ )
369
+ print(response.choices[0].message.content)
370
+
371
+
372
+ if __name__ == "__main__":
373
+ simple_tool_call(client)
374
+
375
+ ```
376
+
377
+ ---
378
+
379
+ ## 6. License
380
+
381
+ Both the code repository and the model weights are released under the [Modified MIT License](LICENSE).
chat_template.jinja ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- macro render_extra_keys(json_dict, handled_keys) -%}
2
+ {%- if json_dict is mapping -%}
3
+ {%- for json_key in json_dict if json_key not in handled_keys -%}
4
+ {%- if json_dict[json_key] is mapping or (json_dict[json_key] is sequence and json_dict[json_key] is not string) -%}
5
+ {{- '\n<' ~ json_key ~ '>' ~ (json_dict[json_key] | tojson | safe) ~ '</' ~ json_key ~ '>' -}}
6
+ {%- else -%}
7
+ {{- '\n<' ~ json_key ~ '>' ~ (json_dict[json_key] | string) ~ '</' ~ json_key ~ '>' -}}
8
+ {%- endif -%}
9
+ {%- endfor -%}
10
+ {%- endif -%}
11
+ {%- endmacro -%}
12
+
13
+ {%- if not add_generation_prompt is defined -%}{%- set add_generation_prompt = false -%}{%- endif -%}
14
+
15
+ {%- set ns = namespace(system_prompt='', is_first_sp=true, is_last_user=false) -%}
16
+ {%- set default_system = "You are JoyAI , a large language model trained by JD(京东)that can interact with a computer to solve tasks. Answer as concisely as possible." -%}
17
+ {%- set ns.system_prompt = default_system -%}
18
+
19
+ {%- for message in messages -%}
20
+ {%- if message['role'] == 'system' -%}
21
+ {%- if ns.is_first_sp -%}
22
+ {%- set ns.system_prompt = message['content'] -%}
23
+ {%- set ns.is_first_sp = false -%}
24
+ {%- else -%}
25
+ {%- set ns.system_prompt = ns.system_prompt + '\n\n' + message['content'] -%}
26
+ {%- endif -%}
27
+ {%- endif -%}
28
+ {%- endfor -%}
29
+
30
+ {{- bos_token -}}{{- ns.system_prompt -}}
31
+ {%- if tools is iterable and tools | length > 0 -%}
32
+ {{- "\n\n# Tools\n\nYou have access to the following functions:\n\n" }}
33
+ {{- "<tools>" }}
34
+ {%- for tool in tools %}
35
+ {%- if tool.function is defined %}
36
+ {%- set tool = tool.function %}
37
+ {%- endif %}
38
+ {{- "\n<function>\n<name>" ~ tool.name ~ "</name>" }}
39
+ {%- if tool.description is defined %}
40
+ {{- '\n<description>' ~ (tool.description | trim) ~ '</description>' }}
41
+ {%- endif %}
42
+ {{- '\n<parameters>' }}
43
+ {%- if tool.parameters is defined and tool.parameters is mapping and tool.parameters.properties is defined and tool.parameters.properties is mapping %}
44
+ {%- for param_name, param_fields in tool.parameters.properties|items %}
45
+ {{- '\n<parameter>' }}
46
+ {{- '\n<name>' ~ param_name ~ '</name>' }}
47
+ {%- if param_fields.type is defined %}
48
+ {{- '\n<type>' ~ (param_fields.type | string) ~ '</type>' }}
49
+ {%- endif %}
50
+ {%- if param_fields.description is defined %}
51
+ {{- '\n<description>' ~ (param_fields.description | trim) ~ '</description>' }}
52
+ {%- endif %}
53
+ {%- set handled_keys = ['name', 'type', 'description'] %}
54
+ {{- render_extra_keys(param_fields, handled_keys) }}
55
+ {{- '\n</parameter>' }}
56
+ {%- endfor %}
57
+ {%- endif %}
58
+ {% set handled_keys = ['type', 'properties'] %}
59
+ {{- render_extra_keys(tool.parameters, handled_keys) }}
60
+ {{- '\n</parameters>' }}
61
+ {%- set handled_keys = ['type', 'name', 'description', 'parameters'] %}
62
+ {{- render_extra_keys(tool, handled_keys) }}
63
+ {{- '\n</function>' }}
64
+ {%- endfor %}
65
+ {{- "\n</tools>" }}
66
+ {{- '\n\nIf you choose to call a function ONLY reply in the following format with NO suffix:\n\n<tool_call>\n<function=example_function_name>\n<parameter=example_parameter_1>\nvalue_1\n</parameter>\n<parameter=example_parameter_2>\nThis is the value for the second parameter\nthat can span\nmultiple lines\n</parameter>\n</function>\n</tool_call>\n\n<IMPORTANT>\nReminder:\n- Function calls MUST follow the specified format: an inner <function=...></function> block must be nested within <tool_call></tool_call> XML tags\n- Required parameters MUST be specified\n- You may provide optional reasoning for your function call in natural language BEFORE the function call, but NOT after\n- If there is no function call available, answer the question like normal with your current knowledge and do not tell the user about function calls\n</IMPORTANT>' }}
67
+ {%- endif %}
68
+ {%- for message in messages -%}
69
+ {%- if message['role'] == 'user' -%}
70
+ {%- set ns.is_last_user = true -%}
71
+ {{- '<|User|>' + message['content'] -}}
72
+ {%- elif message['role'] == 'assistant' -%}
73
+ {%- if ns.is_last_user -%}
74
+ {{ '<|Assistant|>' }}
75
+ {%- endif -%}
76
+ {%- set ns.is_last_user = false -%}
77
+ {%- set content = message.get('content') | default('', true) -%}
78
+ {{ '<|end_of_thought|>' + content }}
79
+ {%- if message['tool_calls'] is defined and message['tool_calls'] is not none -%}
80
+ {%- for tool in message['tool_calls'] -%}
81
+ {%- if tool.function is defined %}{% set tool = tool.function %}{% endif -%}
82
+ {{- '\n<tool_call>\n<function=' + tool.name + '>\n' -}}
83
+ {%- if tool.arguments is defined -%}
84
+ {%- if tool.arguments is string -%}{%- set args_data = tool.arguments | from_json -%}{%- else -%}{%- set args_data = tool.arguments -%}{%- endif -%}
85
+ {%- for args_name, args_value in args_data.items() -%}
86
+ {{- '<parameter=' + args_name + '>\n' -}}
87
+ {%- set args_value = args_value | tojson | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string -%}
88
+ {{- args_value -}}{{- '\n</parameter>\n' -}}
89
+ {%- endfor -%}
90
+ {%- endif -%}
91
+ {{- '</function>\n</tool_call>' -}}
92
+ {%- endfor -%}
93
+ {%- endif -%}
94
+ {{ '<|end▁of▁sentence|>' }}
95
+ {%- elif message['role'] == 'tool' -%}
96
+ {%- set ns.is_last_user = true -%}
97
+ {{ '\n<tool_response>\n' + message['content'] + '\n</tool_response>' }}
98
+ {%- endif -%}
99
+ {%- endfor -%}
100
+
101
+ {%- if add_generation_prompt -%}
102
+ {{ '<|Assistant|>' }}{{ '<|end_of_thought|>' }}
103
+ {%- endif -%}
config.json ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "DeepseekV3ForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "auto_map": {
8
+ "AutoConfig": "configuration_deepseek.DeepseekV3Config",
9
+ "AutoModel": "modeling_deepseek.DeepseekV3Model",
10
+ "AutoModelForCausalLM": "modeling_deepseek.DeepseekV3ForCausalLM"
11
+ },
12
+ "bos_token_id": 0,
13
+ "eos_token_id": 1,
14
+ "ep_size": 1,
15
+ "first_k_dense_replace": 1,
16
+ "hidden_act": "silu",
17
+ "hidden_size": 2048,
18
+ "initializer_range": 0.02,
19
+ "intermediate_size": 7168,
20
+ "kv_lora_rank": 512,
21
+ "max_position_embeddings": 131072,
22
+ "model_type": "joyai_llm_flash",
23
+ "moe_intermediate_size": 768,
24
+ "moe_layer_freq": 1,
25
+ "n_group": 1,
26
+ "n_routed_experts": 256,
27
+ "n_shared_experts": 1,
28
+ "norm_topk_prob": true,
29
+ "num_attention_heads": 32,
30
+ "num_experts_per_tok": 8,
31
+ "num_hidden_layers": 40,
32
+ "num_key_value_heads": 32,
33
+ "num_nextn_predict_layers": 1,
34
+ "q_lora_rank": 1536,
35
+ "qk_nope_head_dim": 128,
36
+ "qk_rope_head_dim": 64,
37
+ "rms_norm_eps": 1e-06,
38
+ "rope_theta": 32000000,
39
+ "routed_scaling_factor": 2.5,
40
+ "scoring_func": "sigmoid",
41
+ "tie_word_embeddings": false,
42
+ "topk_group": 1,
43
+ "topk_method": "noaux_tc",
44
+ "torch_dtype": "bfloat16",
45
+ "transformers_version": "4.44.2",
46
+ "use_cache": true,
47
+ "v_head_dim": 128,
48
+ "vocab_size": 129280,
49
+ "quantization_config": {
50
+ "config_groups": {
51
+ "group_0": {
52
+ "input_activations": null,
53
+ "output_activations": null,
54
+ "targets": [
55
+ "Linear"
56
+ ],
57
+ "weights": {
58
+ "actorder": null,
59
+ "block_structure": null,
60
+ "dynamic": false,
61
+ "group_size": 32,
62
+ "num_bits": 4,
63
+ "observer": "minmax",
64
+ "observer_kwargs": {},
65
+ "strategy": "group",
66
+ "symmetric": true,
67
+ "type": "int"
68
+ }
69
+ }
70
+ },
71
+ "format": "pack-quantized",
72
+ "ignore": [
73
+ "lm_head",
74
+ "re:.*self_attn.*",
75
+ "re:.*shared_experts.*",
76
+ "re:.*mlp\\.(gate|up|gate_up|down)_proj.*"
77
+ ],
78
+ "kv_cache_scheme": null,
79
+ "quant_method": "compressed-tensors",
80
+ "quantization_status": "compressed"
81
+ }
82
+ }
configuration.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"framework":"Pytorch","task":"text-generation"}
configuration_deepseek.py ADDED
@@ -0,0 +1,247 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2025 bzantium and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on the DeepSeekV3 implementations from the DeepSeek AI team. (https://huggingface.co/deepseek-ai/DeepSeek-V3)
5
+
6
+ # Licensed under the Apache License, Version 2.0 (the "License");
7
+ # you may not use this file except in compliance with the License.
8
+ # You may obtain a copy of the License at
9
+ #
10
+ # http://www.apache.org/licenses/LICENSE-2.0
11
+ #
12
+ # Unless required by applicable law or agreed to in writing, software
13
+ # distributed under the License is distributed on an "AS IS" BASIS,
14
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15
+ # See the License for the specific language governing permissions and
16
+ # limitations under the License.
17
+ """DeepSeekV3 model configuration"""
18
+
19
+ from transformers.configuration_utils import PretrainedConfig
20
+ from transformers.modeling_rope_utils import rope_config_validation
21
+
22
+
23
+ DEEPSEEK_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
24
+
25
+
26
+ class DeepseekV3Config(PretrainedConfig):
27
+ r"""
28
+ This is the configuration class to store the configuration of a [`DeepseekV3Model`]. It is used to instantiate an DeepSeek
29
+ model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
30
+ defaults will yield a similar configuration to that of the DeepSeek-V3.
31
+ e.g. [bzantium/tiny-deepseek-v3](https://huggingface.co/bzantium/tiny-deepseek-v3)
32
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
33
+ documentation from [`PretrainedConfig`] for more information.
34
+
35
+
36
+ Args:
37
+ vocab_size (`int`, *optional*, defaults to 129280):
38
+ Vocabulary size of the Deep model. Defines the number of different tokens that can be represented by the
39
+ `inputs_ids` passed when calling [`DeepseekV3Model`]
40
+ hidden_size (`int`, *optional*, defaults to 7168):
41
+ Dimension of the hidden representations.
42
+ intermediate_size (`int`, *optional*, defaults to 18432):
43
+ Dimension of the MLP representations.
44
+ moe_intermediate_size (`int`, *optional*, defaults to 2048):
45
+ Dimension of the MoE representations.
46
+ num_hidden_layers (`int`, *optional*, defaults to 61):
47
+ Number of hidden layers in the Transformer decoder.
48
+ num_attention_heads (`int`, *optional*, defaults to 128):
49
+ Number of attention heads for each attention layer in the Transformer decoder.
50
+ num_key_value_heads (`int`, *optional*, defaults to 128):
51
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
52
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
53
+ `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
54
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
55
+ by meanpooling all the original heads within that group. For more details checkout [this
56
+ paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
57
+ `num_attention_heads`.
58
+ n_shared_experts (`int`, *optional*, defaults to 1):
59
+ Number of shared experts.
60
+ n_routed_experts (`int`, *optional*, defaults to 256):
61
+ Number of routed experts.
62
+ routed_scaling_factor (`float`, *optional*, defaults to 2.5):
63
+ Scaling factor or routed experts.
64
+ kv_lora_rank (`int`, *optional*, defaults to 512):
65
+ Rank of the LoRA matrices for key and value projections.
66
+ q_lora_rank (`int`, *optional*, defaults to 1536):
67
+ Rank of the LoRA matrices for query projections.
68
+ qk_rope_head_dim (`int`, *optional*, defaults to 64):
69
+ Dimension of the query/key heads that use rotary position embeddings.
70
+ v_head_dim (`int`, *optional*, defaults to 128):
71
+ Dimension of the value heads.
72
+ qk_nope_head_dim (`int`, *optional*, defaults to 128):
73
+ Dimension of the query/key heads that don't use rotary position embeddings.
74
+ n_group (`int`, *optional*, defaults to 8):
75
+ Number of groups for routed experts.
76
+ topk_group (`int`, *optional*, defaults to 4):
77
+ Number of selected groups for each token(for each token, ensuring the selected experts is only within `topk_group` groups).
78
+ num_experts_per_tok (`int`, *optional*, defaults to 8):
79
+ Number of selected experts, None means dense model.
80
+ first_k_dense_replace (`int`, *optional*, defaults to 3):
81
+ Number of dense layers in shallow layers(embed->dense->dense->...->dense->moe->moe...->lm_head).
82
+ \--k dense layers--/
83
+ norm_topk_prob (`bool`, *optional*, defaults to `True`):
84
+ Whether to normalize the weights of the routed experts.
85
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
86
+ The non-linear activation function (function or string) in the decoder.
87
+ max_position_embeddings (`int`, *optional*, defaults to 4096):
88
+ The maximum sequence length that this model might ever be used with.
89
+ initializer_range (`float`, *optional*, defaults to 0.02):
90
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
91
+ rms_norm_eps (`float`, *optional*, defaults to 1e-06):
92
+ The epsilon used by the rms normalization layers.
93
+ use_cache (`bool`, *optional*, defaults to `True`):
94
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
95
+ relevant if `config.is_decoder=True`.
96
+ pad_token_id (`int`, *optional*):
97
+ Padding token id.
98
+ bos_token_id (`int`, *optional*, defaults to 0):
99
+ Beginning of stream token id.
100
+ eos_token_id (`int`, *optional*, defaults to 1):
101
+ End of stream token id.
102
+ pretraining_tp (`int`, *optional*, defaults to 1):
103
+ Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
104
+ document](https://huggingface.co/docs/transformers/parallelism) to understand more about it. This value is
105
+ necessary to ensure exact reproducibility of the pretraining results. Please refer to [this
106
+ issue](https://github.com/pytorch/pytorch/issues/76232).
107
+ tie_word_embeddings (`bool`, *optional*, defaults to `False`):
108
+ Whether to tie weight embeddings
109
+ rope_theta (`float`, *optional*, defaults to 10000.0):
110
+ The base period of the RoPE embeddings.
111
+ rope_scaling (`Dict`, *optional*):
112
+ Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
113
+ strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
114
+ `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
115
+ `max_position_embeddings` to the expected new maximum.
116
+ rope_interleave (`bool`, *optional*, defaults to `True`):
117
+ Whether to interleave the rotary position embeddings.
118
+ attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
119
+ Whether to use a bias in the query, key, value and output projection layers during self-attention.
120
+ attention_dropout (`float`, *optional*, defaults to 0.0):
121
+ The dropout ratio for the attention probabilities.
122
+
123
+ ```python
124
+ >>> from transformers import DeepseekV3Model, DeepseekV3Config
125
+
126
+ >>> # Initializing a Deepseek-V3 style configuration
127
+ >>> configuration = DeepseekV3Config()
128
+
129
+ >>> # Accessing the model configuration
130
+ >>> configuration = model.config
131
+ ```"""
132
+
133
+ model_type = "deepseek_v3"
134
+ keys_to_ignore_at_inference = ["past_key_values"]
135
+ base_model_tp_plan = { # TODO: only replicate attention layers when > first_k_dense_replace
136
+ "layers.*.mlp.experts.*.gate_proj": "local_colwise",
137
+ "layers.*.mlp.experts.*.up_proj": "local_colwise",
138
+ "layers.*.mlp.experts.*.down_proj": "local_rowwise",
139
+ "layers.*.mlp.experts.*": "local", # each expert is wrapped in a module list
140
+ "layers.*.mlp.shared_experts.gate_proj": "local_colwise",
141
+ "layers.*.mlp.shared_experts.up_proj": "local_colwise",
142
+ "layers.*.mlp.shared_experts.down_proj": "local_rowwise",
143
+ "layers.*.mlp.shared_experts": "local",
144
+ "layers.*.mlp.gate_proj": "local_colwise",
145
+ "layers.*.mlp.up_proj": "local_colwise",
146
+ "layers.*.mlp.down_proj": "local_rowwise",
147
+ "layers.*.mlp": "gather", # This is the only moment where results are gathered
148
+ }
149
+ base_model_pp_plan = {
150
+ "embed_tokens": (["input_ids"], ["inputs_embeds"]),
151
+ "layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
152
+ "norm": (["hidden_states"], ["hidden_states"]),
153
+ }
154
+
155
+ def __init__(
156
+ self,
157
+ vocab_size=129280,
158
+ hidden_size=7168,
159
+ intermediate_size=18432,
160
+ moe_intermediate_size=2048,
161
+ num_hidden_layers=61,
162
+ num_attention_heads=128,
163
+ num_key_value_heads=128,
164
+ n_shared_experts=1,
165
+ n_routed_experts=256,
166
+ routed_scaling_factor=2.5,
167
+ kv_lora_rank=512,
168
+ q_lora_rank=1536,
169
+ qk_rope_head_dim=64,
170
+ v_head_dim=128,
171
+ qk_nope_head_dim=128,
172
+ n_group=8,
173
+ topk_group=4,
174
+ num_experts_per_tok=8,
175
+ first_k_dense_replace=3,
176
+ norm_topk_prob=True,
177
+ hidden_act="silu",
178
+ max_position_embeddings=4096,
179
+ initializer_range=0.02,
180
+ rms_norm_eps=1e-6,
181
+ use_cache=True,
182
+ pad_token_id=None,
183
+ bos_token_id=0,
184
+ eos_token_id=1,
185
+ pretraining_tp=1,
186
+ tie_word_embeddings=False,
187
+ rope_theta=10000.0,
188
+ rope_scaling=None,
189
+ rope_interleave=True,
190
+ attention_bias=False,
191
+ attention_dropout=0.0,
192
+ **kwargs,
193
+ ):
194
+ self.vocab_size = vocab_size
195
+ self.max_position_embeddings = max_position_embeddings
196
+ self.hidden_size = hidden_size
197
+ self.intermediate_size = intermediate_size
198
+ self.moe_intermediate_size = moe_intermediate_size
199
+ self.num_hidden_layers = num_hidden_layers
200
+ self.num_attention_heads = num_attention_heads
201
+ self.n_shared_experts = n_shared_experts
202
+ self.n_routed_experts = n_routed_experts
203
+ self.routed_scaling_factor = routed_scaling_factor
204
+ self.kv_lora_rank = kv_lora_rank
205
+ self.q_lora_rank = q_lora_rank
206
+ self.qk_rope_head_dim = qk_rope_head_dim
207
+ self.v_head_dim = v_head_dim
208
+ self.qk_nope_head_dim = qk_nope_head_dim
209
+ self.qk_head_dim = qk_nope_head_dim + qk_rope_head_dim
210
+ self.head_dim = qk_rope_head_dim
211
+ self.n_group = n_group
212
+ self.topk_group = topk_group
213
+ self.num_experts_per_tok = num_experts_per_tok
214
+ self.first_k_dense_replace = first_k_dense_replace
215
+ self.norm_topk_prob = norm_topk_prob
216
+ self.rope_interleave = rope_interleave
217
+
218
+ # for backward compatibility
219
+ if num_key_value_heads is None:
220
+ num_key_value_heads = num_attention_heads
221
+
222
+ self.num_key_value_heads = num_key_value_heads
223
+ self.hidden_act = hidden_act
224
+ self.initializer_range = initializer_range
225
+ self.rms_norm_eps = rms_norm_eps
226
+ self.pretraining_tp = pretraining_tp
227
+ self.use_cache = use_cache
228
+ self.rope_theta = rope_theta
229
+ self.rope_scaling = rope_scaling
230
+ self.attention_bias = attention_bias
231
+ self.attention_dropout = attention_dropout
232
+ # Validate the correctness of rotary position embeddings parameters
233
+ # BC: if there is a 'type' field, copy it it to 'rope_type'.
234
+ if self.rope_scaling is not None and "type" in self.rope_scaling:
235
+ self.rope_scaling["rope_type"] = self.rope_scaling["type"]
236
+ rope_config_validation(self)
237
+
238
+ super().__init__(
239
+ pad_token_id=pad_token_id,
240
+ bos_token_id=bos_token_id,
241
+ eos_token_id=eos_token_id,
242
+ tie_word_embeddings=tie_word_embeddings,
243
+ **kwargs,
244
+ )
245
+
246
+
247
+ __all__ = ["DeepseekV3Config"]
model-10-of-40.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5aa597d8635c3acaa4aeab8826d347744c13874ca41c663819019d93217f27a0
3
+ size 818455784
model-11-of-40.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bda756aaa66d660b75d7e346c17da8f51bae8bda545ef29e7e0c051172093399
3
+ size 818458104
model-14-of-40.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:de1784464c45e646cb2015e191770ae1ebe6be35bf272bc3903e2b9d9c69a7d8
3
+ size 818458104
model-15-of-40.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:21cf0c46b26b86be7c13326d902bba90dc7df2d31d048f1d17f909b2fd6978aa
3
+ size 818458104
model-2-of-40.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f61f10f404c9f381d77945de4dd38513e45019ff02d404ea3abac39d6d5bf493
3
+ size 818455784
model-20-of-40.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:199ae2f049fc27bfdcdcaf64de0846361c5dfe55b869c2b66c581469e9b561b3
3
+ size 818458104
model-21-of-40.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a62bf3b90b56bddfe6294fcfbce14c153795b263add692140f205b9a8c5edc25
3
+ size 818458104
model-25-of-40.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7b0111296847fa1ab64e90cfa5698536105e9b4a1aef7df22fe55fa3189fc343
3
+ size 818458104
model-26-of-40.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ffce974334c99b62283739fbf3cd5145c95f8fd88578f1259609b0d482494154
3
+ size 818458104
model-27-of-40.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:470389782d30a32b20bcf7d90bcf3b4b3f33cdedd1e84cbfffdb8c21e72da41f
3
+ size 818458104
model-28-of-40.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:033b32c1cf70d940c221d2047d72ddf9bd7c026fd51f35a5459bf020d2a3b903
3
+ size 818458104
model-29-of-40.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4a3166a08f3538e96d6ad67bd2d517e1934434dfc30b787b23be69a903a96221
3
+ size 818458104
model-31-of-40.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c11cb2744e668cfba60a69849f5772b5b1ab55b6f688532a5a40060126556500
3
+ size 818458104
model-33-of-40.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6f84bc2db3c95527736e5d9084416bdeb8b3cf7ea8d9e444570cbe2285d0e78d
3
+ size 818458104