exdysa chengs18 commited on
Commit
fc2ebc7
·
verified ·
0 Parent(s):

Duplicate from JetLM/SDAR-1.7B-Chat

Browse files

Co-authored-by: Cheng Shuang <chengs18@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ ---
5
+
6
+ # SDAR
7
+
8
+ <div align="center">
9
+ <img src="https://raw.githubusercontent.com/JetAstra/SDAR/main/assets/SDAR_doc_head.png">
10
+
11
+
12
+ <div>&nbsp;</div>
13
+
14
+ [Arxiv](https://arxiv.org/abs/2510.06303) • [💻Github Repo](https://github.com/JetAstra/SDAR) • [🤗Model Collections](https://huggingface.co/collections/JetLM/sdar-689b1b6d392a4eeb2664f8ff)
15
+
16
+ </div>
17
+
18
+ # Introduction
19
+
20
+ **SDAR** (**S**ynergy of **D**iffusion and **A**uto**R**egression) model is a new large language model that integrates autoregressive (AR) and discrete diffusion modeling strategies. It combines the efficient training paradigm of AR models with the highly parallel inference capability of diffusion models, while delivering performance fully on par with SOTA open-source AR models. At the same time, SDAR sets a new benchmark as the most powerful diffusion language model to date. We highlight three major conclusions from our study:
21
+
22
+ > [!IMPORTANT]
23
+ > Take-home message
24
+ >
25
+ > - **Balanced Efficiency:** SDAR unifies the **efficient training** of AR models with the **parallel inference** of diffusion, achieving both fast training and inference.
26
+ > - **Fair Comparisons:** In rigorously controlled experiments, SDAR achieves **on-par general task performance** with strong AR baselines, ensuring credibility and reproducibility.
27
+ > - **Superior Learning Efficiency:** On complex scientific reasoning tasks (e.g., GPQA, ChemBench, Physics), SDAR shows **clear gains over AR models** of the same scale, approaching or even exceeding leading closed-source systems.
28
+
29
+ # Inference
30
+
31
+ ## Using the tailored inference engine [JetEngine](https://github.com/Labman42/JetEngine)
32
+
33
+ JetEngine enables more efficient inference compared to the built-in implementation.
34
+
35
+ ```bash
36
+ git clone https://github.com/Labman42/JetEngine.git
37
+ cd JetEngine
38
+ pip install .
39
+ ```
40
+
41
+ The following example shows how to quickly load a model with JetEngine and run a prompt end-to-end.
42
+
43
+ ```python
44
+ import os
45
+ from jetengine import LLM, SamplingParams
46
+ from transformers import AutoTokenizer
47
+
48
+ model_path = os.path.expanduser("/path/to/your/sdar-model")
49
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
50
+ # Initialize the LLM
51
+ llm = LLM(
52
+ model_path,
53
+ enforce_eager=True,
54
+ tensor_parallel_size=1,
55
+ mask_token_id=151669, # Optional: only needed for masked/diffusion models
56
+ block_length=4
57
+ )
58
+
59
+ # Set sampling/generation parameters
60
+ sampling_params = SamplingParams(
61
+ temperature=1.0,
62
+ topk=0,
63
+ topp=1.0,
64
+ max_tokens=256,
65
+ remasking_strategy="low_confidence_dynamic",
66
+ block_length=4,
67
+ denoising_steps=4,
68
+ dynamic_threshold=0.9
69
+ )
70
+
71
+ # Prepare a simple chat-style prompt
72
+ prompt = tokenizer.apply_chat_template(
73
+ [{"role": "user", "content": "Explain what reinforcement learning is in simple terms."}],
74
+ tokenize=False,
75
+ add_generation_prompt=True
76
+ )
77
+
78
+ # Generate text
79
+ outputs = llm.generate_streaming([prompt], sampling_params)
80
+ ```
81
+
82
+ # Performance
83
+
84
+ ### SDAR v.s. Qwen
85
+
86
+ For **SDAR** models, inference hyperparameters are set to: `block_length = 4`, `denoising_steps = 4`, greedy decoding.
87
+
88
+ For **Qwen3-1.7B-AR-SFT** and **Qwen3-30B-AR-SFT**, we use *greedy decoding*, and the base models **Qwen3-1.7B-Base** and **Qwen3-30B-Base** are derived from the [Qwen3 Technical Report](https://arxiv.org/abs/2505.09388).
89
+
90
+ <p align="center">
91
+ <img src="https://raw.githubusercontent.com/JetAstra/SDAR/main/assets/table1.png" style="max-width:100%; height:auto;">
92
+ <p align="center">
93
+
94
+ ### SDAR-Sci v.s. AR Baseline
95
+
96
+ This table presents a **controlled comparison** between AR and SDAR under the same backbone and dataset settings.
97
+ The results are averaged over 8 runs for GPQA, and over 32 runs each for AIME 2024, AIME 2025, and LiveMathBench.
98
+
99
+ <p align="center">
100
+ <img src="https://raw.githubusercontent.com/JetAstra/SDAR/main/assets/table2.png" style="max-width:100%; height:auto;">
101
+ <p align="center">
102
+
103
+ #### SDAR-Sci v.s. Other Models
104
+
105
+ This table positions **SDAR-30B-A3B-Sci(sample)** against leading open-source and closed-source LLMs.
106
+ Scores for external models are sourced from the [InternLM/Intern-S1](https://github.com/InternLM/Intern-S1) repository.
107
+
108
+ <p align="center">
109
+ <img src="https://raw.githubusercontent.com/JetAstra/SDAR/main/assets/table3.png" style="max-width:100%; height:auto;">
110
+ <p align="center">
added_tokens.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</think>": 151668,
3
+ "</tool_call>": 151658,
4
+ "</tool_response>": 151666,
5
+ "<MASK>": 151669,
6
+ "<think>": 151667,
7
+ "<tool_call>": 151657,
8
+ "<tool_response>": 151665,
9
+ "<|box_end|>": 151649,
10
+ "<|box_start|>": 151648,
11
+ "<|endoftext|>": 151643,
12
+ "<|file_sep|>": 151664,
13
+ "<|fim_middle|>": 151660,
14
+ "<|fim_pad|>": 151662,
15
+ "<|fim_prefix|>": 151659,
16
+ "<|fim_suffix|>": 151661,
17
+ "<|im_end|>": 151645,
18
+ "<|im_start|>": 151644,
19
+ "<|image_pad|>": 151655,
20
+ "<|object_ref_end|>": 151647,
21
+ "<|object_ref_start|>": 151646,
22
+ "<|quad_end|>": 151651,
23
+ "<|quad_start|>": 151650,
24
+ "<|repo_name|>": 151663,
25
+ "<|video_pad|>": 151656,
26
+ "<|vision_end|>": 151653,
27
+ "<|vision_pad|>": 151654,
28
+ "<|vision_start|>": 151652
29
+ }
chat_template.jinja ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0].role == 'system' %}
4
+ {{- messages[0].content + '\n\n' }}
5
+ {%- endif %}
6
+ {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
7
+ {%- for tool in tools %}
8
+ {{- "\n" }}
9
+ {{- tool | tojson }}
10
+ {%- endfor %}
11
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
12
+ {%- else %}
13
+ {%- if messages[0].role == 'system' %}
14
+ {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
15
+ {%- endif %}
16
+ {%- endif %}
17
+ {%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
18
+ {%- for message in messages[::-1] %}
19
+ {%- set index = (messages|length - 1) - loop.index0 %}
20
+ {%- if ns.multi_step_tool and message.role == "user" and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
21
+ {%- set ns.multi_step_tool = false %}
22
+ {%- set ns.last_query_index = index %}
23
+ {%- endif %}
24
+ {%- endfor %}
25
+ {%- for message in messages %}
26
+ {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
27
+ {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
28
+ {%- elif message.role == "assistant" %}
29
+ {%- set content = message.content %}
30
+ {%- set reasoning_content = '' %}
31
+ {%- if message.reasoning_content is defined and message.reasoning_content is not none %}
32
+ {%- set reasoning_content = message.reasoning_content %}
33
+ {%- else %}
34
+ {%- if '</think>' in message.content %}
35
+ {%- set content = message.content.split('</think>')[-1].lstrip('\n') %}
36
+ {%- set reasoning_content = message.content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
37
+ {%- endif %}
38
+ {%- endif %}
39
+ {%- if loop.index0 > ns.last_query_index %}
40
+ {%- if loop.last or (not loop.last and reasoning_content) %}
41
+ {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
42
+ {%- else %}
43
+ {{- '<|im_start|>' + message.role + '\n' + content }}
44
+ {%- endif %}
45
+ {%- else %}
46
+ {{- '<|im_start|>' + message.role + '\n' + content }}
47
+ {%- endif %}
48
+ {%- if message.tool_calls %}
49
+ {%- for tool_call in message.tool_calls %}
50
+ {%- if (loop.first and content) or (not loop.first) %}
51
+ {{- '\n' }}
52
+ {%- endif %}
53
+ {%- if tool_call.function %}
54
+ {%- set tool_call = tool_call.function %}
55
+ {%- endif %}
56
+ {{- '<tool_call>\n{"name": "' }}
57
+ {{- tool_call.name }}
58
+ {{- '", "arguments": ' }}
59
+ {%- if tool_call.arguments is string %}
60
+ {{- tool_call.arguments }}
61
+ {%- else %}
62
+ {{- tool_call.arguments | tojson }}
63
+ {%- endif %}
64
+ {{- '}\n</tool_call>' }}
65
+ {%- endfor %}
66
+ {%- endif %}
67
+ {{- '<|im_end|>\n' }}
68
+ {%- elif message.role == "tool" %}
69
+ {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
70
+ {{- '<|im_start|>user' }}
71
+ {%- endif %}
72
+ {{- '\n<tool_response>\n' }}
73
+ {{- message.content }}
74
+ {{- '\n</tool_response>' }}
75
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
76
+ {{- '<|im_end|>\n' }}
77
+ {%- endif %}
78
+ {%- endif %}
79
+ {%- endfor %}
80
+ {%- if add_generation_prompt %}
81
+ {{- '<|im_start|>assistant\n' }}
82
+ {%- if enable_thinking is defined and enable_thinking is false %}
83
+ {{- '<think>\n\n</think>\n\n' }}
84
+ {%- endif %}
85
+ {%- endif %}
config.json ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "SDARForCausalLM"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "configuration_sdar.SDARConfig",
7
+ "AutoModel": "modeling_sdar.SDARModel",
8
+ "AutoModelForCausalLM": "modeling_sdar.SDARForCausalLM"
9
+ },
10
+ "attention_bias": false,
11
+ "attention_dropout": 0.0,
12
+ "bos_token_id": 151643,
13
+ "eos_token_id": 151643,
14
+ "fuse_cross_entropy": true,
15
+ "head_dim": 128,
16
+ "hidden_act": "silu",
17
+ "hidden_size": 2048,
18
+ "initializer_range": 0.02,
19
+ "intermediate_size": 6144,
20
+ "max_position_embeddings": 32768,
21
+ "max_window_layers": 28,
22
+ "model_type": "sdar",
23
+ "num_attention_heads": 16,
24
+ "num_hidden_layers": 28,
25
+ "num_key_value_heads": 8,
26
+ "rms_norm_eps": 1e-06,
27
+ "rope_scaling": null,
28
+ "rope_theta": 1000000,
29
+ "sliding_window": null,
30
+ "tie_word_embeddings": false,
31
+ "torch_dtype": "bfloat16",
32
+ "transformers_version": "4.52.4",
33
+ "use_cache": true,
34
+ "use_sliding_window": false,
35
+ "vocab_size": 151936
36
+ }
configuration_sdar.py ADDED
@@ -0,0 +1,212 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2024 The Qwen team, Alibaba Group and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ """SDAR model configuration"""
16
+
17
+ from transformers.configuration_utils import PretrainedConfig
18
+ from transformers.modeling_rope_utils import rope_config_validation
19
+ from transformers.utils import logging
20
+
21
+
22
+ logger = logging.get_logger(__name__)
23
+
24
+
25
+ class SDARConfig(PretrainedConfig):
26
+ r"""
27
+ This is the configuration class to store the configuration of a [`SDARModel`]. It is used to instantiate a
28
+ SDAR model according to the specified arguments, defining the model architecture. Instantiating a configuration
29
+ with the defaults will yield a similar configuration to that of
30
+ SDAR-1.7B [DiffuOpen/SDAR-1.7B-Chat](https://huggingface.co/DiffuOpen/SDAR-1.7B-Chat/).
31
+
32
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
33
+ documentation from [`PretrainedConfig`] for more information.
34
+
35
+
36
+ Args:
37
+ vocab_size (`int`, *optional*, defaults to 151936):
38
+ Vocabulary size of the SDAR model. Defines the number of different tokens that can be represented by the
39
+ `inputs_ids` passed when calling [`SDARModel`]
40
+ hidden_size (`int`, *optional*, defaults to 4096):
41
+ Dimension of the hidden representations.
42
+ intermediate_size (`int`, *optional*, defaults to 22016):
43
+ Dimension of the MLP representations.
44
+ num_hidden_layers (`int`, *optional*, defaults to 32):
45
+ Number of hidden layers in the Transformer encoder.
46
+ num_attention_heads (`int`, *optional*, defaults to 32):
47
+ Number of attention heads for each attention layer in the Transformer encoder.
48
+ num_key_value_heads (`int`, *optional*, defaults to 32):
49
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
50
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
51
+ `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
52
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
53
+ by meanpooling all the original heads within that group. For more details checkout [this
54
+ paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to `32`.
55
+ head_dim (`int`, *optional*, defaults to 128):
56
+ The attention head dimension.
57
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
58
+ The non-linear activation function (function or string) in the decoder.
59
+ max_position_embeddings (`int`, *optional*, defaults to 32768):
60
+ The maximum sequence length that this model might ever be used with.
61
+ initializer_range (`float`, *optional*, defaults to 0.02):
62
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
63
+ rms_norm_eps (`float`, *optional*, defaults to 1e-06):
64
+ The epsilon used by the rms normalization layers.
65
+ use_cache (`bool`, *optional*, defaults to `True`):
66
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
67
+ relevant if `config.is_decoder=True`.
68
+ tie_word_embeddings (`bool`, *optional*, defaults to `False`):
69
+ Whether the model's input and output word embeddings should be tied.
70
+ rope_theta (`float`, *optional*, defaults to 10000.0):
71
+ The base period of the RoPE embeddings.
72
+ rope_scaling (`Dict`, *optional*):
73
+ Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type
74
+ and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value
75
+ accordingly.
76
+ Expected contents:
77
+ `rope_type` (`str`):
78
+ The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope',
79
+ 'llama3'], with 'default' being the original RoPE implementation.
80
+ `factor` (`float`, *optional*):
81
+ Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In
82
+ most scaling types, a `factor` of x will enable the model to handle sequences of length x *
83
+ original maximum pre-trained length.
84
+ `original_max_position_embeddings` (`int`, *optional*):
85
+ Used with 'dynamic', 'longrope' and 'llama3'. The original max position embeddings used during
86
+ pretraining.
87
+ `attention_factor` (`float`, *optional*):
88
+ Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention
89
+ computation. If unspecified, it defaults to value recommended by the implementation, using the
90
+ `factor` field to infer the suggested value.
91
+ `beta_fast` (`float`, *optional*):
92
+ Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear
93
+ ramp function. If unspecified, it defaults to 32.
94
+ `beta_slow` (`float`, *optional*):
95
+ Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear
96
+ ramp function. If unspecified, it defaults to 1.
97
+ `short_factor` (`List[float]`, *optional*):
98
+ Only used with 'longrope'. The scaling factor to be applied to short contexts (<
99
+ `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
100
+ size divided by the number of attention heads divided by 2
101
+ `long_factor` (`List[float]`, *optional*):
102
+ Only used with 'longrope'. The scaling factor to be applied to long contexts (<
103
+ `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
104
+ size divided by the number of attention heads divided by 2
105
+ `low_freq_factor` (`float`, *optional*):
106
+ Only used with 'llama3'. Scaling factor applied to low frequency components of the RoPE
107
+ `high_freq_factor` (`float`, *optional*):
108
+ Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE
109
+ attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
110
+ Whether to use a bias in the query, key, value and output projection layers during self-attention.
111
+ use_sliding_window (`bool`, *optional*, defaults to `False`):
112
+ Whether to use sliding window attention.
113
+ sliding_window (`int`, *optional*, defaults to 4096):
114
+ Sliding window attention (SWA) window size. If not specified, will default to `4096`.
115
+ max_window_layers (`int`, *optional*, defaults to 28):
116
+ The number of layers that use SWA (Sliding Window Attention). The bottom layers use SWA while the top use full attention.
117
+ attention_dropout (`float`, *optional*, defaults to 0.0):
118
+ The dropout ratio for the attention probabilities.
119
+
120
+ ```python
121
+ >>> from transformers import SDARModel, SDARConfig
122
+
123
+ >>> # Initializing a SDAR style configuration
124
+ >>> configuration = SDARConfig()
125
+
126
+ >>> # Initializing a model from the SDAR-8B style configuration
127
+ >>> model = SDARModel(configuration)
128
+
129
+ >>> # Accessing the model configuration
130
+ >>> configuration = model.config
131
+ ```"""
132
+
133
+ model_type = "sdar"
134
+ keys_to_ignore_at_inference = ["past_key_values"]
135
+
136
+ # Default tensor parallel plan for base model `SDAR`
137
+ base_model_tp_plan = {
138
+ "layers.*.self_attn.q_proj": "colwise",
139
+ "layers.*.self_attn.k_proj": "colwise",
140
+ "layers.*.self_attn.v_proj": "colwise",
141
+ "layers.*.self_attn.o_proj": "rowwise",
142
+ "layers.*.mlp.gate_proj": "colwise",
143
+ "layers.*.mlp.up_proj": "colwise",
144
+ "layers.*.mlp.down_proj": "rowwise",
145
+ }
146
+ base_model_pp_plan = {
147
+ "embed_tokens": (["input_ids"], ["inputs_embeds"]),
148
+ "layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
149
+ "norm": (["hidden_states"], ["hidden_states"]),
150
+ }
151
+
152
+ def __init__(
153
+ self,
154
+ vocab_size=151936,
155
+ hidden_size=4096,
156
+ intermediate_size=22016,
157
+ num_hidden_layers=32,
158
+ num_attention_heads=32,
159
+ num_key_value_heads=32,
160
+ head_dim=128,
161
+ hidden_act="silu",
162
+ max_position_embeddings=32768,
163
+ initializer_range=0.02,
164
+ rms_norm_eps=1e-6,
165
+ use_cache=True,
166
+ tie_word_embeddings=False,
167
+ rope_theta=10000.0,
168
+ rope_scaling=None,
169
+ attention_bias=False,
170
+ use_sliding_window=False,
171
+ sliding_window=4096,
172
+ max_window_layers=28,
173
+ attention_dropout=0.0,
174
+ **kwargs,
175
+ ):
176
+ self.vocab_size = vocab_size
177
+ self.max_position_embeddings = max_position_embeddings
178
+ self.hidden_size = hidden_size
179
+ self.intermediate_size = intermediate_size
180
+ self.num_hidden_layers = num_hidden_layers
181
+ self.num_attention_heads = num_attention_heads
182
+ self.use_sliding_window = use_sliding_window
183
+ self.sliding_window = sliding_window # we check `use_sliding_window` in the modeling code
184
+ self.max_window_layers = max_window_layers
185
+
186
+ # for backward compatibility
187
+ if num_key_value_heads is None:
188
+ num_key_value_heads = num_attention_heads
189
+
190
+ self.num_key_value_heads = num_key_value_heads
191
+ self.head_dim = head_dim
192
+ self.hidden_act = hidden_act
193
+ self.initializer_range = initializer_range
194
+ self.rms_norm_eps = rms_norm_eps
195
+ self.use_cache = use_cache
196
+ self.rope_theta = rope_theta
197
+ self.rope_scaling = rope_scaling
198
+ self.attention_bias = attention_bias
199
+ self.attention_dropout = attention_dropout
200
+ # Validate the correctness of rotary position embeddings parameters
201
+ # BC: if there is a 'type' field, move it to 'rope_type'.
202
+ if self.rope_scaling is not None and "type" in self.rope_scaling:
203
+ self.rope_scaling["rope_type"] = self.rope_scaling["type"]
204
+ rope_config_validation(self)
205
+
206
+ super().__init__(
207
+ tie_word_embeddings=tie_word_embeddings,
208
+ **kwargs,
209
+ )
210
+
211
+
212
+ __all__ = ["SDARConfig"]
generation_config.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 151643,
3
+ "do_sample": true,
4
+ "eos_token_id": [
5
+ 151645,
6
+ 151643
7
+ ],
8
+ "pad_token_id": 151643,
9
+ "temperature": 0.6,
10
+ "top_k": 20,
11
+ "top_p": 0.95,
12
+ "transformers_version": "4.51.0"
13
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1737775176591d7c7f39b884b98d620d87646f8220b9b6b39431b6f6467e3e0f
3
+ size 4063515640
modeling_sdar.py ADDED
@@ -0,0 +1,866 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # This file is modified based on https://github.com/huggingface/transformers/blob/v4.52.4/src/transformers/models/qwen3/modeling_qwen3.py.
2
+ #
3
+ # 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
4
+ # This file was automatically generated from src/transformers/models/qwen3/modular_qwen3.py.
5
+ # Do NOT edit this file manually as any edits will be overwritten by the generation of
6
+ # the file from the modular. If any change should be done, please apply the change to the
7
+ # modular_qwen3.py file directly. One of our CI enforces this.
8
+ # 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
9
+ # coding=utf-8
10
+ # Copyright 2025 The Qwen team, Alibaba Group and the HuggingFace Inc. team. All rights reserved.
11
+ #
12
+ # Licensed under the Apache License, Version 2.0 (the "License");
13
+ # you may not use this file except in compliance with the License.
14
+ # You may obtain a copy of the License at
15
+ #
16
+ # http://www.apache.org/licenses/LICENSE-2.0
17
+ #
18
+ # Unless required by applicable law or agreed to in writing, software
19
+ # distributed under the License is distributed on an "AS IS" BASIS,
20
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
21
+ # See the License for the specific language governing permissions and
22
+ # limitations under the License.
23
+
24
+ from typing import Callable, Optional, Tuple, Union
25
+
26
+ import torch
27
+ from torch import nn
28
+
29
+ from transformers.activations import ACT2FN
30
+ from transformers.cache_utils import Cache, DynamicCache, SlidingWindowCache, StaticCache
31
+ from transformers.generation import GenerationMixin
32
+ from transformers.integrations import use_kernel_forward_from_hub
33
+ from transformers.modeling_attn_mask_utils import AttentionMaskConverter
34
+ from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
35
+ from transformers.modeling_layers import GradientCheckpointingLayer
36
+ from transformers.modeling_outputs import (
37
+ BaseModelOutputWithPast,
38
+ CausalLMOutputWithPast,
39
+ QuestionAnsweringModelOutput,
40
+ SequenceClassifierOutputWithPast,
41
+ TokenClassifierOutput,
42
+ )
43
+ from transformers.modeling_rope_utils import ROPE_INIT_FUNCTIONS, dynamic_rope_update
44
+ from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
45
+ from transformers.processing_utils import Unpack
46
+ from transformers.utils import LossKwargs, auto_docstring, can_return_tuple, is_torch_flex_attn_available, logging
47
+ from .configuration_sdar import SDARConfig
48
+
49
+ from flash_attn.ops.triton.layer_norm import rms_norm_fn as flash_rms_norm
50
+
51
+ import torch.nn.functional as F
52
+ try:
53
+ from flash_attn import flash_attn_func, flash_attn_varlen_func
54
+ from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input
55
+ except:
56
+ pass
57
+
58
+ try:
59
+ from liger_kernel.ops.swiglu import LigerSiLUMulFunction # noqa: F401
60
+ liger_kernel_is_available = True
61
+ except ImportError:
62
+ liger_kernel_is_available = False
63
+
64
+
65
+ if is_torch_flex_attn_available():
66
+ from torch.nn.attention.flex_attention import BlockMask, create_block_mask, flex_attention
67
+ from transformers.integrations.flex_attention import make_flex_block_causal_mask
68
+
69
+
70
+ logger = logging.get_logger(__name__)
71
+
72
+
73
+
74
+ @use_kernel_forward_from_hub("RMSNorm")
75
+ class SDARRMSNorm(nn.Module):
76
+ def __init__(self, hidden_size, eps=1e-6):
77
+ """
78
+ SDARRMSNorm is equivalent to T5LayerNorm
79
+ """
80
+ super().__init__()
81
+ self.weight = nn.Parameter(torch.ones(hidden_size))
82
+ self.variance_epsilon = eps
83
+
84
+ def forward(self, hidden_states):
85
+ return flash_rms_norm(
86
+ hidden_states, weight=self.weight, bias=None, eps=self.variance_epsilon)
87
+ '''
88
+ input_dtype = hidden_states.dtype
89
+ hidden_states = hidden_states.to(torch.float32)
90
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
91
+ hidden_states = hidden_states * \
92
+ torch.rsqrt(variance + self.variance_epsilon)
93
+ return self.weight * hidden_states.to(input_dtype)
94
+ '''
95
+
96
+
97
+ def extra_repr(self):
98
+ return f"{tuple(self.weight.shape)}, eps={self.variance_epsilon}"
99
+
100
+
101
+ class SDARMLP(nn.Module):
102
+ def __init__(self, config):
103
+ super().__init__()
104
+ self.config = config
105
+ self.hidden_size = config.hidden_size
106
+ self.intermediate_size = config.intermediate_size
107
+ self.gate_proj = nn.Linear(
108
+ self.hidden_size, self.intermediate_size, bias=False)
109
+ self.up_proj = nn.Linear(
110
+ self.hidden_size, self.intermediate_size, bias=False)
111
+ self.down_proj = nn.Linear(
112
+ self.intermediate_size, self.hidden_size, bias=False)
113
+ self.act_fn = ACT2FN[config.hidden_act]
114
+
115
+ def forward(self, x):
116
+ if liger_kernel_is_available:
117
+ return self.down_proj(LigerSiLUMulFunction.apply(self.gate_proj(x), self.up_proj(x)))
118
+ else:
119
+ down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
120
+ return down_proj
121
+
122
+
123
+ def rotate_half(x):
124
+ """Rotates half the hidden dims of the input."""
125
+ x1 = x[..., : x.shape[-1] // 2]
126
+ x2 = x[..., x.shape[-1] // 2:]
127
+ return torch.cat((-x2, x1), dim=-1)
128
+
129
+
130
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
131
+ """Applies Rotary Position Embedding to the query and key tensors.
132
+
133
+ Args:
134
+ q (`torch.Tensor`): The query tensor.
135
+ k (`torch.Tensor`): The key tensor.
136
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
137
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
138
+ position_ids (`torch.Tensor`, *optional*):
139
+ Deprecated and unused.
140
+ unsqueeze_dim (`int`, *optional*, defaults to 1):
141
+ The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
142
+ sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
143
+ that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
144
+ k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
145
+ cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
146
+ the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
147
+ Returns:
148
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
149
+ """
150
+ cos = cos.unsqueeze(unsqueeze_dim)
151
+ sin = sin.unsqueeze(unsqueeze_dim)
152
+ q_embed = (q * cos) + (rotate_half(q) * sin)
153
+ k_embed = (k * cos) + (rotate_half(k) * sin)
154
+ return q_embed, k_embed
155
+
156
+
157
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
158
+ """
159
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
160
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
161
+ """
162
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
163
+ if n_rep == 1:
164
+ return hidden_states
165
+ hidden_states = hidden_states[:, :, None, :, :].expand(
166
+ batch, num_key_value_heads, n_rep, slen, head_dim)
167
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
168
+
169
+
170
+ def eager_attention_forward(
171
+ module: nn.Module,
172
+ query: torch.Tensor,
173
+ key: torch.Tensor,
174
+ value: torch.Tensor,
175
+ attention_mask: Optional[torch.Tensor],
176
+ scaling: float,
177
+ dropout: float = 0.0,
178
+ **kwargs,
179
+ ):
180
+ key_states = repeat_kv(key, module.num_key_value_groups)
181
+ value_states = repeat_kv(value, module.num_key_value_groups)
182
+
183
+ attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
184
+ if attention_mask is not None:
185
+ causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
186
+ attn_weights = attn_weights + causal_mask
187
+
188
+ attn_weights = nn.functional.softmax(
189
+ attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
190
+ attn_weights = nn.functional.dropout(
191
+ attn_weights, p=dropout, training=module.training)
192
+ attn_output = torch.matmul(attn_weights, value_states)
193
+ attn_output = attn_output.transpose(1, 2).contiguous()
194
+
195
+ return attn_output, attn_weights
196
+
197
+
198
+ class SDARAttention(nn.Module):
199
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
200
+
201
+ def __init__(self, config: SDARConfig, layer_idx: int):
202
+ super().__init__()
203
+ self.config = config
204
+ self.layer_idx = layer_idx
205
+ self.head_dim = getattr(
206
+ config, "head_dim", config.hidden_size // config.num_attention_heads)
207
+ self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
208
+ self.scaling = self.head_dim**-0.5
209
+ self.attention_dropout = config.attention_dropout
210
+ self.is_causal = True
211
+
212
+ self.hidden_size = config.hidden_size
213
+ self.num_attention_heads = config.num_attention_heads
214
+ self.num_key_value_heads = config.num_key_value_heads
215
+
216
+ self.q_proj = nn.Linear(
217
+ config.hidden_size, config.num_attention_heads * self.head_dim, bias=config.attention_bias
218
+ )
219
+ self.k_proj = nn.Linear(
220
+ config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
221
+ )
222
+ self.v_proj = nn.Linear(
223
+ config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
224
+ )
225
+ self.o_proj = nn.Linear(
226
+ config.num_attention_heads * self.head_dim, config.hidden_size, bias=config.attention_bias
227
+ )
228
+ # unlike olmo, only on the head dim!
229
+ self.q_norm = SDARRMSNorm(self.head_dim, eps=config.rms_norm_eps)
230
+ # thus post q_norm does not need reshape
231
+ self.k_norm = SDARRMSNorm(self.head_dim, eps=config.rms_norm_eps)
232
+ self.sliding_window = config.sliding_window
233
+ if not (
234
+ self.config.use_sliding_window
235
+ and getattr(self.config, "sliding_window", None) is not None
236
+ and self.layer_idx >= self.config.max_window_layers
237
+ ):
238
+ self.sliding_window = None
239
+
240
+ def forward(
241
+ self,
242
+ hidden_states: torch.Tensor,
243
+ position_embeddings: Tuple[torch.Tensor, torch.Tensor],
244
+ attention_mask: Optional[torch.Tensor],
245
+ past_key_value: Optional[Cache] = None,
246
+ cache_position: Optional[torch.LongTensor] = None,
247
+ **kwargs: Unpack[FlashAttentionKwargs],
248
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
249
+ input_shape = hidden_states.shape[:-1]
250
+ bsz, q_len = input_shape
251
+ hidden_shape = (*input_shape, -1, self.head_dim)
252
+
253
+ query_states = self.q_norm(self.q_proj(
254
+ hidden_states).view(hidden_shape)).transpose(1, 2)
255
+ key_states = self.k_norm(self.k_proj(
256
+ hidden_states).view(hidden_shape)).transpose(1, 2)
257
+ value_states = self.v_proj(hidden_states).view(
258
+ hidden_shape).transpose(1, 2)
259
+
260
+ cos, sin = position_embeddings
261
+ query_states, key_states = apply_rotary_pos_emb(
262
+ query_states, key_states, cos, sin)
263
+
264
+ if past_key_value is not None and kwargs.get("store_kv", False):
265
+ # sin and cos are specific to RoPE models; cache_position needed for the static cache
266
+ key_states, value_states = past_key_value.update(
267
+ key_states, value_states, self.layer_idx)
268
+ elif past_key_value is not None and not kwargs.get("store_kv", False) and len(past_key_value) > self.layer_idx:
269
+ # only retrive, do not store kv
270
+ past_key_states, past_value_states = past_key_value[self.layer_idx]
271
+ key_states = torch.cat(
272
+ [past_key_states, key_states], dim=-2
273
+ )
274
+ value_states = torch.cat(
275
+ [past_value_states, value_states], dim=-2
276
+ )
277
+
278
+ attention_mask = attention_mask.bool() if attention_mask is not None else None
279
+ if torch.all(attention_mask): # decoding
280
+ query_states = query_states.transpose(1, 2)
281
+ key_states = key_states.transpose(1, 2)
282
+ value_states = value_states.transpose(1, 2)
283
+ attn_output = flash_attn_func(
284
+ query_states,
285
+ key_states,
286
+ value_states,
287
+ causal=False,
288
+ softmax_scale=self.scaling)
289
+
290
+ else: # prefilling
291
+ attn_output = F.scaled_dot_product_attention(
292
+ query=query_states,
293
+ key=key_states,
294
+ value=value_states,
295
+ attn_mask=attention_mask,
296
+ is_causal=False,
297
+ scale=self.scaling,
298
+ enable_gqa=True)
299
+ attn_output = attn_output.transpose(1, 2).contiguous()
300
+
301
+ attn_output = attn_output.reshape(*input_shape, -1).contiguous()
302
+ attn_output = self.o_proj(attn_output)
303
+ return attn_output, None #, attn_weights
304
+
305
+
306
+ class SDARDecoderLayer(GradientCheckpointingLayer):
307
+ def __init__(self, config: SDARConfig, layer_idx: int):
308
+ super().__init__()
309
+ self.hidden_size = config.hidden_size
310
+ self.self_attn = SDARAttention(config=config, layer_idx=layer_idx)
311
+ self.mlp = SDARMLP(config)
312
+ self.input_layernorm = SDARRMSNorm(
313
+ config.hidden_size, eps=config.rms_norm_eps)
314
+ self.post_attention_layernorm = SDARRMSNorm(
315
+ config.hidden_size, eps=config.rms_norm_eps)
316
+ if (
317
+ config.sliding_window and config._attn_implementation != "flash_attention_2"
318
+ ): # diff with Llama is this warning
319
+ logger.warning_once(
320
+ f"Sliding Window Attention is enabled but not implemented for `{config._attn_implementation}`; "
321
+ "unexpected results may be encountered."
322
+ )
323
+
324
+ def forward(
325
+ self,
326
+ hidden_states: torch.Tensor,
327
+ attention_mask: Optional[torch.Tensor] = None,
328
+ position_ids: Optional[torch.LongTensor] = None,
329
+ past_key_value: Optional[Cache] = None,
330
+ output_attentions: Optional[bool] = False,
331
+ use_cache: Optional[bool] = False,
332
+ store_kv: Optional[bool] = False,
333
+ cache_position: Optional[torch.LongTensor] = None,
334
+ # necessary, but kept here for BC
335
+ position_embeddings: Optional[Tuple[torch.Tensor,
336
+ torch.Tensor]] = None,
337
+ **kwargs: Unpack[FlashAttentionKwargs],
338
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
339
+ residual = hidden_states
340
+ hidden_states = self.input_layernorm(hidden_states)
341
+
342
+ # Self Attention
343
+ hidden_states, self_attn_weights = self.self_attn(
344
+ hidden_states=hidden_states,
345
+ attention_mask=attention_mask,
346
+ position_ids=position_ids,
347
+ past_key_value=past_key_value,
348
+ output_attentions=output_attentions,
349
+ use_cache=use_cache,
350
+ store_kv=store_kv,
351
+ cache_position=cache_position,
352
+ position_embeddings=position_embeddings,
353
+ **kwargs,
354
+ )
355
+ hidden_states = residual + hidden_states
356
+
357
+ # Fully Connected
358
+ residual = hidden_states
359
+ hidden_states = self.post_attention_layernorm(hidden_states)
360
+ hidden_states = self.mlp(hidden_states)
361
+ hidden_states = residual + hidden_states
362
+
363
+ outputs = (hidden_states,)
364
+ if output_attentions:
365
+ outputs += (self_attn_weights,)
366
+
367
+ return outputs
368
+
369
+
370
+ @auto_docstring
371
+ class SDARPreTrainedModel(PreTrainedModel):
372
+ config_class = SDARConfig
373
+ base_model_prefix = "model"
374
+ supports_gradient_checkpointing = True
375
+ _no_split_modules = ["SDARDecoderLayer"]
376
+ _skip_keys_device_placement = ["past_key_values"]
377
+ _supports_flash_attn_2 = True
378
+ _supports_sdpa = True
379
+ _supports_flex_attn = True
380
+ _supports_cache_class = True
381
+ _supports_quantized_cache = True
382
+ _supports_static_cache = True
383
+ _supports_attention_backend = True
384
+
385
+ def _init_weights(self, module):
386
+ std = self.config.initializer_range
387
+ if isinstance(module, nn.Linear):
388
+ module.weight.data.normal_(mean=0.0, std=std)
389
+ if module.bias is not None:
390
+ module.bias.data.zero_()
391
+ elif isinstance(module, nn.Embedding):
392
+ module.weight.data.normal_(mean=0.0, std=std)
393
+ if module.padding_idx is not None:
394
+ module.weight.data[module.padding_idx].zero_()
395
+ elif isinstance(module, SDARRMSNorm):
396
+ module.weight.data.fill_(1.0)
397
+
398
+
399
+ class SDARRotaryEmbedding(nn.Module):
400
+ def __init__(self, config: SDARConfig, device=None):
401
+ super().__init__()
402
+ # BC: "rope_type" was originally "type"
403
+ if hasattr(config, "rope_scaling") and config.rope_scaling is not None:
404
+ self.rope_type = config.rope_scaling.get(
405
+ "rope_type", config.rope_scaling.get("type"))
406
+ else:
407
+ self.rope_type = "default"
408
+ self.max_seq_len_cached = config.max_position_embeddings
409
+ self.original_max_seq_len = config.max_position_embeddings
410
+
411
+ self.config = config
412
+ self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
413
+
414
+ inv_freq, self.attention_scaling = self.rope_init_fn(
415
+ self.config, device)
416
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
417
+ self.original_inv_freq = self.inv_freq
418
+
419
+ @torch.no_grad()
420
+ # power user: used with advanced RoPE types (e.g. dynamic rope)
421
+ @dynamic_rope_update
422
+ def forward(self, x, position_ids):
423
+ inv_freq_expanded = self.inv_freq[None, :, None].float().expand(
424
+ position_ids.shape[0], -1, 1).to(x.device)
425
+ position_ids_expanded = position_ids[:, None, :].float()
426
+
427
+ device_type = x.device.type if isinstance(
428
+ x.device.type, str) and x.device.type != "mps" else "cpu"
429
+ with torch.autocast(device_type=device_type, enabled=False): # Force float32
430
+ freqs = (inv_freq_expanded.float() @
431
+ position_ids_expanded.float()).transpose(1, 2)
432
+ emb = torch.cat((freqs, freqs), dim=-1)
433
+ cos = emb.cos() * self.attention_scaling
434
+ sin = emb.sin() * self.attention_scaling
435
+
436
+ return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
437
+
438
+
439
+ @auto_docstring
440
+ class SDARModel(SDARPreTrainedModel):
441
+ def __init__(self, config: SDARConfig):
442
+ super().__init__(config)
443
+ self.padding_idx = config.pad_token_id
444
+ self.vocab_size = config.vocab_size
445
+
446
+ self.embed_tokens = nn.Embedding(
447
+ config.vocab_size, config.hidden_size, self.padding_idx)
448
+ self.layers = nn.ModuleList(
449
+ [SDARDecoderLayer(config, layer_idx)
450
+ for layer_idx in range(config.num_hidden_layers)]
451
+ )
452
+ self.norm = SDARRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
453
+ self.rotary_emb = SDARRotaryEmbedding(config=config)
454
+ self.gradient_checkpointing = False
455
+
456
+ # Initialize weights and apply final processing
457
+ self.post_init()
458
+
459
+ def get_input_embeddings(self):
460
+ return self.embed_tokens
461
+
462
+ def set_input_embeddings(self, value):
463
+ self.embed_tokens = value
464
+
465
+ @can_return_tuple
466
+ @auto_docstring
467
+ def forward(
468
+ self,
469
+ input_ids: Optional[torch.LongTensor] = None,
470
+ attention_mask: Optional[torch.Tensor] = None,
471
+ position_ids: Optional[torch.LongTensor] = None,
472
+ past_key_values: Optional[Cache] = None,
473
+ inputs_embeds: Optional[torch.FloatTensor] = None,
474
+ use_cache: Optional[bool] = None,
475
+ store_kv: Optional[bool] = None,
476
+ output_attentions: Optional[bool] = None,
477
+ output_hidden_states: Optional[bool] = None,
478
+ cache_position: Optional[torch.LongTensor] = None,
479
+ **flash_attn_kwargs: Unpack[FlashAttentionKwargs],
480
+ ) -> BaseModelOutputWithPast:
481
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
482
+ output_hidden_states = (
483
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
484
+ )
485
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
486
+
487
+ if (input_ids is None) ^ (inputs_embeds is not None):
488
+ raise ValueError(
489
+ "You must specify exactly one of input_ids or inputs_embeds")
490
+
491
+ if self.gradient_checkpointing and self.training and use_cache:
492
+ logger.warning_once(
493
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
494
+ )
495
+ use_cache = False
496
+
497
+ # TODO (joao): remove this exception in v4.56 -- it exists for users that try to pass a legacy cache
498
+ if not isinstance(past_key_values, (type(None), Cache)):
499
+ raise ValueError(
500
+ "The `past_key_values` should be either a `Cache` object or `None`.")
501
+
502
+ if inputs_embeds is None:
503
+ inputs_embeds = self.embed_tokens(input_ids)
504
+
505
+ if use_cache and past_key_values is None:
506
+ past_key_values = DynamicCache()
507
+
508
+ if cache_position is None:
509
+ past_seen_tokens = past_key_values.get_seq_length(
510
+ ) if past_key_values is not None else 0
511
+ cache_position = torch.arange(
512
+ past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
513
+ )
514
+
515
+ if position_ids is None:
516
+ position_ids = cache_position.unsqueeze(0)
517
+
518
+ # causal_mask = self._update_causal_mask(
519
+ # attention_mask, inputs_embeds, cache_position, past_key_values, output_attentions
520
+ # )
521
+
522
+ hidden_states = inputs_embeds
523
+
524
+ # create position embeddings to be shared across the decoder layers
525
+ position_embeddings = self.rotary_emb(hidden_states, position_ids)
526
+
527
+ # decoder layers
528
+ all_hidden_states = () if output_hidden_states else None
529
+ all_self_attns = () if output_attentions else None
530
+
531
+ for decoder_layer in self.layers[: self.config.num_hidden_layers]:
532
+ if output_hidden_states:
533
+ all_hidden_states += (hidden_states,)
534
+
535
+ layer_outputs = decoder_layer(
536
+ hidden_states,
537
+ attention_mask=attention_mask,
538
+ position_ids=position_ids,
539
+ past_key_value=past_key_values,
540
+ output_attentions=output_attentions,
541
+ use_cache=use_cache,
542
+ store_kv=store_kv,
543
+ cache_position=cache_position,
544
+ position_embeddings=position_embeddings,
545
+ **flash_attn_kwargs,
546
+ )
547
+
548
+ hidden_states = layer_outputs[0]
549
+
550
+ if output_attentions:
551
+ all_self_attns += (layer_outputs[1],)
552
+
553
+ hidden_states = self.norm(hidden_states)
554
+
555
+ # add hidden states from the last decoder layer
556
+ if output_hidden_states:
557
+ all_hidden_states += (hidden_states,)
558
+
559
+ return BaseModelOutputWithPast(
560
+ last_hidden_state=hidden_states,
561
+ past_key_values=past_key_values if use_cache else None,
562
+ hidden_states=all_hidden_states,
563
+ attentions=all_self_attns,
564
+ )
565
+
566
+ def _update_causal_mask(
567
+ self,
568
+ attention_mask: Union[torch.Tensor, "BlockMask"],
569
+ input_tensor: torch.Tensor,
570
+ cache_position: torch.Tensor,
571
+ past_key_values: Cache,
572
+ output_attentions: bool = False,
573
+ ):
574
+ if self.config._attn_implementation == "flash_attention_2":
575
+ if attention_mask is not None and past_key_values is not None:
576
+ is_padding_right = attention_mask[:, -
577
+ 1].sum().item() != input_tensor.size()[0]
578
+ if is_padding_right:
579
+ raise ValueError(
580
+ "You are attempting to perform batched generation with padding_side='right'"
581
+ " this may lead to unexpected behaviour for Flash Attention version of Qwen3. Make sure to "
582
+ " call `tokenizer.padding_side = 'left'` before tokenizing the input. "
583
+ )
584
+ if attention_mask is not None and 0.0 in attention_mask:
585
+ return attention_mask
586
+ return None
587
+ if self.config._attn_implementation == "flex_attention":
588
+ if isinstance(attention_mask, torch.Tensor):
589
+ seq_len_q, seq_len_kv = attention_mask.shape
590
+ assert seq_len_q == seq_len_kv, f"got {attention_mask.shape=}"
591
+ attention_mask = create_block_mask(
592
+ # 2d bool tensor, shape: [2*seqlen, 2*seqlen]
593
+ lambda b, h, q_idx, kv_idx: attention_mask[q_idx, kv_idx],
594
+ B=None, H=None, Q_LEN=seq_len_q, KV_LEN=seq_len_kv,
595
+ )
596
+ else:
597
+ # Here we pass in flex mask computed externally
598
+ assert isinstance(attention_mask, BlockMask)
599
+ return attention_mask
600
+
601
+ # For SDPA, when possible, we will rely on its `is_causal` argument instead of its `attn_mask` argument, in
602
+ # order to dispatch on Flash Attention 2. This feature is not compatible with static cache, as SDPA will fail
603
+ # to infer the attention mask.
604
+ past_seen_tokens = past_key_values.get_seq_length(
605
+ ) if past_key_values is not None else 0
606
+ using_static_cache = isinstance(past_key_values, StaticCache)
607
+ using_sliding_window_cache = isinstance(
608
+ past_key_values, SlidingWindowCache)
609
+
610
+ # When output attentions is True, sdpa implementation's forward method calls the eager implementation's forward
611
+ if (
612
+ self.config._attn_implementation == "sdpa"
613
+ and not (using_static_cache or using_sliding_window_cache)
614
+ and not output_attentions
615
+ ):
616
+ if AttentionMaskConverter._ignore_causal_mask_sdpa(
617
+ attention_mask,
618
+ inputs_embeds=input_tensor,
619
+ past_key_values_length=past_seen_tokens,
620
+ sliding_window=self.config.sliding_window,
621
+ is_training=self.training,
622
+ ):
623
+ return None
624
+
625
+ dtype = input_tensor.dtype
626
+ min_dtype = torch.finfo(dtype).min
627
+ sequence_length = input_tensor.shape[1]
628
+ # SlidingWindowCache or StaticCache
629
+ if using_sliding_window_cache or using_static_cache:
630
+ target_length = past_key_values.get_max_cache_shape()
631
+ # DynamicCache or no cache
632
+ else:
633
+ target_length = (
634
+ attention_mask.shape[-1]
635
+ if isinstance(attention_mask, torch.Tensor)
636
+ else past_seen_tokens + sequence_length + 1
637
+ )
638
+
639
+ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D).
640
+ causal_mask = self._prepare_4d_causal_attention_mask_with_cache_position(
641
+ attention_mask,
642
+ sequence_length=sequence_length,
643
+ target_length=target_length,
644
+ dtype=dtype,
645
+ cache_position=cache_position,
646
+ batch_size=input_tensor.shape[0],
647
+ config=self.config,
648
+ past_key_values=past_key_values,
649
+ )
650
+
651
+ if (
652
+ self.config._attn_implementation == "sdpa"
653
+ and attention_mask is not None
654
+ and attention_mask.device.type in ["cuda", "xpu", "npu"]
655
+ and not output_attentions
656
+ ):
657
+ # Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
658
+ # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
659
+ # Details: https://github.com/pytorch/pytorch/issues/110213
660
+ causal_mask = AttentionMaskConverter._unmask_unattended(
661
+ causal_mask, min_dtype)
662
+
663
+ return causal_mask
664
+
665
+ @staticmethod
666
+ def _prepare_4d_causal_attention_mask_with_cache_position(
667
+ attention_mask: torch.Tensor,
668
+ sequence_length: int,
669
+ target_length: int,
670
+ dtype: torch.dtype,
671
+ cache_position: torch.Tensor,
672
+ batch_size: int,
673
+ config: SDARConfig,
674
+ past_key_values: Cache,
675
+ ):
676
+ """
677
+ Creates a causal 4D mask of shape `(batch_size, 1, query_length, key_value_length)` from a 2D mask of shape
678
+ `(batch_size, key_value_length)`, or if the input `attention_mask` is already 4D, do nothing.
679
+
680
+ Args:
681
+ attention_mask (`torch.Tensor`):
682
+ A 2D attention mask of shape `(batch_size, key_value_length)` or a 4D attention mask of shape `(batch_size, 1, query_length, key_value_length)`.
683
+ sequence_length (`int`):
684
+ The sequence length being processed.
685
+ target_length (`int`):
686
+ The target length: when generating with static cache, the mask should be as long as the static cache, to account for the 0 padding, the part of the cache that is not filled yet.
687
+ dtype (`torch.dtype`):
688
+ The dtype to use for the 4D attention mask.
689
+ cache_position (`torch.Tensor`):
690
+ Indices depicting the position of the input sequence tokens in the sequence.
691
+ batch_size (`torch.Tensor`):
692
+ Batch size.
693
+ config (`SDARConfig`):
694
+ The model's configuration class
695
+ past_key_values (`Cache`):
696
+ The cache class that is being used currently to generate
697
+ """
698
+ if attention_mask is not None and attention_mask.dim() == 4:
699
+ # In this case we assume that the mask comes already in inverted form and requires no inversion or slicing.
700
+ causal_mask = attention_mask
701
+ else:
702
+ min_dtype = torch.finfo(dtype).min
703
+ causal_mask = torch.full(
704
+ (sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=cache_position.device
705
+ )
706
+ diagonal_attend_mask = torch.arange(target_length, device=cache_position.device) > cache_position.reshape(
707
+ -1, 1
708
+ )
709
+ text_config = config.get_text_config()
710
+ if getattr(text_config, "use_sliding_window", True) and text_config.sliding_window is not None:
711
+ # if we have sliding window, we should not attend to tokens beyond sliding window length, so we mask them out also
712
+ # the check is needed to verify is current checkpoint was trained with sliding window or not
713
+ if not isinstance(past_key_values, SlidingWindowCache) or sequence_length > target_length:
714
+ sliding_attend_mask = torch.arange(target_length, device=cache_position.device) <= (
715
+ cache_position.reshape(-1, 1) -
716
+ text_config.sliding_window
717
+ )
718
+ diagonal_attend_mask.bitwise_or_(sliding_attend_mask)
719
+ causal_mask *= diagonal_attend_mask
720
+ causal_mask = causal_mask[None, None,
721
+ :, :].expand(batch_size, 1, -1, -1)
722
+ if attention_mask is not None:
723
+ causal_mask = causal_mask.clone() # copy to contiguous memory for in-place edit
724
+ if attention_mask.shape[-1] > target_length:
725
+ attention_mask = attention_mask[:, :target_length]
726
+ mask_length = attention_mask.shape[-1]
727
+ padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :].to(
728
+ causal_mask.device
729
+ )
730
+ padding_mask = padding_mask == 0
731
+ causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(
732
+ padding_mask, min_dtype
733
+ )
734
+ return causal_mask
735
+
736
+
737
+ class KwargsForCausalLM(FlashAttentionKwargs, LossKwargs):
738
+ ...
739
+
740
+
741
+ @auto_docstring
742
+ class SDARForCausalLM(SDARPreTrainedModel, GenerationMixin):
743
+ _tied_weights_keys = ["lm_head.weight"]
744
+ _tp_plan = {"lm_head": "colwise_rep"}
745
+ _pp_plan = {"lm_head": (["hidden_states"], ["logits"])}
746
+
747
+ def __init__(self, config):
748
+ super().__init__(config)
749
+ self.model = SDARModel(config)
750
+ self.vocab_size = config.vocab_size
751
+ self.lm_head = nn.Linear(
752
+ config.hidden_size, config.vocab_size, bias=False)
753
+
754
+ # Initialize weights and apply final processing
755
+ self.post_init()
756
+
757
+ def get_input_embeddings(self):
758
+ return self.model.embed_tokens
759
+
760
+ def set_input_embeddings(self, value):
761
+ self.model.embed_tokens = value
762
+
763
+ def get_output_embeddings(self):
764
+ return self.lm_head
765
+
766
+ def set_output_embeddings(self, new_embeddings):
767
+ self.lm_head = new_embeddings
768
+
769
+ def set_decoder(self, decoder):
770
+ self.model = decoder
771
+
772
+ def get_decoder(self):
773
+ return self.model
774
+
775
+ @can_return_tuple
776
+ @auto_docstring
777
+ def forward(
778
+ self,
779
+ input_ids: Optional[torch.LongTensor] = None,
780
+ attention_mask: Optional[torch.Tensor] = None,
781
+ position_ids: Optional[torch.LongTensor] = None,
782
+ past_key_values: Optional[Cache] = None,
783
+ inputs_embeds: Optional[torch.FloatTensor] = None,
784
+ labels: Optional[torch.LongTensor] = None,
785
+ use_cache: Optional[bool] = None,
786
+ output_attentions: Optional[bool] = None,
787
+ output_hidden_states: Optional[bool] = None,
788
+ cache_position: Optional[torch.LongTensor] = None,
789
+ logits_to_keep: Union[int, torch.Tensor] = 0,
790
+ **kwargs: Unpack[KwargsForCausalLM],
791
+ ) -> CausalLMOutputWithPast:
792
+ r"""
793
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
794
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
795
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
796
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
797
+
798
+ Example:
799
+
800
+ ```python
801
+ >>> from transformers import AutoTokenizer, SDARForCausalLM
802
+
803
+ >>> model = SDARForCausalLM.from_pretrained("DiffuOpen/SDAR-1.7B-Chat")
804
+ >>> tokenizer = AutoTokenizer.from_pretrained("DiffuOpen/SDAR-1.7B-Chat")
805
+
806
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
807
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
808
+
809
+ >>> # Generate
810
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
811
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
812
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
813
+ ```"""
814
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
815
+ output_hidden_states = (
816
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
817
+ )
818
+
819
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
820
+ outputs: BaseModelOutputWithPast = self.model(
821
+ input_ids=input_ids,
822
+ attention_mask=attention_mask,
823
+ position_ids=position_ids,
824
+ past_key_values=past_key_values,
825
+ inputs_embeds=inputs_embeds,
826
+ use_cache=use_cache,
827
+ output_attentions=output_attentions,
828
+ output_hidden_states=output_hidden_states,
829
+ cache_position=cache_position,
830
+ **kwargs,
831
+ )
832
+
833
+ hidden_states = outputs.last_hidden_state
834
+ # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
835
+ slice_indices = slice(-logits_to_keep,
836
+ None) if isinstance(logits_to_keep, int) else logits_to_keep
837
+ hidden_states = hidden_states[:, slice_indices, :].contiguous()
838
+ fuse_linear_and_cross_entropy = self.config.fuse_cross_entropy and self.training
839
+ if fuse_linear_and_cross_entropy:
840
+ # When using fused_linear_ce_loss, we do not compute the whole logits on HBM
841
+ logits = None
842
+ else:
843
+ logits = self.lm_head(hidden_states)
844
+
845
+ loss = None
846
+ if labels is not None:
847
+ # FusedLinearCrossEntropyLoss will be implemented by monkey patch when training
848
+ # We don't use it when inferencing
849
+ loss_fct = nn.CrossEntropyLoss() # nn.CE
850
+ loss = loss_fct(
851
+ logits.view(-1, self.config.vocab_size), labels.view(-1))
852
+
853
+ return CausalLMOutputWithPast(
854
+ loss=loss,
855
+ logits=logits,
856
+ past_key_values=outputs.past_key_values,
857
+ hidden_states=outputs.hidden_states,
858
+ attentions=outputs.attentions,
859
+ )
860
+
861
+
862
+ __all__ = [
863
+ "SDARForCausalLM",
864
+ "SDARModel",
865
+ "SDARPreTrainedModel",
866
+ ]
special_tokens_map.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>",
16
+ "<MASK>"
17
+ ],
18
+ "eos_token": {
19
+ "content": "<|endoftext|>",
20
+ "lstrip": false,
21
+ "normalized": false,
22
+ "rstrip": false,
23
+ "single_word": false
24
+ },
25
+ "pad_token": {
26
+ "content": "<|endoftext|>",
27
+ "lstrip": false,
28
+ "normalized": false,
29
+ "rstrip": false,
30
+ "single_word": false
31
+ }
32
+ }
tokenization_qwen2.py ADDED
@@ -0,0 +1,342 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2024 The Qwen team, Alibaba Group and The HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ """Tokenization classes for Qwen2."""
16
+
17
+ import json
18
+ import os
19
+ import unicodedata
20
+ from functools import lru_cache
21
+ from typing import Optional, Tuple
22
+
23
+ import regex as re
24
+
25
+ from transformers.tokenization_utils import AddedToken, PreTrainedTokenizer
26
+ from transformers.utils import logging
27
+
28
+
29
+ logger = logging.get_logger(__name__)
30
+
31
+ VOCAB_FILES_NAMES = {
32
+ "vocab_file": "vocab.json",
33
+ "merges_file": "merges.txt",
34
+ }
35
+
36
+
37
+ MAX_MODEL_INPUT_SIZES = {"qwen/qwen-tokenizer": 32768}
38
+
39
+ PRETOKENIZE_REGEX = r"""(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"""
40
+
41
+
42
+ @lru_cache()
43
+ # Copied from transformers.models.gpt2.tokenization_gpt2.bytes_to_unicode
44
+ def bytes_to_unicode():
45
+ """
46
+ Returns list of utf-8 byte and a mapping to unicode strings. We specifically avoids mapping to whitespace/control
47
+ characters the bpe code barfs on.
48
+
49
+ The reversible bpe codes work on unicode strings. This means you need a large # of unicode characters in your vocab
50
+ if you want to avoid UNKs. When you're at something like a 10B token dataset you end up needing around 5K for
51
+ decent coverage. This is a significant percentage of your normal, say, 32K bpe vocab. To avoid that, we want lookup
52
+ tables between utf-8 bytes and unicode strings.
53
+ """
54
+ bs = (
55
+ list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1))
56
+ )
57
+ cs = bs[:]
58
+ n = 0
59
+ for b in range(2**8):
60
+ if b not in bs:
61
+ bs.append(b)
62
+ cs.append(2**8 + n)
63
+ n += 1
64
+ cs = [chr(n) for n in cs]
65
+ return dict(zip(bs, cs))
66
+
67
+
68
+ # Copied from transformers.models.gpt2.tokenization_gpt2.get_pairs
69
+ def get_pairs(word):
70
+ """
71
+ Return set of symbol pairs in a word.
72
+
73
+ Word is represented as tuple of symbols (symbols being variable-length strings).
74
+ """
75
+ pairs = set()
76
+ prev_char = word[0]
77
+ for char in word[1:]:
78
+ pairs.add((prev_char, char))
79
+ prev_char = char
80
+ return pairs
81
+
82
+
83
+ class Qwen2Tokenizer(PreTrainedTokenizer):
84
+ """
85
+ Construct a Qwen2 tokenizer. Based on byte-level Byte-Pair-Encoding.
86
+
87
+ Same with GPT2Tokenizer, this tokenizer has been trained to treat spaces like parts of the tokens so a word will
88
+ be encoded differently whether it is at the beginning of the sentence (without space) or not:
89
+
90
+ ```python
91
+ >>> from transformers import Qwen2Tokenizer
92
+
93
+ >>> tokenizer = Qwen2Tokenizer.from_pretrained("Qwen/Qwen-tokenizer")
94
+ >>> tokenizer("Hello world")["input_ids"]
95
+ [9707, 1879]
96
+
97
+ >>> tokenizer(" Hello world")["input_ids"]
98
+ [21927, 1879]
99
+ ```
100
+ This is expected.
101
+
102
+ You should not use GPT2Tokenizer instead, because of the different pretokenization rules.
103
+
104
+ This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
105
+ this superclass for more information regarding those methods.
106
+
107
+ Args:
108
+ vocab_file (`str`):
109
+ Path to the vocabulary file.
110
+ merges_file (`str`):
111
+ Path to the merges file.
112
+ errors (`str`, *optional*, defaults to `"replace"`):
113
+ Paradigm to follow when decoding bytes to UTF-8. See
114
+ [bytes.decode](https://docs.python.org/3/library/stdtypes.html#bytes.decode) for more information.
115
+ unk_token (`str`, *optional*, defaults to `"<|endoftext|>"`):
116
+ The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
117
+ token instead.
118
+ bos_token (`str`, *optional*):
119
+ The beginning of sequence token. Not applicable for this tokenizer.
120
+ eos_token (`str`, *optional*, defaults to `"<|endoftext|>"`):
121
+ The end of sequence token.
122
+ pad_token (`str`, *optional*, defaults to `"<|endoftext|>"`):
123
+ The token used for padding, for example when batching sequences of different lengths.
124
+ clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`):
125
+ Whether or not the model should cleanup the spaces that were added when splitting the input text during the
126
+ tokenization process. Not applicable to this tokenizer, since tokenization does not add spaces.
127
+ split_special_tokens (`bool`, *optional*, defaults to `False`):
128
+ Whether or not the special tokens should be split during the tokenization process. The default behavior is
129
+ to not split special tokens. This means that if `<|endoftext|>` is the `eos_token`, then `tokenizer.tokenize("<|endoftext|>") =
130
+ ['<|endoftext|>`]. Otherwise, if `split_special_tokens=True`, then `tokenizer.tokenize("<|endoftext|>")` will be give `['<',
131
+ '|', 'endo', 'ft', 'ext', '|', '>']`. This argument is only supported for `slow` tokenizers for the moment.
132
+ """
133
+
134
+ vocab_files_names = VOCAB_FILES_NAMES
135
+ model_input_names = ["input_ids", "attention_mask"]
136
+
137
+ def __init__(
138
+ self,
139
+ vocab_file,
140
+ merges_file,
141
+ errors="replace",
142
+ unk_token="<|endoftext|>",
143
+ bos_token=None,
144
+ eos_token="<|endoftext|>",
145
+ pad_token="<|endoftext|>",
146
+ clean_up_tokenization_spaces=False,
147
+ split_special_tokens=False,
148
+ **kwargs,
149
+ ):
150
+ # Qwen vocab does not contain control tokens; added tokens need to be special
151
+ bos_token = (
152
+ AddedToken(bos_token, lstrip=False, rstrip=False, special=True, normalized=False)
153
+ if isinstance(bos_token, str)
154
+ else bos_token
155
+ )
156
+ eos_token = (
157
+ AddedToken(eos_token, lstrip=False, rstrip=False, special=True, normalized=False)
158
+ if isinstance(eos_token, str)
159
+ else eos_token
160
+ )
161
+ unk_token = (
162
+ AddedToken(unk_token, lstrip=False, rstrip=False, special=True, normalized=False)
163
+ if isinstance(unk_token, str)
164
+ else unk_token
165
+ )
166
+ pad_token = (
167
+ AddedToken(pad_token, lstrip=False, rstrip=False, special=True, normalized=False)
168
+ if isinstance(pad_token, str)
169
+ else pad_token
170
+ )
171
+
172
+ with open(vocab_file, encoding="utf-8") as vocab_handle:
173
+ self.encoder = json.load(vocab_handle)
174
+ self.decoder = {v: k for k, v in self.encoder.items()}
175
+ self.errors = errors # how to handle errors in decoding
176
+ self.byte_encoder = bytes_to_unicode()
177
+ self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
178
+ bpe_merges = []
179
+ with open(merges_file, encoding="utf-8") as merges_handle:
180
+ for i, line in enumerate(merges_handle):
181
+ line = line.strip()
182
+ if (i == 0 and line.startswith("#version:")) or not line:
183
+ continue
184
+ bpe_merges.append(tuple(line.split()))
185
+ self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
186
+ # NOTE: the cache can grow without bound and will get really large for long running processes
187
+ # (esp. for texts of language that do not use space between word, e.g. Chinese); technically
188
+ # not a memory leak but appears as one.
189
+ # GPT2Tokenizer has the same problem, so let's be consistent.
190
+ self.cache = {}
191
+
192
+ self.pat = re.compile(PRETOKENIZE_REGEX)
193
+
194
+ if kwargs.get("add_prefix_space", False):
195
+ logger.warning_once(
196
+ f"{self.__class__.__name} does not support `add_prefix_space`, setting it to True has no effect."
197
+ )
198
+
199
+ super().__init__(
200
+ errors=errors,
201
+ bos_token=bos_token,
202
+ eos_token=eos_token,
203
+ pad_token=pad_token,
204
+ unk_token=unk_token,
205
+ clean_up_tokenization_spaces=clean_up_tokenization_spaces,
206
+ split_special_tokens=split_special_tokens,
207
+ **kwargs,
208
+ )
209
+
210
+ @property
211
+ def vocab_size(self) -> int:
212
+ return len(self.encoder)
213
+
214
+ # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer.get_vocab
215
+ def get_vocab(self):
216
+ return dict(self.encoder, **self.added_tokens_encoder)
217
+
218
+ # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer.bpe
219
+ def bpe(self, token):
220
+ if token in self.cache:
221
+ return self.cache[token]
222
+ word = tuple(token)
223
+ pairs = get_pairs(word)
224
+
225
+ if not pairs:
226
+ return token
227
+
228
+ while True:
229
+ bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
230
+ if bigram not in self.bpe_ranks:
231
+ break
232
+ first, second = bigram
233
+ new_word = []
234
+ i = 0
235
+ while i < len(word):
236
+ try:
237
+ j = word.index(first, i)
238
+ except ValueError:
239
+ new_word.extend(word[i:])
240
+ break
241
+ else:
242
+ new_word.extend(word[i:j])
243
+ i = j
244
+
245
+ if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
246
+ new_word.append(first + second)
247
+ i += 2
248
+ else:
249
+ new_word.append(word[i])
250
+ i += 1
251
+ new_word = tuple(new_word)
252
+ word = new_word
253
+ if len(word) == 1:
254
+ break
255
+ else:
256
+ pairs = get_pairs(word)
257
+ word = " ".join(word)
258
+ self.cache[token] = word
259
+ return word
260
+
261
+ # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer._tokenize
262
+ def _tokenize(self, text):
263
+ """Tokenize a string."""
264
+ bpe_tokens = []
265
+ for token in re.findall(self.pat, text):
266
+ token = "".join(
267
+ self.byte_encoder[b] for b in token.encode("utf-8")
268
+ ) # Maps all our bytes to unicode strings, avoiding control tokens of the BPE (spaces in our case)
269
+ bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(" "))
270
+ return bpe_tokens
271
+
272
+ # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer._convert_token_to_id
273
+ def _convert_token_to_id(self, token):
274
+ """Converts a token (str) in an id using the vocab."""
275
+ return self.encoder.get(token, self.encoder.get(self.unk_token))
276
+
277
+ # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer._convert_id_to_token
278
+ def _convert_id_to_token(self, index):
279
+ """Converts an index (integer) in a token (str) using the vocab."""
280
+ return self.decoder.get(index)
281
+
282
+ # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer.convert_tokens_to_string
283
+ def convert_tokens_to_string(self, tokens):
284
+ """Converts a sequence of tokens (string) in a single string."""
285
+ text = "".join(tokens)
286
+ text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)
287
+ return text
288
+
289
+ def decode(
290
+ self,
291
+ token_ids,
292
+ skip_special_tokens: bool = False,
293
+ clean_up_tokenization_spaces: Optional[bool] = False,
294
+ spaces_between_special_tokens: bool = False,
295
+ **kwargs,
296
+ ) -> str:
297
+ # `spaces_between_special_tokens` defaults to True for _decode in slow tokenizers
298
+ # and cannot be configured elsewhere, but it should default to False for Qwen2Tokenizer
299
+ return super().decode(
300
+ token_ids,
301
+ skip_special_tokens=skip_special_tokens,
302
+ clean_up_tokenization_spaces=clean_up_tokenization_spaces,
303
+ spaces_between_special_tokens=spaces_between_special_tokens,
304
+ **kwargs,
305
+ )
306
+
307
+ # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer.save_vocabulary
308
+ def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
309
+ if not os.path.isdir(save_directory):
310
+ logger.error(f"Vocabulary path ({save_directory}) should be a directory")
311
+ return
312
+ vocab_file = os.path.join(
313
+ save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
314
+ )
315
+ merge_file = os.path.join(
316
+ save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["merges_file"]
317
+ )
318
+
319
+ with open(vocab_file, "w", encoding="utf-8") as f:
320
+ f.write(json.dumps(self.encoder, indent=2, sort_keys=True, ensure_ascii=False) + "\n")
321
+
322
+ index = 0
323
+ with open(merge_file, "w", encoding="utf-8") as writer:
324
+ writer.write("#version: 0.2\n")
325
+ for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):
326
+ if index != token_index:
327
+ logger.warning(
328
+ f"Saving vocabulary to {merge_file}: BPE merge indices are not consecutive."
329
+ " Please check that the tokenizer is not corrupted!"
330
+ )
331
+ index = token_index
332
+ writer.write(" ".join(bpe_tokens) + "\n")
333
+ index += 1
334
+
335
+ return vocab_file, merge_file
336
+
337
+ def prepare_for_tokenization(self, text, **kwargs):
338
+ text = unicodedata.normalize("NFC", text)
339
+ return (text, kwargs)
340
+
341
+
342
+ __all__ = ["Qwen2Tokenizer"]
tokenization_qwen2_fast.py ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2024 The Qwen team, Alibaba Group and The HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ """Tokenization classes for Qwen2."""
16
+
17
+ from typing import Optional, Tuple
18
+
19
+ from transformers.tokenization_utils import AddedToken
20
+ from transformers.tokenization_utils_fast import PreTrainedTokenizerFast
21
+ from transformers.utils import logging
22
+ from .tokenization_qwen2 import Qwen2Tokenizer
23
+
24
+
25
+ logger = logging.get_logger(__name__)
26
+
27
+ VOCAB_FILES_NAMES = {
28
+ "vocab_file": "vocab.json",
29
+ "merges_file": "merges.txt",
30
+ "tokenizer_file": "tokenizer.json",
31
+ }
32
+
33
+
34
+ MAX_MODEL_INPUT_SIZES = {"qwen/qwen-tokenizer": 32768}
35
+
36
+
37
+ class Qwen2TokenizerFast(PreTrainedTokenizerFast):
38
+ """
39
+ Construct a "fast" Qwen2 tokenizer (backed by HuggingFace's *tokenizers* library). Based on byte-level
40
+ Byte-Pair-Encoding.
41
+
42
+ Same with GPT2Tokenizer, this tokenizer has been trained to treat spaces like parts of the tokens so a word will
43
+ be encoded differently whether it is at the beginning of the sentence (without space) or not:
44
+
45
+ ```python
46
+ >>> from transformers import Qwen2TokenizerFast
47
+
48
+ >>> tokenizer = Qwen2TokenizerFast.from_pretrained("Qwen/Qwen-tokenizer")
49
+ >>> tokenizer("Hello world")["input_ids"]
50
+ [9707, 1879]
51
+
52
+ >>> tokenizer(" Hello world")["input_ids"]
53
+ [21927, 1879]
54
+ ```
55
+ This is expected.
56
+
57
+ This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
58
+ refer to this superclass for more information regarding those methods.
59
+
60
+ Args:
61
+ vocab_file (`str`, *optional*):
62
+ Path to the vocabulary file.
63
+ merges_file (`str`, *optional*):
64
+ Path to the merges file.
65
+ tokenizer_file (`str`, *optional*):
66
+ Path to [tokenizers](https://github.com/huggingface/tokenizers) file (generally has a .json extension) that
67
+ contains everything needed to load the tokenizer.
68
+ unk_token (`str`, *optional*, defaults to `"<|endoftext|>"`):
69
+ The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
70
+ token instead. Not applicable to this tokenizer.
71
+ bos_token (`str`, *optional*):
72
+ The beginning of sequence token. Not applicable for this tokenizer.
73
+ eos_token (`str`, *optional*, defaults to `"<|endoftext|>"`):
74
+ The end of sequence token.
75
+ pad_token (`str`, *optional*, defaults to `"<|endoftext|>"`):
76
+ The token used for padding, for example when batching sequences of different lengths.
77
+ """
78
+
79
+ vocab_files_names = VOCAB_FILES_NAMES
80
+ model_input_names = ["input_ids", "attention_mask"]
81
+ slow_tokenizer_class = Qwen2Tokenizer
82
+
83
+ def __init__(
84
+ self,
85
+ vocab_file=None,
86
+ merges_file=None,
87
+ tokenizer_file=None,
88
+ unk_token="<|endoftext|>",
89
+ bos_token=None,
90
+ eos_token="<|endoftext|>",
91
+ pad_token="<|endoftext|>",
92
+ **kwargs,
93
+ ):
94
+ # We need to at least pass vocab_file and merges_file to base class
95
+ # in case a slow tokenizer needs to be initialized; other can be
96
+ # configured through files.
97
+ # following GPT2TokenizerFast, also adding unk_token, bos_token, and eos_token
98
+
99
+ bos_token = (
100
+ AddedToken(bos_token, lstrip=False, rstrip=False, special=True, normalized=False)
101
+ if isinstance(bos_token, str)
102
+ else bos_token
103
+ )
104
+ eos_token = (
105
+ AddedToken(eos_token, lstrip=False, rstrip=False, special=True, normalized=False)
106
+ if isinstance(eos_token, str)
107
+ else eos_token
108
+ )
109
+ unk_token = (
110
+ AddedToken(unk_token, lstrip=False, rstrip=False, special=True, normalized=False)
111
+ if isinstance(unk_token, str)
112
+ else unk_token
113
+ )
114
+ pad_token = (
115
+ AddedToken(pad_token, lstrip=False, rstrip=False, special=True, normalized=False)
116
+ if isinstance(pad_token, str)
117
+ else pad_token
118
+ )
119
+
120
+ super().__init__(
121
+ vocab_file=vocab_file,
122
+ merges_file=merges_file,
123
+ tokenizer_file=tokenizer_file,
124
+ unk_token=unk_token,
125
+ bos_token=bos_token,
126
+ eos_token=eos_token,
127
+ pad_token=pad_token,
128
+ **kwargs,
129
+ )
130
+
131
+ # Copied from transformers.models.gpt2.tokenization_gpt2_fast.GPT2TokenizerFast.save_vocabulary
132
+ def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
133
+ files = self._tokenizer.model.save(save_directory, name=filename_prefix)
134
+ return tuple(files)
135
+
136
+
137
+ __all__ = ["Qwen2TokenizerFast"]
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,256 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+
5
+ "added_tokens_decoder": {
6
+ "151643": {
7
+ "content": "<|endoftext|>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false,
12
+ "special": true
13
+ },
14
+ "151644": {
15
+ "content": "<|im_start|>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false,
20
+ "special": true
21
+ },
22
+ "151645": {
23
+ "content": "<|im_end|>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": false,
27
+ "single_word": false,
28
+ "special": true
29
+ },
30
+ "151646": {
31
+ "content": "<|object_ref_start|>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false,
36
+ "special": true
37
+ },
38
+ "151647": {
39
+ "content": "<|object_ref_end|>",
40
+ "lstrip": false,
41
+ "normalized": false,
42
+ "rstrip": false,
43
+ "single_word": false,
44
+ "special": true
45
+ },
46
+ "151648": {
47
+ "content": "<|box_start|>",
48
+ "lstrip": false,
49
+ "normalized": false,
50
+ "rstrip": false,
51
+ "single_word": false,
52
+ "special": true
53
+ },
54
+ "151649": {
55
+ "content": "<|box_end|>",
56
+ "lstrip": false,
57
+ "normalized": false,
58
+ "rstrip": false,
59
+ "single_word": false,
60
+ "special": true
61
+ },
62
+ "151650": {
63
+ "content": "<|quad_start|>",
64
+ "lstrip": false,
65
+ "normalized": false,
66
+ "rstrip": false,
67
+ "single_word": false,
68
+ "special": true
69
+ },
70
+ "151651": {
71
+ "content": "<|quad_end|>",
72
+ "lstrip": false,
73
+ "normalized": false,
74
+ "rstrip": false,
75
+ "single_word": false,
76
+ "special": true
77
+ },
78
+ "151652": {
79
+ "content": "<|vision_start|>",
80
+ "lstrip": false,
81
+ "normalized": false,
82
+ "rstrip": false,
83
+ "single_word": false,
84
+ "special": true
85
+ },
86
+ "151653": {
87
+ "content": "<|vision_end|>",
88
+ "lstrip": false,
89
+ "normalized": false,
90
+ "rstrip": false,
91
+ "single_word": false,
92
+ "special": true
93
+ },
94
+ "151654": {
95
+ "content": "<|vision_pad|>",
96
+ "lstrip": false,
97
+ "normalized": false,
98
+ "rstrip": false,
99
+ "single_word": false,
100
+ "special": true
101
+ },
102
+ "151655": {
103
+ "content": "<|image_pad|>",
104
+ "lstrip": false,
105
+ "normalized": false,
106
+ "rstrip": false,
107
+ "single_word": false,
108
+ "special": true
109
+ },
110
+ "151656": {
111
+ "content": "<|video_pad|>",
112
+ "lstrip": false,
113
+ "normalized": false,
114
+ "rstrip": false,
115
+ "single_word": false,
116
+ "special": true
117
+ },
118
+ "151657": {
119
+ "content": "<tool_call>",
120
+ "lstrip": false,
121
+ "normalized": false,
122
+ "rstrip": false,
123
+ "single_word": false,
124
+ "special": false
125
+ },
126
+ "151658": {
127
+ "content": "</tool_call>",
128
+ "lstrip": false,
129
+ "normalized": false,
130
+ "rstrip": false,
131
+ "single_word": false,
132
+ "special": false
133
+ },
134
+ "151659": {
135
+ "content": "<|fim_prefix|>",
136
+ "lstrip": false,
137
+ "normalized": false,
138
+ "rstrip": false,
139
+ "single_word": false,
140
+ "special": false
141
+ },
142
+ "151660": {
143
+ "content": "<|fim_middle|>",
144
+ "lstrip": false,
145
+ "normalized": false,
146
+ "rstrip": false,
147
+ "single_word": false,
148
+ "special": false
149
+ },
150
+ "151661": {
151
+ "content": "<|fim_suffix|>",
152
+ "lstrip": false,
153
+ "normalized": false,
154
+ "rstrip": false,
155
+ "single_word": false,
156
+ "special": false
157
+ },
158
+ "151662": {
159
+ "content": "<|fim_pad|>",
160
+ "lstrip": false,
161
+ "normalized": false,
162
+ "rstrip": false,
163
+ "single_word": false,
164
+ "special": false
165
+ },
166
+ "151663": {
167
+ "content": "<|repo_name|>",
168
+ "lstrip": false,
169
+ "normalized": false,
170
+ "rstrip": false,
171
+ "single_word": false,
172
+ "special": false
173
+ },
174
+ "151664": {
175
+ "content": "<|file_sep|>",
176
+ "lstrip": false,
177
+ "normalized": false,
178
+ "rstrip": false,
179
+ "single_word": false,
180
+ "special": false
181
+ },
182
+ "151665": {
183
+ "content": "<tool_response>",
184
+ "lstrip": false,
185
+ "normalized": false,
186
+ "rstrip": false,
187
+ "single_word": false,
188
+ "special": false
189
+ },
190
+ "151666": {
191
+ "content": "</tool_response>",
192
+ "lstrip": false,
193
+ "normalized": false,
194
+ "rstrip": false,
195
+ "single_word": false,
196
+ "special": false
197
+ },
198
+ "151667": {
199
+ "content": "<think>",
200
+ "lstrip": false,
201
+ "normalized": false,
202
+ "rstrip": false,
203
+ "single_word": false,
204
+ "special": false
205
+ },
206
+ "151668": {
207
+ "content": "</think>",
208
+ "lstrip": false,
209
+ "normalized": false,
210
+ "rstrip": false,
211
+ "single_word": false,
212
+ "special": false
213
+ },
214
+ "151669": {
215
+ "content": "<|MASK|>",
216
+ "lstrip": false,
217
+ "normalized": false,
218
+ "rstrip": false,
219
+ "single_word": false,
220
+ "special": false
221
+ }
222
+ },
223
+ "additional_special_tokens": [
224
+ "<|im_start|>",
225
+ "<|im_end|>",
226
+ "<|object_ref_start|>",
227
+ "<|object_ref_end|>",
228
+ "<|box_start|>",
229
+ "<|box_end|>",
230
+ "<|quad_start|>",
231
+ "<|quad_end|>",
232
+ "<|vision_start|>",
233
+ "<|vision_end|>",
234
+ "<|vision_pad|>",
235
+ "<|image_pad|>",
236
+ "<|video_pad|>",
237
+ "<|MASK|>"
238
+ ],
239
+ "auto_map": {
240
+ "AutoTokenizer": [
241
+ "tokenization_qwen2.Qwen2Tokenizer",
242
+ null
243
+ ]
244
+ },
245
+ "bos_token": null,
246
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0].role == 'system' %}\n {{- messages[0].content + '\\n\\n' }}\n {%- endif %}\n {{- \"# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0].role == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0].content + '<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}\n{%- for message in messages[::-1] %}\n {%- set index = (messages|length - 1) - loop.index0 %}\n {%- if ns.multi_step_tool and message.role == \"user\" and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}\n {%- set ns.multi_step_tool = false %}\n {%- set ns.last_query_index = index %}\n {%- endif %}\n{%- endfor %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {%- set content = message.content %}\n {%- set reasoning_content = '' %}\n {%- if message.reasoning_content is defined and message.reasoning_content is not none %}\n {%- set reasoning_content = message.reasoning_content %}\n {%- else %}\n {%- if '</think>' in message.content %}\n {%- set content = message.content.split('</think>')[-1].lstrip('\\n') %}\n {%- set reasoning_content = message.content.split('</think>')[0].rstrip('\\n').split('<think>')[-1].lstrip('\\n') %}\n {%- endif %}\n {%- endif %}\n {%- if loop.index0 > ns.last_query_index %}\n {%- if loop.last or (not loop.last and reasoning_content) %}\n {{- '<|im_start|>' + message.role + '\\n<think>\\n' + reasoning_content.strip('\\n') + '\\n</think>\\n\\n' + content.lstrip('\\n') }}\n {%- else %}\n {{- '<|im_start|>' + message.role + '\\n' + content }}\n {%- endif %}\n {%- else %}\n {{- '<|im_start|>' + message.role + '\\n' + content }}\n {%- endif %}\n {%- if message.tool_calls %}\n {%- for tool_call in message.tool_calls %}\n {%- if (loop.first and content) or (not loop.first) %}\n {{- '\\n' }}\n {%- endif %}\n {%- if tool_call.function %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '<tool_call>\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {%- if tool_call.arguments is string %}\n {{- tool_call.arguments }}\n {%- else %}\n {{- tool_call.arguments | tojson }}\n {%- endif %}\n {{- '}\\n</tool_call>' }}\n {%- endfor %}\n {%- endif %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if loop.first or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n<tool_response>\\n' }}\n {{- message.content }}\n {{- '\\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n {%- if enable_thinking is defined and enable_thinking is false %}\n {{- '<think>\\n\\n</think>\\n\\n' }}\n {%- endif %}\n{%- endif %}",
247
+ "clean_up_tokenization_spaces": false,
248
+ "eos_token": "<|endoftext|>",
249
+ "mask_token": "<|MASK|>",
250
+ "errors": "replace",
251
+ "model_max_length": 131072,
252
+ "pad_token": "<|endoftext|>",
253
+ "split_special_tokens": false,
254
+ "tokenizer_class": "Qwen2Tokenizer",
255
+ "unk_token": null
256
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff