qikp commited on
Commit
1bec310
·
0 Parent(s):

Initial commit

Browse files
Files changed (6) hide show
  1. .gitattributes +35 -0
  2. AMLR.txt +88 -0
  3. README.md +6 -0
  4. config.json +76 -0
  5. configuration_openelm.py +316 -0
  6. model.safetensors +3 -0
.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
AMLR.txt ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Disclaimer: IMPORTANT: This Apple Machine Learning Research Model is
2
+ specifically developed and released by Apple Inc. ("Apple") for the sole purpose
3
+ of scientific research of artificial intelligence and machine-learning
4
+ technology. “Apple Machine Learning Research Model” means the model, including
5
+ but not limited to algorithms, formulas, trained model weights, parameters,
6
+ configurations, checkpoints, and any related materials (including
7
+ documentation).
8
+
9
+ This Apple Machine Learning Research Model is provided to You by
10
+ Apple in consideration of your agreement to the following terms, and your use,
11
+ modification, creation of Model Derivatives, and or redistribution of the Apple
12
+ Machine Learning Research Model constitutes acceptance of this Agreement. If You
13
+ do not agree with these terms, please do not use, modify, create Model
14
+ Derivatives of, or distribute this Apple Machine Learning Research Model or
15
+ Model Derivatives.
16
+
17
+ * License Scope: In consideration of your agreement to abide by the following
18
+ terms, and subject to these terms, Apple hereby grants you a personal,
19
+ non-exclusive, worldwide, non-transferable, royalty-free, revocable, and
20
+ limited license, to use, copy, modify, distribute, and create Model
21
+ Derivatives (defined below) of the Apple Machine Learning Research Model
22
+ exclusively for Research Purposes. You agree that any Model Derivatives You
23
+ may create or that may be created for You will be limited to Research Purposes
24
+ as well. “Research Purposes” means non-commercial scientific research and
25
+ academic development activities, such as experimentation, analysis, testing
26
+ conducted by You with the sole intent to advance scientific knowledge and
27
+ research. “Research Purposes” does not include any commercial exploitation,
28
+ product development or use in any commercial product or service.
29
+
30
+ * Distribution of Apple Machine Learning Research Model and Model Derivatives:
31
+ If you choose to redistribute Apple Machine Learning Research Model or its
32
+ Model Derivatives, you must provide a copy of this Agreement to such third
33
+ party, and ensure that the following attribution notice be provided: “Apple
34
+ Machine Learning Research Model is licensed under the Apple Machine Learning
35
+ Research Model License Agreement.” Additionally, all Model Derivatives must
36
+ clearly be identified as such, including disclosure of modifications and
37
+ changes made to the Apple Machine Learning Research Model. The name,
38
+ trademarks, service marks or logos of Apple may not be used to endorse or
39
+ promote Model Derivatives or the relationship between You and Apple. “Model
40
+ Derivatives” means any models or any other artifacts created by modifications,
41
+ improvements, adaptations, alterations to the architecture, algorithm or
42
+ training processes of the Apple Machine Learning Research Model, or by any
43
+ retraining, fine-tuning of the Apple Machine Learning Research Model.
44
+
45
+ * No Other License: Except as expressly stated in this notice, no other rights
46
+ or licenses, express or implied, are granted by Apple herein, including but
47
+ not limited to any patent, trademark, and similar intellectual property rights
48
+ worldwide that may be infringed by the Apple Machine Learning Research Model,
49
+ the Model Derivatives or by other works in which the Apple Machine Learning
50
+ Research Model may be incorporated.
51
+
52
+ * Compliance with Laws: Your use of Apple Machine Learning Research Model must
53
+ be in compliance with all applicable laws and regulations.
54
+
55
+ * Term and Termination: The term of this Agreement will begin upon your
56
+ acceptance of this Agreement or use of the Apple Machine Learning Research
57
+ Model and will continue until terminated in accordance with the following
58
+ terms. Apple may terminate this Agreement at any time if You are in breach of
59
+ any term or condition of this Agreement. Upon termination of this Agreement,
60
+ You must cease to use all Apple Machine Learning Research Models and Model
61
+ Derivatives and permanently delete any copy thereof. Sections 3, 6 and 7 will
62
+ survive termination.
63
+
64
+ * Disclaimer and Limitation of Liability: This Apple Machine Learning Research
65
+ Model and any outputs generated by the Apple Machine Learning Research Model
66
+ are provided on an “AS IS” basis. APPLE MAKES NO WARRANTIES, EXPRESS OR
67
+ IMPLIED, INCLUDING WITHOUT LIMITATION THE IMPLIED WARRANTIES OF
68
+ NON-INFRINGEMENT, MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE,
69
+ REGARDING THE APPLE MACHINE LEARNING RESEARCH MODEL OR OUTPUTS GENERATED BY
70
+ THE APPLE MACHINE LEARNING RESEARCH MODEL. You are solely responsible for
71
+ determining the appropriateness of using or redistributing the Apple Machine
72
+ Learning Research Model and any outputs of the Apple Machine Learning Research
73
+ Model and assume any risks associated with Your use of the Apple Machine
74
+ Learning Research Model and any output and results. IN NO EVENT SHALL APPLE BE
75
+ LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
76
+ IN ANY WAY OUT OF THE USE, REPRODUCTION, MODIFICATION AND/OR DISTRIBUTION OF
77
+ THE APPLE MACHINE LEARNING RESEARCH MODEL AND ANY OUTPUTS OF THE APPLE MACHINE
78
+ LEARNING RESEARCH MODEL, HOWEVER CAUSED AND WHETHER UNDER THEORY OF CONTRACT,
79
+ TORT (INCLUDING NEGLIGENCE), STRICT LIABILITY OR OTHERWISE, EVEN IF APPLE HAS
80
+ BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
81
+
82
+ * Governing Law: This Agreement will be governed by and construed under the laws
83
+ of the State of California without regard to its choice of law principles. The
84
+ Convention on Contracts for the International Sale of Goods shall not apply to
85
+ the Agreement except that the arbitration clause and any arbitration hereunder
86
+ shall be governed by the Federal Arbitration Act, Chapters 1 and 2. 
87
+
88
+ Copyright (C) 2025 Apple Inc. All Rights Reserved.
README.md ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ ---
2
+ license: apple-amlr
3
+ ---
4
+ This is a randomly initialized model from a modified config from [apple/OpenELM-270M](https://huggingface.co/apple/OpenELM-270M). It is for debugging or testing only. Use Apple's definitions for interfacing with this model. The configuration class was modified to remove an assertation that would prevent making this model due to its modified size.
5
+
6
+ Apple Machine Learning Research Model is licensed under the Apple Machine Learning Research Model License Agreement.
config.json ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "activation_fn_name": "swish",
3
+ "architectures": [
4
+ "OpenELMModel"
5
+ ],
6
+ "auto_map": {
7
+ "AutoConfig": "configuration_openelm.OpenELMConfig",
8
+ "AutoModelForCausalLM": "modeling_openelm.OpenELMForCausalLM"
9
+ },
10
+ "bos_token_id": 1,
11
+ "dtype": "float32",
12
+ "eos_token_id": 2,
13
+ "ffn_dim_divisor": 256,
14
+ "ffn_multipliers": [
15
+ 0.5,
16
+ 0.73,
17
+ 0.97,
18
+ 1.2,
19
+ 1.43,
20
+ 1.67,
21
+ 1.9,
22
+ 2.13,
23
+ 2.37,
24
+ 2.6,
25
+ 2.83,
26
+ 3.07,
27
+ 3.3,
28
+ 3.53,
29
+ 3.77,
30
+ 4.0
31
+ ],
32
+ "ffn_with_glu": true,
33
+ "head_dim": 32,
34
+ "initializer_range": 0.02,
35
+ "max_context_length": 2048,
36
+ "model_dim": 640,
37
+ "model_type": "openelm",
38
+ "normalization_layer_name": "rms_norm",
39
+ "normalize_qk_projections": true,
40
+ "num_gqa_groups": 4,
41
+ "num_kv_heads": [
42
+ 3,
43
+ 3,
44
+ 3,
45
+ 3,
46
+ 3,
47
+ 4,
48
+ 4,
49
+ 4,
50
+ 4,
51
+ 4,
52
+ 4,
53
+ 4,
54
+ 5,
55
+ 5,
56
+ 5,
57
+ 5
58
+ ],
59
+ "num_query_heads": [
60
+ 12,
61
+ 16,
62
+ 16,
63
+ 20
64
+ ],
65
+ "num_transformer_layers": 4,
66
+ "qkv_multipliers": [
67
+ 0.5,
68
+ 1.0
69
+ ],
70
+ "rope_freq_constant": 10000,
71
+ "rope_max_length": 4096,
72
+ "share_input_output_layers": true,
73
+ "transformers_version": "4.57.3",
74
+ "use_cache": true,
75
+ "vocab_size": 16000
76
+ }
configuration_openelm.py ADDED
@@ -0,0 +1,316 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #
2
+ # For licensing see accompanying LICENSE file.
3
+ # Copyright (C) 2024 Apple Inc. All Rights Reserved.
4
+ #
5
+
6
+ """Implements HF OpenELMConfig based on PretrainedConfig"""
7
+ from numbers import Number
8
+ from typing import List, Optional, Union
9
+
10
+ import numpy as np
11
+ from transformers import PretrainedConfig
12
+
13
+
14
+ def make_divisible(
15
+ v: Union[float, int],
16
+ divisor: Optional[int] = 8,
17
+ min_value: Optional[Union[float, int]] = None,
18
+ ) -> Union[float, int]:
19
+ """
20
+ This function is taken from the original tf repo.
21
+ It ensures that all layers have a channel number that is divisible by the divisor
22
+ It can be seen at:
23
+ https://github.com/tensorflow/models/blob/2cfc99eff5e5eb729c6793d2f3d03aa1c9be2b15/research/slim/nets/mobilenet/mobilenet.py#L62
24
+
25
+ Args:
26
+ v: input value
27
+ divisor: default to 8
28
+ min_value: minimum divisor value
29
+ Returns:
30
+ new_v: new divisible value
31
+ """
32
+ if min_value is None:
33
+ min_value = divisor
34
+ new_v = max(min_value, int(v + divisor / 2) // divisor * divisor)
35
+ # Make sure that round down does not go down by more than 10%.
36
+ if new_v < 0.9 * v:
37
+ new_v += divisor
38
+ return new_v
39
+
40
+
41
+ def compute_heads(model_dim: int, head_dim: int) -> int:
42
+ """Compute the number of heads.
43
+
44
+ Args:
45
+ model_dim: Model dimension.
46
+ head_dim: Head dimension.
47
+
48
+ Returns:
49
+ An integer denoting number of heads in multi-head attention is returned.
50
+
51
+ Raises:
52
+ ValueError: if model dimension is not divisible by head dimension.
53
+ """
54
+ if model_dim % head_dim == 0:
55
+ return model_dim // head_dim
56
+ else:
57
+ raise ValueError(
58
+ f"Model dimension should be divisible by head dimension. Got: {model_dim} and {head_dim}."
59
+ )
60
+
61
+
62
+ OpenELM_CONFIGS = {
63
+ "OpenELM-270M": dict(
64
+ num_transformer_layers=16,
65
+ model_dim=1280,
66
+ head_dim=64,
67
+ num_gqa_groups=4,
68
+ normalize_qk_projections=True,
69
+ share_input_output_layers=True,
70
+ # Vary the FFN and QKV multipliers to create variable FFN and attention layers respectively.
71
+ ffn_multipliers=(0.5, 4.0),
72
+ qkv_multipliers=(0.5, 1.0),
73
+ ),
74
+ "OpenELM-450M": dict(
75
+ num_transformer_layers=20,
76
+ model_dim=1536,
77
+ head_dim=64,
78
+ num_gqa_groups=4,
79
+ normalize_qk_projections=True,
80
+ share_input_output_layers=True,
81
+ # Vary the FFN and QKV multipliers to create variable FFN and attention layers respectively.
82
+ ffn_multipliers=(0.5, 4.0),
83
+ qkv_multipliers=(0.5, 1.0),
84
+ ),
85
+ "OpenELM-1_1B": dict(
86
+ num_transformer_layers=28,
87
+ model_dim=2048,
88
+ head_dim=64,
89
+ num_gqa_groups=4,
90
+ normalize_qk_projections=True,
91
+ share_input_output_layers=True,
92
+ # Vary the FFN and QKV multipliers to create variable FFN and attention layers respectively.
93
+ ffn_multipliers=(0.5, 4.0),
94
+ qkv_multipliers=(0.5, 1.0),
95
+ ),
96
+ "OpenELM-3B": dict(
97
+ num_transformer_layers=36,
98
+ model_dim=3072,
99
+ head_dim=128,
100
+ num_gqa_groups=4,
101
+ normalize_qk_projections=True,
102
+ share_input_output_layers=True,
103
+ # Vary the FFN and QKV multipliers to create variable FFN and attention layers respectively.
104
+ ffn_multipliers=(0.5, 4.0),
105
+ qkv_multipliers=(0.5, 1.0),
106
+ ),
107
+ }
108
+
109
+
110
+ class OpenELMConfig(PretrainedConfig):
111
+ r"""
112
+ This is the configuration class to store the configuration of a [`OpenELMModel`]. It is used to instantiate an OpenELM model according to the specified arguments, defining the model architecture.
113
+
114
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
115
+ documentation from [`PretrainedConfig`] for more information.
116
+
117
+ Args:
118
+ vocab_size (`int`, *optional*, defaults to 32000):
119
+ Vocabulary size of the OpenELM model.
120
+ max_context_length (`int`, *optional*, defaults to 2048):
121
+ Maximum number of input tokens.
122
+ num_transformer_layers (`int`, *optional*, defaults to 12):
123
+ Number of hidden layers in the Transformer decoder.
124
+ model_dim (`int`, *optional*, defaults to 2048):
125
+ Dimension of the hidden representations.
126
+ head_dim (`int`, *optional*, defaults to 128):
127
+ The attention head dimension.
128
+ qkv_multipliers (`Union[Number, List[Number]]`, *optional*, defaults to 1.0):
129
+ If the qkv_multipliers is a Number, then all attention layers have the same latent dimensions,
130
+ resulting in uniform allocation of parameters.
131
+ If the qkv_multipliers is a List of Number, then each attention layer have different latent dimensions
132
+ assuming qkv_multipliers[0] != qkv_multipliers[1]. This results in variable allocation of parameters in attention layer.
133
+ This scaling is known as layer-wise or block-wise scaling: https://arxiv.org/abs/2008.00623
134
+ num_query_heads (`Union[int, None]`, *optional*, defaults to None):
135
+ The number of query heads, computed from `compute_heads(model_dim=model_dim, head_dim=head_dim)`.
136
+ num_gqa_groups (`int`, *optional*, defaults to 1):
137
+ This variable allows to switch between multi-head attention, group query attention, and multi-query attention.
138
+ When num_gqa_groups == 1, then it is multi-head attention.
139
+ When 1 < num_gqa_groups < num_heads and num_heads is divisible by num_gqa_groups, then it is group query attention
140
+ When num_gqa_groups == num_heads, then it is multi-query attention
141
+ ffn_multipliers (`Union[Number, List[Number]]`, *optional*, defaults to 4.0):
142
+ Feed-forward network (FFN) multipliers.
143
+ If the ffn_multipliers is a Number, then all FFN layers have the same latent dimensions,
144
+ resulting in uniform allocation of parameters.
145
+ If the ffn_multipliers is a List of Number, then each FFN layer have different latent dimensions
146
+ assuming ffn_multipliers[0] != ffn_multipliers[1]. This results in variable allocation of parameters in FFN layer.
147
+ This scaling is known as layer-wise or block-wise scaling: https://arxiv.org/abs/2008.00623
148
+ ffn_with_glu (`bool`, *optional*, defaults to True):
149
+ Whether to use FFN with Gated Linear Unit (GLU)
150
+ ffn_dim_divisor (`int`, *optional*, defaults to 256):
151
+ The ffn layer dimension divisor.
152
+ activation_fn_name (`str` or `function`, *optional*, defaults to `"swish"`):
153
+ The non-linear activation function (function or string) in the decoder.
154
+ normalization_layer_name (`str` or `function`, *optional*, defaults to `"rms_norm"`):
155
+ Type of normalization layer.
156
+ normalize_qk_projections (`bool`, *optional*, defaults to False):
157
+ Whether to normalize queries and keys after projections
158
+ share_input_output_layers (`bool`, *optional*, defaults to False):
159
+ Whether to share the embedding between input and output linear layer
160
+ rope_freq_constant (`int`, *optional*, defaults to 10000):
161
+ The base period of the RoPE embeddings.
162
+ rope_max_length (`int`, *optional*, defaults to 4096):
163
+ That rope_max_length is set to twice of max_context_length.
164
+ This allows flexibility in token lengths during training or fine-tuning.
165
+ initializer_range (`float`, *optional*, defaults to 0.02):
166
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
167
+ use_cache (`bool`, *optional*, defaults to `True`):
168
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
169
+ relevant if `config.is_decoder=True`.
170
+ bos_token_id (`int`, *optional*, defaults to 2):
171
+ Beginning of stream token id.
172
+ eos_token_id (`int`, *optional*, defaults to 1):
173
+ End of stream token id.
174
+ """
175
+
176
+ model_type = "openelm"
177
+
178
+ def __init__(
179
+ self,
180
+ vocab_size: int = 32000,
181
+ max_context_length: int = 2048,
182
+ num_transformer_layers: int = 12,
183
+ model_dim: int = 2048,
184
+ head_dim: int = 128,
185
+ qkv_multipliers: Union[Number, List[Number]] = 1.0,
186
+ num_query_heads: Union[int, None] = None,
187
+ num_gqa_groups: int = 1,
188
+ ffn_multipliers: Union[Number, List[Number]] = 4.0,
189
+ ffn_with_glu: bool = True,
190
+ ffn_dim_divisor: int = 256,
191
+ activation_fn_name: str = "swish",
192
+ normalization_layer_name: str = "rms_norm",
193
+ normalize_qk_projections: bool = False,
194
+ share_input_output_layers: bool = False,
195
+ rope_freq_constant: int = 10000,
196
+ rope_max_length: int = 4096,
197
+ initializer_range: float = 0.02,
198
+ use_cache: bool = True,
199
+ bos_token_id: int = 1,
200
+ eos_token_id: int = 2,
201
+ **kwargs,
202
+ ) -> None:
203
+ self.vocab_size = vocab_size
204
+ self.max_context_length = max_context_length
205
+ self.num_transformer_layers = num_transformer_layers
206
+ self.model_dim = model_dim
207
+ self.head_dim = head_dim
208
+ self.qkv_multipliers = qkv_multipliers
209
+ self.num_query_heads = num_query_heads
210
+ self.num_gqa_groups = num_gqa_groups
211
+ self.ffn_multipliers = ffn_multipliers
212
+ self.ffn_with_glu = ffn_with_glu
213
+ self.ffn_dim_divisor = ffn_dim_divisor
214
+ self.activation_fn_name = activation_fn_name
215
+ self.normalization_layer_name = normalization_layer_name
216
+ self.normalize_qk_projections = normalize_qk_projections
217
+ self.share_input_output_layers = share_input_output_layers
218
+ self.rope_freq_constant = rope_freq_constant
219
+ self.rope_max_length = rope_max_length
220
+ self.num_query_heads = (
221
+ compute_heads(model_dim=model_dim, head_dim=head_dim)
222
+ if num_query_heads is None
223
+ else num_query_heads
224
+ )
225
+ self.initializer_range = initializer_range
226
+
227
+ self.__post_init__()
228
+ super().__init__(
229
+ use_cache=use_cache,
230
+ bos_token_id=bos_token_id,
231
+ eos_token_id=eos_token_id,
232
+ **kwargs,
233
+ )
234
+
235
+ def __post_init__(self) -> None:
236
+ if self.num_gqa_groups is not None:
237
+ head_multiple_of = self.num_gqa_groups
238
+ else:
239
+ head_multiple_of = 2
240
+
241
+ if isinstance(self.qkv_multipliers, Number):
242
+ # All attention layers have the same latent dimensions, resulting in uniform allocation of parameters.
243
+ qkv_dim = make_divisible(
244
+ self.model_dim * self.qkv_multipliers,
245
+ divisor=self.head_dim * head_multiple_of,
246
+ )
247
+ query_dims = [int(qkv_dim)] * self.num_transformer_layers
248
+
249
+ elif (
250
+ isinstance(self.qkv_multipliers, (tuple, list))
251
+ and len(self.qkv_multipliers) == 2
252
+ ):
253
+ # Each attention layer have different latent dimensions assuming qkv_multipliers[0] != qkv_multipliers[1].
254
+ # This results in variable allocation of parameters in attention layer.
255
+ # This scaling is known as layer-wise or block-wise scaling: https://arxiv.org/abs/2008.00623
256
+ qkv_multipliers = [
257
+ round(v, 2)
258
+ for v in np.linspace(
259
+ self.qkv_multipliers[0],
260
+ self.qkv_multipliers[1],
261
+ num=self.num_transformer_layers,
262
+ dtype=float,
263
+ )
264
+ ]
265
+ # Make sure that scaled model dimension is divisible by scaled head dimension.
266
+ query_dims = [
267
+ int(
268
+ make_divisible(
269
+ self.model_dim * m, divisor=self.head_dim * head_multiple_of
270
+ )
271
+ )
272
+ for m in qkv_multipliers
273
+ ]
274
+ else:
275
+ raise NotImplementedError(
276
+ f"QKV multipliers should be a single number or a list containing exactly two numbers. Got: {qkv_multipliers}."
277
+ )
278
+
279
+ # compute the number of query, key, and value heads
280
+ # For multi-head and multi-query attention, the number of heads for query, key, and value are the same.
281
+ # For group query attention, the number of key and value heads are the same.
282
+ self.num_query_heads = [
283
+ int(compute_heads(q_dim, self.head_dim)) for q_dim in query_dims
284
+ ]
285
+ self.num_kv_heads = [
286
+ q_heads // self.num_gqa_groups for q_heads in self.num_query_heads
287
+ ]
288
+
289
+ # Feed-forward network (FFN) multipliers
290
+ if isinstance(self.ffn_multipliers, Number):
291
+ # All FFN layers have the same latent dimensions, resulting in uniform allocation of parameters.
292
+ self.ffn_multipliers = [self.ffn_multipliers] * self.num_transformer_layers
293
+ elif isinstance(self.ffn_multipliers, (tuple, list)):
294
+ # Each FFN layer have different latent dimensions assuming ffn_multipliers[0] != ffn_multipliers[1].
295
+ # This results in variable allocation of parameters in FFN layer.
296
+ # This scaling is known as layer-wise or block-wise scaling: https://arxiv.org/abs/2008.00623
297
+ if len(self.ffn_multipliers) == 2:
298
+ self.ffn_multipliers = [
299
+ round(v, 2)
300
+ for v in np.linspace(
301
+ self.ffn_multipliers[0],
302
+ self.ffn_multipliers[1],
303
+ num=self.num_transformer_layers,
304
+ dtype=float,
305
+ )
306
+ ]
307
+ else:
308
+ pass
309
+ else:
310
+ raise NotImplementedError(
311
+ f"FFN multipliers should be a single number or a list containing exactly two numbers. Got: {qkv_multipliers}."
312
+ )
313
+
314
+ # check num_query_heads divisible by num_kv_heads for every layer
315
+ for layer_idx in range(len(query_dims)):
316
+ assert self.num_query_heads[layer_idx] % self.num_kv_heads[layer_idx] == 0
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:263c29e306efb75d845e6c89119f5a46e79586900ffc1c16a6ab9579a1b4345b
3
+ size 73099968