kl1 commited on
Commit
7b6c98e
·
verified ·
1 Parent(s): fc08a6a

Upload folder using huggingface_hub

Browse files
ACKNOWLEDGMENTS ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Acknowledgements
2
+ Portions of this ml-fs-dfm Software may utilize the following copyrighted
3
+ material, the use of which is hereby acknowledged.
4
+
5
+ _____________________
6
+
7
+ Facebook, Inc. (Flow Matching)
8
+ Attribution-NonCommercial 4.0 International
9
+
10
+ Creative Commons Corporation ("Creative Commons") is not a law firm and
11
+ does not provide legal services or legal advice. Distribution of
12
+ Creative Commons public licenses does not create a lawyer-client or
13
+ other relationship. Creative Commons makes its licenses and related
14
+ information available on an "as-is" basis. Creative Commons gives no
15
+ warranties regarding its licenses, any material licensed under their
16
+ terms and conditions, or any related information. Creative Commons
17
+ disclaims all liability for damages resulting from their use to the
18
+ fullest extent possible.
19
+
20
+ By exercising the Licensed Rights (defined below), You accept and agree
21
+ to be bound by the terms and conditions of this Creative Commons
22
+ Attribution-NonCommercial 4.0 International Public License ("Public
23
+ License"). To the extent this Public License may be interpreted as a
24
+ contract, You are granted the Licensed Rights in consideration of Your
25
+ acceptance of these terms and conditions, and the Licensor grants You
26
+ such rights in consideration of benefits the Licensor receives from
27
+ making the Licensed Material available under these terms and
28
+ conditions.
29
+
30
+ For the full license text, see:
31
+ https://creativecommons.org/licenses/by-nc/4.0/legalcode
LICENSE ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Copyright (C) 2025 Apple Inc. All Rights Reserved.
2
+
3
+ IMPORTANT: This Apple software is supplied to you by Apple
4
+ Inc. ("Apple") in consideration of your agreement to the following
5
+ terms, and your use, installation, modification or redistribution of
6
+ this Apple software constitutes acceptance of these terms. If you do
7
+ not agree with these terms, please do not use, install, modify or
8
+ redistribute this Apple software.
9
+
10
+ In consideration of your agreement to abide by the following terms, and
11
+ subject to these terms, Apple grants you a personal, non-exclusive
12
+ license, under Apple's copyrights in this original Apple software (the
13
+ "Apple Software"), to use, reproduce, modify and redistribute the Apple
14
+ Software, with or without modifications, in source and/or binary forms;
15
+ provided that if you redistribute the Apple Software in its entirety and
16
+ without modifications, you must retain this notice and the following
17
+ text and disclaimers in all such redistributions of the Apple Software.
18
+ Neither the name, trademarks, service marks or logos of Apple Inc. may
19
+ be used to endorse or promote products derived from the Apple Software
20
+ without specific prior written permission from Apple. Except as
21
+ expressly stated in this notice, no other rights or licenses, express or
22
+ implied, are granted by Apple herein, including but not limited to any
23
+ patent rights that may be infringed by your derivative works or by other
24
+ works in which the Apple Software may be incorporated.
25
+
26
+ The Apple Software is provided by Apple on an "AS IS" basis. APPLE
27
+ MAKES NO WARRANTIES, EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION
28
+ THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY AND FITNESS
29
+ FOR A PARTICULAR PURPOSE, REGARDING THE APPLE SOFTWARE OR ITS USE AND
30
+ OPERATION ALONE OR IN COMBINATION WITH YOUR PRODUCTS.
31
+
32
+ IN NO EVENT SHALL APPLE BE LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL
33
+ OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
34
+ SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
35
+ INTERRUPTION) ARISING IN ANY WAY OUT OF THE USE, REPRODUCTION,
36
+ MODIFICATION AND/OR DISTRIBUTION OF THE APPLE SOFTWARE, HOWEVER CAUSED
37
+ AND WHETHER UNDER THEORY OF CONTRACT, TORT (INCLUDING NEGLIGENCE),
38
+ STRICT LIABILITY OR OTHERWISE, EVEN IF APPLE HAS BEEN ADVISED OF THE
39
+ POSSIBILITY OF SUCH DAMAGE.
40
+
41
+ Third-party software acknowledgments are contained in the file named ACKNOWLEDGMENTS.
README.md ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ tags:
4
+ - non-commercial
5
+ - text-generation
6
+ - flow-matching
7
+ datasets:
8
+ - cerebras/SlimPajama-627B
9
+ ---
10
+
11
+ # DFM
12
+
13
+ ## Summary
14
+ `DFM` is a continued-pretraining checkpoint based on Apple's fs-dfm weights. It is trained with Flow Matching code and released for research/non-commercial use only.
15
+
16
+ Base checkpoint (external, not on HF):
17
+ ```
18
+ https://ml-site.cdn-apple.com/models/fs-dfm/checkpoint.pth
19
+ ```
20
+
21
+ ## Training
22
+ - Continued pretraining from Apple's fs-dfm checkpoint
23
+ - Dataset: SlimPajama-627B
24
+ - Steps: 250,000
25
+ - Global batch size: 256
26
+
27
+ ## License
28
+ Research/non-commercial use only. This repository is governed by the Apple Software License (see `LICENSE`) and includes non-commercial restrictions inherited from Flow Matching (CC BY-NC 4.0). See `ACKNOWLEDGMENTS` for third-party notices.
29
+
30
+ ## Intended Use
31
+ Research and non-commercial use only.
32
+
33
+ ## Limitations
34
+ Commercial use is not permitted. Dataset-specific licensing constraints apply to SlimPajama's underlying sources.
35
+
36
+ ## Usage
37
+ ### Hugging Face (trust_remote_code)
38
+ This repo provides `configuration_dfm.py` and `modeling_dfm.py` for HF loading with `trust_remote_code=True`.
39
+
40
+ Example:
41
+ ```python
42
+ from transformers import AutoConfig, AutoModel
43
+
44
+ config = AutoConfig.from_pretrained(".", trust_remote_code=True)
45
+ model = AutoModel.from_pretrained(".", trust_remote_code=True)
46
+ ```
47
+
48
+ Note:
49
+ - This model expects `x_t` and `time` inputs (flow-matching style), not GPT-style autoregressive inputs.
50
+
51
+ This release includes model-only weights (`model.safetensors`) for inference/forward passes. Full training/eval/sampling code is available in the original project: `https://github.com/apple/ml-fs-dfm`.
52
+
53
+ ## Acknowledgments
54
+ This model is derived from Apple's fs-dfm checkpoint and follows the original Apple license terms. The original project is at `https://github.com/apple/ml-fs-dfm`. See `ACKNOWLEDGMENTS` for third-party attributions and licensing.
config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "dfm",
3
+ "architectures": [
4
+ "DFMModel"
5
+ ],
6
+ "auto_map": {
7
+ "AutoConfig": "configuration_dfm.DFMConfig",
8
+ "AutoModel": "modeling_dfm.DFMModel"
9
+ },
10
+ "vocab_size": 50257,
11
+ "hidden_size": 2048,
12
+ "cond_dim": 256,
13
+ "num_hidden_layers": 21,
14
+ "n_blocks": 21,
15
+ "num_attention_heads": 32,
16
+ "n_heads": 32,
17
+ "max_position_embeddings": 1024,
18
+ "sequence_length": 1024,
19
+ "dropout": 0.1,
20
+ "rotary_dim": 64,
21
+ "source_distribution": "mask",
22
+ "flow_scheduler_type": "polynomial",
23
+ "flow_exponent": 1.0,
24
+ "flow_loss_function": "generalized_kl",
25
+ "sampling_steps": 1024,
26
+ "bos_token_id": 50256,
27
+ "eos_token_id": 50256,
28
+ "mask_token_id": 50257,
29
+ "tokenizer_name": "gpt2",
30
+ "dtype": "bfloat16"
31
+ }
configuration_dfm.py ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import PretrainedConfig
2
+
3
+
4
+ class DFMConfig(PretrainedConfig):
5
+ model_type = "dfm"
6
+
7
+ def __init__(
8
+ self,
9
+ vocab_size=50257,
10
+ hidden_size=2048,
11
+ cond_dim=256,
12
+ n_blocks=21,
13
+ n_heads=32,
14
+ dropout=0.1,
15
+ sequence_length=1024,
16
+ source_distribution="mask",
17
+ flow_scheduler_type="polynomial",
18
+ flow_exponent=1.0,
19
+ flow_loss_function="generalized_kl",
20
+ sampling_steps=1024,
21
+ bos_token_id=50256,
22
+ eos_token_id=50256,
23
+ mask_token_id=50257,
24
+ tokenizer_name="gpt2",
25
+ dtype="bfloat16",
26
+ **kwargs,
27
+ ):
28
+ super().__init__(
29
+ bos_token_id=bos_token_id,
30
+ eos_token_id=eos_token_id,
31
+ **kwargs,
32
+ )
33
+ self.vocab_size = vocab_size
34
+ self.hidden_size = hidden_size
35
+ self.cond_dim = cond_dim
36
+ self.n_blocks = n_blocks
37
+ self.n_heads = n_heads
38
+ self.dropout = dropout
39
+ self.sequence_length = sequence_length
40
+ self.source_distribution = source_distribution
41
+ self.flow_scheduler_type = flow_scheduler_type
42
+ self.flow_exponent = flow_exponent
43
+ self.flow_loss_function = flow_loss_function
44
+ self.sampling_steps = sampling_steps
45
+ self.mask_token_id = mask_token_id
46
+ self.tokenizer_name = tokenizer_name
47
+ self.dtype = dtype
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d80bdb27307852691fa4eb6eddd880307cb07a12cd30458bc051c8ff23662291
3
+ size 5322735224
modeling_dfm.py ADDED
@@ -0,0 +1,372 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+ from types import SimpleNamespace
3
+ from typing import Optional, Tuple
4
+
5
+ import torch
6
+ import torch.nn.functional as F
7
+ from einops import rearrange, repeat
8
+ from torch import Tensor, nn
9
+ from transformers import PreTrainedModel
10
+
11
+ try:
12
+ import flash_attn
13
+ except ImportError:
14
+ flash_attn = None
15
+
16
+ try:
17
+ import flash_attn_interface
18
+ except ImportError:
19
+ flash_attn_interface = None
20
+ from configuration_dfm import DFMConfig
21
+
22
+
23
+ class Rotary(torch.nn.Module):
24
+ """
25
+ From: https://github.com/louaaron/Score-Entropy-Discrete-Diffusion
26
+ """
27
+
28
+ def __init__(self, dim: int, base: int = 10_000):
29
+ super().__init__()
30
+ inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
31
+ self.register_buffer("inv_freq", inv_freq)
32
+ self.seq_len_cached = None
33
+ self.cos_cached = None
34
+ self.sin_cached = None
35
+
36
+ def forward(self, x: Tensor, seq_dim: int = 1) -> Tuple[Tensor, Tensor]:
37
+ seq_len = x.shape[seq_dim]
38
+ if seq_len != self.seq_len_cached:
39
+ self.seq_len_cached = seq_len
40
+ t = torch.arange(x.shape[seq_dim], device=x.device).type_as(self.inv_freq)
41
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq.clone())
42
+ emb = torch.cat((freqs, freqs), dim=-1).to(x.device)
43
+
44
+ # dims are: batch, seq_len, qkv, head, dim
45
+ self.cos_cached = emb.cos()[None, :, None, None, :].repeat(1, 1, 3, 1, 1)
46
+ self.sin_cached = emb.sin()[None, :, None, None, :].repeat(1, 1, 3, 1, 1)
47
+
48
+ # This makes the transformation on v an identity.
49
+ self.cos_cached[:, :, 2, :, :].fill_(1.0)
50
+ self.sin_cached[:, :, 2, :, :].fill_(0.0)
51
+
52
+ return self.cos_cached, self.sin_cached
53
+
54
+
55
+ def rotate_half(x: Tensor) -> Tensor:
56
+ x1, x2 = x[..., : x.shape[-1] // 2], x[..., x.shape[-1] // 2 :]
57
+
58
+ return torch.cat((-x2, x1), dim=-1)
59
+
60
+
61
+ def apply_rotary_emb_torch(x, cos, sin, interleaved=False):
62
+ """
63
+ From: https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/layers/rotary.py#L20
64
+ """
65
+ cos = cos[0, :, 0, 0, : cos.shape[-1] // 2]
66
+ sin = sin[0, :, 0, 0, : sin.shape[-1] // 2]
67
+
68
+ ro_dim = cos.shape[-1] * 2
69
+ assert ro_dim <= x.shape[-1]
70
+ cos = repeat(
71
+ cos, "... d -> ... 1 (2 d)" if not interleaved else "... d -> ... 1 (d 2)"
72
+ )
73
+ sin = repeat(
74
+ sin, "... d -> ... 1 (2 d)" if not interleaved else "... d -> ... 1 (d 2)"
75
+ )
76
+
77
+ return x[..., :ro_dim] * cos + rotate_half(x[..., :ro_dim]) * sin
78
+
79
+
80
+ def bias_dropout_add_scale(
81
+ x: Tensor, scale: Tensor, residual: Optional[Tensor], prob: float, training: bool
82
+ ) -> Tensor:
83
+ return residual + scale * F.dropout(x, p=prob, training=training)
84
+
85
+
86
+ def modulate(x: Tensor, shift: Tensor, scale: Tensor) -> Tensor:
87
+ return x * (1 + scale) + shift
88
+
89
+
90
+ class LayerNorm(nn.Module):
91
+ def __init__(self, dim: int):
92
+ super().__init__()
93
+ self.weight = nn.Parameter(torch.ones([dim]))
94
+ self.dim = dim
95
+
96
+ def forward(self, x: Tensor) -> Tensor:
97
+ with torch.amp.autocast("cuda", enabled=False):
98
+ x = F.layer_norm(x.float(), [self.dim])
99
+
100
+ return x * self.weight[None, None, :]
101
+
102
+
103
+ class TimestepEmbedder(nn.Module):
104
+ """
105
+ Embeds scalar timesteps into vector representations.
106
+ """
107
+
108
+ def __init__(self, hidden_size: int, frequency_embedding_size: int = 256):
109
+ super().__init__()
110
+ self.mlp = nn.Sequential(
111
+ nn.Linear(frequency_embedding_size, hidden_size, bias=True),
112
+ nn.SiLU(),
113
+ nn.Linear(hidden_size, hidden_size, bias=True),
114
+ )
115
+ self.frequency_embedding_size = frequency_embedding_size
116
+
117
+ @staticmethod
118
+ def timestep_embedding(time: Tensor, dim: int, max_period: int = 10000) -> Tensor:
119
+ """
120
+ Create sinusoidal timestep embeddings.
121
+ :param t: a 1-D Tensor of N indices, one per batch element.
122
+ These may be fractional.
123
+ :param dim: the dimension of the output.
124
+ :param max_period: controls the minimum frequency of the embeddings.
125
+ :return: an (N, D) Tensor of positional embeddings.
126
+ """
127
+ half = dim // 2
128
+ freqs = torch.exp(
129
+ -math.log(max_period)
130
+ * torch.arange(start=0, end=half, dtype=torch.float32)
131
+ / half
132
+ ).to(device=time.device)
133
+ args = time[:, None].float() * freqs[None]
134
+ embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
135
+ if dim % 2:
136
+ embedding = torch.cat(
137
+ [embedding, torch.zeros_like(embedding[:, :1])], dim=-1
138
+ )
139
+ return embedding
140
+
141
+ def forward(self, time: Tensor) -> Tensor:
142
+ t_freq = self.timestep_embedding(time=time, dim=self.frequency_embedding_size)
143
+ t_emb = self.mlp(t_freq)
144
+ return t_emb
145
+
146
+
147
+ class DDiTBlock(nn.Module):
148
+ def __init__(
149
+ self,
150
+ dim: int,
151
+ n_heads: int,
152
+ cond_dim: int,
153
+ mlp_ratio: int = 4,
154
+ dropout: float = 0.1,
155
+ ):
156
+ super().__init__()
157
+ assert dim % n_heads == 0, "dim must be devisable by n_heads"
158
+
159
+ self.n_heads = n_heads
160
+ self.dim = dim
161
+ self.dropout = dropout
162
+
163
+ self.head_dim = self.dim // self.n_heads
164
+
165
+ self.norm1 = LayerNorm(dim=dim)
166
+
167
+ self.qw = nn.Linear(dim, dim, bias=False)
168
+ self.kw = nn.Linear(dim, dim, bias=False)
169
+ self.vw = nn.Linear(dim, dim, bias=False)
170
+
171
+ self.attn_out = nn.Linear(dim, dim, bias=False)
172
+ self.dropout1 = nn.Dropout(dropout)
173
+
174
+ self.norm2 = LayerNorm(dim=dim)
175
+ self.mlp = nn.Sequential(
176
+ nn.Linear(dim, mlp_ratio * dim, bias=True),
177
+ nn.GELU(approximate="tanh"),
178
+ nn.Linear(mlp_ratio * dim, dim, bias=True),
179
+ )
180
+
181
+ self.adaLN_modulation = nn.Linear(cond_dim, 6 * dim, bias=True)
182
+ self.adaLN_modulation.weight.data.zero_()
183
+ self.adaLN_modulation.bias.data.zero_()
184
+
185
+ def forward(self, x: Tensor, rotary_cos_sin: Tensor, c: Tensor) -> Tensor:
186
+ batch_size, seq_len = x.shape[0], x.shape[1]
187
+
188
+ (
189
+ shift_msa,
190
+ scale_msa,
191
+ gate_msa,
192
+ shift_mlp,
193
+ scale_mlp,
194
+ gate_mlp,
195
+ ) = self.adaLN_modulation(c)[:, None].chunk(6, dim=2)
196
+
197
+ x_skip = x
198
+ x = modulate(x=self.norm1(x), shift=shift_msa, scale=scale_msa)
199
+
200
+ q = self.qw(x)
201
+ k = self.kw(x)
202
+ v = self.vw(x)
203
+
204
+ q, k, v = (
205
+ item.view(batch_size, seq_len, self.n_heads, self.head_dim)
206
+ for item in (q, k, v)
207
+ )
208
+
209
+ with torch.amp.autocast("cuda", enabled=False):
210
+ cos, sin = rotary_cos_sin
211
+ original_dtype = q.dtype
212
+
213
+ q = apply_rotary_emb_torch(
214
+ x=q.float(), cos=cos.float(), sin=sin.float()
215
+ ).to(original_dtype)
216
+ k = apply_rotary_emb_torch(
217
+ x=k.float(), cos=cos.float(), sin=sin.float()
218
+ ).to(original_dtype)
219
+
220
+ use_flash_attn = (
221
+ flash_attn_interface is not None or flash_attn is not None
222
+ ) and q.is_cuda
223
+ if use_flash_attn:
224
+ qkv = torch.stack((q, k, v), dim=2)
225
+ if flash_attn_interface is not None:
226
+ x = flash_attn_interface.flash_attn_qkvpacked_func(qkv, causal=False)
227
+ else:
228
+ x = flash_attn.flash_attn_qkvpacked_func(qkv, 0.0, causal=False)
229
+ x = rearrange(x, "b s h d -> b s (h d)", b=batch_size)
230
+ else:
231
+ q, k, v = (item.transpose(1, 2) for item in (q, k, v))
232
+ x = F.scaled_dot_product_attention(query=q, key=k, value=v)
233
+ x = rearrange(x, "b h s d -> b s (h d)", b=batch_size)
234
+ x = bias_dropout_add_scale(
235
+ x=self.attn_out(x),
236
+ scale=gate_msa,
237
+ residual=x_skip,
238
+ prob=self.dropout,
239
+ training=self.training,
240
+ )
241
+ x = bias_dropout_add_scale(
242
+ x=self.mlp(modulate(x=self.norm2(x), shift=shift_mlp, scale=scale_mlp)),
243
+ scale=gate_mlp,
244
+ residual=x,
245
+ prob=self.dropout,
246
+ training=self.training,
247
+ )
248
+
249
+ return x
250
+
251
+
252
+ class DDitFinalLayer(nn.Module):
253
+ def __init__(self, hidden_size: int, out_channels: int, cond_dim: int):
254
+ super().__init__()
255
+ self.norm_final = LayerNorm(hidden_size)
256
+ self.linear = nn.Linear(hidden_size, out_channels)
257
+ self.linear.weight.data.zero_()
258
+ self.linear.bias.data.zero_()
259
+
260
+ self.adaLN_modulation = nn.Linear(cond_dim, 2 * hidden_size, bias=True)
261
+ self.adaLN_modulation.weight.data.zero_()
262
+ self.adaLN_modulation.bias.data.zero_()
263
+
264
+ def forward(self, x: Tensor, c: Tensor) -> Tensor:
265
+ shift, scale = self.adaLN_modulation(c)[:, None].chunk(2, dim=2)
266
+ x = modulate(x=self.norm_final(x), shift=shift, scale=scale)
267
+ x = self.linear(x)
268
+
269
+ return x
270
+
271
+
272
+ class Transformer(nn.Module):
273
+ def __init__(self, vocab_size: int, masked: bool, config):
274
+ super().__init__()
275
+
276
+ if isinstance(config, dict):
277
+ config = SimpleNamespace(**config)
278
+
279
+ self.config = config
280
+ self.vocab_size = vocab_size
281
+
282
+ add_token = 1 if masked else 0
283
+
284
+ self.vocab_embed = nn.Embedding(self.vocab_size + add_token, config.hidden_size)
285
+
286
+ self.time_embedding = TimestepEmbedder(hidden_size=config.cond_dim)
287
+ self.rotary_emb = Rotary(dim=config.hidden_size // config.n_heads)
288
+
289
+ self.blocks = nn.ModuleList(
290
+ [
291
+ DDiTBlock(
292
+ dim=config.hidden_size,
293
+ n_heads=config.n_heads,
294
+ cond_dim=config.cond_dim,
295
+ dropout=config.dropout,
296
+ )
297
+ for _ in range(config.n_blocks)
298
+ ]
299
+ )
300
+
301
+ self.output_layer = DDitFinalLayer(
302
+ hidden_size=config.hidden_size,
303
+ out_channels=vocab_size + add_token,
304
+ cond_dim=config.cond_dim,
305
+ )
306
+
307
+ def forward(self, x_t: Tensor, time: Tensor) -> Tensor:
308
+ x = self.vocab_embed(x_t)
309
+ c = F.silu(self.time_embedding(time=time))
310
+
311
+ rotary_cos_sin = self.rotary_emb(x=x)
312
+
313
+ with torch.amp.autocast("cuda", dtype=torch.bfloat16):
314
+ for i in range(len(self.blocks)):
315
+ x = self.blocks[i](x=x, rotary_cos_sin=rotary_cos_sin, c=c)
316
+
317
+ x = self.output_layer(x=x, c=c)
318
+
319
+ return x
320
+
321
+
322
+ class DFMModel(PreTrainedModel):
323
+ config_class = DFMConfig
324
+ base_model_prefix = "model"
325
+
326
+ def __init__(self, config: DFMConfig):
327
+ super().__init__(config)
328
+ masked = config.source_distribution == "mask"
329
+ self.model = Transformer(
330
+ vocab_size=config.vocab_size,
331
+ masked=masked,
332
+ config={
333
+ "hidden_size": config.hidden_size,
334
+ "cond_dim": config.cond_dim,
335
+ "length": config.sequence_length,
336
+ "n_blocks": config.n_blocks,
337
+ "n_heads": config.n_heads,
338
+ "dropout": config.dropout,
339
+ "compile": False,
340
+ },
341
+ )
342
+ self.post_init()
343
+
344
+ def forward(
345
+ self,
346
+ x_t: torch.Tensor,
347
+ time: torch.Tensor,
348
+ **kwargs,
349
+ ) -> torch.Tensor:
350
+ return self.model(x_t=x_t, time=time)
351
+
352
+ @classmethod
353
+ def _load_pretrained_model(
354
+ cls,
355
+ model,
356
+ state_dict,
357
+ *args,
358
+ **kwargs,
359
+ ):
360
+ if state_dict is not None:
361
+ if "model" in state_dict and isinstance(state_dict["model"], dict):
362
+ state_dict = state_dict["model"]
363
+ if state_dict and not any(
364
+ k.startswith("model.") for k in state_dict.keys()
365
+ ):
366
+ state_dict = {f"model.{k}": v for k, v in state_dict.items()}
367
+ return super()._load_pretrained_model(
368
+ model,
369
+ state_dict,
370
+ *args,
371
+ **kwargs,
372
+ )
special_tokens_map.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|endoftext|>",
3
+ "eos_token": "<|endoftext|>",
4
+ "unk_token": "<|endoftext|>"
5
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "50256": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ }
12
+ },
13
+ "bos_token": "<|endoftext|>",
14
+ "clean_up_tokenization_spaces": false,
15
+ "eos_token": "<|endoftext|>",
16
+ "extra_special_tokens": {},
17
+ "model_max_length": 1024,
18
+ "tokenizer_class": "GPT2Tokenizer",
19
+ "unk_token": "<|endoftext|>"
20
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff