BiliSakura commited on
Commit
b67e8f3
Β·
verified Β·
1 Parent(s): b80fa12

Upload folder using huggingface_hub

Browse files
Files changed (39) hide show
  1. .gitattributes +1 -0
  2. MiniT2I-B-16/demo.png +3 -0
  3. MiniT2I-B-16/model_index.json +26 -0
  4. MiniT2I-B-16/pipeline.py +444 -0
  5. MiniT2I-B-16/scheduler/scheduler_config.json +7 -0
  6. MiniT2I-B-16/text_encoder/README.md +276 -0
  7. MiniT2I-B-16/text_encoder/config.json +28 -0
  8. MiniT2I-B-16/text_encoder/generation_config.json +7 -0
  9. MiniT2I-B-16/text_encoder/model.safetensors +3 -0
  10. MiniT2I-B-16/text_encoder/special_tokens_map.json +107 -0
  11. MiniT2I-B-16/text_encoder/spiece.model +3 -0
  12. MiniT2I-B-16/text_encoder/tokenizer.json +0 -0
  13. MiniT2I-B-16/text_encoder/tokenizer_config.json +113 -0
  14. MiniT2I-B-16/tokenizer/special_tokens_map.json +107 -0
  15. MiniT2I-B-16/tokenizer/spiece.model +3 -0
  16. MiniT2I-B-16/tokenizer/tokenizer.json +0 -0
  17. MiniT2I-B-16/tokenizer/tokenizer_config.json +113 -0
  18. MiniT2I-B-16/transformer/config.json +27 -0
  19. MiniT2I-B-16/transformer/diffusion_pytorch_model.safetensors +3 -0
  20. MiniT2I-B-16/transformer/transformer_minit2i.py +446 -0
  21. MiniT2I-L-16/model_index.json +26 -0
  22. MiniT2I-L-16/pipeline.py +444 -0
  23. MiniT2I-L-16/scheduler/scheduler_config.json +7 -0
  24. MiniT2I-L-16/text_encoder/README.md +276 -0
  25. MiniT2I-L-16/text_encoder/config.json +28 -0
  26. MiniT2I-L-16/text_encoder/generation_config.json +7 -0
  27. MiniT2I-L-16/text_encoder/model.safetensors +3 -0
  28. MiniT2I-L-16/text_encoder/special_tokens_map.json +107 -0
  29. MiniT2I-L-16/text_encoder/spiece.model +3 -0
  30. MiniT2I-L-16/text_encoder/tokenizer.json +0 -0
  31. MiniT2I-L-16/text_encoder/tokenizer_config.json +113 -0
  32. MiniT2I-L-16/tokenizer/special_tokens_map.json +107 -0
  33. MiniT2I-L-16/tokenizer/spiece.model +3 -0
  34. MiniT2I-L-16/tokenizer/tokenizer.json +0 -0
  35. MiniT2I-L-16/tokenizer/tokenizer_config.json +113 -0
  36. MiniT2I-L-16/transformer/config.json +27 -0
  37. MiniT2I-L-16/transformer/diffusion_pytorch_model.safetensors +3 -0
  38. MiniT2I-L-16/transformer/transformer_minit2i.py +446 -0
  39. README.md +156 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ MiniT2I-B-16/demo.png filter=lfs diff=lfs merge=lfs -text
MiniT2I-B-16/demo.png ADDED

Git LFS Details

  • SHA256: 5f7ef1590783708ce7d2ece800ad0d48e76b71260ed5b818cc999e5c2a5e0952
  • Pointer size: 131 Bytes
  • Size of remote file: 489 kB
MiniT2I-B-16/model_index.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": [
3
+ "pipeline",
4
+ "MiniT2ITextToImagePipeline"
5
+ ],
6
+ "_diffusers_version": "0.32.0",
7
+ "default_num_inference_steps": 100,
8
+ "model_type": "b16",
9
+ "recommended_guidance_scale": 2.5,
10
+ "scheduler": [
11
+ "diffusers",
12
+ "FlowMatchEulerDiscreteScheduler"
13
+ ],
14
+ "text_encoder": [
15
+ "transformers",
16
+ "T5EncoderModel"
17
+ ],
18
+ "tokenizer": [
19
+ "transformers",
20
+ "T5Tokenizer"
21
+ ],
22
+ "transformer": [
23
+ "transformer_minit2i",
24
+ "MiniT2IMMJiTModel"
25
+ ]
26
+ }
MiniT2I-B-16/pipeline.py ADDED
@@ -0,0 +1,444 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Hub custom pipeline: MiniT2ITextToImagePipeline.
2
+ Load with native Hugging Face diffusers and trust_remote_code=True.
3
+ """
4
+
5
+ from __future__ import annotations
6
+
7
+ from diffusers.image_processor import VaeImageProcessor
8
+ from diffusers.pipelines.pipeline_utils import DiffusionPipeline, ImagePipelineOutput
9
+ from diffusers.schedulers import FlowMatchEulerDiscreteScheduler
10
+ from diffusers.schedulers.scheduling_utils import KarrasDiffusionSchedulers
11
+ from diffusers.utils import BaseOutput
12
+ from diffusers.utils.torch_utils import randn_tensor
13
+ # Copyright 2025 The HuggingFace Team. All rights reserved.
14
+ #
15
+ # Licensed under the Apache License, Version 2.0 (the "License");
16
+ # you may not use this file except in compliance with the License.
17
+ # You may obtain a copy of the License at
18
+ #
19
+ # http://www.apache.org/licenses/LICENSE-2.0
20
+ #
21
+ # Unless required by applicable law or agreed to in writing, software
22
+ # distributed under the License is distributed on an "AS IS" BASIS,
23
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
24
+ # See the License for the specific language governing permissions and
25
+ # limitations under the License.
26
+ import inspect
27
+ import json
28
+ import os
29
+ from pathlib import Path
30
+ from typing import Any, Dict, List, Optional, Tuple, Union
31
+
32
+ os.environ.setdefault("USE_FLAX", "0")
33
+ os.environ.setdefault("TRANSFORMERS_NO_FLAX", "1")
34
+
35
+ import torch
36
+ from huggingface_hub import snapshot_download
37
+ from PIL import Image
38
+ from transformers import AutoTokenizer, T5EncoderModel
39
+ from transformers import logging as transformers_logging
40
+
41
+ transformers_logging.set_verbosity_error()
42
+
43
+ DEFAULT_NUM_INFERENCE_STEPS = 100
44
+ NOISE_INIT_SCALE = 2.0
45
+
46
+ EXAMPLE_DOC_STRING = """
47
+ Examples:
48
+ ```py
49
+ >>> from pathlib import Path
50
+ >>> import torch
51
+ >>> from diffusers import DiffusionPipeline, FlowMatchEulerDiscreteScheduler
52
+
53
+ >>> model_dir = Path("./minit2i-diffusers").resolve()
54
+ >>> pipe = DiffusionPipeline.from_pretrained(
55
+ ... str(model_dir),
56
+ ... local_files_only=True,
57
+ ... custom_pipeline=str(model_dir / "pipeline.py"),
58
+ ... trust_remote_code=True,
59
+ ... torch_dtype=torch.bfloat16,
60
+ ... )
61
+ >>> pipe.to("cuda")
62
+ >>> pipe.scheduler = FlowMatchEulerDiscreteScheduler.from_config(pipe.scheduler.config)
63
+
64
+ >>> generator = torch.Generator(device="cuda").manual_seed(42)
65
+ >>> image = pipe(
66
+ ... "a cinematic portrait of a robot musician",
67
+ ... num_inference_steps=100,
68
+ ... guidance_scale=6.0,
69
+ ... generator=generator,
70
+ ... ).images[0]
71
+ >>> image.save("demo.png")
72
+ ```
73
+ """
74
+
75
+ MODEL_ALIASES: Dict[str, str] = {
76
+ "b": "minit2i-b-16",
77
+ "b16": "minit2i-b-16",
78
+ "b-16": "minit2i-b-16",
79
+ "base": "minit2i-b-16",
80
+ "minit2i-b16": "minit2i-b-16",
81
+ "minit2i-b-16": "minit2i-b-16",
82
+ "minit2i-b/16": "minit2i-b-16",
83
+ "l": "minit2i-l-16",
84
+ "l16": "minit2i-l-16",
85
+ "l-16": "minit2i-l-16",
86
+ "large": "minit2i-l-16",
87
+ "minit2i-l16": "minit2i-l-16",
88
+ "minit2i-l-16": "minit2i-l-16",
89
+ "minit2i-l/16": "minit2i-l-16",
90
+ }
91
+
92
+ def resolve_model_type(model_type: str) -> str:
93
+ key = model_type.lower().replace("_", "-")
94
+ if key not in MODEL_ALIASES:
95
+ choices = ", ".join(sorted(set(MODEL_ALIASES)))
96
+ raise ValueError(f"Unknown model_type={model_type!r}. Expected one of: {choices}")
97
+ return MODEL_ALIASES[key]
98
+
99
+ class MiniT2ITextToImagePipeline(DiffusionPipeline):
100
+ r"""
101
+ Text-to-image pipeline for MiniT2I pixel-space flow matching.
102
+
103
+ Parameters:
104
+ transformer ([`MiniT2IMMJiTModel`]):
105
+ MiniT2I MM-JiT transformer that predicts flow-matching velocity in pixel space.
106
+ scheduler ([`FlowMatchEulerDiscreteScheduler`]):
107
+ Flow-matching Euler scheduler. Other [`KarrasDiffusionSchedulers`] can be swapped at inference time.
108
+ tokenizer ([`AutoTokenizer`], *optional*):
109
+ Tokenizer for the text encoder.
110
+ text_encoder ([`T5EncoderModel`], *optional*):
111
+ Text encoder used to embed prompts.
112
+ """
113
+
114
+ model_cpu_offload_seq = "text_encoder->transformer"
115
+ _optional_components = ["tokenizer", "text_encoder"]
116
+
117
+ def __init__(
118
+ self,
119
+ transformer,
120
+ scheduler,
121
+ tokenizer=None,
122
+ text_encoder=None,
123
+ text_encoder_name: str = "google/flan-t5-large",
124
+ model_type: str = "b16",
125
+ repo_id_or_path: Optional[str] = None,
126
+ default_num_inference_steps: int = DEFAULT_NUM_INFERENCE_STEPS,
127
+ ):
128
+ super().__init__()
129
+ if scheduler is None:
130
+ scheduler = self._default_inference_scheduler()
131
+ self.register_modules(
132
+ transformer=transformer,
133
+ scheduler=scheduler,
134
+ tokenizer=tokenizer,
135
+ text_encoder=text_encoder,
136
+ )
137
+ self.register_to_config(
138
+ text_encoder_name=text_encoder_name,
139
+ model_type=model_type,
140
+ repo_id_or_path=repo_id_or_path,
141
+ default_num_inference_steps=int(default_num_inference_steps),
142
+ )
143
+ self._variant_transformers: Dict[str, MiniT2IMMJiTModel] = {}
144
+ self._active_model_type = resolve_model_type(model_type)
145
+
146
+ @staticmethod
147
+ def _default_inference_scheduler() -> FlowMatchEulerDiscreteScheduler:
148
+ return FlowMatchEulerDiscreteScheduler(
149
+ num_train_timesteps=1000,
150
+ shift=1.0,
151
+ stochastic_sampling=False,
152
+ )
153
+
154
+ @classmethod
155
+ def _load_scheduler_from_dir(
156
+ cls,
157
+ scheduler_dir: Path,
158
+ model_kwargs: Dict[str, Any],
159
+ ) -> Tuple[KarrasDiffusionSchedulers, int]:
160
+ config_path = scheduler_dir / "scheduler_config.json"
161
+ if not config_path.exists():
162
+ return cls._default_inference_scheduler(), DEFAULT_NUM_INFERENCE_STEPS
163
+
164
+ config = json.loads(config_path.read_text(encoding="utf-8"))
165
+ class_name = config.get("_class_name", "")
166
+ default_steps = int(config.get("num_inference_steps", DEFAULT_NUM_INFERENCE_STEPS))
167
+
168
+ if class_name == "MiniT2IFlowMatchScheduler":
169
+ return cls._default_inference_scheduler(), default_steps
170
+
171
+ schedulers_pkg = _hf["schedulers"]
172
+ if hasattr(schedulers_pkg, class_name):
173
+ scheduler_cls = getattr(schedulers_pkg, class_name)
174
+ return scheduler_cls.from_pretrained(str(scheduler_dir), **model_kwargs), default_steps
175
+
176
+ return cls._default_inference_scheduler(), default_steps
177
+
178
+ @staticmethod
179
+ def _resolve_transformer_path(root: Path, variant_dir: str) -> Path:
180
+ variant_transformer = root / variant_dir / "transformer"
181
+ if variant_transformer.exists():
182
+ return variant_transformer
183
+ root_transformer = root / "transformer"
184
+ if root_transformer.exists():
185
+ return root_transformer
186
+ raise FileNotFoundError(
187
+ f"Could not find transformer weights under {root}. "
188
+ f"Tried {variant_transformer} and {root_transformer}."
189
+ )
190
+
191
+ def _get_transformer(
192
+ self,
193
+ model_type: Optional[str],
194
+ repo_id_or_path: Optional[str],
195
+ torch_dtype: Optional[torch.dtype] = None,
196
+ variant: Optional[str] = None,
197
+ ) -> MiniT2IMMJiTModel:
198
+ active_type = resolve_model_type(model_type or self.config.model_type)
199
+ if active_type == self._active_model_type and self.transformer is not None:
200
+ return self.transformer
201
+ if active_type in self._variant_transformers:
202
+ return self._variant_transformers[active_type]
203
+
204
+ repo = repo_id_or_path or self.config.repo_id_or_path
205
+ if repo is None:
206
+ raise ValueError("model_type switching requires repo_id_or_path to be set on the pipeline.")
207
+
208
+ root = Path(repo)
209
+ if not root.exists():
210
+ root = Path(snapshot_download(repo_id=str(repo)))
211
+ transformer = MiniT2IMMJiTModel.from_pretrained(
212
+ self._resolve_transformer_path(root, active_type),
213
+ torch_dtype=torch_dtype,
214
+ variant=variant,
215
+ )
216
+ self._variant_transformers[active_type] = transformer
217
+ if active_type == resolve_model_type(self.config.model_type):
218
+ self.transformer = transformer
219
+ self._active_model_type = active_type
220
+ return transformer
221
+
222
+ @staticmethod
223
+ def prepare_extra_step_kwargs(
224
+ scheduler,
225
+ generator: Optional[Union[torch.Generator, List[torch.Generator]]],
226
+ ) -> Dict[str, Any]:
227
+ kwargs: Dict[str, Any] = {}
228
+ step_params = set(inspect.signature(scheduler.step).parameters.keys())
229
+ if "generator" in step_params:
230
+ kwargs["generator"] = generator
231
+ return kwargs
232
+
233
+ def check_inputs(
234
+ self,
235
+ prompt: Union[str, List[str]],
236
+ guidance_scale: float,
237
+ num_inference_steps: int,
238
+ output_type: str,
239
+ ) -> None:
240
+ if not isinstance(prompt, str) and not (isinstance(prompt, list) and all(isinstance(p, str) for p in prompt)):
241
+ raise TypeError(f"`prompt` must be a string or list of strings, got {type(prompt)}.")
242
+ if guidance_scale < 0:
243
+ raise ValueError(f"`guidance_scale` must be non-negative, got {guidance_scale}.")
244
+ if num_inference_steps <= 0:
245
+ raise ValueError(f"`num_inference_steps` must be positive, got {num_inference_steps}.")
246
+ if output_type not in {"pil", "np", "pt", "latent"}:
247
+ raise ValueError(f"Unsupported `output_type`: {output_type}")
248
+
249
+ def prepare_latents(
250
+ self,
251
+ batch_size: int,
252
+ image_size: int,
253
+ in_channels: int,
254
+ device: torch.device,
255
+ dtype: torch.dtype,
256
+ generator: Optional[torch.Generator] = None,
257
+ latents: Optional[torch.Tensor] = None,
258
+ ) -> torch.Tensor:
259
+ shape = (batch_size, in_channels, image_size, image_size)
260
+ if latents is None:
261
+ latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
262
+ latents = latents * NOISE_INIT_SCALE
263
+ else:
264
+ latents = latents.to(device=device, dtype=dtype)
265
+ if tuple(latents.shape) != shape:
266
+ raise ValueError(f"Invalid `latents` shape: {tuple(latents.shape)}. Expected {shape}.")
267
+ return latents
268
+
269
+ def _encode_prompt(
270
+ self,
271
+ prompt: Union[str, List[str]],
272
+ device: torch.device,
273
+ transformer = None,
274
+ ) -> Tuple[torch.Tensor, torch.Tensor]:
275
+ if isinstance(prompt, str):
276
+ prompt = [prompt]
277
+ transformer = transformer or self.transformer
278
+ if self.tokenizer is None:
279
+ self.tokenizer = AutoTokenizer.from_pretrained(self.config.text_encoder_name)
280
+ if self.text_encoder is None:
281
+ self.text_encoder = T5EncoderModel.from_pretrained(self.config.text_encoder_name)
282
+ if next(self.text_encoder.parameters()).device != device:
283
+ self.text_encoder.to(device)
284
+ cfg = transformer.mmjit_config
285
+ tokens = self.tokenizer(
286
+ prompt,
287
+ return_tensors="pt",
288
+ padding="max_length",
289
+ truncation=True,
290
+ max_length=cfg.prompt_length,
291
+ )
292
+ input_ids = tokens.input_ids.to(device)
293
+ attn = tokens.attention_mask.to(device)
294
+ text = self.text_encoder(input_ids=input_ids, attention_mask=attn).last_hidden_state
295
+ return text, attn
296
+
297
+ @staticmethod
298
+ def _cfg_velocity(
299
+ transformer,
300
+ x: torch.Tensor,
301
+ t: torch.Tensor,
302
+ text: torch.Tensor,
303
+ mask: torch.Tensor,
304
+ cfg_scale: float,
305
+ ) -> torch.Tensor:
306
+ batch_size = x.shape[0]
307
+ doubled_x = torch.cat([x, x], dim=0)
308
+ doubled_t = torch.cat([t, t], dim=0)
309
+ doubled_text = torch.cat([text, text], dim=0)
310
+ null_mask = torch.zeros_like(mask)
311
+ doubled_mask = torch.cat([mask, null_mask], dim=0)
312
+ velocity = transformer.pred_velocity(doubled_x, doubled_t, doubled_text, doubled_mask)
313
+ cond, uncond = velocity[:batch_size], velocity[batch_size:]
314
+ cfg_interval = transformer.mmjit_config.cfg_interval
315
+ use_cfg = ((t >= cfg_interval[0]) & (t <= cfg_interval[1])).to(velocity.dtype)
316
+ scale = torch.where(
317
+ use_cfg[:, None, None, None] > 0,
318
+ torch.tensor(cfg_scale, device=x.device, dtype=velocity.dtype),
319
+ torch.tensor(1.0, device=x.device, dtype=velocity.dtype),
320
+ )
321
+ return uncond + (cond - uncond) * scale
322
+
323
+ @torch.no_grad()
324
+ def __call__(
325
+ self,
326
+ prompt: Union[str, List[str]],
327
+ num_images_per_prompt: int = 1,
328
+ guidance_scale: float = 6.0,
329
+ num_inference_steps: Optional[int] = None,
330
+ generator: Optional[torch.Generator] = None,
331
+ latents: Optional[torch.Tensor] = None,
332
+ output_type: str = "pil",
333
+ return_dict: bool = True,
334
+ progress: bool = True,
335
+ model_type: Optional[str] = None,
336
+ repo_id_or_path: Optional[str] = None,
337
+ variant: Optional[str] = None,
338
+ torch_dtype: Optional[torch.dtype] = None,
339
+ ) -> Union[ImagePipelineOutput, Tuple]:
340
+ r"""
341
+ Generate images from text prompts with MiniT2I.
342
+
343
+ Args:
344
+ prompt (`str` or `list[str]`):
345
+ Text prompt or batch of prompts.
346
+ num_images_per_prompt (`int`, defaults to `1`):
347
+ Number of images to generate per prompt.
348
+ guidance_scale (`float`, defaults to `6.0`):
349
+ Classifier-free guidance scale. CFG is active when `guidance_scale != 1.0`.
350
+ num_inference_steps (`int`, *optional*):
351
+ Number of denoising steps. Defaults to the pipeline config value.
352
+ generator (`torch.Generator`, *optional*):
353
+ RNG for reproducibility.
354
+ latents (`torch.Tensor`, *optional*):
355
+ Pre-generated pixel latents with shape `(batch, channels, height, width)`.
356
+ output_type (`str`, defaults to `"pil"`):
357
+ `"pil"`, `"np"`, `"pt"`, or `"latent"`.
358
+ return_dict (`bool`, defaults to `True`):
359
+ Return [`ImagePipelineOutput`] if True.
360
+ progress (`bool`, defaults to `True`):
361
+ Whether to show a progress bar during denoising.
362
+ model_type (`str`, *optional*):
363
+ MiniT2I variant alias such as `"b16"` or `"l16"`.
364
+ repo_id_or_path (`str`, *optional*):
365
+ Hub id or local path used when switching `model_type`.
366
+ variant (`str`, *optional*):
367
+ Weight variant passed to `from_pretrained`.
368
+ torch_dtype (`torch.dtype`, *optional*):
369
+ Optional dtype override when loading a different transformer variant.
370
+ """
371
+ num_inference_steps = int(num_inference_steps or self.config.default_num_inference_steps)
372
+ self.check_inputs(prompt, guidance_scale, num_inference_steps, output_type)
373
+
374
+ transformer = self._get_transformer(model_type, repo_id_or_path, torch_dtype=torch_dtype, variant=variant)
375
+ device = self._execution_device
376
+ transformer = transformer.to(device)
377
+
378
+ if isinstance(prompt, str):
379
+ prompt_batch = [prompt] * num_images_per_prompt
380
+ else:
381
+ prompt_batch = []
382
+ for entry in prompt:
383
+ prompt_batch.extend([entry] * num_images_per_prompt)
384
+
385
+ batch_size = len(prompt_batch)
386
+ mmjit_cfg = transformer.mmjit_config
387
+ model_dtype = next(transformer.parameters()).dtype
388
+
389
+ text, attn = self._encode_prompt(prompt_batch, device, transformer=transformer)
390
+ text = text.to(dtype=model_dtype)
391
+ attn = attn.to(dtype=model_dtype)
392
+
393
+ if getattr(self.scheduler.config, "stochastic_sampling", False):
394
+ raise ValueError(
395
+ "MiniT2I expects deterministic FlowMatchEulerDiscreteScheduler stepping "
396
+ "(scheduler.config.stochastic_sampling=False)."
397
+ )
398
+
399
+ extra_step_kwargs = self.prepare_extra_step_kwargs(self.scheduler, generator=generator)
400
+ self.scheduler.set_timesteps(num_inference_steps, device=device)
401
+ num_train_timesteps = self.scheduler.config.num_train_timesteps
402
+
403
+ latents = self.prepare_latents(
404
+ batch_size=batch_size,
405
+ image_size=mmjit_cfg.image_size,
406
+ in_channels=mmjit_cfg.in_channels,
407
+ device=device,
408
+ dtype=model_dtype,
409
+ generator=generator,
410
+ latents=latents,
411
+ )
412
+
413
+ timesteps = self.scheduler.timesteps
414
+ if progress:
415
+ timesteps = self.progress_bar(timesteps)
416
+
417
+ using_cfg = guidance_scale != 1.0
418
+ for timestep in timesteps:
419
+ flow_time = 1.0 - float(timestep) / num_train_timesteps
420
+ t = torch.full((batch_size,), flow_time, device=device, dtype=model_dtype)
421
+ if using_cfg:
422
+ velocity = self._cfg_velocity(transformer, latents, t, text, attn, guidance_scale)
423
+ else:
424
+ velocity = transformer.pred_velocity(latents, t, text, attn)
425
+
426
+ # MiniT2I integrates velocity from noise (t=0) to data (t=1); flip sign for
427
+ # FlowMatchEulerDiscreteScheduler sigma decreasing from 1 to 0.
428
+ latents = self.scheduler.step(-velocity, timestep, latents, **extra_step_kwargs).prev_sample
429
+
430
+ if output_type == "latent":
431
+ images = latents
432
+ else:
433
+ images = (latents.clamp(-1, 1) * 127.5 + 128.0).clamp(0, 255).to(torch.uint8)
434
+ if output_type == "pt":
435
+ images = images.float() / 255.0
436
+ else:
437
+ images = images.permute(0, 2, 3, 1).cpu().numpy()
438
+ if output_type == "pil":
439
+ images = [Image.fromarray(image) for image in images]
440
+
441
+ self.maybe_free_model_hooks()
442
+ if not return_dict:
443
+ return (images,)
444
+ return ImagePipelineOutput(images=images)
MiniT2I-B-16/scheduler/scheduler_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "FlowMatchEulerDiscreteScheduler",
3
+ "_diffusers_version": "0.32.0",
4
+ "num_train_timesteps": 1000,
5
+ "shift": 1.0,
6
+ "stochastic_sampling": false
7
+ }
MiniT2I-B-16/text_encoder/README.md ADDED
@@ -0,0 +1,276 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - fr
5
+ - ro
6
+ - de
7
+ - multilingual
8
+
9
+ widget:
10
+ - text: "Translate to German: My name is Arthur"
11
+ example_title: "Translation"
12
+ - text: "Please answer to the following question. Who is going to be the next Ballon d'or?"
13
+ example_title: "Question Answering"
14
+ - text: "Q: Can Geoffrey Hinton have a conversation with George Washington? Give the rationale before answering."
15
+ example_title: "Logical reasoning"
16
+ - text: "Please answer the following question. What is the boiling point of Nitrogen?"
17
+ example_title: "Scientific knowledge"
18
+ - text: "Answer the following yes/no question. Can you write a whole Haiku in a single tweet?"
19
+ example_title: "Yes/no question"
20
+ - text: "Answer the following yes/no question by reasoning step-by-step. Can you write a whole Haiku in a single tweet?"
21
+ example_title: "Reasoning task"
22
+ - text: "Q: ( False or not False or False ) is? A: Let's think step by step"
23
+ example_title: "Boolean Expressions"
24
+ - text: "The square root of x is the cube root of y. What is y to the power of 2, if x = 4?"
25
+ example_title: "Math reasoning"
26
+ - text: "Premise: At my age you will probably have learnt one lesson. Hypothesis: It's not certain how many lessons you'll learn by your thirties. Does the premise entail the hypothesis?"
27
+ example_title: "Premise and hypothesis"
28
+
29
+ tags:
30
+ - text2text-generation
31
+
32
+ datasets:
33
+ - svakulenk0/qrecc
34
+ - taskmaster2
35
+ - djaym7/wiki_dialog
36
+ - deepmind/code_contests
37
+ - lambada
38
+ - gsm8k
39
+ - aqua_rat
40
+ - esnli
41
+ - quasc
42
+ - qed
43
+
44
+
45
+ license: apache-2.0
46
+ ---
47
+
48
+ # Model Card for FLAN-T5 large
49
+
50
+ <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/flan2_architecture.jpg"
51
+ alt="drawing" width="600"/>
52
+
53
+ # Table of Contents
54
+
55
+ 0. [TL;DR](#TL;DR)
56
+ 1. [Model Details](#model-details)
57
+ 2. [Usage](#usage)
58
+ 3. [Uses](#uses)
59
+ 4. [Bias, Risks, and Limitations](#bias-risks-and-limitations)
60
+ 5. [Training Details](#training-details)
61
+ 6. [Evaluation](#evaluation)
62
+ 7. [Environmental Impact](#environmental-impact)
63
+ 8. [Citation](#citation)
64
+ 9. [Model Card Authors](#model-card-authors)
65
+
66
+ # TL;DR
67
+
68
+ If you already know T5, FLAN-T5 is just better at everything. For the same number of parameters, these models have been fine-tuned on more than 1000 additional tasks covering also more languages.
69
+ As mentioned in the first few lines of the abstract :
70
+ > Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints,1 which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.
71
+
72
+ **Disclaimer**: Content from **this** model card has been written by the Hugging Face team, and parts of it were copy pasted from the [T5 model card](https://huggingface.co/t5-large).
73
+
74
+ # Model Details
75
+
76
+ ## Model Description
77
+
78
+
79
+ - **Model type:** Language model
80
+ - **Language(s) (NLP):** English, Spanish, Japanese, Persian, Hindi, French, Chinese, Bengali, Gujarati, German, Telugu, Italian, Arabic, Polish, Tamil, Marathi, Malayalam, Oriya, Panjabi, Portuguese, Urdu, Galician, Hebrew, Korean, Catalan, Thai, Dutch, Indonesian, Vietnamese, Bulgarian, Filipino, Central Khmer, Lao, Turkish, Russian, Croatian, Swedish, Yoruba, Kurdish, Burmese, Malay, Czech, Finnish, Somali, Tagalog, Swahili, Sinhala, Kannada, Zhuang, Igbo, Xhosa, Romanian, Haitian, Estonian, Slovak, Lithuanian, Greek, Nepali, Assamese, Norwegian
81
+ - **License:** Apache 2.0
82
+ - **Related Models:** [All FLAN-T5 Checkpoints](https://huggingface.co/models?search=flan-t5)
83
+ - **Original Checkpoints:** [All Original FLAN-T5 Checkpoints](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints)
84
+ - **Resources for more information:**
85
+ - [Research paper](https://arxiv.org/pdf/2210.11416.pdf)
86
+ - [GitHub Repo](https://github.com/google-research/t5x)
87
+ - [Hugging Face FLAN-T5 Docs (Similar to T5) ](https://huggingface.co/docs/transformers/model_doc/t5)
88
+
89
+ # Usage
90
+
91
+ Find below some example scripts on how to use the model in `transformers`:
92
+
93
+ ## Using the Pytorch model
94
+
95
+ ### Running the model on a CPU
96
+
97
+ <details>
98
+ <summary> Click to expand </summary>
99
+
100
+ ```python
101
+
102
+ from transformers import T5Tokenizer, T5ForConditionalGeneration
103
+
104
+ tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")
105
+ model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large")
106
+
107
+ input_text = "translate English to German: How old are you?"
108
+ input_ids = tokenizer(input_text, return_tensors="pt").input_ids
109
+
110
+ outputs = model.generate(input_ids)
111
+ print(tokenizer.decode(outputs[0]))
112
+ ```
113
+
114
+ </details>
115
+
116
+ ### Running the model on a GPU
117
+
118
+ <details>
119
+ <summary> Click to expand </summary>
120
+
121
+ ```python
122
+ # pip install accelerate
123
+ from transformers import T5Tokenizer, T5ForConditionalGeneration
124
+
125
+ tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")
126
+ model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto")
127
+
128
+ input_text = "translate English to German: How old are you?"
129
+ input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
130
+
131
+ outputs = model.generate(input_ids)
132
+ print(tokenizer.decode(outputs[0]))
133
+ ```
134
+
135
+ </details>
136
+
137
+ ### Running the model on a GPU using different precisions
138
+
139
+ #### FP16
140
+
141
+ <details>
142
+ <summary> Click to expand </summary>
143
+
144
+ ```python
145
+ # pip install accelerate
146
+ import torch
147
+ from transformers import T5Tokenizer, T5ForConditionalGeneration
148
+
149
+ tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")
150
+ model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto", torch_dtype=torch.float16)
151
+
152
+ input_text = "translate English to German: How old are you?"
153
+ input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
154
+
155
+ outputs = model.generate(input_ids)
156
+ print(tokenizer.decode(outputs[0]))
157
+ ```
158
+
159
+ </details>
160
+
161
+ #### INT8
162
+
163
+ <details>
164
+ <summary> Click to expand </summary>
165
+
166
+ ```python
167
+ # pip install bitsandbytes accelerate
168
+ from transformers import T5Tokenizer, T5ForConditionalGeneration
169
+
170
+ tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")
171
+ model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto", load_in_8bit=True)
172
+
173
+ input_text = "translate English to German: How old are you?"
174
+ input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
175
+
176
+ outputs = model.generate(input_ids)
177
+ print(tokenizer.decode(outputs[0]))
178
+ ```
179
+
180
+ </details>
181
+
182
+ # Uses
183
+
184
+ ## Direct Use and Downstream Use
185
+
186
+ The authors write in [the original paper's model card](https://arxiv.org/pdf/2210.11416.pdf) that:
187
+
188
+ > The primary use is research on language models, including: research on zero-shot NLP tasks and in-context few-shot learning NLP tasks, such as reasoning, and question answering; advancing fairness and safety research, and understanding limitations of current large language models
189
+
190
+ See the [research paper](https://arxiv.org/pdf/2210.11416.pdf) for further details.
191
+
192
+ ## Out-of-Scope Use
193
+
194
+ More information needed.
195
+
196
+ # Bias, Risks, and Limitations
197
+
198
+ The information below in this section are copied from the model's [official model card](https://arxiv.org/pdf/2210.11416.pdf):
199
+
200
+ > Language models, including Flan-T5, can potentially be used for language generation in a harmful way, according to Rae et al. (2021). Flan-T5 should not be used directly in any application, without a prior assessment of safety and fairness concerns specific to the application.
201
+
202
+ ## Ethical considerations and risks
203
+
204
+ > Flan-T5 is fine-tuned on a large corpus of text data that was not filtered for explicit content or assessed for existing biases. As a result the model itself is potentially vulnerable to generating equivalently inappropriate content or replicating inherent biases in the underlying data.
205
+
206
+ ## Known Limitations
207
+
208
+ > Flan-T5 has not been tested in real world applications.
209
+
210
+ ## Sensitive Use:
211
+
212
+ > Flan-T5 should not be applied for any unacceptable use cases, e.g., generation of abusive speech.
213
+
214
+ # Training Details
215
+
216
+ ## Training Data
217
+
218
+ The model was trained on a mixture of tasks, that includes the tasks described in the table below (from the original paper, figure 2):
219
+
220
+ ![table.png](https://s3.amazonaws.com/moonup/production/uploads/1666363265279-62441d1d9fdefb55a0b7d12c.png)
221
+
222
+
223
+ ## Training Procedure
224
+
225
+ According to the model card from the [original paper](https://arxiv.org/pdf/2210.11416.pdf):
226
+
227
+ > These models are based on pretrained T5 (Raffel et al., 2020) and fine-tuned with instructions for better zero-shot and few-shot performance. There is one fine-tuned Flan model per T5 model size.
228
+
229
+ The model has been trained on TPU v3 or TPU v4 pods, using [`t5x`](https://github.com/google-research/t5x) codebase together with [`jax`](https://github.com/google/jax).
230
+
231
+
232
+ # Evaluation
233
+
234
+ ## Testing Data, Factors & Metrics
235
+
236
+ The authors evaluated the model on various tasks covering several languages (1836 in total). See the table below for some quantitative evaluation:
237
+ ![image.png](https://s3.amazonaws.com/moonup/production/uploads/1668072995230-62441d1d9fdefb55a0b7d12c.png)
238
+ For full details, please check the [research paper](https://arxiv.org/pdf/2210.11416.pdf).
239
+
240
+ ## Results
241
+
242
+ For full results for FLAN-T5-Large, see the [research paper](https://arxiv.org/pdf/2210.11416.pdf), Table 3.
243
+
244
+ # Environmental Impact
245
+
246
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
247
+
248
+ - **Hardware Type:** Google Cloud TPU Pods - TPU v3 or TPU v4 | Number of chips β‰₯ 4.
249
+ - **Hours used:** More information needed
250
+ - **Cloud Provider:** GCP
251
+ - **Compute Region:** More information needed
252
+ - **Carbon Emitted:** More information needed
253
+
254
+ # Citation
255
+
256
+ **BibTeX:**
257
+
258
+ ```bibtex
259
+ @misc{https://doi.org/10.48550/arxiv.2210.11416,
260
+ doi = {10.48550/ARXIV.2210.11416},
261
+
262
+ url = {https://arxiv.org/abs/2210.11416},
263
+
264
+ author = {Chung, Hyung Won and Hou, Le and Longpre, Shayne and Zoph, Barret and Tay, Yi and Fedus, William and Li, Eric and Wang, Xuezhi and Dehghani, Mostafa and Brahma, Siddhartha and Webson, Albert and Gu, Shixiang Shane and Dai, Zhuyun and Suzgun, Mirac and Chen, Xinyun and Chowdhery, Aakanksha and Narang, Sharan and Mishra, Gaurav and Yu, Adams and Zhao, Vincent and Huang, Yanping and Dai, Andrew and Yu, Hongkun and Petrov, Slav and Chi, Ed H. and Dean, Jeff and Devlin, Jacob and Roberts, Adam and Zhou, Denny and Le, Quoc V. and Wei, Jason},
265
+
266
+ keywords = {Machine Learning (cs.LG), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
267
+
268
+ title = {Scaling Instruction-Finetuned Language Models},
269
+
270
+ publisher = {arXiv},
271
+
272
+ year = {2022},
273
+
274
+ copyright = {Creative Commons Attribution 4.0 International}
275
+ }
276
+ ```
MiniT2I-B-16/text_encoder/config.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "T5ForConditionalGeneration"
4
+ ],
5
+ "d_ff": 2816,
6
+ "d_kv": 64,
7
+ "d_model": 1024,
8
+ "decoder_start_token_id": 0,
9
+ "dropout_rate": 0.1,
10
+ "eos_token_id": 1,
11
+ "feed_forward_proj": "gated-gelu",
12
+ "initializer_factor": 1.0,
13
+ "is_encoder_decoder": true,
14
+ "layer_norm_epsilon": 1e-06,
15
+ "model_type": "t5",
16
+ "n_positions": 512,
17
+ "num_decoder_layers": 24,
18
+ "num_heads": 16,
19
+ "num_layers": 24,
20
+ "output_past": true,
21
+ "pad_token_id": 0,
22
+ "relative_attention_max_distance": 128,
23
+ "relative_attention_num_buckets": 32,
24
+ "tie_word_embeddings": false,
25
+ "transformers_version": "4.23.1",
26
+ "use_cache": true,
27
+ "vocab_size": 32128
28
+ }
MiniT2I-B-16/text_encoder/generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "decoder_start_token_id": 0,
4
+ "eos_token_id": 1,
5
+ "pad_token_id": 0,
6
+ "transformers_version": "4.27.0.dev0"
7
+ }
MiniT2I-B-16/text_encoder/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a9dd06ce490f139af36e9eb77dd3758b4fd07a08a73d5a1abe5ff2591e2d388e
3
+ size 3132668804
MiniT2I-B-16/text_encoder/special_tokens_map.json ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<extra_id_0>",
4
+ "<extra_id_1>",
5
+ "<extra_id_2>",
6
+ "<extra_id_3>",
7
+ "<extra_id_4>",
8
+ "<extra_id_5>",
9
+ "<extra_id_6>",
10
+ "<extra_id_7>",
11
+ "<extra_id_8>",
12
+ "<extra_id_9>",
13
+ "<extra_id_10>",
14
+ "<extra_id_11>",
15
+ "<extra_id_12>",
16
+ "<extra_id_13>",
17
+ "<extra_id_14>",
18
+ "<extra_id_15>",
19
+ "<extra_id_16>",
20
+ "<extra_id_17>",
21
+ "<extra_id_18>",
22
+ "<extra_id_19>",
23
+ "<extra_id_20>",
24
+ "<extra_id_21>",
25
+ "<extra_id_22>",
26
+ "<extra_id_23>",
27
+ "<extra_id_24>",
28
+ "<extra_id_25>",
29
+ "<extra_id_26>",
30
+ "<extra_id_27>",
31
+ "<extra_id_28>",
32
+ "<extra_id_29>",
33
+ "<extra_id_30>",
34
+ "<extra_id_31>",
35
+ "<extra_id_32>",
36
+ "<extra_id_33>",
37
+ "<extra_id_34>",
38
+ "<extra_id_35>",
39
+ "<extra_id_36>",
40
+ "<extra_id_37>",
41
+ "<extra_id_38>",
42
+ "<extra_id_39>",
43
+ "<extra_id_40>",
44
+ "<extra_id_41>",
45
+ "<extra_id_42>",
46
+ "<extra_id_43>",
47
+ "<extra_id_44>",
48
+ "<extra_id_45>",
49
+ "<extra_id_46>",
50
+ "<extra_id_47>",
51
+ "<extra_id_48>",
52
+ "<extra_id_49>",
53
+ "<extra_id_50>",
54
+ "<extra_id_51>",
55
+ "<extra_id_52>",
56
+ "<extra_id_53>",
57
+ "<extra_id_54>",
58
+ "<extra_id_55>",
59
+ "<extra_id_56>",
60
+ "<extra_id_57>",
61
+ "<extra_id_58>",
62
+ "<extra_id_59>",
63
+ "<extra_id_60>",
64
+ "<extra_id_61>",
65
+ "<extra_id_62>",
66
+ "<extra_id_63>",
67
+ "<extra_id_64>",
68
+ "<extra_id_65>",
69
+ "<extra_id_66>",
70
+ "<extra_id_67>",
71
+ "<extra_id_68>",
72
+ "<extra_id_69>",
73
+ "<extra_id_70>",
74
+ "<extra_id_71>",
75
+ "<extra_id_72>",
76
+ "<extra_id_73>",
77
+ "<extra_id_74>",
78
+ "<extra_id_75>",
79
+ "<extra_id_76>",
80
+ "<extra_id_77>",
81
+ "<extra_id_78>",
82
+ "<extra_id_79>",
83
+ "<extra_id_80>",
84
+ "<extra_id_81>",
85
+ "<extra_id_82>",
86
+ "<extra_id_83>",
87
+ "<extra_id_84>",
88
+ "<extra_id_85>",
89
+ "<extra_id_86>",
90
+ "<extra_id_87>",
91
+ "<extra_id_88>",
92
+ "<extra_id_89>",
93
+ "<extra_id_90>",
94
+ "<extra_id_91>",
95
+ "<extra_id_92>",
96
+ "<extra_id_93>",
97
+ "<extra_id_94>",
98
+ "<extra_id_95>",
99
+ "<extra_id_96>",
100
+ "<extra_id_97>",
101
+ "<extra_id_98>",
102
+ "<extra_id_99>"
103
+ ],
104
+ "eos_token": "</s>",
105
+ "pad_token": "<pad>",
106
+ "unk_token": "<unk>"
107
+ }
MiniT2I-B-16/text_encoder/spiece.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d60acb128cf7b7f2536e8f38a5b18a05535c9e14c7a355904270e15b0945ea86
3
+ size 791656
MiniT2I-B-16/text_encoder/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
MiniT2I-B-16/text_encoder/tokenizer_config.json ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<extra_id_0>",
4
+ "<extra_id_1>",
5
+ "<extra_id_2>",
6
+ "<extra_id_3>",
7
+ "<extra_id_4>",
8
+ "<extra_id_5>",
9
+ "<extra_id_6>",
10
+ "<extra_id_7>",
11
+ "<extra_id_8>",
12
+ "<extra_id_9>",
13
+ "<extra_id_10>",
14
+ "<extra_id_11>",
15
+ "<extra_id_12>",
16
+ "<extra_id_13>",
17
+ "<extra_id_14>",
18
+ "<extra_id_15>",
19
+ "<extra_id_16>",
20
+ "<extra_id_17>",
21
+ "<extra_id_18>",
22
+ "<extra_id_19>",
23
+ "<extra_id_20>",
24
+ "<extra_id_21>",
25
+ "<extra_id_22>",
26
+ "<extra_id_23>",
27
+ "<extra_id_24>",
28
+ "<extra_id_25>",
29
+ "<extra_id_26>",
30
+ "<extra_id_27>",
31
+ "<extra_id_28>",
32
+ "<extra_id_29>",
33
+ "<extra_id_30>",
34
+ "<extra_id_31>",
35
+ "<extra_id_32>",
36
+ "<extra_id_33>",
37
+ "<extra_id_34>",
38
+ "<extra_id_35>",
39
+ "<extra_id_36>",
40
+ "<extra_id_37>",
41
+ "<extra_id_38>",
42
+ "<extra_id_39>",
43
+ "<extra_id_40>",
44
+ "<extra_id_41>",
45
+ "<extra_id_42>",
46
+ "<extra_id_43>",
47
+ "<extra_id_44>",
48
+ "<extra_id_45>",
49
+ "<extra_id_46>",
50
+ "<extra_id_47>",
51
+ "<extra_id_48>",
52
+ "<extra_id_49>",
53
+ "<extra_id_50>",
54
+ "<extra_id_51>",
55
+ "<extra_id_52>",
56
+ "<extra_id_53>",
57
+ "<extra_id_54>",
58
+ "<extra_id_55>",
59
+ "<extra_id_56>",
60
+ "<extra_id_57>",
61
+ "<extra_id_58>",
62
+ "<extra_id_59>",
63
+ "<extra_id_60>",
64
+ "<extra_id_61>",
65
+ "<extra_id_62>",
66
+ "<extra_id_63>",
67
+ "<extra_id_64>",
68
+ "<extra_id_65>",
69
+ "<extra_id_66>",
70
+ "<extra_id_67>",
71
+ "<extra_id_68>",
72
+ "<extra_id_69>",
73
+ "<extra_id_70>",
74
+ "<extra_id_71>",
75
+ "<extra_id_72>",
76
+ "<extra_id_73>",
77
+ "<extra_id_74>",
78
+ "<extra_id_75>",
79
+ "<extra_id_76>",
80
+ "<extra_id_77>",
81
+ "<extra_id_78>",
82
+ "<extra_id_79>",
83
+ "<extra_id_80>",
84
+ "<extra_id_81>",
85
+ "<extra_id_82>",
86
+ "<extra_id_83>",
87
+ "<extra_id_84>",
88
+ "<extra_id_85>",
89
+ "<extra_id_86>",
90
+ "<extra_id_87>",
91
+ "<extra_id_88>",
92
+ "<extra_id_89>",
93
+ "<extra_id_90>",
94
+ "<extra_id_91>",
95
+ "<extra_id_92>",
96
+ "<extra_id_93>",
97
+ "<extra_id_94>",
98
+ "<extra_id_95>",
99
+ "<extra_id_96>",
100
+ "<extra_id_97>",
101
+ "<extra_id_98>",
102
+ "<extra_id_99>"
103
+ ],
104
+ "eos_token": "</s>",
105
+ "extra_ids": 100,
106
+ "model_max_length": 512,
107
+ "name_or_path": "google/t5-v1_1-large",
108
+ "pad_token": "<pad>",
109
+ "sp_model_kwargs": {},
110
+ "special_tokens_map_file": "/home/younes_huggingface_co/.cache/huggingface/hub/models--google--t5-v1_1-large/snapshots/314bc112b191ec17b625ba81438dc73d6c23659d/special_tokens_map.json",
111
+ "tokenizer_class": "T5Tokenizer",
112
+ "unk_token": "<unk>"
113
+ }
MiniT2I-B-16/tokenizer/special_tokens_map.json ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<extra_id_0>",
4
+ "<extra_id_1>",
5
+ "<extra_id_2>",
6
+ "<extra_id_3>",
7
+ "<extra_id_4>",
8
+ "<extra_id_5>",
9
+ "<extra_id_6>",
10
+ "<extra_id_7>",
11
+ "<extra_id_8>",
12
+ "<extra_id_9>",
13
+ "<extra_id_10>",
14
+ "<extra_id_11>",
15
+ "<extra_id_12>",
16
+ "<extra_id_13>",
17
+ "<extra_id_14>",
18
+ "<extra_id_15>",
19
+ "<extra_id_16>",
20
+ "<extra_id_17>",
21
+ "<extra_id_18>",
22
+ "<extra_id_19>",
23
+ "<extra_id_20>",
24
+ "<extra_id_21>",
25
+ "<extra_id_22>",
26
+ "<extra_id_23>",
27
+ "<extra_id_24>",
28
+ "<extra_id_25>",
29
+ "<extra_id_26>",
30
+ "<extra_id_27>",
31
+ "<extra_id_28>",
32
+ "<extra_id_29>",
33
+ "<extra_id_30>",
34
+ "<extra_id_31>",
35
+ "<extra_id_32>",
36
+ "<extra_id_33>",
37
+ "<extra_id_34>",
38
+ "<extra_id_35>",
39
+ "<extra_id_36>",
40
+ "<extra_id_37>",
41
+ "<extra_id_38>",
42
+ "<extra_id_39>",
43
+ "<extra_id_40>",
44
+ "<extra_id_41>",
45
+ "<extra_id_42>",
46
+ "<extra_id_43>",
47
+ "<extra_id_44>",
48
+ "<extra_id_45>",
49
+ "<extra_id_46>",
50
+ "<extra_id_47>",
51
+ "<extra_id_48>",
52
+ "<extra_id_49>",
53
+ "<extra_id_50>",
54
+ "<extra_id_51>",
55
+ "<extra_id_52>",
56
+ "<extra_id_53>",
57
+ "<extra_id_54>",
58
+ "<extra_id_55>",
59
+ "<extra_id_56>",
60
+ "<extra_id_57>",
61
+ "<extra_id_58>",
62
+ "<extra_id_59>",
63
+ "<extra_id_60>",
64
+ "<extra_id_61>",
65
+ "<extra_id_62>",
66
+ "<extra_id_63>",
67
+ "<extra_id_64>",
68
+ "<extra_id_65>",
69
+ "<extra_id_66>",
70
+ "<extra_id_67>",
71
+ "<extra_id_68>",
72
+ "<extra_id_69>",
73
+ "<extra_id_70>",
74
+ "<extra_id_71>",
75
+ "<extra_id_72>",
76
+ "<extra_id_73>",
77
+ "<extra_id_74>",
78
+ "<extra_id_75>",
79
+ "<extra_id_76>",
80
+ "<extra_id_77>",
81
+ "<extra_id_78>",
82
+ "<extra_id_79>",
83
+ "<extra_id_80>",
84
+ "<extra_id_81>",
85
+ "<extra_id_82>",
86
+ "<extra_id_83>",
87
+ "<extra_id_84>",
88
+ "<extra_id_85>",
89
+ "<extra_id_86>",
90
+ "<extra_id_87>",
91
+ "<extra_id_88>",
92
+ "<extra_id_89>",
93
+ "<extra_id_90>",
94
+ "<extra_id_91>",
95
+ "<extra_id_92>",
96
+ "<extra_id_93>",
97
+ "<extra_id_94>",
98
+ "<extra_id_95>",
99
+ "<extra_id_96>",
100
+ "<extra_id_97>",
101
+ "<extra_id_98>",
102
+ "<extra_id_99>"
103
+ ],
104
+ "eos_token": "</s>",
105
+ "pad_token": "<pad>",
106
+ "unk_token": "<unk>"
107
+ }
MiniT2I-B-16/tokenizer/spiece.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d60acb128cf7b7f2536e8f38a5b18a05535c9e14c7a355904270e15b0945ea86
3
+ size 791656
MiniT2I-B-16/tokenizer/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
MiniT2I-B-16/tokenizer/tokenizer_config.json ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<extra_id_0>",
4
+ "<extra_id_1>",
5
+ "<extra_id_2>",
6
+ "<extra_id_3>",
7
+ "<extra_id_4>",
8
+ "<extra_id_5>",
9
+ "<extra_id_6>",
10
+ "<extra_id_7>",
11
+ "<extra_id_8>",
12
+ "<extra_id_9>",
13
+ "<extra_id_10>",
14
+ "<extra_id_11>",
15
+ "<extra_id_12>",
16
+ "<extra_id_13>",
17
+ "<extra_id_14>",
18
+ "<extra_id_15>",
19
+ "<extra_id_16>",
20
+ "<extra_id_17>",
21
+ "<extra_id_18>",
22
+ "<extra_id_19>",
23
+ "<extra_id_20>",
24
+ "<extra_id_21>",
25
+ "<extra_id_22>",
26
+ "<extra_id_23>",
27
+ "<extra_id_24>",
28
+ "<extra_id_25>",
29
+ "<extra_id_26>",
30
+ "<extra_id_27>",
31
+ "<extra_id_28>",
32
+ "<extra_id_29>",
33
+ "<extra_id_30>",
34
+ "<extra_id_31>",
35
+ "<extra_id_32>",
36
+ "<extra_id_33>",
37
+ "<extra_id_34>",
38
+ "<extra_id_35>",
39
+ "<extra_id_36>",
40
+ "<extra_id_37>",
41
+ "<extra_id_38>",
42
+ "<extra_id_39>",
43
+ "<extra_id_40>",
44
+ "<extra_id_41>",
45
+ "<extra_id_42>",
46
+ "<extra_id_43>",
47
+ "<extra_id_44>",
48
+ "<extra_id_45>",
49
+ "<extra_id_46>",
50
+ "<extra_id_47>",
51
+ "<extra_id_48>",
52
+ "<extra_id_49>",
53
+ "<extra_id_50>",
54
+ "<extra_id_51>",
55
+ "<extra_id_52>",
56
+ "<extra_id_53>",
57
+ "<extra_id_54>",
58
+ "<extra_id_55>",
59
+ "<extra_id_56>",
60
+ "<extra_id_57>",
61
+ "<extra_id_58>",
62
+ "<extra_id_59>",
63
+ "<extra_id_60>",
64
+ "<extra_id_61>",
65
+ "<extra_id_62>",
66
+ "<extra_id_63>",
67
+ "<extra_id_64>",
68
+ "<extra_id_65>",
69
+ "<extra_id_66>",
70
+ "<extra_id_67>",
71
+ "<extra_id_68>",
72
+ "<extra_id_69>",
73
+ "<extra_id_70>",
74
+ "<extra_id_71>",
75
+ "<extra_id_72>",
76
+ "<extra_id_73>",
77
+ "<extra_id_74>",
78
+ "<extra_id_75>",
79
+ "<extra_id_76>",
80
+ "<extra_id_77>",
81
+ "<extra_id_78>",
82
+ "<extra_id_79>",
83
+ "<extra_id_80>",
84
+ "<extra_id_81>",
85
+ "<extra_id_82>",
86
+ "<extra_id_83>",
87
+ "<extra_id_84>",
88
+ "<extra_id_85>",
89
+ "<extra_id_86>",
90
+ "<extra_id_87>",
91
+ "<extra_id_88>",
92
+ "<extra_id_89>",
93
+ "<extra_id_90>",
94
+ "<extra_id_91>",
95
+ "<extra_id_92>",
96
+ "<extra_id_93>",
97
+ "<extra_id_94>",
98
+ "<extra_id_95>",
99
+ "<extra_id_96>",
100
+ "<extra_id_97>",
101
+ "<extra_id_98>",
102
+ "<extra_id_99>"
103
+ ],
104
+ "eos_token": "</s>",
105
+ "extra_ids": 100,
106
+ "model_max_length": 512,
107
+ "name_or_path": "google/t5-v1_1-large",
108
+ "pad_token": "<pad>",
109
+ "sp_model_kwargs": {},
110
+ "special_tokens_map_file": "/home/younes_huggingface_co/.cache/huggingface/hub/models--google--t5-v1_1-large/snapshots/314bc112b191ec17b625ba81438dc73d6c23659d/special_tokens_map.json",
111
+ "tokenizer_class": "T5Tokenizer",
112
+ "unk_token": "<unk>"
113
+ }
MiniT2I-B-16/transformer/config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "MiniT2IMMJiTModel",
3
+ "_diffusers_version": "0.35.2",
4
+ "cfg_channels": 3,
5
+ "cfg_interval": [
6
+ 0.0,
7
+ 1.0
8
+ ],
9
+ "cond_vec_size": 768,
10
+ "depth_double": 17,
11
+ "head_dim": 64,
12
+ "hidden_size": 768,
13
+ "image_size": 512,
14
+ "in_channels": 3,
15
+ "llm": "google/flan-t5-large",
16
+ "mlp_ratio": 2.6666666666666665,
17
+ "n_T": 100,
18
+ "num_heads": 12,
19
+ "patch_size": 16,
20
+ "pca_channels": 128,
21
+ "prediction": "x",
22
+ "prompt_length": 256,
23
+ "sampler": "euler",
24
+ "txt_hidden_size": 768,
25
+ "txt_input_size": 1024,
26
+ "txt_preamble_depth": 2
27
+ }
MiniT2I-B-16/transformer/diffusion_pytorch_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5114de36379acd45f810001d9da82d48f96559ce4428cde7a79b1e724983ced1
3
+ size 1032534472
MiniT2I-B-16/transformer/transformer_minit2i.py ADDED
@@ -0,0 +1,446 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+ from dataclasses import dataclass
3
+ from typing import Optional
4
+
5
+ import torch
6
+ from torch import nn
7
+ import torch.nn.functional as F
8
+
9
+ from diffusers.configuration_utils import ConfigMixin, register_to_config
10
+ from diffusers.image_processor import VaeImageProcessor
11
+ from diffusers.models.modeling_utils import ModelMixin
12
+ from diffusers.pipelines.pipeline_utils import DiffusionPipeline, ImagePipelineOutput
13
+ from diffusers.schedulers.scheduling_utils import SchedulerMixin, SchedulerOutput
14
+ from diffusers.utils import BaseOutput
15
+ from diffusers.utils.torch_utils import randn_tensor
16
+
17
+
18
+ def modulate(x, shift, scale):
19
+ return x * (1 + scale[:, None, :]) + shift[:, None, :]
20
+
21
+
22
+ def rotate_half(x):
23
+ x1, x2 = x.reshape(*x.shape[:-1], 2, -1).unbind(dim=-2)
24
+ return torch.cat((-x2, x1), dim=-1)
25
+
26
+
27
+ class RMSNorm(nn.Module):
28
+ def __init__(self, dim: int, eps: float = 1e-6):
29
+ super().__init__()
30
+ self.weight = nn.Parameter(torch.ones(dim))
31
+ self.eps = eps
32
+
33
+ def forward(self, x):
34
+ y = x * torch.rsqrt(x.pow(2).mean(dim=-1, keepdim=True) + self.eps)
35
+ return y * self.weight
36
+
37
+
38
+ class TimestepEmbedder(nn.Module):
39
+ def __init__(self, hidden_size: int, frequency_embedding_size: int = 256):
40
+ super().__init__()
41
+ self.frequency_embedding_size = frequency_embedding_size
42
+ self.mlp = nn.Sequential(
43
+ nn.Linear(frequency_embedding_size, hidden_size),
44
+ nn.SiLU(),
45
+ nn.Linear(hidden_size, hidden_size),
46
+ )
47
+
48
+ def forward(self, t):
49
+ half = self.frequency_embedding_size // 2
50
+ freqs = torch.exp(
51
+ -math.log(10000.0)
52
+ * torch.arange(half, device=t.device, dtype=torch.float32)
53
+ / half
54
+ )
55
+ args = t.float()[:, None] * freqs[None]
56
+ emb = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
57
+ return self.mlp(emb.to(dtype=self.mlp[0].weight.dtype))
58
+
59
+
60
+ class BottleneckPatchEmbed(nn.Module):
61
+ def __init__(self, img_size=512, patch_size=16, in_channels=3, pca_channels=128, hidden_size=1248):
62
+ super().__init__()
63
+ self.img_size = img_size
64
+ self.patch_size = patch_size
65
+ self.proj1 = nn.Conv2d(in_channels, pca_channels, kernel_size=patch_size, stride=patch_size, bias=False)
66
+ self.proj2 = nn.Conv2d(pca_channels, hidden_size, kernel_size=1, stride=1, bias=True)
67
+
68
+ def forward(self, x):
69
+ x = self.proj2(self.proj1(x))
70
+ return x.flatten(2).transpose(1, 2)
71
+
72
+
73
+ class SwiGLUMlp(nn.Module):
74
+ def __init__(self, in_features: int, hidden_features: int):
75
+ super().__init__()
76
+ hidden_dim = (hidden_features + 7) // 8 * 8
77
+ self.w1 = nn.Linear(in_features, hidden_dim, bias=False)
78
+ self.w3 = nn.Linear(in_features, hidden_dim, bias=False)
79
+ self.w2 = nn.Linear(hidden_dim, in_features, bias=False)
80
+
81
+ def forward(self, x):
82
+ return self.w2(F.silu(self.w1(x)) * self.w3(x))
83
+
84
+
85
+ class TextRotaryEmbedding1D(nn.Module):
86
+ def __init__(self, head_dim: int, theta: float = 10000.0):
87
+ super().__init__()
88
+ self.head_dim = head_dim
89
+ self.theta = theta
90
+
91
+ def forward(self, x):
92
+ b, length, h, d = x.shape
93
+ inv = 1.0 / (self.theta ** (torch.arange(0, d, 2, device=x.device, dtype=torch.float32) / d))
94
+ pos = torch.arange(length, device=x.device, dtype=torch.float32)
95
+ angles = torch.einsum("l,f->lf", pos, inv)
96
+ angles = torch.cat([angles, angles], dim=-1)
97
+ cos = angles.cos().to(dtype=x.dtype)
98
+ sin = angles.sin().to(dtype=x.dtype)
99
+ return x * cos[None, :, None, :] + rotate_half(x) * sin[None, :, None, :]
100
+
101
+
102
+ class VisionRotaryEmbeddingFast(nn.Module):
103
+ def __init__(self, head_dim: int, theta: float = 10000.0):
104
+ super().__init__()
105
+ self.dim = head_dim // 2
106
+ self.theta = theta
107
+
108
+ def forward(self, x):
109
+ length = x.shape[1]
110
+ side = int(math.sqrt(length))
111
+ if side * side != length:
112
+ raise ValueError(f"image token length must be square, got {length}")
113
+ freqs = 1.0 / (
114
+ self.theta
115
+ ** (torch.arange(0, self.dim, 2, device=x.device, dtype=torch.float32)[: self.dim // 2] / self.dim)
116
+ )
117
+ t = torch.arange(side, device=x.device, dtype=torch.float32)
118
+ base = torch.einsum("l,f->lf", t, freqs)
119
+ f_h, f_w = torch.broadcast_tensors(base[:, None, :], base[None, :, :])
120
+ angles = torch.cat([f_h, f_w], dim=-1)
121
+ angles = torch.cat([angles, angles], dim=-1).reshape(length, -1)
122
+ cos = angles.cos().to(dtype=x.dtype)
123
+ sin = angles.sin().to(dtype=x.dtype)
124
+ return x * cos[None, :, None, :] + rotate_half(x) * sin[None, :, None, :]
125
+
126
+
127
+ class MultiModalRotaryEmbeddingFast(nn.Module):
128
+ def __init__(self, head_dim: int):
129
+ super().__init__()
130
+ self.text_rope = TextRotaryEmbedding1D(head_dim)
131
+ self.vision_rope = VisionRotaryEmbeddingFast(head_dim)
132
+
133
+ def forward(self, x, txt_len: int):
134
+ txt = self.text_rope(x[:, :txt_len])
135
+ img = self.vision_rope(x[:, txt_len:])
136
+ return torch.cat([txt, img], dim=1)
137
+
138
+
139
+ class PlainTextTransformerBlock(nn.Module):
140
+ def __init__(self, hidden_size=1248, num_heads=24, head_dim=52, mlp_ratio=2.7):
141
+ super().__init__()
142
+ self.num_heads = num_heads
143
+ self.head_dim = head_dim
144
+ inner_dim = num_heads * head_dim
145
+ self.norm1 = RMSNorm(hidden_size)
146
+ self.norm2 = RMSNorm(hidden_size)
147
+ self.qkv = nn.Linear(hidden_size, inner_dim * 3)
148
+ self.attn_proj = nn.Linear(inner_dim, hidden_size)
149
+ self.mlp = SwiGLUMlp(hidden_size, int(hidden_size * mlp_ratio))
150
+ self.q_norm = RMSNorm(head_dim)
151
+ self.k_norm = RMSNorm(head_dim)
152
+ self.rope = TextRotaryEmbedding1D(head_dim)
153
+
154
+ def forward(self, txt):
155
+ b, length, _ = txt.shape
156
+ qkv = self.qkv(self.norm1(txt)).reshape(b, length, 3, self.num_heads, self.head_dim)
157
+ q, k, v = qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2]
158
+ q = self.rope(self.q_norm(q))
159
+ k = self.rope(self.k_norm(k))
160
+ attn = torch.einsum("bqhd,bkhd->bhqk", q, k) * (self.head_dim ** -0.5)
161
+ out = torch.einsum("bhqk,bkhd->bqhd", attn.softmax(dim=-1), v).reshape(b, length, -1)
162
+ txt = txt + self.attn_proj(out)
163
+ txt = txt + self.mlp(self.norm2(txt))
164
+ return txt
165
+
166
+
167
+ class DoubleStreamDiTBlock(nn.Module):
168
+ def __init__(self, hidden_size=1248, txt_hidden_size=1248, num_heads=24, head_dim=52, mlp_ratio=2.7):
169
+ super().__init__()
170
+ self.hidden_size = hidden_size
171
+ self.txt_hidden_size = txt_hidden_size
172
+ self.num_heads = num_heads
173
+ self.head_dim = head_dim
174
+ inner_dim = num_heads * head_dim
175
+ self.img_norm1 = RMSNorm(hidden_size)
176
+ self.img_norm2 = RMSNorm(hidden_size)
177
+ self.txt_norm1 = RMSNorm(txt_hidden_size)
178
+ self.txt_norm2 = RMSNorm(txt_hidden_size)
179
+ self.img_qkv = nn.Linear(hidden_size, inner_dim * 3)
180
+ self.txt_qkv = nn.Linear(txt_hidden_size, inner_dim * 3)
181
+ self.q_norm = RMSNorm(head_dim)
182
+ self.k_norm = RMSNorm(head_dim)
183
+ self.rope = MultiModalRotaryEmbeddingFast(head_dim)
184
+ self.img_attn_proj = nn.Linear(inner_dim, hidden_size)
185
+ self.txt_attn_proj = nn.Linear(inner_dim, txt_hidden_size)
186
+ self.img_mlp = SwiGLUMlp(hidden_size, int(hidden_size * mlp_ratio))
187
+ self.txt_mlp = SwiGLUMlp(txt_hidden_size, int(txt_hidden_size * mlp_ratio))
188
+
189
+ def forward(self, x, txt, vec):
190
+ b, li, _ = x.shape
191
+ lt = txt.shape[1]
192
+ x_norm = self.img_norm1(x)
193
+ txt_norm = self.txt_norm1(txt)
194
+ qkv_i = self.img_qkv(x_norm).reshape(b, li, 3, self.num_heads, self.head_dim)
195
+ qkv_t = self.txt_qkv(txt_norm).reshape(b, lt, 3, self.num_heads, self.head_dim)
196
+ q_i, k_i, v_i = qkv_i[:, :, 0], qkv_i[:, :, 1], qkv_i[:, :, 2]
197
+ q_t, k_t, v_t = qkv_t[:, :, 0], qkv_t[:, :, 1], qkv_t[:, :, 2]
198
+ q_i, k_i = self.q_norm(q_i), self.k_norm(k_i)
199
+ q_t, k_t = self.q_norm(q_t), self.k_norm(k_t)
200
+ q = self.rope(torch.cat([q_t, q_i], dim=1), txt_len=lt)
201
+ k = self.rope(torch.cat([k_t, k_i], dim=1), txt_len=lt)
202
+ v = torch.cat([v_t, v_i], dim=1)
203
+ attn = torch.einsum("bqhd,bkhd->bhqk", q, k) * (self.head_dim ** -0.5)
204
+ out = torch.einsum("bhqk,bkhd->bqhd", attn.softmax(dim=-1), v)
205
+ x = x + self.img_attn_proj(out[:, lt:].reshape(b, li, -1))
206
+ txt = txt + self.txt_attn_proj(out[:, :lt].reshape(b, lt, -1))
207
+ x = x + self.img_mlp(self.img_norm2(x))
208
+ txt = txt + self.txt_mlp(self.txt_norm2(txt))
209
+ return x, txt
210
+
211
+
212
+ class FinalLayer(nn.Module):
213
+ def __init__(self, hidden_size=1248, patch_size=16, out_channels=3):
214
+ super().__init__()
215
+ self.patch_size = patch_size
216
+ self.out_channels = out_channels
217
+ self.norm_final = RMSNorm(hidden_size)
218
+ self.linear = nn.Linear(hidden_size, patch_size * patch_size * out_channels)
219
+
220
+ def forward(self, x, vec=None):
221
+ return self.linear(self.norm_final(x))
222
+
223
+
224
+ def get_2d_sincos_pos_embed(embed_dim, grid_size, device, dtype):
225
+ grid_h = torch.arange(grid_size, device=device, dtype=torch.float32)
226
+ grid_w = torch.arange(grid_size, device=device, dtype=torch.float32)
227
+ grid = torch.meshgrid(grid_w, grid_h, indexing="xy")
228
+ grid = torch.stack(grid, dim=0).reshape(2, 1, grid_size, grid_size)
229
+ emb_h = get_1d_sincos_pos_embed(embed_dim // 2, grid[0])
230
+ emb_w = get_1d_sincos_pos_embed(embed_dim // 2, grid[1])
231
+ return torch.cat([emb_h, emb_w], dim=1).to(dtype=dtype)
232
+
233
+
234
+ def get_1d_sincos_pos_embed(embed_dim, pos):
235
+ omega = torch.arange(embed_dim // 2, device=pos.device, dtype=torch.float32)
236
+ omega = 1.0 / (10000 ** (omega / (embed_dim / 2.0)))
237
+ out = torch.einsum("m,d->md", pos.reshape(-1), omega)
238
+ return torch.cat([out.sin(), out.cos()], dim=1)
239
+
240
+
241
+ @dataclass
242
+ class MMJiTConfig:
243
+ image_size: int = 512
244
+ patch_size: int = 16
245
+ in_channels: int = 3
246
+ txt_input_size: int = 1024
247
+ hidden_size: int = 768
248
+ txt_hidden_size: int = 768
249
+ cond_vec_size: int = 768
250
+ depth_double: int = 17
251
+ txt_preamble_depth: int = 2
252
+ num_heads: int = 12
253
+ head_dim: int = 64
254
+ mlp_ratio: float = 2.6667
255
+ pca_channels: int = 128
256
+ prompt_length: int = 256
257
+ n_T: int = 100
258
+ prediction: str = "x"
259
+ sampler: str = "euler"
260
+ cfg_channels: int = 3
261
+ cfg_interval: tuple = (0.0, 1.0)
262
+ llm: str = "google/flan-t5-large"
263
+
264
+
265
+ class MMJiT(nn.Module):
266
+ def __init__(self, cfg: MMJiTConfig):
267
+ super().__init__()
268
+ self.cfg = cfg
269
+ self.latent_img_size = cfg.image_size // cfg.patch_size
270
+ self.img_embedder = BottleneckPatchEmbed(
271
+ cfg.image_size, cfg.patch_size, cfg.in_channels, cfg.pca_channels, cfg.hidden_size
272
+ )
273
+ self.txt_embedder = nn.Linear(cfg.txt_input_size, cfg.txt_hidden_size, bias=False)
274
+ self.mask_token = nn.Parameter(torch.zeros(1, 1, cfg.txt_input_size))
275
+ self.t_embedder = TimestepEmbedder(cfg.cond_vec_size)
276
+ self.pooled_embedder = nn.Linear(cfg.txt_input_size, cfg.cond_vec_size, bias=False)
277
+ self.txt_preamble_blocks = nn.ModuleList(
278
+ [
279
+ PlainTextTransformerBlock(cfg.txt_hidden_size, cfg.num_heads, cfg.head_dim, cfg.mlp_ratio)
280
+ for _ in range(cfg.txt_preamble_depth)
281
+ ]
282
+ )
283
+ self.double_blocks = nn.ModuleList(
284
+ [
285
+ DoubleStreamDiTBlock(
286
+ cfg.hidden_size, cfg.txt_hidden_size, cfg.num_heads, cfg.head_dim, cfg.mlp_ratio
287
+ )
288
+ for _ in range(cfg.depth_double)
289
+ ]
290
+ )
291
+ self.final_layer = FinalLayer(cfg.hidden_size, cfg.patch_size, cfg.in_channels)
292
+
293
+ def unpatchify(self, x):
294
+ b = x.shape[0]
295
+ p = self.cfg.patch_size
296
+ c = self.cfg.in_channels
297
+ h = w = int(math.sqrt(x.shape[1]))
298
+ x = x.reshape(b, h, w, p, p, c)
299
+ x = torch.einsum("nhwpqc->nchpwq", x)
300
+ return x.reshape(b, c, h * p, w * p)
301
+
302
+ def forward(self, img, t, context, attn_mask):
303
+ if img.ndim == 4 and img.shape[1] != self.cfg.in_channels:
304
+ img = img.permute(0, 3, 1, 2)
305
+ attn_mask = attn_mask.to(device=context.device)
306
+ context = torch.where(attn_mask[:, :, None] > 0.5, context, self.mask_token.to(dtype=context.dtype))
307
+ x = self.img_embedder(img)
308
+ pos = get_2d_sincos_pos_embed(self.cfg.hidden_size, self.latent_img_size, x.device, x.dtype)
309
+ x = x + pos[None]
310
+ t_vec = self.t_embedder(t)
311
+ txt = self.txt_embedder(context.to(dtype=self.txt_embedder.weight.dtype))
312
+ pooled_text = context.mean(dim=1)
313
+ vec = t_vec + self.pooled_embedder(pooled_text.to(dtype=self.pooled_embedder.weight.dtype))
314
+ for block in self.txt_preamble_blocks:
315
+ txt = block(txt)
316
+ for block in self.double_blocks:
317
+ x, txt = block(x, txt, vec)
318
+ combined = torch.cat([txt, x], dim=1)
319
+ out = self.final_layer(combined, vec)
320
+ img_out = out[:, txt.shape[1] :, :]
321
+ return self.unpatchify(img_out)
322
+
323
+
324
+ class DiffusionModel(nn.Module):
325
+ def __init__(self, cfg: Optional[MMJiTConfig] = None):
326
+ super().__init__()
327
+ self.cfg = cfg or MMJiTConfig()
328
+ self.net = MMJiT(self.cfg)
329
+
330
+ def real_t_to_embed_t(self, t):
331
+ return t
332
+
333
+ def pred_velocity(self, x, t, text, mask):
334
+ x0 = self.net(x, self.real_t_to_embed_t(t), text, mask)
335
+ return (x0 - x) / torch.clamp(1 - t[:, None, None, None], min=0.05)
336
+
337
+ def cfg_velocity(self, x, t, text, mask, cfg_scale: float):
338
+ b = x.shape[0]
339
+ xx = torch.cat([x, x], dim=0)
340
+ tt = torch.cat([t, t], dim=0)
341
+ yy = torch.cat([text, text], dim=0)
342
+ mm = torch.cat([mask, torch.zeros_like(mask)], dim=0)
343
+ out = self.pred_velocity(xx, tt, yy, mm)
344
+ cond, uncond = out[:b], out[b:]
345
+ use_cfg = ((t >= self.cfg.cfg_interval[0]) & (t <= self.cfg.cfg_interval[1])).to(out.dtype)
346
+ scale = torch.where(
347
+ use_cfg[:, None, None, None] > 0,
348
+ torch.tensor(cfg_scale, device=x.device, dtype=out.dtype),
349
+ torch.tensor(1.0, device=x.device, dtype=out.dtype),
350
+ )
351
+ return uncond + (cond - uncond) * scale
352
+
353
+ @torch.no_grad()
354
+ def sample(self, text, mask, cfg_scale=6.0, generator=None, progress=False):
355
+ b = text.shape[0]
356
+ device = text.device
357
+ dtype = next(self.parameters()).dtype
358
+ x = torch.randn(
359
+ b,
360
+ self.cfg.in_channels,
361
+ self.cfg.image_size,
362
+ self.cfg.image_size,
363
+ generator=generator,
364
+ device=device,
365
+ dtype=dtype,
366
+ ) * 2
367
+ timesteps = torch.linspace(0.0, 1.0, self.cfg.n_T + 1, device=device, dtype=dtype)
368
+ iterator = range(self.cfg.n_T)
369
+ if progress:
370
+ from tqdm.auto import tqdm
371
+
372
+ iterator = tqdm(iterator)
373
+ for i in iterator:
374
+ t_cur = timesteps[i].expand(b)
375
+ t_next = timesteps[i + 1].expand(b)
376
+ v = self.cfg_velocity(x, t_cur, text.to(dtype), mask.to(dtype), cfg_scale)
377
+ x = x + (t_next - t_cur)[:, None, None, None] * v
378
+ return x
379
+
380
+
381
+ class MiniT2IMMJiTModel(ModelMixin, ConfigMixin):
382
+ """MiniT2I MM-JiT transformer for pixel-space flow matching."""
383
+
384
+ config_name = "config.json"
385
+
386
+ @register_to_config
387
+ def __init__(
388
+ self,
389
+ image_size: int = 512,
390
+ patch_size: int = 16,
391
+ in_channels: int = 3,
392
+ txt_input_size: int = 1024,
393
+ hidden_size: int = 768,
394
+ txt_hidden_size: int = 768,
395
+ cond_vec_size: int = 768,
396
+ depth_double: int = 17,
397
+ txt_preamble_depth: int = 2,
398
+ num_heads: int = 12,
399
+ head_dim: int = 64,
400
+ mlp_ratio: float = 2.6666666666666665,
401
+ pca_channels: int = 128,
402
+ prompt_length: int = 256,
403
+ n_T: int = 100,
404
+ prediction: str = "x",
405
+ sampler: str = "euler",
406
+ cfg_channels: int = 3,
407
+ cfg_interval: tuple = (0.0, 1.0),
408
+ llm: str = "google/flan-t5-large",
409
+ ):
410
+ super().__init__()
411
+ cfg = MMJiTConfig(
412
+ image_size=image_size,
413
+ patch_size=patch_size,
414
+ in_channels=in_channels,
415
+ txt_input_size=txt_input_size,
416
+ hidden_size=hidden_size,
417
+ txt_hidden_size=txt_hidden_size,
418
+ cond_vec_size=cond_vec_size,
419
+ depth_double=depth_double,
420
+ txt_preamble_depth=txt_preamble_depth,
421
+ num_heads=num_heads,
422
+ head_dim=head_dim,
423
+ mlp_ratio=mlp_ratio,
424
+ pca_channels=pca_channels,
425
+ prompt_length=prompt_length,
426
+ n_T=n_T,
427
+ prediction=prediction,
428
+ sampler=sampler,
429
+ cfg_channels=cfg_channels,
430
+ cfg_interval=tuple(cfg_interval),
431
+ llm=llm,
432
+ )
433
+ self.model = DiffusionModel(cfg)
434
+
435
+ @property
436
+ def mmjit_config(self) -> MMJiTConfig:
437
+ return self.model.cfg
438
+
439
+ def forward(self, img, t, context, attn_mask):
440
+ return self.model.net(img, t, context, attn_mask)
441
+
442
+ def pred_velocity(self, x, t, text, mask):
443
+ return self.model.pred_velocity(x, t, text, mask)
444
+
445
+ def sample(self, text, mask, cfg_scale=6.0, generator=None, progress=False):
446
+ return self.model.sample(text, mask, cfg_scale=cfg_scale, generator=generator, progress=progress)
MiniT2I-L-16/model_index.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": [
3
+ "pipeline",
4
+ "MiniT2ITextToImagePipeline"
5
+ ],
6
+ "_diffusers_version": "0.32.0",
7
+ "default_num_inference_steps": 100,
8
+ "model_type": "l16",
9
+ "recommended_guidance_scale": 6.0,
10
+ "scheduler": [
11
+ "diffusers",
12
+ "FlowMatchEulerDiscreteScheduler"
13
+ ],
14
+ "text_encoder": [
15
+ "transformers",
16
+ "T5EncoderModel"
17
+ ],
18
+ "tokenizer": [
19
+ "transformers",
20
+ "T5Tokenizer"
21
+ ],
22
+ "transformer": [
23
+ "transformer_minit2i",
24
+ "MiniT2IMMJiTModel"
25
+ ]
26
+ }
MiniT2I-L-16/pipeline.py ADDED
@@ -0,0 +1,444 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Hub custom pipeline: MiniT2ITextToImagePipeline.
2
+ Load with native Hugging Face diffusers and trust_remote_code=True.
3
+ """
4
+
5
+ from __future__ import annotations
6
+
7
+ from diffusers.image_processor import VaeImageProcessor
8
+ from diffusers.pipelines.pipeline_utils import DiffusionPipeline, ImagePipelineOutput
9
+ from diffusers.schedulers import FlowMatchEulerDiscreteScheduler
10
+ from diffusers.schedulers.scheduling_utils import KarrasDiffusionSchedulers
11
+ from diffusers.utils import BaseOutput
12
+ from diffusers.utils.torch_utils import randn_tensor
13
+ # Copyright 2025 The HuggingFace Team. All rights reserved.
14
+ #
15
+ # Licensed under the Apache License, Version 2.0 (the "License");
16
+ # you may not use this file except in compliance with the License.
17
+ # You may obtain a copy of the License at
18
+ #
19
+ # http://www.apache.org/licenses/LICENSE-2.0
20
+ #
21
+ # Unless required by applicable law or agreed to in writing, software
22
+ # distributed under the License is distributed on an "AS IS" BASIS,
23
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
24
+ # See the License for the specific language governing permissions and
25
+ # limitations under the License.
26
+ import inspect
27
+ import json
28
+ import os
29
+ from pathlib import Path
30
+ from typing import Any, Dict, List, Optional, Tuple, Union
31
+
32
+ os.environ.setdefault("USE_FLAX", "0")
33
+ os.environ.setdefault("TRANSFORMERS_NO_FLAX", "1")
34
+
35
+ import torch
36
+ from huggingface_hub import snapshot_download
37
+ from PIL import Image
38
+ from transformers import AutoTokenizer, T5EncoderModel
39
+ from transformers import logging as transformers_logging
40
+
41
+ transformers_logging.set_verbosity_error()
42
+
43
+ DEFAULT_NUM_INFERENCE_STEPS = 100
44
+ NOISE_INIT_SCALE = 2.0
45
+
46
+ EXAMPLE_DOC_STRING = """
47
+ Examples:
48
+ ```py
49
+ >>> from pathlib import Path
50
+ >>> import torch
51
+ >>> from diffusers import DiffusionPipeline, FlowMatchEulerDiscreteScheduler
52
+
53
+ >>> model_dir = Path("./minit2i-diffusers").resolve()
54
+ >>> pipe = DiffusionPipeline.from_pretrained(
55
+ ... str(model_dir),
56
+ ... local_files_only=True,
57
+ ... custom_pipeline=str(model_dir / "pipeline.py"),
58
+ ... trust_remote_code=True,
59
+ ... torch_dtype=torch.bfloat16,
60
+ ... )
61
+ >>> pipe.to("cuda")
62
+ >>> pipe.scheduler = FlowMatchEulerDiscreteScheduler.from_config(pipe.scheduler.config)
63
+
64
+ >>> generator = torch.Generator(device="cuda").manual_seed(42)
65
+ >>> image = pipe(
66
+ ... "a cinematic portrait of a robot musician",
67
+ ... num_inference_steps=100,
68
+ ... guidance_scale=6.0,
69
+ ... generator=generator,
70
+ ... ).images[0]
71
+ >>> image.save("demo.png")
72
+ ```
73
+ """
74
+
75
+ MODEL_ALIASES: Dict[str, str] = {
76
+ "b": "minit2i-b-16",
77
+ "b16": "minit2i-b-16",
78
+ "b-16": "minit2i-b-16",
79
+ "base": "minit2i-b-16",
80
+ "minit2i-b16": "minit2i-b-16",
81
+ "minit2i-b-16": "minit2i-b-16",
82
+ "minit2i-b/16": "minit2i-b-16",
83
+ "l": "minit2i-l-16",
84
+ "l16": "minit2i-l-16",
85
+ "l-16": "minit2i-l-16",
86
+ "large": "minit2i-l-16",
87
+ "minit2i-l16": "minit2i-l-16",
88
+ "minit2i-l-16": "minit2i-l-16",
89
+ "minit2i-l/16": "minit2i-l-16",
90
+ }
91
+
92
+ def resolve_model_type(model_type: str) -> str:
93
+ key = model_type.lower().replace("_", "-")
94
+ if key not in MODEL_ALIASES:
95
+ choices = ", ".join(sorted(set(MODEL_ALIASES)))
96
+ raise ValueError(f"Unknown model_type={model_type!r}. Expected one of: {choices}")
97
+ return MODEL_ALIASES[key]
98
+
99
+ class MiniT2ITextToImagePipeline(DiffusionPipeline):
100
+ r"""
101
+ Text-to-image pipeline for MiniT2I pixel-space flow matching.
102
+
103
+ Parameters:
104
+ transformer ([`MiniT2IMMJiTModel`]):
105
+ MiniT2I MM-JiT transformer that predicts flow-matching velocity in pixel space.
106
+ scheduler ([`FlowMatchEulerDiscreteScheduler`]):
107
+ Flow-matching Euler scheduler. Other [`KarrasDiffusionSchedulers`] can be swapped at inference time.
108
+ tokenizer ([`AutoTokenizer`], *optional*):
109
+ Tokenizer for the text encoder.
110
+ text_encoder ([`T5EncoderModel`], *optional*):
111
+ Text encoder used to embed prompts.
112
+ """
113
+
114
+ model_cpu_offload_seq = "text_encoder->transformer"
115
+ _optional_components = ["tokenizer", "text_encoder"]
116
+
117
+ def __init__(
118
+ self,
119
+ transformer,
120
+ scheduler,
121
+ tokenizer=None,
122
+ text_encoder=None,
123
+ text_encoder_name: str = "google/flan-t5-large",
124
+ model_type: str = "b16",
125
+ repo_id_or_path: Optional[str] = None,
126
+ default_num_inference_steps: int = DEFAULT_NUM_INFERENCE_STEPS,
127
+ ):
128
+ super().__init__()
129
+ if scheduler is None:
130
+ scheduler = self._default_inference_scheduler()
131
+ self.register_modules(
132
+ transformer=transformer,
133
+ scheduler=scheduler,
134
+ tokenizer=tokenizer,
135
+ text_encoder=text_encoder,
136
+ )
137
+ self.register_to_config(
138
+ text_encoder_name=text_encoder_name,
139
+ model_type=model_type,
140
+ repo_id_or_path=repo_id_or_path,
141
+ default_num_inference_steps=int(default_num_inference_steps),
142
+ )
143
+ self._variant_transformers: Dict[str, MiniT2IMMJiTModel] = {}
144
+ self._active_model_type = resolve_model_type(model_type)
145
+
146
+ @staticmethod
147
+ def _default_inference_scheduler() -> FlowMatchEulerDiscreteScheduler:
148
+ return FlowMatchEulerDiscreteScheduler(
149
+ num_train_timesteps=1000,
150
+ shift=1.0,
151
+ stochastic_sampling=False,
152
+ )
153
+
154
+ @classmethod
155
+ def _load_scheduler_from_dir(
156
+ cls,
157
+ scheduler_dir: Path,
158
+ model_kwargs: Dict[str, Any],
159
+ ) -> Tuple[KarrasDiffusionSchedulers, int]:
160
+ config_path = scheduler_dir / "scheduler_config.json"
161
+ if not config_path.exists():
162
+ return cls._default_inference_scheduler(), DEFAULT_NUM_INFERENCE_STEPS
163
+
164
+ config = json.loads(config_path.read_text(encoding="utf-8"))
165
+ class_name = config.get("_class_name", "")
166
+ default_steps = int(config.get("num_inference_steps", DEFAULT_NUM_INFERENCE_STEPS))
167
+
168
+ if class_name == "MiniT2IFlowMatchScheduler":
169
+ return cls._default_inference_scheduler(), default_steps
170
+
171
+ schedulers_pkg = _hf["schedulers"]
172
+ if hasattr(schedulers_pkg, class_name):
173
+ scheduler_cls = getattr(schedulers_pkg, class_name)
174
+ return scheduler_cls.from_pretrained(str(scheduler_dir), **model_kwargs), default_steps
175
+
176
+ return cls._default_inference_scheduler(), default_steps
177
+
178
+ @staticmethod
179
+ def _resolve_transformer_path(root: Path, variant_dir: str) -> Path:
180
+ variant_transformer = root / variant_dir / "transformer"
181
+ if variant_transformer.exists():
182
+ return variant_transformer
183
+ root_transformer = root / "transformer"
184
+ if root_transformer.exists():
185
+ return root_transformer
186
+ raise FileNotFoundError(
187
+ f"Could not find transformer weights under {root}. "
188
+ f"Tried {variant_transformer} and {root_transformer}."
189
+ )
190
+
191
+ def _get_transformer(
192
+ self,
193
+ model_type: Optional[str],
194
+ repo_id_or_path: Optional[str],
195
+ torch_dtype: Optional[torch.dtype] = None,
196
+ variant: Optional[str] = None,
197
+ ) -> MiniT2IMMJiTModel:
198
+ active_type = resolve_model_type(model_type or self.config.model_type)
199
+ if active_type == self._active_model_type and self.transformer is not None:
200
+ return self.transformer
201
+ if active_type in self._variant_transformers:
202
+ return self._variant_transformers[active_type]
203
+
204
+ repo = repo_id_or_path or self.config.repo_id_or_path
205
+ if repo is None:
206
+ raise ValueError("model_type switching requires repo_id_or_path to be set on the pipeline.")
207
+
208
+ root = Path(repo)
209
+ if not root.exists():
210
+ root = Path(snapshot_download(repo_id=str(repo)))
211
+ transformer = MiniT2IMMJiTModel.from_pretrained(
212
+ self._resolve_transformer_path(root, active_type),
213
+ torch_dtype=torch_dtype,
214
+ variant=variant,
215
+ )
216
+ self._variant_transformers[active_type] = transformer
217
+ if active_type == resolve_model_type(self.config.model_type):
218
+ self.transformer = transformer
219
+ self._active_model_type = active_type
220
+ return transformer
221
+
222
+ @staticmethod
223
+ def prepare_extra_step_kwargs(
224
+ scheduler,
225
+ generator: Optional[Union[torch.Generator, List[torch.Generator]]],
226
+ ) -> Dict[str, Any]:
227
+ kwargs: Dict[str, Any] = {}
228
+ step_params = set(inspect.signature(scheduler.step).parameters.keys())
229
+ if "generator" in step_params:
230
+ kwargs["generator"] = generator
231
+ return kwargs
232
+
233
+ def check_inputs(
234
+ self,
235
+ prompt: Union[str, List[str]],
236
+ guidance_scale: float,
237
+ num_inference_steps: int,
238
+ output_type: str,
239
+ ) -> None:
240
+ if not isinstance(prompt, str) and not (isinstance(prompt, list) and all(isinstance(p, str) for p in prompt)):
241
+ raise TypeError(f"`prompt` must be a string or list of strings, got {type(prompt)}.")
242
+ if guidance_scale < 0:
243
+ raise ValueError(f"`guidance_scale` must be non-negative, got {guidance_scale}.")
244
+ if num_inference_steps <= 0:
245
+ raise ValueError(f"`num_inference_steps` must be positive, got {num_inference_steps}.")
246
+ if output_type not in {"pil", "np", "pt", "latent"}:
247
+ raise ValueError(f"Unsupported `output_type`: {output_type}")
248
+
249
+ def prepare_latents(
250
+ self,
251
+ batch_size: int,
252
+ image_size: int,
253
+ in_channels: int,
254
+ device: torch.device,
255
+ dtype: torch.dtype,
256
+ generator: Optional[torch.Generator] = None,
257
+ latents: Optional[torch.Tensor] = None,
258
+ ) -> torch.Tensor:
259
+ shape = (batch_size, in_channels, image_size, image_size)
260
+ if latents is None:
261
+ latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
262
+ latents = latents * NOISE_INIT_SCALE
263
+ else:
264
+ latents = latents.to(device=device, dtype=dtype)
265
+ if tuple(latents.shape) != shape:
266
+ raise ValueError(f"Invalid `latents` shape: {tuple(latents.shape)}. Expected {shape}.")
267
+ return latents
268
+
269
+ def _encode_prompt(
270
+ self,
271
+ prompt: Union[str, List[str]],
272
+ device: torch.device,
273
+ transformer = None,
274
+ ) -> Tuple[torch.Tensor, torch.Tensor]:
275
+ if isinstance(prompt, str):
276
+ prompt = [prompt]
277
+ transformer = transformer or self.transformer
278
+ if self.tokenizer is None:
279
+ self.tokenizer = AutoTokenizer.from_pretrained(self.config.text_encoder_name)
280
+ if self.text_encoder is None:
281
+ self.text_encoder = T5EncoderModel.from_pretrained(self.config.text_encoder_name)
282
+ if next(self.text_encoder.parameters()).device != device:
283
+ self.text_encoder.to(device)
284
+ cfg = transformer.mmjit_config
285
+ tokens = self.tokenizer(
286
+ prompt,
287
+ return_tensors="pt",
288
+ padding="max_length",
289
+ truncation=True,
290
+ max_length=cfg.prompt_length,
291
+ )
292
+ input_ids = tokens.input_ids.to(device)
293
+ attn = tokens.attention_mask.to(device)
294
+ text = self.text_encoder(input_ids=input_ids, attention_mask=attn).last_hidden_state
295
+ return text, attn
296
+
297
+ @staticmethod
298
+ def _cfg_velocity(
299
+ transformer,
300
+ x: torch.Tensor,
301
+ t: torch.Tensor,
302
+ text: torch.Tensor,
303
+ mask: torch.Tensor,
304
+ cfg_scale: float,
305
+ ) -> torch.Tensor:
306
+ batch_size = x.shape[0]
307
+ doubled_x = torch.cat([x, x], dim=0)
308
+ doubled_t = torch.cat([t, t], dim=0)
309
+ doubled_text = torch.cat([text, text], dim=0)
310
+ null_mask = torch.zeros_like(mask)
311
+ doubled_mask = torch.cat([mask, null_mask], dim=0)
312
+ velocity = transformer.pred_velocity(doubled_x, doubled_t, doubled_text, doubled_mask)
313
+ cond, uncond = velocity[:batch_size], velocity[batch_size:]
314
+ cfg_interval = transformer.mmjit_config.cfg_interval
315
+ use_cfg = ((t >= cfg_interval[0]) & (t <= cfg_interval[1])).to(velocity.dtype)
316
+ scale = torch.where(
317
+ use_cfg[:, None, None, None] > 0,
318
+ torch.tensor(cfg_scale, device=x.device, dtype=velocity.dtype),
319
+ torch.tensor(1.0, device=x.device, dtype=velocity.dtype),
320
+ )
321
+ return uncond + (cond - uncond) * scale
322
+
323
+ @torch.no_grad()
324
+ def __call__(
325
+ self,
326
+ prompt: Union[str, List[str]],
327
+ num_images_per_prompt: int = 1,
328
+ guidance_scale: float = 6.0,
329
+ num_inference_steps: Optional[int] = None,
330
+ generator: Optional[torch.Generator] = None,
331
+ latents: Optional[torch.Tensor] = None,
332
+ output_type: str = "pil",
333
+ return_dict: bool = True,
334
+ progress: bool = True,
335
+ model_type: Optional[str] = None,
336
+ repo_id_or_path: Optional[str] = None,
337
+ variant: Optional[str] = None,
338
+ torch_dtype: Optional[torch.dtype] = None,
339
+ ) -> Union[ImagePipelineOutput, Tuple]:
340
+ r"""
341
+ Generate images from text prompts with MiniT2I.
342
+
343
+ Args:
344
+ prompt (`str` or `list[str]`):
345
+ Text prompt or batch of prompts.
346
+ num_images_per_prompt (`int`, defaults to `1`):
347
+ Number of images to generate per prompt.
348
+ guidance_scale (`float`, defaults to `6.0`):
349
+ Classifier-free guidance scale. CFG is active when `guidance_scale != 1.0`.
350
+ num_inference_steps (`int`, *optional*):
351
+ Number of denoising steps. Defaults to the pipeline config value.
352
+ generator (`torch.Generator`, *optional*):
353
+ RNG for reproducibility.
354
+ latents (`torch.Tensor`, *optional*):
355
+ Pre-generated pixel latents with shape `(batch, channels, height, width)`.
356
+ output_type (`str`, defaults to `"pil"`):
357
+ `"pil"`, `"np"`, `"pt"`, or `"latent"`.
358
+ return_dict (`bool`, defaults to `True`):
359
+ Return [`ImagePipelineOutput`] if True.
360
+ progress (`bool`, defaults to `True`):
361
+ Whether to show a progress bar during denoising.
362
+ model_type (`str`, *optional*):
363
+ MiniT2I variant alias such as `"b16"` or `"l16"`.
364
+ repo_id_or_path (`str`, *optional*):
365
+ Hub id or local path used when switching `model_type`.
366
+ variant (`str`, *optional*):
367
+ Weight variant passed to `from_pretrained`.
368
+ torch_dtype (`torch.dtype`, *optional*):
369
+ Optional dtype override when loading a different transformer variant.
370
+ """
371
+ num_inference_steps = int(num_inference_steps or self.config.default_num_inference_steps)
372
+ self.check_inputs(prompt, guidance_scale, num_inference_steps, output_type)
373
+
374
+ transformer = self._get_transformer(model_type, repo_id_or_path, torch_dtype=torch_dtype, variant=variant)
375
+ device = self._execution_device
376
+ transformer = transformer.to(device)
377
+
378
+ if isinstance(prompt, str):
379
+ prompt_batch = [prompt] * num_images_per_prompt
380
+ else:
381
+ prompt_batch = []
382
+ for entry in prompt:
383
+ prompt_batch.extend([entry] * num_images_per_prompt)
384
+
385
+ batch_size = len(prompt_batch)
386
+ mmjit_cfg = transformer.mmjit_config
387
+ model_dtype = next(transformer.parameters()).dtype
388
+
389
+ text, attn = self._encode_prompt(prompt_batch, device, transformer=transformer)
390
+ text = text.to(dtype=model_dtype)
391
+ attn = attn.to(dtype=model_dtype)
392
+
393
+ if getattr(self.scheduler.config, "stochastic_sampling", False):
394
+ raise ValueError(
395
+ "MiniT2I expects deterministic FlowMatchEulerDiscreteScheduler stepping "
396
+ "(scheduler.config.stochastic_sampling=False)."
397
+ )
398
+
399
+ extra_step_kwargs = self.prepare_extra_step_kwargs(self.scheduler, generator=generator)
400
+ self.scheduler.set_timesteps(num_inference_steps, device=device)
401
+ num_train_timesteps = self.scheduler.config.num_train_timesteps
402
+
403
+ latents = self.prepare_latents(
404
+ batch_size=batch_size,
405
+ image_size=mmjit_cfg.image_size,
406
+ in_channels=mmjit_cfg.in_channels,
407
+ device=device,
408
+ dtype=model_dtype,
409
+ generator=generator,
410
+ latents=latents,
411
+ )
412
+
413
+ timesteps = self.scheduler.timesteps
414
+ if progress:
415
+ timesteps = self.progress_bar(timesteps)
416
+
417
+ using_cfg = guidance_scale != 1.0
418
+ for timestep in timesteps:
419
+ flow_time = 1.0 - float(timestep) / num_train_timesteps
420
+ t = torch.full((batch_size,), flow_time, device=device, dtype=model_dtype)
421
+ if using_cfg:
422
+ velocity = self._cfg_velocity(transformer, latents, t, text, attn, guidance_scale)
423
+ else:
424
+ velocity = transformer.pred_velocity(latents, t, text, attn)
425
+
426
+ # MiniT2I integrates velocity from noise (t=0) to data (t=1); flip sign for
427
+ # FlowMatchEulerDiscreteScheduler sigma decreasing from 1 to 0.
428
+ latents = self.scheduler.step(-velocity, timestep, latents, **extra_step_kwargs).prev_sample
429
+
430
+ if output_type == "latent":
431
+ images = latents
432
+ else:
433
+ images = (latents.clamp(-1, 1) * 127.5 + 128.0).clamp(0, 255).to(torch.uint8)
434
+ if output_type == "pt":
435
+ images = images.float() / 255.0
436
+ else:
437
+ images = images.permute(0, 2, 3, 1).cpu().numpy()
438
+ if output_type == "pil":
439
+ images = [Image.fromarray(image) for image in images]
440
+
441
+ self.maybe_free_model_hooks()
442
+ if not return_dict:
443
+ return (images,)
444
+ return ImagePipelineOutput(images=images)
MiniT2I-L-16/scheduler/scheduler_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "FlowMatchEulerDiscreteScheduler",
3
+ "_diffusers_version": "0.32.0",
4
+ "num_train_timesteps": 1000,
5
+ "shift": 1.0,
6
+ "stochastic_sampling": false
7
+ }
MiniT2I-L-16/text_encoder/README.md ADDED
@@ -0,0 +1,276 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - fr
5
+ - ro
6
+ - de
7
+ - multilingual
8
+
9
+ widget:
10
+ - text: "Translate to German: My name is Arthur"
11
+ example_title: "Translation"
12
+ - text: "Please answer to the following question. Who is going to be the next Ballon d'or?"
13
+ example_title: "Question Answering"
14
+ - text: "Q: Can Geoffrey Hinton have a conversation with George Washington? Give the rationale before answering."
15
+ example_title: "Logical reasoning"
16
+ - text: "Please answer the following question. What is the boiling point of Nitrogen?"
17
+ example_title: "Scientific knowledge"
18
+ - text: "Answer the following yes/no question. Can you write a whole Haiku in a single tweet?"
19
+ example_title: "Yes/no question"
20
+ - text: "Answer the following yes/no question by reasoning step-by-step. Can you write a whole Haiku in a single tweet?"
21
+ example_title: "Reasoning task"
22
+ - text: "Q: ( False or not False or False ) is? A: Let's think step by step"
23
+ example_title: "Boolean Expressions"
24
+ - text: "The square root of x is the cube root of y. What is y to the power of 2, if x = 4?"
25
+ example_title: "Math reasoning"
26
+ - text: "Premise: At my age you will probably have learnt one lesson. Hypothesis: It's not certain how many lessons you'll learn by your thirties. Does the premise entail the hypothesis?"
27
+ example_title: "Premise and hypothesis"
28
+
29
+ tags:
30
+ - text2text-generation
31
+
32
+ datasets:
33
+ - svakulenk0/qrecc
34
+ - taskmaster2
35
+ - djaym7/wiki_dialog
36
+ - deepmind/code_contests
37
+ - lambada
38
+ - gsm8k
39
+ - aqua_rat
40
+ - esnli
41
+ - quasc
42
+ - qed
43
+
44
+
45
+ license: apache-2.0
46
+ ---
47
+
48
+ # Model Card for FLAN-T5 large
49
+
50
+ <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/flan2_architecture.jpg"
51
+ alt="drawing" width="600"/>
52
+
53
+ # Table of Contents
54
+
55
+ 0. [TL;DR](#TL;DR)
56
+ 1. [Model Details](#model-details)
57
+ 2. [Usage](#usage)
58
+ 3. [Uses](#uses)
59
+ 4. [Bias, Risks, and Limitations](#bias-risks-and-limitations)
60
+ 5. [Training Details](#training-details)
61
+ 6. [Evaluation](#evaluation)
62
+ 7. [Environmental Impact](#environmental-impact)
63
+ 8. [Citation](#citation)
64
+ 9. [Model Card Authors](#model-card-authors)
65
+
66
+ # TL;DR
67
+
68
+ If you already know T5, FLAN-T5 is just better at everything. For the same number of parameters, these models have been fine-tuned on more than 1000 additional tasks covering also more languages.
69
+ As mentioned in the first few lines of the abstract :
70
+ > Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints,1 which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.
71
+
72
+ **Disclaimer**: Content from **this** model card has been written by the Hugging Face team, and parts of it were copy pasted from the [T5 model card](https://huggingface.co/t5-large).
73
+
74
+ # Model Details
75
+
76
+ ## Model Description
77
+
78
+
79
+ - **Model type:** Language model
80
+ - **Language(s) (NLP):** English, Spanish, Japanese, Persian, Hindi, French, Chinese, Bengali, Gujarati, German, Telugu, Italian, Arabic, Polish, Tamil, Marathi, Malayalam, Oriya, Panjabi, Portuguese, Urdu, Galician, Hebrew, Korean, Catalan, Thai, Dutch, Indonesian, Vietnamese, Bulgarian, Filipino, Central Khmer, Lao, Turkish, Russian, Croatian, Swedish, Yoruba, Kurdish, Burmese, Malay, Czech, Finnish, Somali, Tagalog, Swahili, Sinhala, Kannada, Zhuang, Igbo, Xhosa, Romanian, Haitian, Estonian, Slovak, Lithuanian, Greek, Nepali, Assamese, Norwegian
81
+ - **License:** Apache 2.0
82
+ - **Related Models:** [All FLAN-T5 Checkpoints](https://huggingface.co/models?search=flan-t5)
83
+ - **Original Checkpoints:** [All Original FLAN-T5 Checkpoints](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints)
84
+ - **Resources for more information:**
85
+ - [Research paper](https://arxiv.org/pdf/2210.11416.pdf)
86
+ - [GitHub Repo](https://github.com/google-research/t5x)
87
+ - [Hugging Face FLAN-T5 Docs (Similar to T5) ](https://huggingface.co/docs/transformers/model_doc/t5)
88
+
89
+ # Usage
90
+
91
+ Find below some example scripts on how to use the model in `transformers`:
92
+
93
+ ## Using the Pytorch model
94
+
95
+ ### Running the model on a CPU
96
+
97
+ <details>
98
+ <summary> Click to expand </summary>
99
+
100
+ ```python
101
+
102
+ from transformers import T5Tokenizer, T5ForConditionalGeneration
103
+
104
+ tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")
105
+ model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large")
106
+
107
+ input_text = "translate English to German: How old are you?"
108
+ input_ids = tokenizer(input_text, return_tensors="pt").input_ids
109
+
110
+ outputs = model.generate(input_ids)
111
+ print(tokenizer.decode(outputs[0]))
112
+ ```
113
+
114
+ </details>
115
+
116
+ ### Running the model on a GPU
117
+
118
+ <details>
119
+ <summary> Click to expand </summary>
120
+
121
+ ```python
122
+ # pip install accelerate
123
+ from transformers import T5Tokenizer, T5ForConditionalGeneration
124
+
125
+ tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")
126
+ model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto")
127
+
128
+ input_text = "translate English to German: How old are you?"
129
+ input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
130
+
131
+ outputs = model.generate(input_ids)
132
+ print(tokenizer.decode(outputs[0]))
133
+ ```
134
+
135
+ </details>
136
+
137
+ ### Running the model on a GPU using different precisions
138
+
139
+ #### FP16
140
+
141
+ <details>
142
+ <summary> Click to expand </summary>
143
+
144
+ ```python
145
+ # pip install accelerate
146
+ import torch
147
+ from transformers import T5Tokenizer, T5ForConditionalGeneration
148
+
149
+ tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")
150
+ model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto", torch_dtype=torch.float16)
151
+
152
+ input_text = "translate English to German: How old are you?"
153
+ input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
154
+
155
+ outputs = model.generate(input_ids)
156
+ print(tokenizer.decode(outputs[0]))
157
+ ```
158
+
159
+ </details>
160
+
161
+ #### INT8
162
+
163
+ <details>
164
+ <summary> Click to expand </summary>
165
+
166
+ ```python
167
+ # pip install bitsandbytes accelerate
168
+ from transformers import T5Tokenizer, T5ForConditionalGeneration
169
+
170
+ tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")
171
+ model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto", load_in_8bit=True)
172
+
173
+ input_text = "translate English to German: How old are you?"
174
+ input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
175
+
176
+ outputs = model.generate(input_ids)
177
+ print(tokenizer.decode(outputs[0]))
178
+ ```
179
+
180
+ </details>
181
+
182
+ # Uses
183
+
184
+ ## Direct Use and Downstream Use
185
+
186
+ The authors write in [the original paper's model card](https://arxiv.org/pdf/2210.11416.pdf) that:
187
+
188
+ > The primary use is research on language models, including: research on zero-shot NLP tasks and in-context few-shot learning NLP tasks, such as reasoning, and question answering; advancing fairness and safety research, and understanding limitations of current large language models
189
+
190
+ See the [research paper](https://arxiv.org/pdf/2210.11416.pdf) for further details.
191
+
192
+ ## Out-of-Scope Use
193
+
194
+ More information needed.
195
+
196
+ # Bias, Risks, and Limitations
197
+
198
+ The information below in this section are copied from the model's [official model card](https://arxiv.org/pdf/2210.11416.pdf):
199
+
200
+ > Language models, including Flan-T5, can potentially be used for language generation in a harmful way, according to Rae et al. (2021). Flan-T5 should not be used directly in any application, without a prior assessment of safety and fairness concerns specific to the application.
201
+
202
+ ## Ethical considerations and risks
203
+
204
+ > Flan-T5 is fine-tuned on a large corpus of text data that was not filtered for explicit content or assessed for existing biases. As a result the model itself is potentially vulnerable to generating equivalently inappropriate content or replicating inherent biases in the underlying data.
205
+
206
+ ## Known Limitations
207
+
208
+ > Flan-T5 has not been tested in real world applications.
209
+
210
+ ## Sensitive Use:
211
+
212
+ > Flan-T5 should not be applied for any unacceptable use cases, e.g., generation of abusive speech.
213
+
214
+ # Training Details
215
+
216
+ ## Training Data
217
+
218
+ The model was trained on a mixture of tasks, that includes the tasks described in the table below (from the original paper, figure 2):
219
+
220
+ ![table.png](https://s3.amazonaws.com/moonup/production/uploads/1666363265279-62441d1d9fdefb55a0b7d12c.png)
221
+
222
+
223
+ ## Training Procedure
224
+
225
+ According to the model card from the [original paper](https://arxiv.org/pdf/2210.11416.pdf):
226
+
227
+ > These models are based on pretrained T5 (Raffel et al., 2020) and fine-tuned with instructions for better zero-shot and few-shot performance. There is one fine-tuned Flan model per T5 model size.
228
+
229
+ The model has been trained on TPU v3 or TPU v4 pods, using [`t5x`](https://github.com/google-research/t5x) codebase together with [`jax`](https://github.com/google/jax).
230
+
231
+
232
+ # Evaluation
233
+
234
+ ## Testing Data, Factors & Metrics
235
+
236
+ The authors evaluated the model on various tasks covering several languages (1836 in total). See the table below for some quantitative evaluation:
237
+ ![image.png](https://s3.amazonaws.com/moonup/production/uploads/1668072995230-62441d1d9fdefb55a0b7d12c.png)
238
+ For full details, please check the [research paper](https://arxiv.org/pdf/2210.11416.pdf).
239
+
240
+ ## Results
241
+
242
+ For full results for FLAN-T5-Large, see the [research paper](https://arxiv.org/pdf/2210.11416.pdf), Table 3.
243
+
244
+ # Environmental Impact
245
+
246
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
247
+
248
+ - **Hardware Type:** Google Cloud TPU Pods - TPU v3 or TPU v4 | Number of chips β‰₯ 4.
249
+ - **Hours used:** More information needed
250
+ - **Cloud Provider:** GCP
251
+ - **Compute Region:** More information needed
252
+ - **Carbon Emitted:** More information needed
253
+
254
+ # Citation
255
+
256
+ **BibTeX:**
257
+
258
+ ```bibtex
259
+ @misc{https://doi.org/10.48550/arxiv.2210.11416,
260
+ doi = {10.48550/ARXIV.2210.11416},
261
+
262
+ url = {https://arxiv.org/abs/2210.11416},
263
+
264
+ author = {Chung, Hyung Won and Hou, Le and Longpre, Shayne and Zoph, Barret and Tay, Yi and Fedus, William and Li, Eric and Wang, Xuezhi and Dehghani, Mostafa and Brahma, Siddhartha and Webson, Albert and Gu, Shixiang Shane and Dai, Zhuyun and Suzgun, Mirac and Chen, Xinyun and Chowdhery, Aakanksha and Narang, Sharan and Mishra, Gaurav and Yu, Adams and Zhao, Vincent and Huang, Yanping and Dai, Andrew and Yu, Hongkun and Petrov, Slav and Chi, Ed H. and Dean, Jeff and Devlin, Jacob and Roberts, Adam and Zhou, Denny and Le, Quoc V. and Wei, Jason},
265
+
266
+ keywords = {Machine Learning (cs.LG), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
267
+
268
+ title = {Scaling Instruction-Finetuned Language Models},
269
+
270
+ publisher = {arXiv},
271
+
272
+ year = {2022},
273
+
274
+ copyright = {Creative Commons Attribution 4.0 International}
275
+ }
276
+ ```
MiniT2I-L-16/text_encoder/config.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "T5ForConditionalGeneration"
4
+ ],
5
+ "d_ff": 2816,
6
+ "d_kv": 64,
7
+ "d_model": 1024,
8
+ "decoder_start_token_id": 0,
9
+ "dropout_rate": 0.1,
10
+ "eos_token_id": 1,
11
+ "feed_forward_proj": "gated-gelu",
12
+ "initializer_factor": 1.0,
13
+ "is_encoder_decoder": true,
14
+ "layer_norm_epsilon": 1e-06,
15
+ "model_type": "t5",
16
+ "n_positions": 512,
17
+ "num_decoder_layers": 24,
18
+ "num_heads": 16,
19
+ "num_layers": 24,
20
+ "output_past": true,
21
+ "pad_token_id": 0,
22
+ "relative_attention_max_distance": 128,
23
+ "relative_attention_num_buckets": 32,
24
+ "tie_word_embeddings": false,
25
+ "transformers_version": "4.23.1",
26
+ "use_cache": true,
27
+ "vocab_size": 32128
28
+ }
MiniT2I-L-16/text_encoder/generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "decoder_start_token_id": 0,
4
+ "eos_token_id": 1,
5
+ "pad_token_id": 0,
6
+ "transformers_version": "4.27.0.dev0"
7
+ }
MiniT2I-L-16/text_encoder/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a9dd06ce490f139af36e9eb77dd3758b4fd07a08a73d5a1abe5ff2591e2d388e
3
+ size 3132668804
MiniT2I-L-16/text_encoder/special_tokens_map.json ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<extra_id_0>",
4
+ "<extra_id_1>",
5
+ "<extra_id_2>",
6
+ "<extra_id_3>",
7
+ "<extra_id_4>",
8
+ "<extra_id_5>",
9
+ "<extra_id_6>",
10
+ "<extra_id_7>",
11
+ "<extra_id_8>",
12
+ "<extra_id_9>",
13
+ "<extra_id_10>",
14
+ "<extra_id_11>",
15
+ "<extra_id_12>",
16
+ "<extra_id_13>",
17
+ "<extra_id_14>",
18
+ "<extra_id_15>",
19
+ "<extra_id_16>",
20
+ "<extra_id_17>",
21
+ "<extra_id_18>",
22
+ "<extra_id_19>",
23
+ "<extra_id_20>",
24
+ "<extra_id_21>",
25
+ "<extra_id_22>",
26
+ "<extra_id_23>",
27
+ "<extra_id_24>",
28
+ "<extra_id_25>",
29
+ "<extra_id_26>",
30
+ "<extra_id_27>",
31
+ "<extra_id_28>",
32
+ "<extra_id_29>",
33
+ "<extra_id_30>",
34
+ "<extra_id_31>",
35
+ "<extra_id_32>",
36
+ "<extra_id_33>",
37
+ "<extra_id_34>",
38
+ "<extra_id_35>",
39
+ "<extra_id_36>",
40
+ "<extra_id_37>",
41
+ "<extra_id_38>",
42
+ "<extra_id_39>",
43
+ "<extra_id_40>",
44
+ "<extra_id_41>",
45
+ "<extra_id_42>",
46
+ "<extra_id_43>",
47
+ "<extra_id_44>",
48
+ "<extra_id_45>",
49
+ "<extra_id_46>",
50
+ "<extra_id_47>",
51
+ "<extra_id_48>",
52
+ "<extra_id_49>",
53
+ "<extra_id_50>",
54
+ "<extra_id_51>",
55
+ "<extra_id_52>",
56
+ "<extra_id_53>",
57
+ "<extra_id_54>",
58
+ "<extra_id_55>",
59
+ "<extra_id_56>",
60
+ "<extra_id_57>",
61
+ "<extra_id_58>",
62
+ "<extra_id_59>",
63
+ "<extra_id_60>",
64
+ "<extra_id_61>",
65
+ "<extra_id_62>",
66
+ "<extra_id_63>",
67
+ "<extra_id_64>",
68
+ "<extra_id_65>",
69
+ "<extra_id_66>",
70
+ "<extra_id_67>",
71
+ "<extra_id_68>",
72
+ "<extra_id_69>",
73
+ "<extra_id_70>",
74
+ "<extra_id_71>",
75
+ "<extra_id_72>",
76
+ "<extra_id_73>",
77
+ "<extra_id_74>",
78
+ "<extra_id_75>",
79
+ "<extra_id_76>",
80
+ "<extra_id_77>",
81
+ "<extra_id_78>",
82
+ "<extra_id_79>",
83
+ "<extra_id_80>",
84
+ "<extra_id_81>",
85
+ "<extra_id_82>",
86
+ "<extra_id_83>",
87
+ "<extra_id_84>",
88
+ "<extra_id_85>",
89
+ "<extra_id_86>",
90
+ "<extra_id_87>",
91
+ "<extra_id_88>",
92
+ "<extra_id_89>",
93
+ "<extra_id_90>",
94
+ "<extra_id_91>",
95
+ "<extra_id_92>",
96
+ "<extra_id_93>",
97
+ "<extra_id_94>",
98
+ "<extra_id_95>",
99
+ "<extra_id_96>",
100
+ "<extra_id_97>",
101
+ "<extra_id_98>",
102
+ "<extra_id_99>"
103
+ ],
104
+ "eos_token": "</s>",
105
+ "pad_token": "<pad>",
106
+ "unk_token": "<unk>"
107
+ }
MiniT2I-L-16/text_encoder/spiece.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d60acb128cf7b7f2536e8f38a5b18a05535c9e14c7a355904270e15b0945ea86
3
+ size 791656
MiniT2I-L-16/text_encoder/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
MiniT2I-L-16/text_encoder/tokenizer_config.json ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<extra_id_0>",
4
+ "<extra_id_1>",
5
+ "<extra_id_2>",
6
+ "<extra_id_3>",
7
+ "<extra_id_4>",
8
+ "<extra_id_5>",
9
+ "<extra_id_6>",
10
+ "<extra_id_7>",
11
+ "<extra_id_8>",
12
+ "<extra_id_9>",
13
+ "<extra_id_10>",
14
+ "<extra_id_11>",
15
+ "<extra_id_12>",
16
+ "<extra_id_13>",
17
+ "<extra_id_14>",
18
+ "<extra_id_15>",
19
+ "<extra_id_16>",
20
+ "<extra_id_17>",
21
+ "<extra_id_18>",
22
+ "<extra_id_19>",
23
+ "<extra_id_20>",
24
+ "<extra_id_21>",
25
+ "<extra_id_22>",
26
+ "<extra_id_23>",
27
+ "<extra_id_24>",
28
+ "<extra_id_25>",
29
+ "<extra_id_26>",
30
+ "<extra_id_27>",
31
+ "<extra_id_28>",
32
+ "<extra_id_29>",
33
+ "<extra_id_30>",
34
+ "<extra_id_31>",
35
+ "<extra_id_32>",
36
+ "<extra_id_33>",
37
+ "<extra_id_34>",
38
+ "<extra_id_35>",
39
+ "<extra_id_36>",
40
+ "<extra_id_37>",
41
+ "<extra_id_38>",
42
+ "<extra_id_39>",
43
+ "<extra_id_40>",
44
+ "<extra_id_41>",
45
+ "<extra_id_42>",
46
+ "<extra_id_43>",
47
+ "<extra_id_44>",
48
+ "<extra_id_45>",
49
+ "<extra_id_46>",
50
+ "<extra_id_47>",
51
+ "<extra_id_48>",
52
+ "<extra_id_49>",
53
+ "<extra_id_50>",
54
+ "<extra_id_51>",
55
+ "<extra_id_52>",
56
+ "<extra_id_53>",
57
+ "<extra_id_54>",
58
+ "<extra_id_55>",
59
+ "<extra_id_56>",
60
+ "<extra_id_57>",
61
+ "<extra_id_58>",
62
+ "<extra_id_59>",
63
+ "<extra_id_60>",
64
+ "<extra_id_61>",
65
+ "<extra_id_62>",
66
+ "<extra_id_63>",
67
+ "<extra_id_64>",
68
+ "<extra_id_65>",
69
+ "<extra_id_66>",
70
+ "<extra_id_67>",
71
+ "<extra_id_68>",
72
+ "<extra_id_69>",
73
+ "<extra_id_70>",
74
+ "<extra_id_71>",
75
+ "<extra_id_72>",
76
+ "<extra_id_73>",
77
+ "<extra_id_74>",
78
+ "<extra_id_75>",
79
+ "<extra_id_76>",
80
+ "<extra_id_77>",
81
+ "<extra_id_78>",
82
+ "<extra_id_79>",
83
+ "<extra_id_80>",
84
+ "<extra_id_81>",
85
+ "<extra_id_82>",
86
+ "<extra_id_83>",
87
+ "<extra_id_84>",
88
+ "<extra_id_85>",
89
+ "<extra_id_86>",
90
+ "<extra_id_87>",
91
+ "<extra_id_88>",
92
+ "<extra_id_89>",
93
+ "<extra_id_90>",
94
+ "<extra_id_91>",
95
+ "<extra_id_92>",
96
+ "<extra_id_93>",
97
+ "<extra_id_94>",
98
+ "<extra_id_95>",
99
+ "<extra_id_96>",
100
+ "<extra_id_97>",
101
+ "<extra_id_98>",
102
+ "<extra_id_99>"
103
+ ],
104
+ "eos_token": "</s>",
105
+ "extra_ids": 100,
106
+ "model_max_length": 512,
107
+ "name_or_path": "google/t5-v1_1-large",
108
+ "pad_token": "<pad>",
109
+ "sp_model_kwargs": {},
110
+ "special_tokens_map_file": "/home/younes_huggingface_co/.cache/huggingface/hub/models--google--t5-v1_1-large/snapshots/314bc112b191ec17b625ba81438dc73d6c23659d/special_tokens_map.json",
111
+ "tokenizer_class": "T5Tokenizer",
112
+ "unk_token": "<unk>"
113
+ }
MiniT2I-L-16/tokenizer/special_tokens_map.json ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<extra_id_0>",
4
+ "<extra_id_1>",
5
+ "<extra_id_2>",
6
+ "<extra_id_3>",
7
+ "<extra_id_4>",
8
+ "<extra_id_5>",
9
+ "<extra_id_6>",
10
+ "<extra_id_7>",
11
+ "<extra_id_8>",
12
+ "<extra_id_9>",
13
+ "<extra_id_10>",
14
+ "<extra_id_11>",
15
+ "<extra_id_12>",
16
+ "<extra_id_13>",
17
+ "<extra_id_14>",
18
+ "<extra_id_15>",
19
+ "<extra_id_16>",
20
+ "<extra_id_17>",
21
+ "<extra_id_18>",
22
+ "<extra_id_19>",
23
+ "<extra_id_20>",
24
+ "<extra_id_21>",
25
+ "<extra_id_22>",
26
+ "<extra_id_23>",
27
+ "<extra_id_24>",
28
+ "<extra_id_25>",
29
+ "<extra_id_26>",
30
+ "<extra_id_27>",
31
+ "<extra_id_28>",
32
+ "<extra_id_29>",
33
+ "<extra_id_30>",
34
+ "<extra_id_31>",
35
+ "<extra_id_32>",
36
+ "<extra_id_33>",
37
+ "<extra_id_34>",
38
+ "<extra_id_35>",
39
+ "<extra_id_36>",
40
+ "<extra_id_37>",
41
+ "<extra_id_38>",
42
+ "<extra_id_39>",
43
+ "<extra_id_40>",
44
+ "<extra_id_41>",
45
+ "<extra_id_42>",
46
+ "<extra_id_43>",
47
+ "<extra_id_44>",
48
+ "<extra_id_45>",
49
+ "<extra_id_46>",
50
+ "<extra_id_47>",
51
+ "<extra_id_48>",
52
+ "<extra_id_49>",
53
+ "<extra_id_50>",
54
+ "<extra_id_51>",
55
+ "<extra_id_52>",
56
+ "<extra_id_53>",
57
+ "<extra_id_54>",
58
+ "<extra_id_55>",
59
+ "<extra_id_56>",
60
+ "<extra_id_57>",
61
+ "<extra_id_58>",
62
+ "<extra_id_59>",
63
+ "<extra_id_60>",
64
+ "<extra_id_61>",
65
+ "<extra_id_62>",
66
+ "<extra_id_63>",
67
+ "<extra_id_64>",
68
+ "<extra_id_65>",
69
+ "<extra_id_66>",
70
+ "<extra_id_67>",
71
+ "<extra_id_68>",
72
+ "<extra_id_69>",
73
+ "<extra_id_70>",
74
+ "<extra_id_71>",
75
+ "<extra_id_72>",
76
+ "<extra_id_73>",
77
+ "<extra_id_74>",
78
+ "<extra_id_75>",
79
+ "<extra_id_76>",
80
+ "<extra_id_77>",
81
+ "<extra_id_78>",
82
+ "<extra_id_79>",
83
+ "<extra_id_80>",
84
+ "<extra_id_81>",
85
+ "<extra_id_82>",
86
+ "<extra_id_83>",
87
+ "<extra_id_84>",
88
+ "<extra_id_85>",
89
+ "<extra_id_86>",
90
+ "<extra_id_87>",
91
+ "<extra_id_88>",
92
+ "<extra_id_89>",
93
+ "<extra_id_90>",
94
+ "<extra_id_91>",
95
+ "<extra_id_92>",
96
+ "<extra_id_93>",
97
+ "<extra_id_94>",
98
+ "<extra_id_95>",
99
+ "<extra_id_96>",
100
+ "<extra_id_97>",
101
+ "<extra_id_98>",
102
+ "<extra_id_99>"
103
+ ],
104
+ "eos_token": "</s>",
105
+ "pad_token": "<pad>",
106
+ "unk_token": "<unk>"
107
+ }
MiniT2I-L-16/tokenizer/spiece.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d60acb128cf7b7f2536e8f38a5b18a05535c9e14c7a355904270e15b0945ea86
3
+ size 791656
MiniT2I-L-16/tokenizer/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
MiniT2I-L-16/tokenizer/tokenizer_config.json ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<extra_id_0>",
4
+ "<extra_id_1>",
5
+ "<extra_id_2>",
6
+ "<extra_id_3>",
7
+ "<extra_id_4>",
8
+ "<extra_id_5>",
9
+ "<extra_id_6>",
10
+ "<extra_id_7>",
11
+ "<extra_id_8>",
12
+ "<extra_id_9>",
13
+ "<extra_id_10>",
14
+ "<extra_id_11>",
15
+ "<extra_id_12>",
16
+ "<extra_id_13>",
17
+ "<extra_id_14>",
18
+ "<extra_id_15>",
19
+ "<extra_id_16>",
20
+ "<extra_id_17>",
21
+ "<extra_id_18>",
22
+ "<extra_id_19>",
23
+ "<extra_id_20>",
24
+ "<extra_id_21>",
25
+ "<extra_id_22>",
26
+ "<extra_id_23>",
27
+ "<extra_id_24>",
28
+ "<extra_id_25>",
29
+ "<extra_id_26>",
30
+ "<extra_id_27>",
31
+ "<extra_id_28>",
32
+ "<extra_id_29>",
33
+ "<extra_id_30>",
34
+ "<extra_id_31>",
35
+ "<extra_id_32>",
36
+ "<extra_id_33>",
37
+ "<extra_id_34>",
38
+ "<extra_id_35>",
39
+ "<extra_id_36>",
40
+ "<extra_id_37>",
41
+ "<extra_id_38>",
42
+ "<extra_id_39>",
43
+ "<extra_id_40>",
44
+ "<extra_id_41>",
45
+ "<extra_id_42>",
46
+ "<extra_id_43>",
47
+ "<extra_id_44>",
48
+ "<extra_id_45>",
49
+ "<extra_id_46>",
50
+ "<extra_id_47>",
51
+ "<extra_id_48>",
52
+ "<extra_id_49>",
53
+ "<extra_id_50>",
54
+ "<extra_id_51>",
55
+ "<extra_id_52>",
56
+ "<extra_id_53>",
57
+ "<extra_id_54>",
58
+ "<extra_id_55>",
59
+ "<extra_id_56>",
60
+ "<extra_id_57>",
61
+ "<extra_id_58>",
62
+ "<extra_id_59>",
63
+ "<extra_id_60>",
64
+ "<extra_id_61>",
65
+ "<extra_id_62>",
66
+ "<extra_id_63>",
67
+ "<extra_id_64>",
68
+ "<extra_id_65>",
69
+ "<extra_id_66>",
70
+ "<extra_id_67>",
71
+ "<extra_id_68>",
72
+ "<extra_id_69>",
73
+ "<extra_id_70>",
74
+ "<extra_id_71>",
75
+ "<extra_id_72>",
76
+ "<extra_id_73>",
77
+ "<extra_id_74>",
78
+ "<extra_id_75>",
79
+ "<extra_id_76>",
80
+ "<extra_id_77>",
81
+ "<extra_id_78>",
82
+ "<extra_id_79>",
83
+ "<extra_id_80>",
84
+ "<extra_id_81>",
85
+ "<extra_id_82>",
86
+ "<extra_id_83>",
87
+ "<extra_id_84>",
88
+ "<extra_id_85>",
89
+ "<extra_id_86>",
90
+ "<extra_id_87>",
91
+ "<extra_id_88>",
92
+ "<extra_id_89>",
93
+ "<extra_id_90>",
94
+ "<extra_id_91>",
95
+ "<extra_id_92>",
96
+ "<extra_id_93>",
97
+ "<extra_id_94>",
98
+ "<extra_id_95>",
99
+ "<extra_id_96>",
100
+ "<extra_id_97>",
101
+ "<extra_id_98>",
102
+ "<extra_id_99>"
103
+ ],
104
+ "eos_token": "</s>",
105
+ "extra_ids": 100,
106
+ "model_max_length": 512,
107
+ "name_or_path": "google/t5-v1_1-large",
108
+ "pad_token": "<pad>",
109
+ "sp_model_kwargs": {},
110
+ "special_tokens_map_file": "/home/younes_huggingface_co/.cache/huggingface/hub/models--google--t5-v1_1-large/snapshots/314bc112b191ec17b625ba81438dc73d6c23659d/special_tokens_map.json",
111
+ "tokenizer_class": "T5Tokenizer",
112
+ "unk_token": "<unk>"
113
+ }
MiniT2I-L-16/transformer/config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "MiniT2IMMJiTModel",
3
+ "_diffusers_version": "0.35.2",
4
+ "cfg_channels": 3,
5
+ "cfg_interval": [
6
+ 0.0,
7
+ 1.0
8
+ ],
9
+ "cond_vec_size": 1248,
10
+ "depth_double": 23,
11
+ "head_dim": 52,
12
+ "hidden_size": 1248,
13
+ "image_size": 512,
14
+ "in_channels": 3,
15
+ "llm": "google/flan-t5-large",
16
+ "mlp_ratio": 2.7051282051282053,
17
+ "n_T": 100,
18
+ "num_heads": 24,
19
+ "patch_size": 16,
20
+ "pca_channels": 128,
21
+ "prediction": "x",
22
+ "prompt_length": 256,
23
+ "sampler": "euler",
24
+ "txt_hidden_size": 1248,
25
+ "txt_input_size": 1024,
26
+ "txt_preamble_depth": 2
27
+ }
MiniT2I-L-16/transformer/diffusion_pytorch_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:290775640b27b7cc743c4f79a244aab2b5f460d285ea42f569144b38cbb03633
3
+ size 3647124768
MiniT2I-L-16/transformer/transformer_minit2i.py ADDED
@@ -0,0 +1,446 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+ from dataclasses import dataclass
3
+ from typing import Optional
4
+
5
+ import torch
6
+ from torch import nn
7
+ import torch.nn.functional as F
8
+
9
+ from diffusers.configuration_utils import ConfigMixin, register_to_config
10
+ from diffusers.image_processor import VaeImageProcessor
11
+ from diffusers.models.modeling_utils import ModelMixin
12
+ from diffusers.pipelines.pipeline_utils import DiffusionPipeline, ImagePipelineOutput
13
+ from diffusers.schedulers.scheduling_utils import SchedulerMixin, SchedulerOutput
14
+ from diffusers.utils import BaseOutput
15
+ from diffusers.utils.torch_utils import randn_tensor
16
+
17
+
18
+ def modulate(x, shift, scale):
19
+ return x * (1 + scale[:, None, :]) + shift[:, None, :]
20
+
21
+
22
+ def rotate_half(x):
23
+ x1, x2 = x.reshape(*x.shape[:-1], 2, -1).unbind(dim=-2)
24
+ return torch.cat((-x2, x1), dim=-1)
25
+
26
+
27
+ class RMSNorm(nn.Module):
28
+ def __init__(self, dim: int, eps: float = 1e-6):
29
+ super().__init__()
30
+ self.weight = nn.Parameter(torch.ones(dim))
31
+ self.eps = eps
32
+
33
+ def forward(self, x):
34
+ y = x * torch.rsqrt(x.pow(2).mean(dim=-1, keepdim=True) + self.eps)
35
+ return y * self.weight
36
+
37
+
38
+ class TimestepEmbedder(nn.Module):
39
+ def __init__(self, hidden_size: int, frequency_embedding_size: int = 256):
40
+ super().__init__()
41
+ self.frequency_embedding_size = frequency_embedding_size
42
+ self.mlp = nn.Sequential(
43
+ nn.Linear(frequency_embedding_size, hidden_size),
44
+ nn.SiLU(),
45
+ nn.Linear(hidden_size, hidden_size),
46
+ )
47
+
48
+ def forward(self, t):
49
+ half = self.frequency_embedding_size // 2
50
+ freqs = torch.exp(
51
+ -math.log(10000.0)
52
+ * torch.arange(half, device=t.device, dtype=torch.float32)
53
+ / half
54
+ )
55
+ args = t.float()[:, None] * freqs[None]
56
+ emb = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
57
+ return self.mlp(emb.to(dtype=self.mlp[0].weight.dtype))
58
+
59
+
60
+ class BottleneckPatchEmbed(nn.Module):
61
+ def __init__(self, img_size=512, patch_size=16, in_channels=3, pca_channels=128, hidden_size=1248):
62
+ super().__init__()
63
+ self.img_size = img_size
64
+ self.patch_size = patch_size
65
+ self.proj1 = nn.Conv2d(in_channels, pca_channels, kernel_size=patch_size, stride=patch_size, bias=False)
66
+ self.proj2 = nn.Conv2d(pca_channels, hidden_size, kernel_size=1, stride=1, bias=True)
67
+
68
+ def forward(self, x):
69
+ x = self.proj2(self.proj1(x))
70
+ return x.flatten(2).transpose(1, 2)
71
+
72
+
73
+ class SwiGLUMlp(nn.Module):
74
+ def __init__(self, in_features: int, hidden_features: int):
75
+ super().__init__()
76
+ hidden_dim = (hidden_features + 7) // 8 * 8
77
+ self.w1 = nn.Linear(in_features, hidden_dim, bias=False)
78
+ self.w3 = nn.Linear(in_features, hidden_dim, bias=False)
79
+ self.w2 = nn.Linear(hidden_dim, in_features, bias=False)
80
+
81
+ def forward(self, x):
82
+ return self.w2(F.silu(self.w1(x)) * self.w3(x))
83
+
84
+
85
+ class TextRotaryEmbedding1D(nn.Module):
86
+ def __init__(self, head_dim: int, theta: float = 10000.0):
87
+ super().__init__()
88
+ self.head_dim = head_dim
89
+ self.theta = theta
90
+
91
+ def forward(self, x):
92
+ b, length, h, d = x.shape
93
+ inv = 1.0 / (self.theta ** (torch.arange(0, d, 2, device=x.device, dtype=torch.float32) / d))
94
+ pos = torch.arange(length, device=x.device, dtype=torch.float32)
95
+ angles = torch.einsum("l,f->lf", pos, inv)
96
+ angles = torch.cat([angles, angles], dim=-1)
97
+ cos = angles.cos().to(dtype=x.dtype)
98
+ sin = angles.sin().to(dtype=x.dtype)
99
+ return x * cos[None, :, None, :] + rotate_half(x) * sin[None, :, None, :]
100
+
101
+
102
+ class VisionRotaryEmbeddingFast(nn.Module):
103
+ def __init__(self, head_dim: int, theta: float = 10000.0):
104
+ super().__init__()
105
+ self.dim = head_dim // 2
106
+ self.theta = theta
107
+
108
+ def forward(self, x):
109
+ length = x.shape[1]
110
+ side = int(math.sqrt(length))
111
+ if side * side != length:
112
+ raise ValueError(f"image token length must be square, got {length}")
113
+ freqs = 1.0 / (
114
+ self.theta
115
+ ** (torch.arange(0, self.dim, 2, device=x.device, dtype=torch.float32)[: self.dim // 2] / self.dim)
116
+ )
117
+ t = torch.arange(side, device=x.device, dtype=torch.float32)
118
+ base = torch.einsum("l,f->lf", t, freqs)
119
+ f_h, f_w = torch.broadcast_tensors(base[:, None, :], base[None, :, :])
120
+ angles = torch.cat([f_h, f_w], dim=-1)
121
+ angles = torch.cat([angles, angles], dim=-1).reshape(length, -1)
122
+ cos = angles.cos().to(dtype=x.dtype)
123
+ sin = angles.sin().to(dtype=x.dtype)
124
+ return x * cos[None, :, None, :] + rotate_half(x) * sin[None, :, None, :]
125
+
126
+
127
+ class MultiModalRotaryEmbeddingFast(nn.Module):
128
+ def __init__(self, head_dim: int):
129
+ super().__init__()
130
+ self.text_rope = TextRotaryEmbedding1D(head_dim)
131
+ self.vision_rope = VisionRotaryEmbeddingFast(head_dim)
132
+
133
+ def forward(self, x, txt_len: int):
134
+ txt = self.text_rope(x[:, :txt_len])
135
+ img = self.vision_rope(x[:, txt_len:])
136
+ return torch.cat([txt, img], dim=1)
137
+
138
+
139
+ class PlainTextTransformerBlock(nn.Module):
140
+ def __init__(self, hidden_size=1248, num_heads=24, head_dim=52, mlp_ratio=2.7):
141
+ super().__init__()
142
+ self.num_heads = num_heads
143
+ self.head_dim = head_dim
144
+ inner_dim = num_heads * head_dim
145
+ self.norm1 = RMSNorm(hidden_size)
146
+ self.norm2 = RMSNorm(hidden_size)
147
+ self.qkv = nn.Linear(hidden_size, inner_dim * 3)
148
+ self.attn_proj = nn.Linear(inner_dim, hidden_size)
149
+ self.mlp = SwiGLUMlp(hidden_size, int(hidden_size * mlp_ratio))
150
+ self.q_norm = RMSNorm(head_dim)
151
+ self.k_norm = RMSNorm(head_dim)
152
+ self.rope = TextRotaryEmbedding1D(head_dim)
153
+
154
+ def forward(self, txt):
155
+ b, length, _ = txt.shape
156
+ qkv = self.qkv(self.norm1(txt)).reshape(b, length, 3, self.num_heads, self.head_dim)
157
+ q, k, v = qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2]
158
+ q = self.rope(self.q_norm(q))
159
+ k = self.rope(self.k_norm(k))
160
+ attn = torch.einsum("bqhd,bkhd->bhqk", q, k) * (self.head_dim ** -0.5)
161
+ out = torch.einsum("bhqk,bkhd->bqhd", attn.softmax(dim=-1), v).reshape(b, length, -1)
162
+ txt = txt + self.attn_proj(out)
163
+ txt = txt + self.mlp(self.norm2(txt))
164
+ return txt
165
+
166
+
167
+ class DoubleStreamDiTBlock(nn.Module):
168
+ def __init__(self, hidden_size=1248, txt_hidden_size=1248, num_heads=24, head_dim=52, mlp_ratio=2.7):
169
+ super().__init__()
170
+ self.hidden_size = hidden_size
171
+ self.txt_hidden_size = txt_hidden_size
172
+ self.num_heads = num_heads
173
+ self.head_dim = head_dim
174
+ inner_dim = num_heads * head_dim
175
+ self.img_norm1 = RMSNorm(hidden_size)
176
+ self.img_norm2 = RMSNorm(hidden_size)
177
+ self.txt_norm1 = RMSNorm(txt_hidden_size)
178
+ self.txt_norm2 = RMSNorm(txt_hidden_size)
179
+ self.img_qkv = nn.Linear(hidden_size, inner_dim * 3)
180
+ self.txt_qkv = nn.Linear(txt_hidden_size, inner_dim * 3)
181
+ self.q_norm = RMSNorm(head_dim)
182
+ self.k_norm = RMSNorm(head_dim)
183
+ self.rope = MultiModalRotaryEmbeddingFast(head_dim)
184
+ self.img_attn_proj = nn.Linear(inner_dim, hidden_size)
185
+ self.txt_attn_proj = nn.Linear(inner_dim, txt_hidden_size)
186
+ self.img_mlp = SwiGLUMlp(hidden_size, int(hidden_size * mlp_ratio))
187
+ self.txt_mlp = SwiGLUMlp(txt_hidden_size, int(txt_hidden_size * mlp_ratio))
188
+
189
+ def forward(self, x, txt, vec):
190
+ b, li, _ = x.shape
191
+ lt = txt.shape[1]
192
+ x_norm = self.img_norm1(x)
193
+ txt_norm = self.txt_norm1(txt)
194
+ qkv_i = self.img_qkv(x_norm).reshape(b, li, 3, self.num_heads, self.head_dim)
195
+ qkv_t = self.txt_qkv(txt_norm).reshape(b, lt, 3, self.num_heads, self.head_dim)
196
+ q_i, k_i, v_i = qkv_i[:, :, 0], qkv_i[:, :, 1], qkv_i[:, :, 2]
197
+ q_t, k_t, v_t = qkv_t[:, :, 0], qkv_t[:, :, 1], qkv_t[:, :, 2]
198
+ q_i, k_i = self.q_norm(q_i), self.k_norm(k_i)
199
+ q_t, k_t = self.q_norm(q_t), self.k_norm(k_t)
200
+ q = self.rope(torch.cat([q_t, q_i], dim=1), txt_len=lt)
201
+ k = self.rope(torch.cat([k_t, k_i], dim=1), txt_len=lt)
202
+ v = torch.cat([v_t, v_i], dim=1)
203
+ attn = torch.einsum("bqhd,bkhd->bhqk", q, k) * (self.head_dim ** -0.5)
204
+ out = torch.einsum("bhqk,bkhd->bqhd", attn.softmax(dim=-1), v)
205
+ x = x + self.img_attn_proj(out[:, lt:].reshape(b, li, -1))
206
+ txt = txt + self.txt_attn_proj(out[:, :lt].reshape(b, lt, -1))
207
+ x = x + self.img_mlp(self.img_norm2(x))
208
+ txt = txt + self.txt_mlp(self.txt_norm2(txt))
209
+ return x, txt
210
+
211
+
212
+ class FinalLayer(nn.Module):
213
+ def __init__(self, hidden_size=1248, patch_size=16, out_channels=3):
214
+ super().__init__()
215
+ self.patch_size = patch_size
216
+ self.out_channels = out_channels
217
+ self.norm_final = RMSNorm(hidden_size)
218
+ self.linear = nn.Linear(hidden_size, patch_size * patch_size * out_channels)
219
+
220
+ def forward(self, x, vec=None):
221
+ return self.linear(self.norm_final(x))
222
+
223
+
224
+ def get_2d_sincos_pos_embed(embed_dim, grid_size, device, dtype):
225
+ grid_h = torch.arange(grid_size, device=device, dtype=torch.float32)
226
+ grid_w = torch.arange(grid_size, device=device, dtype=torch.float32)
227
+ grid = torch.meshgrid(grid_w, grid_h, indexing="xy")
228
+ grid = torch.stack(grid, dim=0).reshape(2, 1, grid_size, grid_size)
229
+ emb_h = get_1d_sincos_pos_embed(embed_dim // 2, grid[0])
230
+ emb_w = get_1d_sincos_pos_embed(embed_dim // 2, grid[1])
231
+ return torch.cat([emb_h, emb_w], dim=1).to(dtype=dtype)
232
+
233
+
234
+ def get_1d_sincos_pos_embed(embed_dim, pos):
235
+ omega = torch.arange(embed_dim // 2, device=pos.device, dtype=torch.float32)
236
+ omega = 1.0 / (10000 ** (omega / (embed_dim / 2.0)))
237
+ out = torch.einsum("m,d->md", pos.reshape(-1), omega)
238
+ return torch.cat([out.sin(), out.cos()], dim=1)
239
+
240
+
241
+ @dataclass
242
+ class MMJiTConfig:
243
+ image_size: int = 512
244
+ patch_size: int = 16
245
+ in_channels: int = 3
246
+ txt_input_size: int = 1024
247
+ hidden_size: int = 768
248
+ txt_hidden_size: int = 768
249
+ cond_vec_size: int = 768
250
+ depth_double: int = 17
251
+ txt_preamble_depth: int = 2
252
+ num_heads: int = 12
253
+ head_dim: int = 64
254
+ mlp_ratio: float = 2.6667
255
+ pca_channels: int = 128
256
+ prompt_length: int = 256
257
+ n_T: int = 100
258
+ prediction: str = "x"
259
+ sampler: str = "euler"
260
+ cfg_channels: int = 3
261
+ cfg_interval: tuple = (0.0, 1.0)
262
+ llm: str = "google/flan-t5-large"
263
+
264
+
265
+ class MMJiT(nn.Module):
266
+ def __init__(self, cfg: MMJiTConfig):
267
+ super().__init__()
268
+ self.cfg = cfg
269
+ self.latent_img_size = cfg.image_size // cfg.patch_size
270
+ self.img_embedder = BottleneckPatchEmbed(
271
+ cfg.image_size, cfg.patch_size, cfg.in_channels, cfg.pca_channels, cfg.hidden_size
272
+ )
273
+ self.txt_embedder = nn.Linear(cfg.txt_input_size, cfg.txt_hidden_size, bias=False)
274
+ self.mask_token = nn.Parameter(torch.zeros(1, 1, cfg.txt_input_size))
275
+ self.t_embedder = TimestepEmbedder(cfg.cond_vec_size)
276
+ self.pooled_embedder = nn.Linear(cfg.txt_input_size, cfg.cond_vec_size, bias=False)
277
+ self.txt_preamble_blocks = nn.ModuleList(
278
+ [
279
+ PlainTextTransformerBlock(cfg.txt_hidden_size, cfg.num_heads, cfg.head_dim, cfg.mlp_ratio)
280
+ for _ in range(cfg.txt_preamble_depth)
281
+ ]
282
+ )
283
+ self.double_blocks = nn.ModuleList(
284
+ [
285
+ DoubleStreamDiTBlock(
286
+ cfg.hidden_size, cfg.txt_hidden_size, cfg.num_heads, cfg.head_dim, cfg.mlp_ratio
287
+ )
288
+ for _ in range(cfg.depth_double)
289
+ ]
290
+ )
291
+ self.final_layer = FinalLayer(cfg.hidden_size, cfg.patch_size, cfg.in_channels)
292
+
293
+ def unpatchify(self, x):
294
+ b = x.shape[0]
295
+ p = self.cfg.patch_size
296
+ c = self.cfg.in_channels
297
+ h = w = int(math.sqrt(x.shape[1]))
298
+ x = x.reshape(b, h, w, p, p, c)
299
+ x = torch.einsum("nhwpqc->nchpwq", x)
300
+ return x.reshape(b, c, h * p, w * p)
301
+
302
+ def forward(self, img, t, context, attn_mask):
303
+ if img.ndim == 4 and img.shape[1] != self.cfg.in_channels:
304
+ img = img.permute(0, 3, 1, 2)
305
+ attn_mask = attn_mask.to(device=context.device)
306
+ context = torch.where(attn_mask[:, :, None] > 0.5, context, self.mask_token.to(dtype=context.dtype))
307
+ x = self.img_embedder(img)
308
+ pos = get_2d_sincos_pos_embed(self.cfg.hidden_size, self.latent_img_size, x.device, x.dtype)
309
+ x = x + pos[None]
310
+ t_vec = self.t_embedder(t)
311
+ txt = self.txt_embedder(context.to(dtype=self.txt_embedder.weight.dtype))
312
+ pooled_text = context.mean(dim=1)
313
+ vec = t_vec + self.pooled_embedder(pooled_text.to(dtype=self.pooled_embedder.weight.dtype))
314
+ for block in self.txt_preamble_blocks:
315
+ txt = block(txt)
316
+ for block in self.double_blocks:
317
+ x, txt = block(x, txt, vec)
318
+ combined = torch.cat([txt, x], dim=1)
319
+ out = self.final_layer(combined, vec)
320
+ img_out = out[:, txt.shape[1] :, :]
321
+ return self.unpatchify(img_out)
322
+
323
+
324
+ class DiffusionModel(nn.Module):
325
+ def __init__(self, cfg: Optional[MMJiTConfig] = None):
326
+ super().__init__()
327
+ self.cfg = cfg or MMJiTConfig()
328
+ self.net = MMJiT(self.cfg)
329
+
330
+ def real_t_to_embed_t(self, t):
331
+ return t
332
+
333
+ def pred_velocity(self, x, t, text, mask):
334
+ x0 = self.net(x, self.real_t_to_embed_t(t), text, mask)
335
+ return (x0 - x) / torch.clamp(1 - t[:, None, None, None], min=0.05)
336
+
337
+ def cfg_velocity(self, x, t, text, mask, cfg_scale: float):
338
+ b = x.shape[0]
339
+ xx = torch.cat([x, x], dim=0)
340
+ tt = torch.cat([t, t], dim=0)
341
+ yy = torch.cat([text, text], dim=0)
342
+ mm = torch.cat([mask, torch.zeros_like(mask)], dim=0)
343
+ out = self.pred_velocity(xx, tt, yy, mm)
344
+ cond, uncond = out[:b], out[b:]
345
+ use_cfg = ((t >= self.cfg.cfg_interval[0]) & (t <= self.cfg.cfg_interval[1])).to(out.dtype)
346
+ scale = torch.where(
347
+ use_cfg[:, None, None, None] > 0,
348
+ torch.tensor(cfg_scale, device=x.device, dtype=out.dtype),
349
+ torch.tensor(1.0, device=x.device, dtype=out.dtype),
350
+ )
351
+ return uncond + (cond - uncond) * scale
352
+
353
+ @torch.no_grad()
354
+ def sample(self, text, mask, cfg_scale=6.0, generator=None, progress=False):
355
+ b = text.shape[0]
356
+ device = text.device
357
+ dtype = next(self.parameters()).dtype
358
+ x = torch.randn(
359
+ b,
360
+ self.cfg.in_channels,
361
+ self.cfg.image_size,
362
+ self.cfg.image_size,
363
+ generator=generator,
364
+ device=device,
365
+ dtype=dtype,
366
+ ) * 2
367
+ timesteps = torch.linspace(0.0, 1.0, self.cfg.n_T + 1, device=device, dtype=dtype)
368
+ iterator = range(self.cfg.n_T)
369
+ if progress:
370
+ from tqdm.auto import tqdm
371
+
372
+ iterator = tqdm(iterator)
373
+ for i in iterator:
374
+ t_cur = timesteps[i].expand(b)
375
+ t_next = timesteps[i + 1].expand(b)
376
+ v = self.cfg_velocity(x, t_cur, text.to(dtype), mask.to(dtype), cfg_scale)
377
+ x = x + (t_next - t_cur)[:, None, None, None] * v
378
+ return x
379
+
380
+
381
+ class MiniT2IMMJiTModel(ModelMixin, ConfigMixin):
382
+ """MiniT2I MM-JiT transformer for pixel-space flow matching."""
383
+
384
+ config_name = "config.json"
385
+
386
+ @register_to_config
387
+ def __init__(
388
+ self,
389
+ image_size: int = 512,
390
+ patch_size: int = 16,
391
+ in_channels: int = 3,
392
+ txt_input_size: int = 1024,
393
+ hidden_size: int = 768,
394
+ txt_hidden_size: int = 768,
395
+ cond_vec_size: int = 768,
396
+ depth_double: int = 17,
397
+ txt_preamble_depth: int = 2,
398
+ num_heads: int = 12,
399
+ head_dim: int = 64,
400
+ mlp_ratio: float = 2.6666666666666665,
401
+ pca_channels: int = 128,
402
+ prompt_length: int = 256,
403
+ n_T: int = 100,
404
+ prediction: str = "x",
405
+ sampler: str = "euler",
406
+ cfg_channels: int = 3,
407
+ cfg_interval: tuple = (0.0, 1.0),
408
+ llm: str = "google/flan-t5-large",
409
+ ):
410
+ super().__init__()
411
+ cfg = MMJiTConfig(
412
+ image_size=image_size,
413
+ patch_size=patch_size,
414
+ in_channels=in_channels,
415
+ txt_input_size=txt_input_size,
416
+ hidden_size=hidden_size,
417
+ txt_hidden_size=txt_hidden_size,
418
+ cond_vec_size=cond_vec_size,
419
+ depth_double=depth_double,
420
+ txt_preamble_depth=txt_preamble_depth,
421
+ num_heads=num_heads,
422
+ head_dim=head_dim,
423
+ mlp_ratio=mlp_ratio,
424
+ pca_channels=pca_channels,
425
+ prompt_length=prompt_length,
426
+ n_T=n_T,
427
+ prediction=prediction,
428
+ sampler=sampler,
429
+ cfg_channels=cfg_channels,
430
+ cfg_interval=tuple(cfg_interval),
431
+ llm=llm,
432
+ )
433
+ self.model = DiffusionModel(cfg)
434
+
435
+ @property
436
+ def mmjit_config(self) -> MMJiTConfig:
437
+ return self.model.cfg
438
+
439
+ def forward(self, img, t, context, attn_mask):
440
+ return self.model.net(img, t, context, attn_mask)
441
+
442
+ def pred_velocity(self, x, t, text, mask):
443
+ return self.model.pred_velocity(x, t, text, mask)
444
+
445
+ def sample(self, text, mask, cfg_scale=6.0, generator=None, progress=False):
446
+ return self.model.sample(text, mask, cfg_scale=cfg_scale, generator=generator, progress=progress)
README.md ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: diffusers
4
+ pipeline_tag: text-to-image
5
+ tags:
6
+ - diffusers
7
+ - minit2i
8
+ - image-generation
9
+ - text-to-image
10
+ - flow-matching
11
+ - pixel-space
12
+ inference: true
13
+ widget:
14
+ - text: A lonely astronaut standing on a quiet beach under two moons.
15
+ output:
16
+ url: MiniT2I-B-16/demo.png
17
+ language:
18
+ - en
19
+ ---
20
+
21
+ # BiliSakura/MiniT2I-diffusers
22
+
23
+ Self-contained MiniT2I text-to-image checkpoints for Hugging Face diffusers. Each variant folder ships its own pipeline code, component modules, bundled FLAN-T5-Large text encoder, and transformer weights.
24
+
25
+ Converted from [`MiniT2I/MiniT2I`](https://huggingface.co/MiniT2I/MiniT2I) using [MiniT2I-diffusers](https://github.com/Bili-Sakura/Visual-Generative-Foundation-Model-Collection/tree/main/libs/MiniT2I-diffusers) in [Visual-Generative-Foundation-Model-Collection](https://github.com/Bili-Sakura/Visual-Generative-Foundation-Model-Collection).
26
+
27
+ ## Available checkpoints
28
+
29
+ | Subfolder | Model | Params (denoiser + text encoder) | Patch | Recommended CFG |
30
+ | --- | --- | --- | ---: | ---: |
31
+ | [`MiniT2I-B-16/`](MiniT2I-B-16/) | MiniT2I-B/16 | 258M + 341M | 16 | 2.5 |
32
+ | [`MiniT2I-L-16/`](MiniT2I-L-16/) | MiniT2I-L/16 | 912M + 341M | 16 | 6.0 |
33
+
34
+ ## Repo layout
35
+
36
+ ```text
37
+ BiliSakura/MiniT2I-diffusers/
38
+ β”œβ”€β”€ README.md
39
+ β”œβ”€β”€ MiniT2I-B-16/
40
+ β”‚ β”œβ”€β”€ pipeline.py
41
+ β”‚ β”œβ”€β”€ model_index.json
42
+ β”‚ β”œβ”€β”€ conversion_metadata.json
43
+ β”‚ β”œβ”€β”€ demo.png
44
+ β”‚ β”œβ”€β”€ scheduler/
45
+ β”‚ β”‚ └── scheduler_config.json
46
+ β”‚ β”œβ”€β”€ text_encoder/
47
+ β”‚ β”œβ”€β”€ tokenizer/
48
+ β”‚ └── transformer/
49
+ β”‚ β”œβ”€β”€ config.json
50
+ β”‚ β”œβ”€β”€ diffusion_pytorch_model.safetensors
51
+ β”‚ └── transformer_minit2i.py
52
+ └── MiniT2I-L-16/
53
+ └── ...
54
+ ```
55
+
56
+ Each variant is self-contained: load with `custom_pipeline=.../pipeline.py` and `trust_remote_code=True`. MiniT2I denoises directly in RGB pixel space (no VAE).
57
+
58
+ ## Demo
59
+
60
+ ![MiniT2I-B-16 demo](MiniT2I-B-16/demo.png)
61
+
62
+ Prompt: *"A lonely astronaut standing on a quiet beach under two moons."* β€” MiniT2I-B/16 at 512Γ—512, 100 steps, `guidance_scale=2.5`, seed 42.
63
+
64
+ ## Load from Hugging Face
65
+
66
+ ```python
67
+ import torch
68
+ from diffusers import DiffusionPipeline
69
+
70
+ pipe = DiffusionPipeline.from_pretrained(
71
+ "BiliSakura/MiniT2I-diffusers/MiniT2I-B-16",
72
+ trust_remote_code=True,
73
+ torch_dtype=torch.bfloat16,
74
+ ).to("cuda")
75
+
76
+ generator = torch.Generator(device="cuda").manual_seed(42)
77
+ image = pipe(
78
+ "A lonely astronaut standing on a quiet beach under two moons.",
79
+ num_inference_steps=100,
80
+ guidance_scale=2.5,
81
+ generator=generator,
82
+ ).images[0]
83
+ image.save("demo.png")
84
+ ```
85
+
86
+ For MiniT2I-L/16, use `MiniT2I-L-16` and `guidance_scale=6.0`.
87
+
88
+ ## Load from a local clone
89
+
90
+ ```python
91
+ from pathlib import Path
92
+ import torch
93
+ from diffusers import DiffusionPipeline
94
+
95
+ model_dir = Path("./MiniT2I-B-16").resolve()
96
+ pipe = DiffusionPipeline.from_pretrained(
97
+ str(model_dir),
98
+ local_files_only=True,
99
+ custom_pipeline=str(model_dir / "pipeline.py"),
100
+ trust_remote_code=True,
101
+ torch_dtype=torch.bfloat16,
102
+ ).to("cuda")
103
+
104
+ generator = torch.Generator(device="cuda").manual_seed(42)
105
+ image = pipe(
106
+ "A lonely astronaut standing on a quiet beach under two moons.",
107
+ num_inference_steps=100,
108
+ guidance_scale=2.5,
109
+ generator=generator,
110
+ ).images[0]
111
+ image.save("demo.png")
112
+ ```
113
+
114
+ Load a **variant subfolder** (e.g. `./MiniT2I-B-16`), not the repo root.
115
+
116
+ ## Recommended inference settings
117
+
118
+ | Variant | Resolution | Steps | CFG scale | `torch_dtype` |
119
+ | --- | --- | ---: | ---: | --- |
120
+ | `MiniT2I-B-16` | 512Γ—512 | 100 | 2.5 | `bfloat16` |
121
+ | `MiniT2I-L-16` | 512Γ—512 | 100 | 6.0 | `bfloat16` |
122
+
123
+ For GenEval / DPG-Bench evaluation, upstream configs use `guidance_scale=5.0` for both B/16 and L/16.
124
+
125
+ ## Interface notes
126
+
127
+ - Text conditioning uses bundled `google/flan-t5-large` (`T5EncoderModel` + `T5Tokenizer`).
128
+ - Scheduler is `FlowMatchEulerDiscreteScheduler` with 1000 training timesteps and `shift=1.0`.
129
+ - `guidance_scale > 1.0` enables classifier-free guidance with an empty-string null prompt.
130
+ - Output resolution is fixed at 512Γ—512 for these exports.
131
+
132
+ ## Regenerate bundles
133
+
134
+ From the repository root:
135
+
136
+ ```bash
137
+ conda activate rsgen
138
+ python scripts/convert_minit2i_to_bilisakura.py
139
+ ```
140
+
141
+ ## Links
142
+
143
+ - Blog: [MiniT2I: A Minimalist Baseline for Text-to-Image Generation](https://peppaking8.github.io/#/post/minit2i)
144
+ - Upstream checkpoints: [MiniT2I/MiniT2I](https://huggingface.co/MiniT2I/MiniT2I)
145
+ - PyTorch/Diffusers source: [MiniT2I-diffusers](https://github.com/Bili-Sakura/Visual-Generative-Foundation-Model-Collection/tree/main/libs/MiniT2I-diffusers)
146
+
147
+ ## Citation
148
+
149
+ ```bibtex
150
+ @misc{minit2i2026,
151
+ title = {MiniT2I: A Minimalist Baseline for Text-to-Image Generation},
152
+ author = {Wang, Xianbang and Zhao, Hanhong and Lu, Yiyang and Zhou, Kangyang and Ma, Linrui and He, Kaiming},
153
+ year = {2026},
154
+ url = {https://peppaking8.github.io/#/post/minit2i}
155
+ }
156
+ ```