Buckets:
| # Hunyuan-DiT | |
|  | |
| [Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding](https://huggingface.co/papers/2405.08748) from Tencent Hunyuan. | |
| The abstract from the paper is: | |
| *We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models.* | |
| You can find the original codebase at [Tencent/HunyuanDiT](https://github.com/Tencent/HunyuanDiT) and all the available checkpoints at [Tencent-Hunyuan](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT). | |
| **Highlights**: HunyuanDiT supports Chinese/English-to-image, multi-resolution generation. | |
| HunyuanDiT has the following components: | |
| * It uses a diffusion transformer as the backbone | |
| * It combines two text encoders, a bilingual CLIP and a multilingual T5 encoder | |
| > [!TIP] | |
| > Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. | |
| > [!TIP] | |
| > You can further improve generation quality by passing the generated image from `HungyuanDiTPipeline` to the [SDXL refiner](../../using-diffusers/sdxl#base-to-refiner-model) model. | |
| ## Optimization | |
| You can optimize the pipeline's runtime and memory consumption with torch.compile and feed-forward chunking. To learn about other optimization methods, check out the [Speed up inference](../../optimization/fp16) and [Reduce memory usage](../../optimization/memory) guides. | |
| ### Inference | |
| Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency. | |
| First, load the pipeline: | |
| ```python | |
| from diffusers import HunyuanDiTPipeline | |
| import torch | |
| pipeline = HunyuanDiTPipeline.from_pretrained( | |
| "Tencent-Hunyuan/HunyuanDiT-Diffusers", torch_dtype=torch.float16 | |
| ).to("cuda") | |
| ``` | |
| Then change the memory layout of the pipelines `transformer` and `vae` components to `torch.channels-last`: | |
| ```python | |
| pipeline.transformer.to(memory_format=torch.channels_last) | |
| pipeline.vae.to(memory_format=torch.channels_last) | |
| ``` | |
| Finally, compile the components and run inference: | |
| ```python | |
| pipeline.transformer = torch.compile(pipeline.transformer, mode="max-autotune", fullgraph=True) | |
| pipeline.vae.decode = torch.compile(pipeline.vae.decode, mode="max-autotune", fullgraph=True) | |
| image = pipeline(prompt="一个宇航员在骑马").images[0] | |
| ``` | |
| The [benchmark](https://gist.github.com/sayakpaul/29d3a14905cfcbf611fe71ebd22e9b23) results on a 80GB A100 machine are: | |
| ```bash | |
| With torch.compile(): Average inference time: 12.470 seconds. | |
| Without torch.compile(): Average inference time: 20.570 seconds. | |
| ``` | |
| ### Memory optimization | |
| By loading the T5 text encoder in 8 bits, you can run the pipeline in just under 6 GBs of GPU VRAM. Refer to [this script](https://gist.github.com/sayakpaul/3154605f6af05b98a41081aaba5ca43e) for details. | |
| Furthermore, you can use the [enable_forward_chunking()](/docs/diffusers/pr_12595/en/api/models/hunyuan_transformer2d#diffusers.HunyuanDiT2DModel.enable_forward_chunking) method to reduce memory usage. Feed-forward chunking runs the feed-forward layers in a transformer block in a loop instead of all at once. This gives you a trade-off between memory consumption and inference runtime. | |
| ```diff | |
| + pipeline.transformer.enable_forward_chunking(chunk_size=1, dim=1) | |
| ``` | |
| ## HunyuanDiTPipeline[[diffusers.HunyuanDiTPipeline]] | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>class diffusers.HunyuanDiTPipeline</name><anchor>diffusers.HunyuanDiTPipeline</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12595/src/diffusers/pipelines/hunyuandit/pipeline_hunyuandit.py#L149</source><parameters>[{"name": "vae", "val": ": AutoencoderKL"}, {"name": "text_encoder", "val": ": BertModel"}, {"name": "tokenizer", "val": ": BertTokenizer"}, {"name": "transformer", "val": ": HunyuanDiT2DModel"}, {"name": "scheduler", "val": ": DDPMScheduler"}, {"name": "safety_checker", "val": ": StableDiffusionSafetyChecker"}, {"name": "feature_extractor", "val": ": CLIPImageProcessor"}, {"name": "requires_safety_checker", "val": ": bool = True"}, {"name": "text_encoder_2", "val": ": typing.Optional[transformers.models.t5.modeling_t5.T5EncoderModel] = None"}, {"name": "tokenizer_2", "val": ": typing.Optional[transformers.models.mt5.tokenization_mt5.MT5Tokenizer] = None"}]</parameters><paramsdesc>- **vae** ([AutoencoderKL](/docs/diffusers/pr_12595/en/api/models/autoencoderkl#diffusers.AutoencoderKL)) -- | |
| Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. We use | |
| `sdxl-vae-fp16-fix`. | |
| - **text_encoder** (Optional[`~transformers.BertModel`, `~transformers.CLIPTextModel`]) -- | |
| Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). | |
| HunyuanDiT uses a fine-tuned [bilingual CLIP]. | |
| - **tokenizer** (Optional[`~transformers.BertTokenizer`, `~transformers.CLIPTokenizer`]) -- | |
| A `BertTokenizer` or `CLIPTokenizer` to tokenize text. | |
| - **transformer** ([HunyuanDiT2DModel](/docs/diffusers/pr_12595/en/api/models/hunyuan_transformer2d#diffusers.HunyuanDiT2DModel)) -- | |
| The HunyuanDiT model designed by Tencent Hunyuan. | |
| - **text_encoder_2** (`T5EncoderModel`) -- | |
| The mT5 embedder. Specifically, it is 't5-v1_1-xxl'. | |
| - **tokenizer_2** (`MT5Tokenizer`) -- | |
| The tokenizer for the mT5 embedder. | |
| - **scheduler** ([DDPMScheduler](/docs/diffusers/pr_12595/en/api/schedulers/ddpm#diffusers.DDPMScheduler)) -- | |
| A scheduler to be used in combination with HunyuanDiT to denoise the encoded image latents.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Pipeline for English/Chinese-to-image generation using HunyuanDiT. | |
| This model inherits from [DiffusionPipeline](/docs/diffusers/pr_12595/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the | |
| library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.) | |
| HunyuanDiT uses two text encoders: [mT5](https://huggingface.co/google/mt5-base) and [bilingual CLIP](fine-tuned by | |
| ourselves) | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>__call__</name><anchor>diffusers.HunyuanDiTPipeline.__call__</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12595/src/diffusers/pipelines/hunyuandit/pipeline_hunyuandit.py#L568</source><parameters>[{"name": "prompt", "val": ": typing.Union[str, typing.List[str]] = None"}, {"name": "height", "val": ": typing.Optional[int] = None"}, {"name": "width", "val": ": typing.Optional[int] = None"}, {"name": "num_inference_steps", "val": ": typing.Optional[int] = 50"}, {"name": "guidance_scale", "val": ": typing.Optional[float] = 5.0"}, {"name": "negative_prompt", "val": ": typing.Union[str, typing.List[str], NoneType] = None"}, {"name": "num_images_per_prompt", "val": ": typing.Optional[int] = 1"}, {"name": "eta", "val": ": typing.Optional[float] = 0.0"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_embeds_2", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_embeds_2", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_attention_mask_2", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_attention_mask_2", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "output_type", "val": ": typing.Optional[str] = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "callback_on_step_end", "val": ": typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": typing.List[str] = ['latents']"}, {"name": "guidance_rescale", "val": ": float = 0.0"}, {"name": "original_size", "val": ": typing.Optional[typing.Tuple[int, int]] = (1024, 1024)"}, {"name": "target_size", "val": ": typing.Optional[typing.Tuple[int, int]] = None"}, {"name": "crops_coords_top_left", "val": ": typing.Tuple[int, int] = (0, 0)"}, {"name": "use_resolution_binning", "val": ": bool = True"}]</parameters><paramsdesc>- **prompt** (`str` or `List[str]`, *optional*) -- | |
| The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. | |
| - **height** (`int`) -- | |
| The height in pixels of the generated image. | |
| - **width** (`int`) -- | |
| The width in pixels of the generated image. | |
| - **num_inference_steps** (`int`, *optional*, defaults to 50) -- | |
| The number of denoising steps. More denoising steps usually lead to a higher quality image at the | |
| expense of slower inference. This parameter is modulated by `strength`. | |
| - **guidance_scale** (`float`, *optional*, defaults to 7.5) -- | |
| A higher guidance scale value encourages the model to generate images closely linked to the text | |
| `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. | |
| - **negative_prompt** (`str` or `List[str]`, *optional*) -- | |
| The prompt or prompts to guide what to not include in image generation. If not defined, you need to | |
| pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). | |
| - **num_images_per_prompt** (`int`, *optional*, defaults to 1) -- | |
| The number of images to generate per prompt. | |
| - **eta** (`float`, *optional*, defaults to 0.0) -- | |
| Corresponds to parameter eta (η) from the [DDIM](https://huggingface.co/papers/2010.02502) paper. Only | |
| applies to the [DDIMScheduler](/docs/diffusers/pr_12595/en/api/schedulers/ddim#diffusers.DDIMScheduler), and is ignored in other schedulers. | |
| - **generator** (`torch.Generator` or `List[torch.Generator]`, *optional*) -- | |
| A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make | |
| generation deterministic. | |
| - **prompt_embeds** (`torch.Tensor`, *optional*) -- | |
| Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not | |
| provided, text embeddings are generated from the `prompt` input argument. | |
| - **prompt_embeds_2** (`torch.Tensor`, *optional*) -- | |
| Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not | |
| provided, text embeddings are generated from the `prompt` input argument. | |
| - **negative_prompt_embeds** (`torch.Tensor`, *optional*) -- | |
| Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If | |
| not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. | |
| - **negative_prompt_embeds_2** (`torch.Tensor`, *optional*) -- | |
| Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If | |
| not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. | |
| - **prompt_attention_mask** (`torch.Tensor`, *optional*) -- | |
| Attention mask for the prompt. Required when `prompt_embeds` is passed directly. | |
| - **prompt_attention_mask_2** (`torch.Tensor`, *optional*) -- | |
| Attention mask for the prompt. Required when `prompt_embeds_2` is passed directly. | |
| - **negative_prompt_attention_mask** (`torch.Tensor`, *optional*) -- | |
| Attention mask for the negative prompt. Required when `negative_prompt_embeds` is passed directly. | |
| - **negative_prompt_attention_mask_2** (`torch.Tensor`, *optional*) -- | |
| Attention mask for the negative prompt. Required when `negative_prompt_embeds_2` is passed directly. | |
| - **output_type** (`str`, *optional*, defaults to `"pil"`) -- | |
| The output format of the generated image. Choose between `PIL.Image` or `np.array`. | |
| - **return_dict** (`bool`, *optional*, defaults to `True`) -- | |
| Whether or not to return a [StableDiffusionPipelineOutput](/docs/diffusers/pr_12595/en/api/pipelines/stable_diffusion/depth2img#diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput) instead of a | |
| plain tuple. | |
| - **callback_on_step_end** (`Callable[[int, int, Dict], None]`, `PipelineCallback`, `MultiPipelineCallbacks`, *optional*) -- | |
| A callback function or a list of callback functions to be called at the end of each denoising step. | |
| - **callback_on_step_end_tensor_inputs** (`List[str]`, *optional*) -- | |
| A list of tensor inputs that should be passed to the callback function. If not defined, all tensor | |
| inputs will be passed. | |
| - **guidance_rescale** (`float`, *optional*, defaults to 0.0) -- | |
| Rescale the noise_cfg according to `guidance_rescale`. Based on findings of [Common Diffusion Noise | |
| Schedules and Sample Steps are Flawed](https://huggingface.co/papers/2305.08891). See Section 3.4 | |
| - **original_size** (`Tuple[int, int]`, *optional*, defaults to `(1024, 1024)`) -- | |
| The original size of the image. Used to calculate the time ids. | |
| - **target_size** (`Tuple[int, int]`, *optional*) -- | |
| The target size of the image. Used to calculate the time ids. | |
| - **crops_coords_top_left** (`Tuple[int, int]`, *optional*, defaults to `(0, 0)`) -- | |
| The top left coordinates of the crop. Used to calculate the time ids. | |
| - **use_resolution_binning** (`bool`, *optional*, defaults to `True`) -- | |
| Whether to use resolution binning or not. If `True`, the input resolution will be mapped to the closest | |
| standard resolution. Supported resolutions are 1024x1024, 1280x1280, 1024x768, 1152x864, 1280x960, | |
| 768x1024, 864x1152, 960x1280, 1280x768, and 768x1280. It is recommended to set this to `True`.</paramsdesc><paramgroups>0</paramgroups><rettype>[StableDiffusionPipelineOutput](/docs/diffusers/pr_12595/en/api/pipelines/stable_diffusion/depth2img#diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput) or `tuple`</rettype><retdesc>If `return_dict` is `True`, [StableDiffusionPipelineOutput](/docs/diffusers/pr_12595/en/api/pipelines/stable_diffusion/depth2img#diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput) is returned, | |
| otherwise a `tuple` is returned where the first element is a list with the generated images and the | |
| second element is a list of `bool`s indicating whether the corresponding generated image contains | |
| "not-safe-for-work" (nsfw) content.</retdesc></docstring> | |
| The call function to the pipeline for generation with HunyuanDiT. | |
| <ExampleCodeBlock anchor="diffusers.HunyuanDiTPipeline.__call__.example"> | |
| Examples: | |
| ```py | |
| >>> import torch | |
| >>> from diffusers import HunyuanDiTPipeline | |
| >>> pipe = HunyuanDiTPipeline.from_pretrained( | |
| ... "Tencent-Hunyuan/HunyuanDiT-Diffusers", torch_dtype=torch.float16 | |
| ... ) | |
| >>> pipe.to("cuda") | |
| >>> # You may also use English prompt as HunyuanDiT supports both English and Chinese | |
| >>> # prompt = "An astronaut riding a horse" | |
| >>> prompt = "一个宇航员在骑马" | |
| >>> image = pipe(prompt).images[0] | |
| ``` | |
| </ExampleCodeBlock> | |
| </div> | |
| <div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8"> | |
| <docstring><name>encode_prompt</name><anchor>diffusers.HunyuanDiTPipeline.encode_prompt</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12595/src/diffusers/pipelines/hunyuandit/pipeline_hunyuandit.py#L248</source><parameters>[{"name": "prompt", "val": ": str"}, {"name": "device", "val": ": device = None"}, {"name": "dtype", "val": ": dtype = None"}, {"name": "num_images_per_prompt", "val": ": int = 1"}, {"name": "do_classifier_free_guidance", "val": ": bool = True"}, {"name": "negative_prompt", "val": ": typing.Optional[str] = None"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "max_sequence_length", "val": ": typing.Optional[int] = None"}, {"name": "text_encoder_index", "val": ": int = 0"}]</parameters><paramsdesc>- **prompt** (`str` or `List[str]`, *optional*) -- | |
| prompt to be encoded | |
| - **device** -- (`torch.device`): | |
| torch device | |
| - **dtype** (`torch.dtype`) -- | |
| torch dtype | |
| - **num_images_per_prompt** (`int`) -- | |
| number of images that should be generated per prompt | |
| - **do_classifier_free_guidance** (`bool`) -- | |
| whether to use classifier free guidance or not | |
| - **negative_prompt** (`str` or `List[str]`, *optional*) -- | |
| The prompt or prompts not to guide the image generation. If not defined, one has to pass | |
| `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is | |
| less than `1`). | |
| - **prompt_embeds** (`torch.Tensor`, *optional*) -- | |
| Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not | |
| provided, text embeddings will be generated from `prompt` input argument. | |
| - **negative_prompt_embeds** (`torch.Tensor`, *optional*) -- | |
| Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt | |
| weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input | |
| argument. | |
| - **prompt_attention_mask** (`torch.Tensor`, *optional*) -- | |
| Attention mask for the prompt. Required when `prompt_embeds` is passed directly. | |
| - **negative_prompt_attention_mask** (`torch.Tensor`, *optional*) -- | |
| Attention mask for the negative prompt. Required when `negative_prompt_embeds` is passed directly. | |
| - **max_sequence_length** (`int`, *optional*) -- maximum sequence length to use for the prompt. | |
| - **text_encoder_index** (`int`, *optional*) -- | |
| Index of the text encoder to use. `0` for clip and `1` for T5.</paramsdesc><paramgroups>0</paramgroups></docstring> | |
| Encodes the prompt into text encoder hidden states. | |
| </div></div> | |
| <EditOnGithub source="https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/hunyuandit.md" /> |
Xet Storage Details
- Size:
- 19.4 kB
- Xet hash:
- a6c0b94661661cf1431c2e6277105a2af66d45a591a01cf2ba85319a7bfb0185
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.