Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / diffusers /pr_13331 /en /api /models /qwenimage_transformer2d.md

rtrm

3 months ago

preview code

download

raw

5.54 kB

QwenImageTransformer2DModel

The model can be loaded with the following code snippet.

from diffusers import QwenImageTransformer2DModel

transformer = QwenImageTransformer2DModel.from_pretrained("Qwen/QwenImage-20B", subfolder="transformer", torch_dtype=torch.bfloat16)

QwenImageTransformer2DModel[[diffusers.QwenImageTransformer2DModel]]

diffusers.QwenImageTransformer2DModel[[diffusers.QwenImageTransformer2DModel]]

Source

The Transformer model introduced in Qwen.

forwarddiffusers.QwenImageTransformer2DModel.forwardhttps://github.com/huggingface/diffusers/blob/vr_13331/src/diffusers/models/transformers/transformer_qwenimage.py#L835[{"name": "hidden_states", "val": ": Tensor"}, {"name": "encoder_hidden_states", "val": ": Tensor = None"}, {"name": "encoder_hidden_states_mask", "val": ": Tensor = None"}, {"name": "timestep", "val": ": LongTensor = None"}, {"name": "img_shapes", "val": ": list[tuple[int, int, int]] | None = None"}, {"name": "txt_seq_lens", "val": ": list[int] | None = None"}, {"name": "guidance", "val": ": Tensor = None"}, {"name": "attention_kwargs", "val": ": dict[str, typing.Any] | None = None"}, {"name": "controlnet_block_samples", "val": " = None"}, {"name": "additional_t_cond", "val": " = None"}, {"name": "return_dict", "val": ": bool = True"}]- hidden_states (torch.Tensor of shape (batch_size, image_sequence_length, in_channels)) -- Input hidden_states.

encoder_hidden_states (torch.Tensor of shape (batch_size, text_sequence_length, joint_attention_dim)) -- Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
encoder_hidden_states_mask (torch.Tensor of shape (batch_size, text_sequence_length), optional) -- Mask for the encoder hidden states. Expected to have 1.0 for valid tokens and 0.0 for padding tokens. Used in the attention processor to prevent attending to padding tokens. The mask can have any pattern (not just contiguous valid tokens followed by padding) since it's applied element-wise in attention.
timestep ( torch.LongTensor) -- Used to indicate denoising step.
img_shapes (list[tuple[int, int, int]], optional) -- Image shapes for RoPE computation.
txt_seq_lens (list[int], optional, Deprecated) -- Deprecated parameter. Use encoder_hidden_states_mask instead. If provided, the maximum value will be used to compute RoPE sequence length.
guidance (torch.Tensor, optional) -- Guidance tensor for conditional generation.
attention_kwargs (dict, optional) -- A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined under self.processor in diffusers.models.attention_processor.
controlnet_block_samples (optional) -- ControlNet block samples to add to the transformer blocks.
return_dict (bool, optional, defaults to True) -- Whether or not to return a ~models.transformer_2d.Transformer2DModelOutput instead of a plain tuple.0If return_dict is True, an ~models.transformer_2d.Transformer2DModelOutput is returned, otherwise a tuple where the first element is the sample tensor.

The QwenTransformer2DModel forward method.

Parameters:

patch_size (int, defaults to 2) : Patch size to turn the input data into small patches.

in_channels (int, defaults to 64) : The number of channels in the input.

out_channels (int, optional, defaults to None) : The number of channels in the output. If not specified, it defaults to in_channels.

num_layers (int, defaults to 60) : The number of layers of dual stream DiT blocks to use.

attention_head_dim (int, defaults to 128) : The number of dimensions to use for each attention head.

num_attention_heads (int, defaults to 24) : The number of attention heads to use.

joint_attention_dim (int, defaults to 3584) : The number of dimensions to use for the joint attention (embedding/channel dimension of encoder_hidden_states).

guidance_embeds (bool, defaults to False) : Whether to use guidance embeddings for guidance-distilled variant of the model.

axes_dims_rope (tuple[int], defaults to (16, 56, 56)) : The dimensions to use for the rotary positional embeddings.

Returns:

If return_dict is True, an ~models.transformer_2d.Transformer2DModelOutput is returned, otherwise a tuple where the first element is the sample tensor.

Transformer2DModelOutput[[diffusers.models.modeling_outputs.Transformer2DModelOutput]]

diffusers.models.modeling_outputs.Transformer2DModelOutput[[diffusers.models.modeling_outputs.Transformer2DModelOutput]]

Source

The output of Transformer2DModel.

Parameters:

sample (torch.Tensor of shape (batch_size, num_channels, height, width) or (batch size, num_vector_embeds - 1, num_latent_pixels) if Transformer2DModel is discrete) : The hidden states output conditioned on the encoder_hidden_states input. If discrete, returns probability distributions for the unnoised latent pixels.

Xet Storage Details

Size:: 5.54 kB
Xet hash:: 99bd6ef994718f6f60fd852e61e7a66d2b692f709f85236b174ede72baee049a

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.