Buckets:

hf-doc-build/doc-dev / diffusers /pr_12595 /en /api /models /lumina2_transformer2d.md
rtrm's picture
|
download
raw
5.17 kB

Lumina2Transformer2DModel

A Diffusion Transformer model for 3D video-like data was introduced in Lumina Image 2.0 by Alpha-VLLM.

The model can be loaded with the following code snippet.

from diffusers import Lumina2Transformer2DModel

transformer = Lumina2Transformer2DModel.from_pretrained("Alpha-VLLM/Lumina-Image-2.0", subfolder="transformer", torch_dtype=torch.bfloat16)

Lumina2Transformer2DModel[[diffusers.Lumina2Transformer2DModel]]

class diffusers.Lumina2Transformer2DModeldiffusers.Lumina2Transformer2DModelhttps://github.com/huggingface/diffusers/blob/vr_12595/src/diffusers/models/transformers/transformer_lumina2.py#L325[{"name": "sample_size", "val": ": int = 128"}, {"name": "patch_size", "val": ": int = 2"}, {"name": "in_channels", "val": ": int = 16"}, {"name": "out_channels", "val": ": typing.Optional[int] = None"}, {"name": "hidden_size", "val": ": int = 2304"}, {"name": "num_layers", "val": ": int = 26"}, {"name": "num_refiner_layers", "val": ": int = 2"}, {"name": "num_attention_heads", "val": ": int = 24"}, {"name": "num_kv_heads", "val": ": int = 8"}, {"name": "multiple_of", "val": ": int = 256"}, {"name": "ffn_dim_multiplier", "val": ": typing.Optional[float] = None"}, {"name": "norm_eps", "val": ": float = 1e-05"}, {"name": "scaling_factor", "val": ": float = 1.0"}, {"name": "axes_dim_rope", "val": ": typing.Tuple[int, int, int] = (32, 32, 32)"}, {"name": "axes_lens", "val": ": typing.Tuple[int, int, int] = (300, 512, 512)"}, {"name": "cap_feat_dim", "val": ": int = 1024"}]- sample_size (int) -- The width of the latent images. This is fixed during training since it is used to learn a number of position embeddings.

  • patch_size (int, optional, (int, optional, defaults to 2) -- The size of each patch in the image. This parameter defines the resolution of patches fed into the model.
  • in_channels (int, optional, defaults to 4) -- The number of input channels for the model. Typically, this matches the number of channels in the input images.
  • hidden_size (int, optional, defaults to 4096) -- The dimensionality of the hidden layers in the model. This parameter determines the width of the model's hidden representations.
  • num_layers (int, optional, default to 32) -- The number of layers in the model. This defines the depth of the neural network.
  • num_attention_heads (int, optional, defaults to 32) -- The number of attention heads in each attention layer. This parameter specifies how many separate attention mechanisms are used.
  • num_kv_heads (int, optional, defaults to 8) -- The number of key-value heads in the attention mechanism, if different from the number of attention heads. If None, it defaults to num_attention_heads.
  • multiple_of (int, optional, defaults to 256) -- A factor that the hidden size should be a multiple of. This can help optimize certain hardware configurations.
  • ffn_dim_multiplier (float, optional) -- A multiplier for the dimensionality of the feed-forward network. If None, it uses a default value based on the model configuration.
  • norm_eps (float, optional, defaults to 1e-5) -- A small value added to the denominator for numerical stability in normalization layers.
  • scaling_factor (float, optional, defaults to 1.0) -- A scaling factor applied to certain parameters or layers in the model. This can be used for adjusting the overall scale of the model's operations.0

Lumina2NextDiT: Diffusion model with a Transformer backbone.

Transformer2DModelOutput[[diffusers.models.modeling_outputs.Transformer2DModelOutput]]

class diffusers.models.modeling_outputs.Transformer2DModelOutputdiffusers.models.modeling_outputs.Transformer2DModelOutputhttps://github.com/huggingface/diffusers/blob/vr_12595/src/diffusers/models/modeling_outputs.py#L21[{"name": "sample", "val": ": torch.Tensor"}]- sample (torch.Tensor of shape (batch_size, num_channels, height, width) or (batch size, num_vector_embeds - 1, num_latent_pixels) if Transformer2DModel is discrete) -- The hidden states output conditioned on the encoder_hidden_states input. If discrete, returns probability distributions for the unnoised latent pixels.0

The output of Transformer2DModel.

Xet Storage Details

Size:
5.17 kB
·
Xet hash:
f48c0530cc81ca3aca4d312c459c3363d9a6dd7bdf4c7905ea7b2c5a51c40e87

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.