Buckets:

hf-doc-build/doc-dev / diffusers /pr_11739 /en /api /models /ltx2_video_transformer3d.md
rtrm's picture
|
download
raw
6.45 kB

LTX2VideoTransformer3DModel

A Diffusion Transformer model for 3D data from LTX was introduced by Lightricks.

The model can be loaded with the following code snippet.

from diffusers import LTX2VideoTransformer3DModel

transformer = LTX2VideoTransformer3DModel.from_pretrained("Lightricks/LTX-2", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda")

LTX2VideoTransformer3DModel[[diffusers.LTX2VideoTransformer3DModel]]

diffusers.LTX2VideoTransformer3DModel[[diffusers.LTX2VideoTransformer3DModel]]

Source

A Transformer model for video-like data used in LTX.

forwarddiffusers.LTX2VideoTransformer3DModel.forwardhttps://github.com/huggingface/diffusers/blob/vr_11739/src/diffusers/models/transformers/transformer_ltx2.py#L1104[{"name": "hidden_states", "val": ": Tensor"}, {"name": "audio_hidden_states", "val": ": Tensor"}, {"name": "encoder_hidden_states", "val": ": Tensor"}, {"name": "audio_encoder_hidden_states", "val": ": Tensor"}, {"name": "timestep", "val": ": LongTensor"}, {"name": "audio_timestep", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "encoder_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "audio_encoder_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "num_frames", "val": ": typing.Optional[int] = None"}, {"name": "height", "val": ": typing.Optional[int] = None"}, {"name": "width", "val": ": typing.Optional[int] = None"}, {"name": "fps", "val": ": float = 24.0"}, {"name": "audio_num_frames", "val": ": typing.Optional[int] = None"}, {"name": "video_coords", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "audio_coords", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "attention_kwargs", "val": ": typing.Optional[typing.Dict[str, typing.Any]] = None"}, {"name": "return_dict", "val": ": bool = True"}]- hidden_states (torch.Tensor) -- Input patchified video latents of shape (batch_size, num_video_tokens, in_channels).

  • audio_hidden_states (torch.Tensor) -- Input patchified audio latents of shape (batch_size, num_audio_tokens, audio_in_channels).
  • encoder_hidden_states (torch.Tensor) -- Input video text embeddings of shape (batch_size, text_seq_len, self.config.caption_channels).
  • audio_encoder_hidden_states (torch.Tensor) -- Input audio text embeddings of shape (batch_size, text_seq_len, self.config.caption_channels).
  • timestep (torch.Tensor) -- Input timestep of shape (batch_size, num_video_tokens). These should already be scaled by self.config.timestep_scale_multiplier.
  • audio_timestep (torch.Tensor, optional) -- Input timestep of shape (batch_size,) or (batch_size, num_audio_tokens) for audio modulation params. This is only used by certain pipelines such as the I2V pipeline.
  • encoder_attention_mask (torch.Tensor, optional) -- Optional multiplicative text attention mask of shape (batch_size, text_seq_len).
  • audio_encoder_attention_mask (torch.Tensor, optional) -- Optional multiplicative text attention mask of shape (batch_size, text_seq_len) for audio modeling.
  • num_frames (int, optional) -- The number of latent video frames. Used if calculating the video coordinates for RoPE.
  • height (int, optional) -- The latent video height. Used if calculating the video coordinates for RoPE.
  • width (int, optional) -- The latent video width. Used if calculating the video coordinates for RoPE.
  • fps -- (float, optional, defaults to 24.0): The desired frames per second of the generated video. Used if calculating the video coordinates for RoPE.
  • audio_num_frames -- (int, optional): The number of latent audio frames. Used if calculating the audio coordinates for RoPE.
  • video_coords (torch.Tensor, optional) -- The video coordinates to be used when calculating the rotary positional embeddings (RoPE) of shape (batch_size, 3, num_video_tokens, 2). If not supplied, this will be calculated inside forward.
  • audio_coords (torch.Tensor, optional) -- The audio coordinates to be used when calculating the rotary positional embeddings (RoPE) of shape (batch_size, 1, num_audio_tokens, 2). If not supplied, this will be calculated inside forward.
  • attention_kwargs (Dict[str, Any], optional) -- Optional dict of keyword args to be passed to the attention processor.
  • return_dict (bool, optional, defaults to True) -- Whether to return a dict-like structured output of type AudioVisualModelOutput or a tuple.0AudioVisualModelOutput or tupleIf return_dict is True, returns a structured output of type AudioVisualModelOutput, otherwise a tuple is returned where the first element is the denoised video latent patch sequence and the second element is the denoised audio latent patch sequence.

Forward pass for LTX-2.0 audiovisual video transformer.

Parameters:

in_channels (int, defaults to 128) : The number of channels in the input.

out_channels (int, defaults to 128) : The number of channels in the output.

patch_size (int, defaults to 1) : The size of the spatial patches to use in the patch embedding layer.

patch_size_t (int, defaults to 1) : The size of the tmeporal patches to use in the patch embedding layer.

num_attention_heads (int, defaults to 32) : The number of heads to use for multi-head attention.

attention_head_dim (int, defaults to 64) : The number of channels in each head.

cross_attention_dim (int, defaults to 2048 ) : The number of channels for cross attention heads.

num_layers (int, defaults to 28) : The number of layers of Transformer blocks to use.

activation_fn (str, defaults to "gelu-approximate") : Activation function to use in feed-forward.

qk_norm (str, defaults to "rms_norm_across_heads") : The normalization layer to use.

Returns:

AudioVisualModelOutput` or `tuple

If return_dict is True, returns a structured output of type AudioVisualModelOutput, otherwise a tuple is returned where the first element is the denoised video latent patch sequence and the second element is the denoised audio latent patch sequence.

Xet Storage Details

Size:
6.45 kB
·
Xet hash:
3e67ea33c7b625b65391de19b502ea0d57ac88307f85619c2a4795e225ff4460

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.