Buckets:
LTX2VideoTransformer3DModel
A Diffusion Transformer model for 3D data from LTX was introduced by Lightricks.
The model can be loaded with the following code snippet.
from diffusers import LTX2VideoTransformer3DModel
transformer = LTX2VideoTransformer3DModel.from_pretrained("Lightricks/LTX-2", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda")
LTX2VideoTransformer3DModel[[diffusers.LTX2VideoTransformer3DModel]]
diffusers.LTX2VideoTransformer3DModel[[diffusers.LTX2VideoTransformer3DModel]]
A Transformer model for video-like data used in LTX.
forwarddiffusers.LTX2VideoTransformer3DModel.forwardhttps://github.com/huggingface/diffusers/blob/vr_11739/src/diffusers/models/transformers/transformer_ltx2.py#L1104[{"name": "hidden_states", "val": ": Tensor"}, {"name": "audio_hidden_states", "val": ": Tensor"}, {"name": "encoder_hidden_states", "val": ": Tensor"}, {"name": "audio_encoder_hidden_states", "val": ": Tensor"}, {"name": "timestep", "val": ": LongTensor"}, {"name": "audio_timestep", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "encoder_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "audio_encoder_attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "num_frames", "val": ": typing.Optional[int] = None"}, {"name": "height", "val": ": typing.Optional[int] = None"}, {"name": "width", "val": ": typing.Optional[int] = None"}, {"name": "fps", "val": ": float = 24.0"}, {"name": "audio_num_frames", "val": ": typing.Optional[int] = None"}, {"name": "video_coords", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "audio_coords", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "attention_kwargs", "val": ": typing.Optional[typing.Dict[str, typing.Any]] = None"}, {"name": "return_dict", "val": ": bool = True"}]- hidden_states (torch.Tensor) --
Input patchified video latents of shape (batch_size, num_video_tokens, in_channels).
- audio_hidden_states (
torch.Tensor) -- Input patchified audio latents of shape(batch_size, num_audio_tokens, audio_in_channels). - encoder_hidden_states (
torch.Tensor) -- Input video text embeddings of shape(batch_size, text_seq_len, self.config.caption_channels). - audio_encoder_hidden_states (
torch.Tensor) -- Input audio text embeddings of shape(batch_size, text_seq_len, self.config.caption_channels). - timestep (
torch.Tensor) -- Input timestep of shape(batch_size, num_video_tokens). These should already be scaled byself.config.timestep_scale_multiplier. - audio_timestep (
torch.Tensor, optional) -- Input timestep of shape(batch_size,)or(batch_size, num_audio_tokens)for audio modulation params. This is only used by certain pipelines such as the I2V pipeline. - encoder_attention_mask (
torch.Tensor, optional) -- Optional multiplicative text attention mask of shape(batch_size, text_seq_len). - audio_encoder_attention_mask (
torch.Tensor, optional) -- Optional multiplicative text attention mask of shape(batch_size, text_seq_len)for audio modeling. - num_frames (
int, optional) -- The number of latent video frames. Used if calculating the video coordinates for RoPE. - height (
int, optional) -- The latent video height. Used if calculating the video coordinates for RoPE. - width (
int, optional) -- The latent video width. Used if calculating the video coordinates for RoPE. - fps -- (
float, optional, defaults to24.0): The desired frames per second of the generated video. Used if calculating the video coordinates for RoPE. - audio_num_frames -- (
int, optional): The number of latent audio frames. Used if calculating the audio coordinates for RoPE. - video_coords (
torch.Tensor, optional) -- The video coordinates to be used when calculating the rotary positional embeddings (RoPE) of shape(batch_size, 3, num_video_tokens, 2). If not supplied, this will be calculated insideforward. - audio_coords (
torch.Tensor, optional) -- The audio coordinates to be used when calculating the rotary positional embeddings (RoPE) of shape(batch_size, 1, num_audio_tokens, 2). If not supplied, this will be calculated insideforward. - attention_kwargs (
Dict[str, Any], optional) -- Optional dict of keyword args to be passed to the attention processor. - return_dict (
bool, optional, defaults toTrue) -- Whether to return a dict-like structured output of typeAudioVisualModelOutputor a tuple.0AudioVisualModelOutputortupleIfreturn_dictisTrue, returns a structured output of typeAudioVisualModelOutput, otherwise atupleis returned where the first element is the denoised video latent patch sequence and the second element is the denoised audio latent patch sequence.
Forward pass for LTX-2.0 audiovisual video transformer.
Parameters:
in_channels (int, defaults to 128) : The number of channels in the input.
out_channels (int, defaults to 128) : The number of channels in the output.
patch_size (int, defaults to 1) : The size of the spatial patches to use in the patch embedding layer.
patch_size_t (int, defaults to 1) : The size of the tmeporal patches to use in the patch embedding layer.
num_attention_heads (int, defaults to 32) : The number of heads to use for multi-head attention.
attention_head_dim (int, defaults to 64) : The number of channels in each head.
cross_attention_dim (int, defaults to 2048 ) : The number of channels for cross attention heads.
num_layers (int, defaults to 28) : The number of layers of Transformer blocks to use.
activation_fn (str, defaults to "gelu-approximate") : Activation function to use in feed-forward.
qk_norm (str, defaults to "rms_norm_across_heads") : The normalization layer to use.
Returns:
AudioVisualModelOutput` or `tuple
If return_dict is True, returns a structured output of type AudioVisualModelOutput, otherwise a
tuple is returned where the first element is the denoised video latent patch sequence and the second
element is the denoised audio latent patch sequence.
Xet Storage Details
- Size:
- 6.45 kB
- Xet hash:
- 3e67ea33c7b625b65391de19b502ea0d57ac88307f85619c2a4795e225ff4460
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.