Buckets:
StableAudioDiTModel
A Transformer model for audio waveforms from Stable Audio Open.
StableAudioDiTModel[[diffusers.StableAudioDiTModel]]
diffusers.StableAudioDiTModel[[diffusers.StableAudioDiTModel]]
The Diffusion Transformer model introduced in Stable Audio.
Reference: https://github.com/Stability-AI/stable-audio-tools
forwarddiffusers.StableAudioDiTModel.forwardhttps://github.com/huggingface/diffusers/blob/vr_13751/src/diffusers/models/transformers/stable_audio_transformer.py#L282[{"name": "hidden_states", "val": ": FloatTensor"}, {"name": "timestep", "val": ": LongTensor = None"}, {"name": "encoder_hidden_states", "val": ": FloatTensor = None"}, {"name": "global_hidden_states", "val": ": FloatTensor = None"}, {"name": "rotary_embedding", "val": ": FloatTensor = None"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "attention_mask", "val": ": torch.LongTensor | None = None"}, {"name": "encoder_attention_mask", "val": ": torch.LongTensor | None = None"}]- hidden_states (torch.FloatTensor of shape (batch size, in_channels, sequence_len)) --
Input hidden_states.
timestep (
torch.LongTensor) -- Used to indicate denoising step.encoder_hidden_states (
torch.FloatTensorof shape(batch size, encoder_sequence_len, cross_attention_input_dim)) -- Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.global_hidden_states (
torch.FloatTensorof shape(batch size, global_sequence_len, global_states_input_dim)) -- Global embeddings that will be prepended to the hidden states.rotary_embedding (
torch.Tensor) -- The rotary embeddings to apply on query and key tensors during attention calculation.return_dict (
bool, optional, defaults toTrue) -- Whether or not to return a~models.transformer_2d.Transformer2DModelOutputinstead of a plain tuple.attention_mask (
torch.Tensorof shape(batch_size, sequence_len), optional) -- Mask to avoid performing attention on padding token indices, formed by concatenating the attention masks for the two text encoders together. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
encoder_attention_mask (
torch.Tensorof shape(batch_size, sequence_len), optional) -- Mask to avoid performing attention on padding token cross-attention indices, formed by concatenating the attention masks for the two text encoders together. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.0If
return_dictis True, an~models.transformer_2d.Transformer2DModelOutputis returned, otherwise a
tuple where the first element is the sample tensor.
The StableAudioDiTModel forward method.
Parameters:
sample_size ( int, optional, defaults to 1024) : The size of the input sample.
in_channels (int, optional, defaults to 64) : The number of channels in the input.
num_layers (int, optional, defaults to 24) : The number of layers of Transformer blocks to use.
attention_head_dim (int, optional, defaults to 64) : The number of channels in each head.
num_attention_heads (int, optional, defaults to 24) : The number of heads to use for the query states.
num_key_value_attention_heads (int, optional, defaults to 12) : The number of heads to use for the key and value states.
out_channels (int, defaults to 64) : Number of output channels.
cross_attention_dim ( int, optional, defaults to 768) : Dimension of the cross-attention projection.
time_proj_dim ( int, optional, defaults to 256) : Dimension of the timestep inner projection.
global_states_input_dim ( int, optional, defaults to 1536) : Input dimension of the global hidden states projection.
cross_attention_input_dim ( int, optional, defaults to 768) : Input dimension of the cross-attention projection
Returns:
If return_dict is True, an ~models.transformer_2d.Transformer2DModelOutput is returned, otherwise a
tuple where the first element is the sample tensor.
set_default_attn_processor[[diffusers.StableAudioDiTModel.set_default_attn_processor]]
Disables custom attention processors and sets the default attention implementation.
Xet Storage Details
- Size:
- 4.72 kB
- Xet hash:
- cf29a1d4fa83b7c5867b34af1d5b24f4db574941f3c3022be4422195bca7e896
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.