Buckets:

hf-doc-build
/

doc

Files

xet

hf-doc-build/doc / diffusers /main /en /api /models /ace_step_transformer.md

HuggingFaceDocBuilder

1 day ago

preview code

download

raw

3.69 kB

AceStepTransformer1DModel

A 1D Diffusion Transformer for music generation from ACE-Step 1.5. The model operates on the 25 Hz stereo latents produced by AutoencoderOobleck using flow matching, and is trained with a Qwen3-derived backbone (grouped-query attention, rotary position embedding, RMSNorm, AdaLN-Zero timestep conditioning) plus cross-attention to the text / lyric / timbre conditions built by AceStepConditionEncoder.

AceStepTransformer1DModel[[diffusers.AceStepTransformer1DModel]]

diffusers.AceStepTransformer1DModel[[diffusers.AceStepTransformer1DModel]]

Source

Diffusion Transformer for ACE-Step 1.5 music generation.

Generates audio latents conditioned on text, lyrics, and timbre. Uses 1D patch embedding (Conv1d with stride patch_size) followed by a stack of AceStepTransformerBlocks with alternating sliding-window / full attention on the self-attention branch. Cross-attention consumes the packed encoder_hidden_states produced by AceStepConditionEncoder.

forwarddiffusers.AceStepTransformer1DModel.forwardhttps://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformers/ace_step_transformer.py#L531[{"name": "hidden_states", "val": ": Tensor"}, {"name": "timestep", "val": ": Tensor"}, {"name": "timestep_r", "val": ": Tensor"}, {"name": "encoder_hidden_states", "val": ": Tensor"}, {"name": "context_latents", "val": ": Tensor"}, {"name": "return_dict", "val": ": bool = True"}]- hidden_states (torch.Tensor of shape (batch_size, seq_len, channels)) -- Noisy latent input for the diffusion process.

timestep (torch.Tensor of shape (batch_size,)) -- Current diffusion timestep t.
timestep_r (torch.Tensor of shape (batch_size,)) -- Reference timestep r (set equal to t for standard inference).
encoder_hidden_states (torch.Tensor of shape (batch_size, encoder_seq_len, hidden_size)) -- Conditioning embeddings from the condition encoder (text + lyrics + timbre).
context_latents (torch.Tensor of shape (batch_size, seq_len, context_dim)) -- Context latents (source latents concatenated with chunk masks) — fed to the patchify conv alongside hidden_states.
return_dict (bool, defaults to True) -- Whether to return a Transformer2DModelOutput or a plain tuple.0Transformer2DModelOutput or tupleThe predicted velocity field. The AceStepTransformer1DModel forward method.

Parameters:

hidden_states (torch.Tensor of shape (batch_size, seq_len, channels)) : Noisy latent input for the diffusion process.

timestep (torch.Tensor of shape (batch_size,)) : Current diffusion timestep t.

timestep_r (torch.Tensor of shape (batch_size,)) : Reference timestep r (set equal to t for standard inference).

encoder_hidden_states (torch.Tensor of shape (batch_size, encoder_seq_len, hidden_size)) : Conditioning embeddings from the condition encoder (text + lyrics + timbre).

context_latents (torch.Tensor of shape (batch_size, seq_len, context_dim)) : Context latents (source latents concatenated with chunk masks) — fed to the patchify conv alongside hidden_states.

return_dict (bool, defaults to True) : Whether to return a Transformer2DModelOutput or a plain tuple.

Returns:

Transformer2DModelOutput` or `tuple

The predicted velocity field.

Xet Storage Details

Size:: 3.69 kB
Xet hash:: 93185090e62f72fad8bfa74b6b4b83343e72c1c03255e3b903c13b878d50dd40

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.