Buckets:
AceStepTransformer1DModel
A 1D Diffusion Transformer for music generation from ACE-Step 1.5. The model operates on the 25 Hz stereo latents produced by AutoencoderOobleck using flow matching, and is trained with a Qwen3-derived backbone (grouped-query attention, rotary position embedding, RMSNorm, AdaLN-Zero timestep conditioning) plus cross-attention to the text / lyric / timbre conditions built by AceStepConditionEncoder.
AceStepTransformer1DModel[[diffusers.AceStepTransformer1DModel]]
diffusers.AceStepTransformer1DModel[[diffusers.AceStepTransformer1DModel]]
Diffusion Transformer for ACE-Step 1.5 music generation.
Generates audio latents conditioned on text, lyrics, and timbre. Uses 1D patch embedding (Conv1d with stride
patch_size) followed by a stack of AceStepTransformerBlocks with alternating sliding-window / full attention on
the self-attention branch. Cross-attention consumes the packed encoder_hidden_states produced by
AceStepConditionEncoder.
forwarddiffusers.AceStepTransformer1DModel.forwardhttps://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformers/ace_step_transformer.py#L531[{"name": "hidden_states", "val": ": Tensor"}, {"name": "timestep", "val": ": Tensor"}, {"name": "timestep_r", "val": ": Tensor"}, {"name": "encoder_hidden_states", "val": ": Tensor"}, {"name": "context_latents", "val": ": Tensor"}, {"name": "return_dict", "val": ": bool = True"}]- hidden_states (torch.Tensor of shape (batch_size, seq_len, channels)) --
Noisy latent input for the diffusion process.
- timestep (
torch.Tensorof shape(batch_size,)) -- Current diffusion timestept. - timestep_r (
torch.Tensorof shape(batch_size,)) -- Reference timestepr(set equal totfor standard inference). - encoder_hidden_states (
torch.Tensorof shape(batch_size, encoder_seq_len, hidden_size)) -- Conditioning embeddings from the condition encoder (text + lyrics + timbre). - context_latents (
torch.Tensorof shape(batch_size, seq_len, context_dim)) -- Context latents (source latents concatenated with chunk masks) — fed to the patchify conv alongsidehidden_states. - return_dict (
bool, defaults toTrue) -- Whether to return aTransformer2DModelOutputor a plain tuple.0Transformer2DModelOutputortupleThe predicted velocity field. The AceStepTransformer1DModel forward method.
Parameters:
hidden_states (torch.Tensor of shape (batch_size, seq_len, channels)) : Noisy latent input for the diffusion process.
timestep (torch.Tensor of shape (batch_size,)) : Current diffusion timestep t.
timestep_r (torch.Tensor of shape (batch_size,)) : Reference timestep r (set equal to t for standard inference).
encoder_hidden_states (torch.Tensor of shape (batch_size, encoder_seq_len, hidden_size)) : Conditioning embeddings from the condition encoder (text + lyrics + timbre).
context_latents (torch.Tensor of shape (batch_size, seq_len, context_dim)) : Context latents (source latents concatenated with chunk masks) — fed to the patchify conv alongside hidden_states.
return_dict (bool, defaults to True) : Whether to return a Transformer2DModelOutput or a plain tuple.
Returns:
Transformer2DModelOutput` or `tuple
The predicted velocity field.
Xet Storage Details
- Size:
- 3.69 kB
- Xet hash:
- 93185090e62f72fad8bfa74b6b4b83343e72c1c03255e3b903c13b878d50dd40
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.