Buckets:

hf-doc-build
/

doc

Files

xet

hf-doc-build/doc / diffusers /main /en /api /models /anyflow_transformer3d.md

HuggingFaceDocBuilder

about 3 hours ago

preview code

download

raw

5.73 kB

	# AnyFlowTransformer3DModel

	The bidirectional 3D Transformer used by [`AnyFlowPipeline`](../pipelines/anyflow#anyflowpipeline). It is the
	v0.35.1 Wan2.1 backbone with one structural change: the timestep embedder is replaced by
	``AnyFlowDualTimestepTextImageEmbedding``, so every forward call conditions on both the source timestep
	``t`` and the target timestep ``r``. This is the embedding required to learn the flow map
	$\Phi_{r\leftarrow t}$ introduced in
	[AnyFlow](https://huggingface.co/papers/2605.13724). See the [`AnyFlowPipeline`](../pipelines/anyflow) page
	for paper, authors, and released checkpoints.

	For chunk-wise autoregressive (FAR causal) generation, use
	[`AnyFlowFARTransformer3DModel`](anyflow_far_transformer3d) instead.

	```python
	from diffusers import AnyFlowTransformer3DModel

	# Bidirectional AnyFlow checkpoint (T2V):
	transformer = AnyFlowTransformer3DModel.from_pretrained(
	"nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers", subfolder="transformer"
	)
	```

	## AnyFlowTransformer3DModel[[diffusers.AnyFlowTransformer3DModel]]

	#### diffusers.AnyFlowTransformer3DModel[[diffusers.AnyFlowTransformer3DModel]]

	[Source](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformers/transformer_anyflow.py#L507)

	Bidirectional 3D Transformer for AnyFlow flow-map sampling.

	The architecture is the v0.35.1 Wan2.1 3D DiT backbone with one structural change: the timestep embedder is
	replaced by `AnyFlowDualTimestepTextImageEmbedding` so that every forward call conditions on both the source
	timestep `t` and the target timestep `r`. This is the embedding required to learn the flow map
	\$\Phi_&lcub;r\leftarrow t}\$ introduced in [AnyFlow](https://huggingface.co/papers/2605.13724).

	For chunk-wise autoregressive (FAR causal) generation, use `AnyFlowFARTransformer3DModel` instead; that variant
	adds the FAR causal block-mask and a compressed-frame patch embedding on top of the same backbone.

	forwarddiffusers.AnyFlowTransformer3DModel.forwardhttps://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformers/transformer_anyflow.py#L626[{"name": "hidden_states", "val": ": Tensor"}, {"name": "timestep", "val": ": Tensor"}, {"name": "r_timestep", "val": ": Tensor"}, {"name": "encoder_hidden_states", "val": ": Tensor"}, {"name": "encoder_hidden_states_image", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "attention_kwargs", "val": ": typing.Optional[typing.Dict[str, typing.Any]] = None"}, {"name": "return_dict", "val": ": bool = True"}]- hidden_states (torch.Tensor of shape (batch_size, num_frames, num_channels, height, width)) --
	Input video latents.
	- timestep (torch.Tensor) --
	Source (noisier) flow-map timestep t.
	- r_timestep (torch.Tensor) --
	Target (cleaner) flow-map timestep r; defines the destination of the flow-map step.
	- encoder_hidden_states (torch.Tensor of shape (batch_size, sequence_len, embed_dims)) --
	Text-conditioning embeddings.
	- encoder_hidden_states_image (torch.Tensor, optional) --
	Image-conditioning embeddings; concatenated before the text tokens when provided.
	- attention_kwargs (dict, optional) --
	Kwargs forwarded to the AttentionProcessor as defined under self.processor in
	[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
	- return_dict (bool, optional, defaults to True) --
	Whether to return a [~models.transformer_2d.Transformer2DModelOutput] instead of a plain tuple.0[~models.transformer_2d.Transformer2DModelOutput] if return_dict is True, otherwise a tuple whose
	first element is the predicted velocity tensor.

	Bidirectional flow-map forward pass. `hidden_states` is laid out as `(B, F, C, H, W)` (per-frame latents).
	The input is patchified with the standard `patch_embedding` (kernel = stride = `patch_size`) and denoised
	with global bidirectional self-attention over the resulting flat token sequence.

	Parameters:

	patch_size (Tuple[int], defaults to (1, 2, 2)) : 3D patch dimensions for video embedding (t_patch, h_patch, w_patch).

	num_attention_heads (int, defaults to 40) : Number of attention heads.

	attention_head_dim (int, defaults to 128) : The number of channels in each head.

	in_channels (int, defaults to 16) : The number of channels in the input latent.

	out_channels (int, defaults to 16) : The number of channels in the output latent.

	text_dim (int, defaults to 4096) : Input dimension for text embeddings (UMT5).

	freq_dim (int, defaults to 256) : Dimension for sinusoidal time embeddings.

	ffn_dim (int, defaults to 13824) : Intermediate dimension in feed-forward network.

	num_layers (int, defaults to 40) : Number of transformer blocks.

	cross_attn_norm (bool, defaults to True) : Enable cross-attention normalization.

	eps (float, defaults to 1e-6) : Epsilon for normalization layers.

	image_dim (Optional[int], optional, defaults to None) : Image embedding dimension for I2V conditioning (1280 for the original Wan2.1-I2V model).

	rope_max_seq_len (int, defaults to 1024) : Maximum sequence length used to precompute rotary position frequencies.

	gate_value (float, defaults to 0.25) : Mixing gate between source-timestep and delta-timestep embeddings (the AnyFlow paper's \$g\$ parameter, fixed at 0.25 in stage-1 distillation).

	deltatime_type (str, defaults to 'r') : Either `"r"` (delta is the target timestep) or `"t-r"` (delta is the absolute interval).

	Returns:

	[~models.transformer_2d.Transformer2DModelOutput] if return_dict is True, otherwise a tuple whose
	first element is the predicted velocity tensor.

Xet Storage Details

Size:: 5.73 kB
Xet hash:: bdc0000bd499c72278397a996df85596eb3a4acdf60d97b81c9c5910a1c3a7ce

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.