Buckets:

hf-doc-build/doc-dev / diffusers /pr_13745 /en /api /models /anyflow_far_transformer3d.md
HuggingFaceDocBuilder's picture
|
download
raw
8.04 kB

AnyFlowFARTransformer3DModel

The causal (FAR) 3D Transformer used by AnyFlowFARPipeline — the FAR variant of AnyFlow (Yuchao Gu, Guian Fang et al., NUS ShowLab × NVIDIA). It extends the v0.35.1 Wan2.1 backbone with three additions:

  1. FAR causal block-mask via torch.nn.attention.flex_attention, supporting frame-level autoregressive generation as introduced in FAR (Gu et al., 2025).
  2. Compressed-frame patch embedding (far_patch_embedding) for context (already-generated) frames, warm-started from the full-resolution patch_embedding at construction time via trilinear interpolation.
  3. Dual-timestep flow-map embedding (same as AnyFlowTransformer3DModel) — every forward call conditions on both the source timestep t and the target timestep r.

The chunk schedule (chunk_partition) is not baked into the model config. It is a per-call argument to forward, so the same checkpoint handles different num_frames configurations without retraining.

from diffusers import AnyFlowFARTransformer3DModel

# Causal AnyFlow checkpoint (FAR):
transformer = AnyFlowFARTransformer3DModel.from_pretrained(
    "nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", subfolder="transformer"
)

AnyFlowFARTransformer3DModel[[diffusers.AnyFlowFARTransformer3DModel]]

diffusers.AnyFlowFARTransformer3DModel[[diffusers.AnyFlowFARTransformer3DModel]]

Source

Causal (FAR) 3D Transformer for AnyFlow flow-map sampling with frame-level autoregressive generation.

Extends the v0.35.1 Wan2.1 backbone with:

  • FAR causal block-mask via torch.nn.attention.flex_attention, supporting frame-level autoregressive generation (FAR; Gu et al., 2025).
  • Compressed-frame patch embedding far_patch_embedding for context (already-generated) frames, initialized from patch_embedding via trilinear interpolation so a freshly constructed model is already at a reasonable starting point even before LoRA fine-tuning.
  • Dual-timestep flow-map embedding for any-step sampling (same as AnyFlowTransformer3DModel).

Use AnyFlowTransformer3DModel instead for plain bidirectional T2V — that variant skips the FAR causal masking and far_patch_embedding and is ~5–10% smaller.

chunk_partition is not a model config field — it is a per-call argument passed to forward. Different inference setups (varying num_frames or full-vs-compressed schedules) therefore do not require separate checkpoints.

forwarddiffusers.AnyFlowFARTransformer3DModel.forwardhttps://github.com/huggingface/diffusers/blob/vr_13745/src/diffusers/models/transformers/transformer_anyflow_far.py#L913[{"name": "hidden_states", "val": ": Tensor"}, {"name": "timestep", "val": ": Tensor"}, {"name": "r_timestep", "val": ": Tensor"}, {"name": "encoder_hidden_states", "val": ": Tensor"}, {"name": "chunk_partition", "val": ": typing.List[int]"}, {"name": "encoder_hidden_states_image", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "clean_hidden_states", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "clean_timestep", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "kv_cache", "val": ": typing.Optional[typing.List[typing.Dict[str, torch.Tensor]]] = None"}, {"name": "kv_cache_flag", "val": ": typing.Optional[typing.Dict[str, typing.Any]] = None"}, {"name": "attention_kwargs", "val": ": typing.Optional[typing.Dict[str, typing.Any]] = None"}, {"name": "return_dict", "val": ": bool = True"}]- hidden_states (torch.Tensor) -- Latent input of shape (B, F, C, H, W).

  • timestep (torch.Tensor) -- Source (noisier) flow-map timestep t.
  • r_timestep (torch.Tensor) -- Target (cleaner) flow-map timestep r.
  • encoder_hidden_states (torch.Tensor) -- UMT5 text embeddings.
  • chunk_partition (List[int]) -- Per-chunk frame counts; total must match the number of latent frames in hidden_states.
  • encoder_hidden_states_image (torch.Tensor, optional) -- I2V image embedding; concatenated before text tokens when provided.
  • clean_hidden_states (torch.Tensor, optional) -- Clean (noise-free) conditioning frames used by the training rollout.
  • clean_timestep (torch.Tensor, optional) -- Timesteps for the clean conditioning frames in the training rollout.
  • kv_cache (List[Dict[str, torch.Tensor]], optional) -- Per-block KV cache for autoregressive inference. None selects the training path.
  • kv_cache_flag (Dict[str, Any], optional) -- KV-cache metadata (e.g. is_cache_step flag and token counts).
  • attention_kwargs (dict, optional) -- Forwarded to the attention processors.
  • return_dict (bool, optional, defaults to True) -- If False, returns positional tuples instead of an output dataclass.0

FAR causal forward pass. Dispatches to one of three internal paths:

  • kv_cache is None → causal training rollout (returns Transformer2DModelOutput).
  • kv_cache is not None and kv_cache_flag["is_cache_step"] → cache-prefill (returns AnyFlowFARTransformerOutput with sample=None).
  • Otherwise → autoregressive inference step (returns AnyFlowFARTransformerOutput).

Parameters:

patch_size (Tuple[int], defaults to (1, 2, 2)) : 3D patch dimensions for full-resolution chunks.

compressed_patch_size (Tuple[int], defaults to (1, 4, 4)) : Larger patch dimensions for the FAR-compressed (context) chunks.

full_chunk_limit (int, defaults to 3) : Maximum number of full-resolution chunks before earlier chunks are demoted to compressed FAR context. The released checkpoints use 3.

num_attention_heads (int, defaults to 40) : Number of attention heads.

attention_head_dim (int, defaults to 128) : The number of channels in each head.

in_channels (int, defaults to 16) : The number of channels in the input latent.

out_channels (int, defaults to 16) : The number of channels in the output latent.

text_dim (int, defaults to 4096) : Input dimension for text embeddings (UMT5).

freq_dim (int, defaults to 256) : Dimension for sinusoidal time embeddings.

ffn_dim (int, defaults to 13824) : Intermediate dimension in feed-forward network.

num_layers (int, defaults to 40) : Number of transformer blocks.

cross_attn_norm (bool, defaults to True) : Enable cross-attention normalization.

eps (float, defaults to 1e-6) : Epsilon for normalization layers.

image_dim (Optional[int], optional, defaults to None) : Image embedding dimension for I2V conditioning.

rope_max_seq_len (int, defaults to 1024) : Maximum sequence length used to precompute rotary position frequencies.

gate_value (float, defaults to 0.25) : Mixing gate between source-timestep and delta-timestep embeddings.

deltatime_type (str, defaults to 'r') : Either "r" (delta is the target timestep) or "t-r" (delta is the absolute interval).

AnyFlowFARTransformerOutput[[diffusers.models.transformers.transformer_anyflow_far.AnyFlowFARTransformerOutput]]

diffusers.models.transformers.transformer_anyflow_far.AnyFlowFARTransformerOutput[[diffusers.models.transformers.transformer_anyflow_far.AnyFlowFARTransformerOutput]]

Source

Output dataclass for AnyFlowFARTransformer3DModel's causal forward paths.

Parameters:

sample (torch.Tensor or None) : Predicted denoising target for the autoregressive chunk. None for the cache-prefill path, which only writes the KV cache and produces no usable sample.

kv_cache (list[dict[str, torch.Tensor]], optional) : Per-block KV cache state used by subsequent autoregressive steps.

Xet Storage Details

Size:
8.04 kB
·
Xet hash:
69e8d30453bc337f8effce4bfaa41259df462358ab8482548948e0b1e5a6f7af

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.