Buckets:
AnyFlowFARTransformer3DModel
The causal (FAR) 3D Transformer used by AnyFlowFARPipeline —
the FAR variant of AnyFlow (Yuchao Gu, Guian Fang et al., NUS
ShowLab × NVIDIA). It extends the v0.35.1 Wan2.1 backbone with three additions:
- FAR causal block-mask via
torch.nn.attention.flex_attention, supporting frame-level autoregressive generation as introduced in FAR (Gu et al., 2025). - Compressed-frame patch embedding (
far_patch_embedding) for context (already-generated) frames, warm-started from the full-resolutionpatch_embeddingat construction time via trilinear interpolation. - Dual-timestep flow-map embedding (same as
AnyFlowTransformer3DModel) — every forward call conditions on both the source timesteptand the target timestepr.
The chunk schedule (chunk_partition) is not baked into the model config. It is a per-call argument to
forward, so the same checkpoint handles different num_frames configurations without retraining.
from diffusers import AnyFlowFARTransformer3DModel
# Causal AnyFlow checkpoint (FAR):
transformer = AnyFlowFARTransformer3DModel.from_pretrained(
"nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", subfolder="transformer"
)
AnyFlowFARTransformer3DModel[[diffusers.AnyFlowFARTransformer3DModel]]
diffusers.AnyFlowFARTransformer3DModel[[diffusers.AnyFlowFARTransformer3DModel]]
Causal (FAR) 3D Transformer for AnyFlow flow-map sampling with frame-level autoregressive generation.
Extends the v0.35.1 Wan2.1 backbone with:
- FAR causal block-mask via
torch.nn.attention.flex_attention, supporting frame-level autoregressive generation (FAR; Gu et al., 2025). - Compressed-frame patch embedding
far_patch_embeddingfor context (already-generated) frames, initialized frompatch_embeddingvia trilinear interpolation so a freshly constructed model is already at a reasonable starting point even before LoRA fine-tuning. - Dual-timestep flow-map embedding for any-step sampling (same as
AnyFlowTransformer3DModel).
Use AnyFlowTransformer3DModel instead for plain bidirectional T2V — that variant skips the FAR causal masking
and far_patch_embedding and is ~5–10% smaller.
chunk_partition is not a model config field — it is a per-call argument passed to forward.
Different inference setups (varying num_frames or full-vs-compressed schedules) therefore do not require
separate checkpoints.
forwarddiffusers.AnyFlowFARTransformer3DModel.forwardhttps://github.com/huggingface/diffusers/blob/vr_13745/src/diffusers/models/transformers/transformer_anyflow_far.py#L913[{"name": "hidden_states", "val": ": Tensor"}, {"name": "timestep", "val": ": Tensor"}, {"name": "r_timestep", "val": ": Tensor"}, {"name": "encoder_hidden_states", "val": ": Tensor"}, {"name": "chunk_partition", "val": ": typing.List[int]"}, {"name": "encoder_hidden_states_image", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "clean_hidden_states", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "clean_timestep", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "kv_cache", "val": ": typing.Optional[typing.List[typing.Dict[str, torch.Tensor]]] = None"}, {"name": "kv_cache_flag", "val": ": typing.Optional[typing.Dict[str, typing.Any]] = None"}, {"name": "attention_kwargs", "val": ": typing.Optional[typing.Dict[str, typing.Any]] = None"}, {"name": "return_dict", "val": ": bool = True"}]- hidden_states (torch.Tensor) --
Latent input of shape (B, F, C, H, W).
- timestep (torch.Tensor) -- Source (noisier) flow-map timestep t.
- r_timestep (torch.Tensor) -- Target (cleaner) flow-map timestep r.
- encoder_hidden_states (torch.Tensor) -- UMT5 text embeddings.
- chunk_partition (List[int]) --
Per-chunk frame counts; total must match the number of latent frames in
hidden_states. - encoder_hidden_states_image (torch.Tensor, optional) -- I2V image embedding; concatenated before text tokens when provided.
- clean_hidden_states (torch.Tensor, optional) -- Clean (noise-free) conditioning frames used by the training rollout.
- clean_timestep (torch.Tensor, optional) -- Timesteps for the clean conditioning frames in the training rollout.
- kv_cache (List[Dict[str, torch.Tensor]], optional) -- Per-block KV cache for autoregressive inference. None selects the training path.
- kv_cache_flag (Dict[str, Any], optional) --
KV-cache metadata (e.g.
is_cache_stepflag and token counts). - attention_kwargs (dict, optional) -- Forwarded to the attention processors.
- return_dict (bool, optional, defaults to True) -- If False, returns positional tuples instead of an output dataclass.0
FAR causal forward pass. Dispatches to one of three internal paths:
kv_cache is None→ causal training rollout (returnsTransformer2DModelOutput).kv_cache is not Noneandkv_cache_flag["is_cache_step"]→ cache-prefill (returnsAnyFlowFARTransformerOutputwithsample=None).- Otherwise → autoregressive inference step (returns
AnyFlowFARTransformerOutput).
Parameters:
patch_size (Tuple[int], defaults to (1, 2, 2)) : 3D patch dimensions for full-resolution chunks.
compressed_patch_size (Tuple[int], defaults to (1, 4, 4)) : Larger patch dimensions for the FAR-compressed (context) chunks.
full_chunk_limit (int, defaults to 3) : Maximum number of full-resolution chunks before earlier chunks are demoted to compressed FAR context. The released checkpoints use 3.
num_attention_heads (int, defaults to 40) : Number of attention heads.
attention_head_dim (int, defaults to 128) : The number of channels in each head.
in_channels (int, defaults to 16) : The number of channels in the input latent.
out_channels (int, defaults to 16) : The number of channels in the output latent.
text_dim (int, defaults to 4096) : Input dimension for text embeddings (UMT5).
freq_dim (int, defaults to 256) : Dimension for sinusoidal time embeddings.
ffn_dim (int, defaults to 13824) : Intermediate dimension in feed-forward network.
num_layers (int, defaults to 40) : Number of transformer blocks.
cross_attn_norm (bool, defaults to True) : Enable cross-attention normalization.
eps (float, defaults to 1e-6) : Epsilon for normalization layers.
image_dim (Optional[int], optional, defaults to None) : Image embedding dimension for I2V conditioning.
rope_max_seq_len (int, defaults to 1024) : Maximum sequence length used to precompute rotary position frequencies.
gate_value (float, defaults to 0.25) : Mixing gate between source-timestep and delta-timestep embeddings.
deltatime_type (str, defaults to 'r') : Either "r" (delta is the target timestep) or "t-r" (delta is the absolute interval).
AnyFlowFARTransformerOutput[[diffusers.models.transformers.transformer_anyflow_far.AnyFlowFARTransformerOutput]]
diffusers.models.transformers.transformer_anyflow_far.AnyFlowFARTransformerOutput[[diffusers.models.transformers.transformer_anyflow_far.AnyFlowFARTransformerOutput]]
Output dataclass for AnyFlowFARTransformer3DModel's causal forward paths.
Parameters:
sample (torch.Tensor or None) : Predicted denoising target for the autoregressive chunk. None for the cache-prefill path, which only writes the KV cache and produces no usable sample.
kv_cache (list[dict[str, torch.Tensor]], optional) : Per-block KV cache state used by subsequent autoregressive steps.
Xet Storage Details
- Size:
- 8.04 kB
- Xet hash:
- 69e8d30453bc337f8effce4bfaa41259df462358ab8482548948e0b1e5a6f7af
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.