Buckets:

HuggingFaceDocBuilder's picture
|
download
raw
24.5 kB

AnyFlow

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation by Yuchao Gu, Guian Fang and collaborators at NUS ShowLab in collaboration with NVIDIA.

Few-step video generation has been significantly advanced by consistency models. However, their performance often degrades in any-step video diffusion models due to the fixed-point formulation. To address this limitation, we present AnyFlow, the first any-step video diffusion distillation framework built on flow maps. Instead of learning only the mapping z_t → z_0, AnyFlow learns transitions z_t → z_r over arbitrary time intervals, enabling a single model to adapt to different inference budgets. We design an improved forward flow map training recipe that fine-tunes pretrained video diffusion models into flow map models, and introduce Flow Map Backward Simulation to enable on-policy distillation for flow map models. Extensive experiments across both bidirectional and causal architectures, at scales ranging from 1.3B to 14B, on text-to-video and image-to-video tasks demonstrate that AnyFlow outperforms consistency-based baselines while preserving high fidelity and flexible sampling under varying step budgets.

The original training code is at NVlabs/AnyFlow. The project page is at nvlabs.github.io/AnyFlow.

The following AnyFlow checkpoints are supported:

Checkpoint Backbone Description
nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers Wan2.1 1.3B Bidirectional T2V, lightweight
nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers Wan2.1 14B Bidirectional T2V, full quality
nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers FAR + Wan2.1 1.3B Causal T2V / I2V / V2V
nvidia/AnyFlow-FAR-Wan2.1-14B-Diffusers FAR + Wan2.1 14B Causal T2V / I2V / V2V

All four are grouped under the nvidia/anyflow Hugging Face collection.

Choose AnyFlowPipeline for traditional bidirectional text-to-video generation. Choose AnyFlowFARPipeline for streaming I2V, video continuation (V2V), or any setup that benefits from frame-by-frame autoregressive sampling.

AnyFlow supports any-step sampling: a single distilled checkpoint can be evaluated at 1, 2, 4, 8, 16... NFE without retraining. Quality scales monotonically with steps in our benchmarks.

Optimizing Memory and Inference Speed

import torch
from diffusers import AnyFlowPipeline
from diffusers.hooks import apply_group_offloading

pipe = AnyFlowPipeline.from_pretrained(
    "nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers", torch_dtype=torch.bfloat16
)
apply_group_offloading(pipe.transformer, onload_device="cuda", offload_type="leaf_level")
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
import torch
from diffusers import AnyFlowPipeline

pipe = AnyFlowPipeline.from_pretrained(
    "nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")
pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune-no-cudagraphs")

Generation with AnyFlow (Bidirectional T2V)

import torch
from diffusers import AnyFlowPipeline
from diffusers.utils import export_to_video

pipe = AnyFlowPipeline.from_pretrained(
    "nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")

prompt = "A red panda eating bamboo in a forest, cinematic lighting"
video = pipe(prompt, num_inference_steps=4, num_frames=33).frames[0]
export_to_video(video, "out.mp4", fps=16)

Generation with AnyFlow (FAR Causal)

The causal pipeline selects between T2V / I2V / V2V via the video (or video_latents) argument: omit both for plain text-to-video, or pass video=<tensor> of shape (B, T, C, H, W) in [0, 1] with T = 4n + 1 to condition on existing frames. Use a single conditioning frame for I2V and a longer clip for V2V continuation. If you already have pre-encoded latents in the model layout, pass them via video_latents=<tensor> to skip VAE encoding. video and video_latents are mutually exclusive.

AnyFlowFARPipeline.default_chunk_partition = [1, 3, 3, 3, 3, 3, 3, 2] (sum 21) is matched to the released checkpoints' canonical 81 raw frames (21 latent frames at the VAE temporal stride of 4). When you change num_frames, you must also pass a matching chunk_partition summing to (num_frames - 1) // 4 + 1, otherwise the pipeline raises an AssertionError.

import torch
from diffusers import AnyFlowFARPipeline
from diffusers.utils import export_to_video

pipe = AnyFlowFARPipeline.from_pretrained(
    "nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")

video = pipe(
    prompt="A cat surfing a wave, sunset",
    num_inference_steps=4,
    num_frames=81,
).frames[0]
export_to_video(video, "out.mp4", fps=16)
import numpy as np
import torch
from diffusers import AnyFlowFARPipeline
from diffusers.utils import export_to_video, load_image

pipe = AnyFlowFARPipeline.from_pretrained(
    "nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")

# Wrap the conditioning image as a one-frame video tensor: (1, 1, 3, H, W) in [0, 1].
first_frame = load_image("path/to/first_frame.png").resize((832, 480))
arr = np.asarray(first_frame).astype("float32") / 255.0  # (480, 832, 3)
context_tensor = torch.from_numpy(arr).permute(2, 0, 1).unsqueeze(0).unsqueeze(1).to("cuda")

video = pipe(
    prompt="a cat walks across a sunlit lawn",
    video=context_tensor,
    num_inference_steps=4,
    num_frames=81,
).frames[0]
export_to_video(video, "out.mp4", fps=16)
import numpy as np
import torch
from diffusers import AnyFlowFARPipeline
from diffusers.utils import export_to_video, load_video

pipe = AnyFlowFARPipeline.from_pretrained(
    "nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")

# Context clip — 9 raw frames map to 3 latent frames (9 = 4·2 + 1, 3 = 2 + 1).
context_frames = load_video("path/to/context.mp4")[:9]
arr = np.stack([np.asarray(f.resize((832, 480))) for f in context_frames]).astype("float32") / 255.0
# np.stack gives (T, H, W, C) = (9, 480, 832, 3) → permute to (T, C, H, W) then add batch.
context_tensor = torch.from_numpy(arr).permute(0, 3, 1, 2).unsqueeze(0).to("cuda")  # (1, 9, 3, 480, 832)

video = pipe(
    prompt="continue the story",
    video=context_tensor,
    num_inference_steps=4,
    num_frames=81,
    # Override chunk_partition so the first chunk covers exactly the 3 latent context frames.
    chunk_partition=[3, 3, 3, 3, 3, 3, 3],
).frames[0]
export_to_video(video, "out.mp4", fps=16)

Notes

  • Classifier-free guidance is fused into the released checkpoints, so inference does not run a second guided forward pass. Keep the default guidance_scale=1.0 unless your own checkpoint requires otherwise.
  • FlowMapEulerDiscreteScheduler is general-purpose. You can attach it to any flow-map-distilled checkpoint via from_pretrained(..., scheduler=FlowMapEulerDiscreteScheduler.from_config(...)).
  • AnyFlowPipeline uses AnyFlowTransformer3DModel (bidirectional). AnyFlowFARPipeline uses AnyFlowFARTransformer3DModel, which adds a compressed-frame patch embedding and the FAR causal block-mask.
  • LoRA loading is supported via WanLoraLoaderMixin, the same mixin used by the upstream Wan pipelines.
  • For training recipes (forward flow-map training and on-policy distillation), refer to the original AnyFlow training framework at NVlabs/AnyFlow; training is out of scope for diffusers.

AnyFlowPipeline[[diffusers.AnyFlowPipeline]]

diffusers.AnyFlowPipeline[[diffusers.AnyFlowPipeline]]

Source

Bidirectional text-to-video generation pipeline for AnyFlow flow-map-distilled checkpoints, introduced in AnyFlow by Yuchao Gu, Guian Fang et al.

AnyFlow learns arbitrary-interval transitions ztzrz_t \to z_r rather than the fixed ztz0z_t \to z_0 mapping of consistency models, so a single distilled checkpoint can be evaluated at 1, 2, 4, 8, 16... NFE without retraining. This pipeline operates over the full video tensor in one bidirectional pass; for frame-level autoregressive (causal) generation use AnyFlowFARPipeline.

Sampling is plain Euler in mean-velocity form (z_r = z_t - (t - r) * u) with no re-noising. The released NVIDIA checkpoints fold classifier-free guidance into the model weights, so the default guidance_scale=1.0 is the recommended setting.

This model inherits from [DiffusionPipeline]. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

__call__diffusers.AnyFlowPipeline.__call__https://github.com/huggingface/diffusers/blob/vr_13745/src/diffusers/pipelines/anyflow/pipeline_anyflow.py#L379[{"name": "prompt", "val": ": typing.Union[str, typing.List[str]] = None"}, {"name": "video", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "video_latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt", "val": ": typing.Union[str, typing.List[str]] = None"}, {"name": "height", "val": ": int = 480"}, {"name": "width", "val": ": int = 832"}, {"name": "num_frames", "val": ": int = 81"}, {"name": "num_inference_steps", "val": ": int = 50"}, {"name": "sigmas", "val": ": typing.Optional[typing.List[float]] = None"}, {"name": "timesteps", "val": ": typing.Optional[typing.List[float]] = None"}, {"name": "guidance_scale", "val": ": float = 1.0"}, {"name": "num_videos_per_prompt", "val": ": typing.Optional[int] = 1"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "output_type", "val": ": typing.Optional[str] = 'np'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "attention_kwargs", "val": ": typing.Optional[typing.Dict[str, typing.Any]] = None"}, {"name": "callback_on_step_end", "val": ": typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": typing.List[str] = ['latents']"}, {"name": "max_sequence_length", "val": ": int = 512"}, {"name": "use_mean_velocity", "val": ": bool = True"}]- prompt (str or List[str], optional) -- The prompt or prompts to guide the video generation. If not defined, pass prompt_embeds instead.

  • video (torch.Tensor, optional) -- Pre-VAE conditioning frames of shape (B, T, C, H, W) in [0, 1]. When provided, the pipeline VAE-encodes them and keeps the corresponding latent prefix fixed during sampling. Mutually exclusive with video_latents.
  • video_latents (torch.Tensor, optional) -- Pre-encoded VAE latents in the AnyFlow layout (B, T_latent, C, H_latent, W_latent). Skips VAE encoding on the pipeline side. Mutually exclusive with video.
  • negative_prompt (str or List[str], optional) -- The prompt or prompts to avoid during video generation. Ignored when not using guidance (guidance_scale 0~AnyFlowPipelineOutputortupleIf return_dictisTrue, AnyFlowPipelineOutputis returned, otherwise atuple` whose first element is the generated video.

The call function to the pipeline for generation.

Examples:

>>> import torch
>>> from diffusers import AnyFlowPipeline
>>> from diffusers.utils import export_to_video

>>> pipe = AnyFlowPipeline.from_pretrained(
...     "nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers", torch_dtype=torch.bfloat16
... ).to("cuda")

>>> prompt = "A red panda eating bamboo in a forest, cinematic lighting"
>>> video = pipe(prompt, num_inference_steps=4, num_frames=33).frames[0]
>>> export_to_video(video, "anyflow_t2v.mp4", fps=16)

Parameters:

tokenizer ([AutoTokenizer]) : Tokenizer from google/umt5-xxl.

text_encoder ([UMT5EncoderModel]) : google/umt5-xxl text encoder.

transformer ([AnyFlowTransformer3DModel]) : Bidirectional flow-map 3D Transformer.

vae ([AutoencoderKLWan]) : VAE that encodes/decodes videos to and from latent representations.

scheduler ([FlowMapEulerDiscreteScheduler]) : Flow-map sampler. The pipeline drives scheduler.step(..., timestep, sample, r_timestep) per inference step.

Returns:

~AnyFlowPipelineOutput` or `tuple

If return_dict is True, AnyFlowPipelineOutput is returned, otherwise a tuple whose first element is the generated video.

encode_prompt[[diffusers.AnyFlowPipeline.encode_prompt]]

Source

Encodes the prompt into text encoder hidden states.

Parameters:

prompt (str or list[str], optional) : prompt to be encoded

negative_prompt (str or list[str], optional) : The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).

do_classifier_free_guidance (bool, optional, defaults to True) : Whether to use classifier free guidance or not.

num_videos_per_prompt (int, optional, defaults to 1) : Number of videos that should be generated per prompt. torch device to place the resulting embeddings on

prompt_embeds (torch.Tensor, optional) : Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.

negative_prompt_embeds (torch.Tensor, optional) : Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.

device : (torch.device, optional): torch device

dtype : (torch.dtype, optional): torch dtype

encode_video[[diffusers.AnyFlowPipeline.encode_video]]

Source

Encode a pixel-space video into AnyFlow's latent layout.

Mirrors the single-helper convention of other diffusers pipelines (cf. WanImageToVideoPipeline.encode_image): wraps preprocessing, VAE encoding, and latent normalization into one call. Output layout is (B, T_latent, C, H, W), which is what the AnyFlow transformer expects for conditioning frames.

AnyFlowFARPipeline[[diffusers.AnyFlowFARPipeline]]

diffusers.AnyFlowFARPipeline[[diffusers.AnyFlowFARPipeline]]

Source

Causal (FAR-based) text-to-video / image-to-video / video-to-video pipeline for AnyFlow checkpoints, introduced in AnyFlow by Yuchao Gu, Guian Fang et al.

The pipeline drives a frame-level autoregressive sampling loop over chunks: each chunk is denoised with flow-map steps while attending only to past chunks via block-sparse causal attention, and intermediate KV cache is reused across chunks.

The task mode (T2V / I2V / V2V) is selected by which conditioning argument is passed to __call__:

  • both video=None and video_latents=None — pure text-to-video.
  • video=&amp;lt;tensor of shape (B, T, C, H, W) in [0, 1] with T = 4n + 1> — pre-VAE conditioning frames; the pipeline VAE-encodes them. Pass a single-frame video for I2V or a multi-frame clip for V2V.
  • video_latents=&amp;lt;latent tensor of shape (B, T_latent, C, H_latent, W_latent)> — already-encoded latents in the FAR layout (skips the VAE encode step).

The FAR backbone is the causal Wan2.1 variant introduced by FAR (Gu et al., 2025; arXiv:2503.19325). Inference is plain Euler in mean-velocity form per chunk with no re-noising. Joint T2V / I2V / V2V is supported by a single distilled model.

This model inherits from [DiffusionPipeline]. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

__call__diffusers.AnyFlowFARPipeline.__call__https://github.com/huggingface/diffusers/blob/vr_13745/src/diffusers/pipelines/anyflow/pipeline_anyflow_far.py#L444[{"name": "prompt", "val": ": typing.Union[str, typing.List[str]] = None"}, {"name": "video", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "video_latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt", "val": ": typing.Union[str, typing.List[str]] = None"}, {"name": "height", "val": ": int = 480"}, {"name": "width", "val": ": int = 832"}, {"name": "num_frames", "val": ": int = 81"}, {"name": "num_inference_steps", "val": ": int = 50"}, {"name": "sigmas", "val": ": typing.Optional[typing.List[float]] = None"}, {"name": "timesteps", "val": ": typing.Optional[typing.List[float]] = None"}, {"name": "guidance_scale", "val": ": float = 1.0"}, {"name": "num_videos_per_prompt", "val": ": typing.Optional[int] = 1"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "output_type", "val": ": typing.Optional[str] = 'np'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "attention_kwargs", "val": ": typing.Optional[typing.Dict[str, typing.Any]] = None"}, {"name": "callback_on_step_end", "val": ": typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": typing.List[str] = ['latents']"}, {"name": "max_sequence_length", "val": ": int = 512"}, {"name": "use_mean_velocity", "val": ": bool = True"}, {"name": "use_kv_cache", "val": ": bool = True"}, {"name": "chunk_partition", "val": ": typing.Optional[typing.List[int]] = None"}]- prompt (str or List[str], optional) -- The prompt or prompts to guide the video generation. If not defined, pass prompt_embeds instead.

  • video (torch.Tensor, optional) -- Pre-VAE conditioning frames of shape (B, T, C, H, W) in [0, 1] (T = 4n + 1). When provided, the pipeline VAE-encodes them and keeps the corresponding latent prefix fixed during sampling. Mutually exclusive with video_latents.
  • video_latents (torch.Tensor, optional) -- Pre-encoded VAE latents in the FAR layout (B, T_latent, C, H_latent, W_latent). Skips VAE encoding on the pipeline side. Mutually exclusive with video.
  • negative_prompt (str or List[str], optional) -- The prompt or prompts to avoid during video generation. Ignored when not using guidance (guidance_scale 0~AnyFlowPipelineOutputortupleIf return_dictisTrue, an AnyFlowPipelineOutputis returned, otherwise atuple` whose first element is the generated video.

The call function to the pipeline for generation.

Examples:

>>> import numpy as np
>>> import torch
>>> from diffusers import AnyFlowFARPipeline
>>> from diffusers.utils import export_to_video, load_image

>>> pipe = AnyFlowFARPipeline.from_pretrained(
...     "nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
... ).to("cuda")

>>> # Single-frame I2V: wrap the conditioning image as a (1, 1, 3, H, W) tensor in [0, 1].
>>> first_frame = load_image("path/to/first_frame.png").resize((832, 480))
>>> arr = np.asarray(first_frame).astype("float32") / 255.0
>>> context = torch.from_numpy(arr).permute(2, 0, 1).unsqueeze(0).unsqueeze(1).to("cuda")

>>> video = pipe(
...     prompt="a cat walks across a sunlit lawn",
...     video=context,
...     num_inference_steps=4,
...     num_frames=81,
... ).frames[0]
>>> export_to_video(video, "anyflow_far.mp4", fps=16)

Parameters:

tokenizer ([AutoTokenizer]) : Tokenizer from google/umt5-xxl.

text_encoder ([UMT5EncoderModel]) : google/umt5-xxl text encoder.

transformer ([AnyFlowFARTransformer3DModel]) : FAR causal flow-map 3D Transformer.

vae ([AutoencoderKLWan]) : VAE that encodes/decodes videos to and from latent representations.

scheduler ([FlowMapEulerDiscreteScheduler]) : Flow-map sampler.

Returns:

~AnyFlowPipelineOutput` or `tuple

If return_dict is True, an AnyFlowPipelineOutput is returned, otherwise a tuple whose first element is the generated video.

encode_prompt[[diffusers.AnyFlowFARPipeline.encode_prompt]]

Source

Encodes the prompt into text encoder hidden states.

Parameters:

prompt (str or list[str], optional) : prompt to be encoded

negative_prompt (str or list[str], optional) : The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).

do_classifier_free_guidance (bool, optional, defaults to True) : Whether to use classifier free guidance or not.

num_videos_per_prompt (int, optional, defaults to 1) : Number of videos that should be generated per prompt. torch device to place the resulting embeddings on

prompt_embeds (torch.Tensor, optional) : Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.

negative_prompt_embeds (torch.Tensor, optional) : Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.

device : (torch.device, optional): torch device

dtype : (torch.dtype, optional): torch dtype

encode_video[[diffusers.AnyFlowFARPipeline.encode_video]]

Source

Encode a pixel-space video into AnyFlow's latent layout.

Mirrors the single-helper convention of other diffusers pipelines (cf. WanImageToVideoPipeline.encode_image): wraps preprocessing, VAE encoding, and latent normalization into one call. Output layout is (B, T_latent, C, H, W), which is what the AnyFlow transformer expects for conditioning frames.

AnyFlowPipelineOutput[[diffusers.pipelines.anyflow.pipeline_output.AnyFlowPipelineOutput]]

diffusers.pipelines.anyflow.pipeline_output.AnyFlowPipelineOutput[[diffusers.pipelines.anyflow.pipeline_output.AnyFlowPipelineOutput]]

Source

Output class for AnyFlow pipelines.

Parameters:

frames (torch.Tensor, np.ndarray, or list[list[PIL.Image.Image]]) : list of video outputs - It can be a nested list of length batch_size, with each sub-list containing denoised PIL image sequences of length num_frames. It can also be a NumPy array or Torch tensor of shape (batch_size, num_frames, channels, height, width).

Xet Storage Details

Size:
24.5 kB
·
Xet hash:
b56d6927d3273ddd31143dc531c66d3bcfafbe494154b8ca3627aff6d31d57d1

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.