Buckets:

HuggingFaceDocBuilder's picture
|
download
raw
24.5 kB
# AnyFlow
[AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation](https://huggingface.co/papers/2605.13724) by Yuchao Gu, Guian Fang and collaborators at [NUS ShowLab](https://sites.google.com/view/showlab) in collaboration with NVIDIA.
*Few-step video generation has been significantly advanced by consistency models. However, their performance often degrades in any-step video diffusion models due to the fixed-point formulation. To address this limitation, we present AnyFlow, the first any-step video diffusion distillation framework built on flow maps. Instead of learning only the mapping z_t → z_0, AnyFlow learns transitions z_t → z_r over arbitrary time intervals, enabling a single model to adapt to different inference budgets. We design an improved forward flow map training recipe that fine-tunes pretrained video diffusion models into flow map models, and introduce Flow Map Backward Simulation to enable on-policy distillation for flow map models. Extensive experiments across both bidirectional and causal architectures, at scales ranging from 1.3B to 14B, on text-to-video and image-to-video tasks demonstrate that AnyFlow outperforms consistency-based baselines while preserving high fidelity and flexible sampling under varying step budgets.*
The original training code is at [`NVlabs/AnyFlow`](https://github.com/NVlabs/AnyFlow). The project page is at [nvlabs.github.io/AnyFlow](https://nvlabs.github.io/AnyFlow).
The following AnyFlow checkpoints are supported:
| Checkpoint | Backbone | Description |
|------------|----------|-------------|
| [`nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers) | Wan2.1 1.3B | Bidirectional T2V, lightweight |
| [`nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers) | Wan2.1 14B | Bidirectional T2V, full quality |
| [`nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers) | FAR + Wan2.1 1.3B | Causal T2V / I2V / V2V |
| [`nvidia/AnyFlow-FAR-Wan2.1-14B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-FAR-Wan2.1-14B-Diffusers) | FAR + Wan2.1 14B | Causal T2V / I2V / V2V |
All four are grouped under the [`nvidia/anyflow`](https://huggingface.co/collections/nvidia/anyflow) Hugging Face collection.
> [!TIP]
> Choose `AnyFlowPipeline` for traditional bidirectional text-to-video generation. Choose `AnyFlowFARPipeline` for streaming I2V, video continuation (V2V), or any setup that benefits from frame-by-frame autoregressive sampling.
> [!TIP]
> AnyFlow supports any-step sampling: a single distilled checkpoint can be evaluated at 1, 2, 4, 8, 16... NFE without retraining. Quality scales monotonically with steps in our benchmarks.
### Optimizing Memory and Inference Speed
```py
import torch
from diffusers import AnyFlowPipeline
from diffusers.hooks import apply_group_offloading
pipe = AnyFlowPipeline.from_pretrained(
"nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers", torch_dtype=torch.bfloat16
)
apply_group_offloading(pipe.transformer, onload_device="cuda", offload_type="leaf_level")
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
```
```py
import torch
from diffusers import AnyFlowPipeline
pipe = AnyFlowPipeline.from_pretrained(
"nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")
pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune-no-cudagraphs")
```
### Generation with AnyFlow (Bidirectional T2V)
```py
import torch
from diffusers import AnyFlowPipeline
from diffusers.utils import export_to_video
pipe = AnyFlowPipeline.from_pretrained(
"nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")
prompt = "A red panda eating bamboo in a forest, cinematic lighting"
video = pipe(prompt, num_inference_steps=4, num_frames=33).frames[0]
export_to_video(video, "out.mp4", fps=16)
```
### Generation with AnyFlow (FAR Causal)
The causal pipeline selects between T2V / I2V / V2V via the ``video`` (or ``video_latents``) argument:
omit both for plain text-to-video, or pass ``video=<tensor>`` of shape ``(B, T, C, H, W)`` in ``[0, 1]``
with ``T = 4n + 1`` to condition on existing frames. Use a single conditioning frame for I2V and a longer
clip for V2V continuation. If you already have pre-encoded latents in the model layout, pass them via
``video_latents=<tensor>`` to skip VAE encoding. ``video`` and ``video_latents`` are mutually exclusive.
> [!IMPORTANT]
> `AnyFlowFARPipeline.default_chunk_partition = [1, 3, 3, 3, 3, 3, 3, 2]` (sum 21) is matched to the
> released checkpoints' canonical 81 raw frames (21 latent frames at the VAE temporal stride of 4). When
> you change `num_frames`, you must also pass a matching `chunk_partition` summing to
> `(num_frames - 1) // 4 + 1`, otherwise the pipeline raises an `AssertionError`.
```py
import torch
from diffusers import AnyFlowFARPipeline
from diffusers.utils import export_to_video
pipe = AnyFlowFARPipeline.from_pretrained(
"nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")
video = pipe(
prompt="A cat surfing a wave, sunset",
num_inference_steps=4,
num_frames=81,
).frames[0]
export_to_video(video, "out.mp4", fps=16)
```
```py
import numpy as np
import torch
from diffusers import AnyFlowFARPipeline
from diffusers.utils import export_to_video, load_image
pipe = AnyFlowFARPipeline.from_pretrained(
"nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")
# Wrap the conditioning image as a one-frame video tensor: (1, 1, 3, H, W) in [0, 1].
first_frame = load_image("path/to/first_frame.png").resize((832, 480))
arr = np.asarray(first_frame).astype("float32") / 255.0 # (480, 832, 3)
context_tensor = torch.from_numpy(arr).permute(2, 0, 1).unsqueeze(0).unsqueeze(1).to("cuda")
video = pipe(
prompt="a cat walks across a sunlit lawn",
video=context_tensor,
num_inference_steps=4,
num_frames=81,
).frames[0]
export_to_video(video, "out.mp4", fps=16)
```
```py
import numpy as np
import torch
from diffusers import AnyFlowFARPipeline
from diffusers.utils import export_to_video, load_video
pipe = AnyFlowFARPipeline.from_pretrained(
"nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")
# Context clip — 9 raw frames map to 3 latent frames (9 = 4·2 + 1, 3 = 2 + 1).
context_frames = load_video("path/to/context.mp4")[:9]
arr = np.stack([np.asarray(f.resize((832, 480))) for f in context_frames]).astype("float32") / 255.0
# np.stack gives (T, H, W, C) = (9, 480, 832, 3) → permute to (T, C, H, W) then add batch.
context_tensor = torch.from_numpy(arr).permute(0, 3, 1, 2).unsqueeze(0).to("cuda") # (1, 9, 3, 480, 832)
video = pipe(
prompt="continue the story",
video=context_tensor,
num_inference_steps=4,
num_frames=81,
# Override chunk_partition so the first chunk covers exactly the 3 latent context frames.
chunk_partition=[3, 3, 3, 3, 3, 3, 3],
).frames[0]
export_to_video(video, "out.mp4", fps=16)
```
## Notes
- Classifier-free guidance is fused into the released checkpoints, so inference does not run a second guided forward pass. Keep the default `guidance_scale=1.0` unless your own checkpoint requires otherwise.
- `FlowMapEulerDiscreteScheduler` is general-purpose. You can attach it to any flow-map-distilled checkpoint via `from_pretrained(..., scheduler=FlowMapEulerDiscreteScheduler.from_config(...))`.
- `AnyFlowPipeline` uses [`AnyFlowTransformer3DModel`](../models/anyflow_transformer3d) (bidirectional). `AnyFlowFARPipeline` uses [`AnyFlowFARTransformer3DModel`](../models/anyflow_far_transformer3d), which adds a compressed-frame patch embedding and the FAR causal block-mask.
- LoRA loading is supported via `WanLoraLoaderMixin`, the same mixin used by the upstream Wan pipelines.
- For training recipes (forward flow-map training and on-policy distillation), refer to the original AnyFlow training framework at [`NVlabs/AnyFlow`](https://github.com/NVlabs/AnyFlow); training is out of scope for diffusers.
## AnyFlowPipeline[[diffusers.AnyFlowPipeline]]
#### diffusers.AnyFlowPipeline[[diffusers.AnyFlowPipeline]]
[Source](https://github.com/huggingface/diffusers/blob/vr_13745/src/diffusers/pipelines/anyflow/pipeline_anyflow.py#L80)
Bidirectional text-to-video generation pipeline for AnyFlow flow-map-distilled checkpoints, introduced in
[AnyFlow](https://huggingface.co/papers/2605.13724) by Yuchao Gu, Guian Fang et al.
AnyFlow learns arbitrary-interval transitions \\(z_t \to z_r\\) rather than the fixed \\(z_t \to z_0\\) mapping
of consistency models, so a single distilled checkpoint can be evaluated at 1, 2, 4, 8, 16... NFE without
retraining. This pipeline operates over the full video tensor in one bidirectional pass; for frame-level
autoregressive (causal) generation use `AnyFlowFARPipeline`.
Sampling is plain Euler in mean-velocity form (`z_r = z_t - (t - r) * u`) with no re-noising. The released NVIDIA
checkpoints fold classifier-free guidance into the model weights, so the default `guidance_scale=1.0` is the
recommended setting.
This model inherits from [*DiffusionPipeline*]. Check the superclass documentation for the generic methods
implemented for all pipelines (downloading, saving, running on a particular device, etc.).
__call__diffusers.AnyFlowPipeline.__call__https://github.com/huggingface/diffusers/blob/vr_13745/src/diffusers/pipelines/anyflow/pipeline_anyflow.py#L379[{"name": "prompt", "val": ": typing.Union[str, typing.List[str]] = None"}, {"name": "video", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "video_latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt", "val": ": typing.Union[str, typing.List[str]] = None"}, {"name": "height", "val": ": int = 480"}, {"name": "width", "val": ": int = 832"}, {"name": "num_frames", "val": ": int = 81"}, {"name": "num_inference_steps", "val": ": int = 50"}, {"name": "sigmas", "val": ": typing.Optional[typing.List[float]] = None"}, {"name": "timesteps", "val": ": typing.Optional[typing.List[float]] = None"}, {"name": "guidance_scale", "val": ": float = 1.0"}, {"name": "num_videos_per_prompt", "val": ": typing.Optional[int] = 1"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "output_type", "val": ": typing.Optional[str] = 'np'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "attention_kwargs", "val": ": typing.Optional[typing.Dict[str, typing.Any]] = None"}, {"name": "callback_on_step_end", "val": ": typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": typing.List[str] = ['latents']"}, {"name": "max_sequence_length", "val": ": int = 512"}, {"name": "use_mean_velocity", "val": ": bool = True"}]- **prompt** (`str` or `List[str]`, *optional*) --
The prompt or prompts to guide the video generation. If not defined, pass `prompt_embeds` instead.
- **video** (`torch.Tensor`, *optional*) --
Pre-VAE conditioning frames of shape `(B, T, C, H, W)` in `[0, 1]`. When provided, the pipeline
VAE-encodes them and keeps the corresponding latent prefix fixed during sampling. Mutually exclusive
with `video_latents`.
- **video_latents** (`torch.Tensor`, *optional*) --
Pre-encoded VAE latents in the AnyFlow layout `(B, T_latent, C, H_latent, W_latent)`. Skips VAE
encoding on the pipeline side. Mutually exclusive with `video`.
- **negative_prompt** (`str` or `List[str]`, *optional*) --
The prompt or prompts to avoid during video generation. Ignored when not using guidance
(`guidance_scale 0`~AnyFlowPipelineOutput` or `tuple`If `return_dict` is `True`, `AnyFlowPipelineOutput` is returned, otherwise a `tuple` whose first
element is the generated video.
The call function to the pipeline for generation.
Examples:
```python
>>> import torch
>>> from diffusers import AnyFlowPipeline
>>> from diffusers.utils import export_to_video
>>> pipe = AnyFlowPipeline.from_pretrained(
... "nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers", torch_dtype=torch.bfloat16
... ).to("cuda")
>>> prompt = "A red panda eating bamboo in a forest, cinematic lighting"
>>> video = pipe(prompt, num_inference_steps=4, num_frames=33).frames[0]
>>> export_to_video(video, "anyflow_t2v.mp4", fps=16)
```
**Parameters:**
tokenizer ([*AutoTokenizer*]) : Tokenizer from [google/umt5-xxl](https://huggingface.co/google/umt5-xxl).
text_encoder ([*UMT5EncoderModel*]) : [google/umt5-xxl](https://huggingface.co/google/umt5-xxl) text encoder.
transformer ([*AnyFlowTransformer3DModel*]) : Bidirectional flow-map 3D Transformer.
vae ([*AutoencoderKLWan*]) : VAE that encodes/decodes videos to and from latent representations.
scheduler ([*FlowMapEulerDiscreteScheduler*]) : Flow-map sampler. The pipeline drives `scheduler.step(..., timestep, sample, r_timestep)` per inference step.
**Returns:**
``~AnyFlowPipelineOutput` or `tuple``
If `return_dict` is `True`, `AnyFlowPipelineOutput` is returned, otherwise a `tuple` whose first
element is the generated video.
#### encode_prompt[[diffusers.AnyFlowPipeline.encode_prompt]]
[Source](https://github.com/huggingface/diffusers/blob/vr_13745/src/diffusers/pipelines/anyflow/pipeline_anyflow.py#L179)
Encodes the prompt into text encoder hidden states.
**Parameters:**
prompt (`str` or `list[str]`, *optional*) : prompt to be encoded
negative_prompt (`str` or `list[str]`, *optional*) : The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
do_classifier_free_guidance (`bool`, *optional*, defaults to `True`) : Whether to use classifier free guidance or not.
num_videos_per_prompt (`int`, *optional*, defaults to 1) : Number of videos that should be generated per prompt. torch device to place the resulting embeddings on
prompt_embeds (`torch.Tensor`, *optional*) : Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
negative_prompt_embeds (`torch.Tensor`, *optional*) : Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input argument.
device : (`torch.device`, *optional*): torch device
dtype : (`torch.dtype`, *optional*): torch dtype
#### encode_video[[diffusers.AnyFlowPipeline.encode_video]]
[Source](https://github.com/huggingface/diffusers/blob/vr_13745/src/diffusers/pipelines/anyflow/pipeline_anyflow.py#L359)
Encode a pixel-space video into AnyFlow's latent layout.
Mirrors the single-helper convention of other diffusers pipelines (cf.
`WanImageToVideoPipeline.encode_image`): wraps preprocessing, VAE encoding, and latent normalization into one
call. Output layout is `(B, T_latent, C, H, W)`, which is what the AnyFlow transformer expects for
conditioning frames.
## AnyFlowFARPipeline[[diffusers.AnyFlowFARPipeline]]
#### diffusers.AnyFlowFARPipeline[[diffusers.AnyFlowFARPipeline]]
[Source](https://github.com/huggingface/diffusers/blob/vr_13745/src/diffusers/pipelines/anyflow/pipeline_anyflow_far.py#L92)
Causal (FAR-based) text-to-video / image-to-video / video-to-video pipeline for AnyFlow checkpoints, introduced in
[AnyFlow](https://huggingface.co/papers/2605.13724) by Yuchao Gu, Guian Fang et al.
The pipeline drives a frame-level autoregressive sampling loop over chunks: each chunk is denoised with flow-map
steps while attending only to past chunks via block-sparse causal attention, and intermediate KV cache is reused
across chunks.
The task mode (T2V / I2V / V2V) is selected by which conditioning argument is passed to `__call__`:
- both `video=None` and `video_latents=None` — pure text-to-video.
- `video=&amp;lt;tensor of shape (B, T, C, H, W) in [0, 1] with T = 4n + 1>` — pre-VAE conditioning frames; the pipeline
VAE-encodes them. Pass a single-frame video for I2V or a multi-frame clip for V2V.
- `video_latents=&amp;lt;latent tensor of shape (B, T_latent, C, H_latent, W_latent)>` — already-encoded latents in the
FAR layout (skips the VAE encode step).
The FAR backbone is the causal Wan2.1 variant introduced by FAR (Gu et al., 2025; arXiv:2503.19325). Inference is
plain Euler in mean-velocity form per chunk with no re-noising. Joint T2V / I2V / V2V is supported by a single
distilled model.
This model inherits from [*DiffusionPipeline*]. Check the superclass documentation for the generic methods
implemented for all pipelines (downloading, saving, running on a particular device, etc.).
__call__diffusers.AnyFlowFARPipeline.__call__https://github.com/huggingface/diffusers/blob/vr_13745/src/diffusers/pipelines/anyflow/pipeline_anyflow_far.py#L444[{"name": "prompt", "val": ": typing.Union[str, typing.List[str]] = None"}, {"name": "video", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "video_latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt", "val": ": typing.Union[str, typing.List[str]] = None"}, {"name": "height", "val": ": int = 480"}, {"name": "width", "val": ": int = 832"}, {"name": "num_frames", "val": ": int = 81"}, {"name": "num_inference_steps", "val": ": int = 50"}, {"name": "sigmas", "val": ": typing.Optional[typing.List[float]] = None"}, {"name": "timesteps", "val": ": typing.Optional[typing.List[float]] = None"}, {"name": "guidance_scale", "val": ": float = 1.0"}, {"name": "num_videos_per_prompt", "val": ": typing.Optional[int] = 1"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "output_type", "val": ": typing.Optional[str] = 'np'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "attention_kwargs", "val": ": typing.Optional[typing.Dict[str, typing.Any]] = None"}, {"name": "callback_on_step_end", "val": ": typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": typing.List[str] = ['latents']"}, {"name": "max_sequence_length", "val": ": int = 512"}, {"name": "use_mean_velocity", "val": ": bool = True"}, {"name": "use_kv_cache", "val": ": bool = True"}, {"name": "chunk_partition", "val": ": typing.Optional[typing.List[int]] = None"}]- **prompt** (`str` or `List[str]`, *optional*) --
The prompt or prompts to guide the video generation. If not defined, pass `prompt_embeds` instead.
- **video** (`torch.Tensor`, *optional*) --
Pre-VAE conditioning frames of shape `(B, T, C, H, W)` in `[0, 1]` (`T = 4n + 1`). When provided, the
pipeline VAE-encodes them and keeps the corresponding latent prefix fixed during sampling. Mutually
exclusive with `video_latents`.
- **video_latents** (`torch.Tensor`, *optional*) --
Pre-encoded VAE latents in the FAR layout `(B, T_latent, C, H_latent, W_latent)`. Skips VAE encoding on
the pipeline side. Mutually exclusive with `video`.
- **negative_prompt** (`str` or `List[str]`, *optional*) --
The prompt or prompts to avoid during video generation. Ignored when not using guidance
(`guidance_scale 0`~AnyFlowPipelineOutput` or `tuple`If `return_dict` is `True`, an `AnyFlowPipelineOutput` is returned, otherwise a `tuple` whose first
element is the generated video.
The call function to the pipeline for generation.
Examples:
```python
>>> import numpy as np
>>> import torch
>>> from diffusers import AnyFlowFARPipeline
>>> from diffusers.utils import export_to_video, load_image
>>> pipe = AnyFlowFARPipeline.from_pretrained(
... "nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
... ).to("cuda")
>>> # Single-frame I2V: wrap the conditioning image as a (1, 1, 3, H, W) tensor in [0, 1].
>>> first_frame = load_image("path/to/first_frame.png").resize((832, 480))
>>> arr = np.asarray(first_frame).astype("float32") / 255.0
>>> context = torch.from_numpy(arr).permute(2, 0, 1).unsqueeze(0).unsqueeze(1).to("cuda")
>>> video = pipe(
... prompt="a cat walks across a sunlit lawn",
... video=context,
... num_inference_steps=4,
... num_frames=81,
... ).frames[0]
>>> export_to_video(video, "anyflow_far.mp4", fps=16)
```
**Parameters:**
tokenizer ([*AutoTokenizer*]) : Tokenizer from [google/umt5-xxl](https://huggingface.co/google/umt5-xxl).
text_encoder ([*UMT5EncoderModel*]) : [google/umt5-xxl](https://huggingface.co/google/umt5-xxl) text encoder.
transformer ([*AnyFlowFARTransformer3DModel*]) : FAR causal flow-map 3D Transformer.
vae ([*AutoencoderKLWan*]) : VAE that encodes/decodes videos to and from latent representations.
scheduler ([*FlowMapEulerDiscreteScheduler*]) : Flow-map sampler.
**Returns:**
``~AnyFlowPipelineOutput` or `tuple``
If `return_dict` is `True`, an `AnyFlowPipelineOutput` is returned, otherwise a `tuple` whose first
element is the generated video.
#### encode_prompt[[diffusers.AnyFlowFARPipeline.encode_prompt]]
[Source](https://github.com/huggingface/diffusers/blob/vr_13745/src/diffusers/pipelines/anyflow/pipeline_anyflow_far.py#L202)
Encodes the prompt into text encoder hidden states.
**Parameters:**
prompt (`str` or `list[str]`, *optional*) : prompt to be encoded
negative_prompt (`str` or `list[str]`, *optional*) : The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
do_classifier_free_guidance (`bool`, *optional*, defaults to `True`) : Whether to use classifier free guidance or not.
num_videos_per_prompt (`int`, *optional*, defaults to 1) : Number of videos that should be generated per prompt. torch device to place the resulting embeddings on
prompt_embeds (`torch.Tensor`, *optional*) : Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
negative_prompt_embeds (`torch.Tensor`, *optional*) : Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input argument.
device : (`torch.device`, *optional*): torch device
dtype : (`torch.dtype`, *optional*): torch dtype
#### encode_video[[diffusers.AnyFlowFARPipeline.encode_video]]
[Source](https://github.com/huggingface/diffusers/blob/vr_13745/src/diffusers/pipelines/anyflow/pipeline_anyflow_far.py#L385)
Encode a pixel-space video into AnyFlow's latent layout.
Mirrors the single-helper convention of other diffusers pipelines (cf.
`WanImageToVideoPipeline.encode_image`): wraps preprocessing, VAE encoding, and latent normalization into one
call. Output layout is `(B, T_latent, C, H, W)`, which is what the AnyFlow transformer expects for
conditioning frames.
## AnyFlowPipelineOutput[[diffusers.pipelines.anyflow.pipeline_output.AnyFlowPipelineOutput]]
#### diffusers.pipelines.anyflow.pipeline_output.AnyFlowPipelineOutput[[diffusers.pipelines.anyflow.pipeline_output.AnyFlowPipelineOutput]]
[Source](https://github.com/huggingface/diffusers/blob/vr_13745/src/diffusers/pipelines/anyflow/pipeline_output.py#L23)
Output class for AnyFlow pipelines.
**Parameters:**
frames (`torch.Tensor`, `np.ndarray`, or list[list[PIL.Image.Image]]) : list of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing denoised PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape `(batch_size, num_frames, channels, height, width)`.

Xet Storage Details

Size:
24.5 kB
·
Xet hash:
b56d6927d3273ddd31143dc531c66d3bcfafbe494154b8ca3627aff6d31d57d1

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.