Update README.md

89dc69d verified about 8 hours ago

7.48 kB

	---
	license: other
	language:
	- en
	base_model:
	- Wan-AI/Wan2.1-T2V-14B-Diffusers
	pipeline_tag: text-to-video
	tags:
	- Any-Step
	- Text-to-Video
	- Image-to-Video
	- Video-to-Video
	---

	# AnyFlow

	<p align="center">
	🖥️ <a href="https://github.com/NVlabs/AnyFlow">GitHub</a>    ｜    🤗 <a href="https://huggingface.co/collections/nvidia/anyflow">Hugging Face</a>    ｜    📑 <a href="https://arxiv.org/">Paper</a>    ｜    🌐 <a href="https://nvlabs.github.io/AnyFlow">Website</a>
	<br>
	</p>

	-----

	AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

	In this repository, we present AnyFlow, the first any-step video diffusion framework built on flow maps. AnyFlow offers these key features:

	- ⚡ Any-Step Generation: Unlike traditional distilled models tied to fixed step budgets, AnyFlow enables a single model to adapt to arbitrary inference budgets. It achieves high-quality few-step generation while providing stable improvements as more sampling steps are added.

	- 🔀 Multiple Architectures: AnyFlow supports any-step distillation for both causal and bidirectional video diffusion models.

	- 🎬 Multiple Tasks: AnyFlow supports Text-to-Video, Image-to-Video, and Video-to-Video generation within one causal video diffusion model.

	- 📈 Scalable Performance: AnyFlow is validated from 1.3B up to 14B parameters.

	This directory contains AnyFlow-FAR-Wan2.1-14B-Diffusers (a 14B causal video diffusion model) in Hugging Face Diffusers format, derived from the [Wan2.1-T2V-14B-Diffusers](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B-Diffusers) text-to-video backbone.

	## Video Demos

	<div align="center">
	<video width="80%" autoplay loop muted playsinline controls>
	<source src="https://nvlabs.github.io/AnyFlow/assets/videos/demo_video.m4v" type="video/mp4">
	Your browser does not support the video tag.
	</video>
	</div>

	## 🔥 Latest News!!

	* May 4, 2026: 👋 We've released the codebase and weights of AnyFlow.

	## Quickstart

	### Setup Environment

	1️⃣ Create Conda Environment

	```bash
	conda create -n far python=3.10
	conda activate far
	```

	2️⃣ Install PyTorch and Dependencies

	```bash
	pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
	pip install -r requirements.txt --no-build-isolation
	```

	### Model Download

	\| Model \| Tasks \| Resolution \| Download Link \|
	\| ----- \| ----- \| ---------- \| ------------- \|
	\| `AnyFlow-FAR-Wan2.1-1.3B-Diffusers` \| T2V, I2V, V2V \| 480P \| 🤗 [Hugging Face](https://huggingface.co/nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers) \|
	\| `AnyFlow-FAR-Wan2.1-14B-Diffusers` \| T2V, I2V, V2V \| 480P \| 🤗 [Hugging Face](https://huggingface.co/nvidia/AnyFlow-FAR-Wan2.1-14B-Diffusers) \|
	\| `AnyFlow-Wan2.1-T2V-14B-Diffusers` \| T2V \| 480P \| 🤗 [Hugging Face](https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers) \|
	\| `AnyFlow-Wan2.1-T2V-1.3B-Diffusers` \| T2V \| 480P \| 🤗 [Hugging Face](https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers) \|

	Download models using 🤗 hf download:
	```
	pip install "huggingface_hub[cli]"

	hf download nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers --repo-type model --local-dir experiments/pretrained_models/AnyFlow-FAR-Wan2.1-1.3B-Diffusers
	```

	### Run Text-to-Video Generation with Diffusers

	```python
	import torch
	from diffusers.utils import export_to_video

	from far.pipelines.pipeline_far_wan_anyflow import FARWanAnyFlowPipeline

	model_id = "nvidia/AnyFlow-FAR-Wan2.1-14B-Diffusers"
	pipeline = FARWanAnyFlowPipeline.from_pretrained(model_path).to('cuda', dtype=torch.bfloat16)

	prompt = "CG game concept digital art, a majestic elephant with a vibrant tusk and sleek fur running swiftly towards a herd of its kind."

	video = pipeline(
	prompt=prompt,
	height=480,
	width=832,
	num_frames=81,
	num_inference_steps=4,
	generator=torch.Generator('cuda').manual_seed(0)
	).frames[0]
	export_to_video(output, "output.mp4", fps=16)
	```

	### Run Image-to-Video Generation with Diffusers

	```python
	import torch
	from diffusers.utils import export_to_video
	from PIL import Image
	from torchvision import transforms

	from far.pipelines.pipeline_far_wan_anyflow import FARWanAnyFlowPipeline

	model_id = "nvidia/AnyFlow-FAR-Wan2.1-14B-Diffusers"
	pipeline = FARWanAnyFlowPipeline.from_pretrained(model_path).to('cuda', dtype=torch.bfloat16)

	# load image
	image_path = 'assets/example_image.jpg'
	prompt = 'A towering, battle-scarred humanoid robot walking through the skeletal remains of a city ruin.'

	image = Image.open(image_path).convert('RGB')
	image = transforms.ToTensor()(transforms.Resize([480, 832])(image)).unsqueeze(0).unsqueeze(0)

	video = pipeline(
	prompt=prompt,
	context_sequence={'raw': image},
	height=480,
	width=832,
	num_frames=81,
	num_inference_steps=4,
	generator=torch.Generator('cuda').manual_seed(0)
	).frames[0]
	export_to_video(output, "output.mp4", fps=16)
	```

	### Run Video-to-Video Generation with Diffusers

	```python
	import torch
	from diffusers.utils import export_to_video
	import decord
	from torchvision import transforms

	from far.pipelines.pipeline_far_wan_anyflow import FARWanAnyFlowPipeline

	decord.bridge.set_bridge('torch')

	model_id = "nvidia/AnyFlow-FAR-Wan2.1-14B-Diffusers"
	pipeline = FARWanAnyFlowPipeline.from_pretrained(model_path).to('cuda', dtype=torch.bfloat16)

	# load video
	video_path = 'assets/example_video.mp4'
	prompt = "A focused trail runner's powerful strides through a dense, sun-dappled forest."

	video_reader = decord.VideoReader(video_path)
	frame_idxs = select_frame_indices(len(video_reader), video_reader.get_avg_fps(), target_fps=16)[:num_cond_frames]
	frames = video_reader.get_batch(frame_idxs)
	frames = (frames / 255.0).float().permute(0, 3, 1, 2).contiguous()
	frames = transforms.Resize([480, 832])(frames).unsqueeze(0)

	video = pipeline(
	prompt=prompt,
	context_sequence={'raw': frames},
	height=480,
	width=832,
	num_frames=81,
	num_inference_steps=4,
	generator=torch.Generator('cuda').manual_seed(0)
	).frames[0]
	export_to_video(output, "output.mp4", fps=16)
	```

	## License

	This model is released under the NVIDIA One-Way Noncommercial License ([NSCLv1](LICENSE.md)).

	Under the NVIDIA One-Way Noncommercial License (NSCLv1), NVIDIA confirms:

	* Models are not for commercial use.
	* NVIDIA does not claim ownership to any outputs generated using the Models or Derivative Models.

	## Citation

	If you find our work helpful, please cite us.

	```bibtex
	@article{gu2026anyflow,
	title={AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation},
	author={Gu, Yuchao and Fang, Guian and Jiang, Yuxin and Mao, Weijia and Han, Song and Cai, Han and Shou, Mike Zheng},
	journal={arXiv preprint arXiv:2605.13724},
	year={2026}
	}

	@article{gu2025long,
	title={Long-Context Autoregressive Video Modeling with Next-Frame Prediction},
	author={Gu, Yuchao and Mao, weijia and Shou, Mike Zheng},
	journal={arXiv preprint arXiv:2503.19325},
	year={2025}
	}
	```

	## Acknowledgements

	This codebase is built on [Diffusers](https://github.com/huggingface/diffusers). We also refer to implementations from [FAR](https://github.com/showlab/FAR), [Self-Forcing](https://github.com/guandeh17/Self-Forcing), and [TiM](https://github.com/WZDTHU/TiM). We thank the authors for open-sourcing their work.