Motif-Video-2B / docs /memory-efficient-inference.md

docs: split Memory-efficient Inference and GGUF+SageAttention into sub READMEs (#20)

190632e 14 days ago

3.1 kB

	# Memory-efficient Inference

	> See the main [README](../README.md) for `FlowDPMSolver` and `guider` setup.

	By default, `pipe.to("cuda")` loads all components onto the GPU simultaneously, requiring ~30 GB VRAM.

	For GPUs with 24 GB or less (e.g. RTX 4090, RTX 3090), use `enable_model_cpu_offload()` with the `expandable_segments` allocator setting:

	```bash
	export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
	```

	```python
	pipe = MotifVideoPipeline.from_pretrained(
	"Motif-Technologies/Motif-Video-2B",
	revision="diffusers-integration",
	torch_dtype=torch.bfloat16,
	guider=guider, # see T2V example above
	)
	pipe.scheduler = FlowDPMSolver(
	num_train_timesteps=pipe.scheduler.config.get("num_train_timesteps", 1000),
	algorithm_type="dpmsolver++",
	solver_order=2,
	prediction_type="flow_prediction",
	use_flow_sigmas=True,
	flow_shift=15.0,
	)
	pipe.enable_model_cpu_offload() # replaces pipe.to("cuda")

	output = pipe(
	prompt="...",
	negative_prompt="...",
	height=736, width=1280, num_frames=121, num_inference_steps=50,
	frame_rate=24, use_linear_quadratic_schedule=False,
	)
	export_to_video(output.frames[0], "output.mp4", fps=24)
	```

	This moves each component (text encoder → transformer → VAE) to GPU only when needed. The `expandable_segments` setting allows the CUDA memory allocator to efficiently reuse memory released by earlier components, avoiding fragmentation-related OOM errors.

	\| Mode \| Peak VRAM \| Speed \| Recommended GPU \|
	\|------\|-----------\|-------\|-----------------\|
	\| `pipe.to("cuda")` \| ~30 GB \| Fastest \| A100, H100, H200 \|
	\| `enable_model_cpu_offload()` \| ~19 GB \| Similar \| RTX 4090, RTX 3090 \|

	## FP8 Weight Quantization (Optional)

	For further VRAM reduction, you can quantize the transformer weights to FP8 using [torchao](https://github.com/pytorch/ao):

	```bash
	pip install torchao
	```

	```python
	from torchao.quantization import quantize_, Float8WeightOnlyConfig

	pipe = MotifVideoPipeline.from_pretrained(
	"Motif-Technologies/Motif-Video-2B",
	revision="diffusers-integration",
	torch_dtype=torch.bfloat16,
	guider=guider, # see T2V example above
	)
	pipe.scheduler = FlowDPMSolver(
	num_train_timesteps=pipe.scheduler.config.get("num_train_timesteps", 1000),
	algorithm_type="dpmsolver++",
	solver_order=2,
	prediction_type="flow_prediction",
	use_flow_sigmas=True,
	flow_shift=15.0,
	)
	quantize_(pipe.transformer, Float8WeightOnlyConfig())
	pipe.enable_model_cpu_offload()

	output = pipe(
	prompt="...",
	negative_prompt="...",
	height=736, width=1280, num_frames=121, num_inference_steps=50,
	frame_rate=24, use_linear_quadratic_schedule=False,
	)
	export_to_video(output.frames[0], "output.mp4", fps=24)
	```

	This stores the transformer weights in FP8 (8-bit) instead of BF16 (16-bit), reducing peak VRAM from ~19 GB to ~15 GB while keeping all computation in BF16 precision.

	\| Mode \| Peak VRAM \| Notes \|
	\|------\|-----------\|-------\|
	\| `enable_model_cpu_offload()` \| ~19 GB \| BF16 baseline \|
	\| `+ Float8WeightOnlyConfig` \| ~15 GB \| FP8 weights, BF16 compute \|