| # Memory-efficient Inference |
|
|
| > See the main [README](../README.md) for `FlowDPMSolver` and `guider` setup. |
|
|
| By default, `pipe.to("cuda")` loads all components onto the GPU simultaneously, requiring **~30 GB VRAM**. |
|
|
| For GPUs with 24 GB or less (e.g. RTX 4090, RTX 3090), use `enable_model_cpu_offload()` with the `expandable_segments` allocator setting: |
|
|
| ```bash |
| export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True |
| ``` |
|
|
| ```python |
| pipe = MotifVideoPipeline.from_pretrained( |
| "Motif-Technologies/Motif-Video-2B", |
| revision="diffusers-integration", |
| torch_dtype=torch.bfloat16, |
| guider=guider, # see T2V example above |
| ) |
| pipe.scheduler = FlowDPMSolver( |
| num_train_timesteps=pipe.scheduler.config.get("num_train_timesteps", 1000), |
| algorithm_type="dpmsolver++", |
| solver_order=2, |
| prediction_type="flow_prediction", |
| use_flow_sigmas=True, |
| flow_shift=15.0, |
| ) |
| pipe.enable_model_cpu_offload() # replaces pipe.to("cuda") |
| |
| output = pipe( |
| prompt="...", |
| negative_prompt="...", |
| height=736, width=1280, num_frames=121, num_inference_steps=50, |
| frame_rate=24, use_linear_quadratic_schedule=False, |
| ) |
| export_to_video(output.frames[0], "output.mp4", fps=24) |
| ``` |
|
|
| This moves each component (text encoder → transformer → VAE) to GPU only when needed. The `expandable_segments` setting allows the CUDA memory allocator to efficiently reuse memory released by earlier components, avoiding fragmentation-related OOM errors. |
|
|
| | Mode | Peak VRAM | Speed | Recommended GPU | |
| |------|-----------|-------|-----------------| |
| | `pipe.to("cuda")` | ~30 GB | Fastest | A100, H100, H200 | |
| | `enable_model_cpu_offload()` | ~19 GB | Similar | RTX 4090, RTX 3090 | |
|
|
| ## FP8 Weight Quantization (Optional) |
|
|
| For further VRAM reduction, you can quantize the transformer weights to FP8 using [torchao](https://github.com/pytorch/ao): |
|
|
| ```bash |
| pip install torchao |
| ``` |
|
|
| ```python |
| from torchao.quantization import quantize_, Float8WeightOnlyConfig |
| |
| pipe = MotifVideoPipeline.from_pretrained( |
| "Motif-Technologies/Motif-Video-2B", |
| revision="diffusers-integration", |
| torch_dtype=torch.bfloat16, |
| guider=guider, # see T2V example above |
| ) |
| pipe.scheduler = FlowDPMSolver( |
| num_train_timesteps=pipe.scheduler.config.get("num_train_timesteps", 1000), |
| algorithm_type="dpmsolver++", |
| solver_order=2, |
| prediction_type="flow_prediction", |
| use_flow_sigmas=True, |
| flow_shift=15.0, |
| ) |
| quantize_(pipe.transformer, Float8WeightOnlyConfig()) |
| pipe.enable_model_cpu_offload() |
| |
| output = pipe( |
| prompt="...", |
| negative_prompt="...", |
| height=736, width=1280, num_frames=121, num_inference_steps=50, |
| frame_rate=24, use_linear_quadratic_schedule=False, |
| ) |
| export_to_video(output.frames[0], "output.mp4", fps=24) |
| ``` |
|
|
| This stores the transformer weights in FP8 (8-bit) instead of BF16 (16-bit), reducing peak VRAM from ~19 GB to ~15 GB while keeping all computation in BF16 precision. |
|
|
| | Mode | Peak VRAM | Notes | |
| |------|-----------|-------| |
| | `enable_model_cpu_offload()` | ~19 GB | BF16 baseline | |
| | `+ Float8WeightOnlyConfig` | ~15 GB | FP8 weights, BF16 compute | |
|
|