File size: 3,101 Bytes
ef28144 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 | # Memory-efficient Inference
> See the main [README](../README.md) for `FlowDPMSolver` and `guider` setup.
By default, `pipe.to("cuda")` loads all components onto the GPU simultaneously, requiring **~30 GB VRAM**.
For GPUs with 24 GB or less (e.g. RTX 4090, RTX 3090), use `enable_model_cpu_offload()` with the `expandable_segments` allocator setting:
```bash
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
```
```python
pipe = MotifVideoPipeline.from_pretrained(
"Motif-Technologies/Motif-Video-2B",
revision="diffusers-integration",
torch_dtype=torch.bfloat16,
guider=guider, # see T2V example above
)
pipe.scheduler = FlowDPMSolver(
num_train_timesteps=pipe.scheduler.config.get("num_train_timesteps", 1000),
algorithm_type="dpmsolver++",
solver_order=2,
prediction_type="flow_prediction",
use_flow_sigmas=True,
flow_shift=15.0,
)
pipe.enable_model_cpu_offload() # replaces pipe.to("cuda")
output = pipe(
prompt="...",
negative_prompt="...",
height=736, width=1280, num_frames=121, num_inference_steps=50,
frame_rate=24, use_linear_quadratic_schedule=False,
)
export_to_video(output.frames[0], "output.mp4", fps=24)
```
This moves each component (text encoder → transformer → VAE) to GPU only when needed. The `expandable_segments` setting allows the CUDA memory allocator to efficiently reuse memory released by earlier components, avoiding fragmentation-related OOM errors.
| Mode | Peak VRAM | Speed | Recommended GPU |
|------|-----------|-------|-----------------|
| `pipe.to("cuda")` | ~30 GB | Fastest | A100, H100, H200 |
| `enable_model_cpu_offload()` | ~19 GB | Similar | RTX 4090, RTX 3090 |
## FP8 Weight Quantization (Optional)
For further VRAM reduction, you can quantize the transformer weights to FP8 using [torchao](https://github.com/pytorch/ao):
```bash
pip install torchao
```
```python
from torchao.quantization import quantize_, Float8WeightOnlyConfig
pipe = MotifVideoPipeline.from_pretrained(
"Motif-Technologies/Motif-Video-2B",
revision="diffusers-integration",
torch_dtype=torch.bfloat16,
guider=guider, # see T2V example above
)
pipe.scheduler = FlowDPMSolver(
num_train_timesteps=pipe.scheduler.config.get("num_train_timesteps", 1000),
algorithm_type="dpmsolver++",
solver_order=2,
prediction_type="flow_prediction",
use_flow_sigmas=True,
flow_shift=15.0,
)
quantize_(pipe.transformer, Float8WeightOnlyConfig())
pipe.enable_model_cpu_offload()
output = pipe(
prompt="...",
negative_prompt="...",
height=736, width=1280, num_frames=121, num_inference_steps=50,
frame_rate=24, use_linear_quadratic_schedule=False,
)
export_to_video(output.frames[0], "output.mp4", fps=24)
```
This stores the transformer weights in FP8 (8-bit) instead of BF16 (16-bit), reducing peak VRAM from ~19 GB to ~15 GB while keeping all computation in BF16 precision.
| Mode | Peak VRAM | Notes |
|------|-----------|-------|
| `enable_model_cpu_offload()` | ~19 GB | BF16 baseline |
| `+ Float8WeightOnlyConfig` | ~15 GB | FP8 weights, BF16 compute |
|