Motif-Video-2B / docs /memory-efficient-inference.md
smithcooly2k's picture
Duplicate from Motif-Technologies/Motif-Video-2B
ef28144

Memory-efficient Inference

See the main README for FlowDPMSolver and guider setup.

By default, pipe.to("cuda") loads all components onto the GPU simultaneously, requiring ~30 GB VRAM.

For GPUs with 24 GB or less (e.g. RTX 4090, RTX 3090), use enable_model_cpu_offload() with the expandable_segments allocator setting:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
pipe = MotifVideoPipeline.from_pretrained(
    "Motif-Technologies/Motif-Video-2B",
    revision="diffusers-integration",
    torch_dtype=torch.bfloat16,
    guider=guider,  # see T2V example above
)
pipe.scheduler = FlowDPMSolver(
    num_train_timesteps=pipe.scheduler.config.get("num_train_timesteps", 1000),
    algorithm_type="dpmsolver++",
    solver_order=2,
    prediction_type="flow_prediction",
    use_flow_sigmas=True,
    flow_shift=15.0,
)
pipe.enable_model_cpu_offload()  # replaces pipe.to("cuda")

output = pipe(
    prompt="...",
    negative_prompt="...",
    height=736, width=1280, num_frames=121, num_inference_steps=50,
    frame_rate=24, use_linear_quadratic_schedule=False,
)
export_to_video(output.frames[0], "output.mp4", fps=24)

This moves each component (text encoder → transformer → VAE) to GPU only when needed. The expandable_segments setting allows the CUDA memory allocator to efficiently reuse memory released by earlier components, avoiding fragmentation-related OOM errors.

Mode Peak VRAM Speed Recommended GPU
pipe.to("cuda") ~30 GB Fastest A100, H100, H200
enable_model_cpu_offload() ~19 GB Similar RTX 4090, RTX 3090

FP8 Weight Quantization (Optional)

For further VRAM reduction, you can quantize the transformer weights to FP8 using torchao:

pip install torchao
from torchao.quantization import quantize_, Float8WeightOnlyConfig

pipe = MotifVideoPipeline.from_pretrained(
    "Motif-Technologies/Motif-Video-2B",
    revision="diffusers-integration",
    torch_dtype=torch.bfloat16,
    guider=guider,  # see T2V example above
)
pipe.scheduler = FlowDPMSolver(
    num_train_timesteps=pipe.scheduler.config.get("num_train_timesteps", 1000),
    algorithm_type="dpmsolver++",
    solver_order=2,
    prediction_type="flow_prediction",
    use_flow_sigmas=True,
    flow_shift=15.0,
)
quantize_(pipe.transformer, Float8WeightOnlyConfig())
pipe.enable_model_cpu_offload()

output = pipe(
    prompt="...",
    negative_prompt="...",
    height=736, width=1280, num_frames=121, num_inference_steps=50,
    frame_rate=24, use_linear_quadratic_schedule=False,
)
export_to_video(output.frames[0], "output.mp4", fps=24)

This stores the transformer weights in FP8 (8-bit) instead of BF16 (16-bit), reducing peak VRAM from ~19 GB to ~15 GB while keeping all computation in BF16 precision.

Mode Peak VRAM Notes
enable_model_cpu_offload() ~19 GB BF16 baseline
+ Float8WeightOnlyConfig ~15 GB FP8 weights, BF16 compute