Load it in to 16-bit quantization(float16 or bfloat16)

by zeeshan73 - opened Jul 30, 2025

Jul 30, 2025

Hi,

Thanks for the support and sharing this repo, I want to load the models in the float16 or bfloat16, but still even though I have ram of 46GB still I am facing memory issues while loading the models itself.

Below are the things I tried it on AWS g6e.2xlarge(https://instances.vantage.sh/aws/ec2/g6e.2xlarge)

tried quantization with 16bit --> OOM error
tried with bfloat16 quantization --> mismatch error with prepare_latents method. float32 dtype
tried with bfloat16 quantization & update the dtype in prepare_latents method --> while generating getting the OOM error.
tried with CPU offload --> even then OOM eeror
when loaded with bfloat16 it occupied 44221Mib/46068Mib on NVIDIA L40S

Can you help me here, how to further proceed. or do I need to increase the computation power. Pls share the required details.

Thanks in advance,
Zeeshan

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment