OOM on 4 GPU

#3
by SpiridonSunRotator - opened

Hi, I am trying to produce an image on 4 GPUs with auto device_map. However the code crashes due to CUDA OOM even for small inputs, say 64x64

  cot_text, samples = model.generate_image(
        prompt=prompt,
        image=imgs_input,
        seed=42,
        image_size=(64, 64),
        use_system_prompt="en_unified",
        bot_task="think_recaption",  # Use "think_recaption" for reasoning and enhancement
        infer_align_image_size=True,  # Align output image size to input image size
        diff_infer_steps=8, 
        verbose=2
    )

According to the error message, it seems that a large tensor is allocated somewhere

File /usr/local/lib/python3.12/dist-packages/torch/nn/modules/conv.py:712, in Conv3d._conv_forward(self, input, weight, bias)
    700 if self.padding_mode != "zeros":
    701     return F.conv3d(
    702         F.pad(
    703             input, self._reversed_padding_repeated_twice, mode=self.padding_mode
   (...)    710         self.groups,
    711     )
--> 712 return F.conv3d(
    713     input, weight, bias, self.stride, self.padding, self.dilation, self.groups
    714 )

OutOfMemoryError: CUDA out of memory. Tried to allocate 27.00 GiB. GPU 0 has a total capacity of 79.25 GiB of which 11.26 GiB is free. Process 334550 has 67.98 GiB memory in use. Of the allocated memory 45.48 GiB is allocated by PyTorch, and 22.01 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

How can one fix this?

The image_size argument is being interpreted more as an aspect-ratio hint (1:1) rather than the literal pixel resolution (not 100% sure, but that’s how it seems to behave). So the model may still run at its default internal size like 1024x1024.
Have you tried enabling moe_drop_tokens=True when calling AutoModelForCausalLM.from_pretrained? Without token dropping moe layers can require significantly more VRAM, while turning it on usually reduces memory usage with some quality trade-offs.
For reference, I was able to run it on a single H200 using 8-bit quantization plus a bit of code tweaking.

Hi, @LWZ19 i am loading with moe_drop_tokens=True. I passed tiny image (64, 64) (the image size is the same as the argument).

where exactly is the OOM happening? Are you able to load the model? Or is it during image/recaption generation? Can you get any recaption text at all?
It might just be too many tokens. You can try disabling thinking and using a minimal setup:

bot_task="image",
use_system_prompt="en_vanilla"

If it is a token/memory issue, another workaround is quantizing the model to reduce VRAM usage.

@LWZ19 I am able to load the model, and the memory per GPU is something like 40Gb per GPU. But the generation (call of generate_image) crashes due to allocation of some large tensor.

@LWZ19 which quantization config are you using? I tried load_in_8_bit from bitsandbytes, but it doesn't work due to absence of some kernel for GeLU.

NotImplementedError: "GeluCUDAKernelImpl" not implemented for 'Char'

I'm not able to reproduce the OOM on my side. The example generate_image code from the model card runs fine for me on 4×80 GB GPUs without crashes.

For quantization, you could try using QuantoConfig from Transformers instead of bitsandbytes:

from transformers import QuantoConfig

quanto_config = QuantoConfig(weights="int8")
kwargs = dict(
    quantization_config=quanto_config,
    ...
)

With this setup you might still see a CUDA/CPU device mismatch error depending on your environment, but it avoids the GeLU kernel error for me.

@LWZ19 thank you very much! With quanto I was able to run inference.

SpiridonSunRotator changed discussion status to closed

Sign up or log in to comment