OOM error for 450 second video at 1 frame per second.

#9
by hchanuk - opened

Hi,

I am getting an OOM error when using Molmo2 8B whilst I was running a 450 second video at only 1 FPS. I've ran other 8B VLMs at this frame rate and did not experience any issues. Molmo2 4B also faced the same issue.

Are there any suggestions on how to solve this without lowering the FPS?

For example:
Are there any input arguments for the model to set a limit to the spatial resolution/pixels/tokens per frame?

Molmo2 resizes each video frame to a fixed resolution (378x378) before encoding it with the ViT.
Since the model was not trained with other resolutions, it's hard for us to recommend using a different input resolution.
It also applies fairly aggressive pooling (3x3) after encoding to reduce the number of visual tokens.

As a result, there isn't much room to further reduce the per-frame token count.
The most practical recommendation to reduce memory usage is to lower the maximum number of sampled frames, num_frames.

For reference, Molmo2 samples video frames using a fixed max_fps, regardless of the input video's original frame rate.
You can find more details in the following code:
https://huggingface.co/allenai/Molmo2-8B/blob/main/video_processing_molmo2.py#L637-L651

With the default configuration (num_frames=384), your 450-second video will be uniformly sampled into 384 frames.
Reducing num_frames will proportionally reduce the number of visual tokens and therefore lower memory usage.

Sign up or log in to comment