arxiv:2603.16139

Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

Published on Mar 17

· Submitted by

Authors:

Abstract

A data-efficient training framework for unified multimodal models that uses image-only data for pre-training followed by fine-tuning with mixed data types achieves state-of-the-art performance with reduced computational requirements.

AI-generated summary

Unified Multimodal Models (UMMs) are often constrained by the pre-training of their visual generation components, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for UMM visual generation and identify these two issues as the major bottlenecks. To address them, we propose Image-Only Training for UMMs (IOMM), a data-efficient two-stage training framework. The first stage pre-trains the visual generative component exclusively using abundant unlabeled image-only data, thereby removing the dependency on paired data for this costly phase. The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality. Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art (SOTA) performance. For example, our IOMM-B (3.6B) model was trained from scratch using only sim 1050 H800 GPU hours (with the vast majority, 1000 hours, dedicated to the efficient image-only pre-training stage). It achieves 0.89 on GenEval and 0.55 on WISE--surpassing strong baselines such as BAGEL-7B (0.82 & 0.55) and BLIP3-o-4B (0.84 & 0.50). Code is available https://github.com/LINs-lab/IOMM{https://github.com/LINs-lab/IOMM}.

View arXiv page View PDF GitHub 4 Add to collection

Community

sp12138sp

Paper submitter about 7 hours ago

IOMM (Image-Only Training for UMMs) introduces a data-efficient two-stage framework that achieves state-of-the-art multimodal generation by replacing the costly reliance on paired text-image data with a high-performance "image-only" pre-training stage.