Transformers
Safetensors
Configuration Parsing Warning: Invalid JSON for config file config.json

EN | 中文

ConsistCompose: Unified Multimodal Layout Control for Image Composition

GitHub Stars Project Page arXiv

Overview

Despite significant advancements in unified multimodal models that integrate visual understanding and image generation, precise layout control for multi-instance image synthesis remains an underexplored challenge. Existing approaches often rely on task-specific modules or structured spatial encoders, limiting their compatibility with broader multimodal capabilities like visual reasoning and identity preservation.

In this work, we propose ConsistCompose, a unified multimodal framework designed to enable layout-controllable multi-instance generation through a language-driven paradigm. Built upon the MoT architecture of Bagel, ConsistCompose introduces Linguistic-Embedded Layout-Grounded Generation (LELG), which embeds layout coordinates directly into language prompts as textual tokens—eliminating the need for specialized spatial encoders or geometric branches. To support large-scale training, we construct ConsistCompose3M, a 3.4M-sample dataset with layout and identity annotations, including 2.6M text-guided and 0.8M image-guided pairs that provide structured spatial and semantic supervision.

ConsistCompose achieves state-of-the-art performance on layout-controlled generation benchmarks: it delivers a 7.2% gain in layout IoU and a 13.7% AP improvement on COCO-Position (with 92.6% Instance Success Ratio and 76.1% Image Success Ratio), while leading on core metrics (DINO, mIoU, AP) across both MS-Bench and MS-Bench-Random. Notably, the framework preserves strong general multimodal capabilities—maintaining performance on MMBench, MMMU, and GenEval comparable to its Bagel backbone—while enhancing identity preservation for multi-reference generation. A key component, Coordinate-CFG, further strengthens spatial fidelity through hierarchical classifier-free guidance, balancing strict layout adherence with visual realism.

More importantly, ConsistCompose establishes a unified paradigm that consolidates layout-grounded text-to-image synthesis, multi-reference identity-preserving composition, and general multimodal understanding within a single generative interface. We analyze the impact of coordinate guidance scaling, validate the flexibility of the LELG paradigm, and demonstrate its ability to handle complex, cluttered layouts without sacrificing semantic coherence. ConsistCompose provides a principled solution to layout-controllable multimodal generation, pointing toward future directions in finer-grained spatial reasoning and interactive scene composition.

All technical details, training recipes, and supplementary results are publicly available to facilitate further research in unified layout-aware multimodal systems.

Release Information

We introduce ConsistCompose-BAGEL-7B-MoT, built upon a unified understanding and generation base model.

Specific details are reported in the ConsistCompose technical report, Our ConsistCompose-BAGEL-7B-MoT preserves the full image general generation and understanding capabilities of the original BAGEL while introducing enhanced layout control for precise multi-instance composition.

1 Results on COCO-Position

Methods Instance Success Ratio (%) ↑ Image Success Ratio (%) ↑ Position Accuracy (%) ↑
L2L3L4L5L6Avg L2L3L4L5L6Avg mIoUAPAP50AP75
GLIGEN 89.186.382.079.681.682.6 78.863.848.135.035.052.1 69.040.575.939.1
InstanceDiffusion 94.194.489.584.683.887.8 89.484.467.546.939.465.5 78.157.283.665.5
MIGC++ 94.192.187.384.183.486.8 89.478.162.548.138.863.4 74.948.379.252.6
CreatiLayout 81.976.373.473.571.274.0 69.448.136.931.926.342.5 64.932.461.131.6
PlanGen 85.384.283.880.981.282.5 72.563.151.333.131.350.3 66.231.974.021.5
Ours 95.694.292.790.692.492.6 91.983.173.163.768.876.1 85.370.989.176.9

2 Results on MS-Bench && MS-Bench-Random

Methods MS-Bench MS-Bench-Random
CLIP-TDINOmIoUAP CLIP-TDINOmIoUAP
GLIGEN 0.3090.4540.8680.751 0.3120.4310.8580.722
MS-Diffusion 0.3360.5550.4660.108 0.3340.5440.4640.105
MUSE 0.3200.6190.6980.352 0.3210.6070.6730.303
Ours 0.3330.6600.8890.789 0.3340.6300.8780.756

3. Results on MMBench,MMMU, GenEval, GEdit

(a) General capability
Model MMBench↑ MMMU↑ GenEval↑ GEdit↑
Bagel Base 81.4 46.4 0.86 6.68
Ours (w/o Coord) 81.5 39.4 0.88 6.23
Ours (w/ Coord) 81.4 42.3 0.88 6.31

4. Results on DreamBench(Single/Multi)

Method Single Multi
DINO CLIP-I CLIP-T DINO CLIP-I CLIP-T
UNO 0.661 0.796 0.304 0.491 0.715 0.323
OmniGen 0.554 0.746 0.322 0.441 0.692 0.341
OmniGen2 0.671 0.791 0.312 0.459 0.698 0.333
Ours 0.677 0.792 0.314 0.506 0.703 0.335

🖊️ Citation

@article{shi2025consistcompose,
  title={ConsistCompose: Unified Multimodal Layout Control for Image Composition},
  author={Shi, Xuanke and Li, Boxuan and Han, Xiaoyang and Cai, Zhongang and Yang, Lei and Lin, Dahua and Wang, Quan},
  journal={arXiv preprint arXiv:2511.18333},
  year={2025}
}

Downloads last month
31
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sensenova/ConsistCompose-BAGEL-7B-MoT

Base model

Qwen/Qwen2.5-7B
Finetuned
(21)
this model

Paper for sensenova/ConsistCompose-BAGEL-7B-MoT