Configuration Parsing Warning:Invalid JSON for config file config.json

ConsistCompose: Unified Multimodal Layout Control for Image Composition

Overview

Despite significant advancements in unified multimodal models that integrate visual understanding and image generation, precise layout control for multi-instance image synthesis remains an underexplored challenge. Existing approaches often rely on task-specific modules or structured spatial encoders, limiting their compatibility with broader multimodal capabilities like visual reasoning and identity preservation.

In this work, we propose ConsistCompose, a unified multimodal framework designed to enable layout-controllable multi-instance generation through a language-driven paradigm. Built upon the MoT architecture of Bagel, ConsistCompose introduces Linguistic-Embedded Layout-Grounded Generation (LELG), which embeds layout coordinates directly into language prompts as textual tokens—eliminating the need for specialized spatial encoders or geometric branches. To support large-scale training, we construct ConsistCompose3M, a 3.4M-sample dataset with layout and identity annotations, including 2.6M text-guided and 0.8M image-guided pairs that provide structured spatial and semantic supervision.

ConsistCompose achieves state-of-the-art performance on layout-controlled generation benchmarks: it delivers a 7.2% gain in layout IoU and a 13.7% AP improvement on COCO-Position (with 92.6% Instance Success Ratio and 76.1% Image Success Ratio), while leading on core metrics (DINO, mIoU, AP) across both MS-Bench and MS-Bench-Random. Notably, the framework preserves strong general multimodal capabilities—maintaining performance on MMBench, MMMU, and GenEval comparable to its Bagel backbone—while enhancing identity preservation for multi-reference generation. A key component, Coordinate-CFG, further strengthens spatial fidelity through hierarchical classifier-free guidance, balancing strict layout adherence with visual realism.

More importantly, ConsistCompose establishes a unified paradigm that consolidates layout-grounded text-to-image synthesis, multi-reference identity-preserving composition, and general multimodal understanding within a single generative interface. We analyze the impact of coordinate guidance scaling, validate the flexibility of the LELG paradigm, and demonstrate its ability to handle complex, cluttered layouts without sacrificing semantic coherence. ConsistCompose provides a principled solution to layout-controllable multimodal generation, pointing toward future directions in finer-grained spatial reasoning and interactive scene composition.

All technical details, training recipes, and supplementary results are publicly available to facilitate further research in unified layout-aware multimodal systems.

Release Information

We introduce ConsistCompose-BAGEL-7B-MoT, built upon a unified understanding and generation base model.

Specific details are reported in the ConsistCompose technical report, Our ConsistCompose-BAGEL-7B-MoT preserves the full image general generation and understanding capabilities of the original BAGEL while introducing enhanced layout control for precise multi-instance composition.

1 Results on COCO-Position

Methods	Instance Success Ratio (%) ↑						Image Success Ratio (%) ↑						Position Accuracy (%) ↑
Methods	L2	L3	L4	L5	L6	Avg	L2	L3	L4	L5	L6	Avg	mIoU	AP	AP50	AP75
GLIGEN	89.1	86.3	82.0	79.6	81.6	82.6	78.8	63.8	48.1	35.0	35.0	52.1	69.0	40.5	75.9	39.1
InstanceDiffusion	94.1	94.4	89.5	84.6	83.8	87.8	89.4	84.4	67.5	46.9	39.4	65.5	78.1	57.2	83.6	65.5
MIGC++	94.1	92.1	87.3	84.1	83.4	86.8	89.4	78.1	62.5	48.1	38.8	63.4	74.9	48.3	79.2	52.6
CreatiLayout	81.9	76.3	73.4	73.5	71.2	74.0	69.4	48.1	36.9	31.9	26.3	42.5	64.9	32.4	61.1	31.6
PlanGen	85.3	84.2	83.8	80.9	81.2	82.5	72.5	63.1	51.3	33.1	31.3	50.3	66.2	31.9	74.0	21.5
Ours	95.6	94.2	92.7	90.6	92.4	92.6	91.9	83.1	73.1	63.7	68.8	76.1	85.3	70.9	89.1	76.9

2 Results on MS-Bench && MS-Bench-Random

Methods	MS-Bench				MS-Bench-Random
Methods	CLIP-T	DINO	mIoU	AP	CLIP-T	DINO	mIoU	AP
GLIGEN	0.309	0.454	0.868	0.751	0.312	0.431	0.858	0.722
MS-Diffusion	0.336	0.555	0.466	0.108	0.334	0.544	0.464	0.105
MUSE	0.320	0.619	0.698	0.352	0.321	0.607	0.673	0.303
Ours	0.333	0.660	0.889	0.789	0.334	0.630	0.878	0.756

3. Results on MMBench，MMMU, GenEval, GEdit

(a) General capability
Model	MMBench↑	MMMU↑	GenEval↑	GEdit↑
Bagel Base	81.4	46.4	0.86	6.68
Ours (w/o Coord)	81.5	39.4	0.88	6.23
Ours (w/ Coord)	81.4	42.3	0.88	6.31

4. Results on DreamBench(Single/Multi)

Method	Single			Multi
Method	DINO	CLIP-I	CLIP-T	DINO	CLIP-I	CLIP-T
UNO	0.661	0.796	0.304	0.491	0.715	0.323
OmniGen	0.554	0.746	0.322	0.441	0.692	0.341
OmniGen2	0.671	0.791	0.312	0.459	0.698	0.333
Ours	0.677	0.792	0.314	0.506	0.703	0.335

🖊️ Citation

@misc{shi2026consistcomposeunifiedmultimodallayout,
      title={ConsistCompose: Unified Multimodal Layout Control for Image Composition}, 
      author={Xuanke Shi and Boxuan Li and Xiaoyang Han and Zhongang Cai and Lei Yang and Quan Wang and Dahua Lin},
      year={2026},
      eprint={2511.18333},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.18333}, 
}

Downloads last month: 42

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sensenova/ConsistCompose-BAGEL-7B-MoT

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Finetuned

ByteDance-Seed/BAGEL-7B-MoT

Finetuned

(30)

this model

Paper for sensenova/ConsistCompose-BAGEL-7B-MoT

ConsistCompose: Unified Multimodal Layout Control for Image Composition

Paper • 2511.18333 • Published Nov 23, 2025 • 5