---
tags:
- text-to-image
- diffusers
widget:
- text: a photo of a laptop above a dog
  output:
    url: images/laptop-above-dog.jpg
- text: a photo of a potted plant to the right of a motorcycle
  output:
    url: images/potted_plant-right-motorcycle.jpg
- text: a photo of a sheep below a sink
  output:
    url: images/sheep-below-sink.jpg
base_model: stabilityai/stable-diffusion-2-1
license: apache-2.0
---
# CoMPaSS-SD2.1

<Gallery />

## Model description 

# CoMPaSS-SD2.1

\[[Project Page]\]
\[[code]\]
\[[arXiv]\]

A UNet that enhances spatial understanding capabilities of the StableDiffusion 2.1 text-to-image
diffusion model.  This model demonstrates significant improvements in generating images with specific
spatial relationships between objects.

## Model Details

- **Base Model**: StableDiffusion 2.1
- **Training Data**: SCOP dataset (curated from COCO)
- **Framework**: Diffusers
- **License**: Apache-2.0 (see [./LICENSE])

## Intended Use

- Generating images with accurate spatial relationships between objects
- Creating compositions that require specific spatial arrangements
- Enhancing the base model's spatial understanding while maintaining its other capabilities

## Performance 

### Key Improvements

- VISOR benchmark: +105.2% relative improvement
- T2I-CompBench Spatial: +146.2% relative improvement
- GenEval Position: +628.6% relative improvement
- Maintains or improves base model's image fidelity (lower FID and CMMD scores than base model)

## Using the Model

See our [GitHub repository][code] to get started.

### Effective Prompting

The model works well with:
- Clear spatial relationship descriptors (left, right, above, below)
- Pairs of distinct objects
- Explicit spatial relationships (e.g., "a photo of A to the right of B")

## Training Details

### Training Data

- Built using the SCOP (Spatial Constraints-Oriented Pairing) data engine
- ~28,000 curated object pairs from COCO
- Enforces criteria for:
  - Visual significance
  - Semantic distinction
  - Spatial clarity
  - Object relationships
  - Visual balance

### Training Process

- Trained for 80,000 steps
- Effective batch size of 4
- Learning rate: 5e-6
- Optimizer: AdamW with β₁=0.9, β₂=0.999
- Weight decay: 1e-2

## Evaluation Results

| Metric | StableDiffusion 1.4 | +CoMPaSS |
|--------|-------------|-----------|
| VISOR uncond (⬆️) | 30.25% | **62.06%** |
| T2I-CompBench Spatial (⬆️) | 0.13 | **0.32** |
| GenEval Position (⬆️) | 0.07 | **0.51** |
| FID (⬇️) | 21.65 | **16.96** |
| CMMD (⬇️) | 0.6472 | **0.4083** |

## Citation

If you use this model in your research, please cite:
```bibtex
@inproceedings{zhang2025compass,
  title={CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models},
  author={Zhang, Gaoyang and Fu, Bingtao and Fan, Qingnan and Zhang, Qi and Liu, Runxing and Gu, Hong and Zhang, Huaqi and Liu, Xinguo},
  booktitle={ICCV},
  year={2025}
}
```

## Contact

For questions about the model, please contact <blurgy@zju.edu.cn>

## Download model

Weights for this model are available in Safetensors format.

[./LICENSE]: <./LICENSE>
[code]: <https://github.com/blurgyy/CoMPaSS>
[Project page]: <https://compass.blurgy.xyz>
[arXiv]: <https://arxiv.org/abs/2412.13195>