File size: 3,260 Bytes
0862dd7 3568b5a 0862dd7 3568b5a 0862dd7 3568b5a 0862dd7 3568b5a 0862dd7 3568b5a 0862dd7 3568b5a 0862dd7 3568b5a 0862dd7 3568b5a 0862dd7 3568b5a 0862dd7 3568b5a 0862dd7 3568b5a 0862dd7 3568b5a 0862dd7 3568b5a 0862dd7 3568b5a 0862dd7 6d326f0 0862dd7 6d326f0 0862dd7 3568b5a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 |
---
tags:
- text-to-image
- diffusers
widget:
- text: a photo of a laptop above a dog
output:
url: images/laptop-above-dog.jpg
- text: a photo of a potted plant to the right of a motorcycle
output:
url: images/potted_plant-right-motorcycle.jpg
- text: a photo of a sheep below a sink
output:
url: images/sheep-below-sink.jpg
base_model: runwayml/stable-diffusion-v1-5
license: apache-2.0
---
# CoMPaSS-SD1.5
<Gallery />
## Model description
# CoMPaSS-SD1.5
\[[Project Page]\]
\[[code]\]
\[[arXiv]\]
A UNet that enhances spatial understanding capabilities of the StableDiffusion 1.5 text-to-image
diffusion model. This model demonstrates significant improvements in generating images with specific
spatial relationships between objects.
## Model Details
- **Base Model**: StableDiffusion 1.5
- **Training Data**: SCOP dataset (curated from COCO)
- **Framework**: Diffusers
- **License**: Apache-2.0 (see [./LICENSE])
## Intended Use
- Generating images with accurate spatial relationships between objects
- Creating compositions that require specific spatial arrangements
- Enhancing the base model's spatial understanding while maintaining its other capabilities
## Performance
### Key Improvements
- VISOR benchmark: +249.6% relative improvement
- T2I-CompBench Spatial: +337.5% relative improvement
- GenEval Position: +1250.0% relative improvement
- Maintains or improves base model's image fidelity (lower FID and CMMD scores than base model)
## Using the Model
See our [GitHub repository][code] to get started.
### Effective Prompting
The model works well with:
- Clear spatial relationship descriptors (left, right, above, below)
- Pairs of distinct objects
- Explicit spatial relationships (e.g., "a photo of A to the right of B")
## Training Details
### Training Data
- Built using the SCOP (Spatial Constraints-Oriented Pairing) data engine
- ~28,000 curated object pairs from COCO
- Enforces criteria for:
- Visual significance
- Semantic distinction
- Spatial clarity
- Object relationships
- Visual balance
### Training Process
- Trained for 24,000 steps
- Effective batch size of 4
- Learning rate: 5e-6
- Optimizer: AdamW with β₁=0.9, β₂=0.999
- Weight decay: 1e-2
## Evaluation Results
| Metric | StableDiffusion 1.4 | +CoMPaSS |
|--------|-------------|-----------|
| VISOR uncond (⬆️) | 17.58% | **61.46%** |
| T2I-CompBench Spatial (⬆️) | 0.08 | **0.35** |
| GenEval Position (⬆️) | 0.04 | **0.54** |
| FID (⬇️) | 12.82 | **10.89** |
| CMMD (⬇️) | 0.5548 | **0.3235** |
## Citation
If you use this model in your research, please cite:
```bibtex
@inproceedings{zhang2025compass,
title={CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models},
author={Zhang, Gaoyang and Fu, Bingtao and Fan, Qingnan and Zhang, Qi and Liu, Runxing and Gu, Hong and Zhang, Huaqi and Liu, Xinguo},
booktitle={ICCV},
year={2025}
}
```
## Contact
For questions about the model, please contact <blurgy@zju.edu.cn>
## Download model
Weights for this model are available in Safetensors format.
[./LICENSE]: <./LICENSE>
[Project page]: <https://compass.blurgy.xyz>
[code]: <https://github.com/blurgyy/CoMPaSS>
[arXiv]: <https://arxiv.org/abs/2412.13195>
|