Add model card

Browse files

Signed-off-by: Gaoyang Zhang <gy@blurgy.xyz>

Files changed (4) hide show

README.md +121 -0
images/laptop-above-dog.jpg +3 -0
images/potted_plant-right-motorcycle.jpg +3 -0
images/sheep-below-sink.jpg +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,121 @@

+---
+tags:
+- text-to-image
+- diffusers
+widget:
+- text: a photo of a laptop above a dog
+  output:
+    url: images/laptop-above-dog.jpg
+- text: a photo of a potted plant to the right of a motorcycle
+  output:
+    url: images/potted_plant-right-motorcycle.jpg
+- text: a photo of a sheep below a sink
+  output:
+    url: images/sheep-below-sink.jpg
+base_model: stabilityai/stable-diffusion-2-1
+license: apache-2.0
+---
+# CoMPaSS-SD2.1
+<Gallery />
+## Model description
+# CoMPaSS-SD2.1
+\[[Project Page]\]
+\[[code]\]
+\[[arXiv]\]
+A UNet that enhances spatial understanding capabilities of the StableDiffusion 2.1 text-to-image
+diffusion model.  This model demonstrates significant improvements in generating images with specific
+spatial relationships between objects.
+## Model Details
+- **Base Model**: StableDiffusion 2.1
+- **Training Data**: SCOP dataset (curated from COCO)
+- **Framework**: Diffusers
+- **License**: Apache-2.0 (see [./LICENSE])
+## Intended Use
+- Generating images with accurate spatial relationships between objects
+- Creating compositions that require specific spatial arrangements
+- Enhancing the base model's spatial understanding while maintaining its other capabilities
+## Performance
+### Key Improvements
+- VISOR benchmark: +105.2% relative improvement
+- T2I-CompBench Spatial: +146.2% relative improvement
+- GenEval Position: +628.6% relative improvement
+- Maintains or improves base model's image fidelity (lower FID and CMMD scores than base model)
+## Using the Model
+See our [GitHub repository][code] to get started.
+### Effective Prompting
+The model works well with:
+- Clear spatial relationship descriptors (left, right, above, below)
+- Pairs of distinct objects
+- Explicit spatial relationships (e.g., "a photo of A to the right of B")
+## Training Details
+### Training Data
+- Built using the SCOP (Spatial Constraints-Oriented Pairing) data engine
+- ~28,000 curated object pairs from COCO
+- Enforces criteria for:
+  - Visual significance
+  - Semantic distinction
+  - Spatial clarity
+  - Object relationships
+  - Visual balance
+### Training Process
+- Trained for 80,000 steps
+- Effective batch size of 4
+- Learning rate: 5e-6
+- Optimizer: AdamW with β₁=0.9, β₂=0.999
+- Weight decay: 1e-2
+## Evaluation Results
+| Metric | StableDiffusion 1.4 | +CoMPaSS |
+|--------|-------------|-----------|
+| VISOR uncond (⬆️) | 30.25% | **62.06%** |
+| T2I-CompBench Spatial (⬆️) | 0.13 | **0.32** |
+| GenEval Position (⬆️) | 0.07 | **0.51** |
+| FID (⬇️) | 21.65 | **16.96** |
+| CMMD (⬇️) | 0.6472 | **0.4083** |
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@article{zhang2024compass,
+  title={CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models},
+  author={Zhang, Gaoyang and Fu, Bingtao and Fan, Qingnan and Zhang, Qi and Liu, Runxing and Gu, Hong and Zhang, Huaqi and Liu, Xinguo},
+  journal={arXiv preprint arXiv:2412.13195},
+  year={2024}
+}
+```
+## Contact
+For questions about the model, please contact <blurgy@zju.edu.cn>
+## Download model
+Weights for this model are available in Safetensors format.
+[./LICENSE]: <./LICENSE>
+[code]: <https://github.com/blurgyy/CoMPaSS>
+[Project page]: <https://compass.blurgy.xyz>
+[arXiv]: <https://arxiv.org/abs/2412.13195>

images/laptop-above-dog.jpg ADDED Viewed

Git LFS Details

SHA256: 06b8ac5c9f327eaa40d49462c7cc8216baeff068e864e3dca827477e3fc2a9a9
Pointer size: 130 Bytes
Size of remote file: 36.4 kB

images/potted_plant-right-motorcycle.jpg ADDED Viewed

Git LFS Details

SHA256: db03d30dc92401497307bbc726cc7cbdb741d0190d705827b50fb6b1b378f740
Pointer size: 130 Bytes
Size of remote file: 51.6 kB

images/sheep-below-sink.jpg ADDED Viewed

Git LFS Details

SHA256: 4bc5a64cb305e7e444198d4d7c4b24230c12e5fc9a107df43cea918d6540bc23
Pointer size: 130 Bytes
Size of remote file: 31.7 kB