MVRL
/

VectorSynth-COSA

+---
+license: apache-2.0
+tags:
+  - controlnet
+  - stable-diffusion
+  - satellite-imagery
+  - osm
+  - image-to-image
+  - diffusers
+base_model: stabilityai/stable-diffusion-2-1-base
+pipeline_tag: image-to-image
+library_name: diffusers
+---
+# VectorSynth-COSA
+**VectorSynth-COSA** is a ControlNet model that generates satellite imagery from OpenStreetMap (OSM) vector data embeddings. It conditions [Stable Diffusion 2.1 Base](https://huggingface.co/stabilityai/stable-diffusion-2-1-base) on rendered OSM text using the COSA (Contrastive OSM-Satellite Alignment) embedding space.
+## Model Description
+VectorSynth-COSA uses a two-stage pipeline:
+1. **RenderEncoder**: Projects 768-dim COSA embeddings to 3-channel control images
+2. **ControlNet**: Conditions Stable Diffusion 2.1 on the rendered control images
+This model uses COSA embeddings for improved semantic alignment between OSM text and satellite imagery. For the standard CLIP embedding variant, see [VectorSynth](https://huggingface.co/MVRL/VectorSynth).
+## Usage
+```python
+import torch
+from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, DDIMScheduler
+from huggingface_hub import hf_hub_download
+device = "cuda"
+# Load ControlNet
+controlnet = ControlNetModel.from_pretrained("MVRL/VectorSynth-COSA", torch_dtype=torch.float16)
+# Load pipeline
+pipe = StableDiffusionControlNetPipeline.from_pretrained(
+    "stabilityai/stable-diffusion-2-1-base",
+    controlnet=controlnet,
+    torch_dtype=torch.float16
+)
+pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
+pipe = pipe.to(device)
+# Load RenderEncoder
+render_path = hf_hub_download("MVRL/VectorSynth-COSA", "render_encoder/cosa-render_encoder.pth")
+checkpoint = torch.load(render_path, map_location=device, weights_only=False)
+render_encoder = checkpoint['model'].to(device).eval()
+# Your hint tensor should be (H, W, 768) - per-pixel OSMClip embeddings
+# hint = torch.load("your_hint.pt").to(device)
+# hint = hint.unsqueeze(0).permute(0, 3, 1, 2)  # (1, 768, H, W)
+# with torch.no_grad():
+#     control_image = render_encoder(hint).sigmoid()
+# Generate
+# output = pipe(
+#     prompt="Satellite image of a city neighborhood",
+#     image=control_image,
+#     num_inference_steps=40,
+#     guidance_scale=7.5
+# ).images[0]
+```
+## Files
+- `config.json` - ControlNet configuration
+- `diffusion_pytorch_model.safetensors` - ControlNet weights
+- `render_encoder/cosa-render_encoder.pth` - RenderEncoder weights
+- `render.py` - RenderEncoder class definition
+## Citation
+```bibtex
+@inproceedings{cher2025vectorsynth,
+  title={VectorSynth: Fine-Grained Satellite Image Synthesis with Structured Semantics},
+  author={Cher, Daniel and Wei, Brian and Sastry, Srikumar and Jacobs, Nathan},
+  year={2025},
+  eprint={arXiv:2511.07744},
+  note={arXiv preprint}
+}
+```
+## Related Models
+- [VectorSynth](https://huggingface.co/MVRL/VectorSynth) - Standard CLIP embedding variant
+- [GeoSynth](https://huggingface.co/MVRL/GeoSynth) - Text-to-satellite image generation