Upload README.md with huggingface_hub

6d021a0 verified 2 months ago

3.01 kB

license: apache-2.0
tags:
  - controlnet
  - stable-diffusion
  - satellite-imagery
  - osm
  - image-to-image
  - diffusers
base_model: stabilityai/stable-diffusion-2-1-base
pipeline_tag: image-to-image
library_name: diffusers

VectorSynth-COSA

VectorSynth-COSA is a ControlNet model that generates satellite imagery from OpenStreetMap (OSM) vector data embeddings. It conditions Stable Diffusion 2.1 Base on rendered OSM text using the COSA (Contrastive OSM-Satellite Alignment) embedding space.

Model Description

VectorSynth-COSA uses a two-stage pipeline:

RenderEncoder: Projects 768-dim COSA embeddings to 3-channel control images
ControlNet: Conditions Stable Diffusion 2.1 on the rendered control images

This model uses COSA embeddings for improved semantic alignment between OSM text and satellite imagery. For the standard CLIP embedding variant, see VectorSynth.

Usage

import torch
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, DDIMScheduler
from huggingface_hub import hf_hub_download

device = "cuda"

# Load ControlNet
controlnet = ControlNetModel.from_pretrained("MVRL/VectorSynth-COSA", torch_dtype=torch.float16)

# Load pipeline
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1-base",
    controlnet=controlnet,
    torch_dtype=torch.float16
)
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to(device)

# Load RenderEncoder
render_path = hf_hub_download("MVRL/VectorSynth-COSA", "render_encoder/cosa-render_encoder.pth")
checkpoint = torch.load(render_path, map_location=device, weights_only=False)
render_encoder = checkpoint['model'].to(device).eval()

# Your hint tensor should be (H, W, 768) - per-pixel OSMClip embeddings
# hint = torch.load("your_hint.pt").to(device)
# hint = hint.unsqueeze(0).permute(0, 3, 1, 2)  # (1, 768, H, W)

# with torch.no_grad():
#     control_image = render_encoder(hint).sigmoid()

# Generate
# output = pipe(
#     prompt="Satellite image of a city neighborhood",
#     image=control_image,
#     num_inference_steps=40,
#     guidance_scale=7.5
# ).images[0]

Files

config.json - ControlNet configuration
diffusion_pytorch_model.safetensors - ControlNet weights
render_encoder/cosa-render_encoder.pth - RenderEncoder weights
render.py - RenderEncoder class definition

Citation

@inproceedings{cher2025vectorsynth,
  title={VectorSynth: Fine-Grained Satellite Image Synthesis with Structured Semantics},
  author={Cher, Daniel and Wei, Brian and Sastry, Srikumar and Jacobs, Nathan},
  year={2025},
  eprint={arXiv:2511.07744},
  note={arXiv preprint}
}

Related Models

VectorSynth - Standard CLIP embedding variant
GeoSynth - Text-to-satellite image generation