--- license: apache-2.0 tags: - controlnet - stable-diffusion - satellite-imagery - osm - image-to-image - diffusers base_model: stabilityai/stable-diffusion-2-1-base pipeline_tag: image-to-image library_name: diffusers --- # VectorSynth-GiT10M **VectorSynth-GiT10M** is a ControlNet-based pipeline that generates satellite imagery from OpenStreetMap (OSM) vector data, fine-tuned on the GiT10M dataset of paired OSM + satellite tiles. Like [VectorSynth-COSA](https://huggingface.co/MVRL/VectorSynth-COSA), it conditions [Stable Diffusion 2.1 Base](https://huggingface.co/stabilityai/stable-diffusion-2-1-base) on rendered OSM text using the COSA (Contrastive OSM-Satellite Alignment) embedding space. ## Model Description VectorSynth-GiT10M uses a two-stage pipeline: 1. **RenderEncoder**: Projects 768-dim COSA embeddings to 3-channel control images. 2. **ControlNet + UNet**: Both fine-tuned on the GiT10M dataset to condition Stable Diffusion 2.1 on the rendered control images. Unlike `VectorSynth-COSA` — which ships only a fine-tuned ControlNet on top of the stock SD 2.1 UNet — this model additionally fine-tunes the UNet on GiT10M, so users should load the full pipeline from this repo rather than from `stable-diffusion-2-1-base`. ## Usage ```python import sys import torch from diffusers import StableDiffusionControlNetPipeline, DDIMScheduler from huggingface_hub import snapshot_download device = "cuda" # Load pipeline (GiT10M-finetuned UNet + ControlNet, plus base SD 2.1 VAE/text encoder) local_dir = snapshot_download("MVRL/VectorSynth-GiT10M") pipe = StableDiffusionControlNetPipeline.from_pretrained( local_dir, torch_dtype=torch.float16 ) pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config) pipe = pipe.to(device) # Load RenderEncoder sys.path.insert(0, local_dir) from render import RenderEncoder checkpoint = torch.load( f"{local_dir}/render_encoder/cosa-render_encoder.pth", map_location=device, weights_only=False, ) render_encoder = RenderEncoder(**checkpoint['config']).to(device).eval() render_encoder.load_state_dict(checkpoint['state_dict']) # Your hint tensor should be (H, W, 768) - per-pixel COSA embeddings # hint = torch.load("your_hint.pt").to(device) # hint = hint.unsqueeze(0).permute(0, 3, 1, 2) # (1, 768, H, W) # with torch.no_grad(): # control_image = render_encoder(hint) # Generate # output = pipe( # prompt="An aerial image of a residential neighborhood", # image=control_image, # num_inference_steps=40, # guidance_scale=7.5 # ).images[0] ``` ## Files - `unet/` — GiT10M-fine-tuned UNet (`diffusion_pytorch_model.safetensors`) - `controlnet/` — GiT10M-fine-tuned ControlNet - `render_encoder/cosa-render_encoder.pth` — RenderEncoder weights (COSA 768→3) - `render.py` — RenderEncoder class definition - `vae/`, `text_encoder/`, `tokenizer/`, `scheduler/`, `feature_extractor/` — copied from SD 2.1 Base (unmodified) ## Training Data Fine-tuned on **GiT10M**, a curated collection of paired OpenStreetMap vector data and Google satellite tiles (zoom 17, ~1m/pix). The dataset is split into a training set and two held-out test splits (random and spatial) for evaluation. See [GeoDiT: Point Conditioned Diffusion Transformer for Satellite Image Synthesis](https://arxiv.org/html/2603.02172v1) for more details on the data. ## Citation ```bibtex @inproceedings{cher2025vectorsynth, title={VectorSynth: Fine-Grained Satellite Image Synthesis with Structured Semantics}, author={Cher, Daniel and Wei, Brian and Sastry, Srikumar and Jacobs, Nathan}, year={2025}, eprint={arXiv:2511.07744}, note={arXiv preprint} } ``` ## Related Models - [VectorSynth-COSA](https://huggingface.co/MVRL/VectorSynth-COSA) — trained on smaller cities dataset - [VectorSynth](https://huggingface.co/MVRL/VectorSynth) — standard CLIP embedding variant - [GeoSynth](https://huggingface.co/MVRL/GeoSynth) — text-to-satellite image generation