Image-to-Image
Diffusers
Safetensors
StableDiffusionControlNetPipeline
controlnet
stable-diffusion
satellite-imagery
osm
Instructions to use MVRL/VectorSynth-GiT10M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use MVRL/VectorSynth-GiT10M with Diffusers:
pip install -U diffusers transformers accelerate
from diffusers import ControlNetModel, StableDiffusionControlNetPipeline controlnet = ControlNetModel.from_pretrained("MVRL/VectorSynth-GiT10M") pipe = StableDiffusionControlNetPipeline.from_pretrained( "stabilityai/stable-diffusion-2-1-base", controlnet=controlnet ) - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| tags: | |
| - controlnet | |
| - stable-diffusion | |
| - satellite-imagery | |
| - osm | |
| - image-to-image | |
| - diffusers | |
| base_model: stabilityai/stable-diffusion-2-1-base | |
| pipeline_tag: image-to-image | |
| library_name: diffusers | |
| # VectorSynth-GiT10M | |
| **VectorSynth-GiT10M** is a ControlNet-based pipeline that generates satellite imagery from OpenStreetMap (OSM) vector data, fine-tuned on the GiT10M dataset of paired OSM + satellite tiles. Like [VectorSynth-COSA](https://huggingface.co/MVRL/VectorSynth-COSA), it conditions [Stable Diffusion 2.1 Base](https://huggingface.co/stabilityai/stable-diffusion-2-1-base) on rendered OSM text using the COSA (Contrastive OSM-Satellite Alignment) embedding space. | |
| ## Model Description | |
| VectorSynth-GiT10M uses a two-stage pipeline: | |
| 1. **RenderEncoder**: Projects 768-dim COSA embeddings to 3-channel control images. | |
| 2. **ControlNet + UNet**: Both fine-tuned on the GiT10M dataset to condition Stable Diffusion 2.1 on the rendered control images. | |
| Unlike `VectorSynth-COSA` β which ships only a fine-tuned ControlNet on top of the stock SD 2.1 UNet β this model additionally fine-tunes the UNet on GiT10M, so users should load the full pipeline from this repo rather than from `stable-diffusion-2-1-base`. | |
| ## Usage | |
| ```python | |
| import sys | |
| import torch | |
| from diffusers import StableDiffusionControlNetPipeline, DDIMScheduler | |
| from huggingface_hub import snapshot_download | |
| device = "cuda" | |
| # Load pipeline (GiT10M-finetuned UNet + ControlNet, plus base SD 2.1 VAE/text encoder) | |
| local_dir = snapshot_download("MVRL/VectorSynth-GiT10M") | |
| pipe = StableDiffusionControlNetPipeline.from_pretrained( | |
| local_dir, | |
| torch_dtype=torch.float16 | |
| ) | |
| pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config) | |
| pipe = pipe.to(device) | |
| # Load RenderEncoder | |
| sys.path.insert(0, local_dir) | |
| from render import RenderEncoder | |
| checkpoint = torch.load( | |
| f"{local_dir}/render_encoder/cosa-render_encoder.pth", | |
| map_location=device, weights_only=False, | |
| ) | |
| render_encoder = RenderEncoder(**checkpoint['config']).to(device).eval() | |
| render_encoder.load_state_dict(checkpoint['state_dict']) | |
| # Your hint tensor should be (H, W, 768) - per-pixel COSA embeddings | |
| # hint = torch.load("your_hint.pt").to(device) | |
| # hint = hint.unsqueeze(0).permute(0, 3, 1, 2) # (1, 768, H, W) | |
| # with torch.no_grad(): | |
| # control_image = render_encoder(hint) | |
| # Generate | |
| # output = pipe( | |
| # prompt="An aerial image of a residential neighborhood", | |
| # image=control_image, | |
| # num_inference_steps=40, | |
| # guidance_scale=7.5 | |
| # ).images[0] | |
| ``` | |
| ## Files | |
| - `unet/` β GiT10M-fine-tuned UNet (`diffusion_pytorch_model.safetensors`) | |
| - `controlnet/` β GiT10M-fine-tuned ControlNet | |
| - `render_encoder/cosa-render_encoder.pth` β RenderEncoder weights (COSA 768β3) | |
| - `render.py` β RenderEncoder class definition | |
| - `vae/`, `text_encoder/`, `tokenizer/`, `scheduler/`, `feature_extractor/` β copied from SD 2.1 Base (unmodified) | |
| ## Training Data | |
| Fine-tuned on **GiT10M**, a curated collection of paired OpenStreetMap vector data and Google satellite tiles (zoom 17, ~1m/pix). The dataset is split into a training set and two held-out test splits (random and spatial) for evaluation. See [GeoDiT: Point Conditioned Diffusion Transformer for Satellite Image Synthesis](https://arxiv.org/html/2603.02172v1) for more details on the data. | |
| ## Citation | |
| ```bibtex | |
| @inproceedings{cher2025vectorsynth, | |
| title={VectorSynth: Fine-Grained Satellite Image Synthesis with Structured Semantics}, | |
| author={Cher, Daniel and Wei, Brian and Sastry, Srikumar and Jacobs, Nathan}, | |
| year={2025}, | |
| eprint={arXiv:2511.07744}, | |
| note={arXiv preprint} | |
| } | |
| ``` | |
| ## Related Models | |
| - [VectorSynth-COSA](https://huggingface.co/MVRL/VectorSynth-COSA) β trained on smaller cities dataset | |
| - [VectorSynth](https://huggingface.co/MVRL/VectorSynth) β standard CLIP embedding variant | |
| - [GeoSynth](https://huggingface.co/MVRL/GeoSynth) β text-to-satellite image generation | |