Initial release: VectorSynth-GiT10M

9917430 verified 22 days ago

3.97 kB

	---
	license: apache-2.0
	tags:
	- controlnet
	- stable-diffusion
	- satellite-imagery
	- osm
	- image-to-image
	- diffusers
	base_model: stabilityai/stable-diffusion-2-1-base
	pipeline_tag: image-to-image
	library_name: diffusers
	---

	# VectorSynth-GiT10M

	VectorSynth-GiT10M is a ControlNet-based pipeline that generates satellite imagery from OpenStreetMap (OSM) vector data, fine-tuned on the GiT10M dataset of paired OSM + satellite tiles. Like [VectorSynth-COSA](https://huggingface.co/MVRL/VectorSynth-COSA), it conditions [Stable Diffusion 2.1 Base](https://huggingface.co/stabilityai/stable-diffusion-2-1-base) on rendered OSM text using the COSA (Contrastive OSM-Satellite Alignment) embedding space.

	## Model Description

	VectorSynth-GiT10M uses a two-stage pipeline:
	1. RenderEncoder: Projects 768-dim COSA embeddings to 3-channel control images.
	2. ControlNet + UNet: Both fine-tuned on the GiT10M dataset to condition Stable Diffusion 2.1 on the rendered control images.

	Unlike `VectorSynth-COSA` — which ships only a fine-tuned ControlNet on top of the stock SD 2.1 UNet — this model additionally fine-tunes the UNet on GiT10M, so users should load the full pipeline from this repo rather than from `stable-diffusion-2-1-base`.

	## Usage

	```python
	import sys
	import torch
	from diffusers import StableDiffusionControlNetPipeline, DDIMScheduler
	from huggingface_hub import snapshot_download

	device = "cuda"

	# Load pipeline (GiT10M-finetuned UNet + ControlNet, plus base SD 2.1 VAE/text encoder)
	local_dir = snapshot_download("MVRL/VectorSynth-GiT10M")
	pipe = StableDiffusionControlNetPipeline.from_pretrained(
	local_dir,
	torch_dtype=torch.float16
	)
	pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
	pipe = pipe.to(device)

	# Load RenderEncoder
	sys.path.insert(0, local_dir)
	from render import RenderEncoder
	checkpoint = torch.load(
	f"{local_dir}/render_encoder/cosa-render_encoder.pth",
	map_location=device, weights_only=False,
	)
	render_encoder = RenderEncoder(**checkpoint['config']).to(device).eval()
	render_encoder.load_state_dict(checkpoint['state_dict'])

	# Your hint tensor should be (H, W, 768) - per-pixel COSA embeddings
	# hint = torch.load("your_hint.pt").to(device)
	# hint = hint.unsqueeze(0).permute(0, 3, 1, 2) # (1, 768, H, W)

	# with torch.no_grad():
	# control_image = render_encoder(hint)

	# Generate
	# output = pipe(
	# prompt="An aerial image of a residential neighborhood",
	# image=control_image,
	# num_inference_steps=40,
	# guidance_scale=7.5
	# ).images[0]
	```

	## Files

	- `unet/` — GiT10M-fine-tuned UNet (`diffusion_pytorch_model.safetensors`)
	- `controlnet/` — GiT10M-fine-tuned ControlNet
	- `render_encoder/cosa-render_encoder.pth` — RenderEncoder weights (COSA 768→3)
	- `render.py` — RenderEncoder class definition
	- `vae/`, `text_encoder/`, `tokenizer/`, `scheduler/`, `feature_extractor/` — copied from SD 2.1 Base (unmodified)

	## Training Data

	Fine-tuned on GiT10M, a curated collection of paired OpenStreetMap vector data and Google satellite tiles (zoom 17, ~1m/pix). The dataset is split into a training set and two held-out test splits (random and spatial) for evaluation. See [GeoDiT: Point Conditioned Diffusion Transformer for Satellite Image Synthesis](https://arxiv.org/html/2603.02172v1) for more details on the data.

	## Citation

	```bibtex
	@inproceedings{cher2025vectorsynth,
	title={VectorSynth: Fine-Grained Satellite Image Synthesis with Structured Semantics},
	author={Cher, Daniel and Wei, Brian and Sastry, Srikumar and Jacobs, Nathan},
	year={2025},
	eprint={arXiv:2511.07744},
	note={arXiv preprint}
	}
	```

	## Related Models

	- [VectorSynth-COSA](https://huggingface.co/MVRL/VectorSynth-COSA) — trained on smaller cities dataset
	- [VectorSynth](https://huggingface.co/MVRL/VectorSynth) — standard CLIP embedding variant
	- [GeoSynth](https://huggingface.co/MVRL/GeoSynth) — text-to-satellite image generation