dcher95 commited on
Commit
9917430
·
verified ·
1 Parent(s): e1d60d3

Initial release: VectorSynth-GiT10M

Browse files
README.md ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - controlnet
5
+ - stable-diffusion
6
+ - satellite-imagery
7
+ - osm
8
+ - image-to-image
9
+ - diffusers
10
+ base_model: stabilityai/stable-diffusion-2-1-base
11
+ pipeline_tag: image-to-image
12
+ library_name: diffusers
13
+ ---
14
+
15
+ # VectorSynth-GiT10M
16
+
17
+ **VectorSynth-GiT10M** is a ControlNet-based pipeline that generates satellite imagery from OpenStreetMap (OSM) vector data, fine-tuned on the GiT10M dataset of paired OSM + satellite tiles. Like [VectorSynth-COSA](https://huggingface.co/MVRL/VectorSynth-COSA), it conditions [Stable Diffusion 2.1 Base](https://huggingface.co/stabilityai/stable-diffusion-2-1-base) on rendered OSM text using the COSA (Contrastive OSM-Satellite Alignment) embedding space.
18
+
19
+ ## Model Description
20
+
21
+ VectorSynth-GiT10M uses a two-stage pipeline:
22
+ 1. **RenderEncoder**: Projects 768-dim COSA embeddings to 3-channel control images.
23
+ 2. **ControlNet + UNet**: Both fine-tuned on the GiT10M dataset to condition Stable Diffusion 2.1 on the rendered control images.
24
+
25
+ Unlike `VectorSynth-COSA` — which ships only a fine-tuned ControlNet on top of the stock SD 2.1 UNet — this model additionally fine-tunes the UNet on GiT10M, so users should load the full pipeline from this repo rather than from `stable-diffusion-2-1-base`.
26
+
27
+ ## Usage
28
+
29
+ ```python
30
+ import sys
31
+ import torch
32
+ from diffusers import StableDiffusionControlNetPipeline, DDIMScheduler
33
+ from huggingface_hub import snapshot_download
34
+
35
+ device = "cuda"
36
+
37
+ # Load pipeline (GiT10M-finetuned UNet + ControlNet, plus base SD 2.1 VAE/text encoder)
38
+ local_dir = snapshot_download("MVRL/VectorSynth-GiT10M")
39
+ pipe = StableDiffusionControlNetPipeline.from_pretrained(
40
+ local_dir,
41
+ torch_dtype=torch.float16
42
+ )
43
+ pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
44
+ pipe = pipe.to(device)
45
+
46
+ # Load RenderEncoder
47
+ sys.path.insert(0, local_dir)
48
+ from render import RenderEncoder
49
+ checkpoint = torch.load(
50
+ f"{local_dir}/render_encoder/cosa-render_encoder.pth",
51
+ map_location=device, weights_only=False,
52
+ )
53
+ render_encoder = RenderEncoder(**checkpoint['config']).to(device).eval()
54
+ render_encoder.load_state_dict(checkpoint['state_dict'])
55
+
56
+ # Your hint tensor should be (H, W, 768) - per-pixel COSA embeddings
57
+ # hint = torch.load("your_hint.pt").to(device)
58
+ # hint = hint.unsqueeze(0).permute(0, 3, 1, 2) # (1, 768, H, W)
59
+
60
+ # with torch.no_grad():
61
+ # control_image = render_encoder(hint)
62
+
63
+ # Generate
64
+ # output = pipe(
65
+ # prompt="An aerial image of a residential neighborhood",
66
+ # image=control_image,
67
+ # num_inference_steps=40,
68
+ # guidance_scale=7.5
69
+ # ).images[0]
70
+ ```
71
+
72
+ ## Files
73
+
74
+ - `unet/` — GiT10M-fine-tuned UNet (`diffusion_pytorch_model.safetensors`)
75
+ - `controlnet/` — GiT10M-fine-tuned ControlNet
76
+ - `render_encoder/cosa-render_encoder.pth` — RenderEncoder weights (COSA 768→3)
77
+ - `render.py` — RenderEncoder class definition
78
+ - `vae/`, `text_encoder/`, `tokenizer/`, `scheduler/`, `feature_extractor/` — copied from SD 2.1 Base (unmodified)
79
+
80
+ ## Training Data
81
+
82
+ Fine-tuned on **GiT10M**, a curated collection of paired OpenStreetMap vector data and Google satellite tiles (zoom 17, ~1m/pix). The dataset is split into a training set and two held-out test splits (random and spatial) for evaluation. See [GeoDiT: Point Conditioned Diffusion Transformer for Satellite Image Synthesis](https://arxiv.org/html/2603.02172v1) for more details on the data.
83
+
84
+ ## Citation
85
+
86
+ ```bibtex
87
+ @inproceedings{cher2025vectorsynth,
88
+ title={VectorSynth: Fine-Grained Satellite Image Synthesis with Structured Semantics},
89
+ author={Cher, Daniel and Wei, Brian and Sastry, Srikumar and Jacobs, Nathan},
90
+ year={2025},
91
+ eprint={arXiv:2511.07744},
92
+ note={arXiv preprint}
93
+ }
94
+ ```
95
+
96
+ ## Related Models
97
+
98
+ - [VectorSynth-COSA](https://huggingface.co/MVRL/VectorSynth-COSA) — trained on smaller cities dataset
99
+ - [VectorSynth](https://huggingface.co/MVRL/VectorSynth) — standard CLIP embedding variant
100
+ - [GeoSynth](https://huggingface.co/MVRL/GeoSynth) — text-to-satellite image generation
__pycache__/render.cpython-38.pyc ADDED
Binary file (2.63 kB). View file
 
controlnet/config.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "ControlNetModel",
3
+ "_diffusers_version": "0.34.0",
4
+ "act_fn": "silu",
5
+ "addition_embed_type": null,
6
+ "addition_embed_type_num_heads": 64,
7
+ "addition_time_embed_dim": null,
8
+ "attention_head_dim": [
9
+ 5,
10
+ 10,
11
+ 20,
12
+ 20
13
+ ],
14
+ "block_out_channels": [
15
+ 320,
16
+ 640,
17
+ 1280,
18
+ 1280
19
+ ],
20
+ "class_embed_type": null,
21
+ "conditioning_channels": 3,
22
+ "conditioning_embedding_out_channels": [
23
+ 16,
24
+ 32,
25
+ 96,
26
+ 256
27
+ ],
28
+ "controlnet_conditioning_channel_order": "rgb",
29
+ "cross_attention_dim": 1024,
30
+ "down_block_types": [
31
+ "CrossAttnDownBlock2D",
32
+ "CrossAttnDownBlock2D",
33
+ "CrossAttnDownBlock2D",
34
+ "DownBlock2D"
35
+ ],
36
+ "downsample_padding": 1,
37
+ "encoder_hid_dim": null,
38
+ "encoder_hid_dim_type": null,
39
+ "flip_sin_to_cos": true,
40
+ "freq_shift": 0,
41
+ "global_pool_conditions": false,
42
+ "in_channels": 4,
43
+ "layers_per_block": 2,
44
+ "mid_block_scale_factor": 1,
45
+ "mid_block_type": "UNetMidBlock2DCrossAttn",
46
+ "norm_eps": 1e-05,
47
+ "norm_num_groups": 32,
48
+ "num_attention_heads": null,
49
+ "num_class_embeds": null,
50
+ "only_cross_attention": false,
51
+ "projection_class_embeddings_input_dim": null,
52
+ "resnet_time_scale_shift": "default",
53
+ "transformer_layers_per_block": 1,
54
+ "upcast_attention": false,
55
+ "use_linear_projection": true
56
+ }
controlnet/diffusion_pytorch_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:755bf692b7367e416649ee673a9f79a458eaa050454fc0797e11ac3f9f0feb96
3
+ size 1456953560
feature_extractor/preprocessor_config.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": 224,
3
+ "do_center_crop": true,
4
+ "do_convert_rgb": true,
5
+ "do_normalize": true,
6
+ "do_resize": true,
7
+ "feature_extractor_type": "CLIPFeatureExtractor",
8
+ "image_mean": [
9
+ 0.48145466,
10
+ 0.4578275,
11
+ 0.40821073
12
+ ],
13
+ "image_std": [
14
+ 0.26862954,
15
+ 0.26130258,
16
+ 0.27577711
17
+ ],
18
+ "resample": 3,
19
+ "size": 224
20
+ }
model_index.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "StableDiffusionControlNetPipeline",
3
+ "_diffusers_version": "0.27.2",
4
+ "controlnet": [
5
+ "diffusers",
6
+ "ControlNetModel"
7
+ ],
8
+ "feature_extractor": [
9
+ "transformers",
10
+ "CLIPImageProcessor"
11
+ ],
12
+ "requires_safety_checker": false,
13
+ "safety_checker": [
14
+ null,
15
+ null
16
+ ],
17
+ "scheduler": [
18
+ "diffusers",
19
+ "DDIMScheduler"
20
+ ],
21
+ "text_encoder": [
22
+ "transformers",
23
+ "CLIPTextModel"
24
+ ],
25
+ "tokenizer": [
26
+ "transformers",
27
+ "CLIPTokenizer"
28
+ ],
29
+ "unet": [
30
+ "diffusers",
31
+ "UNet2DConditionModel"
32
+ ],
33
+ "vae": [
34
+ "diffusers",
35
+ "AutoencoderKL"
36
+ ]
37
+ }
render.py ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn as nn
3
+ import torch.nn.functional as F
4
+
5
+ class ResidualRenderBlock(nn.Module):
6
+ def __init__(self, dim):
7
+ super().__init__()
8
+ self.block = nn.Sequential(
9
+ nn.Conv2d(dim, dim, kernel_size=3, padding=1),
10
+ nn.GroupNorm(8, dim),
11
+ nn.SiLU(),
12
+ nn.Conv2d(dim, dim, kernel_size=3, padding=1),
13
+ nn.GroupNorm(8, dim)
14
+ )
15
+
16
+ def forward(self, x):
17
+ return x + self.block(x)
18
+
19
+ class RenderEncoder(nn.Module):
20
+ def __init__(self, encoder_type="1d", in_channels=768, out_channels=3):
21
+ super().__init__()
22
+ self.encoder_type = encoder_type
23
+
24
+ if encoder_type == "1d":
25
+ self.model = nn.Sequential(
26
+ nn.Conv2d(in_channels, out_channels, kernel_size=1),
27
+ nn.Sigmoid()
28
+ )
29
+
30
+ elif encoder_type == "residual":
31
+ self.model = ResidualBlockRender(in_channels, out_channels)
32
+
33
+ elif encoder_type == "expressive":
34
+ mid_channels = 256
35
+ self.model = nn.Sequential(
36
+ nn.Conv2d(in_channels, mid_channels, kernel_size=3, padding=1),
37
+ nn.GroupNorm(8, mid_channels),
38
+ nn.SiLU(),
39
+ ResidualRenderBlock(mid_channels),
40
+ ResidualRenderBlock(mid_channels),
41
+ ResidualRenderBlock(mid_channels),
42
+ nn.Conv2d(mid_channels, out_channels, kernel_size=1),
43
+ nn.Sigmoid()
44
+ )
45
+
46
+ else:
47
+ raise ValueError(f"Unknown encoder_type '{encoder_type}'. Use '1d', 'residual', or 'expressive'.")
48
+
49
+ def forward(self, x):
50
+ return self.model(x)
51
+
52
+ class ResidualBlockRender(nn.Module):
53
+ def __init__(self, in_channels=768, out_channels=3):
54
+ super().__init__()
55
+ self.conv1 = nn.Conv2d(in_channels, 256, kernel_size=3, padding=1)
56
+ self.relu1 = nn.ReLU()
57
+ self.conv2 = nn.Conv2d(256, 256, kernel_size=3, padding=1)
58
+ self.relu2 = nn.ReLU()
59
+ self.conv3 = nn.Conv2d(256, out_channels, kernel_size=1)
60
+ self.out = nn.Sigmoid()
61
+
62
+ if in_channels != out_channels:
63
+ self.residual_proj = nn.Conv2d(in_channels, out_channels, kernel_size=1)
64
+ else:
65
+ self.residual_proj = nn.Identity()
66
+
67
+ def forward(self, x):
68
+ residual = self.residual_proj(x)
69
+ h = self.relu1(self.conv1(x))
70
+ h = self.relu2(self.conv2(h))
71
+ h = self.conv3(h)
72
+ h = h + residual
73
+ return self.out(h)
render_encoder/cosa-render_encoder.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:abb4d405f0fb319363943275d57870b4a5318b173d16ff8d6a1373929d6ea5ac
3
+ size 10976
scheduler/scheduler_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "PNDMScheduler",
3
+ "_diffusers_version": "0.10.0.dev0",
4
+ "beta_end": 0.012,
5
+ "beta_schedule": "scaled_linear",
6
+ "beta_start": 0.00085,
7
+ "clip_sample": false,
8
+ "num_train_timesteps": 1000,
9
+ "prediction_type": "epsilon",
10
+ "set_alpha_to_one": false,
11
+ "skip_prk_steps": true,
12
+ "steps_offset": 1,
13
+ "trained_betas": null
14
+ }
text_encoder/config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "stabilityai/stable-diffusion-2",
3
+ "architectures": [
4
+ "CLIPTextModel"
5
+ ],
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 0,
8
+ "dropout": 0.0,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_size": 1024,
12
+ "initializer_factor": 1.0,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 4096,
15
+ "layer_norm_eps": 1e-05,
16
+ "max_position_embeddings": 77,
17
+ "model_type": "clip_text_model",
18
+ "num_attention_heads": 16,
19
+ "num_hidden_layers": 23,
20
+ "pad_token_id": 1,
21
+ "projection_dim": 512,
22
+ "torch_dtype": "float32",
23
+ "transformers_version": "4.25.0.dev0",
24
+ "vocab_size": 49408
25
+ }
text_encoder/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cce6febb0b6d876ee5eb24af35e27e764eb4f9b1d0b7c026c8c3333d4cfc916c
3
+ size 1361597018
tokenizer/merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer/special_tokens_map.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|startoftext|>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|endoftext|>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "!",
17
+ "unk_token": {
18
+ "content": "<|endoftext|>",
19
+ "lstrip": false,
20
+ "normalized": true,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ }
24
+ }
tokenizer/tokenizer_config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "bos_token": {
4
+ "__type": "AddedToken",
5
+ "content": "<|startoftext|>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false
10
+ },
11
+ "do_lower_case": true,
12
+ "eos_token": {
13
+ "__type": "AddedToken",
14
+ "content": "<|endoftext|>",
15
+ "lstrip": false,
16
+ "normalized": true,
17
+ "rstrip": false,
18
+ "single_word": false
19
+ },
20
+ "errors": "replace",
21
+ "model_max_length": 77,
22
+ "name_or_path": "stabilityai/stable-diffusion-2",
23
+ "pad_token": "<|endoftext|>",
24
+ "special_tokens_map_file": "./special_tokens_map.json",
25
+ "tokenizer_class": "CLIPTokenizer",
26
+ "unk_token": {
27
+ "__type": "AddedToken",
28
+ "content": "<|endoftext|>",
29
+ "lstrip": false,
30
+ "normalized": true,
31
+ "rstrip": false,
32
+ "single_word": false
33
+ }
34
+ }
tokenizer/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
unet/config.json ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "UNet2DConditionModel",
3
+ "_diffusers_version": "0.10.0.dev0",
4
+ "act_fn": "silu",
5
+ "attention_head_dim": [
6
+ 5,
7
+ 10,
8
+ 20,
9
+ 20
10
+ ],
11
+ "block_out_channels": [
12
+ 320,
13
+ 640,
14
+ 1280,
15
+ 1280
16
+ ],
17
+ "center_input_sample": false,
18
+ "cross_attention_dim": 1024,
19
+ "down_block_types": [
20
+ "CrossAttnDownBlock2D",
21
+ "CrossAttnDownBlock2D",
22
+ "CrossAttnDownBlock2D",
23
+ "DownBlock2D"
24
+ ],
25
+ "downsample_padding": 1,
26
+ "dual_cross_attention": false,
27
+ "flip_sin_to_cos": true,
28
+ "freq_shift": 0,
29
+ "in_channels": 4,
30
+ "layers_per_block": 2,
31
+ "mid_block_scale_factor": 1,
32
+ "norm_eps": 1e-05,
33
+ "norm_num_groups": 32,
34
+ "num_class_embeds": null,
35
+ "only_cross_attention": false,
36
+ "out_channels": 4,
37
+ "sample_size": 64,
38
+ "up_block_types": [
39
+ "UpBlock2D",
40
+ "CrossAttnUpBlock2D",
41
+ "CrossAttnUpBlock2D",
42
+ "CrossAttnUpBlock2D"
43
+ ],
44
+ "use_linear_projection": true
45
+ }
unet/diffusion_pytorch_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6dfae3e5f7d459b50f4b0850ead945972c75bb0e1897628933e169eb43974214
3
+ size 3463726498
vae/config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "AutoencoderKL",
3
+ "_diffusers_version": "0.10.0.dev0",
4
+ "act_fn": "silu",
5
+ "block_out_channels": [
6
+ 128,
7
+ 256,
8
+ 512,
9
+ 512
10
+ ],
11
+ "down_block_types": [
12
+ "DownEncoderBlock2D",
13
+ "DownEncoderBlock2D",
14
+ "DownEncoderBlock2D",
15
+ "DownEncoderBlock2D"
16
+ ],
17
+ "in_channels": 3,
18
+ "latent_channels": 4,
19
+ "layers_per_block": 2,
20
+ "norm_num_groups": 32,
21
+ "out_channels": 3,
22
+ "sample_size": 768,
23
+ "up_block_types": [
24
+ "UpDecoderBlock2D",
25
+ "UpDecoderBlock2D",
26
+ "UpDecoderBlock2D",
27
+ "UpDecoderBlock2D"
28
+ ]
29
+ }
vae/diffusion_pytorch_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a1d993488569e928462932c8c38a0760b874d166399b14414135bd9c42df5815
3
+ size 334643276