File size: 4,594 Bytes
1f3c329
 
 
 
 
 
 
 
 
 
 
 
 
 
ab8dd02
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1f3c329
 
 
 
6500251
1f3c329
 
 
 
 
2e2c3f5
1f3c329
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e3310e6
1f3c329
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ab8dd02
1f3c329
6500251
1f3c329
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
license: mit
language:
- en
metrics:
- mse
base_model:
- lllyasviel/sd-controlnet-seg
pipeline_tag: image-to-image
tags:
- controlnet
- stable-diffusion
- conditional-generation
- segmentation
model-index:
  - name: Facades-ControlNet-SD15
    results:
      - task:
          type: image-to-image
          name: Conditional Image Generation
        dataset:
          name: CMP Facades Dataset
          type: facades
          url: https://www.kaggle.com/datasets/balraj98/facades-dataset
        metrics:
          - name: Mean Squared Error
            type: mse
            value: 0.0178
        source:
          name: Custom Evaluation
          url: https://www.kaggle.com/datasets/balraj98/facades-dataset
---

# Model Card for Facades ControlNet with Stable Diffusion v1.5

![Cover](https://cdn-uploads.huggingface.co/production/uploads/67e303fff01ee3e3ab5505a2/DpqNC41GG2ngcNIeJoByU.png)

This model is a fine-tuned version of ControlNet built on top of **Stable Diffusion v1.5**, specifically conditioned on **semantic segmentation maps** from the **Facades dataset**. It enables structure-aware image generation by combining natural language prompts with pixel-level guidance in the form of building façade segmentation masks. The result is highly controllable generation of realistic architectural scenes that reflect both structural layout and textual context.

## Model Description

-   **Base Model**: [stable-diffusion-v1-5/stable-diffusion-v1-5](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5)
    
-   **Control Type**: Semantic segmentation maps (Facades-style RGB masks)
    
-   **Architecture**: U-Net + ControlNet adapter + Variational Autoencoder (VAE) + CLIP Text Encoder (ViT-L/14)
    
-   **Training Epochs**: 30 full passes over the training data
    
-   **Training Dataset**: [Facades dataset](https://www.kaggle.com/datasets/balraj98/facades-dataset)
    
-   **Resolution**: Trained at 512×512 resolution
    
-   **Hardware**: NVIDIA A100 40GB GPU — total training time was approximately 1 hours
    
-   **Loss Function**: Mean Squared Error (MSE) between predicted and true noise vectors (used in DDPM training)
    

The ControlNet branches were trained while freezing the base Stable Diffusion weights. This retains the generative capabilities of the original model while specializing it to generate façade-aligned structures.

## Usage

This model is available via the `diffusers` library. Here's how to load and use it:

```python
from diffusers import StableDiffusionControlNetPipeline
import torch

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "doguilmak/facade-controlnet-sd15",
    torch_dtype=torch.float32,
    safety_checker=None
)
pipe.to("cuda")

# Load your segmentation map (RGB format expected)
from PIL import Image
control = Image.open("facades_segmentation_map.png").convert("RGB")

# Run generation
result = pipe(
    prompt="a modern building with large glass windows",
    negative_prompt="blurry, distorted",
    image=control,
    control_image=control,
    num_inference_steps=50,
    guidance_scale=9,
    output_type="pil"
).images[0]

result.save("facade_result.png")

```

## Example Outputs

These example illustrate the model’s ability to generate photorealistic urban scenes guided by semantic segmentation maps. The output demonstrate strong spatial alignment between the input masks and the synthesized content.

![inference](https://cdn-uploads.huggingface.co/production/uploads/67e303fff01ee3e3ab5505a2/Dphjxf34_5ysSTTMrCaEi.png)

## Limitations

-   The model was trained on **512×512** resolution; using higher resolutions without resizing may cause artifacts.
    
-   It performs best on scenes resembling architectural façades.
    
-   The control image should resemble **Facades-style segmentation formats** for optimal results.
    

## License

This stable diffusion base model is distributed under the [CreativeML Open RAIL-M license](https://huggingface.co/spaces/CompVis/stable-diffusion-license).

Our model is distributed under the [MIT license](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/mit.md).

## References

-   **ControlNet Segmentation Model**: [lllyasviel/sd-controlnet-seg @ Hugging Face](https://huggingface.co/lllyasviel/sd-controlnet-seg)
    
-   **ControlNet Paper**: Y. Zhao _et al._, “Adding Conditional Control to Text-to-Image Diffusion Models,” _arXiv preprint_ arXiv:2302.05543, 2023.
    
-   **Facades Dataset**: [Kaggle: Facades Dataset](https://www.kaggle.com/datasets/balraj98/facades-dataset)