File size: 12,988 Bytes
6f068de
 
 
 
 
 
 
 
 
 
 
 
 
 
e9ed169
 
 
 
 
 
 
 
 
 
6f068de
 
750481d
6f068de
15e23d9
 
0c8970c
6f068de
0c8970c
6f068de
f5f8422
ebfbef0
6f068de
 
6d8b8c9
6f068de
bfcd8db
a13f07d
bfcd8db
6f068de
0c8970c
6f068de
 
 
 
 
 
 
 
 
ebfbef0
6f068de
 
 
 
 
ebfbef0
 
6f068de
 
ebfbef0
6f068de
ebfbef0
6f068de
 
 
50949ec
8cbd4c2
89178b2
50949ec
1c3796c
6f068de
 
 
ebfbef0
6f068de
dfcc50f
0c8970c
ebfbef0
6f068de
 
 
ebfbef0
6f068de
ebfbef0
6f068de
 
 
ebfbef0
6f068de
0c8970c
6f068de
459766f
6f068de
459766f
6f068de
0c8970c
e9ed169
6f068de
 
 
 
e9ed169
6f068de
 
 
 
 
 
 
 
 
 
 
 
61a1fcf
 
6c0f1a7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61a1fcf
2a37d97
61a1fcf
 
 
 
 
 
 
 
 
 
2a37d97
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
license: creativeml-openrail-m
language:
- en
library_name: diffusers
pipeline_tag: text-to-image
inference:
  parameters:
    width: 384
    height: 640
    clip_skip: 2
    guidance_scale: 7.5
    num_inference_steps: 20
widget:
- text: >-
    red Toolbar Upper Top containing Text Left Login inside and white Input
    Upper Top and white Input Lower Top and red Button login Upper Middle,
    Android UI, Medical, white background
datasets:
- AzhureRaven/rico-ui-component-caption
base_model:
- stable-diffusion-v1-5/stable-diffusion-v1-5
tags:
- mobile-ui
---

# Rico Diffusion Model Card

![Rico Diffusion (with GLIGEN)](./docs/Example.jpg)

I fine-tuned a Stable Diffusion 1.5 model to generate mobile UI mockups at 384x640 with GLIGEN (https://gligen.github.io) to control UI component positions. However, there are designs generated at 448x576, primarily modal dialogs.

I used EveryDream2 (https://github.com/victorchall/EveryDream2trainer) to fine-tune the model on the Rico Dataset (http://www.interactionmining.org/rico.html) of UI Screenshots where I wrote a Python notebook to parse over the Semantic Hierarchies of the dataset to create the captions for each screenshot as well as using the Play Store and UI Metadata to use the app categories as extra tags. I have also cropped each UI component of a given screenshot (with exceptions) and labeled them accordingly so that I can train the model on individual UI components first before going for the whole screenshot.  I also used BLIP-2 (https://huggingface.co/Salesforce/blip2-opt-2.7b-coco) to add color names to the UI components in the caption as well as general labelling for certain components.

I run the notebook in Colab local runtime to process the Rico dataset and I primarily split the individual components into two groups for the training process based on the total pixel count of 512x512 = 262,144; components smaller than the threshold are grouped into the small component group, whereas components bigger than that are in the big component group. The model is trained on those groups separately before finally training on the full UIs. You can read the paper listed in Citation at the end for more details.

# Training

I did 7 fine-tuning sessions for 20 epochs on an NVIDIA GeForce RTX 4070 Ti SUPER with 16GB VRAM, taking approximately 57.89 hours to complete 254.1k training steps. I split the small components into 5 parts due to the large number of data (666.7k data) and are trained at batch size 11 and resolution 384 for 5 sessions, the first for 3 epochs and the rest for 2 epochs. The model is then trained on the big components (46.7k data) at batch size 9 and resolution 448 for 4 epochs and finally on the full UIs (65.5k data) at batch size 7 and resolution 512 for 5 epochs. I first did the final session with validation before backing up and redoing it without it to have the model trained on all of the UIs.

I have uploaded all of the related training configurations used in EveryDream2 in "ed2-config". The order of the main configuration files used is as follows: "rico_diffusion_v2_comp.json", "rico_diffusion_v2_comp_image.json", "rico_diffusion_v2_comp_icon.json", "rico_diffusion_v2_comp_button.json", "rico_diffusion_v2_comp_list_item.json", "rico_diffusion_v2_comp_big", and finally "rico_diffusion.json". "rico_diffusion_v2_full.json" is the validation version of the final session before backing up and redoing it with "rico_diffusion.json". 

The "v2" part is because I tried fine-tuning the model in a single session with just the UI screenshots using "rico_diffusion_v1.json", no fine-tuning on individual UI components first, calling the new model Rico Diffusion V1 to compare it with the result of "rico_diffusion_v2_full.json" called Rico Diffusion V2 while the model you can access in this repository is just called Rico Diffusion.

The final model turned out decently well at creating UI mockups. It's still not optimal, especially with many UI components (10 or more) but it is still far better than the base model given that I had to limit the amount of training epochs and batch size due to the limited hardware I have access to.

All images produced in testing can be found in the "results" folder. I tried to reproduce 10 UIs from the dataset and 5 UIs outside of it by screenshotting various apps on my phone and then manually creating the caption as close to the script and saving the results in their sub-folders given the name of the data-id and app name respectively. For each sub-folder, I generated 4 images with and without GLIGEN using Rico Diffusion and the base Stable Diffusion 1.5 model for comparison. It also contains the prompts and GLIGEN inputs used among other things related to the testing.

# Prompt

This model is fine-tuned to work with component-based prompts so that you have better control over what UI component is included in the image and where it is placed by specifying every component in the prompt instead of writing a vague outline of what the UI is supposed to be like "An Android UI of a login page". 

In general, based on the training captions, the prompt should be formatted like this:

<center>[Component] and [Component] and..., Android UI, [Category], [Background Color]</center>

The prompt should describe each UI component in [Component] separated with "and" and ends with the "Android UI" tag to inform the model that you are trying to produce UI images. 

[Category] tag is optional describing the app category the UI belongs to such as "Medical", "Video Players & Editors", etc. You can look up all possible values in "docs/categories.txt".

[Background Color] should be written as "[Color] background" where [Color] is color names in Material Design 2 (https://m2.material.io/design/color/the-color-system.html). You can look up the values in "docs/colors.txt".

Each [Component] can be divided into this:

<center>[Color] [Main Component] [Context] [Position] [Internal Components]</center>

[Main Component] is the name of the UI component such as "Button", "Toolbar", etc. You can look up all possible values in "docs/components.txt".

[Context] describes the [Main Component] which depends on the [Main Component] used. They are based on the various values found in the Semantic Annotations accompanying each component. You can use more than one 'type' of [Context] value on one component or none at all. The type of [Context] value and what component can be used on are as follows:

- Text Button Concepts: Used on "Button", they describe what the button is used for such as "retry", "undo", "logout", etc. which are self-explanatory. You can look up all possible values in "docs/buttons.txt"
- Icon Classes: Used on "Icon", like Text Button Concepts, they describe what kind of icon on the screen is like "avatar", "arrow backward", etc. You can look up all possible values in "docs/icons.txt". I also used BLIP-2 to fill in any Icons with empty classes, so you can try experiment
- Text: Used on "Text" (the component, not the [Context] type), "Button", "Input", and "Radio Button", they're the text seen on those components which can be anything. I didn't put them in quotation marks on the training data and limited them to 1-2 words to minimize the caption length. For "Button", it can use Text instead of Text Button Concepts or both in which case is the latter first and then followed by the former. "Input" uses this value to have the text already inputted in them.
- Resource Id: Used on "Number Stepper". For example "Number Stepper" can have resource id such as "year" and "month" which I used accordingly.
- For "Background Image", "Image", "Video", and "Web View", I used BLIP-2 to describe what they are. Feel free to write anything for them.

[Position] describes where the component is located on the screen such as "Top Left", "Bottom Right", etc. Refer to "docs/positioning.pdf" for more information.

[Internal Components] is the component inside a container component which are "Bottom Navigation", "Button Bar", "Card", "Date Picker", "Drawer", "List Item", "Modal", "Multi-Tab", and "Toolbar". There can be multiple components inside the parent component and you write them in the same "[Component] and [Component] and..." format but wrapped between "containing" and "inside" and [Position] is relative to the parent component, not the screen.

As a limitation, "Text" is not trained with [Color] and so do "Advertisement", "Background Image", "Image", "Map View", "Video", and "Web View", so don't add it to them. It still turned out fine as Rico Diffusion is able to generate text with sufficient contrast to its background even though it may be unreadable.

Check the "prompt.txt" files in the sub-folders of the "results" folder for the prompts I used putting them all together. You may need to use GLIGEN to make it look better.

# GLIGEN

You can try to use the model without it to see if it's enough. If not, you can use GLIGEN to control the positions of the components on the screen via inputting bounding boxes and grounded texts. I generated the images on A1111 with GLIGEN using this extension (https://github.com/AzhureRaven/sd_webui_gligen) which I forked from (https://github.com/ashen-sensored/sd_webui_gligen) after modifying it a bit to make it work on local computers as the original seemed to be designed for Colab.

You can learn how GLIGEN works and how to use it on the page I mentioned earlier. In short, to use GLIGEN with this model, input the bounding box and grounded text for each UI component. Draw the bounding boxes at the position you intend to put the component at with the [Position] you give as close as possible. You should keep the parent and internal components in separate bounding boxes with the internal components' boxes inside the parent's but you are free to experiment. For the grounded text of each component, I found it best to write the full "[Color] [Main Component] [Context] [Position]" format with [Internal Components] on separate bounding boxes without "containing" and "inside".

Note that GLIGEN has a limit of 30 bounding box-grounded text pairs in one input which means a maximum of 30 components. If you want more you may need to combine parent and internal components.

During testing, I found the bounding boxes to be very sensitive, particularly on "Button" and "Input". You may need to reinput them slightly bigger/smaller and/or shift them slightly to get it right.

Check the "gligen_input.txt" and "gligen.png" files in the sub-folders of the "results" folder for the grounded text and bounding boxes I used to give you an idea what you should input.

# A1111 Hyperparameters

The images produced in the "results" folder use these hyperparameters in A1111 that I found to give me the best results:
- Sampling method: DPM++ SDE
- Schedule type: Karras
- Sampling steps: 20
- Width: 384
- Height: 640
- Batch count: 4
- Batch size: 1
- CFG Scale: 7.5
- Seed: 555
- Clip Skip: 2

Width and Height can also be 448x576 for modal dialogs such as date pickers. Clip Skip 2 is important as the model will utterly fail otherwise. When using this model with GLIGEN in A1111 using my version of the extension, don't use Batch size to produce multiple images; it will only apply to the first image, use Batch count instead.

For the GLIGEN parameters, they are as follows:
- Strength: 1
- Stage one: 0.2
- Stage two: 0.5
- Canvas width: 384
- Canvas height: 640

In general, Canvas width and height should be the same as image Width and Height.

## Limitations

- The model struggles to generate UIs with a large number of components (10 or more), often resulting in missing or poorly structured elements.
- Text generation is not reliable; generated text may be unreadable or semantically incorrect.
- Color predictions may be inaccurate due to limitations in BLIP-2-based color labeling.
- When using GLIGEN, the bounding boxes are highly sensitive and may require multiple adjustments to achieve the desired layout.
- The model is highly dependent on structured prompts; performance degrades significantly with vague or unstructured descriptions.

## Dataset

This model is trained on a processed version of the Rico dataset.
Due to licensing restrictions, the dataset is not publicly redistributed.

The preprocessing pipeline is available upon request.

## Citation
```bibtex
@INPROCEEDINGS{11296000,
  author={Fendy, Abraham Arthur and Kristian, Yosi and P. C. S. W, Lukman Zaman},
  booktitle={2025 7th International Conference on Cybernetics and Intelligent System (ICORIS)}, 
  title={Generative AI for Mobile App UI Mockups Using Stable Diffusion with the EveryDream2 Fine-Tuner and GLIGEN}, 
  year={2025},
  volume={},
  number={},
  pages={1-6},
  keywords={Training;Grounding;Navigation;Text to image;Color;User interfaces;Hardware;Mobile handsets;Mobile applications;Software development management;Mobile UI Mockup Generator;Stable Diffusion;EveryDream2;GLIGEN},
  doi={10.1109/ICORIS67789.2025.11296000}}
```