AzhureRaven
/

rico-diffusion

+---
+license: creativeml-openrail-m
+language:
+- en
+library_name: diffusers
+pipeline_tag: text-to-image
+inference:
+  parameters:
+    width: 384
+    height: 640
+    clip_skip: 2
+    guidance_scale: 7.5
+    num_inference_steps: 20
+widget:
+  - text: "red Toolbar Upper Top containing Text Left Login inside and white Input Upper Top and white Input Lower Top and red Button login Upper Middle, Android UI, Medical, white background"
+---
+# Rico Diffusion V0.5 Model Card
+This is a final project of mine where I fine-tuned a Stable Diffusion 1.5 model to create Android Mockups at 384x640 with GLIGEN (https://gligen.github.io) to control UI component positions. However, there are designs at 448x576 primarily modal dialogs.
+I used EveryDream2 (https://github.com/victorchall/EveryDream2trainer) to fine-tune the model on the Rico Dataset (http://www.interactionmining.org/rico.html) of UI Screenshots where I wrote a Python script to parse over the Semantic Annotations part of the dataset to create the captions for each screenshot as well as using the Play Store and UI Metadata to use the app categories as extra tags. I have also cropped each UI component of a given screenshot (with exceptions) and labeled them accordingly so that I can train the model on individual components first before going for the whole screenshot.
+In other words, I use a Python script run in Colab to process the Rico dataset into a new dataset containing UI Screenshots and their captions alongside individual UI components with their captions. I primarily split the individual components into two groups for the training process based on the total pixel count of 512x512 = 262,144; components smaller than the threshold are grouped into the small component group, whereas components bigger than that are in the big component group. The model is trained on those groups separately before finally training on the full UIs.
+# Training
+I did 6 training sessions for 20 epochs, taking approximately 138.5 hours to train on an NVIDIA GeForce RTX 3060 12GB. I had to split the small components into 4 parts due to the size (571.5k data) freezing the computer and are trained at batch size 7 and resolution 384 for 4 sessions, the first for 3 epochs and the rest for 2 epochs. The model is then trained on the big components (34.5k data) at batch size 5 and resolution 448 for 6 epochs and finally on the full UIs (65.5k data) at batch size 4 and resolution 512 for 5 epochs. I first did the final session with validation before backing up and redoing it without it to have the model trained on all of the UIs.
+I have uploaded all of the related training configurations used in EveryDream2. The order of the main configuration files used is as follows: "rico_diffusion_v2_comp.json", "rico_diffusion_v2_comp_image.json", "rico_diffusion_v2_comp_icon.json", "rico_diffusion_v2_comp_text_button.json", "rico_diffusion_v2_comp_big", and finally "rico_diffusion.json". "rico_diffusion_v2_full" is the validation version of the final session before backing up and redoing it with "rico_diffusion.json".
+The final model turned out decently well at creating Android UI mockups. It's still not optimal, especially with many UI components (10 or more) but it is still far better than the base model given that I had to limit the amount of training epochs and batch size due to the limited hardware I have access to.
+All images produced in testing can be found in the "results" folder. I tried to reproduce 10 UIs from the dataset and 5 UIs outside of it by screenshotting various apps on my phone and then manually creating the caption as close to the script and saving the results in their sub-folders given the name of the data-id and app name respectively. For each sub-folder, I generated 4 images with and without GLIGEN using Rico Diffusion and the base Stable Diffusion 1.5 model for comparison. It also contains the prompts and GLIGEN inputs used among other things related to the testing.
+# Prompt
+This model is fine-tuned to work with component-based prompts so that you have better control over what UI component is included in the image and where it is placed by specifying every component in the prompt instead of writing a vague outline of what the UI is supposed to be like "An Android UI of a login page".
+In general, based on the training captions, the prompt should be formatted like this:
+<center>[Component] and [Component] and..., Android UI, [Category]</center>
+The prompt should describe each UI component in [Component] separated with "and" and ends with the "Android UI" tag to inform the model that you are trying to produce UI images.
+[Category] tag is optional describing the app category the UI belongs to such as "Medical", "Video Players & Editors", etc. You can look up all possible values in "docs/categories.txt".
+Each [Component] can be divided into this:
+<center>[Main Component] [Context] [Position] [Internal Components]</center>
+[Main Component] is the name of the UI component such as "Text Button", "Toolbar", etc. You can look up all possible values in "docs/components.txt".
+[Context] describes the [Main Component] which depends on the [Main Component] used. They are based on the various values found in the Semantic Annotations accompanying each component. You can use more than one 'type' of [Context] value on one component or none at all. The type of [Context] value and what component can be used on are as follows:
+- Text Button Concepts: Used on "Text Button", they describe what the button is used for such as "retry", "undo", "logout", etc. which are self-explanatory. You can look up all possible values in docs/text buttons.txt.
+- Icon Classes: Used on "Icon", like Text Button Concepts, they describe what kind of icon on the screen is like "avatar", "arrow backward", etc. You can look up all possible values in docs/icon classes.txt
+- Text: Used on "Text" (the component, not the [Context] type), "Text Button", "Input", and "Radio Button", they're the text seen on those components which can be anything. I didn't put them in quotation marks on the training data and limited them to 1-2 words to minimize the caption length. For "Text Button", it can use Text instead of Text Button Concepts or both in which case is the latter first and then followed by the former. "Input" uses this value to have the text already inputted in them.
+- Play Store Name: Used on "Web View" and "Video", they're the app name of the UI in Play Store and are the closest thing I could use to describe these components on the training data. Download the Play Store Metadata CSV file on the Rico Dataset page to see what values were used on the training data.
+- Resource Id: Used on "Image", "Background Image", and "Number Stepper", I used the resource id of those components found in the Semantic Annotations to describe them unless they're generic like "img", "imgView", etc. For example "Number Stepper" can have resource id such as "year" and "month" which I used accordingly. For "Image" and "Image Background", try to use values like "dog", "cat", etc. to describe what image you want or just leave it to the model by leaving them empty.
+- Class Name: Used on "Input", not every "Input" is a textbox, so it can either be that or a different kind of input in the Semantic Annotation. In the training data, if the Class Name has to do with text boxes "Input" is given "Textbox" forming "Input Textbox" for simplification. Otherwise, it's whatever is in the Class Name key. So in your prompt, just write "Input Textbox" if you want to draw text boxes, please browse the Semantic Annotations if you want something else.
+[Position] describes where the component is located on the screen such as "Top Left", "Bottom Right", etc. Refer to "docs/positioning.pdf" for more information.
+[Internal Components] is the component inside a container component which are "Bottom Navigation", "Button Bar", "Card", "Date Picker", "Drawer", "List Item", "Modal", "Multi-Tab", and "Toolbar". There can be multiple components inside the parent component and you write them in the same "[Component] and [Component] and..." format but wrapped between "containing" and "inside" and [Position] is relative to the parent component, not the screen. You can check the prepared prompt, the "Toolbar" portion, inside the widget to see how it is done.
+Putting them all together, it should look something like the prompt prepared in the widget which will generate a simple username-password login page with a back button on the top left inside the toolbar. You may need to use GLIGEN to make it look better.
+Check the "prompt.txt" files in the sub-folders of the "results" folder for the prompts I used.
+# GLIGEN
+You can try to use the model without it to see if it's enough. If not, you can use GLIGEN to control the positions of the components on the screen via inputting bounding boxes and grounded texts. I generated the images on A1111 with GLIGEN using this extension (https://github.com/AzhureRaven/sd_webui_gligen) which I forked from ashen-sensored after modifying it a bit to make it work on local computers as the original seemed to be designed for Colab.
+You can learn how GLIGEN works and how to use it on the page I mentioned earlier. In short, to use GLIGEN with this model, input the bounding box and grounded text for each UI component. Draw the bounding boxes at the position you intend to put the component at with the [Position] you give as close as possible. You should keep the parent and internal components in separate bounding boxes with the internal components' boxes inside the parent's but you are free to experiment. For the grounded text of each component, I found it best to write the full "[Main Component] [Context] [Position]" format with [Internal Components] on separate bounding boxes without "containing" and "inside".
+For the prompt in the widget, the grounded text should look like this: "Toolbar Upper Top; Icon arrow backward Left; Input Textbox Username Lower Top; Input Textbox Password Upper Middle; Text Button Login Lower Center" and the bounding boxes should look like the image below.
+![Gligen Input](gligen_example.png)
+Note that GLIGEN has a limit of 30 bounding box-grounded text pairs in one input which means a maximum of 30 components. If you want more you may need to combine parent and internal components.
+During testing, I found the bounding boxes to be very sensitive, particularly on "Text Button" and "Input". You may need to reinput them slightly bigger/smaller and/or shift them slightly to get it right. For example, I found making the bounding box for "Input Textbox Password Upper Middle" bigger vertically could cause the model to generate two text boxes instead of one. If the bounding box is smaller or too close to another bounding box of the same component when it's not supposed to, the component may disappear entirely or merge with the other component.
+Check the "gligen_input.txt" and "gligen.png" files in the sub-folders of the "results" folder for the grounded text and bounding boxes I used.
+# Parameters
+The images produced in the "results" folder use these parameters in A1111 that I found to give me the best results:
+- Sampling method: DPM++ SDE
+- Sampling steps: 15
+- Width: 384
+- Height: 640
+- Batch count: 4
+- Batch size: 1
+- CFG Scale: 7
+- Seed: 555
+- Clip Skip: 2
+Width and Height can also be 448x576 for modal dialogs such as date pickers. Clip Skip 2 is important as the model will utterly fail otherwise. When using this model with GLIGEN in A1111 using my version of the extension, don't use Batch size to produce multiple images; it will only apply to the first image, use Batch count instead.
+For the GLIGEN parameters, they are as follows:
+- Strength: 1
+- Stage one: 0.2
+- Stage two: 0.5
+- Canvas width: 384
+- Canvas height: 640
+In general, Canvas width and height should be the same as image Width and Height.