AzhureRaven
/

rico-diffusion

@@ -26,21 +26,21 @@ tags:
 # Rico Diffusion Model Card
-I fine-tuned a Stable Diffusion 1.5 model to create mobile UI mockups at 384x640 with GLIGEN (https://gligen.github.io) to control UI component positions. However, there are designs at 448x576 primarily modal dialogs.
-I used EveryDream2 (https://github.com/victorchall/EveryDream2trainer) to fine-tune the model on the Rico Dataset (http://www.interactionmining.org/rico.html) of UI Screenshots where I wrote a Python script to parse over the Semantic Annotations part of the dataset to create the captions for each screenshot as well as using the Play Store and UI Metadata to use the app categories as extra tags. I have also cropped each UI component of a given screenshot (with exceptions) and labeled them accordingly so that I can train the model on individual components first before going for the whole screenshot.
-In other words, I use a Python script run in Colab to process the Rico dataset into a new dataset containing UI Screenshots and their captions alongside individual UI components with their captions. I primarily split the individual components into two groups for the training process based on the total pixel count of 512x512 = 262,144; components smaller than the threshold are grouped into the small component group, whereas components bigger than that are in the big component group. The model is trained on those groups separately before finally training on the full UIs.
 The prepared Rico dataset used for this training can be accessed here (https://huggingface.co/datasets/AzhureRaven/rico-ui-component-caption).
 # Training
-I did 6 training sessions for 20 epochs, taking approximately 138.5 hours to train on an NVIDIA GeForce RTX 3060 12GB. I had to split the small components into 4 parts due to the size (571.5k data) freezing the computer and are trained at batch size 7 and resolution 384 for 4 sessions, the first for 3 epochs and the rest for 2 epochs. The model is then trained on the big components (34.5k data) at batch size 5 and resolution 448 for 6 epochs and finally on the full UIs (65.5k data) at batch size 4 and resolution 512 for 5 epochs. I first did the final session with validation before backing up and redoing it without it to have the model trained on all of the UIs.
-I have uploaded all of the related training configurations used in EveryDream2. The order of the main configuration files used is as follows: "rico_diffusion_v2_comp.json", "rico_diffusion_v2_comp_image.json", "rico_diffusion_v2_comp_icon.json", "rico_diffusion_v2_comp_text_button.json", "rico_diffusion_v2_comp_big", and finally "rico_diffusion.json". "rico_diffusion_v2_full" is the validation version of the final session before backing up and redoing it with "rico_diffusion.json".
-The final model turned out decently well at creating Android UI mockups. It's still not optimal, especially with many UI components (10 or more) but it is still far better than the base model given that I had to limit the amount of training epochs and batch size due to the limited hardware I have access to.
 All images produced in testing can be found in the "results" folder. I tried to reproduce 10 UIs from the dataset and 5 UIs outside of it by screenshotting various apps on my phone and then manually creating the caption as close to the script and saving the results in their sub-folders given the name of the data-id and app name respectively. For each sub-folder, I generated 4 images with and without GLIGEN using Rico Diffusion and the base Stable Diffusion 1.5 model for comparison. It also contains the prompts and GLIGEN inputs used among other things related to the testing.
@@ -70,11 +70,14 @@ Each [Component] can be divided into this:
 - Icon Classes: Used on "Icon", like Text Button Concepts, they describe what kind of icon on the screen is like "avatar", "arrow backward", etc. You can look up all possible values in docs/icon classes.txt
 - Text: Used on "Text" (the component, not the [Context] type), "Text Button", "Input", and "Radio Button", they're the text seen on those components which can be anything. I didn't put them in quotation marks on the training data and limited them to 1-2 words to minimize the caption length. For "Button", it can use Text instead of Text Button Concepts or both in which case is the latter first and then followed by the former. "Input" uses this value to have the text already inputted in them.
 - Class Name: Used on "Input", not every "Input" is a textbox, so it can either be that or a different kind of input in the Semantic Annotation. In the training data, if the Class Name has to do with text boxes "Input" is kept as is. Otherwise, it's whatever is in the Class Name key. So in your prompt, just write "Input" if you want to draw text boxes, please browse the Semantic Annotations if you want something else.
 [Position] describes where the component is located on the screen such as "Top Left", "Bottom Right", etc. Refer to "docs/positioning.pdf" for more information.
 [Internal Components] is the component inside a container component which are "Bottom Navigation", "Button Bar", "Card", "Date Picker", "Drawer", "List Item", "Modal", "Multi-Tab", and "Toolbar". There can be multiple components inside the parent component and you write them in the same "[Component] and [Component] and..." format but wrapped between "containing" and "inside" and [Position] is relative to the parent component, not the screen.
 Check the "prompt.txt" files in the sub-folders of the "results" folder for the prompts I used putting them all together. You may need to use GLIGEN to make it look better.
 # GLIGEN
@@ -83,21 +86,17 @@ You can try to use the model without it to see if it's enough. If not, you can u
 You can learn how GLIGEN works and how to use it on the page I mentioned earlier. In short, to use GLIGEN with this model, input the bounding box and grounded text for each UI component. Draw the bounding boxes at the position you intend to put the component at with the [Position] you give as close as possible. You should keep the parent and internal components in separate bounding boxes with the internal components' boxes inside the parent's but you are free to experiment. For the grounded text of each component, I found it best to write the full "[Color] [Main Component] [Context] [Position]" format with [Internal Components] on separate bounding boxes without "containing" and "inside".
-For the prompt in the widget, the grounded text should look like this: "Toolbar Upper Top; Icon arrow backward Left; Input Textbox Username Lower Top; Input Textbox Password Upper Middle; Text Button Login Lower Center" and the bounding boxes should look like the image below.
-![Gligen Input](gligen_example.png)
 Note that GLIGEN has a limit of 30 bounding box-grounded text pairs in one input which means a maximum of 30 components. If you want more you may need to combine parent and internal components.
 During testing, I found the bounding boxes to be very sensitive, particularly on "Button" and "Input". You may need to reinput them slightly bigger/smaller and/or shift them slightly to get it right.
-Check the "gligen_input.txt" and "gligen.png" files in the sub-folders of the "results" folder for the grounded text and bounding boxes I used.
 # Parameters
 The images produced in the "results" folder use these parameters in A1111 that I found to give me the best results:
 - Sampling method: DPM++ SDE
-- Schedule: Karras
 - Sampling steps: 20
 - Width: 384
 - Height: 640

 # Rico Diffusion Model Card
+I fine-tuned a Stable Diffusion 1.5 model to generate mobile UI mockups at 384x640 with GLIGEN (https://gligen.github.io) to control UI component positions. However, there are designs generated at 448x576, primarily modal dialogs.
+I used EveryDream2 (https://github.com/victorchall/EveryDream2trainer) to fine-tune the model on the Rico Dataset (http://www.interactionmining.org/rico.html) of UI Screenshots where I wrote a Python notebook to parse over the Semantic Hierarchies of the dataset to create the captions for each screenshot as well as using the Play Store and UI Metadata to use the app categories as extra tags. I have also cropped each UI component of a given screenshot (with exceptions) and labeled them accordingly so that I can train the model on individual UI components first before going for the whole screenshot.  I also used BLIP-2 (https://huggingface.co/Salesforce/blip2-opt-2.7b-coco) to add color names to the UI components in the caption as well as general labelling for certain components.
+I run the notebook in Colab local runtime to process the Rico dataset and I primarily split the individual components into two groups for the training process based on the total pixel count of 512x512 = 262,144; components smaller than the threshold are grouped into the small component group, whereas components bigger than that are in the big component group. The model is trained on those groups separately before finally training on the full UIs.
 The prepared Rico dataset used for this training can be accessed here (https://huggingface.co/datasets/AzhureRaven/rico-ui-component-caption).
 # Training
+I did 7 fine-tuning sessions for 20 epochs on an NVIDIA GeForce RTX 4070 Ti SUPER with 16GB VRAM. I split the small components into 5 parts due to the large number of data (666.7k data) and are trained at batch size 11 and resolution 384 for 5 sessions, the first for 3 epochs and the rest for 2 epochs. The model is then trained on the big components (46.7k data) at batch size 9 and resolution 448 for 4 epochs and finally on the full UIs (65.5k data) at batch size 7 and resolution 512 for 5 epochs. I first did the final session with validation before backing up and redoing it without it to have the model trained on all of the UIs.
+I have uploaded all of the related training configurations used in EveryDream2 in "ed2-config". The order of the main configuration files used is as follows: "rico_diffusion_v2_comp.json", "rico_diffusion_v2_comp_image.json", "rico_diffusion_v2_comp_icon.json", "rico_diffusion_v2_comp_button.json", "rico_diffusion_v2_comp_list_item.json", "rico_diffusion_v2_comp_big", and finally "rico_diffusion.json". "rico_diffusion_v2_full" is the validation version of the final session before backing up and redoing it with "rico_diffusion.json".
+The final model turned out decently well at creating UI mockups. It's still not optimal, especially with many UI components (10 or more) but it is still far better than the base model given that I had to limit the amount of training epochs and batch size due to the limited hardware I have access to.
 All images produced in testing can be found in the "results" folder. I tried to reproduce 10 UIs from the dataset and 5 UIs outside of it by screenshotting various apps on my phone and then manually creating the caption as close to the script and saving the results in their sub-folders given the name of the data-id and app name respectively. For each sub-folder, I generated 4 images with and without GLIGEN using Rico Diffusion and the base Stable Diffusion 1.5 model for comparison. It also contains the prompts and GLIGEN inputs used among other things related to the testing.
 - Icon Classes: Used on "Icon", like Text Button Concepts, they describe what kind of icon on the screen is like "avatar", "arrow backward", etc. You can look up all possible values in docs/icon classes.txt
 - Text: Used on "Text" (the component, not the [Context] type), "Text Button", "Input", and "Radio Button", they're the text seen on those components which can be anything. I didn't put them in quotation marks on the training data and limited them to 1-2 words to minimize the caption length. For "Button", it can use Text instead of Text Button Concepts or both in which case is the latter first and then followed by the former. "Input" uses this value to have the text already inputted in them.
 - Class Name: Used on "Input", not every "Input" is a textbox, so it can either be that or a different kind of input in the Semantic Annotation. In the training data, if the Class Name has to do with text boxes "Input" is kept as is. Otherwise, it's whatever is in the Class Name key. So in your prompt, just write "Input" if you want to draw text boxes, please browse the Semantic Annotations if you want something else.
+- For "Background Image", "Icon", "Image", "Video", and "Web View", I used BLIP-2 to describe what they are. Feel free to write anything for them.
 [Position] describes where the component is located on the screen such as "Top Left", "Bottom Right", etc. Refer to "docs/positioning.pdf" for more information.
 [Internal Components] is the component inside a container component which are "Bottom Navigation", "Button Bar", "Card", "Date Picker", "Drawer", "List Item", "Modal", "Multi-Tab", and "Toolbar". There can be multiple components inside the parent component and you write them in the same "[Component] and [Component] and..." format but wrapped between "containing" and "inside" and [Position] is relative to the parent component, not the screen.
+As a limitation, "Text" is not trained with [Color], so don't add it to them. It still turned out fine as Rico Diffusion is able to generate text with sufficient contrast to its background even though it may be unreadable.
 Check the "prompt.txt" files in the sub-folders of the "results" folder for the prompts I used putting them all together. You may need to use GLIGEN to make it look better.
 # GLIGEN
 You can learn how GLIGEN works and how to use it on the page I mentioned earlier. In short, to use GLIGEN with this model, input the bounding box and grounded text for each UI component. Draw the bounding boxes at the position you intend to put the component at with the [Position] you give as close as possible. You should keep the parent and internal components in separate bounding boxes with the internal components' boxes inside the parent's but you are free to experiment. For the grounded text of each component, I found it best to write the full "[Color] [Main Component] [Context] [Position]" format with [Internal Components] on separate bounding boxes without "containing" and "inside".
 Note that GLIGEN has a limit of 30 bounding box-grounded text pairs in one input which means a maximum of 30 components. If you want more you may need to combine parent and internal components.
 During testing, I found the bounding boxes to be very sensitive, particularly on "Button" and "Input". You may need to reinput them slightly bigger/smaller and/or shift them slightly to get it right.
+Check the "gligen_input.txt" and "gligen.png" files in the sub-folders of the "results" folder for the grounded text and bounding boxes I used to give you an idea what you should input.
 # Parameters
 The images produced in the "results" folder use these parameters in A1111 that I found to give me the best results:
 - Sampling method: DPM++ SDE
+- Schedule type: Karras
 - Sampling steps: 20
 - Width: 384
 - Height: 640