AzhureRaven
/

rico-diffusion

@@ -32,6 +32,8 @@ I used EveryDream2 (https://github.com/victorchall/EveryDream2trainer) to fine-t
 In other words, I use a Python script run in Colab to process the Rico dataset into a new dataset containing UI Screenshots and their captions alongside individual UI components with their captions. I primarily split the individual components into two groups for the training process based on the total pixel count of 512x512 = 262,144; components smaller than the threshold are grouped into the small component group, whereas components bigger than that are in the big component group. The model is trained on those groups separately before finally training on the full UIs.
 # Training
 I did 6 training sessions for 20 epochs, taking approximately 138.5 hours to train on an NVIDIA GeForce RTX 3060 12GB. I had to split the small components into 4 parts due to the size (571.5k data) freezing the computer and are trained at batch size 7 and resolution 384 for 4 sessions, the first for 3 epochs and the rest for 2 epochs. The model is then trained on the big components (34.5k data) at batch size 5 and resolution 448 for 6 epochs and finally on the full UIs (65.5k data) at batch size 4 and resolution 512 for 5 epochs. I first did the final session with validation before backing up and redoing it without it to have the model trained on all of the UIs.
@@ -48,40 +50,38 @@ This model is fine-tuned to work with component-based prompts so that you have b
 In general, based on the training captions, the prompt should be formatted like this:
-<center>[Component] and [Component] and..., Android UI, [Category]</center>
 The prompt should describe each UI component in [Component] separated with "and" and ends with the "Android UI" tag to inform the model that you are trying to produce UI images.
 [Category] tag is optional describing the app category the UI belongs to such as "Medical", "Video Players & Editors", etc. You can look up all possible values in "docs/categories.txt".
 Each [Component] can be divided into this:
-<center>[Main Component] [Context] [Position] [Internal Components]</center>
-[Main Component] is the name of the UI component such as "Text Button", "Toolbar", etc. You can look up all possible values in "docs/components.txt".
 [Context] describes the [Main Component] which depends on the [Main Component] used. They are based on the various values found in the Semantic Annotations accompanying each component. You can use more than one 'type' of [Context] value on one component or none at all. The type of [Context] value and what component can be used on are as follows:
-- Text Button Concepts: Used on "Text Button", they describe what the button is used for such as "retry", "undo", "logout", etc. which are self-explanatory. You can look up all possible values in docs/text buttons.txt.
 - Icon Classes: Used on "Icon", like Text Button Concepts, they describe what kind of icon on the screen is like "avatar", "arrow backward", etc. You can look up all possible values in docs/icon classes.txt
-- Text: Used on "Text" (the component, not the [Context] type), "Text Button", "Input", and "Radio Button", they're the text seen on those components which can be anything. I didn't put them in quotation marks on the training data and limited them to 1-2 words to minimize the caption length. For "Text Button", it can use Text instead of Text Button Concepts or both in which case is the latter first and then followed by the former. "Input" uses this value to have the text already inputted in them.
-- Play Store Name: Used on "Web View" and "Video", they're the app name of the UI in Play Store and are the closest thing I could use to describe these components on the training data. Download the Play Store Metadata CSV file on the Rico Dataset page to see what values were used on the training data.
-- Resource Id: Used on "Image", "Background Image", and "Number Stepper", I used the resource id of those components found in the Semantic Annotations to describe them unless they're generic like "img", "imgView", etc. For example "Number Stepper" can have resource id such as "year" and "month" which I used accordingly. For "Image" and "Image Background", try to use values like "dog", "cat", etc. to describe what image you want or just leave it to the model by leaving them empty.
-- Class Name: Used on "Input", not every "Input" is a textbox, so it can either be that or a different kind of input in the Semantic Annotation. In the training data, if the Class Name has to do with text boxes "Input" is given "Textbox" forming "Input Textbox" for simplification. Otherwise, it's whatever is in the Class Name key. So in your prompt, just write "Input Textbox" if you want to draw text boxes, please browse the Semantic Annotations if you want something else.
 [Position] describes where the component is located on the screen such as "Top Left", "Bottom Right", etc. Refer to "docs/positioning.pdf" for more information.
-[Internal Components] is the component inside a container component which are "Bottom Navigation", "Button Bar", "Card", "Date Picker", "Drawer", "List Item", "Modal", "Multi-Tab", and "Toolbar". There can be multiple components inside the parent component and you write them in the same "[Component] and [Component] and..." format but wrapped between "containing" and "inside" and [Position] is relative to the parent component, not the screen. You can check the prepared prompt, the "Toolbar" portion, inside the widget to see how it is done.
-Putting them all together, it should look something like the prompt prepared in the widget which will generate a simple username-password login page with a back button on the top left inside the toolbar. You may need to use GLIGEN to make it look better.
-Check the "prompt.txt" files in the sub-folders of the "results" folder for the prompts I used.
 # GLIGEN
-You can try to use the model without it to see if it's enough. If not, you can use GLIGEN to control the positions of the components on the screen via inputting bounding boxes and grounded texts. I generated the images on A1111 with GLIGEN using this extension (https://github.com/AzhureRaven/sd_webui_gligen) which I forked from ashen-sensored after modifying it a bit to make it work on local computers as the original seemed to be designed for Colab.
-You can learn how GLIGEN works and how to use it on the page I mentioned earlier. In short, to use GLIGEN with this model, input the bounding box and grounded text for each UI component. Draw the bounding boxes at the position you intend to put the component at with the [Position] you give as close as possible. You should keep the parent and internal components in separate bounding boxes with the internal components' boxes inside the parent's but you are free to experiment. For the grounded text of each component, I found it best to write the full "[Main Component] [Context] [Position]" format with [Internal Components] on separate bounding boxes without "containing" and "inside".
 For the prompt in the widget, the grounded text should look like this: "Toolbar Upper Top; Icon arrow backward Left; Input Textbox Username Lower Top; Input Textbox Password Upper Middle; Text Button Login Lower Center" and the bounding boxes should look like the image below.
@@ -89,7 +89,7 @@ For the prompt in the widget, the grounded text should look like this: "Toolbar
 Note that GLIGEN has a limit of 30 bounding box-grounded text pairs in one input which means a maximum of 30 components. If you want more you may need to combine parent and internal components.
-During testing, I found the bounding boxes to be very sensitive, particularly on "Text Button" and "Input". You may need to reinput them slightly bigger/smaller and/or shift them slightly to get it right. For example, I found making the bounding box for "Input Textbox Password Upper Middle" bigger vertically could cause the model to generate two text boxes instead of one. If the bounding box is smaller or too close to another bounding box of the same component when it's not supposed to, the component may disappear entirely or merge with the other component.
 Check the "gligen_input.txt" and "gligen.png" files in the sub-folders of the "results" folder for the grounded text and bounding boxes I used.

 In other words, I use a Python script run in Colab to process the Rico dataset into a new dataset containing UI Screenshots and their captions alongside individual UI components with their captions. I primarily split the individual components into two groups for the training process based on the total pixel count of 512x512 = 262,144; components smaller than the threshold are grouped into the small component group, whereas components bigger than that are in the big component group. The model is trained on those groups separately before finally training on the full UIs.
+The prepared Rico dataset used for this training can be accessed here (https://huggingface.co/datasets/AzhureRaven/rico-ui-component-caption).
 # Training
 I did 6 training sessions for 20 epochs, taking approximately 138.5 hours to train on an NVIDIA GeForce RTX 3060 12GB. I had to split the small components into 4 parts due to the size (571.5k data) freezing the computer and are trained at batch size 7 and resolution 384 for 4 sessions, the first for 3 epochs and the rest for 2 epochs. The model is then trained on the big components (34.5k data) at batch size 5 and resolution 448 for 6 epochs and finally on the full UIs (65.5k data) at batch size 4 and resolution 512 for 5 epochs. I first did the final session with validation before backing up and redoing it without it to have the model trained on all of the UIs.
 In general, based on the training captions, the prompt should be formatted like this:
+<center>[Component] and [Component] and..., Android UI, [Category], [Background Color]</center>
 The prompt should describe each UI component in [Component] separated with "and" and ends with the "Android UI" tag to inform the model that you are trying to produce UI images.
 [Category] tag is optional describing the app category the UI belongs to such as "Medical", "Video Players & Editors", etc. You can look up all possible values in "docs/categories.txt".
+[Background Color] should be written as "[Color] background" where [Color] is color names in Material Design 2 (https://m2.material.io/design/color/the-color-system.html). You can look up the values in "docs/colors.txt".
 Each [Component] can be divided into this:
+<center>[Color] [Main Component] [Context] [Position] [Internal Components]</center>
+[Main Component] is the name of the UI component such as "Button", "Toolbar", etc. You can look up all possible values in "docs/components.txt".
 [Context] describes the [Main Component] which depends on the [Main Component] used. They are based on the various values found in the Semantic Annotations accompanying each component. You can use more than one 'type' of [Context] value on one component or none at all. The type of [Context] value and what component can be used on are as follows:
+- Text Button Concepts: Used on "Button", they describe what the button is used for such as "retry", "undo", "logout", etc. which are self-explanatory. You can look up all possible values in docs/text buttons.txt.
 - Icon Classes: Used on "Icon", like Text Button Concepts, they describe what kind of icon on the screen is like "avatar", "arrow backward", etc. You can look up all possible values in docs/icon classes.txt
+- Text: Used on "Text" (the component, not the [Context] type), "Text Button", "Input", and "Radio Button", they're the text seen on those components which can be anything. I didn't put them in quotation marks on the training data and limited them to 1-2 words to minimize the caption length. For "Button", it can use Text instead of Text Button Concepts or both in which case is the latter first and then followed by the former. "Input" uses this value to have the text already inputted in them.
+- Class Name: Used on "Input", not every "Input" is a textbox, so it can either be that or a different kind of input in the Semantic Annotation. In the training data, if the Class Name has to do with text boxes "Input" is kept as is. Otherwise, it's whatever is in the Class Name key. So in your prompt, just write "Input" if you want to draw text boxes, please browse the Semantic Annotations if you want something else.
 [Position] describes where the component is located on the screen such as "Top Left", "Bottom Right", etc. Refer to "docs/positioning.pdf" for more information.
+[Internal Components] is the component inside a container component which are "Bottom Navigation", "Button Bar", "Card", "Date Picker", "Drawer", "List Item", "Modal", "Multi-Tab", and "Toolbar". There can be multiple components inside the parent component and you write them in the same "[Component] and [Component] and..." format but wrapped between "containing" and "inside" and [Position] is relative to the parent component, not the screen.
+Check the "prompt.txt" files in the sub-folders of the "results" folder for the prompts I used putting them all together. You may need to use GLIGEN to make it look better.
 # GLIGEN
+You can try to use the model without it to see if it's enough. If not, you can use GLIGEN to control the positions of the components on the screen via inputting bounding boxes and grounded texts. I generated the images on A1111 with GLIGEN using this extension (https://github.com/AzhureRaven/sd_webui_gligen) which I forked from (https://github.com/ashen-sensored/sd_webui_gligen) after modifying it a bit to make it work on local computers as the original seemed to be designed for Colab.
+You can learn how GLIGEN works and how to use it on the page I mentioned earlier. In short, to use GLIGEN with this model, input the bounding box and grounded text for each UI component. Draw the bounding boxes at the position you intend to put the component at with the [Position] you give as close as possible. You should keep the parent and internal components in separate bounding boxes with the internal components' boxes inside the parent's but you are free to experiment. For the grounded text of each component, I found it best to write the full "[Color] [Main Component] [Context] [Position]" format with [Internal Components] on separate bounding boxes without "containing" and "inside".
 For the prompt in the widget, the grounded text should look like this: "Toolbar Upper Top; Icon arrow backward Left; Input Textbox Username Lower Top; Input Textbox Password Upper Middle; Text Button Login Lower Center" and the bounding boxes should look like the image below.
 Note that GLIGEN has a limit of 30 bounding box-grounded text pairs in one input which means a maximum of 30 components. If you want more you may need to combine parent and internal components.
+During testing, I found the bounding boxes to be very sensitive, particularly on "Button" and "Input". You may need to reinput them slightly bigger/smaller and/or shift them slightly to get it right.
 Check the "gligen_input.txt" and "gligen.png" files in the sub-folders of the "results" folder for the grounded text and bounding boxes I used.