FreightDDW commited on
Commit
f45dcc8
·
verified ·
1 Parent(s): 7db8e6c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -17
README.md CHANGED
@@ -2,9 +2,8 @@
2
  {}
3
  ---
4
 
5
-
6
  ----
7
- ## Diffusion Design Works
8
  <span style="font-size: 70%;">This page is for historical and developmental progression purposes.
9
  </span>
10
 
@@ -35,6 +34,8 @@ In addition to ufotable, I added for experimentation purposes the dataset from t
35
 
36
  Dataset was created by running Blu-ray rips of the listed content through ffmpeg as part of the aid of Anime-Screenshot-Pipeline on Github back in February of 2023, but I had quickly become dissatisfied with the quality of the dataset it originally produced, and parts of its pipeline were not useful for what I needed, thus I took some tools, repurposed and/or modified settings, and redid the entire dataset after several months of doing initial training tests before committing to an overall.
37
 
 
 
38
  ## Modifying the Pipeline and handling unexpected complications
39
 
40
  Things to note before continuing:
@@ -46,7 +47,7 @@ Things to note before continuing:
46
  -A manually cleaned up episode will come down to about 50GBs in size and a movie will drop to around 600GB
47
  -Lots of Hard Drive space required
48
 
49
- # Frame Extraction
50
  The first change I made to the pipeline was the ffmpeg script provided. The Pipeline used a script that would run the mpdecimate command to remove duplicate or “frozen frames”, moments in the animation where there is no movement, as ffmpeg is “extracting” the frames to condense down the frames to a final output around a 10th of the original size.
51
  However, I came to find out much later there was a large spread of false positive frame removals for unknown reasons and false negatives of frames that were kept from a combination of mpdecimate not considering jump cuts that repeat scenes after several cuts, and due to the Blu-ray encodings creating enough pixelation changes between frames that it would defeat mpdecimate’s deletion threshold.
52
  Because the loss of data was too significant to ignore, I stopped using mpdecimate and let ffmpeg extract every single frame and began using a deduping script that would sort duplicates based on the image’s file hash value and would only move all the flagged frames to a different folder instead of deleting them, allowing for manual review in the future. The downside to this method is that I am now using significant amount of hard drive space which would eventually force me to start buying hard drives and to looking into NAS solutions to hold all this data. It did also present a way to schedule when to start manual reviewing the datasets, as this would free almost enough space to frame extract another episode.
@@ -58,7 +59,7 @@ I initially used the built in Automatic1111 tagger when starting out making embe
58
  The solution I ended up coming up with was repurposing a face detection tool used in the original github to generate 1:1 cropped images for a dataset. After tagging my dataset, I would run a modified version of the script that will duplicate all images that pass the face detection threshold into a different folder. and when I would run the dataset into my image organizer Hydrus, I would delete all the subject tags from those images and then import the copied images with the accurate subject tags.
59
  My first training after using this method instantly fixed the prompting of subjects. A quick skim of the images that passed the face detection threshold show no signs of false positives, and there is no issue if there are cases of false negatives, as manual review will take care of anything that were missed. Other miscellaneous improves I used for this phase include a classifier aggregate that will tag based on the average of 3 or more classifiers running concurrently and has the tagging of copyright names removed just so that I reduce false positive tags of characters that don’t belong in the content from being erroneously introduced.
60
 
61
- # Dataset Organization and Preparation
62
  While dedicated image organizers for stable diffusion datasets are available, I elected with using an obscure desktop application called Hydrus Network. Created in the early 2010s for the purpose of organizing large media collections (of internet memes and other s#@tposts) under a single location with various customizable categories modeled after the format of imageboards such as danbooru. The media is tabulated based on file hash rather than whatever the file is named as. This aspect synergizes not only with how the datasets need to be tagged if following the Booru/NovelAI format, but with how my deduping script operates on file hash values when sorting unique frames out, as well as how I incorporate face detection copies of images with correct subject tagging.
63
 
64
  Once my sorted and tagged images are completed, I will import the dataset batch into Hydrus and it will associate the tag sidecar txt files generated by the taggers to the images and will automatically populate the datapoints hits. I also include Hydrus specific metadata of the series, episode, and scene for later manual review once the import is finished.
@@ -66,7 +67,7 @@ Once the first import is complete, I will select the entire batch and edit their
66
  From here I can proactively check any tags I may have previously had issues with and search for all images by that tag, remove tags if incorrect, maybe even delete images that could be seen as bad data, and overall just skim that the frames I did get were satisfactory.
67
  I will repeat this process with all new dataset batches I make until the model is ready to train. From here I can use the Hydrus metadata I included to only select a specific range of data that is ready to go and it will export a new set of copies with sidecar txt files that the trainer will need to associate the tags and images.
68
 
69
- # Kohya Trainer
70
  Things to note before continuing:
71
  -Card used for training is a water cooled RTX4090 MSI Suprim Liquid X on my personal machine.
72
  -Training 200k Images at 512x512 resolution at batch size 16 takes 48 hours.
@@ -79,8 +80,12 @@ Things to note before continuing:
79
  Japanese developer Kohya-SS’s SD Script package is an old but very consistent way training package. While it mostly supports LoRA and other network-based checkpoint trainings, it still supports full finetune support for SD1 models. It only needs to be pointed to a training directory, will check the main training folder and the regulation folder if enabled, and will then just follow the training parameters set in the powershell script and output the checkpoint when done. It borrows from the NAI training settings and incorporates the aspect ratio bucketing so the resolution sizes are not restricted to 1:1 aspect image.
80
 
81
  ## Post Training Review
 
82
  After the model is trained, I will run several “templates” to test changes between a previous version of the model compared to the current one. I will note if there are any visual improvements, changes in the look of backgrounds, character feature consistencies, detail in non-upscaled generations, and if any errors I caught in the previous cycle were fixed. I will also take prompts of other stable diffusion images I see in the wild to test how others would prompt on this model in hopes of finding tags prompts that don’t aren’t producing intended results so I can then note down to check on Hydrus.
83
- ## Dataset post processing
 
 
 
84
 
85
  Things to note before continuing:
86
  -SD1.x is trained on 512x512 resolution images, a size of 262,144 Pixels
@@ -89,43 +94,45 @@ Things to note before continuing:
89
  -Most enthusiast’s LoRAs are still trained on 512x512 due to their hardware constraints
90
  There are 3 additional processes I do to my dataset that combat either certain flaws in the dataset, cleaning up data that could have production mistakes or bothersome editing choices, and maximize the amount of detail of specific shots or scenes: Layering, Cropping, and Stitching.
91
 
92
- # Layering
93
  Say I have a shot of store front backdrop with a lot of moving foreground pieces, but not a single frame has a clean shot of that backdrop, but if I piece together enough separate frames together on top of each other and mask out the unwanted bits, I can create a clean frame edit of the backdrop that then I can include in my dataset. Then I can reintroduce some separate frames with the foreground pieces to train concept of being able to draw an empty room or location, and that same spot with people or other moving subjects. This is not something I prioritize often, but when a particular scene with too many movie pieces shows up that I feel is worth doing the work to get
94
 
95
- # Stitching
96
 
97
  In any sort of motion picture, you will get scenes where the camera is panning from one direction another to either show off a landscape shot, or a character’s reveal a sizing up of several characters in the prelude to a confrontation. The frames by themselves may not be quite useful, but they would be if you had a completed picture. I will take a scene and either using a Photoshop or Lightroom tool to stich the pieces together, or I will manually layer the frames together if it’s a complex shot, to have a single complete picture. This also gives provides an additional resolution sizes that the model can train on, fighting the potential overfit that can be caused by training almost exclusively on 16:9 resolution images. I will still include some of the separate frames in full if there is detail in them I want the model to train on.
98
 
99
- # Cropping
100
 
101
  While originally a requirement to make 1:1 crop cutouts of images for your dataset in the early days of Stable Diffusion Dreambooth training, the introduction and source code release for Aspect Ratio Bucketing by NovelAI allowed the use of any resolution size for datasets and would be sized to fit a resolution bucket for training. This crop and resizing down of the full images however will cause a loss in detail of the image as the bucket still needs to comply with the 1:1 aspect pixel size, and those buckets will always have a lower pixel count than the default training resolution, so a large image like a 1920x1080 blu-ray frame or any other 2MP image from an imageboard getting crunched down will cause the model to lose out on information detail. To combat this, I also include sizable amount of 1:1 aspect images. These focus on facial expressions, certain detail focus shots of objects and items, character’s extremities to improve hand and finger accuracy, or close ups of clothing and/or patterns. I will also use this on the same frames of stitches I created, if possible, to also preserve those details of panning shots. The 1:1 aspect images also help in reducing overfit from the main resolution.
102
 
103
- ## Flaws and Challenges to Overcome
104
 
105
  Despite everything I picked up and what I managed to accomplished on my own, there are still shortcomings due to time, in need of more experience, and inability to upscale my resources to continue gaining experience.
106
 
107
- # Model Quality
108
  While I have impressed myself and friends with the quality of AI generation I was able to create that mimics that style I want, I have encountered some degrees of overfit and general stiffness with my generations. Most of these animations used in the current version are adapted from the works of a single main character designer, whose art style is being adapted by 3 different Art Directors in two studios (the content being used in regularization). So it’s hard to tell where art style ends and overfit begins, but when you are familiar with the material and its source, you can start to notice when certain features are creeping up when they are not wanted, hence why the next set of content that will be pushed for training will feature works and franchises handled by this studio but are mostly unrelated to content already in the model.
109
  The existing dataset has been slowly getting replaced with my redone, cleaned up, and meticulously reviewed version with more of the data preserved that will hopefully improve concepts that were missing in the original attempt. The dataset will continue to get larger and larger, and also be laser focused to improve certain tag tokens or concepts that struggle to get generated. I will also be pushing to train the models at higher and higher resolutions, up to 1080x1080 or as close as possible without breaking my machine.
110
 
111
- # Hardware
112
  And to make sure I do not break my machine; I am working towards building an AI focused Home Lab with enough VRAM to run my stable diffusion training and can comfortably run and train LLMs as well. I did not realize early on that a 4090 was not enough to do many of the advance level AI activities. Even if I went to a double 3090 set up, it requires a specific motherboard configuration that supports SLI/NVLink because there is no reliable way to doing sharding for SD Model trainings without it, but not an issue when training LLMs. And while cloud-based training does exist, the amount of trial and error required to get things off the ground will add up very fast, that its basically cheaper to upfront all those costs on hardware.
113
  Hard Drive space also became a big issue with the size of these datasets I was beginning to hoard. The Blu-Ray rips, having many versions of said Blu-Rays due to differences in fan encodes or the quality of the official discs that get pressed between different regions, also started eating into my hard drive space as they were also the source data for my datasets. I ended up purchasing 2 22TB hard drives and tried to find a space to shove them in my machine while I continue looking into RAID solutions.
114
 
115
- # Tools
116
  The challenge of building a pipeline is that you don’t know what you need until you ran into that problem. You can pseudocode and web chart out what you need at every step but until you start running that roadmap, you won’t know whether you took everything into account correctly. And when there was something, I wish could be automated, I found that no such tool was readily on hand. Many times I turned to ChatGPT to help me modify or even create these automatic scripts quickly to reach my end goal.
117
  The Anime Screenshot Pipeline did give me insight on how I need to have everything set up, but not all of it’s tools were at the level I needed for my specific model, or simply didn’t work for my workflow. Mpdecimate wasn’t needed so I just used the standard ffmpeg command as is. There was a secondary computer vision application for similar image removal using FIftyOne, but because I already committed to the need of manual reviewing, is forewent its use and stuck with an image hash file based duplicate organizer. The Face Detect Auto Crop script in its form was not needed, so I removed the auto cropper and kept only face detect for subject tagging accuracy with a quick modification. I customized the use of Hydrus based on the new tools’ synergy with the image organizer instead of using a more mainstream dedicated application.
118
  And despite all of that, there are still some gaps in the pipeline that could probably use another automation tool, but it either doesn’t exist or requires a bit more technical knowledge beyond cheating with GPT to produce. It becomes a scale where I must balance taking the time to learn something that could save me some time in the future or spend that time doing it the way I already know because there is no guarantee that by the time I finish doing the process, that I would have found a solution. A shortcoming from working on this passion project on my own.
119
  While there are people that would like to help me, there is also time I would need to take to set up a working networking server that multiple people can sign into concurrently and do pieces of the work. Once again, take time away from training the work to get a server set up that maybe once in a while I will get a helping hand, or just continue working solo.
120
 
121
- #Current State of Tech
122
 
123
  AI is not perfect yet. Even when the big entities like StabilityAI produce a new version of Stable Diffusion 3, or Anlatan’s NovelAI v3, the changes and upgrades could result in an unexpected characteristic to appear that turns people away and that will wait for the field to continuing evolving so that a better version that feels like a direct improvement, rather than change for the sake of change, comes about.
124
  From what I have seen, I am the only one doing this kind of source material collecting, and further less at the level of “autism”, in which I am doing it. My idea is not original, the method to my madness to reach the quality that others with that idea strive to reach however, is unique because it’s made from passion.
125
  I’m working within the limits of the tech that is accessible to me. And the few crumbs of improvements and developments from other trainers doing similar levels of home-grown local finetuning, but for other themes or styles or end goals, I take in and get ready to apply for the next training.
126
  My dataset is not restricted to Stable Diffusion 1 or even Stable Diffusion all together, so whether tech becomes easier to access for ever aging hardware every generation leap, or I come into access of enterprise level resources to push the boundary, my work will never be wasted, it will be ready to adapt to the next best thing. And with that, I would like to segway into not necessarily proof, but a potential example of how a dataset meant for generative AI artwork, can be applied to different medium.
127
 
128
- ## Motion Models – AnimateDiff + UDW
 
 
129
  As Text to Video has taken its first steps into mainstream with products such as Luma, interest in making this kind of AI content has grown, and there have been some groups that have made attempts in making video models for local applications in the open source space that integrate with Stable Diffusion.
130
 
131
  AnimateDiff is a plug-and-play module that will take T2I prompts and use the motion model to create short animation video clips with Stable Diffusion. This was created back in June 2023.
@@ -134,7 +141,7 @@ One of the issues as to why this never picked up much steam, aside from its inco
134
 
135
  Sometime this year, the developers disclosed their training method and there has been at least one big finetune fork where the training resolution was doubled, but no new data was added due a lack of video tagging tools. With the visibility of the video dataset now clear, the path is open for those that have the hardware to finetune the motion model. The dataset, WebVid10M, is comprised of low-resolution watermarked previews of shutterstock footage. With the understanding of the model now clear, I can take my existing pipeline for SD and modify the roadmap to incorporate new video clips.
136
 
137
- # Theoretical Pipeline
138
 
139
  Since specific tuning is not required, we are open to using more types of movies, recordings, and other animations not necessarily related to official films and shows or even the ones we are using. Say we have access to a collection of videos. What we can do is use a script that will generate the timestamps like a chapter select on a BluRay disc, and mark every time there is a jumpcut to a different camera shot or scene and is output into a sidecar txt file. Then with ffmpeg, use the clip command in conjunction with the sidecar txt to generate clips based on the timestamps created. These clips will then be tagged and captioned and can be organized within Hydrus (Hydrus also can organize video format files) and use the same image organization tools to manually review tags and captions every time we need to train the model.
140
  An addition we could do that can enhance the way our finetune is trained is to introduce the Danbooru/Novel AI tag format as well as keeping the natural language captioning to better prompt what we want out of the motion model without clashing with the limited caption instructions trained to attempt to prompt actions. A method to do this would be to take a single frame from the peak of each clip, run the SD Tagger on it, and then apply those tags onto its corresponding video clip on Hydrus. This would be something that would need trial and error before creating an automated process.
@@ -142,7 +149,7 @@ From there we would do as we do with Stable Diffusion, but this time we will be
142
  In short:
143
  Create scene jump cut timestamp sidecar txt file from video -> Created segmented clips with ffmpeg -> run a video-based caption classifier on clips -> semi-automate additional Danbooru/NovelAI format tags -> manual review dataset on Hydrus-> export to trainer
144
 
145
- ## What I have learned for undertaking this project
146
  All I wanted to do was just make AI art in the style of my favorite shows. Just wanted to cook something up in-between my odd jobs and video game sessions. What I got instead was a journey through several different fields of work, disciplines within art photography and animation, and the roughness of the math and science behind what we all call AI.
147
  I learn a lot of python and CUDA, sometimes had to cheat with ChatGPT but was able to recognize how to get things working.
148
  I learned a lot about Blu-Ray content. The encoding process of fanrips, streaming services, and the physical delivery of the media all playing a part in the quality of the viewing experience and how that was affecting tools I used when curating my dataset from these sources.
 
2
  {}
3
  ---
4
 
 
5
  ----
6
+ # Diffusion Design Works
7
  <span style="font-size: 70%;">This page is for historical and developmental progression purposes.
8
  </span>
9
 
 
34
 
35
  Dataset was created by running Blu-ray rips of the listed content through ffmpeg as part of the aid of Anime-Screenshot-Pipeline on Github back in February of 2023, but I had quickly become dissatisfied with the quality of the dataset it originally produced, and parts of its pipeline were not useful for what I needed, thus I took some tools, repurposed and/or modified settings, and redid the entire dataset after several months of doing initial training tests before committing to an overall.
36
 
37
+ ----
38
+
39
  ## Modifying the Pipeline and handling unexpected complications
40
 
41
  Things to note before continuing:
 
47
  -A manually cleaned up episode will come down to about 50GBs in size and a movie will drop to around 600GB
48
  -Lots of Hard Drive space required
49
 
50
+ ## Frame Extraction
51
  The first change I made to the pipeline was the ffmpeg script provided. The Pipeline used a script that would run the mpdecimate command to remove duplicate or “frozen frames”, moments in the animation where there is no movement, as ffmpeg is “extracting” the frames to condense down the frames to a final output around a 10th of the original size.
52
  However, I came to find out much later there was a large spread of false positive frame removals for unknown reasons and false negatives of frames that were kept from a combination of mpdecimate not considering jump cuts that repeat scenes after several cuts, and due to the Blu-ray encodings creating enough pixelation changes between frames that it would defeat mpdecimate’s deletion threshold.
53
  Because the loss of data was too significant to ignore, I stopped using mpdecimate and let ffmpeg extract every single frame and began using a deduping script that would sort duplicates based on the image’s file hash value and would only move all the flagged frames to a different folder instead of deleting them, allowing for manual review in the future. The downside to this method is that I am now using significant amount of hard drive space which would eventually force me to start buying hard drives and to looking into NAS solutions to hold all this data. It did also present a way to schedule when to start manual reviewing the datasets, as this would free almost enough space to frame extract another episode.
 
59
  The solution I ended up coming up with was repurposing a face detection tool used in the original github to generate 1:1 cropped images for a dataset. After tagging my dataset, I would run a modified version of the script that will duplicate all images that pass the face detection threshold into a different folder. and when I would run the dataset into my image organizer Hydrus, I would delete all the subject tags from those images and then import the copied images with the accurate subject tags.
60
  My first training after using this method instantly fixed the prompting of subjects. A quick skim of the images that passed the face detection threshold show no signs of false positives, and there is no issue if there are cases of false negatives, as manual review will take care of anything that were missed. Other miscellaneous improves I used for this phase include a classifier aggregate that will tag based on the average of 3 or more classifiers running concurrently and has the tagging of copyright names removed just so that I reduce false positive tags of characters that don’t belong in the content from being erroneously introduced.
61
 
62
+ ## Dataset Organization and Preparation
63
  While dedicated image organizers for stable diffusion datasets are available, I elected with using an obscure desktop application called Hydrus Network. Created in the early 2010s for the purpose of organizing large media collections (of internet memes and other s#@tposts) under a single location with various customizable categories modeled after the format of imageboards such as danbooru. The media is tabulated based on file hash rather than whatever the file is named as. This aspect synergizes not only with how the datasets need to be tagged if following the Booru/NovelAI format, but with how my deduping script operates on file hash values when sorting unique frames out, as well as how I incorporate face detection copies of images with correct subject tagging.
64
 
65
  Once my sorted and tagged images are completed, I will import the dataset batch into Hydrus and it will associate the tag sidecar txt files generated by the taggers to the images and will automatically populate the datapoints hits. I also include Hydrus specific metadata of the series, episode, and scene for later manual review once the import is finished.
 
67
  From here I can proactively check any tags I may have previously had issues with and search for all images by that tag, remove tags if incorrect, maybe even delete images that could be seen as bad data, and overall just skim that the frames I did get were satisfactory.
68
  I will repeat this process with all new dataset batches I make until the model is ready to train. From here I can use the Hydrus metadata I included to only select a specific range of data that is ready to go and it will export a new set of copies with sidecar txt files that the trainer will need to associate the tags and images.
69
 
70
+ ## Kohya Trainer
71
  Things to note before continuing:
72
  -Card used for training is a water cooled RTX4090 MSI Suprim Liquid X on my personal machine.
73
  -Training 200k Images at 512x512 resolution at batch size 16 takes 48 hours.
 
80
  Japanese developer Kohya-SS’s SD Script package is an old but very consistent way training package. While it mostly supports LoRA and other network-based checkpoint trainings, it still supports full finetune support for SD1 models. It only needs to be pointed to a training directory, will check the main training folder and the regulation folder if enabled, and will then just follow the training parameters set in the powershell script and output the checkpoint when done. It borrows from the NAI training settings and incorporates the aspect ratio bucketing so the resolution sizes are not restricted to 1:1 aspect image.
81
 
82
  ## Post Training Review
83
+
84
  After the model is trained, I will run several “templates” to test changes between a previous version of the model compared to the current one. I will note if there are any visual improvements, changes in the look of backgrounds, character feature consistencies, detail in non-upscaled generations, and if any errors I caught in the previous cycle were fixed. I will also take prompts of other stable diffusion images I see in the wild to test how others would prompt on this model in hopes of finding tags prompts that don’t aren’t producing intended results so I can then note down to check on Hydrus.
85
+
86
+ ----
87
+
88
+ # Dataset post processing
89
 
90
  Things to note before continuing:
91
  -SD1.x is trained on 512x512 resolution images, a size of 262,144 Pixels
 
94
  -Most enthusiast’s LoRAs are still trained on 512x512 due to their hardware constraints
95
  There are 3 additional processes I do to my dataset that combat either certain flaws in the dataset, cleaning up data that could have production mistakes or bothersome editing choices, and maximize the amount of detail of specific shots or scenes: Layering, Cropping, and Stitching.
96
 
97
+ ## Layering
98
  Say I have a shot of store front backdrop with a lot of moving foreground pieces, but not a single frame has a clean shot of that backdrop, but if I piece together enough separate frames together on top of each other and mask out the unwanted bits, I can create a clean frame edit of the backdrop that then I can include in my dataset. Then I can reintroduce some separate frames with the foreground pieces to train concept of being able to draw an empty room or location, and that same spot with people or other moving subjects. This is not something I prioritize often, but when a particular scene with too many movie pieces shows up that I feel is worth doing the work to get
99
 
100
+ ## Stitching
101
 
102
  In any sort of motion picture, you will get scenes where the camera is panning from one direction another to either show off a landscape shot, or a character’s reveal a sizing up of several characters in the prelude to a confrontation. The frames by themselves may not be quite useful, but they would be if you had a completed picture. I will take a scene and either using a Photoshop or Lightroom tool to stich the pieces together, or I will manually layer the frames together if it’s a complex shot, to have a single complete picture. This also gives provides an additional resolution sizes that the model can train on, fighting the potential overfit that can be caused by training almost exclusively on 16:9 resolution images. I will still include some of the separate frames in full if there is detail in them I want the model to train on.
103
 
104
+ ## Cropping
105
 
106
  While originally a requirement to make 1:1 crop cutouts of images for your dataset in the early days of Stable Diffusion Dreambooth training, the introduction and source code release for Aspect Ratio Bucketing by NovelAI allowed the use of any resolution size for datasets and would be sized to fit a resolution bucket for training. This crop and resizing down of the full images however will cause a loss in detail of the image as the bucket still needs to comply with the 1:1 aspect pixel size, and those buckets will always have a lower pixel count than the default training resolution, so a large image like a 1920x1080 blu-ray frame or any other 2MP image from an imageboard getting crunched down will cause the model to lose out on information detail. To combat this, I also include sizable amount of 1:1 aspect images. These focus on facial expressions, certain detail focus shots of objects and items, character’s extremities to improve hand and finger accuracy, or close ups of clothing and/or patterns. I will also use this on the same frames of stitches I created, if possible, to also preserve those details of panning shots. The 1:1 aspect images also help in reducing overfit from the main resolution.
107
 
108
+ # Flaws and Challenges to Overcome
109
 
110
  Despite everything I picked up and what I managed to accomplished on my own, there are still shortcomings due to time, in need of more experience, and inability to upscale my resources to continue gaining experience.
111
 
112
+ ## Model Quality
113
  While I have impressed myself and friends with the quality of AI generation I was able to create that mimics that style I want, I have encountered some degrees of overfit and general stiffness with my generations. Most of these animations used in the current version are adapted from the works of a single main character designer, whose art style is being adapted by 3 different Art Directors in two studios (the content being used in regularization). So it’s hard to tell where art style ends and overfit begins, but when you are familiar with the material and its source, you can start to notice when certain features are creeping up when they are not wanted, hence why the next set of content that will be pushed for training will feature works and franchises handled by this studio but are mostly unrelated to content already in the model.
114
  The existing dataset has been slowly getting replaced with my redone, cleaned up, and meticulously reviewed version with more of the data preserved that will hopefully improve concepts that were missing in the original attempt. The dataset will continue to get larger and larger, and also be laser focused to improve certain tag tokens or concepts that struggle to get generated. I will also be pushing to train the models at higher and higher resolutions, up to 1080x1080 or as close as possible without breaking my machine.
115
 
116
+ ## Hardware
117
  And to make sure I do not break my machine; I am working towards building an AI focused Home Lab with enough VRAM to run my stable diffusion training and can comfortably run and train LLMs as well. I did not realize early on that a 4090 was not enough to do many of the advance level AI activities. Even if I went to a double 3090 set up, it requires a specific motherboard configuration that supports SLI/NVLink because there is no reliable way to doing sharding for SD Model trainings without it, but not an issue when training LLMs. And while cloud-based training does exist, the amount of trial and error required to get things off the ground will add up very fast, that its basically cheaper to upfront all those costs on hardware.
118
  Hard Drive space also became a big issue with the size of these datasets I was beginning to hoard. The Blu-Ray rips, having many versions of said Blu-Rays due to differences in fan encodes or the quality of the official discs that get pressed between different regions, also started eating into my hard drive space as they were also the source data for my datasets. I ended up purchasing 2 22TB hard drives and tried to find a space to shove them in my machine while I continue looking into RAID solutions.
119
 
120
+ ## Tools
121
  The challenge of building a pipeline is that you don’t know what you need until you ran into that problem. You can pseudocode and web chart out what you need at every step but until you start running that roadmap, you won’t know whether you took everything into account correctly. And when there was something, I wish could be automated, I found that no such tool was readily on hand. Many times I turned to ChatGPT to help me modify or even create these automatic scripts quickly to reach my end goal.
122
  The Anime Screenshot Pipeline did give me insight on how I need to have everything set up, but not all of it’s tools were at the level I needed for my specific model, or simply didn’t work for my workflow. Mpdecimate wasn’t needed so I just used the standard ffmpeg command as is. There was a secondary computer vision application for similar image removal using FIftyOne, but because I already committed to the need of manual reviewing, is forewent its use and stuck with an image hash file based duplicate organizer. The Face Detect Auto Crop script in its form was not needed, so I removed the auto cropper and kept only face detect for subject tagging accuracy with a quick modification. I customized the use of Hydrus based on the new tools’ synergy with the image organizer instead of using a more mainstream dedicated application.
123
  And despite all of that, there are still some gaps in the pipeline that could probably use another automation tool, but it either doesn’t exist or requires a bit more technical knowledge beyond cheating with GPT to produce. It becomes a scale where I must balance taking the time to learn something that could save me some time in the future or spend that time doing it the way I already know because there is no guarantee that by the time I finish doing the process, that I would have found a solution. A shortcoming from working on this passion project on my own.
124
  While there are people that would like to help me, there is also time I would need to take to set up a working networking server that multiple people can sign into concurrently and do pieces of the work. Once again, take time away from training the work to get a server set up that maybe once in a while I will get a helping hand, or just continue working solo.
125
 
126
+ ## Current State of Tech
127
 
128
  AI is not perfect yet. Even when the big entities like StabilityAI produce a new version of Stable Diffusion 3, or Anlatan’s NovelAI v3, the changes and upgrades could result in an unexpected characteristic to appear that turns people away and that will wait for the field to continuing evolving so that a better version that feels like a direct improvement, rather than change for the sake of change, comes about.
129
  From what I have seen, I am the only one doing this kind of source material collecting, and further less at the level of “autism”, in which I am doing it. My idea is not original, the method to my madness to reach the quality that others with that idea strive to reach however, is unique because it’s made from passion.
130
  I’m working within the limits of the tech that is accessible to me. And the few crumbs of improvements and developments from other trainers doing similar levels of home-grown local finetuning, but for other themes or styles or end goals, I take in and get ready to apply for the next training.
131
  My dataset is not restricted to Stable Diffusion 1 or even Stable Diffusion all together, so whether tech becomes easier to access for ever aging hardware every generation leap, or I come into access of enterprise level resources to push the boundary, my work will never be wasted, it will be ready to adapt to the next best thing. And with that, I would like to segway into not necessarily proof, but a potential example of how a dataset meant for generative AI artwork, can be applied to different medium.
132
 
133
+ ----
134
+
135
+ # Motion Models – AnimateDiff + UDW
136
  As Text to Video has taken its first steps into mainstream with products such as Luma, interest in making this kind of AI content has grown, and there have been some groups that have made attempts in making video models for local applications in the open source space that integrate with Stable Diffusion.
137
 
138
  AnimateDiff is a plug-and-play module that will take T2I prompts and use the motion model to create short animation video clips with Stable Diffusion. This was created back in June 2023.
 
141
 
142
  Sometime this year, the developers disclosed their training method and there has been at least one big finetune fork where the training resolution was doubled, but no new data was added due a lack of video tagging tools. With the visibility of the video dataset now clear, the path is open for those that have the hardware to finetune the motion model. The dataset, WebVid10M, is comprised of low-resolution watermarked previews of shutterstock footage. With the understanding of the model now clear, I can take my existing pipeline for SD and modify the roadmap to incorporate new video clips.
143
 
144
+ ## Theoretical Pipeline
145
 
146
  Since specific tuning is not required, we are open to using more types of movies, recordings, and other animations not necessarily related to official films and shows or even the ones we are using. Say we have access to a collection of videos. What we can do is use a script that will generate the timestamps like a chapter select on a BluRay disc, and mark every time there is a jumpcut to a different camera shot or scene and is output into a sidecar txt file. Then with ffmpeg, use the clip command in conjunction with the sidecar txt to generate clips based on the timestamps created. These clips will then be tagged and captioned and can be organized within Hydrus (Hydrus also can organize video format files) and use the same image organization tools to manually review tags and captions every time we need to train the model.
147
  An addition we could do that can enhance the way our finetune is trained is to introduce the Danbooru/Novel AI tag format as well as keeping the natural language captioning to better prompt what we want out of the motion model without clashing with the limited caption instructions trained to attempt to prompt actions. A method to do this would be to take a single frame from the peak of each clip, run the SD Tagger on it, and then apply those tags onto its corresponding video clip on Hydrus. This would be something that would need trial and error before creating an automated process.
 
149
  In short:
150
  Create scene jump cut timestamp sidecar txt file from video -> Created segmented clips with ffmpeg -> run a video-based caption classifier on clips -> semi-automate additional Danbooru/NovelAI format tags -> manual review dataset on Hydrus-> export to trainer
151
 
152
+ # What I have learned for undertaking this project
153
  All I wanted to do was just make AI art in the style of my favorite shows. Just wanted to cook something up in-between my odd jobs and video game sessions. What I got instead was a journey through several different fields of work, disciplines within art photography and animation, and the roughness of the math and science behind what we all call AI.
154
  I learn a lot of python and CUDA, sometimes had to cheat with ChatGPT but was able to recognize how to get things working.
155
  I learned a lot about Blu-Ray content. The encoding process of fanrips, streaming services, and the physical delivery of the media all playing a part in the quality of the viewing experience and how that was affecting tools I used when curating my dataset from these sources.