| # Workflow for fine-tuning ModelScope in anime style | |
| Here is a brief description of my process for fine-tuning ModelScope in an anime style with [Text-To-Video-Finetuning](https://github.com/ExponentialML/Text-To-Video-Finetuning). | |
| Most of it may be basic, but I hope it will be useful. | |
| There is no guarantee that what is written here is correct and will lead to good results! | |
| ## Selection of training data | |
| The goal of my training was to change the model to an overall anime style. | |
| Only the art style was to override the ModelScope content, so I did not need a huge data set. | |
| The total number of videos and images was only a few thousand. | |
| Most of the video was taken from [Tenor](https://tenor.com/). | |
| Many of the videos were posted as gifs and mp4s of one short scene. | |
| It seems to be possible to automate the process using the API. | |
| I also used some smooth and stable motions and videos of 3d models with toon shading. | |
| Short videos with a few seconds are sufficient, as we are not able to train long data yet. | |
| ### Notes on data collection | |
| Blurring and noise are also trained. This is especially noticeable in the case of high-resolution training. | |
| Frame rate also has an effect. If you want to train smooth motion, you need such data. | |
| Scene switching also has an effect. If not addressed, the character may suddenly transform. | |
| In the case of animation training, it is difficult to express details if only video sources are used, so images are also used for training. | |
| Images can be created using StableDiffusion. | |
| The fewer the differences between frames, the less likely the training results will be corrupted. | |
| I avoided animations with too much dynamic motion. | |
| It may be better to avoid scenes with multiple contexts and choose scenes with simple actions. | |
| I collected data while checking if common emotions and actions were included. | |
| ## Correcting data before training | |
| ### Fixing resolution, blurring, and noise | |
| It is safe to use a resolution at least equal to or higher than the training resolution. | |
| The ratio should also match the training settings. | |
| Trimming is possible with ffmpeg. | |
| Incidentally, I have tried padding to ratio with a single color instead of trimming, but it seemed to decrease the training speed. | |
| ### Converting small videos to larger sizes | |
| I used this tool: https://github.com/k4yt3x/video2x | |
| The recommended driver is Waifu2XCaffe. It is suitable for animation as it gets clear and sharp results. It also reduces noise a little. | |
| If you cannot improve the image quality as well as the resolution, it may be better not to force a higher resolution. | |
| ### Number of frames | |
| Since many animations have a small number of frames, the results of the training are likely to be collapsed. | |
| In addition to body collapse, the appearance of the character will no longer be consistent. Less variation between frames seems to improve consistency. | |
| The following tool may be useful for frame interpolation | |
| https://github.com/google-research/frame-interpolation. | |
| If the variation between frames is too large, you will not get a clean result. | |
| ## Tagging | |
| For anime, WaifuTagger can extract content with good accuracy, so I created [a slightly modified script](https://github.com/bruefire/WaifuTaggerForVideo) for video and used it for animov512x. | |
| Nevertheless, [BLIP2-Preprocessor](https://github.com/ExponentialML/Video-BLIP2-Preprocessor) can also extract enough general scene content. It may be a better idea to use them together. | |
| ## config.yaml settings | |
| I'm still not quite sure what is appropriate for this. | |
| [config.yaml for animov512x](https://huggingface.co/strangeman3107/animov-512x/blob/main/config.yaml) | |
| ## Evaluate training results | |
| If there are any poorly done results in the sample videos being trained, we will search the json with the prompts for that sample. With a training dataset of a few thousand or so, you can usually find the training source videos, which may be helpful to see where the problem lies. | |
| I dared to train all videos with 'anime' tags. | |
| Comparing videos with the positive prompts and negative ones with anime tag after training (comparing a fine-tuned result with those that are near to the original ModelScope) may help improve training. | |
| It is difficult to add additional training to specific things afterwards, even if they are tagged, so I avoided that. | |
| Note that the number of frames in anime is small to begin with, so over-learning tends to freeze the characters. | |
| Perhaps it is because ModelScope itself is not trained at such a large resolution, but the training difficulty seems to be lower at lower resolutions. | |
| In fact, when training Animov-0.1, I did not need to pay much attention to what is written here to get good results. | |
| If you are fine-tuning ModelScope at larger resolutions, you may need to train incrementally with more data to avoid collapsing the results. | |
| That's all. |