|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- nyanko7/danbooru2023 |
|
|
--- |
|
|
## Check out my [blog](https://huggingface.co/blog/Fgdfgfthgr/typical-anime-image-style-dim)! |
|
|
|
|
|
# Update 17/10/2025 |
|
|
V4 released! This time instead of training a vision model from scratch, |
|
|
it uses a simple mlp that takes the cls token from a [DINOv3](https://huggingface.co/collections/facebook/dinov3-68924841bd6b561778e31009) model to get the embedding. |
|
|
|
|
|
Far more accurate than the previous V3! You do need access to the DINOv3 with your HuggingFace token, though. |
|
|
|
|
|
# You can use 6/7 numbers to fully describe the style of an (anime) image! |
|
|
|
|
|
## What's it and what could it do? |
|
|
Many diffusion models, though, choose to use artist tags to control the style of output images. |
|
|
I am really not a fan of that, for three reasons: |
|
|
|
|
|
1. Many artists share very similar styles, making many artist tags redundant. |
|
|
2. Some artists have more than one distinct art style in their works. For basic example, sketch vs finished images. |
|
|
3. Prone to content bleeding. If the artist tag you choose draws lots of repeating content, it's very likely these content will bleed into your output despite not prompting for them. |
|
|
|
|
|
One way to overcome this is using a style embedding model. |
|
|
It's a model which takes in images of arbitrary sizes and outputs a style vector for each image. |
|
|
The style vector lives in an N-Dimension space, and is essentially just a list of numbers with a length of N. |
|
|
Each number in the list corresponds to a specific style element the input image has. |
|
|
|
|
|
Images with similar style should have similar embedding (low distance) while different style will have embeddings that are far apart (high distance). |
|
|
|
|
|
The included py file gives minimal usage example. |
|
|
minimal_script.py provides the minimal codes for running an image through the network and obtain an output. |
|
|
While gallery_review.py contains the code I used to generate those visualisations and clustering. |
|
|
|
|
|
## Training data is [here](https://huggingface.co/datasets/Fgdfgfthgr/Style_Embedder_Dataset). |
|
|
|
|
|
## Training Hyperparameters |
|
|
With current version (v4): |
|
|
|
|
|
Training was done using [PyTorch Lightning](https://lightning.ai/). |
|
|
|
|
|
lr = 0.0005 |
|
|
|
|
|
weight_decay = 0.01 |
|
|
|
|
|
AdamW optimizer |
|
|
|
|
|
ExponentialLR scheduler, with a gamma of 0.99, applied every epoch. |
|
|
|
|
|
Batch size of 9999 (so all data goes through the network at once). |
|
|
|
|
|
With every anchor image, 4 positive images and 16 negative images are used. |
|
|
|
|
|
Trained for 150 epoches. On a single RTX 3080 GPUs. A total of 150 optimizer updates. |
|
|
|