Fgdfgfthgr
/

Anime_Images_Style_Embedder

Model card Files Files and versions

Anime_Images_Style_Embedder / README.md

Fgdfgfthgr's picture

Update README.md

ae65ad2 verified 4 months ago

|

history blame contribute delete

2.49 kB

	---
	license: mit
	datasets:
	- nyanko7/danbooru2023
	---
	## Check out my [blog](https://huggingface.co/blog/Fgdfgfthgr/typical-anime-image-style-dim)!

	# Update 17/10/2025
	V4 released! This time instead of training a vision model from scratch,
	it uses a simple mlp that takes the cls token from a [DINOv3](https://huggingface.co/collections/facebook/dinov3-68924841bd6b561778e31009) model to get the embedding.

	Far more accurate than the previous V3! You do need access to the DINOv3 with your HuggingFace token, though.

	# You can use 6/7 numbers to fully describe the style of an (anime) image!

	## What's it and what could it do?
	Many diffusion models, though, choose to use artist tags to control the style of output images.
	I am really not a fan of that, for three reasons:

	1. Many artists share very similar styles, making many artist tags redundant.
	2. Some artists have more than one distinct art style in their works. For basic example, sketch vs finished images.
	3. Prone to content bleeding. If the artist tag you choose draws lots of repeating content, it's very likely these content will bleed into your output despite not prompting for them.

	One way to overcome this is using a style embedding model.
	It's a model which takes in images of arbitrary sizes and outputs a style vector for each image.
	The style vector lives in an N-Dimension space, and is essentially just a list of numbers with a length of N.
	Each number in the list corresponds to a specific style element the input image has.

	Images with similar style should have similar embedding (low distance) while different style will have embeddings that are far apart (high distance).

	The included py file gives minimal usage example.
	minimal_script.py provides the minimal codes for running an image through the network and obtain an output.
	While gallery_review.py contains the code I used to generate those visualisations and clustering.

	## Training data is [here](https://huggingface.co/datasets/Fgdfgfthgr/Style_Embedder_Dataset).

	## Training Hyperparameters
	With current version (v4):

	Training was done using [PyTorch Lightning](https://lightning.ai/).

	lr = 0.0005

	weight_decay = 0.01

	AdamW optimizer

	ExponentialLR scheduler, with a gamma of 0.99, applied every epoch.

	Batch size of 9999 (so all data goes through the network at once).

	With every anchor image, 4 positive images and 16 negative images are used.

	Trained for 150 epoches. On a single RTX 3080 GPUs. A total of 150 optimizer updates.