Create eval_basics.md
Browse files- posts/eval_basics.md +111 -0
posts/eval_basics.md
ADDED
|
@@ -0,0 +1,111 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Evaluating Anime Models Systematically - Basics
|
| 2 |
+
|
| 3 |
+
I was trying to refine my character models when I realized how I've been making models is really inefficient.
|
| 4 |
+
It typically goes like, tweak some configs or data, try some random prompts, see if they look okay.
|
| 5 |
+
It should be helpful to establish a well-defined procedure.
|
| 6 |
+
Then it's apparent that to evaluate fine-tuned models, knowing and quantifying how the base models perform as a baseline is essential.
|
| 7 |
+
So here I am, trying to evaluate base models.
|
| 8 |
+
|
| 9 |
+
I collected 1000 random prompts from Danbooru posts from 2021-2022 with the query `chartags:0 -is:child -rating:e,q order:random score:>=10 filetype:jpg,png,webp ratio:0.45..2.1`
|
| 10 |
+
and generated 1000 640x640 images with them for each of 3 widely-used anime models:
|
| 11 |
+
[animefull-latest](https://huggingface.co/deepghs/animefull-latest),
|
| 12 |
+
[Counterfeit-V3.0](https://civitai.com/models/4468?modelVersionId=57618), [MeinaMix_V11](https://huggingface.co/Meina/MeinaMix_V11).
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
A model can be evaluated over a number of aspects: fidelity, text-image alignment, aesthetics, diversity. Let's go through them one by one.
|
| 16 |
+
|
| 17 |
+
## Fidelity
|
| 18 |
+
|
| 19 |
+
Generated images should be indistinguishable from real ones. They should make sense and not contain obvious errors such as extra limbs, mutated fingers, glitches or random blobs.
|
| 20 |
+
In literature, it's common to use metrics based on distribution distance, such as FID and IS. I calculated the KID score of the 3 sets of images against the 1000 real images.
|
| 21 |
+
|
| 22 |
+
| model | KID (lower better) |
|
| 23 |
+
|---|--|
|
| 24 |
+
| animefull-latest | 0.01192 |
|
| 25 |
+
| Counterfeit-V3.0 | 0.01807 |
|
| 26 |
+
| MeinaMix_V11 | 0.01345 |
|
| 27 |
+
|
| 28 |
+
It seems like that KID does not align with human evaluation, which would generally rate animefull-latest as the worst one.
|
| 29 |
+
This is kind of expected, since models with strong style would have a different image feature distribution than random real images.
|
| 30 |
+
|
| 31 |
+
I also tried multimodal LLMs, including GPT-4V and LLaVA, and unfortunately find them quite useless. GPT-4V is supposedly SOTA, but it's clear that it quite useless at spotting generation errors.
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+

|
| 35 |
+
|
| 36 |
+
|
| 37 |
+

|
| 38 |
+
|
| 39 |
+
So currently I can't find a process that computes a fidelity score for anime models. Have to wait for someone to train a specialized model for now.
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
## Text-Image Alignment
|
| 43 |
+
|
| 44 |
+
Generated images should not contradict the text prompts. A popular metric is the CLIP score, which is the cosine similarity of the projected CLIP embeddings.
|
| 45 |
+
There's also [PickScore_v1](https://huggingface.co/yuvalkirstain/PickScore_v1) which is fine-tuned on human preference data.
|
| 46 |
+
These are not well-suited for anime models due to how different Booru tagging is from regular images.
|
| 47 |
+
|
| 48 |
+
Model using booru-tag prompts can be evaluated with a tagger. Specifically, I used [wd-v1-4-moat-tagger-v2](https://huggingface.co/SmilingWolf/wd-v1-4-moat-tagger-v2) with a threshold of 0.35.
|
| 49 |
+
A tag accuracy score can be defined as `#{prompted tags correctly reproduced}/#{prompted tags}`. The accuracy is macro-averaged over all images. Here are the scores:
|
| 50 |
+
|
| 51 |
+
| model | tag accuracy (higher better) |
|
| 52 |
+
|---|--|
|
| 53 |
+
| animefull-latest | 0.464328 |
|
| 54 |
+
| Counterfeit-V3.0 | 0.434574 |
|
| 55 |
+
| MeinaMix_V11 | 0.375389 |
|
| 56 |
+
|
| 57 |
+
It can be seen that fine-tunes or merges may produce nicer images but at the cost of controllability.
|
| 58 |
+
|
| 59 |
+
## Aesthetics
|
| 60 |
+
|
| 61 |
+
Images should be pretty. While this is generally subjective, there are models that give an aesthetic score, either averaged from many people's preferences or personalized.
|
| 62 |
+
There are CLIP based models ([aesthetic-predictor](https://github.com/LAION-AI/aesthetic-predictor), [improved-aesthetic-predictor](https://github.com/christophschuhmann/improved-aesthetic-predictor))
|
| 63 |
+
and some custom models ([anime-aesthetic](https://huggingface.co/spaces/skytnt/anime-aesthetic-predict), [cafe_aesthetic](https://huggingface.co/cafeai/cafe_aesthetic)).
|
| 64 |
+
|
| 65 |
+
I tested averaged improved-aesthetic-predictor and anime-aesthetic:
|
| 66 |
+
|
| 67 |
+
| model | improved-aesthetic-predictor (higher better) | anime-aesthetic (higher better) |
|
| 68 |
+
|---|--|-|
|
| 69 |
+
| animefull-latest | 6.124954 | 0.639767 |
|
| 70 |
+
| Counterfeit-V3.0 | 6.359464 | 0.789190 |
|
| 71 |
+
| MeinaMix_V11 | 6.474662 | 0.829989 |
|
| 72 |
+
|
| 73 |
+
The two scores appears to agree.
|
| 74 |
+
|
| 75 |
+
Interestingly, GPT-4V does a reasonable job at this.
|
| 76 |
+

|
| 77 |
+
|
| 78 |
+
## Diversity
|
| 79 |
+
|
| 80 |
+
Even with the same prompt, given different random seeds, generated images should not be repetitive.
|
| 81 |
+
There's this DIV score defined in the [Dreambooth paper](https://arxiv.org/pdf/2208.12242.pdf), which calculates image similarity with LPIPS.
|
| 82 |
+
For this particular set of images, this metric is not applicable, and I will leave it to a future update.
|
| 83 |
+
|
| 84 |
+
## Conclusions
|
| 85 |
+
|
| 86 |
+
It's possible to programmatically generate some numbers given a base model. We can use the numbers as a proxy of the model's overall performance.
|
| 87 |
+
|
| 88 |
+
|
| 89 |
+
## Miscellaneous notes
|
| 90 |
+
|
| 91 |
+
I used diffusers and 13 images from animefull-latest came out as solid black images for unknown reasons even with the safety checker disabled and single precision VAE.
|
| 92 |
+
These images and their counterparts were excluded in metrics calculation.
|
| 93 |
+
|
| 94 |
+
The images and prompts can be found [here](https://huggingface.co/datasets/gustproof/sd-data/tree/main/db1k).
|
| 95 |
+
|
| 96 |
+
It's possible that some models perform better with special configs, but for simplicity I kept them the same.
|
| 97 |
+
|
| 98 |
+
The code for image generation and metrics is quite messy so I'm will not upload it right now, but feel free to ask questions or give suggestions.
|
| 99 |
+
|
| 100 |
+
I probably would create a fidelity model eventually if no one does, but it will take a while.
|
| 101 |
+
|
| 102 |
+
|
| 103 |
+
|
| 104 |
+
Prompts with more tags have lower tag accuracy
|
| 105 |
+

|
| 106 |
+
|
| 107 |
+
The affect of tag position is measurable albeit less pronounced. The trend at the pos 20-25 may be due to the 77-token limit wraparound.
|
| 108 |
+

|
| 109 |
+
|
| 110 |
+
The next post will be about evaluating character models.
|
| 111 |
+
|