Anime Aesthetic Classifier

This repository contains an aesthetic scoring model trained exclusively on anime-style illustrations from the Danbooru dataset. The model is designed to evaluate the visual quality of an image.

The model predicts the following 4 classes:

idx2label = {0:"worst", 1:"worse", 2:"better", 3:"best"}

It was trained with the following repository: Aesthetic.

Usage

The backbone is SmilingWolf/wd-swinv2-tagger-v3, it uses the pooled_features from the backbone.

First preprocess the image.

def preprocess_image_for_batch(self, image_path:str):
    """
    Modified preprocessing function that returns tensor without batch dimension
    for batching multiple images together.
    """
    processor_config = {
        "size": {"height": 448, "width": 448},
        "color": [255, 255, 255],
        "image_mean": [0.5, 0.5, 0.5],
        "image_std": [0.5, 0.5, 0.5],
        "rescale_factor": 1 / 255.0,
        "resample": PILImageResampling.BILINEAR,
    }

    size_dict = processor_config["size"]
    color_tuple = tuple(processor_config["color"])
    image_mean = processor_config["image_mean"]
    image_std = processor_config["image_std"]
    rescale_factor = processor_config["rescale_factor"]
    resample_filter = processor_config["resample"]

    try:
        image = Image.open(image_path).convert('RGB')
        image_array = np.array(image)
    except Exception as e:
        print(f"Error opening image {image_path}: {e}")
        return None
    
    # Preprocessing pipeline
    processed_image = resize_with_padding(
        image=image_array,
        size=(size_dict["height"], size_dict["width"]),
        color=color_tuple,
        resample=resample_filter,
        data_format=ChannelDimension.FIRST
    )
    
    if not is_scaled_image(processed_image):
        processed_image = rescale(
            processed_image,
            scale=rescale_factor,
            data_format=ChannelDimension.FIRST
        )
    
    processed_image = normalize(
        processed_image, 
        mean=image_mean,
        std=image_std,
        data_format=ChannelDimension.FIRST
    )
    
    # Return tensor without batch dimension for batching
    return torch.tensor(processed_image).float()

For loading the image we will use the timm library, the code is included in the aesthetic repository.

from aesthetic.aesthetic import load_feature_extractor, resize_with_padding, AestheticClassifier
feature_extractor, _, _ = load_feature_extractor(
  feature_extractor_path
)
checkpoint = torch.load(classifier_path, map_location="cuda")

classifier = AestheticClassifier(
            feature_dim=checkpoint['feature_dim'],
            num_classes=checkpoint['num_classes'],
            hidden_dims=checkpoint.get(
                'hidden_dims', 512
            ),
            dropout_rate=checkpoint.get(
                'dropout_rate', 0.1
            )
        )
# Forward pass through feature extractor
tensor = preprocess_image_for_batch(img_path)
features = feature_extractor.forward_features(tensor)
pooled_features = feature_extractor.head.global_pool(features)

# Forward pass through classifier
logits = classifier(pooled_features)
predictions = torch.argmax(logits, dim=1)

# Convert predictions to labels
batch_labels = [idx2label[pred.item()] for pred in predictions]

Another inference example is in danbooru-dataset-toolkit in the classification.py file.

⚠️ Important Disclaimers

Aesthetics vs. Content

This model evaluates artistic technique, not subject matter.

The classifier is trained to recognize visual attributes such as composition, lighting, coloring, and linework. It does not evaluate the moral, ethical, or legal nature of the content depicted.

Because the training dataset (Danbooru) contains a wide variety of tags and subjects—including explicit, NSFW, or controversial content—the model may assign high aesthetic scores to images containing such content.

Please understand that:

A high score indicates that the image is calculated to be visually/technically high quality (e.g., high-resolution, detailed shading).
A high score is NOT an endorsement of the subject matter, tags, or content depicted in the image.
The authors of this model do not condone or agree with controversial or illegal content that may be found within the training distribution or analyzed by this model.

Limitation of Liability

This model is provided "as-is" for research and experimental purposes only.

The authors are not responsible for the outputs generated by this model.
The authors are not responsible for how users choose to utilize these outputs.
Users are solely responsible for ensuring their use of the model and the data they process complies with local laws and regulations.

Model Details

Training Data: Danbooru (Anime-style illustrations), around 20K samples.
Objective: To predict an aesthetic score based on visual fidelity and artistic style.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support