Anime Style Classifier v5

An EfficientNet-B2 model that classifies anime frames into 4 production style categories and doubles as a style embedding model β€” capturing the visual drawing style of an image independently of scene content.

Model Details

Architecture EfficientNet-B2 (timm)
Parameters 7.7M
Input 512 Γ— 512 RGB
Embedding dim 1408
Classes 4 (flat, modern, painterly, retro)
Format SafeTensors
File size 30 MB
Training data 19,611 images across 4 classes
Training method Supervised CE + progressive resolution fine-tuning (440β†’512) + weight merge

Classes

Style Description Example series
flat Minimal shading, solid color blocks, low gradients Ping Pong The Animation, The Heike Story, Kaiba
modern High-detail digital pipeline, clean gradients, contemporary AAA Demon Slayer, Violet Evergarden, Attack on Titan
painterly Visible brush texture, watercolor/oil feel, textured transitions Mushi-Shi, Your Lie in April, Land of the Lustrous
retro Cel-era aesthetic, grain, simpler compositing, limited dynamic range Cowboy Bebop, Neon Genesis Evangelion, Robotech

Style Examples

flat modern
painterly retro

Performance

OOD Evaluation (439 held-out images, never seen during training)

Overall accuracy: 98.41% (432/439)

Class Recall Count Errors
flat 100.0% 83/83 0
modern 98.5% 203/206 3
painterly 94.0% 47/50 3
retro 99.0% 99/100 1

Confusion Matrix

β†’ flat β†’ modern β†’ painterly β†’ retro
flat 83 0 0 0
modern 0 203 1 2
painterly 0 3 47 0
retro 0 1 0 99

Error Analysis (7 total misclassifications)

All errors are plausible borderline cases. Click any thumbnail to view full size.

ImageTruePredictedConfidenceNotes
modernpainterly95.8% Soft-lit scene with textured background
modernretro98.5% Retro-styled modern production
modernretro93.3% Muted palette resembling cel-era
painterlymodern96.0% AI-generated with clean digital finish
painterlymodern79.5% Borderline painterly/modern
painterlymodern91.1% Synthetic painterly with modern elements
retromodern66.3% Low confidence β€” genuinely ambiguous era

Use as Style Embedding Model

Beyond classification, this model produces 1408-dimensional style embeddings from the penultimate layer that capture the visual drawing style of anime frames β€” line weight, shading technique, color palette, compositing approach β€” independently of scene content (characters, backgrounds, objects).

Embedding Performance

Evaluated on 1,800 frames from 18 anime series across all 4 style groups (100 frames per series):

Model Intra-Series Cohesion ↑ Same-Style Similarity ↑ Cross-Style Similarity ↓ Separation Gap ↑
anime-style-classifier-v5 (this model) 0.347 0.800 0.464 0.336
Base EfficientNet-B2 (ImageNet) 0.228 0.877 0.826 0.051
ViT-B/16 (ImageNet) 0.214 0.844 0.795 0.049

6.5Γ— better style separation than both ImageNet baselines. The fine-tuned model clusters same-style series together while pushing different styles apart. ImageNet models treat everything as nearly identical (0.82 cross-style similarity), unable to distinguish drawing style from scene content.

Style Group Centroid Cohesion Notes
flat 0.835 Yuasa / Science SARU / rotoscoped β€” tight cluster
modern 0.906 Digital AAA productions β€” very tight
painterly 0.885 Watercolor / textured β€” cohesive
retro 0.624 More varied (1985–2004 span) β€” intentional

Embedding Usage

import torch
import timm
from PIL import Image
from torchvision import transforms
from safetensors.torch import load_file

# Load model
model = timm.create_model('efficientnet_b2', pretrained=False, num_classes=4)
state_dict = load_file('model.safetensors')
model.load_state_dict(state_dict)
model.eval()

transform = transforms.Compose([
    transforms.Resize((512, 512)),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])

def get_style_embedding(image_path):
    """Extract a 1408-dim style embedding from the penultimate layer."""
    img = Image.open(image_path).convert('RGB')
    x = transform(img).unsqueeze(0)
    with torch.no_grad():
        features = model.forward_features(x)          # [1, 1408, 16, 16]
        embedding = features.mean(dim=[-2, -1])        # [1, 1408] global avg pool
        embedding = embedding / embedding.norm(dim=1, keepdim=True)  # L2 normalize
    return embedding.squeeze(0).numpy()

# Compare two frames
emb_a = get_style_embedding('frame_a.jpg')
emb_b = get_style_embedding('frame_b.jpg')
similarity = emb_a @ emb_b  # cosine similarity (embeddings are L2-normalized)
print(f'Style similarity: {similarity:.4f}')
# > 0.8  = very similar style (likely same series/studio)
# 0.4-0.8 = same style family (e.g., both modern)
# < 0.2  = different style axis (e.g., retro vs modern)

Interpreting Embedding Distances

The global average pooled features from forward_features() (before the classifier head) encode style properties:

Cosine Similarity Interpretation
> 0.8 Near-identical style (same series, same studio)
0.5 – 0.8 Same style family (e.g., two modern AAA shows)
0.0 – 0.5 Different style families
< 0.0 Opposite ends of the style spectrum (e.g., 1980s cel vs 2020s digital)

The embedding captures how something is drawn β€” line weight, shading technique, color palette, compositing β€” not what is drawn. Scenes with completely different content (landscapes vs close-ups vs action sequences) from the same series will still cluster together.

Classification Usage

import torch
import timm
from PIL import Image
from torchvision import transforms
from safetensors.torch import load_file

# Load model
model = timm.create_model('efficientnet_b2', pretrained=False, num_classes=4)
state_dict = load_file('model.safetensors')
model.load_state_dict(state_dict)
model.eval()

# Preprocessing β€” resize to 512Γ—512
transform = transforms.Compose([
    transforms.Resize((512, 512)),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])

image = Image.open('anime_frame.jpg').convert('RGB')
x = transform(image).unsqueeze(0)

with torch.no_grad():
    logits = model(x)
    probs = torch.softmax(logits, dim=1)
    pred_idx = logits.argmax(-1).item()
    confidence = probs[0, pred_idx].item()

classes = ['flat', 'modern', 'painterly', 'retro']
print(f'{classes[pred_idx]}: {confidence:.1%}')

Training Details

Architecture and Training Pipeline

  1. Base training: EfficientNet-B2 pretrained on ImageNet, fine-tuned at 440px on style dataset with focal loss and class-balanced sampling
  2. Progressive resolution: Fine-tuned from 440β†’512px with frozen batch norm, discriminative learning rates (backbone 1/10th of head), and 1-epoch warmup
  3. Weight merging: Linear interpolation of base (440px) and progressive (512px) checkpoints at optimal Ξ±=0.54, lifting accuracy beyond either parent

Training Data

Class Images Description
flat 2,138 Real anime frames + synthetic consensus images
modern 3,954 Extracted from 50+ modern anime series
painterly 5,130 Real frames + Safebooru/DeviantArt + validated synthetic
retro 8,389 Extracted from 30+ pre-digital-era series
Total 19,611

Split: 17,649 train / 1,962 validation (90/10) + 439 separate OOD test images.

Key Training Decisions

Focal loss handles class imbalance (retro has 4Γ— more samples than flat). The best single fine-tuned checkpoint reached 96.36% OOD accuracy; linear weight merging of two complementary checkpoints pushed this to 98.41%. Batch norm freezing during progressive resolution training was critical β€” without it, running stat drift degraded OOD performance.

What This Model is NOT

  • ❌ Not a character design classifier β€” won't detect moe, chibi, bishounen, etc.
  • ❌ Not a content classifier β€” sees how it's drawn, not what
  • ❌ Not an era detector β€” a modern show using cel technique β†’ retro (correctly)
  • ❌ Not an AI detector β€” AI art that nails a style gets classified by that style
  • ❌ Not a quality scorer β€” good and bad modern frames both classify as modern

Limitations

  • Trained primarily on Japanese anime (TV series, films). May not generalize to Western animation, donghua, or manhwa.
  • "Painterly" class is hardest β€” the boundary with modern is subjective when digital tools simulate traditional media.
  • AI-generated anime art (e.g., ComfyUI, Stable Diffusion) often confuses the model between painterly and modern.
  • Flat style class has the fewest real-world examples β€” supplemented with curated synthetic images.

Image Disclaimer

Example and evaluation images shown in this model card are either generated via diffusion models or sourced from web search results for illustrative purposes only. Web-sourced images remain the property of their respective copyright holders and are used here solely as examples to demonstrate model behavior.

Citation

@misc{anime-style-classifier-v5,
  title={Anime Style Classifier V5},
  year={2026},
  publisher={HuggingFace},
  note={EfficientNet-B2 for anime production style classification and embedding}
}
Downloads last month
21
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Evaluation results