Anime Style Classifier v5

An EfficientNet-B2 model that classifies anime frames into 4 production style categories and doubles as a style embedding model — capturing the visual drawing style of an image independently of scene content.

Model Details


Architecture	EfficientNet-B2 (timm)
Parameters	7.7M
Input	512 × 512 RGB
Embedding dim	1408
Classes	4 (`flat`, `modern`, `painterly`, `retro`)
Format	SafeTensors
File size	30 MB
Training data	19,611 images across 4 classes
Training method	Supervised CE + progressive resolution fine-tuning (440→512) + weight merge

Classes

Style	Description	Example series
flat	Minimal shading, solid color blocks, low gradients	Ping Pong The Animation, The Heike Story, Kaiba
modern	High-detail digital pipeline, clean gradients, contemporary AAA	Demon Slayer, Violet Evergarden, Attack on Titan
painterly	Visible brush texture, watercolor/oil feel, textured transitions	Mushi-Shi, Your Lie in April, Land of the Lustrous
retro	Cel-era aesthetic, grain, simpler compositing, limited dynamic range	Cowboy Bebop, Neon Genesis Evangelion, Robotech

Style Examples

flat	modern


painterly	retro

Performance

OOD Evaluation (439 held-out images, never seen during training)

Overall accuracy: 98.41% (432/439)

Class	Recall	Count	Errors
flat	100.0%	83/83	0
modern	98.5%	203/206	3
painterly	94.0%	47/50	3
retro	99.0%	99/100	1

Confusion Matrix

	→ flat	→ modern	→ painterly	→ retro
flat	83	0	0	0
modern	0	203	1	2
painterly	0	3	47	0
retro	0	1	0	99

Error Analysis (7 total misclassifications)

All errors are plausible borderline cases. Click any thumbnail to view full size.

True	Predicted	Confidence	Notes
modern	painterly	95.8%	Soft-lit scene with textured background
modern	retro	98.5%	Retro-styled modern production
modern	retro	93.3%	Muted palette resembling cel-era
painterly	modern	96.0%	AI-generated with clean digital finish
painterly	modern	79.5%	Borderline painterly/modern
painterly	modern	91.1%	Synthetic painterly with modern elements
retro	modern	66.3%	Low confidence — genuinely ambiguous era

Use as Style Embedding Model

Beyond classification, this model produces 1408-dimensional style embeddings from the penultimate layer that capture the visual drawing style of anime frames — line weight, shading technique, color palette, compositing approach — independently of scene content (characters, backgrounds, objects).

Embedding Performance

Evaluated on 1,800 frames from 18 anime series across all 4 style groups (100 frames per series):

Model	Intra-Series Cohesion ↑	Same-Style Similarity ↑	Cross-Style Similarity ↓	Separation Gap ↑
anime-style-classifier-v5 (this model)	0.347	0.800	0.464	0.336
Base EfficientNet-B2 (ImageNet)	0.228	0.877	0.826	0.051
ViT-B/16 (ImageNet)	0.214	0.844	0.795	0.049

6.5× better style separation than both ImageNet baselines. The fine-tuned model clusters same-style series together while pushing different styles apart. ImageNet models treat everything as nearly identical (0.82 cross-style similarity), unable to distinguish drawing style from scene content.

Style Group	Centroid Cohesion	Notes
flat	0.835	Yuasa / Science SARU / rotoscoped — tight cluster
modern	0.906	Digital AAA productions — very tight
painterly	0.885	Watercolor / textured — cohesive
retro	0.624	More varied (1985–2004 span) — intentional

Embedding Usage

import torch
import timm
from PIL import Image
from torchvision import transforms
from safetensors.torch import load_file

# Load model
model = timm.create_model('efficientnet_b2', pretrained=False, num_classes=4)
state_dict = load_file('model.safetensors')
model.load_state_dict(state_dict)
model.eval()

transform = transforms.Compose([
    transforms.Resize((512, 512)),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])

def get_style_embedding(image_path):
    """Extract a 1408-dim style embedding from the penultimate layer."""
    img = Image.open(image_path).convert('RGB')
    x = transform(img).unsqueeze(0)
    with torch.no_grad():
        features = model.forward_features(x)          # [1, 1408, 16, 16]
        embedding = features.mean(dim=[-2, -1])        # [1, 1408] global avg pool
        embedding = embedding / embedding.norm(dim=1, keepdim=True)  # L2 normalize
    return embedding.squeeze(0).numpy()

# Compare two frames
emb_a = get_style_embedding('frame_a.jpg')
emb_b = get_style_embedding('frame_b.jpg')
similarity = emb_a @ emb_b  # cosine similarity (embeddings are L2-normalized)
print(f'Style similarity: {similarity:.4f}')
# > 0.8  = very similar style (likely same series/studio)
# 0.4-0.8 = same style family (e.g., both modern)
# < 0.2  = different style axis (e.g., retro vs modern)

Interpreting Embedding Distances

The global average pooled features from forward_features() (before the classifier head) encode style properties:

Cosine Similarity	Interpretation
> 0.8	Near-identical style (same series, same studio)
0.5 – 0.8	Same style family (e.g., two modern AAA shows)
0.0 – 0.5	Different style families
< 0.0	Opposite ends of the style spectrum (e.g., 1980s cel vs 2020s digital)

The embedding captures how something is drawn — line weight, shading technique, color palette, compositing — not what is drawn. Scenes with completely different content (landscapes vs close-ups vs action sequences) from the same series will still cluster together.

Classification Usage

import torch
import timm
from PIL import Image
from torchvision import transforms
from safetensors.torch import load_file

# Load model
model = timm.create_model('efficientnet_b2', pretrained=False, num_classes=4)
state_dict = load_file('model.safetensors')
model.load_state_dict(state_dict)
model.eval()

# Preprocessing — resize to 512×512
transform = transforms.Compose([
    transforms.Resize((512, 512)),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])

image = Image.open('anime_frame.jpg').convert('RGB')
x = transform(image).unsqueeze(0)

with torch.no_grad():
    logits = model(x)
    probs = torch.softmax(logits, dim=1)
    pred_idx = logits.argmax(-1).item()
    confidence = probs[0, pred_idx].item()

classes = ['flat', 'modern', 'painterly', 'retro']
print(f'{classes[pred_idx]}: {confidence:.1%}')

Training Details

Architecture and Training Pipeline

Base training: EfficientNet-B2 pretrained on ImageNet, fine-tuned at 440px on style dataset with focal loss and class-balanced sampling
Progressive resolution: Fine-tuned from 440→512px with frozen batch norm, discriminative learning rates (backbone 1/10th of head), and 1-epoch warmup
Weight merging: Linear interpolation of base (440px) and progressive (512px) checkpoints at optimal α=0.54, lifting accuracy beyond either parent

Training Data

Class	Images	Description
flat	2,138	Real anime frames + synthetic consensus images
modern	3,954	Extracted from 50+ modern anime series
painterly	5,130	Real frames + Safebooru/DeviantArt + validated synthetic
retro	8,389	Extracted from 30+ pre-digital-era series
Total	19,611

Split: 17,649 train / 1,962 validation (90/10) + 439 separate OOD test images.

Key Training Decisions

Focal loss handles class imbalance (retro has 4× more samples than flat). The best single fine-tuned checkpoint reached 96.36% OOD accuracy; linear weight merging of two complementary checkpoints pushed this to 98.41%. Batch norm freezing during progressive resolution training was critical — without it, running stat drift degraded OOD performance.

What This Model is NOT

❌ Not a character design classifier — won't detect moe, chibi, bishounen, etc.
❌ Not a content classifier — sees how it's drawn, not what
❌ Not an era detector — a modern show using cel technique → retro (correctly)
❌ Not an AI detector — AI art that nails a style gets classified by that style
❌ Not a quality scorer — good and bad modern frames both classify as modern

Limitations

Trained primarily on Japanese anime (TV series, films). May not generalize to Western animation, donghua, or manhwa.
"Painterly" class is hardest — the boundary with modern is subjective when digital tools simulate traditional media.
AI-generated anime art (e.g., ComfyUI, Stable Diffusion) often confuses the model between painterly and modern.
Flat style class has the fewest real-world examples — supplemented with curated synthetic images.

Image Disclaimer

Example and evaluation images shown in this model card are either generated via diffusion models or sourced from web search results for illustrative purposes only. Web-sourced images remain the property of their respective copyright holders and are used here solely as examples to demonstrate model behavior.

Citation

@misc{anime-style-classifier-v5,
  title={Anime Style Classifier V5},
  year={2026},
  publisher={HuggingFace},
  note={EfficientNet-B2 for anime production style classification and embedding}
}

Downloads last month: 9

Safetensors

Model size

7.77M params

Tensor type

F32

Evaluation results

OOD Accuracy
self-reported

98.410
Flat Recall
self-reported

100.000
Modern Recall
self-reported

98.500
Painterly Recall
self-reported

94.000
Retro Recall
self-reported

99.000