Anime Style Classifier v5
An EfficientNet-B2 model that classifies anime frames into 4 production style categories and doubles as a style embedding model β capturing the visual drawing style of an image independently of scene content.
Model Details
| Architecture | EfficientNet-B2 (timm) |
| Parameters | 7.7M |
| Input | 512 Γ 512 RGB |
| Embedding dim | 1408 |
| Classes | 4 (flat, modern, painterly, retro) |
| Format | SafeTensors |
| File size | 30 MB |
| Training data | 19,611 images across 4 classes |
| Training method | Supervised CE + progressive resolution fine-tuning (440β512) + weight merge |
Classes
| Style | Description | Example series |
|---|---|---|
| flat | Minimal shading, solid color blocks, low gradients | Ping Pong The Animation, The Heike Story, Kaiba |
| modern | High-detail digital pipeline, clean gradients, contemporary AAA | Demon Slayer, Violet Evergarden, Attack on Titan |
| painterly | Visible brush texture, watercolor/oil feel, textured transitions | Mushi-Shi, Your Lie in April, Land of the Lustrous |
| retro | Cel-era aesthetic, grain, simpler compositing, limited dynamic range | Cowboy Bebop, Neon Genesis Evangelion, Robotech |
Style Examples
| flat | modern |
![]() |
![]() |
![]() |
![]() |
| painterly | retro |
![]() |
![]() |
![]() |
![]() |
Performance
OOD Evaluation (439 held-out images, never seen during training)
Overall accuracy: 98.41% (432/439)
| Class | Recall | Count | Errors |
|---|---|---|---|
| flat | 100.0% | 83/83 | 0 |
| modern | 98.5% | 203/206 | 3 |
| painterly | 94.0% | 47/50 | 3 |
| retro | 99.0% | 99/100 | 1 |
Confusion Matrix
| β flat | β modern | β painterly | β retro | |
|---|---|---|---|---|
| flat | 83 | 0 | 0 | 0 |
| modern | 0 | 203 | 1 | 2 |
| painterly | 0 | 3 | 47 | 0 |
| retro | 0 | 1 | 0 | 99 |
Error Analysis (7 total misclassifications)
All errors are plausible borderline cases. Click any thumbnail to view full size.
Use as Style Embedding Model
Beyond classification, this model produces 1408-dimensional style embeddings from the penultimate layer that capture the visual drawing style of anime frames β line weight, shading technique, color palette, compositing approach β independently of scene content (characters, backgrounds, objects).
Embedding Performance
Evaluated on 1,800 frames from 18 anime series across all 4 style groups (100 frames per series):
| Model | Intra-Series Cohesion β | Same-Style Similarity β | Cross-Style Similarity β | Separation Gap β |
|---|---|---|---|---|
| anime-style-classifier-v5 (this model) | 0.347 | 0.800 | 0.464 | 0.336 |
| Base EfficientNet-B2 (ImageNet) | 0.228 | 0.877 | 0.826 | 0.051 |
| ViT-B/16 (ImageNet) | 0.214 | 0.844 | 0.795 | 0.049 |
6.5Γ better style separation than both ImageNet baselines. The fine-tuned model clusters same-style series together while pushing different styles apart. ImageNet models treat everything as nearly identical (0.82 cross-style similarity), unable to distinguish drawing style from scene content.
| Style Group | Centroid Cohesion | Notes |
|---|---|---|
| flat | 0.835 | Yuasa / Science SARU / rotoscoped β tight cluster |
| modern | 0.906 | Digital AAA productions β very tight |
| painterly | 0.885 | Watercolor / textured β cohesive |
| retro | 0.624 | More varied (1985β2004 span) β intentional |
Embedding Usage
import torch
import timm
from PIL import Image
from torchvision import transforms
from safetensors.torch import load_file
# Load model
model = timm.create_model('efficientnet_b2', pretrained=False, num_classes=4)
state_dict = load_file('model.safetensors')
model.load_state_dict(state_dict)
model.eval()
transform = transforms.Compose([
transforms.Resize((512, 512)),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])
def get_style_embedding(image_path):
"""Extract a 1408-dim style embedding from the penultimate layer."""
img = Image.open(image_path).convert('RGB')
x = transform(img).unsqueeze(0)
with torch.no_grad():
features = model.forward_features(x) # [1, 1408, 16, 16]
embedding = features.mean(dim=[-2, -1]) # [1, 1408] global avg pool
embedding = embedding / embedding.norm(dim=1, keepdim=True) # L2 normalize
return embedding.squeeze(0).numpy()
# Compare two frames
emb_a = get_style_embedding('frame_a.jpg')
emb_b = get_style_embedding('frame_b.jpg')
similarity = emb_a @ emb_b # cosine similarity (embeddings are L2-normalized)
print(f'Style similarity: {similarity:.4f}')
# > 0.8 = very similar style (likely same series/studio)
# 0.4-0.8 = same style family (e.g., both modern)
# < 0.2 = different style axis (e.g., retro vs modern)
Interpreting Embedding Distances
The global average pooled features from forward_features() (before the classifier head) encode style properties:
| Cosine Similarity | Interpretation |
|---|---|
| > 0.8 | Near-identical style (same series, same studio) |
| 0.5 β 0.8 | Same style family (e.g., two modern AAA shows) |
| 0.0 β 0.5 | Different style families |
| < 0.0 | Opposite ends of the style spectrum (e.g., 1980s cel vs 2020s digital) |
The embedding captures how something is drawn β line weight, shading technique, color palette, compositing β not what is drawn. Scenes with completely different content (landscapes vs close-ups vs action sequences) from the same series will still cluster together.
Classification Usage
import torch
import timm
from PIL import Image
from torchvision import transforms
from safetensors.torch import load_file
# Load model
model = timm.create_model('efficientnet_b2', pretrained=False, num_classes=4)
state_dict = load_file('model.safetensors')
model.load_state_dict(state_dict)
model.eval()
# Preprocessing β resize to 512Γ512
transform = transforms.Compose([
transforms.Resize((512, 512)),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])
image = Image.open('anime_frame.jpg').convert('RGB')
x = transform(image).unsqueeze(0)
with torch.no_grad():
logits = model(x)
probs = torch.softmax(logits, dim=1)
pred_idx = logits.argmax(-1).item()
confidence = probs[0, pred_idx].item()
classes = ['flat', 'modern', 'painterly', 'retro']
print(f'{classes[pred_idx]}: {confidence:.1%}')
Training Details
Architecture and Training Pipeline
- Base training: EfficientNet-B2 pretrained on ImageNet, fine-tuned at 440px on style dataset with focal loss and class-balanced sampling
- Progressive resolution: Fine-tuned from 440β512px with frozen batch norm, discriminative learning rates (backbone 1/10th of head), and 1-epoch warmup
- Weight merging: Linear interpolation of base (440px) and progressive (512px) checkpoints at optimal Ξ±=0.54, lifting accuracy beyond either parent
Training Data
| Class | Images | Description |
|---|---|---|
| flat | 2,138 | Real anime frames + synthetic consensus images |
| modern | 3,954 | Extracted from 50+ modern anime series |
| painterly | 5,130 | Real frames + Safebooru/DeviantArt + validated synthetic |
| retro | 8,389 | Extracted from 30+ pre-digital-era series |
| Total | 19,611 |
Split: 17,649 train / 1,962 validation (90/10) + 439 separate OOD test images.
Key Training Decisions
Focal loss handles class imbalance (retro has 4Γ more samples than flat). The best single fine-tuned checkpoint reached 96.36% OOD accuracy; linear weight merging of two complementary checkpoints pushed this to 98.41%. Batch norm freezing during progressive resolution training was critical β without it, running stat drift degraded OOD performance.
What This Model is NOT
- β Not a character design classifier β won't detect moe, chibi, bishounen, etc.
- β Not a content classifier β sees how it's drawn, not what
- β Not an era detector β a modern show using cel technique β retro (correctly)
- β Not an AI detector β AI art that nails a style gets classified by that style
- β Not a quality scorer β good and bad modern frames both classify as modern
Limitations
- Trained primarily on Japanese anime (TV series, films). May not generalize to Western animation, donghua, or manhwa.
- "Painterly" class is hardest β the boundary with modern is subjective when digital tools simulate traditional media.
- AI-generated anime art (e.g., ComfyUI, Stable Diffusion) often confuses the model between painterly and modern.
- Flat style class has the fewest real-world examples β supplemented with curated synthetic images.
Image Disclaimer
Example and evaluation images shown in this model card are either generated via diffusion models or sourced from web search results for illustrative purposes only. Web-sourced images remain the property of their respective copyright holders and are used here solely as examples to demonstrate model behavior.
Citation
@misc{anime-style-classifier-v5,
title={Anime Style Classifier V5},
year={2026},
publisher={HuggingFace},
note={EfficientNet-B2 for anime production style classification and embedding}
}
- Downloads last month
- 21
Evaluation results
- OOD Accuracyself-reported98.410
- Flat Recallself-reported100.000
- Modern Recallself-reported98.500
- Painterly Recallself-reported94.000
- Retro Recallself-reported99.000














