Mitchins's picture
Upload README.md with huggingface_hub
21d5541 verified
metadata
library_name: timm
pipeline_tag: image-classification
base_model:
  - timm/efficientnet_b0
tags:
  - anime-classification
  - real-photos
  - rendered-graphics
  - pytorch
  - efficientnet
  - vision
license: openrail
model_type: efficientnet_b0
inference: true

Anime/Real/Rendered Image Classifier (EfficientNet-B0)

Fast, lightweight classifier for distinguishing photographs from anime and 3D rendered images.

Model Details

  • Architecture: EfficientNet-B0 (timm)
  • Input Size: 224×224 RGB
  • Classes: anime, real, rendered
  • Parameters: 5.3M
  • Validation Accuracy: 97.44%
  • Training Speed: ~1 min/epoch (GPU)
  • Inference Speed: ~20ms per image (RTX 3060)

Performance

Class Precision Recall F1-Score
anime 0.98 0.99 0.99
real 0.98 0.98 0.98
rendered 0.96 0.93 0.94
macro avg 0.97 0.97 0.97

Usage

from PIL import Image
import torch
from torchvision import transforms
import timm
from safetensors.torch import load_file

# Load model
model = timm.create_model('efficientnet_b0', num_classes=3, pretrained=False)
state_dict = load_file('model.safetensors')
model.load_state_dict(state_dict)
model.eval()

# Prepare image
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])

image = Image.open('image.jpg').convert('RGB')
x = transform(image).unsqueeze(0)

# Predict
with torch.no_grad():
    logits = model(x)
    probs = torch.softmax(logits, dim=1)
    pred_class = probs.argmax(dim=1).item()

labels = ['anime', 'real', 'rendered']
print(f"{labels[pred_class]}: {probs[0, pred_class]:.2%}")

Dataset

  • Real: 5,000 COCO 2017 validation images (diverse real-world scenarios)
  • Anime: 2,357 curated anime/animation frames
  • Rendered: 1,610 AAA game screenshots + 61 Pixar movie stills
  • Total: 8,967 images (8,070 train / 897 val)

Training Details

  • Augmentation: None (raw resize to 224×224)
  • Optimizer: AdamW (lr=0.001)
  • Loss: CrossEntropyLoss with class weighting
  • Epochs: 20
  • Batch Size: 80
  • Hardware: NVIDIA RTX 3060 (12GB)

Known Limitations

  • Real vs Rendered: Some confusion (photorealistic games misclassified as real)
  • Stylized Games: Cel-shaded games (e.g., Fate/Extella) may score as anime
  • Pixar: Stylized rendered images may show mixed confidence

Recommendations

  • Use ensemble with tf_efficientnetv2_s for critical applications
  • Apply confidence threshold: only trust predictions >85% confidence
  • For edge cases, use the full confusion matrix to understand failure modes

License

OpenRAIL - Free for research and commercial use with proper attribution