Upload README.md with huggingface_hub

21d5541 verified 19 days ago

2.85 kB

library_name: timm
pipeline_tag: image-classification
base_model:
  - timm/efficientnet_b0
tags:
  - anime-classification
  - real-photos
  - rendered-graphics
  - pytorch
  - efficientnet
  - vision
license: openrail
model_type: efficientnet_b0
inference: true

Anime/Real/Rendered Image Classifier (EfficientNet-B0)

Fast, lightweight classifier for distinguishing photographs from anime and 3D rendered images.

Model Details

Architecture: EfficientNet-B0 (timm)
Input Size: 224×224 RGB
Classes: anime, real, rendered
Parameters: 5.3M
Validation Accuracy: 97.44%
Training Speed: ~1 min/epoch (GPU)
Inference Speed: ~20ms per image (RTX 3060)

Performance

Class	Precision	Recall	F1-Score
anime	0.98	0.99	0.99
real	0.98	0.98	0.98
rendered	0.96	0.93	0.94
macro avg	0.97	0.97	0.97

Usage

from PIL import Image
import torch
from torchvision import transforms
import timm
from safetensors.torch import load_file

# Load model
model = timm.create_model('efficientnet_b0', num_classes=3, pretrained=False)
state_dict = load_file('model.safetensors')
model.load_state_dict(state_dict)
model.eval()

# Prepare image
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])

image = Image.open('image.jpg').convert('RGB')
x = transform(image).unsqueeze(0)

# Predict
with torch.no_grad():
    logits = model(x)
    probs = torch.softmax(logits, dim=1)
    pred_class = probs.argmax(dim=1).item()

labels = ['anime', 'real', 'rendered']
print(f"{labels[pred_class]}: {probs[0, pred_class]:.2%}")

Dataset

Real: 5,000 COCO 2017 validation images (diverse real-world scenarios)
Anime: 2,357 curated anime/animation frames
Rendered: 1,610 AAA game screenshots + 61 Pixar movie stills
Total: 8,967 images (8,070 train / 897 val)

Training Details

Augmentation: None (raw resize to 224×224)
Optimizer: AdamW (lr=0.001)
Loss: CrossEntropyLoss with class weighting
Epochs: 20
Batch Size: 80
Hardware: NVIDIA RTX 3060 (12GB)

Known Limitations

Real vs Rendered: Some confusion (photorealistic games misclassified as real)
Stylized Games: Cel-shaded games (e.g., Fate/Extella) may score as anime
Pixar: Stylized rendered images may show mixed confidence

Recommendations

Use ensemble with tf_efficientnetv2_s for critical applications
Apply confidence threshold: only trust predictions >85% confidence
For edge cases, use the full confusion matrix to understand failure modes

License

OpenRAIL - Free for research and commercial use with proper attribution