Anime/Real/Rendered Image Classifier (TF-EfficientNetV2-S)

Higher-capacity classifier with improved generalization for anime, photo, and 3D detection.

Model Details

  • Architecture: TF-EfficientNetV2-S (timm)
  • Input Size: 224×224 RGB
  • Classes: anime, real, rendered
  • Parameters: 21.5M (4× larger than B0)
  • Validation Accuracy: 97.55% (+0.11% vs B0)
  • Training Speed: ~3 min/epoch (GPU)
  • Inference Speed: ~60ms per image (RTX 3060)

Performance

Class Precision Recall F1-Score
anime 1.00 0.97 0.98
real 0.98 0.99 0.98
rendered 0.93 0.90 0.91
macro avg 0.97 0.95 0.96

Comparison to EfficientNet-B0

Metric B0 V2-S Winner
Final Accuracy 97.44% 97.55% V2-S +0.11%
Best Accuracy 97.99% 97.99% Tied
Params 5.3M 21.5M B0 (lighter)
Speed 1 min/epoch 3 min/epoch B0 (faster)
Convergence Epoch 4 Epoch 13 B0 (faster)

Verdict: V2-S learns training data better with marginally improved generalization. Use B0 for speed, V2-S for accuracy.

Usage

from PIL import Image
import torch
from torchvision import transforms
import timm
from safetensors.torch import load_file

# Load model
model = timm.create_model('tf_efficientnetv2_s', num_classes=3, pretrained=False)
state_dict = load_file('model.safetensors')
model.load_state_dict(state_dict)
model.eval()

# Prepare image
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])

image = Image.open('image.jpg').convert('RGB')
x = transform(image).unsqueeze(0)

# Predict
with torch.no_grad():
    logits = model(x)
    probs = torch.softmax(logits, dim=1)
    pred_class = probs.argmax(dim=1).item()

labels = ['anime', 'real', 'rendered']
print(f"{labels[pred_class]}: {probs[0, pred_class]:.2%}")

Dataset

  • Real: 5,000 COCO 2017 validation images
  • Anime: 2,357 curated animation frames
  • Rendered: 1,610 AAA games + 61 Pixar stills
  • Total: 8,967 images (8,070 train / 897 perceptually-hashed val)

Training Details

  • Augmentation: None (raw resize to 224×224)
  • Optimizer: AdamW (lr=0.001)
  • Loss: CrossEntropyLoss with class weighting
  • Epochs: 20
  • Batch Size: 40 (GPU memory constrained)
  • Hardware: NVIDIA RTX 3060 (12GB)

Known Behavior

  • Better Anime Detection: Perfect precision (1.00) but 97% recall
  • Stronger Real Recognition: 99% recall on real images
  • Rendered Uncertainty: 90% recall suggests photorealistic games still challenging
  • Slower Inference: ~3× slower than B0 due to model size

Recommendations

  • Production: Ensemble both models for maximum confidence
  • Real-time: Use B0 for speed-critical applications
  • Accuracy-critical: Use V2-S as primary model
  • Confidence Thresholding: Only trust predictions >80% confidence

License

OpenRAIL - Free for research and educational purposes

Downloads last month
18
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Mitchins/image-medium-classifier-efficientnetv2-s-v1

Finetuned
(6)
this model