--- library_name: timm pipeline_tag: image-classification base_model: - timm/tf_efficientnetv2_s.in21k_ft_in1k tags: - anime-classification - real-photos - rendered-graphics - pytorch - efficientnetv2 - vision license: openrail model_type: efficientnetv2_s inference: true --- # Anime/Real/Rendered Image Classifier (TF-EfficientNetV2-S) **Higher-capacity classifier with improved generalization for anime, photo, and 3D detection.** ## Model Details - **Architecture:** TF-EfficientNetV2-S (timm) - **Input Size:** 224×224 RGB - **Classes:** anime, real, rendered - **Parameters:** 21.5M (4× larger than B0) - **Validation Accuracy:** 97.55% (+0.11% vs B0) - **Training Speed:** ~3 min/epoch (GPU) - **Inference Speed:** ~60ms per image (RTX 3060) ## Performance | Class | Precision | Recall | F1-Score | |-------|-----------|--------|----------| | anime | 1.00 | 0.97 | 0.98 | | real | 0.98 | 0.99 | 0.98 | | rendered | 0.93 | 0.90 | 0.91 | | **macro avg** | **0.97** | **0.95** | **0.96** | ## Comparison to EfficientNet-B0 | Metric | B0 | V2-S | Winner | |--------|-----|------|--------| | Final Accuracy | 97.44% | **97.55%** | V2-S +0.11% | | Best Accuracy | 97.99% | 97.99% | Tied | | Params | 5.3M | 21.5M | B0 (lighter) | | Speed | 1 min/epoch | 3 min/epoch | B0 (faster) | | Convergence | Epoch 4 | Epoch 13 | B0 (faster) | **Verdict:** V2-S learns training data better with marginally improved generalization. Use B0 for speed, V2-S for accuracy. ## Usage ```python from PIL import Image import torch from torchvision import transforms import timm from safetensors.torch import load_file # Load model model = timm.create_model('tf_efficientnetv2_s', num_classes=3, pretrained=False) state_dict = load_file('model.safetensors') model.load_state_dict(state_dict) model.eval() # Prepare image transform = transforms.Compose([ transforms.Resize((224, 224)), transforms.ToTensor(), transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]), ]) image = Image.open('image.jpg').convert('RGB') x = transform(image).unsqueeze(0) # Predict with torch.no_grad(): logits = model(x) probs = torch.softmax(logits, dim=1) pred_class = probs.argmax(dim=1).item() labels = ['anime', 'real', 'rendered'] print(f"{labels[pred_class]}: {probs[0, pred_class]:.2%}") ``` ## Dataset - **Real:** 5,000 COCO 2017 validation images - **Anime:** 2,357 curated animation frames - **Rendered:** 1,610 AAA games + 61 Pixar stills - **Total:** 8,967 images (8,070 train / 897 perceptually-hashed val) ## Training Details - **Augmentation:** None (raw resize to 224×224) - **Optimizer:** AdamW (lr=0.001) - **Loss:** CrossEntropyLoss with class weighting - **Epochs:** 20 - **Batch Size:** 40 (GPU memory constrained) - **Hardware:** NVIDIA RTX 3060 (12GB) ## Known Behavior - **Better Anime Detection:** Perfect precision (1.00) but 97% recall - **Stronger Real Recognition:** 99% recall on real images - **Rendered Uncertainty:** 90% recall suggests photorealistic games still challenging - **Slower Inference:** ~3× slower than B0 due to model size ## Recommendations - **Production:** Ensemble both models for maximum confidence - **Real-time:** Use B0 for speed-critical applications - **Accuracy-critical:** Use V2-S as primary model - **Confidence Thresholding:** Only trust predictions >80% confidence ## License OpenRAIL - Free for research and educational purposes