| | --- |
| | license: openrail |
| | language: en |
| | library_name: timm |
| | tags: |
| | - image-classification |
| | - anime |
| | - real |
| | - rendered |
| | - 3d-graphics |
| | datasets: |
| | - coco |
| | - custom-anime |
| | - steam-screenshots |
| | --- |
| | |
| | # TF-EfficientNetV2-S - Anime/Real/Rendered Classifier |
| |
|
| | Higher-capacity classifier with improved generalization for distinguishing photographs from anime and 3D rendered images. |
| |
|
| | ## Model Summary |
| |
|
| | - **Model Name:** tf_efficientnetv2_s |
| | - **Framework:** PyTorch + TIMM |
| | - **Input:** 224×224 RGB images |
| | - **Output:** 3 classes (anime, real, rendered) |
| | - **Parameters:** 21.5M (4× larger than B0) |
| | - **Size:** 81.4 MB |
| |
|
| | ## Intended Use |
| |
|
| | Same as EfficientNet-B0, but with higher accuracy and better generalization: |
| | - **anime**: Drawn 2D or cel-shaded animation |
| | - **real**: Photographs and real-world footage |
| | - **rendered**: 3D graphics (games, CGI, Pixar, etc.) |
| |
|
| | ## Performance |
| |
|
| | **Validation Accuracy:** 97.55% (+0.11% vs B0) |
| |
|
| | | Class | Precision | Recall | F1-Score | Support | |
| | |-------|-----------|--------|----------|---------| |
| | | anime | 1.00 | 0.97 | 0.98 | 236 | |
| | | real | 0.98 | 0.99 | 0.98 | 500 | |
| | | rendered | 0.93 | 0.90 | 0.91 | 161 | |
| | | **weighted avg** | **0.97** | **0.95** | **0.96** | **897** | |
| |
|
| | ## Training Data |
| |
|
| | Identical to EfficientNet-B0: |
| | - **Real images:** 5,000 COCO 2017 validation set |
| | - **Anime images:** 2,357 curated frames |
| | - **Rendered images:** 1,549 AAA games + 61 Pixar stills |
| | - **Total:** 8,967 images (8,070 train / 897 diverse val) |
| |
|
| | ## Training Details |
| |
|
| | - **Framework:** PyTorch |
| | - **Augmentation:** Resize only (224×224) |
| | - **Loss Function:** CrossEntropyLoss with inverse frequency weighting |
| | - **Optimizer:** AdamW (lr=0.001) |
| | - **Batch Size:** 40 (GPU memory constrained) |
| | - **Epochs:** 20 |
| | - **Hardware:** NVIDIA RTX 3060 (12GB VRAM) |
| | - **Training Time:** ~60 minutes |
| |
|
| | ## Comparison to EfficientNet-B0 |
| |
|
| | | Metric | B0 | V2-S | Delta | |
| | |--------|-----|------|-------| |
| | | Final Accuracy | 97.44% | 97.55% | +0.11% | |
| | | Best Accuracy | 97.99% | 97.99% | Tied | |
| | | Params | 5.3M | 21.5M | +4× | |
| | | Speed | ~20ms | ~60ms | -3× | |
| | | Convergence | Epoch 4 | Epoch 13 | -9 epochs | |
| | | Train Loss | 0.1022 | 0.0003 | Better | |
| | | Val Loss | 0.5519 | 0.1134 | Better | |
| |
|
| | **Verdict:** V2-S learns training distribution more thoroughly, but marginal real-world improvement. Use B0 for speed, V2-S for maximum accuracy. |
| |
|
| | ## Limitations |
| |
|
| | 1. Slower inference (60ms vs B0's 20ms) |
| | 2. Larger model (81.4MB vs B0's 16.2MB) |
| | 3. Same fundamental challenges: photorealistic games, cel-shading, artistic renders |
| | 4. Performance degrades on images <224×224 |
| |
|
| | ## Recommendations |
| |
|
| | - **Real-time/Mobile:** Use EfficientNet-B0 instead |
| | - **Accuracy-Critical:** This model preferred |
| | - **Ensemble:** Use both models for highest confidence |
| | - **Confidence Threshold:** ≥80% for reliable predictions |
| | - **Edge Cases:** Manually inspect when models disagree |
| |
|
| | ## How to Use |
| |
|
| | ```python |
| | from PIL import Image |
| | import torch |
| | from torchvision import transforms |
| | import timm |
| | from safetensors.torch import load_file |
| | |
| | # Load |
| | model = timm.create_model('tf_efficientnetv2_s', num_classes=3, pretrained=False) |
| | state_dict = load_file('model.safetensors') |
| | model.load_state_dict(state_dict) |
| | model.eval() |
| | |
| | # Prepare image |
| | transform = transforms.Compose([ |
| | transforms.Resize((224, 224)), |
| | transforms.ToTensor(), |
| | transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]), |
| | ]) |
| | img = Image.open('image.jpg').convert('RGB') |
| | x = transform(img).unsqueeze(0) |
| | |
| | # Infer |
| | with torch.no_grad(): |
| | logits = model(x) |
| | probs = torch.softmax(logits, dim=1) |
| | pred = probs.argmax().item() |
| | |
| | labels = ['anime', 'real', 'rendered'] |
| | print(f"{labels[pred]}: {probs[0, pred]:.1%}") |
| | ``` |
| |
|
| | ## Ensemble Strategy |
| |
|
| | For maximum accuracy, use both models: |
| |
|
| | ```python |
| | # Load both |
| | b0 = load_model('efficientnet_b0') |
| | v2s = load_model('tf_efficientnetv2_s') |
| | |
| | # Infer |
| | with torch.no_grad(): |
| | probs_b0 = torch.softmax(b0(x), dim=1) |
| | probs_v2s = torch.softmax(v2s(x), dim=1) |
| | |
| | # Average predictions |
| | ensemble_probs = (probs_b0 + probs_v2s) / 2 |
| | pred = ensemble_probs.argmax().item() |
| | ``` |
| |
|
| | ## Benchmarks |
| |
|
| | **Inference Speed (RTX 3060)** |
| | - Single image: ~60ms |
| | - Batch of 16: ~200ms |
| |
|
| | ## Ethical Considerations |
| |
|
| | Same as EfficientNet-B0. This model: |
| | - NOT designed for deepfake detection |
| | - May have cultural bias in anime/rendered representation |
| | - Should be used with human review for content moderation |
| |
|
| | ## Contact |
| |
|
| | For questions: [GitHub repo] |
| |
|
| | ## License |
| |
|
| | OpenRAIL - Free for research and commercial use with proper attribution |
| |
|