File size: 3,431 Bytes

---
library_name: timm
pipeline_tag: image-classification
base_model:
  - timm/tf_efficientnetv2_s.in21k_ft_in1k
tags:
  - anime-classification
  - real-photos
  - rendered-graphics
  - pytorch
  - efficientnetv2
  - vision
license: openrail
model_type: efficientnetv2_s
inference: true
---

# Anime/Real/Rendered Image Classifier (TF-EfficientNetV2-S)

**Higher-capacity classifier with improved generalization for anime, photo, and 3D detection.**

## Model Details

- **Architecture:** TF-EfficientNetV2-S (timm)
- **Input Size:** 224×224 RGB
- **Classes:** anime, real, rendered
- **Parameters:** 21.5M (4× larger than B0)
- **Validation Accuracy:** 97.55% (+0.11% vs B0)
- **Training Speed:** ~3 min/epoch (GPU)
- **Inference Speed:** ~60ms per image (RTX 3060)

## Performance

| Class | Precision | Recall | F1-Score |
|-------|-----------|--------|----------|
| anime | 1.00 | 0.97 | 0.98 |
| real | 0.98 | 0.99 | 0.98 |
| rendered | 0.93 | 0.90 | 0.91 |
| **macro avg** | **0.97** | **0.95** | **0.96** |

## Comparison to EfficientNet-B0

| Metric | B0 | V2-S | Winner |
|--------|-----|------|--------|
| Final Accuracy | 97.44% | **97.55%** | V2-S +0.11% |
| Best Accuracy | 97.99% | 97.99% | Tied |
| Params | 5.3M | 21.5M | B0 (lighter) |
| Speed | 1 min/epoch | 3 min/epoch | B0 (faster) |
| Convergence | Epoch 4 | Epoch 13 | B0 (faster) |

**Verdict:** V2-S learns training data better with marginally improved generalization. Use B0 for speed, V2-S for accuracy.

## Usage

```python
from PIL import Image
import torch
from torchvision import transforms
import timm
from safetensors.torch import load_file

# Load model
model = timm.create_model('tf_efficientnetv2_s', num_classes=3, pretrained=False)
state_dict = load_file('model.safetensors')
model.load_state_dict(state_dict)
model.eval()

# Prepare image
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])

image = Image.open('image.jpg').convert('RGB')
x = transform(image).unsqueeze(0)

# Predict
with torch.no_grad():
    logits = model(x)
    probs = torch.softmax(logits, dim=1)
    pred_class = probs.argmax(dim=1).item()

labels = ['anime', 'real', 'rendered']
print(f"{labels[pred_class]}: {probs[0, pred_class]:.2%}")
```

## Dataset

- **Real:** 5,000 COCO 2017 validation images
- **Anime:** 2,357 curated animation frames
- **Rendered:** 1,610 AAA games + 61 Pixar stills
- **Total:** 8,967 images (8,070 train / 897 perceptually-hashed val)

## Training Details

- **Augmentation:** None (raw resize to 224×224)
- **Optimizer:** AdamW (lr=0.001)
- **Loss:** CrossEntropyLoss with class weighting
- **Epochs:** 20
- **Batch Size:** 40 (GPU memory constrained)
- **Hardware:** NVIDIA RTX 3060 (12GB)

## Known Behavior

- **Better Anime Detection:** Perfect precision (1.00) but 97% recall
- **Stronger Real Recognition:** 99% recall on real images
- **Rendered Uncertainty:** 90% recall suggests photorealistic games still challenging
- **Slower Inference:** ~3× slower than B0 due to model size

## Recommendations

- **Production:** Ensemble both models for maximum confidence
- **Real-time:** Use B0 for speed-critical applications
- **Accuracy-critical:** Use V2-S as primary model
- **Confidence Thresholding:** Only trust predictions >80% confidence

## License

OpenRAIL - Free for research and educational purposes