Upload folder using huggingface_hub

bf0c72b verified about 1 month ago

4.49 kB

	---
	license: openrail
	language: en
	library_name: timm
	tags:
	- image-classification
	- anime
	- real
	- rendered
	- 3d-graphics
	datasets:
	- coco
	- custom-anime
	- steam-screenshots
	---

	# TF-EfficientNetV2-S - Anime/Real/Rendered Classifier

	Higher-capacity classifier with improved generalization for distinguishing photographs from anime and 3D rendered images.

	## Model Summary

	- Model Name: tf_efficientnetv2_s
	- Framework: PyTorch + TIMM
	- Input: 224×224 RGB images
	- Output: 3 classes (anime, real, rendered)
	- Parameters: 21.5M (4× larger than B0)
	- Size: 81.4 MB

	## Intended Use

	Same as EfficientNet-B0, but with higher accuracy and better generalization:
	- anime: Drawn 2D or cel-shaded animation
	- real: Photographs and real-world footage
	- rendered: 3D graphics (games, CGI, Pixar, etc.)

	## Performance

	Validation Accuracy: 97.55% (+0.11% vs B0)

	\| Class \| Precision \| Recall \| F1-Score \| Support \|
	\|-------\|-----------\|--------\|----------\|---------\|
	\| anime \| 1.00 \| 0.97 \| 0.98 \| 236 \|
	\| real \| 0.98 \| 0.99 \| 0.98 \| 500 \|
	\| rendered \| 0.93 \| 0.90 \| 0.91 \| 161 \|
	\| weighted avg \| 0.97 \| 0.95 \| 0.96 \| 897 \|

	## Training Data

	Identical to EfficientNet-B0:
	- Real images: 5,000 COCO 2017 validation set
	- Anime images: 2,357 curated frames
	- Rendered images: 1,549 AAA games + 61 Pixar stills
	- Total: 8,967 images (8,070 train / 897 diverse val)

	## Training Details

	- Framework: PyTorch
	- Augmentation: Resize only (224×224)
	- Loss Function: CrossEntropyLoss with inverse frequency weighting
	- Optimizer: AdamW (lr=0.001)
	- Batch Size: 40 (GPU memory constrained)
	- Epochs: 20
	- Hardware: NVIDIA RTX 3060 (12GB VRAM)
	- Training Time: ~60 minutes

	## Comparison to EfficientNet-B0

	\| Metric \| B0 \| V2-S \| Delta \|
	\|--------\|-----\|------\|-------\|
	\| Final Accuracy \| 97.44% \| 97.55% \| +0.11% \|
	\| Best Accuracy \| 97.99% \| 97.99% \| Tied \|
	\| Params \| 5.3M \| 21.5M \| +4× \|
	\| Speed \| ~20ms \| ~60ms \| -3× \|
	\| Convergence \| Epoch 4 \| Epoch 13 \| -9 epochs \|
	\| Train Loss \| 0.1022 \| 0.0003 \| Better \|
	\| Val Loss \| 0.5519 \| 0.1134 \| Better \|

	Verdict: V2-S learns training distribution more thoroughly, but marginal real-world improvement. Use B0 for speed, V2-S for maximum accuracy.

	## Limitations

	1. Slower inference (60ms vs B0's 20ms)
	2. Larger model (81.4MB vs B0's 16.2MB)
	3. Same fundamental challenges: photorealistic games, cel-shading, artistic renders
	4. Performance degrades on images <224×224

	## Recommendations

	- Real-time/Mobile: Use EfficientNet-B0 instead
	- Accuracy-Critical: This model preferred
	- Ensemble: Use both models for highest confidence
	- Confidence Threshold: ≥80% for reliable predictions
	- Edge Cases: Manually inspect when models disagree

	## How to Use

	```python
	from PIL import Image
	import torch
	from torchvision import transforms
	import timm
	from safetensors.torch import load_file

	# Load
	model = timm.create_model('tf_efficientnetv2_s', num_classes=3, pretrained=False)
	state_dict = load_file('model.safetensors')
	model.load_state_dict(state_dict)
	model.eval()

	# Prepare image
	transform = transforms.Compose([
	transforms.Resize((224, 224)),
	transforms.ToTensor(),
	transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
	])
	img = Image.open('image.jpg').convert('RGB')
	x = transform(img).unsqueeze(0)

	# Infer
	with torch.no_grad():
	logits = model(x)
	probs = torch.softmax(logits, dim=1)
	pred = probs.argmax().item()

	labels = ['anime', 'real', 'rendered']
	print(f"{labels[pred]}: {probs[0, pred]:.1%}")
	```

	## Ensemble Strategy

	For maximum accuracy, use both models:

	```python
	# Load both
	b0 = load_model('efficientnet_b0')
	v2s = load_model('tf_efficientnetv2_s')

	# Infer
	with torch.no_grad():
	probs_b0 = torch.softmax(b0(x), dim=1)
	probs_v2s = torch.softmax(v2s(x), dim=1)

	# Average predictions
	ensemble_probs = (probs_b0 + probs_v2s) / 2
	pred = ensemble_probs.argmax().item()
	```

	## Benchmarks

	Inference Speed (RTX 3060)
	- Single image: ~60ms
	- Batch of 16: ~200ms

	## Ethical Considerations

	Same as EfficientNet-B0. This model:
	- NOT designed for deepfake detection
	- May have cultural bias in anime/rendered representation
	- Should be used with human review for content moderation

	## Contact

	For questions: [GitHub repo]

	## License

	OpenRAIL - Free for research and commercial use with proper attribution