Mitchins
/

image-medium-classifier-efficientnetv2-s-v1

Image Classification

tf_efficientnetv2_s

anime-classification

rendered-graphics

Model card Files Files and versions

image-medium-classifier-efficientnetv2-s-v1 / README.md

Mitchins's picture

Upload README.md with huggingface_hub

0dbf29b verified 28 days ago

|

history blame contribute delete

3.43 kB

	---
	library_name: timm
	pipeline_tag: image-classification
	base_model:
	- timm/tf_efficientnetv2_s.in21k_ft_in1k
	tags:
	- anime-classification
	- real-photos
	- rendered-graphics
	- pytorch
	- efficientnetv2
	- vision
	license: openrail
	model_type: efficientnetv2_s
	inference: true
	---

	# Anime/Real/Rendered Image Classifier (TF-EfficientNetV2-S)

	Higher-capacity classifier with improved generalization for anime, photo, and 3D detection.

	## Model Details

	- Architecture: TF-EfficientNetV2-S (timm)
	- Input Size: 224×224 RGB
	- Classes: anime, real, rendered
	- Parameters: 21.5M (4× larger than B0)
	- Validation Accuracy: 97.55% (+0.11% vs B0)
	- Training Speed: ~3 min/epoch (GPU)
	- Inference Speed: ~60ms per image (RTX 3060)

	## Performance

	\| Class \| Precision \| Recall \| F1-Score \|
	\|-------\|-----------\|--------\|----------\|
	\| anime \| 1.00 \| 0.97 \| 0.98 \|
	\| real \| 0.98 \| 0.99 \| 0.98 \|
	\| rendered \| 0.93 \| 0.90 \| 0.91 \|
	\| macro avg \| 0.97 \| 0.95 \| 0.96 \|

	## Comparison to EfficientNet-B0

	\| Metric \| B0 \| V2-S \| Winner \|
	\|--------\|-----\|------\|--------\|
	\| Final Accuracy \| 97.44% \| 97.55% \| V2-S +0.11% \|
	\| Best Accuracy \| 97.99% \| 97.99% \| Tied \|
	\| Params \| 5.3M \| 21.5M \| B0 (lighter) \|
	\| Speed \| 1 min/epoch \| 3 min/epoch \| B0 (faster) \|
	\| Convergence \| Epoch 4 \| Epoch 13 \| B0 (faster) \|

	Verdict: V2-S learns training data better with marginally improved generalization. Use B0 for speed, V2-S for accuracy.

	## Usage

	```python
	from PIL import Image
	import torch
	from torchvision import transforms
	import timm
	from safetensors.torch import load_file

	# Load model
	model = timm.create_model('tf_efficientnetv2_s', num_classes=3, pretrained=False)
	state_dict = load_file('model.safetensors')
	model.load_state_dict(state_dict)
	model.eval()

	# Prepare image
	transform = transforms.Compose([
	transforms.Resize((224, 224)),
	transforms.ToTensor(),
	transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
	])

	image = Image.open('image.jpg').convert('RGB')
	x = transform(image).unsqueeze(0)

	# Predict
	with torch.no_grad():
	logits = model(x)
	probs = torch.softmax(logits, dim=1)
	pred_class = probs.argmax(dim=1).item()

	labels = ['anime', 'real', 'rendered']
	print(f"{labels[pred_class]}: {probs[0, pred_class]:.2%}")
	```

	## Dataset

	- Real: 5,000 COCO 2017 validation images
	- Anime: 2,357 curated animation frames
	- Rendered: 1,610 AAA games + 61 Pixar stills
	- Total: 8,967 images (8,070 train / 897 perceptually-hashed val)

	## Training Details

	- Augmentation: None (raw resize to 224×224)
	- Optimizer: AdamW (lr=0.001)
	- Loss: CrossEntropyLoss with class weighting
	- Epochs: 20
	- Batch Size: 40 (GPU memory constrained)
	- Hardware: NVIDIA RTX 3060 (12GB)

	## Known Behavior

	- Better Anime Detection: Perfect precision (1.00) but 97% recall
	- Stronger Real Recognition: 99% recall on real images
	- Rendered Uncertainty: 90% recall suggests photorealistic games still challenging
	- Slower Inference: ~3× slower than B0 due to model size

	## Recommendations

	- Production: Ensemble both models for maximum confidence
	- Real-time: Use B0 for speed-critical applications
	- Accuracy-critical: Use V2-S as primary model
	- Confidence Thresholding: Only trust predictions >80% confidence

	## License

	OpenRAIL - Free for research and educational purposes