Mitchins
/

image-medium-classifier-efficientnet-b0-v1

Image Classification

efficientnet_b0

anime-classification

rendered-graphics

Model card Files Files and versions

image-medium-classifier-efficientnet-b0-v1 / README.md

Mitchins's picture

Upload README.md with huggingface_hub

21d5541 verified 20 days ago

|

history blame contribute delete

2.85 kB

	---
	library_name: timm
	pipeline_tag: image-classification
	base_model:
	- timm/efficientnet_b0
	tags:
	- anime-classification
	- real-photos
	- rendered-graphics
	- pytorch
	- efficientnet
	- vision
	license: openrail
	model_type: efficientnet_b0
	inference: true
	---

	# Anime/Real/Rendered Image Classifier (EfficientNet-B0)

	Fast, lightweight classifier for distinguishing photographs from anime and 3D rendered images.

	## Model Details

	- Architecture: EfficientNet-B0 (timm)
	- Input Size: 224×224 RGB
	- Classes: anime, real, rendered
	- Parameters: 5.3M
	- Validation Accuracy: 97.44%
	- Training Speed: ~1 min/epoch (GPU)
	- Inference Speed: ~20ms per image (RTX 3060)

	## Performance

	\| Class \| Precision \| Recall \| F1-Score \|
	\|-------\|-----------\|--------\|----------\|
	\| anime \| 0.98 \| 0.99 \| 0.99 \|
	\| real \| 0.98 \| 0.98 \| 0.98 \|
	\| rendered \| 0.96 \| 0.93 \| 0.94 \|
	\| macro avg \| 0.97 \| 0.97 \| 0.97 \|

	## Usage

	```python
	from PIL import Image
	import torch
	from torchvision import transforms
	import timm
	from safetensors.torch import load_file

	# Load model
	model = timm.create_model('efficientnet_b0', num_classes=3, pretrained=False)
	state_dict = load_file('model.safetensors')
	model.load_state_dict(state_dict)
	model.eval()

	# Prepare image
	transform = transforms.Compose([
	transforms.Resize((224, 224)),
	transforms.ToTensor(),
	transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
	])

	image = Image.open('image.jpg').convert('RGB')
	x = transform(image).unsqueeze(0)

	# Predict
	with torch.no_grad():
	logits = model(x)
	probs = torch.softmax(logits, dim=1)
	pred_class = probs.argmax(dim=1).item()

	labels = ['anime', 'real', 'rendered']
	print(f"{labels[pred_class]}: {probs[0, pred_class]:.2%}")
	```

	## Dataset

	- Real: 5,000 COCO 2017 validation images (diverse real-world scenarios)
	- Anime: 2,357 curated anime/animation frames
	- Rendered: 1,610 AAA game screenshots + 61 Pixar movie stills
	- Total: 8,967 images (8,070 train / 897 val)

	## Training Details

	- Augmentation: None (raw resize to 224×224)
	- Optimizer: AdamW (lr=0.001)
	- Loss: CrossEntropyLoss with class weighting
	- Epochs: 20
	- Batch Size: 80
	- Hardware: NVIDIA RTX 3060 (12GB)

	## Known Limitations

	- Real vs Rendered: Some confusion (photorealistic games misclassified as real)
	- Stylized Games: Cel-shaded games (e.g., Fate/Extella) may score as anime
	- Pixar: Stylized rendered images may show mixed confidence

	## Recommendations

	- Use ensemble with tf_efficientnetv2_s for critical applications
	- Apply confidence threshold: only trust predictions >85% confidence
	- For edge cases, use the full confusion matrix to understand failure modes

	## License

	OpenRAIL - Free for research and commercial use with proper attribution