Upload folder using huggingface_hub

Browse files

Files changed (5) hide show

README.md +103 -0
config.json +10 -0
labels.txt +3 -0
model.safetensors +3 -0
model_card.md +168 -0

README.md ADDED Viewed

	@@ -0,0 +1,103 @@

+# Anime/Real/Rendered Image Classifier (TF-EfficientNetV2-S)
+**Higher-capacity classifier with improved generalization for anime, photo, and 3D detection.**
+## Model Details
+- **Architecture:** TF-EfficientNetV2-S (timm)
+- **Input Size:** 224×224 RGB
+- **Classes:** anime, real, rendered
+- **Parameters:** 21.5M (4× larger than B0)
+- **Validation Accuracy:** 97.55% (+0.11% vs B0)
+- **Training Speed:** ~3 min/epoch (GPU)
+- **Inference Speed:** ~60ms per image (RTX 3060)
+## Performance
+| Class | Precision | Recall | F1-Score |
+|-------|-----------|--------|----------|
+| anime | 1.00 | 0.97 | 0.98 |
+| real | 0.98 | 0.99 | 0.98 |
+| rendered | 0.93 | 0.90 | 0.91 |
+| **macro avg** | **0.97** | **0.95** | **0.96** |
+## Comparison to EfficientNet-B0
+| Metric | B0 | V2-S | Winner |
+|--------|-----|------|--------|
+| Final Accuracy | 97.44% | **97.55%** | V2-S +0.11% |
+| Best Accuracy | 97.99% | 97.99% | Tied |
+| Params | 5.3M | 21.5M | B0 (lighter) |
+| Speed | 1 min/epoch | 3 min/epoch | B0 (faster) |
+| Convergence | Epoch 4 | Epoch 13 | B0 (faster) |
+**Verdict:** V2-S learns training data better with marginally improved generalization. Use B0 for speed, V2-S for accuracy.
+## Usage
+```python
+from PIL import Image
+import torch
+from torchvision import transforms
+import timm
+from safetensors.torch import load_file
+# Load model
+model = timm.create_model('tf_efficientnetv2_s', num_classes=3, pretrained=False)
+state_dict = load_file('model.safetensors')
+model.load_state_dict(state_dict)
+model.eval()
+# Prepare image
+transform = transforms.Compose([
+    transforms.Resize((224, 224)),
+    transforms.ToTensor(),
+    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
+])
+image = Image.open('image.jpg').convert('RGB')
+x = transform(image).unsqueeze(0)
+# Predict
+with torch.no_grad():
+    logits = model(x)
+    probs = torch.softmax(logits, dim=1)
+    pred_class = probs.argmax(dim=1).item()
+labels = ['anime', 'real', 'rendered']
+print(f"{labels[pred_class]}: {probs[0, pred_class]:.2%}")
+```
+## Dataset
+- **Real:** 5,000 COCO 2017 validation images
+- **Anime:** 2,357 curated animation frames
+- **Rendered:** 1,610 AAA games + 61 Pixar stills
+- **Total:** 8,967 images (8,070 train / 897 perceptually-hashed val)
+## Training Details
+- **Augmentation:** Raw (resize only)
+- **Optimizer:** AdamW (lr=0.001)
+- **Loss:** CrossEntropyLoss with class weighting
+- **Epochs:** 20
+- **Batch Size:** 40 (GPU memory constrained)
+- **Hardware:** NVIDIA RTX 3060 (12GB)
+## Known Behavior
+- **Better Anime Detection:** Perfect precision (1.00) but 97% recall
+- **Stronger Real Recognition:** 99% recall on real images
+- **Rendered Uncertainty:** 90% recall suggests photorealistic games still challenging
+- **Slower Inference:** ~3× slower than B0 due to model size
+## Recommendations
+- **Production:** Ensemble both models for maximum confidence
+- **Real-time:** Use B0 for speed-critical applications
+- **Accuracy-critical:** Use V2-S as primary model
+- **Confidence Thresholding:** Only trust predictions >80% confidence
+## License
+This model is provided as-is for research and educational purposes.

config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "model_type": "tf_efficientnetv2_s",
+  "num_classes": 3,
+  "input_size": 224,
+  "labels": [
+    "anime",
+    "real",
+    "rendered"
+  ]
+}

labels.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+0: anime
+1: real
+2: rendered

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:935da24ebbb0bfbe744f7f75d7a018dadd0bfb1f9dcad0d7c34488acb0d546cf
+size 81414628

model_card.md ADDED Viewed

	@@ -0,0 +1,168 @@

+---
+license: openrail
+language: en
+library_name: timm
+tags:
+  - image-classification
+  - anime
+  - real
+  - rendered
+  - 3d-graphics
+datasets:
+  - coco
+  - custom-anime
+  - steam-screenshots
+---
+# TF-EfficientNetV2-S - Anime/Real/Rendered Classifier
+Higher-capacity classifier with improved generalization for distinguishing photographs from anime and 3D rendered images.
+## Model Summary
+- **Model Name:** tf_efficientnetv2_s
+- **Framework:** PyTorch + TIMM
+- **Input:** 224×224 RGB images
+- **Output:** 3 classes (anime, real, rendered)
+- **Parameters:** 21.5M (4× larger than B0)
+- **Size:** 81.4 MB
+## Intended Use
+Same as EfficientNet-B0, but with higher accuracy and better generalization:
+- **anime**: Drawn 2D or cel-shaded animation
+- **real**: Photographs and real-world footage
+- **rendered**: 3D graphics (games, CGI, Pixar, etc.)
+## Performance
+**Validation Accuracy:** 97.55% (+0.11% vs B0)
+| Class | Precision | Recall | F1-Score | Support |
+|-------|-----------|--------|----------|---------|
+| anime | 1.00 | 0.97 | 0.98 | 236 |
+| real | 0.98 | 0.99 | 0.98 | 500 |
+| rendered | 0.93 | 0.90 | 0.91 | 161 |
+| **weighted avg** | **0.97** | **0.95** | **0.96** | **897** |
+## Training Data
+Identical to EfficientNet-B0:
+- **Real images:** 5,000 COCO 2017 validation set
+- **Anime images:** 2,357 curated frames
+- **Rendered images:** 1,549 AAA games + 61 Pixar stills
+- **Total:** 8,967 images (8,070 train / 897 diverse val)
+## Training Details
+- **Framework:** PyTorch
+- **Augmentation:** Resize only (224×224)
+- **Loss Function:** CrossEntropyLoss with inverse frequency weighting
+- **Optimizer:** AdamW (lr=0.001)
+- **Batch Size:** 40 (GPU memory constrained)
+- **Epochs:** 20
+- **Hardware:** NVIDIA RTX 3060 (12GB VRAM)
+- **Training Time:** ~60 minutes
+## Comparison to EfficientNet-B0
+| Metric | B0 | V2-S | Delta |
+|--------|-----|------|-------|
+| Final Accuracy | 97.44% | 97.55% | +0.11% |
+| Best Accuracy | 97.99% | 97.99% | Tied |
+| Params | 5.3M | 21.5M | +4× |
+| Speed | ~20ms | ~60ms | -3× |
+| Convergence | Epoch 4 | Epoch 13 | -9 epochs |
+| Train Loss | 0.1022 | 0.0003 | Better |
+| Val Loss | 0.5519 | 0.1134 | Better |
+**Verdict:** V2-S learns training distribution more thoroughly, but marginal real-world improvement. Use B0 for speed, V2-S for maximum accuracy.
+## Limitations
+1. Slower inference (60ms vs B0's 20ms)
+2. Larger model (81.4MB vs B0's 16.2MB)
+3. Same fundamental challenges: photorealistic games, cel-shading, artistic renders
+4. Performance degrades on images <224×224
+## Recommendations
+- **Real-time/Mobile:** Use EfficientNet-B0 instead
+- **Accuracy-Critical:** This model preferred
+- **Ensemble:** Use both models for highest confidence
+- **Confidence Threshold:** ≥80% for reliable predictions
+- **Edge Cases:** Manually inspect when models disagree
+## How to Use
+```python
+from PIL import Image
+import torch
+from torchvision import transforms
+import timm
+from safetensors.torch import load_file
+# Load
+model = timm.create_model('tf_efficientnetv2_s', num_classes=3, pretrained=False)
+state_dict = load_file('model.safetensors')
+model.load_state_dict(state_dict)
+model.eval()
+# Prepare image
+transform = transforms.Compose([
+    transforms.Resize((224, 224)),
+    transforms.ToTensor(),
+    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
+])
+img = Image.open('image.jpg').convert('RGB')
+x = transform(img).unsqueeze(0)
+# Infer
+with torch.no_grad():
+    logits = model(x)
+    probs = torch.softmax(logits, dim=1)
+    pred = probs.argmax().item()
+labels = ['anime', 'real', 'rendered']
+print(f"{labels[pred]}: {probs[0, pred]:.1%}")
+```
+## Ensemble Strategy
+For maximum accuracy, use both models:
+```python
+# Load both
+b0 = load_model('efficientnet_b0')
+v2s = load_model('tf_efficientnetv2_s')
+# Infer
+with torch.no_grad():
+    probs_b0 = torch.softmax(b0(x), dim=1)
+    probs_v2s = torch.softmax(v2s(x), dim=1)
+    # Average predictions
+    ensemble_probs = (probs_b0 + probs_v2s) / 2
+    pred = ensemble_probs.argmax().item()
+```
+## Benchmarks
+**Inference Speed (RTX 3060)**
+- Single image: ~60ms
+- Batch of 16: ~200ms
+## Ethical Considerations
+Same as EfficientNet-B0. This model:
+- NOT designed for deepfake detection
+- May have cultural bias in anime/rendered representation
+- Should be used with human review for content moderation
+## Contact
+For questions: [GitHub repo]
+## License
+OpenRAIL - Free for research and commercial use with proper attribution