|
|
--- |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- image-classification |
|
|
- vision-transformer |
|
|
- vit |
|
|
- celeba |
|
|
- research |
|
|
- fine-tuning |
|
|
--- |
|
|
|
|
|
# FaceGuard – ViT (20 CelebA IDs) |
|
|
|
|
|
A Vision Transformer (ViT-Base) fine-tuned for identity classification on a **20-identity subset of the CelebA dataset**. |
|
|
This model predicts **anonymized `celeb_id` integers** (not celebrity names). |
|
|
It powers the demo Space: https://huggingface.co/spaces/hudaakram/FaceGuard-demo |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
- **Architecture:** `google/vit-base-patch16-224` (pretrained on ImageNet-1k) |
|
|
- **Fine-tuned for:** 20-class identity classification (CelebA `celeb_id`s) |
|
|
- **Input:** RGB image (face crop), resized and normalized to 224×224 |
|
|
- **Output:** Probability distribution over 20 anonymized IDs |
|
|
- **Parameters:** ~86M |
|
|
|
|
|
### Sources |
|
|
- **Base model:** https://huggingface.co/google/vit-base-patch16-224 |
|
|
- **Demo Space:** https://huggingface.co/spaces/hudaakram/FaceGuard-demo |
|
|
- **Dataset:** CelebA (community mirror on the Hub) |
|
|
|
|
|
--- |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
- Research demo for identity classification with anonymized CelebA IDs |
|
|
- Educational example of fine-tuning ViT for image classification |
|
|
|
|
|
### Downstream Use |
|
|
- As a starting point for transfer learning to other **small identity classification tasks** |
|
|
- As an educational reference for hackathons, workshops, or courses |
|
|
|
|
|
### Out-of-Scope Use |
|
|
- ❌ Production face recognition / surveillance |
|
|
- ❌ Identifying real celebrity names (dataset only provides integer IDs) |
|
|
- ❌ Any high-stakes application involving privacy or personal data |
|
|
|
|
|
--- |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
- **Bias:** CelebA contains celebrity faces, which are not representative of all demographics. |
|
|
- **Limitations:** Trained on **only 20 identities (~600 images total)** → limited generalization. |
|
|
- **Privacy:** CelebA IDs are anonymized integers, not real names. The model is not capable of returning actual identities. |
|
|
|
|
|
**Recommendation:** Use strictly for **research/educational purposes**. |
|
|
|
|
|
--- |
|
|
|
|
|
## How to Get Started |
|
|
Use the code below to get started with the model. |
|
|
```python |
|
|
from transformers import ViTForImageClassification, AutoImageProcessor |
|
|
from PIL import Image |
|
|
import torch |
|
|
|
|
|
model_id = "hudaakram/FaceGuard-20ID-ViT" |
|
|
processor = AutoImageProcessor.from_pretrained(model_id) |
|
|
model = ViTForImageClassification.from_pretrained(model_id) |
|
|
|
|
|
img = Image.open("face.jpg").convert("RGB") |
|
|
inputs = processor(images=img, return_tensors="pt") |
|
|
|
|
|
with torch.no_grad(): |
|
|
logits = model(**inputs).logits |
|
|
probs = torch.softmax(logits, dim=-1)[0] |
|
|
|
|
|
id2label = {int(k): v for k, v in model.config.id2label.items()} |
|
|
top5 = probs.topk(5) |
|
|
for score, idx in zip(top5.values, top5.indices): |
|
|
print(f"Label {idx.item()} (celeb_id {id2label[idx.item()]}): {score:.3f}") |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
- **Dataset:** CelebA (top 20 identities by frequency) |
|
|
- **Splits:** Stratified 80% train / 10% validation / 10% test |
|
|
- **Sizes:** Train 501, Val 60, Test 77 |
|
|
|
|
|
### Training Procedure |
|
|
- **Seed:** 42 |
|
|
- **Epochs:** 4 |
|
|
- **Batch size:** 16 |
|
|
- **Learning rate:** 5e-5 |
|
|
- **Optimizer:** AdamW |
|
|
- **Weight decay:** 0.01 |
|
|
- **Precision:** FP16 on GPU (Colab) |
|
|
- **Head resized:** from 1000 classes → 20 classes |
|
|
|
|
|
### Preprocessing |
|
|
- Images resized + center-cropped to 224×224 |
|
|
- Normalized to ImageNet mean/std |
|
|
- Labels mapped from CelebA `celeb_id` → contiguous 0–19 |
|
|
|
|
|
### Training Hyperparameters |
|
|
- **Training regime:** fp16 mixed precision on GPU |
|
|
- **Total epochs:** 4 (~3 minutes each on Colab T4) |
|
|
|
|
|
### Speeds, Sizes, Times |
|
|
- **Checkpoint size:** ~343 MB |
|
|
- **Throughput:** ~10 samples/sec (Colab T4) |
|
|
|
|
|
--- |
|
|
## Evaluation |
|
|
- Validation Accuracy: ~0.93 |
|
|
- Test Accuracy: ~0.83 |
|
|
- Macro AUC: (see ROC below) |
|
|
|
|
|
### Test Split Summary |
|
|
| Split | #Images | #Classes | Min/Class | Median/Class | Max/Class | |
|
|
|-------|---------|----------|-----------|--------------|-----------| |
|
|
| Train | 501 | 20 | 24 | 24 | 28 | |
|
|
| Val | 60 | 20 | 3 | 3 | 3 | |
|
|
| Test | 77 | 20 | 3 | 4 | 4 | |
|
|
|
|
|
### Results |
|
|
**Confusion Matrix (normalized):** |
|
|
 |
|
|
|
|
|
**ROC Curves (one-vs-rest):** |
|
|
 |
|
|
|
|
|
--- |
|
|
|
|
|
## Environmental Impact |
|
|
- **Hardware:** Google Colab T4 GPU |
|
|
- **Training time:** ~12 minutes total (4 epochs) |
|
|
- **Carbon emissions:** negligible (short fine-tuning run) |
|
|
|
|
|
--- |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
### Model Architecture and Objective |
|
|
- Vision Transformer (ViT-Base, patch16, 224×224) |
|
|
- Objective: Cross-entropy classification across 20 labels |
|
|
|
|
|
### Compute Infrastructure |
|
|
- **Hardware:** Google Colab T4 GPU |
|
|
- **Framework:** PyTorch + Hugging Face Transformers |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
**CelebA Dataset:** |
|
|
Z. Liu, P. Luo, X. Wang, and X. Tang. *Deep Learning Face Attributes in the Wild.* ICCV 2015. |
|
|
|
|
|
**ViT:** |
|
|
A. Dosovitskiy et al. *An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.* ICLR 2021. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Card Authors |
|
|
Hackathon submission by **Huda Akram** |
|
|
|
|
|
## Contact |
|
|
- Hugging Face profile: https://huggingface.co/hudaakram |
|
|
|