FaceGuard-20ID-ViT / README.md
hudaakram's picture
Update README.md
4c4e5e2 verified
---
library_name: transformers
license: apache-2.0
tags:
- image-classification
- vision-transformer
- vit
- celeba
- research
- fine-tuning
---
# FaceGuard – ViT (20 CelebA IDs)
A Vision Transformer (ViT-Base) fine-tuned for identity classification on a **20-identity subset of the CelebA dataset**.
This model predicts **anonymized `celeb_id` integers** (not celebrity names).
It powers the demo Space: https://huggingface.co/spaces/hudaakram/FaceGuard-demo
---
## Model Details
### Model Description
- **Architecture:** `google/vit-base-patch16-224` (pretrained on ImageNet-1k)
- **Fine-tuned for:** 20-class identity classification (CelebA `celeb_id`s)
- **Input:** RGB image (face crop), resized and normalized to 224×224
- **Output:** Probability distribution over 20 anonymized IDs
- **Parameters:** ~86M
### Sources
- **Base model:** https://huggingface.co/google/vit-base-patch16-224
- **Demo Space:** https://huggingface.co/spaces/hudaakram/FaceGuard-demo
- **Dataset:** CelebA (community mirror on the Hub)
---
## Uses
### Direct Use
- Research demo for identity classification with anonymized CelebA IDs
- Educational example of fine-tuning ViT for image classification
### Downstream Use
- As a starting point for transfer learning to other **small identity classification tasks**
- As an educational reference for hackathons, workshops, or courses
### Out-of-Scope Use
- ❌ Production face recognition / surveillance
- ❌ Identifying real celebrity names (dataset only provides integer IDs)
- ❌ Any high-stakes application involving privacy or personal data
---
## Bias, Risks, and Limitations
- **Bias:** CelebA contains celebrity faces, which are not representative of all demographics.
- **Limitations:** Trained on **only 20 identities (~600 images total)** → limited generalization.
- **Privacy:** CelebA IDs are anonymized integers, not real names. The model is not capable of returning actual identities.
**Recommendation:** Use strictly for **research/educational purposes**.
---
## How to Get Started
Use the code below to get started with the model.
```python
from transformers import ViTForImageClassification, AutoImageProcessor
from PIL import Image
import torch
model_id = "hudaakram/FaceGuard-20ID-ViT"
processor = AutoImageProcessor.from_pretrained(model_id)
model = ViTForImageClassification.from_pretrained(model_id)
img = Image.open("face.jpg").convert("RGB")
inputs = processor(images=img, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.softmax(logits, dim=-1)[0]
id2label = {int(k): v for k, v in model.config.id2label.items()}
top5 = probs.topk(5)
for score, idx in zip(top5.values, top5.indices):
print(f"Label {idx.item()} (celeb_id {id2label[idx.item()]}): {score:.3f}")
```
## Training Details
### Training Data
- **Dataset:** CelebA (top 20 identities by frequency)
- **Splits:** Stratified 80% train / 10% validation / 10% test
- **Sizes:** Train 501, Val 60, Test 77
### Training Procedure
- **Seed:** 42
- **Epochs:** 4
- **Batch size:** 16
- **Learning rate:** 5e-5
- **Optimizer:** AdamW
- **Weight decay:** 0.01
- **Precision:** FP16 on GPU (Colab)
- **Head resized:** from 1000 classes → 20 classes
### Preprocessing
- Images resized + center-cropped to 224×224
- Normalized to ImageNet mean/std
- Labels mapped from CelebA `celeb_id` → contiguous 0–19
### Training Hyperparameters
- **Training regime:** fp16 mixed precision on GPU
- **Total epochs:** 4 (~3 minutes each on Colab T4)
### Speeds, Sizes, Times
- **Checkpoint size:** ~343 MB
- **Throughput:** ~10 samples/sec (Colab T4)
---
## Evaluation
- Validation Accuracy: ~0.93
- Test Accuracy: ~0.83
- Macro AUC: (see ROC below)
### Test Split Summary
| Split | #Images | #Classes | Min/Class | Median/Class | Max/Class |
|-------|---------|----------|-----------|--------------|-----------|
| Train | 501 | 20 | 24 | 24 | 28 |
| Val | 60 | 20 | 3 | 3 | 3 |
| Test | 77 | 20 | 3 | 4 | 4 |
### Results
**Confusion Matrix (normalized):**
![Confusion Matrix](./cm.png)
**ROC Curves (one-vs-rest):**
![ROC Curves](./roc.png)
---
## Environmental Impact
- **Hardware:** Google Colab T4 GPU
- **Training time:** ~12 minutes total (4 epochs)
- **Carbon emissions:** negligible (short fine-tuning run)
---
## Technical Specifications
### Model Architecture and Objective
- Vision Transformer (ViT-Base, patch16, 224×224)
- Objective: Cross-entropy classification across 20 labels
### Compute Infrastructure
- **Hardware:** Google Colab T4 GPU
- **Framework:** PyTorch + Hugging Face Transformers
---
## Citation
**CelebA Dataset:**
Z. Liu, P. Luo, X. Wang, and X. Tang. *Deep Learning Face Attributes in the Wild.* ICCV 2015.
**ViT:**
A. Dosovitskiy et al. *An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.* ICLR 2021.
---
## Model Card Authors
Hackathon submission by **Huda Akram**
## Contact
- Hugging Face profile: https://huggingface.co/hudaakram