---
library_name: transformers
license: apache-2.0
tags:
  - image-classification
  - vision-transformer
  - vit
  - celeba
  - research
  - fine-tuning
---

# FaceGuard – ViT (20 CelebA IDs)

A Vision Transformer (ViT-Base) fine-tuned for identity classification on a **20-identity subset of the CelebA dataset**.  
This model predicts **anonymized `celeb_id` integers** (not celebrity names).  
It powers the demo Space: https://huggingface.co/spaces/hudaakram/FaceGuard-demo

---

## Model Details

### Model Description
- **Architecture:** `google/vit-base-patch16-224` (pretrained on ImageNet-1k)  
- **Fine-tuned for:** 20-class identity classification (CelebA `celeb_id`s)  
- **Input:** RGB image (face crop), resized and normalized to 224×224  
- **Output:** Probability distribution over 20 anonymized IDs  
- **Parameters:** ~86M  

### Sources
- **Base model:** https://huggingface.co/google/vit-base-patch16-224  
- **Demo Space:** https://huggingface.co/spaces/hudaakram/FaceGuard-demo  
- **Dataset:** CelebA (community mirror on the Hub)

---

## Uses

### Direct Use
- Research demo for identity classification with anonymized CelebA IDs  
- Educational example of fine-tuning ViT for image classification  

### Downstream Use
- As a starting point for transfer learning to other **small identity classification tasks**  
- As an educational reference for hackathons, workshops, or courses  

### Out-of-Scope Use
- ❌ Production face recognition / surveillance  
- ❌ Identifying real celebrity names (dataset only provides integer IDs)  
- ❌ Any high-stakes application involving privacy or personal data  

---

## Bias, Risks, and Limitations

- **Bias:** CelebA contains celebrity faces, which are not representative of all demographics.  
- **Limitations:** Trained on **only 20 identities (~600 images total)** → limited generalization.  
- **Privacy:** CelebA IDs are anonymized integers, not real names. The model is not capable of returning actual identities.  

**Recommendation:** Use strictly for **research/educational purposes**.

---

## How to Get Started
Use the code below to get started with the model.
```python
from transformers import ViTForImageClassification, AutoImageProcessor
from PIL import Image
import torch

model_id = "hudaakram/FaceGuard-20ID-ViT"
processor = AutoImageProcessor.from_pretrained(model_id)
model = ViTForImageClassification.from_pretrained(model_id)

img = Image.open("face.jpg").convert("RGB")
inputs = processor(images=img, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits
probs = torch.softmax(logits, dim=-1)[0]

id2label = {int(k): v for k, v in model.config.id2label.items()}
top5 = probs.topk(5)
for score, idx in zip(top5.values, top5.indices):
print(f"Label {idx.item()} (celeb_id {id2label[idx.item()]}): {score:.3f}")
```

## Training Details

### Training Data
- **Dataset:** CelebA (top 20 identities by frequency)  
- **Splits:** Stratified 80% train / 10% validation / 10% test  
- **Sizes:** Train 501, Val 60, Test 77  

### Training Procedure
- **Seed:** 42  
- **Epochs:** 4  
- **Batch size:** 16  
- **Learning rate:** 5e-5  
- **Optimizer:** AdamW  
- **Weight decay:** 0.01  
- **Precision:** FP16 on GPU (Colab)  
- **Head resized:** from 1000 classes → 20 classes  

### Preprocessing
- Images resized + center-cropped to 224×224  
- Normalized to ImageNet mean/std  
- Labels mapped from CelebA `celeb_id` → contiguous 0–19  

### Training Hyperparameters
- **Training regime:** fp16 mixed precision on GPU  
- **Total epochs:** 4 (~3 minutes each on Colab T4)  

### Speeds, Sizes, Times
- **Checkpoint size:** ~343 MB  
- **Throughput:** ~10 samples/sec (Colab T4)  

---
## Evaluation
- Validation Accuracy: ~0.93
- Test Accuracy: ~0.83
- Macro AUC: (see ROC below)

### Test Split Summary
| Split | #Images | #Classes | Min/Class | Median/Class | Max/Class |
|-------|---------|----------|-----------|--------------|-----------|
| Train | 501     | 20       | 24        | 24           | 28        |
| Val   | 60      | 20       | 3         | 3            | 3         |
| Test  | 77      | 20       | 3         | 4            | 4         |

### Results
**Confusion Matrix (normalized):**  
![Confusion Matrix](./cm.png)

**ROC Curves (one-vs-rest):**  
![ROC Curves](./roc.png)

---

## Environmental Impact
- **Hardware:** Google Colab T4 GPU  
- **Training time:** ~12 minutes total (4 epochs)  
- **Carbon emissions:** negligible (short fine-tuning run)  

---

## Technical Specifications

### Model Architecture and Objective
- Vision Transformer (ViT-Base, patch16, 224×224)  
- Objective: Cross-entropy classification across 20 labels  

### Compute Infrastructure
- **Hardware:** Google Colab T4 GPU  
- **Framework:** PyTorch + Hugging Face Transformers  

---

## Citation

**CelebA Dataset:**  
Z. Liu, P. Luo, X. Wang, and X. Tang. *Deep Learning Face Attributes in the Wild.* ICCV 2015.  

**ViT:**  
A. Dosovitskiy et al. *An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.* ICLR 2021.  

---

## Model Card Authors
Hackathon submission by **Huda Akram**  

## Contact
- Hugging Face profile: https://huggingface.co/hudaakram