--- library_name: transformers license: apache-2.0 tags: - image-classification - vision-transformer - vit - celeba - research - fine-tuning --- # FaceGuard – ViT (20 CelebA IDs) A Vision Transformer (ViT-Base) fine-tuned for identity classification on a **20-identity subset of the CelebA dataset**. This model predicts **anonymized `celeb_id` integers** (not celebrity names). It powers the demo Space: https://huggingface.co/spaces/hudaakram/FaceGuard-demo --- ## Model Details ### Model Description - **Architecture:** `google/vit-base-patch16-224` (pretrained on ImageNet-1k) - **Fine-tuned for:** 20-class identity classification (CelebA `celeb_id`s) - **Input:** RGB image (face crop), resized and normalized to 224×224 - **Output:** Probability distribution over 20 anonymized IDs - **Parameters:** ~86M ### Sources - **Base model:** https://huggingface.co/google/vit-base-patch16-224 - **Demo Space:** https://huggingface.co/spaces/hudaakram/FaceGuard-demo - **Dataset:** CelebA (community mirror on the Hub) --- ## Uses ### Direct Use - Research demo for identity classification with anonymized CelebA IDs - Educational example of fine-tuning ViT for image classification ### Downstream Use - As a starting point for transfer learning to other **small identity classification tasks** - As an educational reference for hackathons, workshops, or courses ### Out-of-Scope Use - ❌ Production face recognition / surveillance - ❌ Identifying real celebrity names (dataset only provides integer IDs) - ❌ Any high-stakes application involving privacy or personal data --- ## Bias, Risks, and Limitations - **Bias:** CelebA contains celebrity faces, which are not representative of all demographics. - **Limitations:** Trained on **only 20 identities (~600 images total)** → limited generalization. - **Privacy:** CelebA IDs are anonymized integers, not real names. The model is not capable of returning actual identities. **Recommendation:** Use strictly for **research/educational purposes**. --- ## How to Get Started Use the code below to get started with the model. ```python from transformers import ViTForImageClassification, AutoImageProcessor from PIL import Image import torch model_id = "hudaakram/FaceGuard-20ID-ViT" processor = AutoImageProcessor.from_pretrained(model_id) model = ViTForImageClassification.from_pretrained(model_id) img = Image.open("face.jpg").convert("RGB") inputs = processor(images=img, return_tensors="pt") with torch.no_grad(): logits = model(**inputs).logits probs = torch.softmax(logits, dim=-1)[0] id2label = {int(k): v for k, v in model.config.id2label.items()} top5 = probs.topk(5) for score, idx in zip(top5.values, top5.indices): print(f"Label {idx.item()} (celeb_id {id2label[idx.item()]}): {score:.3f}") ``` ## Training Details ### Training Data - **Dataset:** CelebA (top 20 identities by frequency) - **Splits:** Stratified 80% train / 10% validation / 10% test - **Sizes:** Train 501, Val 60, Test 77 ### Training Procedure - **Seed:** 42 - **Epochs:** 4 - **Batch size:** 16 - **Learning rate:** 5e-5 - **Optimizer:** AdamW - **Weight decay:** 0.01 - **Precision:** FP16 on GPU (Colab) - **Head resized:** from 1000 classes → 20 classes ### Preprocessing - Images resized + center-cropped to 224×224 - Normalized to ImageNet mean/std - Labels mapped from CelebA `celeb_id` → contiguous 0–19 ### Training Hyperparameters - **Training regime:** fp16 mixed precision on GPU - **Total epochs:** 4 (~3 minutes each on Colab T4) ### Speeds, Sizes, Times - **Checkpoint size:** ~343 MB - **Throughput:** ~10 samples/sec (Colab T4) --- ## Evaluation - Validation Accuracy: ~0.93 - Test Accuracy: ~0.83 - Macro AUC: (see ROC below) ### Test Split Summary | Split | #Images | #Classes | Min/Class | Median/Class | Max/Class | |-------|---------|----------|-----------|--------------|-----------| | Train | 501 | 20 | 24 | 24 | 28 | | Val | 60 | 20 | 3 | 3 | 3 | | Test | 77 | 20 | 3 | 4 | 4 | ### Results **Confusion Matrix (normalized):** ![Confusion Matrix](./cm.png) **ROC Curves (one-vs-rest):** ![ROC Curves](./roc.png) --- ## Environmental Impact - **Hardware:** Google Colab T4 GPU - **Training time:** ~12 minutes total (4 epochs) - **Carbon emissions:** negligible (short fine-tuning run) --- ## Technical Specifications ### Model Architecture and Objective - Vision Transformer (ViT-Base, patch16, 224×224) - Objective: Cross-entropy classification across 20 labels ### Compute Infrastructure - **Hardware:** Google Colab T4 GPU - **Framework:** PyTorch + Hugging Face Transformers --- ## Citation **CelebA Dataset:** Z. Liu, P. Luo, X. Wang, and X. Tang. *Deep Learning Face Attributes in the Wild.* ICCV 2015. **ViT:** A. Dosovitskiy et al. *An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.* ICLR 2021. --- ## Model Card Authors Hackathon submission by **Huda Akram** ## Contact - Hugging Face profile: https://huggingface.co/hudaakram