FaceGuard-20ID-ViT / README.md

Update README.md

4c4e5e2 verified 4 months ago

5.21 kB

	---
	library_name: transformers
	license: apache-2.0
	tags:
	- image-classification
	- vision-transformer
	- vit
	- celeba
	- research
	- fine-tuning
	---

	# FaceGuard – ViT (20 CelebA IDs)

	A Vision Transformer (ViT-Base) fine-tuned for identity classification on a 20-identity subset of the CelebA dataset.
	This model predicts anonymized `celeb_id` integers (not celebrity names).
	It powers the demo Space: https://huggingface.co/spaces/hudaakram/FaceGuard-demo

	---

	## Model Details

	### Model Description
	- Architecture: `google/vit-base-patch16-224` (pretrained on ImageNet-1k)
	- Fine-tuned for: 20-class identity classification (CelebA `celeb_id`s)
	- Input: RGB image (face crop), resized and normalized to 224×224
	- Output: Probability distribution over 20 anonymized IDs
	- Parameters: ~86M

	### Sources
	- Base model: https://huggingface.co/google/vit-base-patch16-224
	- Demo Space: https://huggingface.co/spaces/hudaakram/FaceGuard-demo
	- Dataset: CelebA (community mirror on the Hub)

	---

	## Uses

	### Direct Use
	- Research demo for identity classification with anonymized CelebA IDs
	- Educational example of fine-tuning ViT for image classification

	### Downstream Use
	- As a starting point for transfer learning to other small identity classification tasks
	- As an educational reference for hackathons, workshops, or courses

	### Out-of-Scope Use
	- ❌ Production face recognition / surveillance
	- ❌ Identifying real celebrity names (dataset only provides integer IDs)
	- ❌ Any high-stakes application involving privacy or personal data

	---

	## Bias, Risks, and Limitations

	- Bias: CelebA contains celebrity faces, which are not representative of all demographics.
	- Limitations: Trained on only 20 identities (~600 images total) → limited generalization.
	- Privacy: CelebA IDs are anonymized integers, not real names. The model is not capable of returning actual identities.

	Recommendation: Use strictly for research/educational purposes.

	---

	## How to Get Started
	Use the code below to get started with the model.
	```python
	from transformers import ViTForImageClassification, AutoImageProcessor
	from PIL import Image
	import torch

	model_id = "hudaakram/FaceGuard-20ID-ViT"
	processor = AutoImageProcessor.from_pretrained(model_id)
	model = ViTForImageClassification.from_pretrained(model_id)

	img = Image.open("face.jpg").convert("RGB")
	inputs = processor(images=img, return_tensors="pt")

	with torch.no_grad():
	logits = model(**inputs).logits
	probs = torch.softmax(logits, dim=-1)[0]

	id2label = {int(k): v for k, v in model.config.id2label.items()}
	top5 = probs.topk(5)
	for score, idx in zip(top5.values, top5.indices):
	print(f"Label {idx.item()} (celeb_id {id2label[idx.item()]}): {score:.3f}")
	```

	## Training Details

	### Training Data
	- Dataset: CelebA (top 20 identities by frequency)
	- Splits: Stratified 80% train / 10% validation / 10% test
	- Sizes: Train 501, Val 60, Test 77

	### Training Procedure
	- Seed: 42
	- Epochs: 4
	- Batch size: 16
	- Learning rate: 5e-5
	- Optimizer: AdamW
	- Weight decay: 0.01
	- Precision: FP16 on GPU (Colab)
	- Head resized: from 1000 classes → 20 classes

	### Preprocessing
	- Images resized + center-cropped to 224×224
	- Normalized to ImageNet mean/std
	- Labels mapped from CelebA `celeb_id` → contiguous 0–19

	### Training Hyperparameters
	- Training regime: fp16 mixed precision on GPU
	- Total epochs: 4 (~3 minutes each on Colab T4)

	### Speeds, Sizes, Times
	- Checkpoint size: ~343 MB
	- Throughput: ~10 samples/sec (Colab T4)

	---
	## Evaluation
	- Validation Accuracy: ~0.93
	- Test Accuracy: ~0.83
	- Macro AUC: (see ROC below)

	### Test Split Summary
	\| Split \| #Images \| #Classes \| Min/Class \| Median/Class \| Max/Class \|
	\|-------\|---------\|----------\|-----------\|--------------\|-----------\|
	\| Train \| 501 \| 20 \| 24 \| 24 \| 28 \|
	\| Val \| 60 \| 20 \| 3 \| 3 \| 3 \|
	\| Test \| 77 \| 20 \| 3 \| 4 \| 4 \|

	### Results
	Confusion Matrix (normalized):
	![Confusion Matrix](./cm.png)

	ROC Curves (one-vs-rest):
	![ROC Curves](./roc.png)

	---

	## Environmental Impact
	- Hardware: Google Colab T4 GPU
	- Training time: ~12 minutes total (4 epochs)
	- Carbon emissions: negligible (short fine-tuning run)

	---

	## Technical Specifications

	### Model Architecture and Objective
	- Vision Transformer (ViT-Base, patch16, 224×224)
	- Objective: Cross-entropy classification across 20 labels

	### Compute Infrastructure
	- Hardware: Google Colab T4 GPU
	- Framework: PyTorch + Hugging Face Transformers

	---

	## Citation

	CelebA Dataset:
	Z. Liu, P. Luo, X. Wang, and X. Tang. Deep Learning Face Attributes in the Wild. ICCV 2015.

	ViT:
	A. Dosovitskiy et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021.

	---

	## Model Card Authors
	Hackathon submission by Huda Akram

	## Contact
	- Hugging Face profile: https://huggingface.co/hudaakram