Upload README.md with huggingface_hub

9c735a5 verified 4 days ago

9.47 kB

	---
	license: mit
	tags:
	- face-generation
	- computer-vision
	- vision-transformer
	- deepfake
	- image-generation
	- pytorch
	- research-only
	- vit
	- cross-attention
	language:
	- en
	library_name: pytorch
	pipeline_tag: image-to-image
	---

	# FaceForge Generator: Vision Transformer-based Face Manipulation

	[![Paper](https://img.shields.io/badge/Paper-Zenodo-blue)](https://doi.org/10.5281/zenodo.18530439)
	[![GitHub](https://img.shields.io/badge/GitHub-Repository-black)](https://github.com/Huzaifanasir95/FaceForge)
	[![License](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)

	🎨 252M Parameters \| ViT-Based \| Baseline Training Complete

	⚠️ RESEARCH USE ONLY - This model is for academic research and developing detection systems.

	## Model Description

	FaceForge Generator is a sophisticated Vision Transformer-based facial manipulation system that learns to synthesize realistic face swaps. The model combines dual ViT encoders, cross-attention mechanisms, transformer decoders, and CNN upsamplers to generate high-quality facial manipulations.

	Key Features:
	- 🏗️ 252 million trainable parameters
	- 🔄 Dual encoder architecture for source and target faces
	- 🎯 Cross-attention fusion mechanism
	- 🖼️ Generates 224×224 RGB face images
	- ⚡ ~300ms inference time per image
	- 📉 Achieved 0.204 validation loss after 3 epochs

	## Model Architecture

	```
	FaceForge Generator (252.5M parameters)
	│
	├── ViT Encoders (172M params)
	│ ├── Source Encoder: ViT-B/16 (86M)
	│ │ └── 12 layers, 768-dim, 12 heads
	│ └── Target Encoder: ViT-B/16 (86M)
	│ └── 12 layers, 768-dim, 12 heads
	│
	├── Cross-Attention Module (14M params)
	│ ├── 2 layers, 8 heads
	│ ├── FFN: 768 → 3072 → 768
	│ └── Dropout: 0.1
	│
	├── Transformer Decoder (58M params)
	│ ├── 256 learnable queries (16×16)
	│ ├── 6 decoder layers, 8 heads
	│ └── 2D positional embeddings
	│
	└── CNN Upsampler (9M params)
	├── TransposeConv: 768→512→256→128→64
	├── 4 upsampling stages (16×16 → 224×224)
	└── Conv: 64→32→3 + Tanh
	```

	## Training Progress

	### Baseline Training (3 Epochs)

	\| Epoch \| Train Loss \| Val Loss \| Time (min) \|
	\|-------\|-----------\|----------\|------------\|
	\| 1 \| 0.2873 \| 0.2804 \| 227.5 \|
	\| 2 \| 0.2432 \| 0.2304 \| 231.2 \|
	\| 3 \| 0.2143 \| 0.2043 \| 228.8 \|

	Total Training Time: 11.5 hours (687.5 minutes)

	### Loss Reduction
	- Training loss: 0.287 → 0.214 (25.3% reduction)
	- Validation loss: 0.280 → 0.204 (27.1% reduction)
	- Minimal overfitting (train-val gap: 0.010)

	## Usage

	### Installation

	```bash
	pip install torch torchvision timm pillow numpy
	```

	### Loading the Model

	```python
	import torch
	import torch.nn as nn
	import timm
	from torchvision import transforms

	class FaceForgeGenerator(nn.Module):
	def __init__(self):
	super().__init__()
	# Source and Target ViT Encoders
	self.source_encoder = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=0)
	self.target_encoder = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=0)

	# Cross-attention (implement your architecture)
	# Transformer decoder
	# CNN upsampler
	# ... (see full architecture in paper)

	def forward(self, source_face, target_face):
	# Encode both faces
	source_features = self.source_encoder.forward_features(source_face)
	target_features = self.target_encoder.forward_features(target_face)

	# Cross-attention fusion
	fused_features = self.cross_attention(source_features, target_features)

	# Decode to spatial map
	spatial_features = self.transformer_decoder(fused_features)

	# Upsample to 224×224
	generated_face = self.cnn_upsampler(spatial_features)

	return generated_face

	# Load checkpoint
	model = FaceForgeGenerator()
	checkpoint = torch.load('generator_best.pth', map_location='cpu')
	model.load_state_dict(checkpoint['model_state_dict'])
	model.eval()

	# Preprocessing
	transform = transforms.Compose([
	transforms.Resize((224, 224)),
	transforms.ToTensor(),
	transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
	])

	# Generate face swap
	def generate_face_swap(source_path, target_path):
	source = transform(Image.open(source_path).convert('RGB')).unsqueeze(0)
	target = transform(Image.open(target_path).convert('RGB')).unsqueeze(0)

	with torch.no_grad():
	generated = model(source, target)

	# Denormalize and convert to PIL
	generated = (generated[0] * 0.5 + 0.5).clamp(0, 1)
	generated = transforms.ToPILImage()(generated)

	return generated

	# Example
	result = generate_face_swap("source.jpg", "target.jpg")
	result.save("generated.jpg")
	```

	## Training Details

	### Dataset
	- Source: FaceForensics++ (c40 compression)
	- Training: 7,000 face images (triplets: source, target, ground truth)
	- Validation: 1,500 face images
	- Resolution: 224×224 RGB

	### Hyperparameters
	```yaml
	optimizer: AdamW
	learning_rate: 1e-4
	betas: [0.9, 0.999]
	weight_decay: 1e-4
	batch_size: 16
	epochs: 3 (baseline)
	loss_function: L1 (Mean Absolute Error)
	lr_schedule: Cosine Annealing (1e-4 → 1e-6)
	```

	### Training Configuration
	- Hardware: CPU
	- Throughput: ~32 samples/minute
	- Batch Processing: 219 train batches, 47 val batches per epoch
	- Best Model: Saved at epoch 3

	## Current Status

	⚠️ Baseline Training: This model has completed 3 epochs of baseline training. For production-quality face generation, extended training (15-20 epochs) is recommended.

	Current Capabilities:
	- ✅ Learns pose transfer
	- ✅ Captures facial structures
	- ✅ Shows convergence trend
	- ⏳ Some blur in generated images (expected at baseline)
	- ⏳ Benefits from extended training

	## Use Cases

	### Research Applications
	1. Detector Training: Generate challenging samples for deepfake detection
	2. Adversarial Training: Min-max game with detector
	3. Understanding Manipulation: Study how synthetic faces are created
	4. Benchmark Creation: Generate test sets for evaluation

	### Educational Uses
	- Demonstrate face generation techniques
	- Teach computer vision concepts
	- Illustrate transformer architectures
	- Show attention mechanism visualization

	## Limitations

	1. Training Duration: Only 3 epochs completed; extended training needed for photo-realism
	2. Blur: Generated faces show some blur at baseline stage
	3. Dataset Scale: Trained on 10K images; larger datasets would improve quality
	4. Single Frame: Doesn't consider temporal consistency for video
	5. Compute: Large model (252M params) requires significant memory

	## Ethical Guidelines

	⚠️ Responsible Use Required

	This model is intended for:
	✅ Academic research
	✅ Deepfake detection development
	✅ Educational demonstrations
	✅ Ethical AI studies

	Prohibited uses:
	❌ Creating misinformation
	❌ Identity theft or impersonation
	❌ Non-consensual face manipulation
	❌ Malicious content creation

	Recommendations:
	- Watermark generated content
	- Maintain audit logs
	- Require user consent
	- Implement content filters

	## Future Improvements

	Planned enhancements:
	- [ ] Extended training (15-20 epochs)
	- [ ] Perceptual loss functions (VGG, LPIPS)
	- [ ] GAN-based adversarial training
	- [ ] Multi-scale architecture
	- [ ] Attention visualization
	- [ ] Video temporal consistency

	## Citation

	```bibtex
	@techreport{nasir2026faceforge,
	title={FaceForge: A Deep Learning Framework for Facial Manipulation Generation and Detection},
	author={Nasir, Huzaifa},
	institution={National University of Computer and Emerging Sciences},
	year={2026},
	doi={10.5281/zenodo.18530439}
	}
	```

	## Links

	- 📄 Paper: https://doi.org/10.5281/zenodo.18530439
	- 💻 Code: https://github.com/Huzaifanasir95/FaceForge
	- 🔍 Detector Model: https://huggingface.co/Huzaifanasir95/faceforge-detector
	- 📓 Notebooks: See repository for training/inference notebooks

	## Architecture Details

	### Vision Transformer Encoder
	- Patch Size: 16×16
	- Patches: 196 + 1 CLS token
	- Embedding Dim: 768
	- Layers: 12
	- Attention Heads: 12
	- MLP Ratio: 4.0

	### Cross-Attention Mechanism
	- Query: Source features
	- Key/Value: Target features
	- Attention: Multi-head (8 heads)
	- FFN Expansion: 4× (768 → 3072 → 768)

	### CNN Upsampler
	- Input: 768×16×16
	- Output: 3×224×224
	- Stages: 4 transpose convolutions
	- Kernel: 4×4, Stride: 2, Padding: 1
	- Activation: ReLU → Tanh (output)

	## License

	This model is released under CC BY 4.0 license. Use responsibly and ethically.

	## Author

	Huzaifa Nasir
	National University of Computer and Emerging Sciences (NUCES)
	Islamabad, Pakistan
	📧 nasirhuzaifa95@gmail.com

	## Acknowledgments

	- Vision Transformer (Dosovitskiy et al.)
	- FaceForensics++ dataset
	- PyTorch and timm libraries
	- Open-source AI community