--- license: mit tags: - face-generation - computer-vision - vision-transformer - deepfake - image-generation - pytorch - research-only - vit - cross-attention language: - en library_name: pytorch pipeline_tag: image-to-image --- # FaceForge Generator: Vision Transformer-based Face Manipulation [![Paper](https://img.shields.io/badge/Paper-Zenodo-blue)](https://doi.org/10.5281/zenodo.18530439) [![GitHub](https://img.shields.io/badge/GitHub-Repository-black)](https://github.com/Huzaifanasir95/FaceForge) [![License](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT) 🎨 **252M Parameters | ViT-Based | Baseline Training Complete** ⚠️ **RESEARCH USE ONLY** - This model is for academic research and developing detection systems. ## Model Description FaceForge Generator is a sophisticated Vision Transformer-based facial manipulation system that learns to synthesize realistic face swaps. The model combines dual ViT encoders, cross-attention mechanisms, transformer decoders, and CNN upsamplers to generate high-quality facial manipulations. **Key Features:** - 🏗️ 252 million trainable parameters - 🔄 Dual encoder architecture for source and target faces - 🎯 Cross-attention fusion mechanism - 🖼️ Generates 224×224 RGB face images - ⚡ ~300ms inference time per image - 📉 Achieved 0.204 validation loss after 3 epochs ## Model Architecture ``` FaceForge Generator (252.5M parameters) │ ├── ViT Encoders (172M params) │ ├── Source Encoder: ViT-B/16 (86M) │ │ └── 12 layers, 768-dim, 12 heads │ └── Target Encoder: ViT-B/16 (86M) │ └── 12 layers, 768-dim, 12 heads │ ├── Cross-Attention Module (14M params) │ ├── 2 layers, 8 heads │ ├── FFN: 768 → 3072 → 768 │ └── Dropout: 0.1 │ ├── Transformer Decoder (58M params) │ ├── 256 learnable queries (16×16) │ ├── 6 decoder layers, 8 heads │ └── 2D positional embeddings │ └── CNN Upsampler (9M params) ├── TransposeConv: 768→512→256→128→64 ├── 4 upsampling stages (16×16 → 224×224) └── Conv: 64→32→3 + Tanh ``` ## Training Progress ### Baseline Training (3 Epochs) | Epoch | Train Loss | Val Loss | Time (min) | |-------|-----------|----------|------------| | 1 | 0.2873 | 0.2804 | 227.5 | | 2 | 0.2432 | 0.2304 | 231.2 | | 3 | 0.2143 | 0.2043 | 228.8 | **Total Training Time:** 11.5 hours (687.5 minutes) ### Loss Reduction - Training loss: 0.287 → 0.214 (25.3% reduction) - Validation loss: 0.280 → 0.204 (27.1% reduction) - Minimal overfitting (train-val gap: 0.010) ## Usage ### Installation ```bash pip install torch torchvision timm pillow numpy ``` ### Loading the Model ```python import torch import torch.nn as nn import timm from torchvision import transforms class FaceForgeGenerator(nn.Module): def __init__(self): super().__init__() # Source and Target ViT Encoders self.source_encoder = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=0) self.target_encoder = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=0) # Cross-attention (implement your architecture) # Transformer decoder # CNN upsampler # ... (see full architecture in paper) def forward(self, source_face, target_face): # Encode both faces source_features = self.source_encoder.forward_features(source_face) target_features = self.target_encoder.forward_features(target_face) # Cross-attention fusion fused_features = self.cross_attention(source_features, target_features) # Decode to spatial map spatial_features = self.transformer_decoder(fused_features) # Upsample to 224×224 generated_face = self.cnn_upsampler(spatial_features) return generated_face # Load checkpoint model = FaceForgeGenerator() checkpoint = torch.load('generator_best.pth', map_location='cpu') model.load_state_dict(checkpoint['model_state_dict']) model.eval() # Preprocessing transform = transforms.Compose([ transforms.Resize((224, 224)), transforms.ToTensor(), transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]) ]) # Generate face swap def generate_face_swap(source_path, target_path): source = transform(Image.open(source_path).convert('RGB')).unsqueeze(0) target = transform(Image.open(target_path).convert('RGB')).unsqueeze(0) with torch.no_grad(): generated = model(source, target) # Denormalize and convert to PIL generated = (generated[0] * 0.5 + 0.5).clamp(0, 1) generated = transforms.ToPILImage()(generated) return generated # Example result = generate_face_swap("source.jpg", "target.jpg") result.save("generated.jpg") ``` ## Training Details ### Dataset - **Source:** FaceForensics++ (c40 compression) - **Training:** 7,000 face images (triplets: source, target, ground truth) - **Validation:** 1,500 face images - **Resolution:** 224×224 RGB ### Hyperparameters ```yaml optimizer: AdamW learning_rate: 1e-4 betas: [0.9, 0.999] weight_decay: 1e-4 batch_size: 16 epochs: 3 (baseline) loss_function: L1 (Mean Absolute Error) lr_schedule: Cosine Annealing (1e-4 → 1e-6) ``` ### Training Configuration - **Hardware:** CPU - **Throughput:** ~32 samples/minute - **Batch Processing:** 219 train batches, 47 val batches per epoch - **Best Model:** Saved at epoch 3 ## Current Status ⚠️ **Baseline Training:** This model has completed 3 epochs of baseline training. For production-quality face generation, extended training (15-20 epochs) is recommended. **Current Capabilities:** - ✅ Learns pose transfer - ✅ Captures facial structures - ✅ Shows convergence trend - ⏳ Some blur in generated images (expected at baseline) - ⏳ Benefits from extended training ## Use Cases ### Research Applications 1. **Detector Training:** Generate challenging samples for deepfake detection 2. **Adversarial Training:** Min-max game with detector 3. **Understanding Manipulation:** Study how synthetic faces are created 4. **Benchmark Creation:** Generate test sets for evaluation ### Educational Uses - Demonstrate face generation techniques - Teach computer vision concepts - Illustrate transformer architectures - Show attention mechanism visualization ## Limitations 1. **Training Duration:** Only 3 epochs completed; extended training needed for photo-realism 2. **Blur:** Generated faces show some blur at baseline stage 3. **Dataset Scale:** Trained on 10K images; larger datasets would improve quality 4. **Single Frame:** Doesn't consider temporal consistency for video 5. **Compute:** Large model (252M params) requires significant memory ## Ethical Guidelines ⚠️ **Responsible Use Required** This model is intended for: ✅ Academic research ✅ Deepfake detection development ✅ Educational demonstrations ✅ Ethical AI studies **Prohibited uses:** ❌ Creating misinformation ❌ Identity theft or impersonation ❌ Non-consensual face manipulation ❌ Malicious content creation **Recommendations:** - Watermark generated content - Maintain audit logs - Require user consent - Implement content filters ## Future Improvements Planned enhancements: - [ ] Extended training (15-20 epochs) - [ ] Perceptual loss functions (VGG, LPIPS) - [ ] GAN-based adversarial training - [ ] Multi-scale architecture - [ ] Attention visualization - [ ] Video temporal consistency ## Citation ```bibtex @techreport{nasir2026faceforge, title={FaceForge: A Deep Learning Framework for Facial Manipulation Generation and Detection}, author={Nasir, Huzaifa}, institution={National University of Computer and Emerging Sciences}, year={2026}, doi={10.5281/zenodo.18530439} } ``` ## Links - 📄 **Paper:** https://doi.org/10.5281/zenodo.18530439 - 💻 **Code:** https://github.com/Huzaifanasir95/FaceForge - 🔍 **Detector Model:** https://huggingface.co/Huzaifanasir95/faceforge-detector - 📓 **Notebooks:** See repository for training/inference notebooks ## Architecture Details ### Vision Transformer Encoder - **Patch Size:** 16×16 - **Patches:** 196 + 1 CLS token - **Embedding Dim:** 768 - **Layers:** 12 - **Attention Heads:** 12 - **MLP Ratio:** 4.0 ### Cross-Attention Mechanism - **Query:** Source features - **Key/Value:** Target features - **Attention:** Multi-head (8 heads) - **FFN Expansion:** 4× (768 → 3072 → 768) ### CNN Upsampler - **Input:** 768×16×16 - **Output:** 3×224×224 - **Stages:** 4 transpose convolutions - **Kernel:** 4×4, Stride: 2, Padding: 1 - **Activation:** ReLU → Tanh (output) ## License This model is released under CC BY 4.0 license. Use responsibly and ethically. ## Author **Huzaifa Nasir** National University of Computer and Emerging Sciences (NUCES) Islamabad, Pakistan 📧 nasirhuzaifa95@gmail.com ## Acknowledgments - Vision Transformer (Dosovitskiy et al.) - FaceForensics++ dataset - PyTorch and timm libraries - Open-source AI community