faceforge-generator / README.md
huzaifanasirrr's picture
Upload README.md with huggingface_hub
9c735a5 verified
---
license: mit
tags:
- face-generation
- computer-vision
- vision-transformer
- deepfake
- image-generation
- pytorch
- research-only
- vit
- cross-attention
language:
- en
library_name: pytorch
pipeline_tag: image-to-image
---
# FaceForge Generator: Vision Transformer-based Face Manipulation
[![Paper](https://img.shields.io/badge/Paper-Zenodo-blue)](https://doi.org/10.5281/zenodo.18530439)
[![GitHub](https://img.shields.io/badge/GitHub-Repository-black)](https://github.com/Huzaifanasir95/FaceForge)
[![License](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
🎨 **252M Parameters | ViT-Based | Baseline Training Complete**
⚠️ **RESEARCH USE ONLY** - This model is for academic research and developing detection systems.
## Model Description
FaceForge Generator is a sophisticated Vision Transformer-based facial manipulation system that learns to synthesize realistic face swaps. The model combines dual ViT encoders, cross-attention mechanisms, transformer decoders, and CNN upsamplers to generate high-quality facial manipulations.
**Key Features:**
- πŸ—οΈ 252 million trainable parameters
- πŸ”„ Dual encoder architecture for source and target faces
- 🎯 Cross-attention fusion mechanism
- πŸ–ΌοΈ Generates 224Γ—224 RGB face images
- ⚑ ~300ms inference time per image
- πŸ“‰ Achieved 0.204 validation loss after 3 epochs
## Model Architecture
```
FaceForge Generator (252.5M parameters)
β”‚
β”œβ”€β”€ ViT Encoders (172M params)
β”‚ β”œβ”€β”€ Source Encoder: ViT-B/16 (86M)
β”‚ β”‚ └── 12 layers, 768-dim, 12 heads
β”‚ └── Target Encoder: ViT-B/16 (86M)
β”‚ └── 12 layers, 768-dim, 12 heads
β”‚
β”œβ”€β”€ Cross-Attention Module (14M params)
β”‚ β”œβ”€β”€ 2 layers, 8 heads
β”‚ β”œβ”€β”€ FFN: 768 β†’ 3072 β†’ 768
β”‚ └── Dropout: 0.1
β”‚
β”œβ”€β”€ Transformer Decoder (58M params)
β”‚ β”œβ”€β”€ 256 learnable queries (16Γ—16)
β”‚ β”œβ”€β”€ 6 decoder layers, 8 heads
β”‚ └── 2D positional embeddings
β”‚
└── CNN Upsampler (9M params)
β”œβ”€β”€ TransposeConv: 768β†’512β†’256β†’128β†’64
β”œβ”€β”€ 4 upsampling stages (16Γ—16 β†’ 224Γ—224)
└── Conv: 64β†’32β†’3 + Tanh
```
## Training Progress
### Baseline Training (3 Epochs)
| Epoch | Train Loss | Val Loss | Time (min) |
|-------|-----------|----------|------------|
| 1 | 0.2873 | 0.2804 | 227.5 |
| 2 | 0.2432 | 0.2304 | 231.2 |
| 3 | 0.2143 | 0.2043 | 228.8 |
**Total Training Time:** 11.5 hours (687.5 minutes)
### Loss Reduction
- Training loss: 0.287 β†’ 0.214 (25.3% reduction)
- Validation loss: 0.280 β†’ 0.204 (27.1% reduction)
- Minimal overfitting (train-val gap: 0.010)
## Usage
### Installation
```bash
pip install torch torchvision timm pillow numpy
```
### Loading the Model
```python
import torch
import torch.nn as nn
import timm
from torchvision import transforms
class FaceForgeGenerator(nn.Module):
def __init__(self):
super().__init__()
# Source and Target ViT Encoders
self.source_encoder = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=0)
self.target_encoder = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=0)
# Cross-attention (implement your architecture)
# Transformer decoder
# CNN upsampler
# ... (see full architecture in paper)
def forward(self, source_face, target_face):
# Encode both faces
source_features = self.source_encoder.forward_features(source_face)
target_features = self.target_encoder.forward_features(target_face)
# Cross-attention fusion
fused_features = self.cross_attention(source_features, target_features)
# Decode to spatial map
spatial_features = self.transformer_decoder(fused_features)
# Upsample to 224Γ—224
generated_face = self.cnn_upsampler(spatial_features)
return generated_face
# Load checkpoint
model = FaceForgeGenerator()
checkpoint = torch.load('generator_best.pth', map_location='cpu')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
# Preprocessing
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])
# Generate face swap
def generate_face_swap(source_path, target_path):
source = transform(Image.open(source_path).convert('RGB')).unsqueeze(0)
target = transform(Image.open(target_path).convert('RGB')).unsqueeze(0)
with torch.no_grad():
generated = model(source, target)
# Denormalize and convert to PIL
generated = (generated[0] * 0.5 + 0.5).clamp(0, 1)
generated = transforms.ToPILImage()(generated)
return generated
# Example
result = generate_face_swap("source.jpg", "target.jpg")
result.save("generated.jpg")
```
## Training Details
### Dataset
- **Source:** FaceForensics++ (c40 compression)
- **Training:** 7,000 face images (triplets: source, target, ground truth)
- **Validation:** 1,500 face images
- **Resolution:** 224Γ—224 RGB
### Hyperparameters
```yaml
optimizer: AdamW
learning_rate: 1e-4
betas: [0.9, 0.999]
weight_decay: 1e-4
batch_size: 16
epochs: 3 (baseline)
loss_function: L1 (Mean Absolute Error)
lr_schedule: Cosine Annealing (1e-4 β†’ 1e-6)
```
### Training Configuration
- **Hardware:** CPU
- **Throughput:** ~32 samples/minute
- **Batch Processing:** 219 train batches, 47 val batches per epoch
- **Best Model:** Saved at epoch 3
## Current Status
⚠️ **Baseline Training:** This model has completed 3 epochs of baseline training. For production-quality face generation, extended training (15-20 epochs) is recommended.
**Current Capabilities:**
- βœ… Learns pose transfer
- βœ… Captures facial structures
- βœ… Shows convergence trend
- ⏳ Some blur in generated images (expected at baseline)
- ⏳ Benefits from extended training
## Use Cases
### Research Applications
1. **Detector Training:** Generate challenging samples for deepfake detection
2. **Adversarial Training:** Min-max game with detector
3. **Understanding Manipulation:** Study how synthetic faces are created
4. **Benchmark Creation:** Generate test sets for evaluation
### Educational Uses
- Demonstrate face generation techniques
- Teach computer vision concepts
- Illustrate transformer architectures
- Show attention mechanism visualization
## Limitations
1. **Training Duration:** Only 3 epochs completed; extended training needed for photo-realism
2. **Blur:** Generated faces show some blur at baseline stage
3. **Dataset Scale:** Trained on 10K images; larger datasets would improve quality
4. **Single Frame:** Doesn't consider temporal consistency for video
5. **Compute:** Large model (252M params) requires significant memory
## Ethical Guidelines
⚠️ **Responsible Use Required**
This model is intended for:
βœ… Academic research
βœ… Deepfake detection development
βœ… Educational demonstrations
βœ… Ethical AI studies
**Prohibited uses:**
❌ Creating misinformation
❌ Identity theft or impersonation
❌ Non-consensual face manipulation
❌ Malicious content creation
**Recommendations:**
- Watermark generated content
- Maintain audit logs
- Require user consent
- Implement content filters
## Future Improvements
Planned enhancements:
- [ ] Extended training (15-20 epochs)
- [ ] Perceptual loss functions (VGG, LPIPS)
- [ ] GAN-based adversarial training
- [ ] Multi-scale architecture
- [ ] Attention visualization
- [ ] Video temporal consistency
## Citation
```bibtex
@techreport{nasir2026faceforge,
title={FaceForge: A Deep Learning Framework for Facial Manipulation Generation and Detection},
author={Nasir, Huzaifa},
institution={National University of Computer and Emerging Sciences},
year={2026},
doi={10.5281/zenodo.18530439}
}
```
## Links
- πŸ“„ **Paper:** https://doi.org/10.5281/zenodo.18530439
- πŸ’» **Code:** https://github.com/Huzaifanasir95/FaceForge
- πŸ” **Detector Model:** https://huggingface.co/Huzaifanasir95/faceforge-detector
- πŸ““ **Notebooks:** See repository for training/inference notebooks
## Architecture Details
### Vision Transformer Encoder
- **Patch Size:** 16Γ—16
- **Patches:** 196 + 1 CLS token
- **Embedding Dim:** 768
- **Layers:** 12
- **Attention Heads:** 12
- **MLP Ratio:** 4.0
### Cross-Attention Mechanism
- **Query:** Source features
- **Key/Value:** Target features
- **Attention:** Multi-head (8 heads)
- **FFN Expansion:** 4Γ— (768 β†’ 3072 β†’ 768)
### CNN Upsampler
- **Input:** 768Γ—16Γ—16
- **Output:** 3Γ—224Γ—224
- **Stages:** 4 transpose convolutions
- **Kernel:** 4Γ—4, Stride: 2, Padding: 1
- **Activation:** ReLU β†’ Tanh (output)
## License
This model is released under CC BY 4.0 license. Use responsibly and ethically.
## Author
**Huzaifa Nasir**
National University of Computer and Emerging Sciences (NUCES)
Islamabad, Pakistan
πŸ“§ nasirhuzaifa95@gmail.com
## Acknowledgments
- Vision Transformer (Dosovitskiy et al.)
- FaceForensics++ dataset
- PyTorch and timm libraries
- Open-source AI community