|
|
---
|
|
|
license: mit
|
|
|
tags:
|
|
|
- face-generation
|
|
|
- computer-vision
|
|
|
- vision-transformer
|
|
|
- deepfake
|
|
|
- image-generation
|
|
|
- pytorch
|
|
|
- research-only
|
|
|
- vit
|
|
|
- cross-attention
|
|
|
language:
|
|
|
- en
|
|
|
library_name: pytorch
|
|
|
pipeline_tag: image-to-image
|
|
|
---
|
|
|
|
|
|
# FaceForge Generator: Vision Transformer-based Face Manipulation
|
|
|
|
|
|
[](https://doi.org/10.5281/zenodo.18530439)
|
|
|
[](https://github.com/Huzaifanasir95/FaceForge)
|
|
|
[](https://opensource.org/licenses/MIT)
|
|
|
|
|
|
π¨ **252M Parameters | ViT-Based | Baseline Training Complete**
|
|
|
|
|
|
β οΈ **RESEARCH USE ONLY** - This model is for academic research and developing detection systems.
|
|
|
|
|
|
## Model Description
|
|
|
|
|
|
FaceForge Generator is a sophisticated Vision Transformer-based facial manipulation system that learns to synthesize realistic face swaps. The model combines dual ViT encoders, cross-attention mechanisms, transformer decoders, and CNN upsamplers to generate high-quality facial manipulations.
|
|
|
|
|
|
**Key Features:**
|
|
|
- ποΈ 252 million trainable parameters
|
|
|
- π Dual encoder architecture for source and target faces
|
|
|
- π― Cross-attention fusion mechanism
|
|
|
- πΌοΈ Generates 224Γ224 RGB face images
|
|
|
- β‘ ~300ms inference time per image
|
|
|
- π Achieved 0.204 validation loss after 3 epochs
|
|
|
|
|
|
## Model Architecture
|
|
|
|
|
|
```
|
|
|
FaceForge Generator (252.5M parameters)
|
|
|
β
|
|
|
βββ ViT Encoders (172M params)
|
|
|
β βββ Source Encoder: ViT-B/16 (86M)
|
|
|
β β βββ 12 layers, 768-dim, 12 heads
|
|
|
β βββ Target Encoder: ViT-B/16 (86M)
|
|
|
β βββ 12 layers, 768-dim, 12 heads
|
|
|
β
|
|
|
βββ Cross-Attention Module (14M params)
|
|
|
β βββ 2 layers, 8 heads
|
|
|
β βββ FFN: 768 β 3072 β 768
|
|
|
β βββ Dropout: 0.1
|
|
|
β
|
|
|
βββ Transformer Decoder (58M params)
|
|
|
β βββ 256 learnable queries (16Γ16)
|
|
|
β βββ 6 decoder layers, 8 heads
|
|
|
β βββ 2D positional embeddings
|
|
|
β
|
|
|
βββ CNN Upsampler (9M params)
|
|
|
βββ TransposeConv: 768β512β256β128β64
|
|
|
βββ 4 upsampling stages (16Γ16 β 224Γ224)
|
|
|
βββ Conv: 64β32β3 + Tanh
|
|
|
```
|
|
|
|
|
|
## Training Progress
|
|
|
|
|
|
### Baseline Training (3 Epochs)
|
|
|
|
|
|
| Epoch | Train Loss | Val Loss | Time (min) |
|
|
|
|-------|-----------|----------|------------|
|
|
|
| 1 | 0.2873 | 0.2804 | 227.5 |
|
|
|
| 2 | 0.2432 | 0.2304 | 231.2 |
|
|
|
| 3 | 0.2143 | 0.2043 | 228.8 |
|
|
|
|
|
|
**Total Training Time:** 11.5 hours (687.5 minutes)
|
|
|
|
|
|
### Loss Reduction
|
|
|
- Training loss: 0.287 β 0.214 (25.3% reduction)
|
|
|
- Validation loss: 0.280 β 0.204 (27.1% reduction)
|
|
|
- Minimal overfitting (train-val gap: 0.010)
|
|
|
|
|
|
## Usage
|
|
|
|
|
|
### Installation
|
|
|
|
|
|
```bash
|
|
|
pip install torch torchvision timm pillow numpy
|
|
|
```
|
|
|
|
|
|
### Loading the Model
|
|
|
|
|
|
```python
|
|
|
import torch
|
|
|
import torch.nn as nn
|
|
|
import timm
|
|
|
from torchvision import transforms
|
|
|
|
|
|
class FaceForgeGenerator(nn.Module):
|
|
|
def __init__(self):
|
|
|
super().__init__()
|
|
|
# Source and Target ViT Encoders
|
|
|
self.source_encoder = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=0)
|
|
|
self.target_encoder = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=0)
|
|
|
|
|
|
# Cross-attention (implement your architecture)
|
|
|
# Transformer decoder
|
|
|
# CNN upsampler
|
|
|
# ... (see full architecture in paper)
|
|
|
|
|
|
def forward(self, source_face, target_face):
|
|
|
# Encode both faces
|
|
|
source_features = self.source_encoder.forward_features(source_face)
|
|
|
target_features = self.target_encoder.forward_features(target_face)
|
|
|
|
|
|
# Cross-attention fusion
|
|
|
fused_features = self.cross_attention(source_features, target_features)
|
|
|
|
|
|
# Decode to spatial map
|
|
|
spatial_features = self.transformer_decoder(fused_features)
|
|
|
|
|
|
# Upsample to 224Γ224
|
|
|
generated_face = self.cnn_upsampler(spatial_features)
|
|
|
|
|
|
return generated_face
|
|
|
|
|
|
# Load checkpoint
|
|
|
model = FaceForgeGenerator()
|
|
|
checkpoint = torch.load('generator_best.pth', map_location='cpu')
|
|
|
model.load_state_dict(checkpoint['model_state_dict'])
|
|
|
model.eval()
|
|
|
|
|
|
# Preprocessing
|
|
|
transform = transforms.Compose([
|
|
|
transforms.Resize((224, 224)),
|
|
|
transforms.ToTensor(),
|
|
|
transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
|
|
|
])
|
|
|
|
|
|
# Generate face swap
|
|
|
def generate_face_swap(source_path, target_path):
|
|
|
source = transform(Image.open(source_path).convert('RGB')).unsqueeze(0)
|
|
|
target = transform(Image.open(target_path).convert('RGB')).unsqueeze(0)
|
|
|
|
|
|
with torch.no_grad():
|
|
|
generated = model(source, target)
|
|
|
|
|
|
# Denormalize and convert to PIL
|
|
|
generated = (generated[0] * 0.5 + 0.5).clamp(0, 1)
|
|
|
generated = transforms.ToPILImage()(generated)
|
|
|
|
|
|
return generated
|
|
|
|
|
|
# Example
|
|
|
result = generate_face_swap("source.jpg", "target.jpg")
|
|
|
result.save("generated.jpg")
|
|
|
```
|
|
|
|
|
|
## Training Details
|
|
|
|
|
|
### Dataset
|
|
|
- **Source:** FaceForensics++ (c40 compression)
|
|
|
- **Training:** 7,000 face images (triplets: source, target, ground truth)
|
|
|
- **Validation:** 1,500 face images
|
|
|
- **Resolution:** 224Γ224 RGB
|
|
|
|
|
|
### Hyperparameters
|
|
|
```yaml
|
|
|
optimizer: AdamW
|
|
|
learning_rate: 1e-4
|
|
|
betas: [0.9, 0.999]
|
|
|
weight_decay: 1e-4
|
|
|
batch_size: 16
|
|
|
epochs: 3 (baseline)
|
|
|
loss_function: L1 (Mean Absolute Error)
|
|
|
lr_schedule: Cosine Annealing (1e-4 β 1e-6)
|
|
|
```
|
|
|
|
|
|
### Training Configuration
|
|
|
- **Hardware:** CPU
|
|
|
- **Throughput:** ~32 samples/minute
|
|
|
- **Batch Processing:** 219 train batches, 47 val batches per epoch
|
|
|
- **Best Model:** Saved at epoch 3
|
|
|
|
|
|
## Current Status
|
|
|
|
|
|
β οΈ **Baseline Training:** This model has completed 3 epochs of baseline training. For production-quality face generation, extended training (15-20 epochs) is recommended.
|
|
|
|
|
|
**Current Capabilities:**
|
|
|
- β
Learns pose transfer
|
|
|
- β
Captures facial structures
|
|
|
- β
Shows convergence trend
|
|
|
- β³ Some blur in generated images (expected at baseline)
|
|
|
- β³ Benefits from extended training
|
|
|
|
|
|
## Use Cases
|
|
|
|
|
|
### Research Applications
|
|
|
1. **Detector Training:** Generate challenging samples for deepfake detection
|
|
|
2. **Adversarial Training:** Min-max game with detector
|
|
|
3. **Understanding Manipulation:** Study how synthetic faces are created
|
|
|
4. **Benchmark Creation:** Generate test sets for evaluation
|
|
|
|
|
|
### Educational Uses
|
|
|
- Demonstrate face generation techniques
|
|
|
- Teach computer vision concepts
|
|
|
- Illustrate transformer architectures
|
|
|
- Show attention mechanism visualization
|
|
|
|
|
|
## Limitations
|
|
|
|
|
|
1. **Training Duration:** Only 3 epochs completed; extended training needed for photo-realism
|
|
|
2. **Blur:** Generated faces show some blur at baseline stage
|
|
|
3. **Dataset Scale:** Trained on 10K images; larger datasets would improve quality
|
|
|
4. **Single Frame:** Doesn't consider temporal consistency for video
|
|
|
5. **Compute:** Large model (252M params) requires significant memory
|
|
|
|
|
|
## Ethical Guidelines
|
|
|
|
|
|
β οΈ **Responsible Use Required**
|
|
|
|
|
|
This model is intended for:
|
|
|
β
Academic research
|
|
|
β
Deepfake detection development
|
|
|
β
Educational demonstrations
|
|
|
β
Ethical AI studies
|
|
|
|
|
|
**Prohibited uses:**
|
|
|
β Creating misinformation
|
|
|
β Identity theft or impersonation
|
|
|
β Non-consensual face manipulation
|
|
|
β Malicious content creation
|
|
|
|
|
|
**Recommendations:**
|
|
|
- Watermark generated content
|
|
|
- Maintain audit logs
|
|
|
- Require user consent
|
|
|
- Implement content filters
|
|
|
|
|
|
## Future Improvements
|
|
|
|
|
|
Planned enhancements:
|
|
|
- [ ] Extended training (15-20 epochs)
|
|
|
- [ ] Perceptual loss functions (VGG, LPIPS)
|
|
|
- [ ] GAN-based adversarial training
|
|
|
- [ ] Multi-scale architecture
|
|
|
- [ ] Attention visualization
|
|
|
- [ ] Video temporal consistency
|
|
|
|
|
|
## Citation
|
|
|
|
|
|
```bibtex
|
|
|
@techreport{nasir2026faceforge,
|
|
|
title={FaceForge: A Deep Learning Framework for Facial Manipulation Generation and Detection},
|
|
|
author={Nasir, Huzaifa},
|
|
|
institution={National University of Computer and Emerging Sciences},
|
|
|
year={2026},
|
|
|
doi={10.5281/zenodo.18530439}
|
|
|
}
|
|
|
```
|
|
|
|
|
|
## Links
|
|
|
|
|
|
- π **Paper:** https://doi.org/10.5281/zenodo.18530439
|
|
|
- π» **Code:** https://github.com/Huzaifanasir95/FaceForge
|
|
|
- π **Detector Model:** https://huggingface.co/Huzaifanasir95/faceforge-detector
|
|
|
- π **Notebooks:** See repository for training/inference notebooks
|
|
|
|
|
|
## Architecture Details
|
|
|
|
|
|
### Vision Transformer Encoder
|
|
|
- **Patch Size:** 16Γ16
|
|
|
- **Patches:** 196 + 1 CLS token
|
|
|
- **Embedding Dim:** 768
|
|
|
- **Layers:** 12
|
|
|
- **Attention Heads:** 12
|
|
|
- **MLP Ratio:** 4.0
|
|
|
|
|
|
### Cross-Attention Mechanism
|
|
|
- **Query:** Source features
|
|
|
- **Key/Value:** Target features
|
|
|
- **Attention:** Multi-head (8 heads)
|
|
|
- **FFN Expansion:** 4Γ (768 β 3072 β 768)
|
|
|
|
|
|
### CNN Upsampler
|
|
|
- **Input:** 768Γ16Γ16
|
|
|
- **Output:** 3Γ224Γ224
|
|
|
- **Stages:** 4 transpose convolutions
|
|
|
- **Kernel:** 4Γ4, Stride: 2, Padding: 1
|
|
|
- **Activation:** ReLU β Tanh (output)
|
|
|
|
|
|
## License
|
|
|
|
|
|
This model is released under CC BY 4.0 license. Use responsibly and ethically.
|
|
|
|
|
|
## Author
|
|
|
|
|
|
**Huzaifa Nasir**
|
|
|
National University of Computer and Emerging Sciences (NUCES)
|
|
|
Islamabad, Pakistan
|
|
|
π§ nasirhuzaifa95@gmail.com
|
|
|
|
|
|
## Acknowledgments
|
|
|
|
|
|
- Vision Transformer (Dosovitskiy et al.)
|
|
|
- FaceForensics++ dataset
|
|
|
- PyTorch and timm libraries
|
|
|
- Open-source AI community
|
|
|
|