---
license: mit
tags:
- face-generation
- computer-vision
- vision-transformer
- deepfake
- image-generation
- pytorch
- research-only
- vit
- cross-attention
language:
- en
library_name: pytorch
pipeline_tag: image-to-image
---

# FaceForge Generator: Vision Transformer-based Face Manipulation

[![Paper](https://img.shields.io/badge/Paper-Zenodo-blue)](https://doi.org/10.5281/zenodo.18530439)
[![GitHub](https://img.shields.io/badge/GitHub-Repository-black)](https://github.com/Huzaifanasir95/FaceForge)
[![License](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)

🎨 **252M Parameters | ViT-Based | Baseline Training Complete**

⚠️ **RESEARCH USE ONLY** - This model is for academic research and developing detection systems.

## Model Description

FaceForge Generator is a sophisticated Vision Transformer-based facial manipulation system that learns to synthesize realistic face swaps. The model combines dual ViT encoders, cross-attention mechanisms, transformer decoders, and CNN upsamplers to generate high-quality facial manipulations.

**Key Features:**
- 🏗️ 252 million trainable parameters
- 🔄 Dual encoder architecture for source and target faces
- 🎯 Cross-attention fusion mechanism
- 🖼️ Generates 224×224 RGB face images
- ⚡ ~300ms inference time per image
- 📉 Achieved 0.204 validation loss after 3 epochs

## Model Architecture

```
FaceForge Generator (252.5M parameters)
│
├── ViT Encoders (172M params)
│   ├── Source Encoder: ViT-B/16 (86M)
│   │   └── 12 layers, 768-dim, 12 heads
│   └── Target Encoder: ViT-B/16 (86M)
│       └── 12 layers, 768-dim, 12 heads
│
├── Cross-Attention Module (14M params)
│   ├── 2 layers, 8 heads
│   ├── FFN: 768 → 3072 → 768
│   └── Dropout: 0.1
│
├── Transformer Decoder (58M params)
│   ├── 256 learnable queries (16×16)
│   ├── 6 decoder layers, 8 heads
│   └── 2D positional embeddings
│
└── CNN Upsampler (9M params)
    ├── TransposeConv: 768→512→256→128→64
    ├── 4 upsampling stages (16×16 → 224×224)
    └── Conv: 64→32→3 + Tanh
```

## Training Progress

### Baseline Training (3 Epochs)

| Epoch | Train Loss | Val Loss | Time (min) |
|-------|-----------|----------|------------|
| 1 | 0.2873 | 0.2804 | 227.5 |
| 2 | 0.2432 | 0.2304 | 231.2 |
| 3 | 0.2143 | 0.2043 | 228.8 |

**Total Training Time:** 11.5 hours (687.5 minutes)

### Loss Reduction
- Training loss: 0.287 → 0.214 (25.3% reduction)
- Validation loss: 0.280 → 0.204 (27.1% reduction)
- Minimal overfitting (train-val gap: 0.010)

## Usage

### Installation

```bash
pip install torch torchvision timm pillow numpy
```

### Loading the Model

```python
import torch
import torch.nn as nn
import timm
from torchvision import transforms

class FaceForgeGenerator(nn.Module):
    def __init__(self):
        super().__init__()
        # Source and Target ViT Encoders
        self.source_encoder = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=0)
        self.target_encoder = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=0)
        
        # Cross-attention (implement your architecture)
        # Transformer decoder
        # CNN upsampler
        # ... (see full architecture in paper)
    
    def forward(self, source_face, target_face):
        # Encode both faces
        source_features = self.source_encoder.forward_features(source_face)
        target_features = self.target_encoder.forward_features(target_face)
        
        # Cross-attention fusion
        fused_features = self.cross_attention(source_features, target_features)
        
        # Decode to spatial map
        spatial_features = self.transformer_decoder(fused_features)
        
        # Upsample to 224×224
        generated_face = self.cnn_upsampler(spatial_features)
        
        return generated_face

# Load checkpoint
model = FaceForgeGenerator()
checkpoint = torch.load('generator_best.pth', map_location='cpu')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Preprocessing
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])

# Generate face swap
def generate_face_swap(source_path, target_path):
    source = transform(Image.open(source_path).convert('RGB')).unsqueeze(0)
    target = transform(Image.open(target_path).convert('RGB')).unsqueeze(0)
    
    with torch.no_grad():
        generated = model(source, target)
    
    # Denormalize and convert to PIL
    generated = (generated[0] * 0.5 + 0.5).clamp(0, 1)
    generated = transforms.ToPILImage()(generated)
    
    return generated

# Example
result = generate_face_swap("source.jpg", "target.jpg")
result.save("generated.jpg")
```

## Training Details

### Dataset
- **Source:** FaceForensics++ (c40 compression)
- **Training:** 7,000 face images (triplets: source, target, ground truth)
- **Validation:** 1,500 face images
- **Resolution:** 224×224 RGB

### Hyperparameters
```yaml
optimizer: AdamW
learning_rate: 1e-4
betas: [0.9, 0.999]
weight_decay: 1e-4
batch_size: 16
epochs: 3 (baseline)
loss_function: L1 (Mean Absolute Error)
lr_schedule: Cosine Annealing (1e-4 → 1e-6)
```

### Training Configuration
- **Hardware:** CPU
- **Throughput:** ~32 samples/minute
- **Batch Processing:** 219 train batches, 47 val batches per epoch
- **Best Model:** Saved at epoch 3

## Current Status

⚠️ **Baseline Training:** This model has completed 3 epochs of baseline training. For production-quality face generation, extended training (15-20 epochs) is recommended.

**Current Capabilities:**
- ✅ Learns pose transfer
- ✅ Captures facial structures
- ✅ Shows convergence trend
- ⏳ Some blur in generated images (expected at baseline)
- ⏳ Benefits from extended training

## Use Cases

### Research Applications
1. **Detector Training:** Generate challenging samples for deepfake detection
2. **Adversarial Training:** Min-max game with detector
3. **Understanding Manipulation:** Study how synthetic faces are created
4. **Benchmark Creation:** Generate test sets for evaluation

### Educational Uses
- Demonstrate face generation techniques
- Teach computer vision concepts
- Illustrate transformer architectures
- Show attention mechanism visualization

## Limitations

1. **Training Duration:** Only 3 epochs completed; extended training needed for photo-realism
2. **Blur:** Generated faces show some blur at baseline stage
3. **Dataset Scale:** Trained on 10K images; larger datasets would improve quality
4. **Single Frame:** Doesn't consider temporal consistency for video
5. **Compute:** Large model (252M params) requires significant memory

## Ethical Guidelines

⚠️ **Responsible Use Required**

This model is intended for:
✅ Academic research
✅ Deepfake detection development
✅ Educational demonstrations
✅ Ethical AI studies

**Prohibited uses:**
❌ Creating misinformation
❌ Identity theft or impersonation
❌ Non-consensual face manipulation
❌ Malicious content creation

**Recommendations:**
- Watermark generated content
- Maintain audit logs
- Require user consent
- Implement content filters

## Future Improvements

Planned enhancements:
- [ ] Extended training (15-20 epochs)
- [ ] Perceptual loss functions (VGG, LPIPS)
- [ ] GAN-based adversarial training
- [ ] Multi-scale architecture
- [ ] Attention visualization
- [ ] Video temporal consistency

## Citation

```bibtex
@techreport{nasir2026faceforge,
  title={FaceForge: A Deep Learning Framework for Facial Manipulation Generation and Detection},
  author={Nasir, Huzaifa},
  institution={National University of Computer and Emerging Sciences},
  year={2026},
  doi={10.5281/zenodo.18530439}
}
```

## Links

- 📄 **Paper:** https://doi.org/10.5281/zenodo.18530439
- 💻 **Code:** https://github.com/Huzaifanasir95/FaceForge
- 🔍 **Detector Model:** https://huggingface.co/Huzaifanasir95/faceforge-detector
- 📓 **Notebooks:** See repository for training/inference notebooks

## Architecture Details

### Vision Transformer Encoder
- **Patch Size:** 16×16
- **Patches:** 196 + 1 CLS token
- **Embedding Dim:** 768
- **Layers:** 12
- **Attention Heads:** 12
- **MLP Ratio:** 4.0

### Cross-Attention Mechanism
- **Query:** Source features
- **Key/Value:** Target features
- **Attention:** Multi-head (8 heads)
- **FFN Expansion:** 4× (768 → 3072 → 768)

### CNN Upsampler
- **Input:** 768×16×16
- **Output:** 3×224×224
- **Stages:** 4 transpose convolutions
- **Kernel:** 4×4, Stride: 2, Padding: 1
- **Activation:** ReLU → Tanh (output)

## License

This model is released under CC BY 4.0 license. Use responsibly and ethically.

## Author

**Huzaifa Nasir**  
National University of Computer and Emerging Sciences (NUCES)  
Islamabad, Pakistan  
📧 nasirhuzaifa95@gmail.com

## Acknowledgments

- Vision Transformer (Dosovitskiy et al.)
- FaceForensics++ dataset
- PyTorch and timm libraries
- Open-source AI community