File size: 9,474 Bytes

9c735a5

---

license: mit
tags:
- face-generation
- computer-vision
- vision-transformer
- deepfake
- image-generation
- pytorch
- research-only
- vit
- cross-attention
language:
- en
library_name: pytorch
pipeline_tag: image-to-image
---


# FaceForge Generator: Vision Transformer-based Face Manipulation

[![Paper](https://img.shields.io/badge/Paper-Zenodo-blue)](https://doi.org/10.5281/zenodo.18530439)
[![GitHub](https://img.shields.io/badge/GitHub-Repository-black)](https://github.com/Huzaifanasir95/FaceForge)
[![License](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)

🎨 **252M Parameters | ViT-Based | Baseline Training Complete**

⚠️ **RESEARCH USE ONLY** - This model is for academic research and developing detection systems.

## Model Description

FaceForge Generator is a sophisticated Vision Transformer-based facial manipulation system that learns to synthesize realistic face swaps. The model combines dual ViT encoders, cross-attention mechanisms, transformer decoders, and CNN upsamplers to generate high-quality facial manipulations.

**Key Features:**
- 🏗️ 252 million trainable parameters
- 🔄 Dual encoder architecture for source and target faces
- 🎯 Cross-attention fusion mechanism
- 🖼️ Generates 224×224 RGB face images
- ⚡ ~300ms inference time per image
- 📉 Achieved 0.204 validation loss after 3 epochs

## Model Architecture

```

FaceForge Generator (252.5M parameters)

│

├── ViT Encoders (172M params)

│   ├── Source Encoder: ViT-B/16 (86M)

│   │   └── 12 layers, 768-dim, 12 heads

│   └── Target Encoder: ViT-B/16 (86M)

│       └── 12 layers, 768-dim, 12 heads

│

├── Cross-Attention Module (14M params)

│   ├── 2 layers, 8 heads

│   ├── FFN: 768 → 3072 → 768

│   └── Dropout: 0.1

│

├── Transformer Decoder (58M params)

│   ├── 256 learnable queries (16×16)

│   ├── 6 decoder layers, 8 heads

│   └── 2D positional embeddings

│

└── CNN Upsampler (9M params)

    ├── TransposeConv: 768→512→256→128→64

    ├── 4 upsampling stages (16×16 → 224×224)

    └── Conv: 64→32→3 + Tanh

```

## Training Progress

### Baseline Training (3 Epochs)

| Epoch | Train Loss | Val Loss | Time (min) |
|-------|-----------|----------|------------|
| 1 | 0.2873 | 0.2804 | 227.5 |
| 2 | 0.2432 | 0.2304 | 231.2 |
| 3 | 0.2143 | 0.2043 | 228.8 |

**Total Training Time:** 11.5 hours (687.5 minutes)

### Loss Reduction
- Training loss: 0.287 → 0.214 (25.3% reduction)
- Validation loss: 0.280 → 0.204 (27.1% reduction)
- Minimal overfitting (train-val gap: 0.010)

## Usage

### Installation

```bash

pip install torch torchvision timm pillow numpy

```

### Loading the Model

```python

import torch

import torch.nn as nn

import timm

from torchvision import transforms



class FaceForgeGenerator(nn.Module):

    def __init__(self):

        super().__init__()

        # Source and Target ViT Encoders

        self.source_encoder = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=0)

        self.target_encoder = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=0)

        

        # Cross-attention (implement your architecture)

        # Transformer decoder

        # CNN upsampler

        # ... (see full architecture in paper)

    

    def forward(self, source_face, target_face):

        # Encode both faces

        source_features = self.source_encoder.forward_features(source_face)

        target_features = self.target_encoder.forward_features(target_face)

        

        # Cross-attention fusion

        fused_features = self.cross_attention(source_features, target_features)

        

        # Decode to spatial map

        spatial_features = self.transformer_decoder(fused_features)

        

        # Upsample to 224×224

        generated_face = self.cnn_upsampler(spatial_features)

        

        return generated_face



# Load checkpoint

model = FaceForgeGenerator()

checkpoint = torch.load('generator_best.pth', map_location='cpu')

model.load_state_dict(checkpoint['model_state_dict'])

model.eval()



# Preprocessing

transform = transforms.Compose([

    transforms.Resize((224, 224)),

    transforms.ToTensor(),

    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])

])



# Generate face swap

def generate_face_swap(source_path, target_path):

    source = transform(Image.open(source_path).convert('RGB')).unsqueeze(0)

    target = transform(Image.open(target_path).convert('RGB')).unsqueeze(0)

    

    with torch.no_grad():

        generated = model(source, target)

    

    # Denormalize and convert to PIL

    generated = (generated[0] * 0.5 + 0.5).clamp(0, 1)

    generated = transforms.ToPILImage()(generated)

    

    return generated



# Example

result = generate_face_swap("source.jpg", "target.jpg")

result.save("generated.jpg")

```

## Training Details

### Dataset
- **Source:** FaceForensics++ (c40 compression)
- **Training:** 7,000 face images (triplets: source, target, ground truth)
- **Validation:** 1,500 face images
- **Resolution:** 224×224 RGB

### Hyperparameters
```yaml

optimizer: AdamW

learning_rate: 1e-4

betas: [0.9, 0.999]

weight_decay: 1e-4

batch_size: 16

epochs: 3 (baseline)

loss_function: L1 (Mean Absolute Error)

lr_schedule: Cosine Annealing (1e-4 → 1e-6)

```

### Training Configuration
- **Hardware:** CPU
- **Throughput:** ~32 samples/minute
- **Batch Processing:** 219 train batches, 47 val batches per epoch
- **Best Model:** Saved at epoch 3

## Current Status

⚠️ **Baseline Training:** This model has completed 3 epochs of baseline training. For production-quality face generation, extended training (15-20 epochs) is recommended.

**Current Capabilities:**
- ✅ Learns pose transfer
- ✅ Captures facial structures
- ✅ Shows convergence trend
- ⏳ Some blur in generated images (expected at baseline)
- ⏳ Benefits from extended training

## Use Cases

### Research Applications
1. **Detector Training:** Generate challenging samples for deepfake detection
2. **Adversarial Training:** Min-max game with detector
3. **Understanding Manipulation:** Study how synthetic faces are created
4. **Benchmark Creation:** Generate test sets for evaluation

### Educational Uses
- Demonstrate face generation techniques
- Teach computer vision concepts
- Illustrate transformer architectures
- Show attention mechanism visualization

## Limitations

1. **Training Duration:** Only 3 epochs completed; extended training needed for photo-realism
2. **Blur:** Generated faces show some blur at baseline stage
3. **Dataset Scale:** Trained on 10K images; larger datasets would improve quality
4. **Single Frame:** Doesn't consider temporal consistency for video
5. **Compute:** Large model (252M params) requires significant memory

## Ethical Guidelines

⚠️ **Responsible Use Required**

This model is intended for:
✅ Academic research
✅ Deepfake detection development
✅ Educational demonstrations
✅ Ethical AI studies

**Prohibited uses:**
❌ Creating misinformation
❌ Identity theft or impersonation
❌ Non-consensual face manipulation
❌ Malicious content creation

**Recommendations:**
- Watermark generated content
- Maintain audit logs
- Require user consent
- Implement content filters

## Future Improvements

Planned enhancements:
- [ ] Extended training (15-20 epochs)
- [ ] Perceptual loss functions (VGG, LPIPS)
- [ ] GAN-based adversarial training
- [ ] Multi-scale architecture
- [ ] Attention visualization
- [ ] Video temporal consistency

## Citation

```bibtex

@techreport{nasir2026faceforge,

  title={FaceForge: A Deep Learning Framework for Facial Manipulation Generation and Detection},

  author={Nasir, Huzaifa},

  institution={National University of Computer and Emerging Sciences},

  year={2026},

  doi={10.5281/zenodo.18530439}

}

```

## Links

- 📄 **Paper:** https://doi.org/10.5281/zenodo.18530439
- 💻 **Code:** https://github.com/Huzaifanasir95/FaceForge
- 🔍 **Detector Model:** https://huggingface.co/Huzaifanasir95/faceforge-detector
- 📓 **Notebooks:** See repository for training/inference notebooks

## Architecture Details

### Vision Transformer Encoder
- **Patch Size:** 16×16
- **Patches:** 196 + 1 CLS token
- **Embedding Dim:** 768
- **Layers:** 12
- **Attention Heads:** 12
- **MLP Ratio:** 4.0

### Cross-Attention Mechanism
- **Query:** Source features
- **Key/Value:** Target features
- **Attention:** Multi-head (8 heads)
- **FFN Expansion:** 4× (768 → 3072 → 768)

### CNN Upsampler
- **Input:** 768×16×16
- **Output:** 3×224×224
- **Stages:** 4 transpose convolutions
- **Kernel:** 4×4, Stride: 2, Padding: 1
- **Activation:** ReLU → Tanh (output)

## License

This model is released under CC BY 4.0 license. Use responsibly and ethically.

## Author

**Huzaifa Nasir**  
National University of Computer and Emerging Sciences (NUCES)  
Islamabad, Pakistan  
📧 nasirhuzaifa95@gmail.com

## Acknowledgments

- Vision Transformer (Dosovitskiy et al.)
- FaceForensics++ dataset
- PyTorch and timm libraries
- Open-source AI community