File size: 9,474 Bytes
9c735a5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 | ---
license: mit
tags:
- face-generation
- computer-vision
- vision-transformer
- deepfake
- image-generation
- pytorch
- research-only
- vit
- cross-attention
language:
- en
library_name: pytorch
pipeline_tag: image-to-image
---
# FaceForge Generator: Vision Transformer-based Face Manipulation
[](https://doi.org/10.5281/zenodo.18530439)
[](https://github.com/Huzaifanasir95/FaceForge)
[](https://opensource.org/licenses/MIT)
π¨ **252M Parameters | ViT-Based | Baseline Training Complete**
β οΈ **RESEARCH USE ONLY** - This model is for academic research and developing detection systems.
## Model Description
FaceForge Generator is a sophisticated Vision Transformer-based facial manipulation system that learns to synthesize realistic face swaps. The model combines dual ViT encoders, cross-attention mechanisms, transformer decoders, and CNN upsamplers to generate high-quality facial manipulations.
**Key Features:**
- ποΈ 252 million trainable parameters
- π Dual encoder architecture for source and target faces
- π― Cross-attention fusion mechanism
- πΌοΈ Generates 224Γ224 RGB face images
- β‘ ~300ms inference time per image
- π Achieved 0.204 validation loss after 3 epochs
## Model Architecture
```
FaceForge Generator (252.5M parameters)
β
βββ ViT Encoders (172M params)
β βββ Source Encoder: ViT-B/16 (86M)
β β βββ 12 layers, 768-dim, 12 heads
β βββ Target Encoder: ViT-B/16 (86M)
β βββ 12 layers, 768-dim, 12 heads
β
βββ Cross-Attention Module (14M params)
β βββ 2 layers, 8 heads
β βββ FFN: 768 β 3072 β 768
β βββ Dropout: 0.1
β
βββ Transformer Decoder (58M params)
β βββ 256 learnable queries (16Γ16)
β βββ 6 decoder layers, 8 heads
β βββ 2D positional embeddings
β
βββ CNN Upsampler (9M params)
βββ TransposeConv: 768β512β256β128β64
βββ 4 upsampling stages (16Γ16 β 224Γ224)
βββ Conv: 64β32β3 + Tanh
```
## Training Progress
### Baseline Training (3 Epochs)
| Epoch | Train Loss | Val Loss | Time (min) |
|-------|-----------|----------|------------|
| 1 | 0.2873 | 0.2804 | 227.5 |
| 2 | 0.2432 | 0.2304 | 231.2 |
| 3 | 0.2143 | 0.2043 | 228.8 |
**Total Training Time:** 11.5 hours (687.5 minutes)
### Loss Reduction
- Training loss: 0.287 β 0.214 (25.3% reduction)
- Validation loss: 0.280 β 0.204 (27.1% reduction)
- Minimal overfitting (train-val gap: 0.010)
## Usage
### Installation
```bash
pip install torch torchvision timm pillow numpy
```
### Loading the Model
```python
import torch
import torch.nn as nn
import timm
from torchvision import transforms
class FaceForgeGenerator(nn.Module):
def __init__(self):
super().__init__()
# Source and Target ViT Encoders
self.source_encoder = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=0)
self.target_encoder = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=0)
# Cross-attention (implement your architecture)
# Transformer decoder
# CNN upsampler
# ... (see full architecture in paper)
def forward(self, source_face, target_face):
# Encode both faces
source_features = self.source_encoder.forward_features(source_face)
target_features = self.target_encoder.forward_features(target_face)
# Cross-attention fusion
fused_features = self.cross_attention(source_features, target_features)
# Decode to spatial map
spatial_features = self.transformer_decoder(fused_features)
# Upsample to 224Γ224
generated_face = self.cnn_upsampler(spatial_features)
return generated_face
# Load checkpoint
model = FaceForgeGenerator()
checkpoint = torch.load('generator_best.pth', map_location='cpu')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
# Preprocessing
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])
# Generate face swap
def generate_face_swap(source_path, target_path):
source = transform(Image.open(source_path).convert('RGB')).unsqueeze(0)
target = transform(Image.open(target_path).convert('RGB')).unsqueeze(0)
with torch.no_grad():
generated = model(source, target)
# Denormalize and convert to PIL
generated = (generated[0] * 0.5 + 0.5).clamp(0, 1)
generated = transforms.ToPILImage()(generated)
return generated
# Example
result = generate_face_swap("source.jpg", "target.jpg")
result.save("generated.jpg")
```
## Training Details
### Dataset
- **Source:** FaceForensics++ (c40 compression)
- **Training:** 7,000 face images (triplets: source, target, ground truth)
- **Validation:** 1,500 face images
- **Resolution:** 224Γ224 RGB
### Hyperparameters
```yaml
optimizer: AdamW
learning_rate: 1e-4
betas: [0.9, 0.999]
weight_decay: 1e-4
batch_size: 16
epochs: 3 (baseline)
loss_function: L1 (Mean Absolute Error)
lr_schedule: Cosine Annealing (1e-4 β 1e-6)
```
### Training Configuration
- **Hardware:** CPU
- **Throughput:** ~32 samples/minute
- **Batch Processing:** 219 train batches, 47 val batches per epoch
- **Best Model:** Saved at epoch 3
## Current Status
β οΈ **Baseline Training:** This model has completed 3 epochs of baseline training. For production-quality face generation, extended training (15-20 epochs) is recommended.
**Current Capabilities:**
- β
Learns pose transfer
- β
Captures facial structures
- β
Shows convergence trend
- β³ Some blur in generated images (expected at baseline)
- β³ Benefits from extended training
## Use Cases
### Research Applications
1. **Detector Training:** Generate challenging samples for deepfake detection
2. **Adversarial Training:** Min-max game with detector
3. **Understanding Manipulation:** Study how synthetic faces are created
4. **Benchmark Creation:** Generate test sets for evaluation
### Educational Uses
- Demonstrate face generation techniques
- Teach computer vision concepts
- Illustrate transformer architectures
- Show attention mechanism visualization
## Limitations
1. **Training Duration:** Only 3 epochs completed; extended training needed for photo-realism
2. **Blur:** Generated faces show some blur at baseline stage
3. **Dataset Scale:** Trained on 10K images; larger datasets would improve quality
4. **Single Frame:** Doesn't consider temporal consistency for video
5. **Compute:** Large model (252M params) requires significant memory
## Ethical Guidelines
β οΈ **Responsible Use Required**
This model is intended for:
β
Academic research
β
Deepfake detection development
β
Educational demonstrations
β
Ethical AI studies
**Prohibited uses:**
β Creating misinformation
β Identity theft or impersonation
β Non-consensual face manipulation
β Malicious content creation
**Recommendations:**
- Watermark generated content
- Maintain audit logs
- Require user consent
- Implement content filters
## Future Improvements
Planned enhancements:
- [ ] Extended training (15-20 epochs)
- [ ] Perceptual loss functions (VGG, LPIPS)
- [ ] GAN-based adversarial training
- [ ] Multi-scale architecture
- [ ] Attention visualization
- [ ] Video temporal consistency
## Citation
```bibtex
@techreport{nasir2026faceforge,
title={FaceForge: A Deep Learning Framework for Facial Manipulation Generation and Detection},
author={Nasir, Huzaifa},
institution={National University of Computer and Emerging Sciences},
year={2026},
doi={10.5281/zenodo.18530439}
}
```
## Links
- π **Paper:** https://doi.org/10.5281/zenodo.18530439
- π» **Code:** https://github.com/Huzaifanasir95/FaceForge
- π **Detector Model:** https://huggingface.co/Huzaifanasir95/faceforge-detector
- π **Notebooks:** See repository for training/inference notebooks
## Architecture Details
### Vision Transformer Encoder
- **Patch Size:** 16Γ16
- **Patches:** 196 + 1 CLS token
- **Embedding Dim:** 768
- **Layers:** 12
- **Attention Heads:** 12
- **MLP Ratio:** 4.0
### Cross-Attention Mechanism
- **Query:** Source features
- **Key/Value:** Target features
- **Attention:** Multi-head (8 heads)
- **FFN Expansion:** 4Γ (768 β 3072 β 768)
### CNN Upsampler
- **Input:** 768Γ16Γ16
- **Output:** 3Γ224Γ224
- **Stages:** 4 transpose convolutions
- **Kernel:** 4Γ4, Stride: 2, Padding: 1
- **Activation:** ReLU β Tanh (output)
## License
This model is released under CC BY 4.0 license. Use responsibly and ethically.
## Author
**Huzaifa Nasir**
National University of Computer and Emerging Sciences (NUCES)
Islamabad, Pakistan
π§ nasirhuzaifa95@gmail.com
## Acknowledgments
- Vision Transformer (Dosovitskiy et al.)
- FaceForensics++ dataset
- PyTorch and timm libraries
- Open-source AI community
|