File size: 9,474 Bytes
9c735a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
---

license: mit
tags:
- face-generation
- computer-vision
- vision-transformer
- deepfake
- image-generation
- pytorch
- research-only
- vit
- cross-attention
language:
- en
library_name: pytorch
pipeline_tag: image-to-image
---


# FaceForge Generator: Vision Transformer-based Face Manipulation

[![Paper](https://img.shields.io/badge/Paper-Zenodo-blue)](https://doi.org/10.5281/zenodo.18530439)
[![GitHub](https://img.shields.io/badge/GitHub-Repository-black)](https://github.com/Huzaifanasir95/FaceForge)
[![License](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)

🎨 **252M Parameters | ViT-Based | Baseline Training Complete**

⚠️ **RESEARCH USE ONLY** - This model is for academic research and developing detection systems.

## Model Description

FaceForge Generator is a sophisticated Vision Transformer-based facial manipulation system that learns to synthesize realistic face swaps. The model combines dual ViT encoders, cross-attention mechanisms, transformer decoders, and CNN upsamplers to generate high-quality facial manipulations.

**Key Features:**
- πŸ—οΈ 252 million trainable parameters
- πŸ”„ Dual encoder architecture for source and target faces
- 🎯 Cross-attention fusion mechanism
- πŸ–ΌοΈ Generates 224Γ—224 RGB face images
- ⚑ ~300ms inference time per image
- πŸ“‰ Achieved 0.204 validation loss after 3 epochs

## Model Architecture

```

FaceForge Generator (252.5M parameters)

β”‚

β”œβ”€β”€ ViT Encoders (172M params)

β”‚   β”œβ”€β”€ Source Encoder: ViT-B/16 (86M)

β”‚   β”‚   └── 12 layers, 768-dim, 12 heads

β”‚   └── Target Encoder: ViT-B/16 (86M)

β”‚       └── 12 layers, 768-dim, 12 heads

β”‚

β”œβ”€β”€ Cross-Attention Module (14M params)

β”‚   β”œβ”€β”€ 2 layers, 8 heads

β”‚   β”œβ”€β”€ FFN: 768 β†’ 3072 β†’ 768

β”‚   └── Dropout: 0.1

β”‚

β”œβ”€β”€ Transformer Decoder (58M params)

β”‚   β”œβ”€β”€ 256 learnable queries (16Γ—16)

β”‚   β”œβ”€β”€ 6 decoder layers, 8 heads

β”‚   └── 2D positional embeddings

β”‚

└── CNN Upsampler (9M params)

    β”œβ”€β”€ TransposeConv: 768β†’512β†’256β†’128β†’64

    β”œβ”€β”€ 4 upsampling stages (16Γ—16 β†’ 224Γ—224)

    └── Conv: 64β†’32β†’3 + Tanh

```

## Training Progress

### Baseline Training (3 Epochs)

| Epoch | Train Loss | Val Loss | Time (min) |
|-------|-----------|----------|------------|
| 1 | 0.2873 | 0.2804 | 227.5 |
| 2 | 0.2432 | 0.2304 | 231.2 |
| 3 | 0.2143 | 0.2043 | 228.8 |

**Total Training Time:** 11.5 hours (687.5 minutes)

### Loss Reduction
- Training loss: 0.287 β†’ 0.214 (25.3% reduction)
- Validation loss: 0.280 β†’ 0.204 (27.1% reduction)
- Minimal overfitting (train-val gap: 0.010)

## Usage

### Installation

```bash

pip install torch torchvision timm pillow numpy

```

### Loading the Model

```python

import torch

import torch.nn as nn

import timm

from torchvision import transforms



class FaceForgeGenerator(nn.Module):

    def __init__(self):

        super().__init__()

        # Source and Target ViT Encoders

        self.source_encoder = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=0)

        self.target_encoder = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=0)

        

        # Cross-attention (implement your architecture)

        # Transformer decoder

        # CNN upsampler

        # ... (see full architecture in paper)

    

    def forward(self, source_face, target_face):

        # Encode both faces

        source_features = self.source_encoder.forward_features(source_face)

        target_features = self.target_encoder.forward_features(target_face)

        

        # Cross-attention fusion

        fused_features = self.cross_attention(source_features, target_features)

        

        # Decode to spatial map

        spatial_features = self.transformer_decoder(fused_features)

        

        # Upsample to 224Γ—224

        generated_face = self.cnn_upsampler(spatial_features)

        

        return generated_face



# Load checkpoint

model = FaceForgeGenerator()

checkpoint = torch.load('generator_best.pth', map_location='cpu')

model.load_state_dict(checkpoint['model_state_dict'])

model.eval()



# Preprocessing

transform = transforms.Compose([

    transforms.Resize((224, 224)),

    transforms.ToTensor(),

    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])

])



# Generate face swap

def generate_face_swap(source_path, target_path):

    source = transform(Image.open(source_path).convert('RGB')).unsqueeze(0)

    target = transform(Image.open(target_path).convert('RGB')).unsqueeze(0)

    

    with torch.no_grad():

        generated = model(source, target)

    

    # Denormalize and convert to PIL

    generated = (generated[0] * 0.5 + 0.5).clamp(0, 1)

    generated = transforms.ToPILImage()(generated)

    

    return generated



# Example

result = generate_face_swap("source.jpg", "target.jpg")

result.save("generated.jpg")

```

## Training Details

### Dataset
- **Source:** FaceForensics++ (c40 compression)
- **Training:** 7,000 face images (triplets: source, target, ground truth)
- **Validation:** 1,500 face images
- **Resolution:** 224Γ—224 RGB

### Hyperparameters
```yaml

optimizer: AdamW

learning_rate: 1e-4

betas: [0.9, 0.999]

weight_decay: 1e-4

batch_size: 16

epochs: 3 (baseline)

loss_function: L1 (Mean Absolute Error)

lr_schedule: Cosine Annealing (1e-4 β†’ 1e-6)

```

### Training Configuration
- **Hardware:** CPU
- **Throughput:** ~32 samples/minute
- **Batch Processing:** 219 train batches, 47 val batches per epoch
- **Best Model:** Saved at epoch 3

## Current Status

⚠️ **Baseline Training:** This model has completed 3 epochs of baseline training. For production-quality face generation, extended training (15-20 epochs) is recommended.

**Current Capabilities:**
- βœ… Learns pose transfer
- βœ… Captures facial structures
- βœ… Shows convergence trend
- ⏳ Some blur in generated images (expected at baseline)
- ⏳ Benefits from extended training

## Use Cases

### Research Applications
1. **Detector Training:** Generate challenging samples for deepfake detection
2. **Adversarial Training:** Min-max game with detector
3. **Understanding Manipulation:** Study how synthetic faces are created
4. **Benchmark Creation:** Generate test sets for evaluation

### Educational Uses
- Demonstrate face generation techniques
- Teach computer vision concepts
- Illustrate transformer architectures
- Show attention mechanism visualization

## Limitations

1. **Training Duration:** Only 3 epochs completed; extended training needed for photo-realism
2. **Blur:** Generated faces show some blur at baseline stage
3. **Dataset Scale:** Trained on 10K images; larger datasets would improve quality
4. **Single Frame:** Doesn't consider temporal consistency for video
5. **Compute:** Large model (252M params) requires significant memory

## Ethical Guidelines

⚠️ **Responsible Use Required**

This model is intended for:
βœ… Academic research
βœ… Deepfake detection development
βœ… Educational demonstrations
βœ… Ethical AI studies

**Prohibited uses:**
❌ Creating misinformation
❌ Identity theft or impersonation
❌ Non-consensual face manipulation
❌ Malicious content creation

**Recommendations:**
- Watermark generated content
- Maintain audit logs
- Require user consent
- Implement content filters

## Future Improvements

Planned enhancements:
- [ ] Extended training (15-20 epochs)
- [ ] Perceptual loss functions (VGG, LPIPS)
- [ ] GAN-based adversarial training
- [ ] Multi-scale architecture
- [ ] Attention visualization
- [ ] Video temporal consistency

## Citation

```bibtex

@techreport{nasir2026faceforge,

  title={FaceForge: A Deep Learning Framework for Facial Manipulation Generation and Detection},

  author={Nasir, Huzaifa},

  institution={National University of Computer and Emerging Sciences},

  year={2026},

  doi={10.5281/zenodo.18530439}

}

```

## Links

- πŸ“„ **Paper:** https://doi.org/10.5281/zenodo.18530439
- πŸ’» **Code:** https://github.com/Huzaifanasir95/FaceForge
- πŸ” **Detector Model:** https://huggingface.co/Huzaifanasir95/faceforge-detector
- πŸ““ **Notebooks:** See repository for training/inference notebooks

## Architecture Details

### Vision Transformer Encoder
- **Patch Size:** 16Γ—16
- **Patches:** 196 + 1 CLS token
- **Embedding Dim:** 768
- **Layers:** 12
- **Attention Heads:** 12
- **MLP Ratio:** 4.0

### Cross-Attention Mechanism
- **Query:** Source features
- **Key/Value:** Target features
- **Attention:** Multi-head (8 heads)
- **FFN Expansion:** 4Γ— (768 β†’ 3072 β†’ 768)

### CNN Upsampler
- **Input:** 768Γ—16Γ—16
- **Output:** 3Γ—224Γ—224
- **Stages:** 4 transpose convolutions
- **Kernel:** 4Γ—4, Stride: 2, Padding: 1
- **Activation:** ReLU β†’ Tanh (output)

## License

This model is released under CC BY 4.0 license. Use responsibly and ethically.

## Author

**Huzaifa Nasir**  
National University of Computer and Emerging Sciences (NUCES)  
Islamabad, Pakistan  
πŸ“§ nasirhuzaifa95@gmail.com

## Acknowledgments

- Vision Transformer (Dosovitskiy et al.)
- FaceForensics++ dataset
- PyTorch and timm libraries
- Open-source AI community