Deepfake Detector โ Xception CNN
A binary image classifier (real vs fake face) built from scratch in PyTorch using the Xception architecture. Trained on 100k images, achieving 99.36% validation accuracy. This model is the CNN backbone of a larger RAG-powered forensic analysis system that pairs predictions with explanations grounded in peer-reviewed deepfake detection research.
Model Architecture
The full Xception architecture was reimplemented from scratch in PyTorch โ no pretrained weights, no transfer learning. The architecture follows Chollet (2017) and consists of three flows:
- Entry flow: Two standard convolutions followed by three residual blocks using depthwise separable convolutions with max pooling, progressively downsampling from 299x299 to 19x19 while increasing depth from 3 to 728 channels.
- Middle flow: Eight repeated residual blocks at 728 channels with no spatial downsampling โ the bulk of the model's representational capacity.
- Exit flow: One residual block expanding from 728 to 1024 channels, two additional separable convolutions expanding to 1536 and 2048 channels, global average pooling, and a fully connected classification head.
Depthwise separable convolutions factorize standard convolutions into a depthwise spatial filter per channel followed by a pointwise 1x1 convolution for channel mixing, significantly reducing parameter count while maintaining representational power.
Total parameters: ~20M Input size: 299x299x3 Output: 2-class softmax โ fake (index 0), real (index 1)
Training
| Parameter | Value |
|---|---|
| Dataset | 140k Real and Fake Faces |
| Train set | 100,000 images (50k real, 50k fake) |
| Validation set | 20,000 images (10k real, 10k fake) |
| Test set | 20,000 images (10k real, 10k fake) |
| Epochs | 10 (resumed from epoch 7) |
| Batch size | 32 |
| Optimizer | Adam |
| Loss | CrossEntropyLoss |
| Hardware | Kaggle T4 GPU with mixed precision |
| Normalization | mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5] |
Results
| Split | Loss | Accuracy |
|---|---|---|
| Train | 0.0098 | 99.63% |
| Validation | 0.0172 | 99.36% |
Training Data
140k Real and Fake Faces by xhlulu on Kaggle.
- Real faces: 70,000 images from the Flickr-Faces-HQ (FFHQ) dataset collected by Nvidia.
- Fake faces: 70,000 images sampled from the 1 Million Fake Faces dataset, generated using StyleGAN.
- All images resized to 256x256px, then resized to 299x299 during training preprocessing.
Usage
import torch
from torchvision import transforms
from PIL import Image
from huggingface_hub import hf_hub_download
# The Xception class must be available โ copy models/xception.py from the project repo
from models.xception import Xception
# Download weights
weights_path = hf_hub_download(
repo_id="RamadhanZome/deepfake-xception",
filename="best_xception.pth"
)
# Load model
model = Xception(num_classes=2)
model.load_state_dict(torch.load(weights_path, map_location="cpu"))
model.eval()
# Preprocessing โ must match training exactly
transform = transforms.Compose([
transforms.Resize((299, 299)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])
# Predict
image = Image.open("face.jpg").convert("RGB")
input_tensor = transform(image).unsqueeze(0)
with torch.no_grad():
output = model(input_tensor)
probs = torch.softmax(output, dim=1)
confidence, idx = torch.max(probs, dim=1)
labels = ["fake", "real"]
print(f"{labels[idx.item()]} ({confidence.item() * 100:.2f}%)")
RAG-Powered Explanation System
This model is deployed as part of a larger forensic analysis pipeline combining CNN predictions with Retrieval-Augmented Generation (RAG). When an image is classified, the system:
- Retrieves the most relevant chunks from a FAISS index built over 10 peer-reviewed deepfake detection papers
- Constructs a prompt combining the prediction, confidence score, and retrieved research context
- Sends the prompt to Llama 3.3 70B via Groq to generate a human-readable forensic explanation grounded in the literature
Knowledge base includes: FaceForensics++, Xception (Chollet 2017), RAG (Lewis et al. 2020), FreqNet, Deepfakes and Beyond survey, Deepfake Detection Reliability Survey, and others.
Limitations
- Trained exclusively on StyleGAN-generated fakes โ generalization to other generation methods (FaceSwap, diffusion-based) is not guaranteed.
- Performance may degrade on images that have been heavily compressed or resized before inference.
- Designed for still face images โ not evaluated on video frames or non-face content.
- May be vulnerable to adversarial attacks as noted in Carlini & Farid (2020).
Citation
@inproceedings{chollet2017xception,
title={Xception: Deep Learning with Depthwise Separable Convolutions},
author={Chollet, Franรงois},
booktitle={CVPR},
year={2017}
}
@misc{140kfaces,
author={xhlulu},
title={140k Real and Fake Faces},
year={2020},
publisher={Kaggle},
howpublished={https://www.kaggle.com/datasets/xhlulu/140k-real-and-fake-faces}
}
Author
Ramadhan Zome GitHub: RamadhanAdam | HuggingFace: RamadhanZome

