Deepfake Detector — Xception CNN

A binary image classifier (real vs fake face) built from scratch in PyTorch using the Xception architecture. Trained on 100k images, achieving 99.36% validation accuracy. This model is the CNN backbone of a larger RAG-powered forensic analysis system that pairs predictions with explanations grounded in peer-reviewed deepfake detection research.

Model Architecture

The full Xception architecture was reimplemented from scratch in PyTorch — no pretrained weights, no transfer learning. The architecture follows Chollet (2017) and consists of three flows:

Entry flow: Two standard convolutions followed by three residual blocks using depthwise separable convolutions with max pooling, progressively downsampling from 299x299 to 19x19 while increasing depth from 3 to 728 channels.
Middle flow: Eight repeated residual blocks at 728 channels with no spatial downsampling — the bulk of the model's representational capacity.
Exit flow: One residual block expanding from 728 to 1024 channels, two additional separable convolutions expanding to 1536 and 2048 channels, global average pooling, and a fully connected classification head.

Depthwise separable convolutions factorize standard convolutions into a depthwise spatial filter per channel followed by a pointwise 1x1 convolution for channel mixing, significantly reducing parameter count while maintaining representational power.

Total parameters: ~20M Input size: 299x299x3 Output: 2-class softmax — fake (index 0), real (index 1)

Training

Parameter	Value
Dataset	140k Real and Fake Faces
Train set	100,000 images (50k real, 50k fake)
Validation set	20,000 images (10k real, 10k fake)
Test set	20,000 images (10k real, 10k fake)
Epochs	10 (resumed from epoch 7)
Batch size	32
Optimizer	Adam
Loss	CrossEntropyLoss
Hardware	Kaggle T4 GPU with mixed precision
Normalization	mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]

Results

Split	Loss	Accuracy
Train	0.0098	99.63%
Validation	0.0172	99.36%

Training Data

140k Real and Fake Faces by xhlulu on Kaggle.

Real faces: 70,000 images from the Flickr-Faces-HQ (FFHQ) dataset collected by Nvidia.
Fake faces: 70,000 images sampled from the 1 Million Fake Faces dataset, generated using StyleGAN.
All images resized to 256x256px, then resized to 299x299 during training preprocessing.

Usage

import torch
from torchvision import transforms
from PIL import Image
from huggingface_hub import hf_hub_download

# The Xception class must be available — copy models/xception.py from the project repo
from models.xception import Xception

# Download weights
weights_path = hf_hub_download(
    repo_id="RamadhanZome/deepfake-xception",
    filename="best_xception.pth"
)

# Load model
model = Xception(num_classes=2)
model.load_state_dict(torch.load(weights_path, map_location="cpu"))
model.eval()

# Preprocessing — must match training exactly
transform = transforms.Compose([
    transforms.Resize((299, 299)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])

# Predict
image = Image.open("face.jpg").convert("RGB")
input_tensor = transform(image).unsqueeze(0)

with torch.no_grad():
    output = model(input_tensor)
    probs = torch.softmax(output, dim=1)
    confidence, idx = torch.max(probs, dim=1)

labels = ["fake", "real"]
print(f"{labels[idx.item()]} ({confidence.item() * 100:.2f}%)")

RAG-Powered Explanation System

This model is deployed as part of a larger forensic analysis pipeline combining CNN predictions with Retrieval-Augmented Generation (RAG). When an image is classified, the system:

Retrieves the most relevant chunks from a FAISS index built over 10 peer-reviewed deepfake detection papers
Constructs a prompt combining the prediction, confidence score, and retrieved research context
Sends the prompt to Llama 3.3 70B via Groq to generate a human-readable forensic explanation grounded in the literature

Knowledge base includes: FaceForensics++, Xception (Chollet 2017), RAG (Lewis et al. 2020), FreqNet, Deepfakes and Beyond survey, Deepfake Detection Reliability Survey, and others.

Limitations

Trained exclusively on StyleGAN-generated fakes — generalization to other generation methods (FaceSwap, diffusion-based) is not guaranteed.
Performance may degrade on images that have been heavily compressed or resized before inference.
Designed for still face images — not evaluated on video frames or non-face content.
May be vulnerable to adversarial attacks as noted in Carlini & Farid (2020).

Citation

@inproceedings{chollet2017xception,
  title={Xception: Deep Learning with Depthwise Separable Convolutions},
  author={Chollet, François},
  booktitle={CVPR},
  year={2017}
}

@misc{140kfaces,
  author={xhlulu},
  title={140k Real and Fake Faces},
  year={2020},
  publisher={Kaggle},
  howpublished={https://www.kaggle.com/datasets/xhlulu/140k-real-and-fake-faces}
}

Author

Ramadhan Zome GitHub: RamadhanAdam | HuggingFace: RamadhanZome

Downloads last month: -; Downloads are not tracked for this model. How to track

RamadhanZome
/

deepfake-xception