Audidex: Audio Deepfake Detection Model

This repository hosts the pre-trained Audidex model and scaler for audio deepfake detection — classifying speech as Real (genuine human) or Fake (synthetic or manipulated). The model is part of the Audidex system and uses hybrid Mel-spectrogram + glottal features with a fully connected neural network (FCNN).

Overview

Task: Binary classification of audio (Real vs Fake).
Input: Processed audio at 16 kHz; the model expects a feature vector (Mel-spectrogram flattened + glottal features), not raw audio.
Output: Class probabilities and label (Real/Fake).
Framework: TensorFlow/Keras (saved as .h5).
Use case: Integration into the Audidex backend for real-time or batch deepfake detection.

What Was Done (Project Summary)

The project addresses the rise of deepfake audio technology (identity theft, privacy concerns, deception, and fraudulent activities). The goal was to develop a detection model trained on a diverse dataset, capable of real-time detection and built for discriminating between real and synthesized speech. The approach uses hand-crafted temporal features (glottal) and frequency-domain features (Mel-spectrogram) jointly. The model is an FCNN chosen for its efficiency in processing combined feature vectors and for learning complex relationships between spectral and glottal attributes. The trained model and scaler are deployed in the Audidex backend (FastAPI) with the same feature extraction pipeline for consistent inference.

Data

A unified and customized dataset was used to improve model accuracy and robustness:

Sources:
- ASVspoof: Benchmark resource in anti-spoofing research; genuine and spoofed audio covering a variety of audio synthesis and conversion techniques.
- Fake or Real (FoR): Exhaustive dataset of authentic and manipulated audio; covers multiple deepfake techniques, strengthening the model against diverse spoofing methods.
- Additional: Real and fake audio from open-source deepfake repositories and self-recordings to cover state-of-the-art synthesis and voice conversion.
Size: 17,940 utterances in total — 8,921 real and 9,019 fake. Real and fake samples are kept as balanced as possible to enhance training and generalization.
Splits: 80% training, 10% validation, 10% testing.
Preprocessing:
- Sampling rate standardized at 16 kHz.
- Audios trimmed or padded to a fixed duration of 3 seconds.
- Normalizing audio amplitudes and noise reduction (e.g. high-pass filtering) for quality improvement.
Data augmentation: Used to diversify data and simulate real-life conditions — e.g. adding noise, pitch shifting, and time stretching.
Ethics: All used datasets were in accordance with the policy on data usage. No raw dataset files are stored in this Hugging Face repo; only the trained model and scaler are provided.

Methodology

1. Feature Extraction (must match inference code)

Feature extraction is the most critical part of the detection pipeline, enabling the model to capture minute variations between authentic and fake audio.

Glottal features (temporal; vocal cord vibration):
- Shimmer: Amplitude variations of the glottal waveform. Real human voice typically has natural amplitude variations; synthetic speech often has regular or over-emphasized amplitude — shimmer helps capture this.
- Jitter: Frequency variations from one glottal cycle to the next. Natural speech is usually slightly irregular; deepfake audio may show very uniform or unnaturally varied jitter — an important indicator of authenticity.
- HNR: Harmonic-to-noise ratio (dB).
- Formants (F1, F2, F3): Resonant frequencies of the vocal tract, extracted using LPC (Linear Predictive Coding). The first three formants are essential for vowel structure and human speech; synthesized audio often lacks natural resonance and can show anomalies in formant patterns.
Mel-spectrogram (frequency–time):
- Audio waveform → spectrogram via STFT → frequencies mapped to the Mel scale (approximating human auditory perception).
- Real audio tends to have smooth transitions in frequency and energy; deepfake audios can show sudden changes or sharp turns in spectral characteristics.
- In this implementation: Librosa, n_mels=128, fmax=8000 Hz, amplitude to dB; time axis fixed to 128 frames; flattened to a vector and normalized (e.g. by max absolute value) before concatenation with glottal features.
Combined vector: [flattened_mel_spectrogram, jitter, shimmer, hnr, f1, f2, f3]. Length is fixed to the model's input dimension via padding or truncation. Features are standardized using a pre-fitted scaler (scaler.pkl) at training and inference.

2. Model Architecture (FCNN)

The detection model is based on a deep learning approach. Several architectures (e.g. CNN, RNN) were investigated; an FCNN was chosen because it handles tabular-style combined feature vectors effectively and is well-matched for learning complex relationships between Mel-spectrogram and glottal features via densely connected layers.

Layer / setting	Details
Input	1D flattened feature vector combining spectral (Mel-spectrogram) and glottal (shimmer, jitter, HNR, formants) features.
First Dense	512 neurons, ReLU. L2 regularization (λ=0.001), Batch Normalization, Dropout 40%. Output: 512-dimensional representation.
Second Dense	256 neurons, ReLU. L2 regularization, Batch Normalization, Dropout 30%. Output: 256-dimensional representation.
Third Dense	128 neurons, ReLU. L2 regularization, Batch Normalization. No Dropout at this stage to retain information for classification. Output: 128-dimensional representation.
Output	2 neurons, Softmax → [P(Real), P(Fake)].
Compilation	Optimizer: Adam (initial learning rate 0.001). Loss: Categorical cross-entropy. Metric: Accuracy.
Training	Dataset split 80% train / 10% validation / 10% test. Features standardized with pre-fitted scaler (`scaler.pkl`). 10 epochs, batch size 32, with a learning rate scheduler to reduce the learning rate over time for smoother convergence.

Output index: 0 = Real, 1 = Fake (argmax over these two classes).

Reported Performance

Training accuracy: 90.42%
Validation accuracy: 87.88%
Test accuracy: 87.50%

Convergence of loss during training indicated effective learning with low overfitting. Strong precision, recall, and F1-score on test data further validate the reliability of the model. The introduction of glottal features (shimmer, jitter, formants) greatly improved the detection capability for synthesized speech; Mel-spectrogram features captured spectral and temporal patterns well. Audio chunking in the backend supports real-time responsiveness without sacrificing classification performance. The model had some difficulty with state-of-the-art voice conversion methods that generate voices very close to real ones — a direction for future improvement.

Files in This Repo

File	Description
`optimized_audio_deepfake_detector.h5`	Keras/TensorFlow FCNN model (load with `tf.keras.models.load_model`).
`scaler.pkl`	Pre-fitted scaler (e.g. `joblib.load`) for standardizing the combined feature vector before inference.

How to Use

1. Download from Hugging Face

pip install huggingface_hub

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(repo_id="soorajsatheesan/audidex", filename="optimized_audio_deepfake_detector.h5")
scaler_path = hf_hub_download(repo_id="soorajsatheesan/audidex", filename="scaler.pkl")

2. Load and Run Inference

Inference must use the same feature extraction as in the Audidex backend (audio_processing.py): Mel-spectrogram (128 mels, 128 time steps) + glottal features (jitter, shimmer, HNR, 3 formants), then concatenate, pad/truncate to model input length, and apply the scaler.

from tensorflow.keras.models import load_model
from joblib import load

model = load_model(model_path)
scaler = load(scaler_path)

# Build feature vector using Audidex's audio_processing (same as backend)
# combined = [flattened_mel, jitter, shimmer, hnr, f1, f2, f3], then pad/truncate
# combined = scaler.transform(combined.reshape(1, -1))
# pred = model.predict(combined)
# label = "Real" if np.argmax(pred) == 0 else "Fake"

For full inference (including feature extraction), use the Audidex repository: clone it, place optimized_audio_deepfake_detector.h5 and scaler.pkl in backend/, and run the FastAPI app.

Model Card Summary

Intended use: Research and deployment of audio deepfake detection (Real vs Fake); areas of application include social media, forensics, content authentication, and cybersecurity.
Limitations: Performance depends on the match between evaluation data and the training distribution; state-of-the-art voice conversion can remain challenging.
License: MIT (see main Audidex project).

soorajsatheesan
/

audidex