Spaces:

ameythakur
/

Deepfake-Audio

Running

App Files Files Community

Deepfake-Audio / README.md

ameythakur

Deepfake-Audio

e7387fc verified about 2 months ago

preview code

raw

history blame contribute delete

17.5 kB

A newer version of the Gradio SDK is available: 6.12.0

Upgrade

metadata

title: Deepfake Audio
emoji: 🎙️
colorFrom: indigo
colorTo: yellow
sdk: gradio
sdk_version: 6.5.1
python_version: 3.11
app_file: app.py
pinned: false
license: mit
short_description: A neural voice cloning studio powered by SV2TTS technology

Deepfake Audio

An advanced neural voice synthesis platform implementing Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS) for high-fidelity zero-shot voice cloning.

Source Code · Technical Specification · Video Demo · Live Demo

Authors · Overview · Features · Structure · Results · Quick Start · Usage Guidelines · License · About · Acknowledgments

Authors

Amey Thakur	Mega Satish

🤝🏻 Special Acknowledgement

Special thanks to Mega Satish for her meaningful contributions, guidance, and support that helped shape this work.

Overview

Deepfake Audio is a multi-stage neural voice synthesis architecture designed to clone speaker identities and generate high-fidelity speech from textual input. By implementing the SV2TTS framework, this project translates skeletal vocal characteristics into a latent embedding, which then conditions a generative model to produce new vocalizations with strikingly natural prosody and timbre.

Attribution

This project builds upon the foundational research and implementation of the Real-Time-Voice-Cloning repository by Corentin Jemine.

🎙️ Defining Audio Deepfakes

An audio deepfake is when a “cloned” voice that is potentially indistinguishable from the real person’s is used to produce synthetic audio. This process involves utilizing advanced neural architectures, such as the SV2TTS framework, to distillate high-dimensional vocal identities into latent embeddings. These embeddings then condition a generative model to synthesize new speech that mirrors the original speaker's prosody, timbre, and acoustic nuances with striking fidelity.

The repository serves as a digital study into the mechanics of neural cloning and signal processing, brought into a modern context via a Progressive Web App (PWA) interface, enabling high-performance voice synthesis through a decoupled engine architecture.

Synthesis Heuristics

The classification engine is governed by strict computational design patterns ensuring fidelity and responsiveness:

Speaker Normalization: The encoder utilizes a linear speaker verification pipeline, incrementally distilling lexical tokens into a global affective voice state.
Zero-Shot Inference: Beyond simple playback, the system integrates a Tacotron 2-based synthesizer that dynamically refines its accuracy over time, simulating an organic learning curve for complex phonetic structures.
Real-Time Vocoding: Audio reconstruction supports both streaming and batch generation, ensuring high-fidelity waveform response critical for interactive neural study.

Acoustic Precision Integration

To maximize cloning clarity, the engine employs a multi-stage neural pipeline. Latent filters refine the embedding stream, and probabilistic weights visualize the voice's confidence vector, strictly coupling acoustic flair with state changes. This ensures the user's mental model is constantly synchronized with the underlying neural simulation.

Features

Feature	Description
SV2TTS Core	Combines LSTM Speaker Encoders with Tacotron Synthesizers for comprehensive voice cloning.
PWA Architecture	Implements a robust standalone installable interface for immediate neural vocalization study.
Academic Clarity	In-depth and detailed comments integrated throughout the codebase for transparent logic study.
Neural Topology	Efficient Decoupled Engine execution via Gradio and Torch for native high-performance access.
Inference Pipeline	Asynchronous architecture ensuring stability and responsiveness on local clients.
Visual Feedback	Interactive Status Monitors that trigger on synthesis events for sensory reward.
State Feedback	Embedding-Based Indicators and waveform effects for high-impact acoustic feel.
Social Persistence	Interactive Footer Integration bridging the analysis to the source repository.

Interactive Polish: The Acoustic Singularity

We have engineered a Logic-Driven State Manager that calibrates vocal scores across multiple vectors to simulate human-like identity transfer. The visual language focuses on the minimalist "Neon Mic" aesthetic, ensuring maximum focus on the interactive neural trajectory.

Tech Stack

Languages: Python 3.9+
Logic: Neural Pipelines (SV2TTS & Signal Processing)
Frameworks: PyTorch & TensorFlow (Inference)
UI System: Modern Design (Gradio & Custom CSS)
Deployment: Local execution / Hugging Face Spaces
Architecture: Progressive Web App (PWA)

Project Structure

DEEPFAKE-AUDIO/
│
├── Dataset/                         # Neural Assets
│   ├── samples/                     # Voice Reference Audio
│   ├── encoder.pt                   # Speaker Verification Model
│   ├── synthesizer.pt               # TTS Synthesis Model
│   └── vocoder.pt                   # Waveform Reconstruction Model
│
├── docs/                            # Academic Documentation
│   └── SPECIFICATION.md             # Technical Architecture
│
├── Mega/                            # Attribution Assets
│   ├── Filly.jpg                    # Companion (Filly)
│   └── Mega.png                     # Profile Image (Mega Satish)
│
├── screenshots/                     # Visual Gallery
│   ├── 01_landing_page.png
│   ├── 02_landing_page_footer.png
│   ├── 03_example_run_config.png
│   ├── 04_example_run_processing.png
│   ├── 05_example_run_results.png
│   ├── 06_example_run_results_footer.png
│   ├── 07_download_option.png
│   ├── Audio.wav                    # Sample Output
│   └── favicon.png                  # Project Icon
│
├── Source Code/                     # Primary Application Layer
│   ├── app.py                       # Gradio Studio Interface
│   ├── app_ui_demo.py               # UI-Only Verification Mode
│   ├── Dockerfile                   # Containerization Config
│   ├── requirements.txt             # Dependency Manifest
│   ├── favicon.png                  # Application Icon
│   └── intro_message.wav            # Audio Branding
│
├── .gitattributes                   # Signal Normalization
├── .gitignore                       # Deployment Exclusions
├── DEEPFAKE-AUDIO.ipynb             # Research Notebook
├── DEEPFAKE-AUDIO.py                # Research Script (Standalone CLI)
├── SECURITY.md                      # Security Protocols
├── CITATION.cff                     # Academic Citation Manifest
├── codemeta.json                    # Metadata Standard
├── LICENSE                          # MIT License (Verbatim)
└── README.md                        # Project Entrance

Results

Main Interface: Modern Design
Initial system state with clean aesthetics and synchronized brand identity.

Landing Page

_{💡 Interactive Element: Engage the title header to activate the system's auditory introduction.}

Interactive Polish: Footer Integration
Seamlessly integrated authorship and social persistence.

Footer UI

Synthesis Setup: Adaptive Config
Configuring target text and reference identity for neural cloning.

Configuration

Neural Processing: Real-Time Inference
System Distillery extracting acoustic embeddings and synthesizing mel-spectrograms.

Quantified Output: Generated Results
Successful high-fidelity audio synthesis with precise identity fidelity.

Results

Complete User Flow: Result & Footer
Comprehensive view of the post-synthesis state.

Results Footer

System Options: Audio Export
Exporting synthesized waveforms for downstream academic reference.

Download

Generated Result Output: Audio Signal
Interactive verified output from the neural synthesis pipeline.

Listen to Generated Sample

Quick Start

1. Prerequisites

Python 3.9+: Required for runtime execution. Download Python
Git: For version control and cloning. Download Git

Neural Model Acquisition

The synthesis engine relies on pre-trained neural models. Ensure you have the weights (encoder.pt, synthesizer.pt, vocoder.pt) placed in the Dataset/ directory. Failure to synchronize these assets will result in initialization errors.

2. Installation & Setup

Step 1: Clone the Repository

Open your terminal and clone the repository:

git clone https://github.com/Amey-Thakur/DEEPFAKE-AUDIO.git
cd DEEPFAKE-AUDIO

Step 2: Configure Virtual Environment

Prepare an isolated environment to manage dependencies:

Windows (Command Prompt / PowerShell):

python -m venv venv
venv\Scripts\activate

macOS / Linux (Terminal):

python3 -m venv venv
source venv/bin/activate

Step 3: Install Core Dependencies

Ensure your environment is active, then install the required libraries:

pip install -r "Source Code/requirements.txt"

3. Execution

A. Interactive Web Studio (PWA)

Launch the primary Gradio-based studio engine:

python "Source Code/app.py"

PWA Installation: Once the studio is running, you can click the "Install" icon in your browser's address bar to add the Deepfake Audio Studio to your desktop as a standalone application.

B. Research & Automation Script

For automated synthesis or command-line research workflows:

# Example: Using a preset identity
python DEEPFAKE-AUDIO.py --preset "Steve Jobs.wav" --text "Neural cloning active."

# Example: Using a custom voice file
python DEEPFAKE-AUDIO.py --input "my_voice.wav" --text "Synthesizing new speech."

Usage Guidelines

This repository is openly shared to support learning and knowledge exchange across the academic community.

For Students
Use this project as reference material for understanding Neural Voice Synthesis, Transfer Learning (SV2TTS), and real-time audio inference. The source code is available for study to facilitate self-paced learning and exploration of Python-based deep learning pipelines and PWA integration.

For Educators
This project may serve as a practical lab example or supplementary teaching resource for Deep Learning, Acoustic Science, and Interactive System Architecture courses. Attribution is appreciated when utilizing content.

For Researchers
The documentation and architectural approach may provide insights into academic project structuring, neural identity representation, and hybrid multi-stage synthesis pipelines.

License

This repository and all its creative and technical assets are made available under the MIT License. See the LICENSE file for complete terms.

Summary: You are free to share and adapt this content for any purpose, even commercially, as long as you provide appropriate attribution to the original authors.

About This Repository

Created & Maintained by: Amey Thakur & Mega Satish

This project features Deepfake Audio, a three-stage neural voice synthesis system. It represents a personal exploration into Deep Learning-based identity transfer and high-performance interactive application architecture via Gradio.

Connect: GitHub · LinkedIn · ORCID

Acknowledgments

Grateful acknowledgment to Mega Satish for her exceptional collaboration and scholarly partnership on this neural voice cloning research. Her constant support, technical clarity, and dedication to software quality were instrumental in achieving the system's functional objectives. Learning alongside her was a transformative experience; her thoughtful approach to problem-solving and steady encouragement turned complex requirements into meaningful learning moments. This work reflects the growth and insights gained from our side-by-side academic journey. Thank you, Mega, for everything you shared and taught along the way.

Special thanks to Corentin Jemine for the foundational research and open-source implementation of the Real-Time-Voice-Cloning repository, which served as the cornerstone for this project's technical architecture.

↑ Back to Top

Authors · Overview · Features · Structure · Results · Quick Start · Usage Guidelines · License · About · Acknowledgments

🎙️ Deepfake Audio

🎓 Computer Engineering Repository

Computer Engineering (B.E.) - University of Mumbai

Semester-wise curriculum, laboratories, projects, and academic notes.