Vocos for Deepfake Audio Enhancement
Model Description
This model is a custom-trained checkpoint of the Vocos vocoder architecture, specifically fine-tuned to enhance the phase coherence and acoustic quality of synthetic speech (deepfakes).
It was trained on the 4TTSDeepfakeDataset, which contains 1:1 parallel bona fide and zero-shot deepfake utterances generated by F5-TTS, MaskGCT, CosyVoice2, and Fish-S1mini.
- Architecture: Vocos (Mel-spectrogram to 24kHz waveform)
- Sampling Rate: 24,000 Hz
- Primary Task: Audio enhancement / Deepfake artifact reduction
Intended Use & Limitations
This model is released strictly for academic research, educational, and personal purposes. It is intended to be used in audio forensics, deepfake detection research, and synthetic audio analysis pipelines.
Prohibited Uses: Due to the CC BY-NC-SA 4.0 license, you may not use these weights for any commercial purposes, productization, or malicious voice cloning applications.
How to Use
You can load these weights using the original vocos Python library. Because this is a fine-tune of the original 24kHz mel-spectrogram model, you instantiate the base model first and then load your downloaded weights.
pip install vocos
import torch
import torchaudio
from vocos import Vocos
# 1. Initialize the base Vocos model
vocos = Vocos.from_pretrained("charactr/vocos-mel-24khz")
# 2. Load the custom deepfake enhancement weights
# (Replace 'model.pt' with the actual name of the file you upload here)
state_dict = torch.load("model.pt", map_location="cpu")
# Handle nested state dictionaries if necessary
if 'vocos_state_dict' in state_dict:
vocos.load_state_dict(state_dict['vocos_state_dict'])
else:
vocos.load_state_dict(state_dict)
vocos.eval()
# 3. Inference example
wav, sr = torchaudio.load("path/to/your/deepfake.wav")
if sr != 24000:
wav = torchaudio.transforms.Resample(sr, 24000)(wav)
with torch.no_grad():
mel = vocos.feature_extractor(wav)
x = vocos.backbone(mel)
enhanced_wav = vocos.head(x)
torchaudio.save("enhanced_output.wav", enhanced_wav.cpu(), 24000)
Acknowledgements and Attribution
- Architecture: The original Vocos model was created by Charactr (Licensed under MIT).
- Training Data: Trained on the 4TTSDeepfakeDataset, which is derived from the Hi-Fi TTS Dataset (CC BY 4.0) and utilizes outputs from F5-TTS, MaskGCT, Fish-S1mini, and CosyVoice2.
Model tree for grotsoV/4TTSDeepfakeDataVocosTrained
Base model
charactr/vocos-mel-24khz