Vocos for Deepfake Audio Enhancement

Model Description

This model is a custom-trained checkpoint of the Vocos vocoder architecture, specifically fine-tuned to enhance the phase coherence and acoustic quality of synthetic speech (deepfakes).

It was trained on the 4TTSDeepfakeDataset, which contains 1:1 parallel bona fide and zero-shot deepfake utterances generated by F5-TTS, MaskGCT, CosyVoice2, and Fish-S1mini.

Architecture: Vocos (Mel-spectrogram to 24kHz waveform)
Sampling Rate: 24,000 Hz
Primary Task: Audio enhancement / Deepfake artifact reduction

Intended Use & Limitations

This model is released strictly for academic research, educational, and personal purposes. It is intended to be used in audio forensics, deepfake detection research, and synthetic audio analysis pipelines.

Prohibited Uses: Due to the CC BY-NC-SA 4.0 license, you may not use these weights for any commercial purposes, productization, or malicious voice cloning applications.

How to Use

You can load these weights using the original vocos Python library. Because this is a fine-tune of the original 24kHz mel-spectrogram model, you instantiate the base model first and then load your downloaded weights.

pip install vocos

import torch
import torchaudio
from vocos import Vocos

# 1. Initialize the base Vocos model
vocos = Vocos.from_pretrained("charactr/vocos-mel-24khz")

# 2. Load the custom deepfake enhancement weights
# (Replace 'model.pt' with the actual name of the file you upload here)
state_dict = torch.load("model.pt", map_location="cpu")

# Handle nested state dictionaries if necessary
if 'vocos_state_dict' in state_dict:
    vocos.load_state_dict(state_dict['vocos_state_dict'])
else:
    vocos.load_state_dict(state_dict)

vocos.eval()

# 3. Inference example
wav, sr = torchaudio.load("path/to/your/deepfake.wav")
if sr != 24000:
    wav = torchaudio.transforms.Resample(sr, 24000)(wav)

with torch.no_grad():
    mel = vocos.feature_extractor(wav)
    x = vocos.backbone(mel)
    enhanced_wav = vocos.head(x)

torchaudio.save("enhanced_output.wav", enhanced_wav.cpu(), 24000)

Acknowledgements and Attribution

Architecture: The original Vocos model was created by Charactr (Licensed under MIT).
Training Data: Trained on the 4TTSDeepfakeDataset, which is derived from the Hi-Fi TTS Dataset (CC BY 4.0) and utilizes outputs from F5-TTS, MaskGCT, Fish-S1mini, and CosyVoice2.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for grotsoV/4TTSDeepfakeDataVocosTrained

Base model

charactr/vocos-mel-24khz

Finetuned

(1)

this model

grotsoV
/

4TTSDeepfakeDataVocosTrained