Vocos for Deepfake Audio Enhancement

Model Description

This model is a custom-trained checkpoint of the Vocos vocoder architecture, specifically fine-tuned to enhance the phase coherence and acoustic quality of synthetic speech (deepfakes).

It was trained on the 4TTSDeepfakeDataset, which contains 1:1 parallel bona fide and zero-shot deepfake utterances generated by F5-TTS, MaskGCT, CosyVoice2, and Fish-S1mini.

  • Architecture: Vocos (Mel-spectrogram to 24kHz waveform)
  • Sampling Rate: 24,000 Hz
  • Primary Task: Audio enhancement / Deepfake artifact reduction

Intended Use & Limitations

This model is released strictly for academic research, educational, and personal purposes. It is intended to be used in audio forensics, deepfake detection research, and synthetic audio analysis pipelines.

Prohibited Uses: Due to the CC BY-NC-SA 4.0 license, you may not use these weights for any commercial purposes, productization, or malicious voice cloning applications.

How to Use

You can load these weights using the original vocos Python library. Because this is a fine-tune of the original 24kHz mel-spectrogram model, you instantiate the base model first and then load your downloaded weights.

pip install vocos
import torch
import torchaudio
from vocos import Vocos

# 1. Initialize the base Vocos model
vocos = Vocos.from_pretrained("charactr/vocos-mel-24khz")

# 2. Load the custom deepfake enhancement weights
# (Replace 'model.pt' with the actual name of the file you upload here)
state_dict = torch.load("model.pt", map_location="cpu")

# Handle nested state dictionaries if necessary
if 'vocos_state_dict' in state_dict:
    vocos.load_state_dict(state_dict['vocos_state_dict'])
else:
    vocos.load_state_dict(state_dict)

vocos.eval()

# 3. Inference example
wav, sr = torchaudio.load("path/to/your/deepfake.wav")
if sr != 24000:
    wav = torchaudio.transforms.Resample(sr, 24000)(wav)

with torch.no_grad():
    mel = vocos.feature_extractor(wav)
    x = vocos.backbone(mel)
    enhanced_wav = vocos.head(x)

torchaudio.save("enhanced_output.wav", enhanced_wav.cpu(), 24000)

Acknowledgements and Attribution

  • Architecture: The original Vocos model was created by Charactr (Licensed under MIT).
  • Training Data: Trained on the 4TTSDeepfakeDataset, which is derived from the Hi-Fi TTS Dataset (CC BY 4.0) and utilizes outputs from F5-TTS, MaskGCT, Fish-S1mini, and CosyVoice2.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for grotsoV/4TTSDeepfakeDataVocosTrained

Finetuned
(1)
this model

Dataset used to train grotsoV/4TTSDeepfakeDataVocosTrained