---
license: cc-by-nc-sa-4.0
tags:
- audio
- vocoder
- deepfake-enhancement
- synthetic-audio
- vocos
datasets:
- grotsoV/4TTSDeepfakeData
language:
- en
base_model:
- charactr/vocos-mel-24khz
---

# Vocos for Deepfake Audio Enhancement

## Model Description
This model is a custom-trained checkpoint of the [Vocos](https://github.com/charactr-platform/vocos) vocoder architecture, specifically fine-tuned to enhance the phase coherence and acoustic quality of synthetic speech (deepfakes). 

It was trained on the [4TTSDeepfakeDataset](https://huggingface.co/datasets/grotsoV/4TTSDeepfakeData), which contains 1:1 parallel bona fide and zero-shot deepfake utterances generated by F5-TTS, MaskGCT, CosyVoice2, and Fish-S1mini.

* **Architecture:** Vocos (Mel-spectrogram to 24kHz waveform)
* **Sampling Rate:** 24,000 Hz
* **Primary Task:** Audio enhancement / Deepfake artifact reduction

## Intended Use & Limitations
This model is released strictly for **academic research, educational, and personal purposes**. 
It is intended to be used in audio forensics, deepfake detection research, and synthetic audio analysis pipelines. 

**Prohibited Uses:** Due to the CC BY-NC-SA 4.0 license, you may **not** use these weights for any commercial purposes, productization, or malicious voice cloning applications.

## How to Use
You can load these weights using the original `vocos` Python library. Because this is a fine-tune of the original 24kHz mel-spectrogram model, you instantiate the base model first and then load your downloaded weights.

```bash
pip install vocos
```

```python
import torch
import torchaudio
from vocos import Vocos

# 1. Initialize the base Vocos model
vocos = Vocos.from_pretrained("charactr/vocos-mel-24khz")

# 2. Load the custom deepfake enhancement weights
# (Replace 'model.pt' with the actual name of the file you upload here)
state_dict = torch.load("model.pt", map_location="cpu")

# Handle nested state dictionaries if necessary
if 'vocos_state_dict' in state_dict:
    vocos.load_state_dict(state_dict['vocos_state_dict'])
else:
    vocos.load_state_dict(state_dict)

vocos.eval()

# 3. Inference example
wav, sr = torchaudio.load("path/to/your/deepfake.wav")
if sr != 24000:
    wav = torchaudio.transforms.Resample(sr, 24000)(wav)

with torch.no_grad():
    mel = vocos.feature_extractor(wav)
    x = vocos.backbone(mel)
    enhanced_wav = vocos.head(x)

torchaudio.save("enhanced_output.wav", enhanced_wav.cpu(), 24000)
```

## Acknowledgements and Attribution
* **Architecture:** The original Vocos model was created by [Charactr](https://huggingface.co/charactr/vocos-mel-24khz) (Licensed under MIT).
* **Training Data:** Trained on the [4TTSDeepfakeDataset](https://huggingface.co/datasets/grotsoV/4TTSDeepfakeData), which is derived from the Hi-Fi TTS Dataset (CC BY 4.0) and utilizes outputs from F5-TTS, MaskGCT, Fish-S1mini, and CosyVoice2.