grotsoV's picture
Update README.md
a05d48e verified
---
license: cc-by-nc-sa-4.0
tags:
- audio
- vocoder
- deepfake-enhancement
- synthetic-audio
- vocos
datasets:
- grotsoV/4TTSDeepfakeData
language:
- en
base_model:
- charactr/vocos-mel-24khz
---
# Vocos for Deepfake Audio Enhancement
## Model Description
This model is a custom-trained checkpoint of the [Vocos](https://github.com/charactr-platform/vocos) vocoder architecture, specifically fine-tuned to enhance the phase coherence and acoustic quality of synthetic speech (deepfakes).
It was trained on the [4TTSDeepfakeDataset](https://huggingface.co/datasets/grotsoV/4TTSDeepfakeData), which contains 1:1 parallel bona fide and zero-shot deepfake utterances generated by F5-TTS, MaskGCT, CosyVoice2, and Fish-S1mini.
* **Architecture:** Vocos (Mel-spectrogram to 24kHz waveform)
* **Sampling Rate:** 24,000 Hz
* **Primary Task:** Audio enhancement / Deepfake artifact reduction
## Intended Use & Limitations
This model is released strictly for **academic research, educational, and personal purposes**.
It is intended to be used in audio forensics, deepfake detection research, and synthetic audio analysis pipelines.
**Prohibited Uses:** Due to the CC BY-NC-SA 4.0 license, you may **not** use these weights for any commercial purposes, productization, or malicious voice cloning applications.
## How to Use
You can load these weights using the original `vocos` Python library. Because this is a fine-tune of the original 24kHz mel-spectrogram model, you instantiate the base model first and then load your downloaded weights.
```bash
pip install vocos
```
```python
import torch
import torchaudio
from vocos import Vocos
# 1. Initialize the base Vocos model
vocos = Vocos.from_pretrained("charactr/vocos-mel-24khz")
# 2. Load the custom deepfake enhancement weights
# (Replace 'model.pt' with the actual name of the file you upload here)
state_dict = torch.load("model.pt", map_location="cpu")
# Handle nested state dictionaries if necessary
if 'vocos_state_dict' in state_dict:
vocos.load_state_dict(state_dict['vocos_state_dict'])
else:
vocos.load_state_dict(state_dict)
vocos.eval()
# 3. Inference example
wav, sr = torchaudio.load("path/to/your/deepfake.wav")
if sr != 24000:
wav = torchaudio.transforms.Resample(sr, 24000)(wav)
with torch.no_grad():
mel = vocos.feature_extractor(wav)
x = vocos.backbone(mel)
enhanced_wav = vocos.head(x)
torchaudio.save("enhanced_output.wav", enhanced_wav.cpu(), 24000)
```
## Acknowledgements and Attribution
* **Architecture:** The original Vocos model was created by [Charactr](https://huggingface.co/charactr/vocos-mel-24khz) (Licensed under MIT).
* **Training Data:** Trained on the [4TTSDeepfakeDataset](https://huggingface.co/datasets/grotsoV/4TTSDeepfakeData), which is derived from the Hi-Fi TTS Dataset (CC BY 4.0) and utilizes outputs from F5-TTS, MaskGCT, Fish-S1mini, and CosyVoice2.