| --- |
| license: cc-by-nc-sa-4.0 |
| tags: |
| - audio |
| - vocoder |
| - deepfake-enhancement |
| - synthetic-audio |
| - vocos |
| datasets: |
| - grotsoV/4TTSDeepfakeData |
| language: |
| - en |
| base_model: |
| - charactr/vocos-mel-24khz |
| --- |
| |
| # Vocos for Deepfake Audio Enhancement |
|
|
| ## Model Description |
| This model is a custom-trained checkpoint of the [Vocos](https://github.com/charactr-platform/vocos) vocoder architecture, specifically fine-tuned to enhance the phase coherence and acoustic quality of synthetic speech (deepfakes). |
|
|
| It was trained on the [4TTSDeepfakeDataset](https://huggingface.co/datasets/grotsoV/4TTSDeepfakeData), which contains 1:1 parallel bona fide and zero-shot deepfake utterances generated by F5-TTS, MaskGCT, CosyVoice2, and Fish-S1mini. |
|
|
| * **Architecture:** Vocos (Mel-spectrogram to 24kHz waveform) |
| * **Sampling Rate:** 24,000 Hz |
| * **Primary Task:** Audio enhancement / Deepfake artifact reduction |
|
|
| ## Intended Use & Limitations |
| This model is released strictly for **academic research, educational, and personal purposes**. |
| It is intended to be used in audio forensics, deepfake detection research, and synthetic audio analysis pipelines. |
|
|
| **Prohibited Uses:** Due to the CC BY-NC-SA 4.0 license, you may **not** use these weights for any commercial purposes, productization, or malicious voice cloning applications. |
|
|
| ## How to Use |
| You can load these weights using the original `vocos` Python library. Because this is a fine-tune of the original 24kHz mel-spectrogram model, you instantiate the base model first and then load your downloaded weights. |
|
|
| ```bash |
| pip install vocos |
| ``` |
|
|
| ```python |
| import torch |
| import torchaudio |
| from vocos import Vocos |
| |
| # 1. Initialize the base Vocos model |
| vocos = Vocos.from_pretrained("charactr/vocos-mel-24khz") |
| |
| # 2. Load the custom deepfake enhancement weights |
| # (Replace 'model.pt' with the actual name of the file you upload here) |
| state_dict = torch.load("model.pt", map_location="cpu") |
| |
| # Handle nested state dictionaries if necessary |
| if 'vocos_state_dict' in state_dict: |
| vocos.load_state_dict(state_dict['vocos_state_dict']) |
| else: |
| vocos.load_state_dict(state_dict) |
| |
| vocos.eval() |
| |
| # 3. Inference example |
| wav, sr = torchaudio.load("path/to/your/deepfake.wav") |
| if sr != 24000: |
| wav = torchaudio.transforms.Resample(sr, 24000)(wav) |
| |
| with torch.no_grad(): |
| mel = vocos.feature_extractor(wav) |
| x = vocos.backbone(mel) |
| enhanced_wav = vocos.head(x) |
| |
| torchaudio.save("enhanced_output.wav", enhanced_wav.cpu(), 24000) |
| ``` |
|
|
| ## Acknowledgements and Attribution |
| * **Architecture:** The original Vocos model was created by [Charactr](https://huggingface.co/charactr/vocos-mel-24khz) (Licensed under MIT). |
| * **Training Data:** Trained on the [4TTSDeepfakeDataset](https://huggingface.co/datasets/grotsoV/4TTSDeepfakeData), which is derived from the Hi-Fi TTS Dataset (CC BY 4.0) and utilizes outputs from F5-TTS, MaskGCT, Fish-S1mini, and CosyVoice2. |