--- license: cc-by-nc-sa-4.0 tags: - audio - vocoder - deepfake-enhancement - synthetic-audio - vocos datasets: - grotsoV/4TTSDeepfakeData language: - en base_model: - charactr/vocos-mel-24khz --- # Vocos for Deepfake Audio Enhancement ## Model Description This model is a custom-trained checkpoint of the [Vocos](https://github.com/charactr-platform/vocos) vocoder architecture, specifically fine-tuned to enhance the phase coherence and acoustic quality of synthetic speech (deepfakes). It was trained on the [4TTSDeepfakeDataset](https://huggingface.co/datasets/grotsoV/4TTSDeepfakeData), which contains 1:1 parallel bona fide and zero-shot deepfake utterances generated by F5-TTS, MaskGCT, CosyVoice2, and Fish-S1mini. * **Architecture:** Vocos (Mel-spectrogram to 24kHz waveform) * **Sampling Rate:** 24,000 Hz * **Primary Task:** Audio enhancement / Deepfake artifact reduction ## Intended Use & Limitations This model is released strictly for **academic research, educational, and personal purposes**. It is intended to be used in audio forensics, deepfake detection research, and synthetic audio analysis pipelines. **Prohibited Uses:** Due to the CC BY-NC-SA 4.0 license, you may **not** use these weights for any commercial purposes, productization, or malicious voice cloning applications. ## How to Use You can load these weights using the original `vocos` Python library. Because this is a fine-tune of the original 24kHz mel-spectrogram model, you instantiate the base model first and then load your downloaded weights. ```bash pip install vocos ``` ```python import torch import torchaudio from vocos import Vocos # 1. Initialize the base Vocos model vocos = Vocos.from_pretrained("charactr/vocos-mel-24khz") # 2. Load the custom deepfake enhancement weights # (Replace 'model.pt' with the actual name of the file you upload here) state_dict = torch.load("model.pt", map_location="cpu") # Handle nested state dictionaries if necessary if 'vocos_state_dict' in state_dict: vocos.load_state_dict(state_dict['vocos_state_dict']) else: vocos.load_state_dict(state_dict) vocos.eval() # 3. Inference example wav, sr = torchaudio.load("path/to/your/deepfake.wav") if sr != 24000: wav = torchaudio.transforms.Resample(sr, 24000)(wav) with torch.no_grad(): mel = vocos.feature_extractor(wav) x = vocos.backbone(mel) enhanced_wav = vocos.head(x) torchaudio.save("enhanced_output.wav", enhanced_wav.cpu(), 24000) ``` ## Acknowledgements and Attribution * **Architecture:** The original Vocos model was created by [Charactr](https://huggingface.co/charactr/vocos-mel-24khz) (Licensed under MIT). * **Training Data:** Trained on the [4TTSDeepfakeDataset](https://huggingface.co/datasets/grotsoV/4TTSDeepfakeData), which is derived from the Hi-Fi TTS Dataset (CC BY 4.0) and utilizes outputs from F5-TTS, MaskGCT, Fish-S1mini, and CosyVoice2.