grotsoV
/

4TTSDeepfakeDataVocosTrained

deepfake-enhancement

synthetic-audio

Model card Files Files and versions

4TTSDeepfakeDataVocosTrained / README.md

grotsoV's picture

Update README.md

a05d48e verified about 2 months ago

|

history blame contribute delete

2.89 kB

	---
	license: cc-by-nc-sa-4.0
	tags:
	- audio
	- vocoder
	- deepfake-enhancement
	- synthetic-audio
	- vocos
	datasets:
	- grotsoV/4TTSDeepfakeData
	language:
	- en
	base_model:
	- charactr/vocos-mel-24khz
	---

	# Vocos for Deepfake Audio Enhancement

	## Model Description
	This model is a custom-trained checkpoint of the [Vocos](https://github.com/charactr-platform/vocos) vocoder architecture, specifically fine-tuned to enhance the phase coherence and acoustic quality of synthetic speech (deepfakes).

	It was trained on the [4TTSDeepfakeDataset](https://huggingface.co/datasets/grotsoV/4TTSDeepfakeData), which contains 1:1 parallel bona fide and zero-shot deepfake utterances generated by F5-TTS, MaskGCT, CosyVoice2, and Fish-S1mini.

	* Architecture: Vocos (Mel-spectrogram to 24kHz waveform)
	* Sampling Rate: 24,000 Hz
	* Primary Task: Audio enhancement / Deepfake artifact reduction

	## Intended Use & Limitations
	This model is released strictly for academic research, educational, and personal purposes.
	It is intended to be used in audio forensics, deepfake detection research, and synthetic audio analysis pipelines.

	Prohibited Uses: Due to the CC BY-NC-SA 4.0 license, you may not use these weights for any commercial purposes, productization, or malicious voice cloning applications.

	## How to Use
	You can load these weights using the original `vocos` Python library. Because this is a fine-tune of the original 24kHz mel-spectrogram model, you instantiate the base model first and then load your downloaded weights.

	```bash
	pip install vocos
	```

	```python
	import torch
	import torchaudio
	from vocos import Vocos

	# 1. Initialize the base Vocos model
	vocos = Vocos.from_pretrained("charactr/vocos-mel-24khz")

	# 2. Load the custom deepfake enhancement weights
	# (Replace 'model.pt' with the actual name of the file you upload here)
	state_dict = torch.load("model.pt", map_location="cpu")

	# Handle nested state dictionaries if necessary
	if 'vocos_state_dict' in state_dict:
	vocos.load_state_dict(state_dict['vocos_state_dict'])
	else:
	vocos.load_state_dict(state_dict)

	vocos.eval()

	# 3. Inference example
	wav, sr = torchaudio.load("path/to/your/deepfake.wav")
	if sr != 24000:
	wav = torchaudio.transforms.Resample(sr, 24000)(wav)

	with torch.no_grad():
	mel = vocos.feature_extractor(wav)
	x = vocos.backbone(mel)
	enhanced_wav = vocos.head(x)

	torchaudio.save("enhanced_output.wav", enhanced_wav.cpu(), 24000)
	```

	## Acknowledgements and Attribution
	* Architecture: The original Vocos model was created by [Charactr](https://huggingface.co/charactr/vocos-mel-24khz) (Licensed under MIT).
	* Training Data: Trained on the [4TTSDeepfakeDataset](https://huggingface.co/datasets/grotsoV/4TTSDeepfakeData), which is derived from the Hi-Fi TTS Dataset (CC BY 4.0) and utilizes outputs from F5-TTS, MaskGCT, Fish-S1mini, and CosyVoice2.