diff --git a/.gitattributes b/.gitattributes index a6344aac8c09253b3b630fb776ae94478aa0275b..2ab266cf06a968421744d9eb88f1d599dfb7680b 100644 --- a/.gitattributes +++ b/.gitattributes @@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text *.zip filter=lfs diff=lfs merge=lfs -text *.zst filter=lfs diff=lfs merge=lfs -text *tfevents* filter=lfs diff=lfs merge=lfs -text +Vocos.[[:space:]]Closing[[:space:]]the[[:space:]]gap[[:space:]]between[[:space:]]time-domain[[:space:]]and[[:space:]]Fourier-based[[:space:]]neural[[:space:]]vocoders[[:space:]]for[[:space:]]high-quality[[:space:]]audio[[:space:]]synthesis.pdf filter=lfs diff=lfs merge=lfs -text diff --git a/Vocos. Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis.pdf b/Vocos. Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis.pdf new file mode 100644 index 0000000000000000000000000000000000000000..dd8b0e9b413d07167028a2fa9e8f0764d266d9d4 --- /dev/null +++ b/Vocos. Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis.pdf @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:a5a8dcb0a6b18a77b7f0fabeeadb8d51149246337a3ab9035ca42ffe910b7eb3 +size 6612764 diff --git a/alvocat-vocos-22khz/.gitattributes b/alvocat-vocos-22khz/.gitattributes new file mode 100644 index 0000000000000000000000000000000000000000..a6344aac8c09253b3b630fb776ae94478aa0275b --- /dev/null +++ b/alvocat-vocos-22khz/.gitattributes @@ -0,0 +1,35 @@ +*.7z filter=lfs diff=lfs merge=lfs -text +*.arrow filter=lfs diff=lfs merge=lfs -text +*.bin filter=lfs diff=lfs merge=lfs -text +*.bz2 filter=lfs diff=lfs merge=lfs -text +*.ckpt filter=lfs diff=lfs merge=lfs -text +*.ftz filter=lfs diff=lfs merge=lfs -text +*.gz filter=lfs diff=lfs merge=lfs -text +*.h5 filter=lfs diff=lfs merge=lfs -text +*.joblib filter=lfs diff=lfs merge=lfs -text +*.lfs.* filter=lfs diff=lfs merge=lfs -text +*.mlmodel filter=lfs diff=lfs merge=lfs -text +*.model filter=lfs diff=lfs merge=lfs -text +*.msgpack filter=lfs diff=lfs merge=lfs -text +*.npy filter=lfs diff=lfs merge=lfs -text +*.npz filter=lfs diff=lfs merge=lfs -text +*.onnx filter=lfs diff=lfs merge=lfs -text +*.ot filter=lfs diff=lfs merge=lfs -text +*.parquet filter=lfs diff=lfs merge=lfs -text +*.pb filter=lfs diff=lfs merge=lfs -text +*.pickle filter=lfs diff=lfs merge=lfs -text +*.pkl filter=lfs diff=lfs merge=lfs -text +*.pt filter=lfs diff=lfs merge=lfs -text +*.pth filter=lfs diff=lfs merge=lfs -text +*.rar filter=lfs diff=lfs merge=lfs -text +*.safetensors filter=lfs diff=lfs merge=lfs -text +saved_model/**/* filter=lfs diff=lfs merge=lfs -text +*.tar.* filter=lfs diff=lfs merge=lfs -text +*.tar filter=lfs diff=lfs merge=lfs -text +*.tflite filter=lfs diff=lfs merge=lfs -text +*.tgz filter=lfs diff=lfs merge=lfs -text +*.wasm filter=lfs diff=lfs merge=lfs -text +*.xz filter=lfs diff=lfs merge=lfs -text +*.zip filter=lfs diff=lfs merge=lfs -text +*.zst filter=lfs diff=lfs merge=lfs -text +*tfevents* filter=lfs diff=lfs merge=lfs -text diff --git a/alvocat-vocos-22khz/README.md b/alvocat-vocos-22khz/README.md new file mode 100644 index 0000000000000000000000000000000000000000..0f4f8458c8f61177f6f871714339ec00eaaabbef --- /dev/null +++ b/alvocat-vocos-22khz/README.md @@ -0,0 +1,174 @@ +--- +license: apache-2.0 +datasets: +- projecte-aina/festcat_trimmed_denoised +- projecte-aina/openslr-slr69-ca-trimmed-denoised +tags: +- vocoder +- vocos +- tts +--- + +# 🥑 alVoCat + + +🥑 alVoCat is a vocoder for Catalan TTS, based on Vocos architecture. It is highly performant and +high quality, works together with [🍵 Matxa](https://huggingface.co/BSC-LT/matcha-tts-cat-multiaccent) +and you can find our fork [here](https://github.com/langtech-bsc/vocos/tree/matcha) and a demo [here](https://huggingface.co/spaces/BSC-LT/matchatts-vocos-onnx-ca). + +## Model Details + +### Model Description + + + +**Vocos** is a fast neural vocoder designed to synthesize audio waveforms from acoustic features. +Unlike other typical GAN-based vocoders, Vocos does not model audio samples in the time domain. +Instead, it generates spectral coefficients, facilitating rapid audio reconstruction through +inverse Fourier transform. + +This version of **Vocos** uses 80-bin mel spectrograms as acoustic features which are widespread +in the TTS domain since the introduction of [hifi-gan](https://github.com/jik876/hifi-gan/blob/master/meldataset.py) +The goal of this model is to provide an alternative to hifi-gan that is faster and compatible with the +acoustic output of several TTS models. This version is tailored for the Catalan language, +as it was trained only on Catalan speech datasets. + +We are grateful with the authors for open sourcing the code allowing us to modify and train this version. + +## Intended Uses and limitations + + +The model is aimed to serve as a vocoder to synthesize audio waveforms from mel spectrograms. Is trained to generate speech and if is used in other audio +domain is possible that the model won't produce high quality samples. + +## How to Get Started with the Model + +Use the code below to get started with the model. + +### Installation + +To use Vocos only in inference mode, install it using: + +```bash +pip install git+https://github.com/langtech-bsc/vocos.git@matcha +``` + +### Reconstruct audio from mel-spectrogram + +```python +import torch + +from vocos import Vocos + +vocos = Vocos.from_pretrained("projecte-aina/alvocat-vocos-22khz") + +mel = torch.randn(1, 80, 256) # B, C, T +audio = vocos.decode(mel) +``` + +### Copy-synthesis from a file: + +```python +import torchaudio + +y, sr = torchaudio.load(YOUR_AUDIO_FILE) +if y.size(0) > 1: # mix to mono + y = y.mean(dim=0, keepdim=True) +y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=22050) +y_hat = vocos(y) +``` + +### Onnx + +We also release an onnx version of the model, you can check in colab: + + + Open In Colab + + +## Training Details + +### Training Data + + + +The model was trained on 3 Catalan speech datasets + +| Dataset | Language | Hours | +|---------------------|----------|---------| +| Festcat | ca | 22 | +| OpenSLR69 | ca | 5 | +| LaFrescat | ca | 3.5 | + + + +### Training Procedure + + +The model was trained for 1.5M steps and 1.3k epochs with a batch size of 16 for stability. We used a Cosine scheduler with an initial learning rate of 5e-4. +We also modified the mel spectrogram loss to use 128 bins and fmax of 11025 instead of the same input mel spectrogram. + + +#### Training Hyperparameters + + +* initial_learning_rate: 5e-4 +* scheduler: cosine without warmup or restarts +* mel_loss_coeff: 45 +* mrd_loss_coeff: 0.1 +* batch_size: 16 +* num_samples: 16384 + +## Evaluation + + + +Evaluation was done using the metrics on the [original repo](https://github.com/gemelo-ai/vocos), after ~ 1000 epochs we achieve: + +* val_loss: 3.57 +* f1_score: 0.95 +* mel_loss: 0.22 +* periodicity_loss: 0.113 +* pesq_score: 3.31 +* pitch_loss: 31.61 +* utmos_score: 3.33 + + +## Citation + + + +If this code contributes to your research, please cite the work: + +``` +@article{siuzdak2023vocos, + title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis}, + author={Siuzdak, Hubert}, + journal={arXiv preprint arXiv:2306.00814}, + year={2023} +} +``` + +## Additional information + +### Author +The Language Technologies Unit from Barcelona Supercomputing Center. + +### Contact +For further information, please send an email to . + +### Copyright +Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center. + +### License +[Creative Commons Attribution Non-commercial 4.0](https://www.creativecommons.org/licenses/by-nc/4.0/) + +These models are free to use for non-commercial and research purposes. Commercial use is only possible through licensing by +the voice artists. For further information, contact and . + +### Funding + +This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/). + +Part of the training of the model was possible thanks to the compute time given by Galician Supercomputing Center CESGA +([Centro de Supercomputación de Galicia](https://www.cesga.es/)) \ No newline at end of file diff --git a/alvocat-vocos-22khz/config.yaml b/alvocat-vocos-22khz/config.yaml new file mode 100644 index 0000000000000000000000000000000000000000..6a59096467954fbdc569b55addbafdf25fa188fa --- /dev/null +++ b/alvocat-vocos-22khz/config.yaml @@ -0,0 +1,33 @@ +# pytorch_lightning==1.8.6 + +feature_extractor: + class_path: vocos.feature_extractors.MelSpectrogramFeatures + init_args: + sample_rate: 22050 + n_fft: 1024 + hop_length: 256 + n_mels: 80 + padding: same + f_min: 0 + f_max: 8000 + norm: "slaney" + mel_scale: "slaney" + + +backbone: + class_path: vocos.models.VocosBackbone + init_args: + input_channels: 80 + dim: 512 + intermediate_dim: 1536 + num_layers: 8 + +head: + class_path: vocos.heads.ISTFTHead + init_args: + dim: 512 + n_fft: 1024 + hop_length: 256 + padding: same + + diff --git a/alvocat-vocos-22khz/mel_spec_22khz_cat.onnx b/alvocat-vocos-22khz/mel_spec_22khz_cat.onnx new file mode 100644 index 0000000000000000000000000000000000000000..c1538cec7858dbdc947297172403a4d87b7bb6d7 --- /dev/null +++ b/alvocat-vocos-22khz/mel_spec_22khz_cat.onnx @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:8ab0744d7d49601ed8ad9be2927fcc99fb359cc90fe28bc9535c0484b3621de3 +size 53883652 diff --git a/alvocat-vocos-22khz/pytorch_model.bin b/alvocat-vocos-22khz/pytorch_model.bin new file mode 100644 index 0000000000000000000000000000000000000000..07b53354015b223bced6b70b707b1af654990c33 --- /dev/null +++ b/alvocat-vocos-22khz/pytorch_model.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:0af7b6f4b153819ada44a917135acf33944cdbb70cde0701eda3d100153799c7 +size 54051047 diff --git a/alvocat-vocos-22khz/source.txt b/alvocat-vocos-22khz/source.txt new file mode 100644 index 0000000000000000000000000000000000000000..559e4ffadf508cccce1c60163b69d32265df6808 --- /dev/null +++ b/alvocat-vocos-22khz/source.txt @@ -0,0 +1 @@ +https://huggingface.co/projecte-aina/alvocat-vocos-22khz \ No newline at end of file diff --git a/vocos-audioset-32khz/.gitattributes b/vocos-audioset-32khz/.gitattributes new file mode 100644 index 0000000000000000000000000000000000000000..a6344aac8c09253b3b630fb776ae94478aa0275b --- /dev/null +++ b/vocos-audioset-32khz/.gitattributes @@ -0,0 +1,35 @@ +*.7z filter=lfs diff=lfs merge=lfs -text +*.arrow filter=lfs diff=lfs merge=lfs -text +*.bin filter=lfs diff=lfs merge=lfs -text +*.bz2 filter=lfs diff=lfs merge=lfs -text +*.ckpt filter=lfs diff=lfs merge=lfs -text +*.ftz filter=lfs diff=lfs merge=lfs -text +*.gz filter=lfs diff=lfs merge=lfs -text +*.h5 filter=lfs diff=lfs merge=lfs -text +*.joblib filter=lfs diff=lfs merge=lfs -text +*.lfs.* filter=lfs diff=lfs merge=lfs -text +*.mlmodel filter=lfs diff=lfs merge=lfs -text +*.model filter=lfs diff=lfs merge=lfs -text +*.msgpack filter=lfs diff=lfs merge=lfs -text +*.npy filter=lfs diff=lfs merge=lfs -text +*.npz filter=lfs diff=lfs merge=lfs -text +*.onnx filter=lfs diff=lfs merge=lfs -text +*.ot filter=lfs diff=lfs merge=lfs -text +*.parquet filter=lfs diff=lfs merge=lfs -text +*.pb filter=lfs diff=lfs merge=lfs -text +*.pickle filter=lfs diff=lfs merge=lfs -text +*.pkl filter=lfs diff=lfs merge=lfs -text +*.pt filter=lfs diff=lfs merge=lfs -text +*.pth filter=lfs diff=lfs merge=lfs -text +*.rar filter=lfs diff=lfs merge=lfs -text +*.safetensors filter=lfs diff=lfs merge=lfs -text +saved_model/**/* filter=lfs diff=lfs merge=lfs -text +*.tar.* filter=lfs diff=lfs merge=lfs -text +*.tar filter=lfs diff=lfs merge=lfs -text +*.tflite filter=lfs diff=lfs merge=lfs -text +*.tgz filter=lfs diff=lfs merge=lfs -text +*.wasm filter=lfs diff=lfs merge=lfs -text +*.xz filter=lfs diff=lfs merge=lfs -text +*.zip filter=lfs diff=lfs merge=lfs -text +*.zst filter=lfs diff=lfs merge=lfs -text +*tfevents* filter=lfs diff=lfs merge=lfs -text diff --git a/vocos-audioset-32khz/README.md b/vocos-audioset-32khz/README.md new file mode 100644 index 0000000000000000000000000000000000000000..9b9753d796dc6bddb9bb53afafcb3121f5455e60 --- /dev/null +++ b/vocos-audioset-32khz/README.md @@ -0,0 +1,34 @@ +--- +license: apache-2.0 +--- + +This model is trained on Google's AudioSet (28GB data) for 1 million steps. (Originally planned 2 million steps, but I'm exploring better training schedule) + +You can regard it as a pretrained base model, which is common in language models but not for vocoders. + +How to load and use this model: + +```python +import torch +import torchaudio +from scipy.io.wavfile import write +with torch.no_grad(): + from vocos import Vocos + A = torch.load("./vocos_checkpoint_epoch=464_step=1001610_val_loss=7.1732.ckpt", map_location="cpu") + V = Vocos.from_hparams("./config.yaml") + V.load_state_dict(A['state_dict'], strict=False) + V.eval() + def safe_log(x: torch.Tensor, clip_val: float = 1e-7): + return torch.log(torch.clip(x, min=clip_val)) + voice, sr = torchaudio.load('example.wav') # must be sample_rate=32000 + if sr != 32000: + raise ValueError + mel = torchaudio.transforms.MelSpectrogram( + sample_rate=32000, n_fft=2048, hop_length=1024, n_mels=128, center=True, power=1, + )(voice) + mel = safe_log(mel) + audio = V.decode(mel) +write('out.wav', 32000, audio.flatten().numpy()) +``` + + diff --git a/vocos-audioset-32khz/config.yaml b/vocos-audioset-32khz/config.yaml new file mode 100644 index 0000000000000000000000000000000000000000..7098f57ff1ae3b9907f7b7307bb113abb9475f30 --- /dev/null +++ b/vocos-audioset-32khz/config.yaml @@ -0,0 +1,24 @@ +feature_extractor: + class_path: vocos.feature_extractors.MelSpectrogramFeatures + init_args: + sample_rate: 32000 + n_fft: 2048 + hop_length: 1024 + n_mels: 128 + padding: center + +backbone: + class_path: vocos.models.VocosBackbone + init_args: + input_channels: 128 + dim: 512 + intermediate_dim: 1536 + num_layers: 8 + +head: + class_path: vocos.heads.ISTFTHead + init_args: + dim: 512 + n_fft: 2048 + hop_length: 1024 + padding: center \ No newline at end of file diff --git a/vocos-audioset-32khz/source.txt b/vocos-audioset-32khz/source.txt new file mode 100644 index 0000000000000000000000000000000000000000..d3acd1fc60677187ad0fba48ddc4b190b698ee2e --- /dev/null +++ b/vocos-audioset-32khz/source.txt @@ -0,0 +1 @@ +https://huggingface.co/ZhangRC/vocos-audioset-32khz \ No newline at end of file diff --git a/vocos-audioset-32khz/vocos_checkpoint_epoch=464_step=1001610_val_loss=7.1732.ckpt b/vocos-audioset-32khz/vocos_checkpoint_epoch=464_step=1001610_val_loss=7.1732.ckpt new file mode 100644 index 0000000000000000000000000000000000000000..9ffc624b55686da60230465aab2696d11ecffb48 --- /dev/null +++ b/vocos-audioset-32khz/vocos_checkpoint_epoch=464_step=1001610_val_loss=7.1732.ckpt @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:7043569c8f810cede62d02bf5480438d29e3dcff21cdfb6b0dce5a96e39e730a +size 681397231 diff --git a/vocos-encodec-24khz/.gitattributes b/vocos-encodec-24khz/.gitattributes new file mode 100644 index 0000000000000000000000000000000000000000..c7d9f3332a950355d5a77d85000f05e6f45435ea --- /dev/null +++ b/vocos-encodec-24khz/.gitattributes @@ -0,0 +1,34 @@ +*.7z filter=lfs diff=lfs merge=lfs -text +*.arrow filter=lfs diff=lfs merge=lfs -text +*.bin filter=lfs diff=lfs merge=lfs -text +*.bz2 filter=lfs diff=lfs merge=lfs -text +*.ckpt filter=lfs diff=lfs merge=lfs -text +*.ftz filter=lfs diff=lfs merge=lfs -text +*.gz filter=lfs diff=lfs merge=lfs -text +*.h5 filter=lfs diff=lfs merge=lfs -text +*.joblib filter=lfs diff=lfs merge=lfs -text +*.lfs.* filter=lfs diff=lfs merge=lfs -text +*.mlmodel filter=lfs diff=lfs merge=lfs -text +*.model filter=lfs diff=lfs merge=lfs -text +*.msgpack filter=lfs diff=lfs merge=lfs -text +*.npy filter=lfs diff=lfs merge=lfs -text +*.npz filter=lfs diff=lfs merge=lfs -text +*.onnx filter=lfs diff=lfs merge=lfs -text +*.ot filter=lfs diff=lfs merge=lfs -text +*.parquet filter=lfs diff=lfs merge=lfs -text +*.pb filter=lfs diff=lfs merge=lfs -text +*.pickle filter=lfs diff=lfs merge=lfs -text +*.pkl filter=lfs diff=lfs merge=lfs -text +*.pt filter=lfs diff=lfs merge=lfs -text +*.pth filter=lfs diff=lfs merge=lfs -text +*.rar filter=lfs diff=lfs merge=lfs -text +*.safetensors filter=lfs diff=lfs merge=lfs -text +saved_model/**/* filter=lfs diff=lfs merge=lfs -text +*.tar.* filter=lfs diff=lfs merge=lfs -text +*.tflite filter=lfs diff=lfs merge=lfs -text +*.tgz filter=lfs diff=lfs merge=lfs -text +*.wasm filter=lfs diff=lfs merge=lfs -text +*.xz filter=lfs diff=lfs merge=lfs -text +*.zip filter=lfs diff=lfs merge=lfs -text +*.zst filter=lfs diff=lfs merge=lfs -text +*tfevents* filter=lfs diff=lfs merge=lfs -text diff --git a/vocos-encodec-24khz/README.md b/vocos-encodec-24khz/README.md new file mode 100644 index 0000000000000000000000000000000000000000..ee1550910386edecd77cb660c3a8339471ff891e --- /dev/null +++ b/vocos-encodec-24khz/README.md @@ -0,0 +1,73 @@ +--- +license: mit +--- + +# Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis + +[Audio samples](https://charactr-platform.github.io/vocos/) | +Paper [[abs]](https://arxiv.org/abs/2306.00814) [[pdf]](https://arxiv.org/pdf/2306.00814.pdf) + +Vocos is a fast neural vocoder designed to synthesize audio waveforms from acoustic features. Trained using a Generative +Adversarial Network (GAN) objective, Vocos can generate waveforms in a single forward pass. Unlike other typical +GAN-based vocoders, Vocos does not model audio samples in the time domain. Instead, it generates spectral +coefficients, facilitating rapid audio reconstruction through inverse Fourier transform. + +## Installation + +To use Vocos only in inference mode, install it using: + +```bash +pip install vocos +``` + +If you wish to train the model, install it with additional dependencies: + +```bash +pip install vocos[train] +``` + +## Usage + +### Reconstruct audio from EnCodec tokens + +Additionally, you need to provide a `bandwidth_id` which corresponds to the embedding for bandwidth from the +list: `[1.5, 3.0, 6.0, 12.0]`. + +```python +vocos = Vocos.from_pretrained("charactr/vocos-encodec-24khz") + +audio_tokens = torch.randint(low=0, high=1024, size=(8, 200)) # 8 codeboooks, 200 frames +features = vocos.codes_to_features(audio_tokens) +bandwidth_id = torch.tensor([2]) # 6 kbps + +audio = vocos.decode(features, bandwidth_id=bandwidth_id) +``` + +Copy-synthesis from a file: It extracts and quantizes features with EnCodec, then reconstructs them with Vocos in a +single forward pass. + +```python +y, sr = torchaudio.load(YOUR_AUDIO_FILE) +if y.size(0) > 1: # mix to mono + y = y.mean(dim=0, keepdim=True) +y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000) + +y_hat = vocos(y, bandwidth_id=bandwidth_id) +``` + +## Citation + +If this code contributes to your research, please cite our work: + +``` +@article{siuzdak2023vocos, + title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis}, + author={Siuzdak, Hubert}, + journal={arXiv preprint arXiv:2306.00814}, + year={2023} +} +``` + +## License + +The code in this repository is released under the MIT license. \ No newline at end of file diff --git a/vocos-encodec-24khz/config.yaml b/vocos-encodec-24khz/config.yaml new file mode 100644 index 0000000000000000000000000000000000000000..943725ef0c2ce567f75551dfed0cf5c4e01c8908 --- /dev/null +++ b/vocos-encodec-24khz/config.yaml @@ -0,0 +1,23 @@ +feature_extractor: + class_path: vocos.feature_extractors.EncodecFeatures + init_args: + encodec_model: encodec_24khz + bandwidths: [1.5, 3.0, 6.0, 12.0] + train_codebooks: false + +backbone: + class_path: vocos.models.VocosBackbone + init_args: + input_channels: 128 + dim: 384 + intermediate_dim: 1152 + num_layers: 8 + adanorm_num_embeddings: 4 # len(bandwidths) + +head: + class_path: vocos.heads.ISTFTHead + init_args: + dim: 384 + n_fft: 1280 + hop_length: 320 + padding: same \ No newline at end of file diff --git a/vocos-encodec-24khz/pytorch_model.bin b/vocos-encodec-24khz/pytorch_model.bin new file mode 100644 index 0000000000000000000000000000000000000000..6a4a479418137b34b4c9f467466234ffd62b0821 --- /dev/null +++ b/vocos-encodec-24khz/pytorch_model.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:7e95bb260b74a1bfc43c52d355831c951acb81c8960e9c62b79bd2b3ab1e3a90 +size 40356708 diff --git a/vocos-encodec-24khz/source.txt b/vocos-encodec-24khz/source.txt new file mode 100644 index 0000000000000000000000000000000000000000..643b37e2899837da266839ee9d02724265a4954e --- /dev/null +++ b/vocos-encodec-24khz/source.txt @@ -0,0 +1 @@ +https://huggingface.co/charactr/vocos-encodec-24khz \ No newline at end of file diff --git a/vocos-mel-10ms-24khz/.gitattributes b/vocos-mel-10ms-24khz/.gitattributes new file mode 100644 index 0000000000000000000000000000000000000000..a6344aac8c09253b3b630fb776ae94478aa0275b --- /dev/null +++ b/vocos-mel-10ms-24khz/.gitattributes @@ -0,0 +1,35 @@ +*.7z filter=lfs diff=lfs merge=lfs -text +*.arrow filter=lfs diff=lfs merge=lfs -text +*.bin filter=lfs diff=lfs merge=lfs -text +*.bz2 filter=lfs diff=lfs merge=lfs -text +*.ckpt filter=lfs diff=lfs merge=lfs -text +*.ftz filter=lfs diff=lfs merge=lfs -text +*.gz filter=lfs diff=lfs merge=lfs -text +*.h5 filter=lfs diff=lfs merge=lfs -text +*.joblib filter=lfs diff=lfs merge=lfs -text +*.lfs.* filter=lfs diff=lfs merge=lfs -text +*.mlmodel filter=lfs diff=lfs merge=lfs -text +*.model filter=lfs diff=lfs merge=lfs -text +*.msgpack filter=lfs diff=lfs merge=lfs -text +*.npy filter=lfs diff=lfs merge=lfs -text +*.npz filter=lfs diff=lfs merge=lfs -text +*.onnx filter=lfs diff=lfs merge=lfs -text +*.ot filter=lfs diff=lfs merge=lfs -text +*.parquet filter=lfs diff=lfs merge=lfs -text +*.pb filter=lfs diff=lfs merge=lfs -text +*.pickle filter=lfs diff=lfs merge=lfs -text +*.pkl filter=lfs diff=lfs merge=lfs -text +*.pt filter=lfs diff=lfs merge=lfs -text +*.pth filter=lfs diff=lfs merge=lfs -text +*.rar filter=lfs diff=lfs merge=lfs -text +*.safetensors filter=lfs diff=lfs merge=lfs -text +saved_model/**/* filter=lfs diff=lfs merge=lfs -text +*.tar.* filter=lfs diff=lfs merge=lfs -text +*.tar filter=lfs diff=lfs merge=lfs -text +*.tflite filter=lfs diff=lfs merge=lfs -text +*.tgz filter=lfs diff=lfs merge=lfs -text +*.wasm filter=lfs diff=lfs merge=lfs -text +*.xz filter=lfs diff=lfs merge=lfs -text +*.zip filter=lfs diff=lfs merge=lfs -text +*.zst filter=lfs diff=lfs merge=lfs -text +*tfevents* filter=lfs diff=lfs merge=lfs -text diff --git a/vocos-mel-10ms-24khz/README.md b/vocos-mel-10ms-24khz/README.md new file mode 100644 index 0000000000000000000000000000000000000000..ed6a70c99c0f1f74c766a81c995993a8d56a209e --- /dev/null +++ b/vocos-mel-10ms-24khz/README.md @@ -0,0 +1,33 @@ +--- +license: mit +--- + +## Reconstruct audio from mel-spectrogram with 10 ms frame shift + +To use Vocos only in inference mode, install it using: + +```bash +pip install vocos +``` + +Load the model and run inference: + +```python +import torch + +from vocos import Vocos + +vocos = Vocos.from_pretrained("meaningteam/vocos-mel-10ms-24khz") + +audio = torch.randn(1, 24000) +mel = vocos.feature_extractor(audio) +prediction = vocos.decode(mel) +``` + +## Model details + +This model was trained on the DNS Challenge dataset for 1M steps. Also, it has 10 ms frame shift compared to `charactr/vocos-mel-24khz`. + +## License + +The code in this repository is released under the MIT license. \ No newline at end of file diff --git a/vocos-mel-10ms-24khz/config.yaml b/vocos-mel-10ms-24khz/config.yaml new file mode 100644 index 0000000000000000000000000000000000000000..53216f56bceb1b816acbd4922bf6e7be7a10c430 --- /dev/null +++ b/vocos-mel-10ms-24khz/config.yaml @@ -0,0 +1,31 @@ +backbone: + class_path: vocos.models.VocosBackbone + init_args: + dim: 512 + input_channels: 100 + intermediate_dim: 1536 + num_layers: 8 +evaluate_periodicty: false +evaluate_pesq: true +evaluate_utmos: false +feature_extractor: + class_path: vocos.feature_extractors.MelSpectrogramFeatures + init_args: + hop_length: 240 + n_fft: 960 + n_mels: 100 + padding: center + sample_rate: 24000 +head: + class_path: vocos.heads.ISTFTHead + init_args: + dim: 512 + hop_length: 240 + n_fft: 960 + padding: center +initial_learning_rate: 5e-4 +mel_loss_coeff: 45 +mrd_loss_coeff: 0.1 +num_warmup_steps: 0 +pretrain_mel_steps: 0 +sample_rate: 24000 diff --git a/vocos-mel-10ms-24khz/pytorch_model.bin b/vocos-mel-10ms-24khz/pytorch_model.bin new file mode 100644 index 0000000000000000000000000000000000000000..643179d06288055bd005457527941cd6d3f9bade --- /dev/null +++ b/vocos-mel-10ms-24khz/pytorch_model.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:c3885f32d463665bcff9df6381a5f73e5ca12dbe77c960f02965f7fe85a4f275 +size 54221351 diff --git a/vocos-mel-10ms-24khz/source.txt b/vocos-mel-10ms-24khz/source.txt new file mode 100644 index 0000000000000000000000000000000000000000..87c5326d5acf098df8d27a0a7f163485b46271bd --- /dev/null +++ b/vocos-mel-10ms-24khz/source.txt @@ -0,0 +1 @@ +https://huggingface.co/meaningteam/vocos-mel-10ms-24khz \ No newline at end of file diff --git a/vocos-mel-22khz/.gitattributes b/vocos-mel-22khz/.gitattributes new file mode 100644 index 0000000000000000000000000000000000000000..a6344aac8c09253b3b630fb776ae94478aa0275b --- /dev/null +++ b/vocos-mel-22khz/.gitattributes @@ -0,0 +1,35 @@ +*.7z filter=lfs diff=lfs merge=lfs -text +*.arrow filter=lfs diff=lfs merge=lfs -text +*.bin filter=lfs diff=lfs merge=lfs -text +*.bz2 filter=lfs diff=lfs merge=lfs -text +*.ckpt filter=lfs diff=lfs merge=lfs -text +*.ftz filter=lfs diff=lfs merge=lfs -text +*.gz filter=lfs diff=lfs merge=lfs -text +*.h5 filter=lfs diff=lfs merge=lfs -text +*.joblib filter=lfs diff=lfs merge=lfs -text +*.lfs.* filter=lfs diff=lfs merge=lfs -text +*.mlmodel filter=lfs diff=lfs merge=lfs -text +*.model filter=lfs diff=lfs merge=lfs -text +*.msgpack filter=lfs diff=lfs merge=lfs -text +*.npy filter=lfs diff=lfs merge=lfs -text +*.npz filter=lfs diff=lfs merge=lfs -text +*.onnx filter=lfs diff=lfs merge=lfs -text +*.ot filter=lfs diff=lfs merge=lfs -text +*.parquet filter=lfs diff=lfs merge=lfs -text +*.pb filter=lfs diff=lfs merge=lfs -text +*.pickle filter=lfs diff=lfs merge=lfs -text +*.pkl filter=lfs diff=lfs merge=lfs -text +*.pt filter=lfs diff=lfs merge=lfs -text +*.pth filter=lfs diff=lfs merge=lfs -text +*.rar filter=lfs diff=lfs merge=lfs -text +*.safetensors filter=lfs diff=lfs merge=lfs -text +saved_model/**/* filter=lfs diff=lfs merge=lfs -text +*.tar.* filter=lfs diff=lfs merge=lfs -text +*.tar filter=lfs diff=lfs merge=lfs -text +*.tflite filter=lfs diff=lfs merge=lfs -text +*.tgz filter=lfs diff=lfs merge=lfs -text +*.wasm filter=lfs diff=lfs merge=lfs -text +*.xz filter=lfs diff=lfs merge=lfs -text +*.zip filter=lfs diff=lfs merge=lfs -text +*.zst filter=lfs diff=lfs merge=lfs -text +*tfevents* filter=lfs diff=lfs merge=lfs -text diff --git a/vocos-mel-22khz/README.md b/vocos-mel-22khz/README.md new file mode 100644 index 0000000000000000000000000000000000000000..9cf11598f9b3e83843e07899423b07e2a01f0a8a --- /dev/null +++ b/vocos-mel-22khz/README.md @@ -0,0 +1,182 @@ +--- +license: apache-2.0 +datasets: +- projecte-aina/festcat_trimmed_denoised +- projecte-aina/openslr-slr69-ca-trimmed-denoised +- lj_speech +- blabble-io/libritts_r +tags: +- vocoder +- mel +- vocos +- hifigan +- tts +--- + +# Vocos-mel-22khz + + + + + +## Model Details + +### Model Description + + + +**Vocos** is a fast neural vocoder designed to synthesize audio waveforms from acoustic features. +Unlike other typical GAN-based vocoders, Vocos does not model audio samples in the time domain. +Instead, it generates spectral coefficients, facilitating rapid audio reconstruction through +inverse Fourier transform. + +This version of vocos uses 80-bin mel spectrograms as acoustic features which are widespread +in the TTS domain since the introduction of [hifi-gan](https://github.com/jik876/hifi-gan/blob/master/meldataset.py) +The goal of this model is to provide an alternative to hifi-gan that is faster and compatible with the +acoustic output of several TTS models. + +We are grateful with the authors for open sourcing the code allowing us to modify and train this version. + +## Intended Uses and limitations + + +The model is aimed to serve as a vocoder to synthesize audio waveforms from mel spectrograms. Is trained to generate speech and if is used in other audio +domain is possible that the model won't produce high quality samples. + +## How to Get Started with the Model + +Use the code below to get started with the model. + +### Installation + +To use Vocos only in inference mode, install it using: + +```bash +pip install git+https://github.com/langtech-bsc/vocos.git@matcha +``` + +### Reconstruct audio from mel-spectrogram + +```python +import torch + +from vocos import Vocos + +vocos = Vocos.from_pretrained("BSC-LT/vocos-mel-22khz") + +mel = torch.randn(1, 80, 256) # B, C, T +audio = vocos.decode(mel) +``` +### Integrate with existing TTS models: + +* Matcha-TTS + + Open In Colab + + +* Fastpitch + + Open In Colab + + +### Copy-synthesis from a file: + +```python +import torchaudio + +y, sr = torchaudio.load(YOUR_AUDIO_FILE) +if y.size(0) > 1: # mix to mono + y = y.mean(dim=0, keepdim=True) +y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=22050) +y_hat = vocos(y) +``` + + +### Onnx + +We also release a onnx version of the model, you can check in colab: + + + Open In Colab + + +## Training Details + +### Training Data + + + +The model was trained on 4 speech datasets + +| Dataset | Language | Hours | +|---------------------|----------|---------| +| LibriTTS-r | en | 585 | +| LJSpeech | en | 24 | +| Festcat | ca | 22 | +| OpenSLR69 | ca | 5 | + + +### Training Procedure + + +The model was trained for 1.8M steps and 183 epochs with a batch size of 16 for stability. We used a Cosine scheduler with a initial learning rate of 5e-4. +We also modified the mel spectrogram loss to use 128 bins and fmax of 11025 instead of the same input mel spectrogram. + + +#### Training Hyperparameters + + +* initial_learning_rate: 5e-4 +* scheduler: cosine without warmup or restarts +* mel_loss_coeff: 45 +* mrd_loss_coeff: 0.1 +* batch_size: 16 +* num_samples: 16384 + +## Evaluation + + + +Evaluation was done using the metrics on the original repo, after 183 epochs we achieve: + +* val_loss: 3.81 +* f1_score: 0.94 +* mel_loss: 0.25 +* periodicity_loss:0.132 +* pesq_score: 3.16 +* pitch_loss: 38.11 +* utmos_score: 3.27 + + +## Citation + + + +If this code contributes to your research, please cite the work: + +``` +@article{siuzdak2023vocos, + title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis}, + author={Siuzdak, Hubert}, + journal={arXiv preprint arXiv:2306.00814}, + year={2023} +} +``` + +## Additional information + +### Author +The Language Technologies Unit from Barcelona Supercomputing Center. + +### Contact +For further information, please send an email to . + +### Copyright +Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center. + +### License +[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) + +### Funding + +This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/). \ No newline at end of file diff --git a/vocos-mel-22khz/config.yaml b/vocos-mel-22khz/config.yaml new file mode 100644 index 0000000000000000000000000000000000000000..6a59096467954fbdc569b55addbafdf25fa188fa --- /dev/null +++ b/vocos-mel-22khz/config.yaml @@ -0,0 +1,33 @@ +# pytorch_lightning==1.8.6 + +feature_extractor: + class_path: vocos.feature_extractors.MelSpectrogramFeatures + init_args: + sample_rate: 22050 + n_fft: 1024 + hop_length: 256 + n_mels: 80 + padding: same + f_min: 0 + f_max: 8000 + norm: "slaney" + mel_scale: "slaney" + + +backbone: + class_path: vocos.models.VocosBackbone + init_args: + input_channels: 80 + dim: 512 + intermediate_dim: 1536 + num_layers: 8 + +head: + class_path: vocos.heads.ISTFTHead + init_args: + dim: 512 + n_fft: 1024 + hop_length: 256 + padding: same + + diff --git a/vocos-mel-22khz/mel_spec_22khz_univ.onnx b/vocos-mel-22khz/mel_spec_22khz_univ.onnx new file mode 100644 index 0000000000000000000000000000000000000000..c1538cec7858dbdc947297172403a4d87b7bb6d7 --- /dev/null +++ b/vocos-mel-22khz/mel_spec_22khz_univ.onnx @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:8ab0744d7d49601ed8ad9be2927fcc99fb359cc90fe28bc9535c0484b3621de3 +size 53883652 diff --git a/vocos-mel-22khz/pytorch_model.bin b/vocos-mel-22khz/pytorch_model.bin new file mode 100644 index 0000000000000000000000000000000000000000..07b53354015b223bced6b70b707b1af654990c33 --- /dev/null +++ b/vocos-mel-22khz/pytorch_model.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:0af7b6f4b153819ada44a917135acf33944cdbb70cde0701eda3d100153799c7 +size 54051047 diff --git a/vocos-mel-22khz/source.txt b/vocos-mel-22khz/source.txt new file mode 100644 index 0000000000000000000000000000000000000000..fc8a42162dad05a27c369f74416866f79dac8d25 --- /dev/null +++ b/vocos-mel-22khz/source.txt @@ -0,0 +1 @@ +https://huggingface.co/BSC-LT/vocos-mel-22khz \ No newline at end of file diff --git a/vocos-mel-22khz/vocos_checkpoint_epoch=183_step=3690672_val_loss=3.8142.ckpt b/vocos-mel-22khz/vocos_checkpoint_epoch=183_step=3690672_val_loss=3.8142.ckpt new file mode 100644 index 0000000000000000000000000000000000000000..f1e5523e6752f829a508f455585f9a30b95f296e --- /dev/null +++ b/vocos-mel-22khz/vocos_checkpoint_epoch=183_step=3690672_val_loss=3.8142.ckpt @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:ec7cc19942235286d3243c6e47798af4510b1feda50901ea46d41073403f40c9 +size 672720367 diff --git a/vocos-mel-24khz-onnx/.gitattributes b/vocos-mel-24khz-onnx/.gitattributes new file mode 100644 index 0000000000000000000000000000000000000000..a6344aac8c09253b3b630fb776ae94478aa0275b --- /dev/null +++ b/vocos-mel-24khz-onnx/.gitattributes @@ -0,0 +1,35 @@ +*.7z filter=lfs diff=lfs merge=lfs -text +*.arrow filter=lfs diff=lfs merge=lfs -text +*.bin filter=lfs diff=lfs merge=lfs -text +*.bz2 filter=lfs diff=lfs merge=lfs -text +*.ckpt filter=lfs diff=lfs merge=lfs -text +*.ftz filter=lfs diff=lfs merge=lfs -text +*.gz filter=lfs diff=lfs merge=lfs -text +*.h5 filter=lfs diff=lfs merge=lfs -text +*.joblib filter=lfs diff=lfs merge=lfs -text +*.lfs.* filter=lfs diff=lfs merge=lfs -text +*.mlmodel filter=lfs diff=lfs merge=lfs -text +*.model filter=lfs diff=lfs merge=lfs -text +*.msgpack filter=lfs diff=lfs merge=lfs -text +*.npy filter=lfs diff=lfs merge=lfs -text +*.npz filter=lfs diff=lfs merge=lfs -text +*.onnx filter=lfs diff=lfs merge=lfs -text +*.ot filter=lfs diff=lfs merge=lfs -text +*.parquet filter=lfs diff=lfs merge=lfs -text +*.pb filter=lfs diff=lfs merge=lfs -text +*.pickle filter=lfs diff=lfs merge=lfs -text +*.pkl filter=lfs diff=lfs merge=lfs -text +*.pt filter=lfs diff=lfs merge=lfs -text +*.pth filter=lfs diff=lfs merge=lfs -text +*.rar filter=lfs diff=lfs merge=lfs -text +*.safetensors filter=lfs diff=lfs merge=lfs -text +saved_model/**/* filter=lfs diff=lfs merge=lfs -text +*.tar.* filter=lfs diff=lfs merge=lfs -text +*.tar filter=lfs diff=lfs merge=lfs -text +*.tflite filter=lfs diff=lfs merge=lfs -text +*.tgz filter=lfs diff=lfs merge=lfs -text +*.wasm filter=lfs diff=lfs merge=lfs -text +*.xz filter=lfs diff=lfs merge=lfs -text +*.zip filter=lfs diff=lfs merge=lfs -text +*.zst filter=lfs diff=lfs merge=lfs -text +*tfevents* filter=lfs diff=lfs merge=lfs -text diff --git a/vocos-mel-24khz-onnx/README.md b/vocos-mel-24khz-onnx/README.md new file mode 100644 index 0000000000000000000000000000000000000000..c8619fb520bd3be235e8d751c19416d2906a18e5 --- /dev/null +++ b/vocos-mel-24khz-onnx/README.md @@ -0,0 +1,33 @@ +--- +license: mit +library: ONNX +base_model: charactr/vocos-mel-24khz +--- + +**Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis** + +**Audio samples | Paper [abs] [pdf]** + +Vocos is a fast neural vocoder designed to synthesize audio waveforms from acoustic features. Trained using a Generative Adversarial Network (GAN) objective, Vocos can generate waveforms in a single forward pass. Unlike other typical GAN-based vocoders, Vocos does not model audio samples in the time domain. Instead, it generates spectral coefficients, facilitating rapid audio reconstruction through inverse Fourier transform. + +This is a ONNX version of the original 24khz mel spectrogram [model](https://huggingface.co/charactr/vocos-mel-24khz). The model predicts spectrograms and the ISTFT is performed outside ONNX as ISTFT is still not implemented as an operator in ONNX. + +## Usage + +Try out in colab: + + + Open In Colab + + +## Citation + +``` +@article{siuzdak2023vocos, + title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis}, + author={Siuzdak, Hubert}, + journal={arXiv preprint arXiv:2306.00814}, + year={2023} +} + +``` \ No newline at end of file diff --git a/vocos-mel-24khz-onnx/config.yaml b/vocos-mel-24khz-onnx/config.yaml new file mode 100644 index 0000000000000000000000000000000000000000..538262138a8b43863802f279909f26ec31c766b3 --- /dev/null +++ b/vocos-mel-24khz-onnx/config.yaml @@ -0,0 +1,24 @@ +feature_extractor: + class_path: vocos.feature_extractors.MelSpectrogramFeatures + init_args: + sample_rate: 24000 + n_fft: 1024 + hop_length: 256 + n_mels: 100 + padding: center + +backbone: + class_path: vocos.models.VocosBackbone + init_args: + input_channels: 100 + dim: 512 + intermediate_dim: 1536 + num_layers: 8 + +head: + class_path: vocos.heads.ISTFTHead + init_args: + dim: 512 + n_fft: 1024 + hop_length: 256 + padding: center diff --git a/vocos-mel-24khz-onnx/mel_spec_24khz.onnx b/vocos-mel-24khz-onnx/mel_spec_24khz.onnx new file mode 100644 index 0000000000000000000000000000000000000000..189bc5718cfa8486d8482d6fbcea7faa35e30a19 --- /dev/null +++ b/vocos-mel-24khz-onnx/mel_spec_24khz.onnx @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:a84c58728a769e8a98eeca75bb89102987db0028e5e3d44b45af2ae3ef0104e2 +size 54156978 diff --git a/vocos-mel-24khz-onnx/source.txt b/vocos-mel-24khz-onnx/source.txt new file mode 100644 index 0000000000000000000000000000000000000000..694d5d27ac991cbb1671b3c60116569b3ff7f187 --- /dev/null +++ b/vocos-mel-24khz-onnx/source.txt @@ -0,0 +1 @@ +https://huggingface.co/wetdog/vocos-mel-24khz-onnx \ No newline at end of file diff --git a/vocos-mel-24khz/.gitattributes b/vocos-mel-24khz/.gitattributes new file mode 100644 index 0000000000000000000000000000000000000000..c7d9f3332a950355d5a77d85000f05e6f45435ea --- /dev/null +++ b/vocos-mel-24khz/.gitattributes @@ -0,0 +1,34 @@ +*.7z filter=lfs diff=lfs merge=lfs -text +*.arrow filter=lfs diff=lfs merge=lfs -text +*.bin filter=lfs diff=lfs merge=lfs -text +*.bz2 filter=lfs diff=lfs merge=lfs -text +*.ckpt filter=lfs diff=lfs merge=lfs -text +*.ftz filter=lfs diff=lfs merge=lfs -text +*.gz filter=lfs diff=lfs merge=lfs -text +*.h5 filter=lfs diff=lfs merge=lfs -text +*.joblib filter=lfs diff=lfs merge=lfs -text +*.lfs.* filter=lfs diff=lfs merge=lfs -text +*.mlmodel filter=lfs diff=lfs merge=lfs -text +*.model filter=lfs diff=lfs merge=lfs -text +*.msgpack filter=lfs diff=lfs merge=lfs -text +*.npy filter=lfs diff=lfs merge=lfs -text +*.npz filter=lfs diff=lfs merge=lfs -text +*.onnx filter=lfs diff=lfs merge=lfs -text +*.ot filter=lfs diff=lfs merge=lfs -text +*.parquet filter=lfs diff=lfs merge=lfs -text +*.pb filter=lfs diff=lfs merge=lfs -text +*.pickle filter=lfs diff=lfs merge=lfs -text +*.pkl filter=lfs diff=lfs merge=lfs -text +*.pt filter=lfs diff=lfs merge=lfs -text +*.pth filter=lfs diff=lfs merge=lfs -text +*.rar filter=lfs diff=lfs merge=lfs -text +*.safetensors filter=lfs diff=lfs merge=lfs -text +saved_model/**/* filter=lfs diff=lfs merge=lfs -text +*.tar.* filter=lfs diff=lfs merge=lfs -text +*.tflite filter=lfs diff=lfs merge=lfs -text +*.tgz filter=lfs diff=lfs merge=lfs -text +*.wasm filter=lfs diff=lfs merge=lfs -text +*.xz filter=lfs diff=lfs merge=lfs -text +*.zip filter=lfs diff=lfs merge=lfs -text +*.zst filter=lfs diff=lfs merge=lfs -text +*tfevents* filter=lfs diff=lfs merge=lfs -text diff --git a/vocos-mel-24khz/README.md b/vocos-mel-24khz/README.md new file mode 100644 index 0000000000000000000000000000000000000000..226bd9452b8561a28a5633a4e1576e883984cc5f --- /dev/null +++ b/vocos-mel-24khz/README.md @@ -0,0 +1,71 @@ +--- +license: mit +--- + +# Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis + +[Audio samples](https://charactr-platform.github.io/vocos/) | +Paper [[abs]](https://arxiv.org/abs/2306.00814) [[pdf]](https://arxiv.org/pdf/2306.00814.pdf) + +Vocos is a fast neural vocoder designed to synthesize audio waveforms from acoustic features. Trained using a Generative +Adversarial Network (GAN) objective, Vocos can generate waveforms in a single forward pass. Unlike other typical +GAN-based vocoders, Vocos does not model audio samples in the time domain. Instead, it generates spectral +coefficients, facilitating rapid audio reconstruction through inverse Fourier transform. + +## Installation + +To use Vocos only in inference mode, install it using: + +```bash +pip install vocos +``` + +If you wish to train the model, install it with additional dependencies: + +```bash +pip install vocos[train] +``` + +## Usage + +### Reconstruct audio from mel-spectrogram + +```python +import torch + +from vocos import Vocos + +vocos = Vocos.from_pretrained("charactr/vocos-mel-24khz") + +mel = torch.randn(1, 100, 256) # B, C, T +audio = vocos.decode(mel) +``` + +Copy-synthesis from a file: + +```python +import torchaudio + +y, sr = torchaudio.load(YOUR_AUDIO_FILE) +if y.size(0) > 1: # mix to mono + y = y.mean(dim=0, keepdim=True) +y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000) +y_hat = vocos(y) +``` + +## Citation + +If this code contributes to your research, please cite our work: + +``` +@article{siuzdak2023vocos, + title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis}, + author={Siuzdak, Hubert}, + journal={arXiv preprint arXiv:2306.00814}, + year={2023} +} +``` + +## License + +The code in this repository is released under the MIT license. \ No newline at end of file diff --git a/vocos-mel-24khz/config.yaml b/vocos-mel-24khz/config.yaml new file mode 100644 index 0000000000000000000000000000000000000000..538262138a8b43863802f279909f26ec31c766b3 --- /dev/null +++ b/vocos-mel-24khz/config.yaml @@ -0,0 +1,24 @@ +feature_extractor: + class_path: vocos.feature_extractors.MelSpectrogramFeatures + init_args: + sample_rate: 24000 + n_fft: 1024 + hop_length: 256 + n_mels: 100 + padding: center + +backbone: + class_path: vocos.models.VocosBackbone + init_args: + input_channels: 100 + dim: 512 + intermediate_dim: 1536 + num_layers: 8 + +head: + class_path: vocos.heads.ISTFTHead + init_args: + dim: 512 + n_fft: 1024 + hop_length: 256 + padding: center diff --git a/vocos-mel-24khz/pytorch_model.bin b/vocos-mel-24khz/pytorch_model.bin new file mode 100644 index 0000000000000000000000000000000000000000..1e837026c9a3bb538af2c093041587f3d565f12a --- /dev/null +++ b/vocos-mel-24khz/pytorch_model.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:97ec976ad1fd67a33ab2682d29c0ac7df85234fae875aefcc5fb215681a91b2a +size 54365991 diff --git a/vocos-mel-24khz/source.txt b/vocos-mel-24khz/source.txt new file mode 100644 index 0000000000000000000000000000000000000000..aa49ae7248d28ea1615dc3e3b7e7a864e5129e5d --- /dev/null +++ b/vocos-mel-24khz/source.txt @@ -0,0 +1 @@ +https://huggingface.co/charactr/vocos-mel-24khz \ No newline at end of file diff --git a/vocos-mel-48khz-alpha1/.gitattributes b/vocos-mel-48khz-alpha1/.gitattributes new file mode 100644 index 0000000000000000000000000000000000000000..a6344aac8c09253b3b630fb776ae94478aa0275b --- /dev/null +++ b/vocos-mel-48khz-alpha1/.gitattributes @@ -0,0 +1,35 @@ +*.7z filter=lfs diff=lfs merge=lfs -text +*.arrow filter=lfs diff=lfs merge=lfs -text +*.bin filter=lfs diff=lfs merge=lfs -text +*.bz2 filter=lfs diff=lfs merge=lfs -text +*.ckpt filter=lfs diff=lfs merge=lfs -text +*.ftz filter=lfs diff=lfs merge=lfs -text +*.gz filter=lfs diff=lfs merge=lfs -text +*.h5 filter=lfs diff=lfs merge=lfs -text +*.joblib filter=lfs diff=lfs merge=lfs -text +*.lfs.* filter=lfs diff=lfs merge=lfs -text +*.mlmodel filter=lfs diff=lfs merge=lfs -text +*.model filter=lfs diff=lfs merge=lfs -text +*.msgpack filter=lfs diff=lfs merge=lfs -text +*.npy filter=lfs diff=lfs merge=lfs -text +*.npz filter=lfs diff=lfs merge=lfs -text +*.onnx filter=lfs diff=lfs merge=lfs -text +*.ot filter=lfs diff=lfs merge=lfs -text +*.parquet filter=lfs diff=lfs merge=lfs -text +*.pb filter=lfs diff=lfs merge=lfs -text +*.pickle filter=lfs diff=lfs merge=lfs -text +*.pkl filter=lfs diff=lfs merge=lfs -text +*.pt filter=lfs diff=lfs merge=lfs -text +*.pth filter=lfs diff=lfs merge=lfs -text +*.rar filter=lfs diff=lfs merge=lfs -text +*.safetensors filter=lfs diff=lfs merge=lfs -text +saved_model/**/* filter=lfs diff=lfs merge=lfs -text +*.tar.* filter=lfs diff=lfs merge=lfs -text +*.tar filter=lfs diff=lfs merge=lfs -text +*.tflite filter=lfs diff=lfs merge=lfs -text +*.tgz filter=lfs diff=lfs merge=lfs -text +*.wasm filter=lfs diff=lfs merge=lfs -text +*.xz filter=lfs diff=lfs merge=lfs -text +*.zip filter=lfs diff=lfs merge=lfs -text +*.zst filter=lfs diff=lfs merge=lfs -text +*tfevents* filter=lfs diff=lfs merge=lfs -text diff --git a/vocos-mel-48khz-alpha1/README.md b/vocos-mel-48khz-alpha1/README.md new file mode 100644 index 0000000000000000000000000000000000000000..9cd350b002674a00efff3a6d23ce1f3b4a6b1071 --- /dev/null +++ b/vocos-mel-48khz-alpha1/README.md @@ -0,0 +1,75 @@ +--- +license: mit +tags: +- audio +library_name: pytorch +--- + +# Vocos + +#### Note: This repo has no affiliation with the author of Vocos. + +Pretrained Vocos model with a 48kHz sampling rate, as opposed to 24kHz of the official. + +## Usage +Make sure the Vocos library is installed: + +```bash +pip install vocos +``` + +then, load the model as usual: + +```python +from vocos import Vocos +vocos = Vocos.from_pretrained("kittn/vocos-mel-48khz-alpha1") +``` + +For more detailed examples, see [github.com/charactr-platform/vocos#usage](https://github.com/charactr-platform/vocos#usage) + +## Evals +TODO + +## Training details +TODO + +## What is Vocos? + +Here's a summary from the official repo [[link](https://github.com/charactr-platform/vocos)]: + +> Vocos is a fast neural vocoder designed to synthesize audio waveforms from acoustic features. Trained using a Generative Adversarial Network (GAN) objective, Vocos can generate waveforms in a single forward pass. Unlike other typical GAN-based vocoders, Vocos does not model audio samples in the time domain. Instead, it generates spectral coefficients, facilitating rapid audio reconstruction through inverse Fourier transform. + +For more details and other variants, check out the repo link above. + +## Model summary +```bash +================================================================= +Layer (type:depth-idx) Param # +================================================================= +Vocos -- +├─MelSpectrogramFeatures: 1-1 -- +│ └─MelSpectrogram: 2-1 -- +│ │ └─Spectrogram: 3-1 -- +│ │ └─MelScale: 3-2 -- +├─VocosBackbone: 1-2 -- +│ └─Conv1d: 2-2 918,528 +│ └─LayerNorm: 2-3 2,048 +│ └─ModuleList: 2-4 -- +│ │ └─ConvNeXtBlock: 3-3 4,208,640 +│ │ └─ConvNeXtBlock: 3-4 4,208,640 +│ │ └─ConvNeXtBlock: 3-5 4,208,640 +│ │ └─ConvNeXtBlock: 3-6 4,208,640 +│ │ └─ConvNeXtBlock: 3-7 4,208,640 +│ │ └─ConvNeXtBlock: 3-8 4,208,640 +│ │ └─ConvNeXtBlock: 3-9 4,208,640 +│ │ └─ConvNeXtBlock: 3-10 4,208,640 +│ └─LayerNorm: 2-5 2,048 +├─ISTFTHead: 1-3 -- +│ └─Linear: 2-6 2,101,250 +│ └─ISTFT: 2-7 -- +================================================================= +Total params: 36,692,994 +Trainable params: 36,692,994 +Non-trainable params: 0 +================================================================= +``` \ No newline at end of file diff --git a/vocos-mel-48khz-alpha1/config.yaml b/vocos-mel-48khz-alpha1/config.yaml new file mode 100644 index 0000000000000000000000000000000000000000..df2dda5add387b0a3ea91f968bcc835d3d814cf1 --- /dev/null +++ b/vocos-mel-48khz-alpha1/config.yaml @@ -0,0 +1,40 @@ +backbone: + class_path: vocos.models.VocosBackbone + init_args: + adanorm_num_embeddings: null + dim: 1024 + input_channels: 128 + intermediate_dim: 2048 + layer_scale_init_value: null + num_layers: 8 +decay_mel_coeff: false +enable_discriminator: true +evaluate_periodicty: true +evaluate_pesq: true +evaluate_utmos: true +feature_extractor: + class_path: vocos.feature_extractors.MelSpectrogramFeatures + init_args: + hop_length: 256 + n_fft: 2048 + n_mels: 128 + padding: center + sample_rate: 48000 +generator_period: 3 +grad_acc: 1 +head: + class_path: vocos.heads.ISTFTHead + init_args: + dim: 1024 + hop_length: 256 + n_fft: 2048 + padding: center +initial_learning_rate: 0.0003 +mel_loss_coeff: 15.0 +mrd_loss_coeff: 0.1 +num_warmup_steps: 500 +pretrain_decoupled_steps: 0 +pretrain_disc_steps: 500 +pretrain_mel_steps: 0 +pretrained_ckpt: null +sample_rate: 48000 diff --git a/vocos-mel-48khz-alpha1/pytorch_model.bin b/vocos-mel-48khz-alpha1/pytorch_model.bin new file mode 100644 index 0000000000000000000000000000000000000000..d195d2136dbd35e17b2bd7a7aa89d8d913596d74 --- /dev/null +++ b/vocos-mel-48khz-alpha1/pytorch_model.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:3315c87d130922dff1c4c0cfd153ac3ef037950ac0eba13f355bb38cbda46fc2 +size 147342055 diff --git a/vocos-mel-48khz-alpha1/source.txt b/vocos-mel-48khz-alpha1/source.txt new file mode 100644 index 0000000000000000000000000000000000000000..f8737ee51dd27cd3a03bfa063beaa7268ec29ecd --- /dev/null +++ b/vocos-mel-48khz-alpha1/source.txt @@ -0,0 +1 @@ +https://huggingface.co/kittn/vocos-mel-48khz-alpha1 \ No newline at end of file diff --git a/vocos-mel-hifigan-compat-44100khz/.gitattributes b/vocos-mel-hifigan-compat-44100khz/.gitattributes new file mode 100644 index 0000000000000000000000000000000000000000..780f21cd6ec1210cf8a764a1f75a08ce4e74bc4f --- /dev/null +++ b/vocos-mel-hifigan-compat-44100khz/.gitattributes @@ -0,0 +1,37 @@ +*.7z filter=lfs diff=lfs merge=lfs -text +*.arrow filter=lfs diff=lfs merge=lfs -text +*.bin filter=lfs diff=lfs merge=lfs -text +*.bz2 filter=lfs diff=lfs merge=lfs -text +*.ckpt filter=lfs diff=lfs merge=lfs -text +*.ftz filter=lfs diff=lfs merge=lfs -text +*.gz filter=lfs diff=lfs merge=lfs -text +*.h5 filter=lfs diff=lfs merge=lfs -text +*.joblib filter=lfs diff=lfs merge=lfs -text +*.lfs.* filter=lfs diff=lfs merge=lfs -text +*.mlmodel filter=lfs diff=lfs merge=lfs -text +*.model filter=lfs diff=lfs merge=lfs -text +*.msgpack filter=lfs diff=lfs merge=lfs -text +*.npy filter=lfs diff=lfs merge=lfs -text +*.npz filter=lfs diff=lfs merge=lfs -text +*.onnx filter=lfs diff=lfs merge=lfs -text +*.ot filter=lfs diff=lfs merge=lfs -text +*.parquet filter=lfs diff=lfs merge=lfs -text +*.pb filter=lfs diff=lfs merge=lfs -text +*.pickle filter=lfs diff=lfs merge=lfs -text +*.pkl filter=lfs diff=lfs merge=lfs -text +*.pt filter=lfs diff=lfs merge=lfs -text +*.pth filter=lfs diff=lfs merge=lfs -text +*.rar filter=lfs diff=lfs merge=lfs -text +*.safetensors filter=lfs diff=lfs merge=lfs -text +saved_model/**/* filter=lfs diff=lfs merge=lfs -text +*.tar.* filter=lfs diff=lfs merge=lfs -text +*.tar filter=lfs diff=lfs merge=lfs -text +*.tflite filter=lfs diff=lfs merge=lfs -text +*.tgz filter=lfs diff=lfs merge=lfs -text +*.wasm filter=lfs diff=lfs merge=lfs -text +*.xz filter=lfs diff=lfs merge=lfs -text +*.zip filter=lfs diff=lfs merge=lfs -text +*.zst filter=lfs diff=lfs merge=lfs -text +*tfevents* filter=lfs diff=lfs merge=lfs -text +vocos_checkpoint_epoch=209_step=3924480_val_loss=3.7036_44100_11.ckpt filter=lfs diff=lfs merge=lfs -text +pytorch_model.bin filter=lfs diff=lfs merge=lfs -text diff --git a/vocos-mel-hifigan-compat-44100khz/README.md b/vocos-mel-hifigan-compat-44100khz/README.md new file mode 100644 index 0000000000000000000000000000000000000000..f3a5fade53672b90ce838330010bf304fb6d1e0b --- /dev/null +++ b/vocos-mel-hifigan-compat-44100khz/README.md @@ -0,0 +1,97 @@ +--- +license: mit +pipeline_tag: audio-to-audio +tags: +- vocos +- hifigan +- tts +- melspectrogram +- vocoder +- mel +--- + +### Model Description + + + +**Vocos** is a fast neural vocoder designed to synthesize audio waveforms from acoustic features. +Unlike other typical GAN-based vocoders, Vocos does not model audio samples in the time domain. +Instead, it generates spectral coefficients, facilitating rapid audio reconstruction through +inverse Fourier transform. + +This version of vocos uses 80-bin mel spectrograms as acoustic features which are widespread +in the TTS domain since the introduction of [hifi-gan](https://github.com/jik876/hifi-gan/blob/master/meldataset.py) +The goal of this model is to provide an alternative to hifi-gan that is faster and compatible with the +acoustic output of several TTS models. + +## Intended Uses and limitations + +The model is aimed to serve as a vocoder to synthesize audio waveforms from mel spectrograms. Is trained to generate speech and if is used in other audio +domain is possible that the model won't produce high quality samples. + +### Installation + +To use Vocos only in inference mode, install it using: + +```bash +pip install git+https://github.com/langtech-bsc/vocos.git@matcha +``` + +### Reconstruct audio from mel-spectrogram + +```python +import torch + +from vocos import Vocos + +vocos = Vocos.from_pretrained("patriotyk/vocos-mel-hifigan-compat-44100khz") + +mel = torch.randn(1, 80, 256) # B, C, T +audio = vocos.decode(mel) +``` + +### Training Data + +The model was trained on private 800+ hours dataset, made from Ukrainian audio books, using [narizaka](https://github.com/patriotyk/narizaka) tool. + +### Training Procedure + +The model was trained for 2.0M steps and 210 epochs with a batch size of 20. We used a Cosine scheduler with a initial learning rate of 3e-4. +We where using two RTX-3090 video cards for training, and it took about one month of continuous training. + +#### Training Hyperparameters + +* initial_learning_rate: 3e-4 +* scheduler: cosine without warmup or restarts +* mel_loss_coeff: 45 +* mrd_loss_coeff: 1.0 +* batch_size: 20 +* num_samples: 32768 + +## Evaluation + + +Evaluation was done using the metrics on the original repo, after 210 epochs we achieve: + +* val_loss: 3.703 +* f1_score: 0.950 +* mel_loss: 0.248 +* periodicity_loss:0.127 +* pesq_score: 3.399 +* pitch_loss: 38.26 +* utmos_score: 3.146 + + +## Citation + + +If this code contributes to your research, please cite the work: + +``` +@article{siuzdak2023vocos, + title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis}, + author={Siuzdak, Hubert}, + journal={arXiv preprint arXiv:2306.00814}, + year={2023} +} +``` \ No newline at end of file diff --git a/vocos-mel-hifigan-compat-44100khz/config.yaml b/vocos-mel-hifigan-compat-44100khz/config.yaml new file mode 100644 index 0000000000000000000000000000000000000000..3f520571001bb13cb14deebc8a3bcdc46ad23807 --- /dev/null +++ b/vocos-mel-hifigan-compat-44100khz/config.yaml @@ -0,0 +1,28 @@ +feature_extractor: + class_path: vocos.feature_extractors.MelSpectrogramFeatures + init_args: + sample_rate: 44100 + n_fft: 2048 + hop_length: 512 + n_mels: 80 + padding: same + f_min: 0 + f_max: 8000 + norm: "slaney" + mel_scale: "slaney" + +backbone: + class_path: vocos.models.VocosBackbone + init_args: + input_channels: 80 + dim: 512 + intermediate_dim: 1536 + num_layers: 8 + +head: + class_path: vocos.heads.ISTFTHead + init_args: + dim: 512 + n_fft: 2048 + hop_length: 512 + padding: same \ No newline at end of file diff --git a/vocos-mel-hifigan-compat-44100khz/logs/version_0/config.yaml b/vocos-mel-hifigan-compat-44100khz/logs/version_0/config.yaml new file mode 100644 index 0000000000000000000000000000000000000000..a5bd8b84b39ceb5fc500a945376f500271e76bd9 --- /dev/null +++ b/vocos-mel-hifigan-compat-44100khz/logs/version_0/config.yaml @@ -0,0 +1,151 @@ +# pytorch_lightning==1.8.6 +seed_everything: 4444 +trainer: + logger: + class_path: pytorch_lightning.loggers.TensorBoardLogger + init_args: + save_dir: /home/patriotyk/vocos/logs + name: lightning_logs + version: null + log_graph: false + default_hp_metric: true + prefix: '' + sub_dir: null + logdir: null + comment: '' + purge_step: null + max_queue: 10 + flush_secs: 120 + filename_suffix: '' + write_to_disk: true + comet_config: + disabled: true + enable_checkpointing: true + callbacks: + - class_path: pytorch_lightning.callbacks.LearningRateMonitor + init_args: + logging_interval: null + log_momentum: false + - class_path: pytorch_lightning.callbacks.ModelSummary + init_args: + max_depth: 2 + - class_path: pytorch_lightning.callbacks.ModelCheckpoint + init_args: + dirpath: null + filename: vocos_checkpoint_{epoch}_{step}_{val_loss:.4f} + monitor: val_loss + verbose: false + save_last: true + save_top_k: 3 + save_weights_only: false + mode: min + auto_insert_metric_name: true + every_n_train_steps: null + train_time_interval: null + every_n_epochs: null + save_on_train_epoch_end: null + - class_path: vocos.helpers.GradNormCallback + default_root_dir: null + gradient_clip_val: null + gradient_clip_algorithm: null + num_nodes: 1 + num_processes: null + devices: -1 + gpus: null + auto_select_gpus: false + tpu_cores: null + ipus: null + enable_progress_bar: true + overfit_batches: 0.0 + track_grad_norm: -1 + check_val_every_n_epoch: 1 + fast_dev_run: false + accumulate_grad_batches: null + max_epochs: null + min_epochs: null + max_steps: -1 + min_steps: null + max_time: null + limit_train_batches: null + limit_val_batches: 100 + limit_test_batches: null + limit_predict_batches: null + val_check_interval: null + log_every_n_steps: 100 + accelerator: gpu + strategy: ddp + sync_batchnorm: false + precision: 32 + enable_model_summary: true + num_sanity_val_steps: 2 + resume_from_checkpoint: null + profiler: null + benchmark: null + deterministic: null + reload_dataloaders_every_n_epochs: 0 + auto_lr_find: false + replace_sampler_ddp: true + detect_anomaly: false + auto_scale_batch_size: false + plugins: null + amp_backend: native + amp_level: null + move_metrics_to_cpu: false + multiple_trainloader_mode: max_size_cycle + inference_mode: true +data: + class_path: vocos.dataset.VocosDataModule + init_args: + train_params: + filelist_path: /home/patriotyk/tts_corpus_44100/train_vocos.txt + sampling_rate: 44100 + num_samples: 32768 + batch_size: 20 + num_workers: 24 + val_params: + filelist_path: /home/patriotyk/tts_corpus_44100/val_vocos.txt + sampling_rate: 44100 + num_samples: 96768 + batch_size: 20 + num_workers: 24 +model: + class_path: vocos.experiment.VocosExp + init_args: + feature_extractor: + class_path: vocos.feature_extractors.MelSpectrogramFeatures + init_args: + sample_rate: 44100 + n_fft: 2048 + hop_length: 512 + n_mels: 80 + padding: same + f_min: 0 + f_max: 8000 + norm: slaney + mel_scale: slaney + backbone: + class_path: vocos.models.VocosBackbone + init_args: + input_channels: 80 + dim: 512 + intermediate_dim: 1536 + num_layers: 8 + layer_scale_init_value: null + adanorm_num_embeddings: null + head: + class_path: vocos.heads.ISTFTHead + init_args: + dim: 512 + n_fft: 2048 + hop_length: 512 + padding: same + sample_rate: 44100 + initial_learning_rate: 0.0003 + num_warmup_steps: 0 + mel_loss_coeff: 45.0 + mrd_loss_coeff: 1.0 + pretrain_mel_steps: 0 + decay_mel_coeff: false + evaluate_utmos: true + evaluate_pesq: true + evaluate_periodicty: true diff --git a/vocos-mel-hifigan-compat-44100khz/logs/version_0/events.out.tfevents.1713993466.gpuserver b/vocos-mel-hifigan-compat-44100khz/logs/version_0/events.out.tfevents.1713993466.gpuserver new file mode 100644 index 0000000000000000000000000000000000000000..0ddf4586b3d59120cd3a06313118aba42d014767 --- /dev/null +++ b/vocos-mel-hifigan-compat-44100khz/logs/version_0/events.out.tfevents.1713993466.gpuserver @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:776d06fc99b9d864dedb323dda38c32b09759b7cb04e488437269fe68a9919db +size 303299046 diff --git a/vocos-mel-hifigan-compat-44100khz/logs/version_0/hparams.yaml b/vocos-mel-hifigan-compat-44100khz/logs/version_0/hparams.yaml new file mode 100644 index 0000000000000000000000000000000000000000..b85d336116890c4ff12fc76ccc565a135f1ff645 --- /dev/null +++ b/vocos-mel-hifigan-compat-44100khz/logs/version_0/hparams.yaml @@ -0,0 +1,10 @@ +sample_rate: 44100 +initial_learning_rate: 0.0003 +num_warmup_steps: 0 +mel_loss_coeff: 45.0 +mrd_loss_coeff: 1.0 +pretrain_mel_steps: 0 +decay_mel_coeff: false +evaluate_utmos: true +evaluate_pesq: true +evaluate_periodicty: true diff --git a/vocos-mel-hifigan-compat-44100khz/logs/version_1/config.yaml b/vocos-mel-hifigan-compat-44100khz/logs/version_1/config.yaml new file mode 100644 index 0000000000000000000000000000000000000000..388e7e4a4a7ea66647e6151e6074d494e69b58bb --- /dev/null +++ b/vocos-mel-hifigan-compat-44100khz/logs/version_1/config.yaml @@ -0,0 +1,151 @@ +# pytorch_lightning==1.8.6 +seed_everything: 4444 +trainer: + logger: + class_path: pytorch_lightning.loggers.TensorBoardLogger + init_args: + save_dir: /home/patriotyk/vocos/logs + name: lightning_logs + version: null + log_graph: false + default_hp_metric: true + prefix: '' + sub_dir: null + logdir: null + comment: '' + purge_step: null + max_queue: 10 + flush_secs: 120 + filename_suffix: '' + write_to_disk: true + comet_config: + disabled: true + enable_checkpointing: true + callbacks: + - class_path: pytorch_lightning.callbacks.LearningRateMonitor + init_args: + logging_interval: null + log_momentum: false + - class_path: pytorch_lightning.callbacks.ModelSummary + init_args: + max_depth: 2 + - class_path: pytorch_lightning.callbacks.ModelCheckpoint + init_args: + dirpath: null + filename: vocos_checkpoint_{epoch}_{step}_{val_loss:.4f} + monitor: val_loss + verbose: false + save_last: true + save_top_k: 3 + save_weights_only: false + mode: min + auto_insert_metric_name: true + every_n_train_steps: null + train_time_interval: null + every_n_epochs: null + save_on_train_epoch_end: null + - class_path: vocos.helpers.GradNormCallback + default_root_dir: null + gradient_clip_val: null + gradient_clip_algorithm: null + num_nodes: 1 + num_processes: null + devices: -1 + gpus: null + auto_select_gpus: false + tpu_cores: null + ipus: null + enable_progress_bar: true + overfit_batches: 0.0 + track_grad_norm: -1 + check_val_every_n_epoch: 1 + fast_dev_run: false + accumulate_grad_batches: null + max_epochs: null + min_epochs: null + max_steps: 4000000 + min_steps: null + max_time: null + limit_train_batches: null + limit_val_batches: 100 + limit_test_batches: null + limit_predict_batches: null + val_check_interval: null + log_every_n_steps: 100 + accelerator: gpu + strategy: ddp + sync_batchnorm: false + precision: 32 + enable_model_summary: true + num_sanity_val_steps: 2 + resume_from_checkpoint: ../vocos/logs/lightning_logs/version_10/checkpoints/last.ckpt + profiler: null + benchmark: null + deterministic: null + reload_dataloaders_every_n_epochs: 0 + auto_lr_find: false + replace_sampler_ddp: true + detect_anomaly: false + auto_scale_batch_size: false + plugins: null + amp_backend: native + amp_level: null + move_metrics_to_cpu: false + multiple_trainloader_mode: max_size_cycle + inference_mode: true +data: + class_path: vocos.dataset.VocosDataModule + init_args: + train_params: + filelist_path: /home/patriotyk/tts_corpus_44100/train_vocos.txt + sampling_rate: 44100 + num_samples: 32768 + batch_size: 20 + num_workers: 24 + val_params: + filelist_path: /home/patriotyk/tts_corpus_44100/val_vocos.txt + sampling_rate: 44100 + num_samples: 96768 + batch_size: 20 + num_workers: 24 +model: + class_path: vocos.experiment.VocosExp + init_args: + feature_extractor: + class_path: vocos.feature_extractors.MelSpectrogramFeatures + init_args: + sample_rate: 44100 + n_fft: 2048 + hop_length: 512 + n_mels: 80 + padding: same + f_min: 0 + f_max: 8000 + norm: slaney + mel_scale: slaney + backbone: + class_path: vocos.models.VocosBackbone + init_args: + input_channels: 80 + dim: 512 + intermediate_dim: 1536 + num_layers: 8 + layer_scale_init_value: null + adanorm_num_embeddings: null + head: + class_path: vocos.heads.ISTFTHead + init_args: + dim: 512 + n_fft: 2048 + hop_length: 512 + padding: same + sample_rate: 44100 + initial_learning_rate: 0.0003 + num_warmup_steps: 0 + mel_loss_coeff: 45.0 + mrd_loss_coeff: 1.0 + pretrain_mel_steps: 0 + decay_mel_coeff: false + evaluate_utmos: true + evaluate_pesq: true + evaluate_periodicty: true diff --git a/vocos-mel-hifigan-compat-44100khz/logs/version_1/events.out.tfevents.1714716087.gpuserver b/vocos-mel-hifigan-compat-44100khz/logs/version_1/events.out.tfevents.1714716087.gpuserver new file mode 100644 index 0000000000000000000000000000000000000000..b7940bacc0f560a0f011e165f979df7f879b1f0b --- /dev/null +++ b/vocos-mel-hifigan-compat-44100khz/logs/version_1/events.out.tfevents.1714716087.gpuserver @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:bcea4380f7467f3dfc0ddc86f9898f91cc3eacc223c836b57f06ec3e7797469c +size 285867873 diff --git a/vocos-mel-hifigan-compat-44100khz/logs/version_1/hparams.yaml b/vocos-mel-hifigan-compat-44100khz/logs/version_1/hparams.yaml new file mode 100644 index 0000000000000000000000000000000000000000..b85d336116890c4ff12fc76ccc565a135f1ff645 --- /dev/null +++ b/vocos-mel-hifigan-compat-44100khz/logs/version_1/hparams.yaml @@ -0,0 +1,10 @@ +sample_rate: 44100 +initial_learning_rate: 0.0003 +num_warmup_steps: 0 +mel_loss_coeff: 45.0 +mrd_loss_coeff: 1.0 +pretrain_mel_steps: 0 +decay_mel_coeff: false +evaluate_utmos: true +evaluate_pesq: true +evaluate_periodicty: true diff --git a/vocos-mel-hifigan-compat-44100khz/pytorch_model.bin b/vocos-mel-hifigan-compat-44100khz/pytorch_model.bin new file mode 100644 index 0000000000000000000000000000000000000000..12bd833cfe67d1a28d218cc08b082ad3b820268c --- /dev/null +++ b/vocos-mel-hifigan-compat-44100khz/pytorch_model.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:310df58f31cd77e56c494df8389f7ebd57bf7b594c585d9221eb9fe888785572 +size 56324327 diff --git a/vocos-mel-hifigan-compat-44100khz/source.txt b/vocos-mel-hifigan-compat-44100khz/source.txt new file mode 100644 index 0000000000000000000000000000000000000000..5ab461935c976500bce436bc4f55125340295dd5 --- /dev/null +++ b/vocos-mel-hifigan-compat-44100khz/source.txt @@ -0,0 +1 @@ +https://huggingface.co/patriotyk/vocos-mel-hifigan-compat-44100khz \ No newline at end of file diff --git a/vocos-mel-hifigan-compat-44100khz/vocos_checkpoint_epoch=209_step=3924480_val_loss=3.7036_44100_11.ckpt b/vocos-mel-hifigan-compat-44100khz/vocos_checkpoint_epoch=209_step=3924480_val_loss=3.7036_44100_11.ckpt new file mode 100644 index 0000000000000000000000000000000000000000..666e312efddea369de6be57b0071aa0511c0982f --- /dev/null +++ b/vocos-mel-hifigan-compat-44100khz/vocos_checkpoint_epoch=209_step=3924480_val_loss=3.7036_44100_11.ckpt @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:40b79831e894e4da55c435be9edbbfe9e1630b90b315abc3d1fb05d1fd74c0ed +size 679109423 diff --git a/vocos-music-mel-24khz-beta/.gitattributes b/vocos-music-mel-24khz-beta/.gitattributes new file mode 100644 index 0000000000000000000000000000000000000000..a6344aac8c09253b3b630fb776ae94478aa0275b --- /dev/null +++ b/vocos-music-mel-24khz-beta/.gitattributes @@ -0,0 +1,35 @@ +*.7z filter=lfs diff=lfs merge=lfs -text +*.arrow filter=lfs diff=lfs merge=lfs -text +*.bin filter=lfs diff=lfs merge=lfs -text +*.bz2 filter=lfs diff=lfs merge=lfs -text +*.ckpt filter=lfs diff=lfs merge=lfs -text +*.ftz filter=lfs diff=lfs merge=lfs -text +*.gz filter=lfs diff=lfs merge=lfs -text +*.h5 filter=lfs diff=lfs merge=lfs -text +*.joblib filter=lfs diff=lfs merge=lfs -text +*.lfs.* filter=lfs diff=lfs merge=lfs -text +*.mlmodel filter=lfs diff=lfs merge=lfs -text +*.model filter=lfs diff=lfs merge=lfs -text +*.msgpack filter=lfs diff=lfs merge=lfs -text +*.npy filter=lfs diff=lfs merge=lfs -text +*.npz filter=lfs diff=lfs merge=lfs -text +*.onnx filter=lfs diff=lfs merge=lfs -text +*.ot filter=lfs diff=lfs merge=lfs -text +*.parquet filter=lfs diff=lfs merge=lfs -text +*.pb filter=lfs diff=lfs merge=lfs -text +*.pickle filter=lfs diff=lfs merge=lfs -text +*.pkl filter=lfs diff=lfs merge=lfs -text +*.pt filter=lfs diff=lfs merge=lfs -text +*.pth filter=lfs diff=lfs merge=lfs -text +*.rar filter=lfs diff=lfs merge=lfs -text +*.safetensors filter=lfs diff=lfs merge=lfs -text +saved_model/**/* filter=lfs diff=lfs merge=lfs -text +*.tar.* filter=lfs diff=lfs merge=lfs -text +*.tar filter=lfs diff=lfs merge=lfs -text +*.tflite filter=lfs diff=lfs merge=lfs -text +*.tgz filter=lfs diff=lfs merge=lfs -text +*.wasm filter=lfs diff=lfs merge=lfs -text +*.xz filter=lfs diff=lfs merge=lfs -text +*.zip filter=lfs diff=lfs merge=lfs -text +*.zst filter=lfs diff=lfs merge=lfs -text +*tfevents* filter=lfs diff=lfs merge=lfs -text diff --git a/vocos-music-mel-24khz-beta/README.md b/vocos-music-mel-24khz-beta/README.md new file mode 100644 index 0000000000000000000000000000000000000000..071624c14b72c249f0b15e5a01db97c0ace4ec8f --- /dev/null +++ b/vocos-music-mel-24khz-beta/README.md @@ -0,0 +1,30 @@ +--- +license: cc-by-4.0 +language: +- en +library_name: vocos +tags: +- music +--- + +> Warning: This model is in beta and not ready for production use. It is undertrained and sounds terrible. A new version will be released soon! + +# Vocos Music 24kHz (beta) + +A model similar to [Vocos Mel 24kHz](https://huggingface.co/charactr/vocos-mel-24khz) but trained on music. + +This model was trained on music, it may not perform well on speech. + +It may be used as a drop-in replacement for Vocos in tasks that require music. + +## Limitations + +- This model is trained on music, it may not perform well on speech or other audio tasks. + +## License + +This model is licensed under the [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/). + +--- + +[OpenMusic](https://huggingface.co/openmusic) \ No newline at end of file diff --git a/vocos-music-mel-24khz-beta/config.yaml b/vocos-music-mel-24khz-beta/config.yaml new file mode 100644 index 0000000000000000000000000000000000000000..12d59e7d0af9ccfd5deb4ec01b4db3855f3d7314 --- /dev/null +++ b/vocos-music-mel-24khz-beta/config.yaml @@ -0,0 +1,24 @@ +feature_extractor: + class_path: vocos.feature_extractors.MelSpectrogramFeatures + init_args: + sample_rate: 24000 + n_fft: 1024 + hop_length: 256 + n_mels: 100 + padding: center + +backbone: + class_path: vocos.models.VocosBackbone + init_args: + input_channels: 100 + dim: 512 + intermediate_dim: 1536 + num_layers: 8 + +head: + class_path: vocos.heads.ISTFTHead + init_args: + dim: 512 + n_fft: 1024 + hop_length: 256 + padding: center \ No newline at end of file diff --git a/vocos-music-mel-24khz-beta/pytorch_model.bin b/vocos-music-mel-24khz-beta/pytorch_model.bin new file mode 100644 index 0000000000000000000000000000000000000000..11e2db464f5a6b6fbceca197004007e45d73bf48 --- /dev/null +++ b/vocos-music-mel-24khz-beta/pytorch_model.bin @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:dfef95e8fbce9d56a282e5321205934d3dc1d9f2980f35e7fff19c59951c50bd +size 54368871 diff --git a/vocos-music-mel-24khz-beta/source.txt b/vocos-music-mel-24khz-beta/source.txt new file mode 100644 index 0000000000000000000000000000000000000000..32048e60d7687c0a41ab57aad7f72d455d930c6b --- /dev/null +++ b/vocos-music-mel-24khz-beta/source.txt @@ -0,0 +1 @@ +https://huggingface.co/openmusic/vocos-music-mel-24khz-beta \ No newline at end of file