Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,41 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
license_link: https://huggingface.co/nvidia/BigVGAN/blob/main/LICENSE
|
| 4 |
+
tags:
|
| 5 |
+
- neural-vocoder
|
| 6 |
+
- audio-generation
|
| 7 |
+
library_name: PyTorch
|
| 8 |
+
pipeline_tag: audio-to-audio
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
## BigVGAN with different mel spectrogram input
|
| 12 |
+
These BigVGAN checkpoints are from continued training of https://huggingface.co/nvidia/bigvgan_v2_24khz_100band_256x, with the input mel spectrogram generated from this code from [[vocos]](https://github.com/gemelo-ai/vocos/blob/c859e3b7b534f3776a357983029d34170ddd6fc3/vocos/feature_extractors.py#L28C1-L49C24):
|
| 13 |
+
|
| 14 |
+
```py
|
| 15 |
+
class MelSpectrogramFeatures(FeatureExtractor):
|
| 16 |
+
def __init__(self, sample_rate=24000, n_fft=1024, hop_length=256, n_mels=100, padding="center"):
|
| 17 |
+
super().__init__()
|
| 18 |
+
if padding not in ["center", "same"]:
|
| 19 |
+
raise ValueError("Padding must be 'center' or 'same'.")
|
| 20 |
+
self.padding = padding
|
| 21 |
+
self.mel_spec = torchaudio.transforms.MelSpectrogram(
|
| 22 |
+
sample_rate=sample_rate,
|
| 23 |
+
n_fft=n_fft,
|
| 24 |
+
hop_length=hop_length,
|
| 25 |
+
n_mels=n_mels,
|
| 26 |
+
center=padding == "center",
|
| 27 |
+
power=1,
|
| 28 |
+
)
|
| 29 |
+
|
| 30 |
+
def forward(self, audio, **kwargs):
|
| 31 |
+
if self.padding == "same":
|
| 32 |
+
pad = self.mel_spec.win_length - self.mel_spec.hop_length
|
| 33 |
+
audio = torch.nn.functional.pad(audio, (pad // 2, pad // 2), mode="reflect")
|
| 34 |
+
mel = self.mel_spec(audio)
|
| 35 |
+
features = safe_log(mel)
|
| 36 |
+
return features
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
Training was done with segment_size=65536 (unchanged) and batch_size=24 (vs 32 from the Nvidia team). Final eval PESQ is 4.340 (vs 4.362 from the Nvidia checkpoint, on their own mel spectrogram code).
|
| 40 |
+
|
| 41 |
+
<center><img src="https://huggingface.co/cckm/bigvgan_melspec/resolve/main/assets/bigvgan_pesq.png" width="800"></center>
|