Update README.md
Browse files
README.md
CHANGED
|
@@ -18,14 +18,12 @@ padding: 0;
|
|
| 18 |
| [](#model-architecture)
|
| 19 |
| [](#datasets)
|
| 20 |
|
| 21 |
-
The NeMo Audio Codec is a neural audio codec which compresses audio into a quantized representation.
|
| 22 |
-
|
| 23 |
|
| 24 |
## Model Architecture
|
| 25 |
-
The NeMo Audio Codec model uses symmetric convolutional encoder-decoder architecture based on [HiFi-GAN](https://arxiv.org/abs/2010.05646).
|
| 26 |
-
For the vector quantization, we use [Finite Scalar Quantization (FSQ)](https://arxiv.org/abs/2309.15505), with eight codebooks, and 1000 entries per codebook.
|
| 27 |
|
| 28 |
-
For more details please
|
| 29 |
|
| 30 |
### Input
|
| 31 |
- **Input Type:** Audio
|
|
@@ -80,8 +78,8 @@ sf.write(path_to_output_audio, output_audio, nemo_codec_model.sample_rate)
|
|
| 80 |
```
|
| 81 |
|
| 82 |
### Training
|
| 83 |
-
For fine-tuning on another dataset please follow the steps available at our [Audio Codec Training Tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/Audio_Codec_Training.ipynb). Note that you will need to set the ```CONFIG_FILENAME``` parameter to the "audio_codec_22050.yaml" config. You also will need to set ```pretrained_model_name``` to "audio_codec_22khz".
|
| 84 |
|
|
|
|
| 85 |
|
| 86 |
## Training, Testing, and Evaluation Datasets:
|
| 87 |
|
|
|
|
| 18 |
| [](#model-architecture)
|
| 19 |
| [](#datasets)
|
| 20 |
|
| 21 |
+
The NeMo Audio Codec is a neural audio codec which compresses audio into a quantized representation. The model can be used as a vocoder for speech synthesis. The model works with full-bandwidth 22.05kHz speech. It might have lower performance with low-bandwidth speech (e.g. 16kHz speech upsampled to 22.05kHz) or with non-speech audio.
|
|
|
|
| 22 |
|
| 23 |
## Model Architecture
|
| 24 |
+
The NeMo Audio Codec model uses symmetric convolutional encoder-decoder architecture based on [HiFi-GAN](https://arxiv.org/abs/2010.05646). We use [Finite Scalar Quantization (FSQ)](https://arxiv.org/abs/2309.15505), with eight codebooks, 1000 entries per codebook, 86.1 frames per second, and a 6.9kbps bitrate.
|
|
|
|
| 25 |
|
| 26 |
+
For more details please refer to [our paper](https://arxiv.org/abs/2406.05298).
|
| 27 |
|
| 28 |
### Input
|
| 29 |
- **Input Type:** Audio
|
|
|
|
| 78 |
```
|
| 79 |
|
| 80 |
### Training
|
|
|
|
| 81 |
|
| 82 |
+
For fine-tuning on another dataset please follow the steps available at our [Audio Codec Training Tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/Audio_Codec_Training.ipynb). Note that you will need to set the ```CONFIG_FILENAME``` parameter to the "audio_codec_22050.yaml" config. You also will need to set ```pretrained_model_name``` to "audio-codec-22khz".
|
| 83 |
|
| 84 |
## Training, Testing, and Evaluation Datasets:
|
| 85 |
|