nvidia
/

audio-codec-22khz

Feature Extraction

Model card Files Files and versions

rlangman commited on Dec 4, 2024

Commit

0eaf726

·

verified ·

1 Parent(s): c40f87f

Update README.md

Files changed (1) hide show

README.md +4 -6

README.md CHANGED Viewed

@@ -18,14 +18,12 @@ padding: 0;
 | [![Model size](https://img.shields.io/badge/Params-61.8M-lightgrey#model-badge)](#model-architecture)
 | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets)
-The NeMo Audio Codec is a neural audio codec which compresses audio into a quantized representation. This model can be used as a vocoder for speech synthesis.
 ## Model Architecture
-The NeMo Audio Codec model uses symmetric convolutional encoder-decoder architecture based on [HiFi-GAN](https://arxiv.org/abs/2010.05646).
-For the vector quantization, we use [Finite Scalar Quantization (FSQ)](https://arxiv.org/abs/2309.15505), with eight codebooks, and 1000 entries per codebook.
-For more details please check [our paper](https://arxiv.org/abs/2406.05298).
 ### Input
   - **Input Type:** Audio
@@ -80,8 +78,8 @@ sf.write(path_to_output_audio, output_audio, nemo_codec_model.sample_rate)
 ```
 ### Training
-For fine-tuning on another dataset please follow the steps available at our [Audio Codec Training Tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/Audio_Codec_Training.ipynb). Note that you will need to set the ```CONFIG_FILENAME``` parameter to the "audio_codec_22050.yaml" config. You also will need to set ```pretrained_model_name``` to "audio_codec_22khz".
 ## Training, Testing, and Evaluation Datasets:

 | [![Model size](https://img.shields.io/badge/Params-61.8M-lightgrey#model-badge)](#model-architecture)
 | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets)
+The NeMo Audio Codec is a neural audio codec which compresses audio into a quantized representation. The model can be used as a vocoder for speech synthesis. The model works with full-bandwidth 22.05kHz speech. It might have lower performance with low-bandwidth speech (e.g. 16kHz speech upsampled to 22.05kHz) or with non-speech audio.
 ## Model Architecture
+The NeMo Audio Codec model uses symmetric convolutional encoder-decoder architecture based on [HiFi-GAN](https://arxiv.org/abs/2010.05646). We use [Finite Scalar Quantization (FSQ)](https://arxiv.org/abs/2309.15505), with eight codebooks, 1000 entries per codebook, 86.1 frames per second, and a 6.9kbps bitrate.
+For more details please refer to [our paper](https://arxiv.org/abs/2406.05298).
 ### Input
   - **Input Type:** Audio
 ```
 ### Training
+For fine-tuning on another dataset please follow the steps available at our [Audio Codec Training Tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/Audio_Codec_Training.ipynb). Note that you will need to set the ```CONFIG_FILENAME``` parameter to the "audio_codec_22050.yaml" config. You also will need to set ```pretrained_model_name``` to "audio-codec-22khz".
 ## Training, Testing, and Evaluation Datasets: