nvidia
/

audio-codec-22khz

Feature Extraction

NeMo

Model card Files Files and versions

xet

Community

rlangman commited on Dec 4, 2024

Commit

c40f87f

verified ·

1 Parent(s): ef3dbab

Update README.md

Browse files

Files changed (1) hide show

README.md +5 -5

README.md CHANGED Viewed

@@ -5,7 +5,7 @@ license_link: https://developer.nvidia.com/downloads/license/nsclv1
 ---
-# NVIDIA NeMo Audio Codec
 <style>
 img {
 display: inline-table;
@@ -23,7 +23,7 @@ The NeMo Audio Codec is a neural audio codec which compresses audio into a quant
 ## Model Architecture
 The NeMo Audio Codec model uses symmetric convolutional encoder-decoder architecture based on [HiFi-GAN](https://arxiv.org/abs/2010.05646).
-For the vector quantization, we use [Finite Scalar Quantization (FSQ)](https://arxiv.org/abs/2309.15505) with eight codebooks and 1000 codes per codebook.
 For more details please check [our paper](https://arxiv.org/abs/2406.05298).
@@ -100,12 +100,12 @@ The NeMo Audio Codec is trained on a total of 28.7k hrs of speech data from 105
 ## Performance
-We evaluate our codec using several objective audio quality metrics. We evaluate [ViSQO](https://github.com/google/visqol) and [PESQ](https://lightning.ai/docs/torchmetrics/stable/audio/perceptual_evaluation_speech_quality.html) for perception quality, [ESTOI](https://ieeexplore.ieee.org/document/7539284) for intelligbility, mel spectrogram and STFT distances for spectral reconstruction accuracy, and SI-SDR [7] for phase reconstruction accuracy. Metrics are reported on the test set for both the MLS English and CommonVoice data. The model has not been trained or evaluated on non-speech audio.
 | Dataset     | ViSQOL     |PESQ        |ESTOI       |Mel Distance |STFT Distance|SI-SDR|
 |:-----------:|:----------:|:----------:|:----------:|:-----------:|:-----------:|:-----------:|
-| MLS English | 4.50       | 3.69       | 0.94       | 0.066       | 0.033       | 8.33       |
-| CommonVoice | 4.53       | 3.55       | 0.93       | 0.100       | 0.057       | 7.63       |
 ## Software Integration

 ---
+# NVIDIA NeMo Audio Codec 22khz
 <style>
 img {
 display: inline-table;
 ## Model Architecture
 The NeMo Audio Codec model uses symmetric convolutional encoder-decoder architecture based on [HiFi-GAN](https://arxiv.org/abs/2010.05646).
+For the vector quantization, we use [Finite Scalar Quantization (FSQ)](https://arxiv.org/abs/2309.15505), with eight codebooks, and 1000 entries per codebook.
 For more details please check [our paper](https://arxiv.org/abs/2406.05298).
 ## Performance
+We evaluate our codec using several objective audio quality metrics. We evaluate [ViSQOL](https://github.com/google/visqol) and [PESQ](https://lightning.ai/docs/torchmetrics/stable/audio/perceptual_evaluation_speech_quality.html) for perception quality, [ESTOI](https://ieeexplore.ieee.org/document/7539284) for intelligbility, mel spectrogram and STFT distances for spectral reconstruction accuracy, and SI-SDR [7] for phase reconstruction accuracy. Metrics are reported on the test set for both the MLS English and CommonVoice data. The model has not been trained or evaluated on non-speech audio.
 | Dataset     | ViSQOL     |PESQ        |ESTOI       |Mel Distance |STFT Distance|SI-SDR|
 |:-----------:|:----------:|:----------:|:----------:|:-----------:|:-----------:|:-----------:|
+| MLS English | 4.50       | 3.69       | 0.94       | 0.066       | 0.033       | 8.33        |
+| CommonVoice | 4.53       | 3.55       | 0.93       | 0.100       | 0.057       | 7.63        |
 ## Software Integration