Update README.md
Browse files
README.md
CHANGED
|
@@ -5,7 +5,7 @@ license_link: https://developer.nvidia.com/downloads/license/nsclv1
|
|
| 5 |
---
|
| 6 |
|
| 7 |
|
| 8 |
-
# NVIDIA NeMo Audio Codec
|
| 9 |
<style>
|
| 10 |
img {
|
| 11 |
display: inline-table;
|
|
@@ -23,7 +23,7 @@ The NeMo Audio Codec is a neural audio codec which compresses audio into a quant
|
|
| 23 |
|
| 24 |
## Model Architecture
|
| 25 |
The NeMo Audio Codec model uses symmetric convolutional encoder-decoder architecture based on [HiFi-GAN](https://arxiv.org/abs/2010.05646).
|
| 26 |
-
For the vector quantization, we use [Finite Scalar Quantization (FSQ)](https://arxiv.org/abs/2309.15505) with eight codebooks and 1000
|
| 27 |
|
| 28 |
For more details please check [our paper](https://arxiv.org/abs/2406.05298).
|
| 29 |
|
|
@@ -100,12 +100,12 @@ The NeMo Audio Codec is trained on a total of 28.7k hrs of speech data from 105
|
|
| 100 |
|
| 101 |
## Performance
|
| 102 |
|
| 103 |
-
We evaluate our codec using several objective audio quality metrics. We evaluate [
|
| 104 |
|
| 105 |
| Dataset | ViSQOL |PESQ |ESTOI |Mel Distance |STFT Distance|SI-SDR|
|
| 106 |
|:-----------:|:----------:|:----------:|:----------:|:-----------:|:-----------:|:-----------:|
|
| 107 |
-
| MLS English | 4.50 | 3.69 | 0.94 | 0.066 | 0.033 | 8.33
|
| 108 |
-
| CommonVoice | 4.53 | 3.55 | 0.93 | 0.100 | 0.057 | 7.63
|
| 109 |
|
| 110 |
## Software Integration
|
| 111 |
|
|
|
|
| 5 |
---
|
| 6 |
|
| 7 |
|
| 8 |
+
# NVIDIA NeMo Audio Codec 22khz
|
| 9 |
<style>
|
| 10 |
img {
|
| 11 |
display: inline-table;
|
|
|
|
| 23 |
|
| 24 |
## Model Architecture
|
| 25 |
The NeMo Audio Codec model uses symmetric convolutional encoder-decoder architecture based on [HiFi-GAN](https://arxiv.org/abs/2010.05646).
|
| 26 |
+
For the vector quantization, we use [Finite Scalar Quantization (FSQ)](https://arxiv.org/abs/2309.15505), with eight codebooks, and 1000 entries per codebook.
|
| 27 |
|
| 28 |
For more details please check [our paper](https://arxiv.org/abs/2406.05298).
|
| 29 |
|
|
|
|
| 100 |
|
| 101 |
## Performance
|
| 102 |
|
| 103 |
+
We evaluate our codec using several objective audio quality metrics. We evaluate [ViSQOL](https://github.com/google/visqol) and [PESQ](https://lightning.ai/docs/torchmetrics/stable/audio/perceptual_evaluation_speech_quality.html) for perception quality, [ESTOI](https://ieeexplore.ieee.org/document/7539284) for intelligbility, mel spectrogram and STFT distances for spectral reconstruction accuracy, and SI-SDR [7] for phase reconstruction accuracy. Metrics are reported on the test set for both the MLS English and CommonVoice data. The model has not been trained or evaluated on non-speech audio.
|
| 104 |
|
| 105 |
| Dataset | ViSQOL |PESQ |ESTOI |Mel Distance |STFT Distance|SI-SDR|
|
| 106 |
|:-----------:|:----------:|:----------:|:----------:|:-----------:|:-----------:|:-----------:|
|
| 107 |
+
| MLS English | 4.50 | 3.69 | 0.94 | 0.066 | 0.033 | 8.33 |
|
| 108 |
+
| CommonVoice | 4.53 | 3.55 | 0.93 | 0.100 | 0.057 | 7.63 |
|
| 109 |
|
| 110 |
## Software Integration
|
| 111 |
|