Update README.md
Browse files
README.md
CHANGED
|
@@ -25,21 +25,29 @@ NeMo Curator Speech Bandwidth Filter (NeMo Curator SBF) is a speech filtering mo
|
|
| 25 |
## Model Architecture
|
| 26 |
|
| 27 |
**Architecture Type:** Random Forest Classifier
|
|
|
|
| 28 |
**Network Architecture:** Random Forest Classifier (scikit-learn)
|
|
|
|
| 29 |
**Number of Model Parameters:** Not Applicable
|
| 30 |
|
| 31 |
## Input(s):
|
| 32 |
|
| 33 |
**Input Type(s):** Audio
|
|
|
|
| 34 |
**Input Format(s):** PCM F32
|
|
|
|
| 35 |
**Input Parameters:** One-Dimensional (1D)
|
|
|
|
| 36 |
**Other Properties Related to Input:** Pulse Code Modulation (PCM) audio samples with no encoding or pre-processing; 16 kHz or 48 kHz sampling rate required.
|
| 37 |
|
| 38 |
## Output(s):
|
| 39 |
|
| 40 |
**Output Type(s):** Integer
|
|
|
|
| 41 |
**Output Format:** Integer (1 or 0)
|
|
|
|
| 42 |
**Output Parameters:** One-Dimensional (1D)
|
|
|
|
| 43 |
**Other Properties Related to Output:** Integer label where 1 indicates full-band (high fidelity) and 0 indicates narrow-band (low fidelity).
|
| 44 |
|
| 45 |
This model uses a sklearn's Random Forest Classifier and runs entirely on CPU. It does not require GPU hardware or CUDA libraries for training or inference.
|
|
@@ -73,46 +81,54 @@ Curator v26.04
|
|
| 73 |
* Less than 10,000 Hours
|
| 74 |
|
| 75 |
**Dataset partition:**
|
|
|
|
| 76 |
Training [80%], Testing [10%], Validation [10%]
|
| 77 |
|
| 78 |
NVIDIA models are trained on a diverse set of public and proprietary datasets. The NeMo Curator Speech Bandwidth Filter model is tested on a dataset that consists of diverse speech dataset.
|
| 79 |
|
| 80 |
**Data Collection Method by dataset:** [Hybrid: Human, Synthetic]
|
|
|
|
| 81 |
**Labeling Method by dataset:** [Hybrid: Human, Synthetic]
|
| 82 |
|
| 83 |
**Link:** [DAPS](https://ccrma.stanford.edu/~gautham/Site/daps.html)
|
|
|
|
| 84 |
**Properties:** The DAPS dataset has 15 versions of audio (3 professional versions and 12 consumer device/real-world environment combinations). Each version consists of about 4.5 hours of data (about 14 minutes from each of 20 speakers).
|
| 85 |
|
| 86 |
**Link:** [LibriTTS](https://www.openslr.org/60/)
|
|
|
|
| 87 |
**Properties:** LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech, which is resampled at 16 kHZ.
|
| 88 |
|
| 89 |
**Link:** [VCTK](https://datashare.ed.ac.uk/handle/10283/3443)
|
|
|
|
| 90 |
**Properties:** This CSTR VCTK Corpus includes speech data uttered by 110 English speakers with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.
|
| 91 |
|
| 92 |
**Link:** [HiFi-TTS](https://www.openslr.org/109/)
|
|
|
|
| 93 |
**Properties:** A multi-speaker English dataset for training text-to-speech models. The HiFi-TTS dataset contains about 291.6 hours of speech from 10 speakers with at least 17 hours per speaker sampled at 44.1 kHz.
|
| 94 |
|
| 95 |
**Link:** [DNS Challenge 5](https://github.com/microsoft/DNS-Challenge/tree/2db96d5f75257df764a6ef66513b4b97bc707f30)
|
|
|
|
| 96 |
**Properties:** Collated dataset of clean speech, noise and impulse response provided by Microsoft for the ICASSP 2023 Deep Noise Suppression Challenge.
|
| 97 |
|
| 98 |
**Link:** [OpenSLR 32 - High quality TTS data for four South African languages ](https://www.openslr.org/32)
|
|
|
|
| 99 |
**Properties:** Multi-speaker TTS data for four South African languages, Afrikaans, Sesotho, Setswana and isiXhosa.
|
| 100 |
|
| 101 |
# Testing Datasets
|
| 102 |
|
| 103 |
**Data Collection Method by dataset:** [Hybrid: Human, Synthetic]
|
|
|
|
| 104 |
**Labeling Method by dataset:** [Hybrid: Human, Synthetic]
|
| 105 |
|
| 106 |
-
**Properties:**
|
| 107 |
-
The NeMo Curator Speech Bandwidth Filter model is tested on a dataset that consists of diverse speech dataset. Test data is taken by sampling 10% of training dataset mentioned above. The modality and data type is same as that of the training dataset.
|
| 108 |
|
| 109 |
# Evaluation Datasets
|
| 110 |
|
| 111 |
**Data Collection Method by dataset:** [Hybrid: Human, Synthetic]
|
|
|
|
| 112 |
**Labeling Method by dataset:** [Hybrid: Human, Synthetic]
|
| 113 |
|
| 114 |
-
**Properties:**
|
| 115 |
-
The NeMo Curator Speech Bandwidth Filter model is tested on a dataset that consists of diverse speech dataset. Test data is taken by sampling 10% of training dataset mentioned above. The modality and data type is same as that of the training dataset.
|
| 116 |
|
| 117 |
# Inference
|
| 118 |
|
|
|
|
| 25 |
## Model Architecture
|
| 26 |
|
| 27 |
**Architecture Type:** Random Forest Classifier
|
| 28 |
+
|
| 29 |
**Network Architecture:** Random Forest Classifier (scikit-learn)
|
| 30 |
+
|
| 31 |
**Number of Model Parameters:** Not Applicable
|
| 32 |
|
| 33 |
## Input(s):
|
| 34 |
|
| 35 |
**Input Type(s):** Audio
|
| 36 |
+
|
| 37 |
**Input Format(s):** PCM F32
|
| 38 |
+
|
| 39 |
**Input Parameters:** One-Dimensional (1D)
|
| 40 |
+
|
| 41 |
**Other Properties Related to Input:** Pulse Code Modulation (PCM) audio samples with no encoding or pre-processing; 16 kHz or 48 kHz sampling rate required.
|
| 42 |
|
| 43 |
## Output(s):
|
| 44 |
|
| 45 |
**Output Type(s):** Integer
|
| 46 |
+
|
| 47 |
**Output Format:** Integer (1 or 0)
|
| 48 |
+
|
| 49 |
**Output Parameters:** One-Dimensional (1D)
|
| 50 |
+
|
| 51 |
**Other Properties Related to Output:** Integer label where 1 indicates full-band (high fidelity) and 0 indicates narrow-band (low fidelity).
|
| 52 |
|
| 53 |
This model uses a sklearn's Random Forest Classifier and runs entirely on CPU. It does not require GPU hardware or CUDA libraries for training or inference.
|
|
|
|
| 81 |
* Less than 10,000 Hours
|
| 82 |
|
| 83 |
**Dataset partition:**
|
| 84 |
+
|
| 85 |
Training [80%], Testing [10%], Validation [10%]
|
| 86 |
|
| 87 |
NVIDIA models are trained on a diverse set of public and proprietary datasets. The NeMo Curator Speech Bandwidth Filter model is tested on a dataset that consists of diverse speech dataset.
|
| 88 |
|
| 89 |
**Data Collection Method by dataset:** [Hybrid: Human, Synthetic]
|
| 90 |
+
|
| 91 |
**Labeling Method by dataset:** [Hybrid: Human, Synthetic]
|
| 92 |
|
| 93 |
**Link:** [DAPS](https://ccrma.stanford.edu/~gautham/Site/daps.html)
|
| 94 |
+
|
| 95 |
**Properties:** The DAPS dataset has 15 versions of audio (3 professional versions and 12 consumer device/real-world environment combinations). Each version consists of about 4.5 hours of data (about 14 minutes from each of 20 speakers).
|
| 96 |
|
| 97 |
**Link:** [LibriTTS](https://www.openslr.org/60/)
|
| 98 |
+
|
| 99 |
**Properties:** LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech, which is resampled at 16 kHZ.
|
| 100 |
|
| 101 |
**Link:** [VCTK](https://datashare.ed.ac.uk/handle/10283/3443)
|
| 102 |
+
|
| 103 |
**Properties:** This CSTR VCTK Corpus includes speech data uttered by 110 English speakers with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.
|
| 104 |
|
| 105 |
**Link:** [HiFi-TTS](https://www.openslr.org/109/)
|
| 106 |
+
|
| 107 |
**Properties:** A multi-speaker English dataset for training text-to-speech models. The HiFi-TTS dataset contains about 291.6 hours of speech from 10 speakers with at least 17 hours per speaker sampled at 44.1 kHz.
|
| 108 |
|
| 109 |
**Link:** [DNS Challenge 5](https://github.com/microsoft/DNS-Challenge/tree/2db96d5f75257df764a6ef66513b4b97bc707f30)
|
| 110 |
+
|
| 111 |
**Properties:** Collated dataset of clean speech, noise and impulse response provided by Microsoft for the ICASSP 2023 Deep Noise Suppression Challenge.
|
| 112 |
|
| 113 |
**Link:** [OpenSLR 32 - High quality TTS data for four South African languages ](https://www.openslr.org/32)
|
| 114 |
+
|
| 115 |
**Properties:** Multi-speaker TTS data for four South African languages, Afrikaans, Sesotho, Setswana and isiXhosa.
|
| 116 |
|
| 117 |
# Testing Datasets
|
| 118 |
|
| 119 |
**Data Collection Method by dataset:** [Hybrid: Human, Synthetic]
|
| 120 |
+
|
| 121 |
**Labeling Method by dataset:** [Hybrid: Human, Synthetic]
|
| 122 |
|
| 123 |
+
**Properties:** The NeMo Curator Speech Bandwidth Filter model is tested on a dataset that consists of diverse speech dataset. Test data is taken by sampling 10% of training dataset mentioned above. The modality and data type is same as that of the training dataset.
|
|
|
|
| 124 |
|
| 125 |
# Evaluation Datasets
|
| 126 |
|
| 127 |
**Data Collection Method by dataset:** [Hybrid: Human, Synthetic]
|
| 128 |
+
|
| 129 |
**Labeling Method by dataset:** [Hybrid: Human, Synthetic]
|
| 130 |
|
| 131 |
+
**Properties:** The NeMo Curator Speech Bandwidth Filter model is tested on a dataset that consists of diverse speech dataset. Test data is taken by sampling 10% of training dataset mentioned above. The modality and data type is same as that of the training dataset.
|
|
|
|
| 132 |
|
| 133 |
# Inference
|
| 134 |
|