sarahyurick commited on
Commit
e195b69
·
verified ·
1 Parent(s): 1d282fe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -4
README.md CHANGED
@@ -25,21 +25,29 @@ NeMo Curator Speech Bandwidth Filter (NeMo Curator SBF) is a speech filtering mo
25
  ## Model Architecture
26
 
27
  **Architecture Type:** Random Forest Classifier
 
28
  **Network Architecture:** Random Forest Classifier (scikit-learn)
 
29
  **Number of Model Parameters:** Not Applicable
30
 
31
  ## Input(s):
32
 
33
  **Input Type(s):** Audio
 
34
  **Input Format(s):** PCM F32
 
35
  **Input Parameters:** One-Dimensional (1D)
 
36
  **Other Properties Related to Input:** Pulse Code Modulation (PCM) audio samples with no encoding or pre-processing; 16 kHz or 48 kHz sampling rate required.
37
 
38
  ## Output(s):
39
 
40
  **Output Type(s):** Integer
 
41
  **Output Format:** Integer (1 or 0)
 
42
  **Output Parameters:** One-Dimensional (1D)
 
43
  **Other Properties Related to Output:** Integer label where 1 indicates full-band (high fidelity) and 0 indicates narrow-band (low fidelity).
44
 
45
  This model uses a sklearn's Random Forest Classifier and runs entirely on CPU. It does not require GPU hardware or CUDA libraries for training or inference.
@@ -73,46 +81,54 @@ Curator v26.04
73
  * Less than 10,000 Hours
74
 
75
  **Dataset partition:**
 
76
  Training [80%], Testing [10%], Validation [10%]
77
 
78
  NVIDIA models are trained on a diverse set of public and proprietary datasets. The NeMo Curator Speech Bandwidth Filter model is tested on a dataset that consists of diverse speech dataset.
79
 
80
  **Data Collection Method by dataset:** [Hybrid: Human, Synthetic]
 
81
  **Labeling Method by dataset:** [Hybrid: Human, Synthetic]
82
 
83
  **Link:** [DAPS](https://ccrma.stanford.edu/~gautham/Site/daps.html)
 
84
  **Properties:** The DAPS dataset has 15 versions of audio (3 professional versions and 12 consumer device/real-world environment combinations). Each version consists of about 4.5 hours of data (about 14 minutes from each of 20 speakers).
85
 
86
  **Link:** [LibriTTS](https://www.openslr.org/60/)
 
87
  **Properties:** LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech, which is resampled at 16 kHZ.
88
 
89
  **Link:** [VCTK](https://datashare.ed.ac.uk/handle/10283/3443)
 
90
  **Properties:** This CSTR VCTK Corpus includes speech data uttered by 110 English speakers with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.
91
 
92
  **Link:** [HiFi-TTS](https://www.openslr.org/109/)
 
93
  **Properties:** A multi-speaker English dataset for training text-to-speech models. The HiFi-TTS dataset contains about 291.6 hours of speech from 10 speakers with at least 17 hours per speaker sampled at 44.1 kHz.
94
 
95
  **Link:** [DNS Challenge 5](https://github.com/microsoft/DNS-Challenge/tree/2db96d5f75257df764a6ef66513b4b97bc707f30)
 
96
  **Properties:** Collated dataset of clean speech, noise and impulse response provided by Microsoft for the ICASSP 2023 Deep Noise Suppression Challenge.
97
 
98
  **Link:** [OpenSLR 32 - High quality TTS data for four South African languages ](https://www.openslr.org/32)
 
99
  **Properties:** Multi-speaker TTS data for four South African languages, Afrikaans, Sesotho, Setswana and isiXhosa.
100
 
101
  # Testing Datasets
102
 
103
  **Data Collection Method by dataset:** [Hybrid: Human, Synthetic]
 
104
  **Labeling Method by dataset:** [Hybrid: Human, Synthetic]
105
 
106
- **Properties:**
107
- The NeMo Curator Speech Bandwidth Filter model is tested on a dataset that consists of diverse speech dataset. Test data is taken by sampling 10% of training dataset mentioned above. The modality and data type is same as that of the training dataset.
108
 
109
  # Evaluation Datasets
110
 
111
  **Data Collection Method by dataset:** [Hybrid: Human, Synthetic]
 
112
  **Labeling Method by dataset:** [Hybrid: Human, Synthetic]
113
 
114
- **Properties:**
115
- The NeMo Curator Speech Bandwidth Filter model is tested on a dataset that consists of diverse speech dataset. Test data is taken by sampling 10% of training dataset mentioned above. The modality and data type is same as that of the training dataset.
116
 
117
  # Inference
118
 
 
25
  ## Model Architecture
26
 
27
  **Architecture Type:** Random Forest Classifier
28
+
29
  **Network Architecture:** Random Forest Classifier (scikit-learn)
30
+
31
  **Number of Model Parameters:** Not Applicable
32
 
33
  ## Input(s):
34
 
35
  **Input Type(s):** Audio
36
+
37
  **Input Format(s):** PCM F32
38
+
39
  **Input Parameters:** One-Dimensional (1D)
40
+
41
  **Other Properties Related to Input:** Pulse Code Modulation (PCM) audio samples with no encoding or pre-processing; 16 kHz or 48 kHz sampling rate required.
42
 
43
  ## Output(s):
44
 
45
  **Output Type(s):** Integer
46
+
47
  **Output Format:** Integer (1 or 0)
48
+
49
  **Output Parameters:** One-Dimensional (1D)
50
+
51
  **Other Properties Related to Output:** Integer label where 1 indicates full-band (high fidelity) and 0 indicates narrow-band (low fidelity).
52
 
53
  This model uses a sklearn's Random Forest Classifier and runs entirely on CPU. It does not require GPU hardware or CUDA libraries for training or inference.
 
81
  * Less than 10,000 Hours
82
 
83
  **Dataset partition:**
84
+
85
  Training [80%], Testing [10%], Validation [10%]
86
 
87
  NVIDIA models are trained on a diverse set of public and proprietary datasets. The NeMo Curator Speech Bandwidth Filter model is tested on a dataset that consists of diverse speech dataset.
88
 
89
  **Data Collection Method by dataset:** [Hybrid: Human, Synthetic]
90
+
91
  **Labeling Method by dataset:** [Hybrid: Human, Synthetic]
92
 
93
  **Link:** [DAPS](https://ccrma.stanford.edu/~gautham/Site/daps.html)
94
+
95
  **Properties:** The DAPS dataset has 15 versions of audio (3 professional versions and 12 consumer device/real-world environment combinations). Each version consists of about 4.5 hours of data (about 14 minutes from each of 20 speakers).
96
 
97
  **Link:** [LibriTTS](https://www.openslr.org/60/)
98
+
99
  **Properties:** LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech, which is resampled at 16 kHZ.
100
 
101
  **Link:** [VCTK](https://datashare.ed.ac.uk/handle/10283/3443)
102
+
103
  **Properties:** This CSTR VCTK Corpus includes speech data uttered by 110 English speakers with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.
104
 
105
  **Link:** [HiFi-TTS](https://www.openslr.org/109/)
106
+
107
  **Properties:** A multi-speaker English dataset for training text-to-speech models. The HiFi-TTS dataset contains about 291.6 hours of speech from 10 speakers with at least 17 hours per speaker sampled at 44.1 kHz.
108
 
109
  **Link:** [DNS Challenge 5](https://github.com/microsoft/DNS-Challenge/tree/2db96d5f75257df764a6ef66513b4b97bc707f30)
110
+
111
  **Properties:** Collated dataset of clean speech, noise and impulse response provided by Microsoft for the ICASSP 2023 Deep Noise Suppression Challenge.
112
 
113
  **Link:** [OpenSLR 32 - High quality TTS data for four South African languages ](https://www.openslr.org/32)
114
+
115
  **Properties:** Multi-speaker TTS data for four South African languages, Afrikaans, Sesotho, Setswana and isiXhosa.
116
 
117
  # Testing Datasets
118
 
119
  **Data Collection Method by dataset:** [Hybrid: Human, Synthetic]
120
+
121
  **Labeling Method by dataset:** [Hybrid: Human, Synthetic]
122
 
123
+ **Properties:** The NeMo Curator Speech Bandwidth Filter model is tested on a dataset that consists of diverse speech dataset. Test data is taken by sampling 10% of training dataset mentioned above. The modality and data type is same as that of the training dataset.
 
124
 
125
  # Evaluation Datasets
126
 
127
  **Data Collection Method by dataset:** [Hybrid: Human, Synthetic]
128
+
129
  **Labeling Method by dataset:** [Hybrid: Human, Synthetic]
130
 
131
+ **Properties:** The NeMo Curator Speech Bandwidth Filter model is tested on a dataset that consists of diverse speech dataset. Test data is taken by sampling 10% of training dataset mentioned above. The modality and data type is same as that of the training dataset.
 
132
 
133
  # Inference
134