nvidia
/

nemocurator-speech-bandwidth-filter

Joblib

Model card Files Files and versions

xet

Community

sarahyurick commited on 22 days ago

Commit

e195b69

verified ·

1 Parent(s): 1d282fe

Update README.md

Browse files

Files changed (1) hide show

README.md +20 -4

README.md CHANGED Viewed

@@ -25,21 +25,29 @@ NeMo Curator Speech Bandwidth Filter (NeMo Curator SBF) is a speech filtering mo
 ## Model Architecture
 **Architecture Type:** Random Forest Classifier
 **Network Architecture:** Random Forest Classifier (scikit-learn)
 **Number of Model Parameters:** Not Applicable
 ## Input(s):
 **Input Type(s):** Audio
 **Input Format(s):** PCM F32
 **Input Parameters:** One-Dimensional (1D)
 **Other Properties Related to Input:** Pulse Code Modulation (PCM) audio samples with no encoding or pre-processing; 16 kHz or 48 kHz sampling rate required.
 ## Output(s):
 **Output Type(s):** Integer
 **Output Format:** Integer (1 or 0)
 **Output Parameters:** One-Dimensional (1D)
 **Other Properties Related to Output:** Integer label where 1 indicates full-band (high fidelity) and 0 indicates narrow-band (low fidelity).
 This model uses a sklearn's Random Forest Classifier and runs entirely on CPU. It does not require GPU hardware or CUDA libraries for training or inference.
@@ -73,46 +81,54 @@ Curator v26.04
 * Less than 10,000 Hours
 **Dataset partition:**
 Training [80%], Testing [10%], Validation [10%]
 NVIDIA models are trained on a diverse set of public and proprietary datasets. The NeMo Curator Speech Bandwidth Filter model is tested on a dataset that consists of diverse speech dataset.
 **Data Collection Method by dataset:** [Hybrid: Human, Synthetic]
 **Labeling Method by dataset:** [Hybrid: Human, Synthetic]
 **Link:** [DAPS](https://ccrma.stanford.edu/~gautham/Site/daps.html)
 **Properties:** The DAPS dataset has 15 versions of audio (3 professional versions and 12 consumer device/real-world environment combinations). Each version consists of about 4.5 hours of data (about 14 minutes from each of 20 speakers).
 **Link:** [LibriTTS](https://www.openslr.org/60/)
 **Properties:** LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech, which is resampled at 16 kHZ.
 **Link:** [VCTK](https://datashare.ed.ac.uk/handle/10283/3443)
 **Properties:** This CSTR VCTK Corpus includes speech data uttered by 110 English speakers with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.
 **Link:** [HiFi-TTS](https://www.openslr.org/109/)
 **Properties:** A multi-speaker English dataset for training text-to-speech models. The HiFi-TTS dataset contains about 291.6 hours of speech from 10 speakers with at least 17 hours per speaker sampled at 44.1 kHz.
 **Link:** [DNS Challenge 5](https://github.com/microsoft/DNS-Challenge/tree/2db96d5f75257df764a6ef66513b4b97bc707f30)
 **Properties:** Collated dataset of clean speech, noise and impulse response provided by Microsoft for the ICASSP 2023 Deep Noise Suppression Challenge.
 **Link:** [OpenSLR 32 - High quality TTS data for four South African languages ](https://www.openslr.org/32)
 **Properties:** Multi-speaker TTS data for four South African languages, Afrikaans, Sesotho, Setswana and isiXhosa.
 # Testing Datasets
 **Data Collection Method by dataset:** [Hybrid: Human, Synthetic]
 **Labeling Method by dataset:** [Hybrid: Human, Synthetic]
-**Properties:**
-The NeMo Curator Speech Bandwidth Filter model is tested on a dataset that consists of diverse speech dataset. Test data is taken by sampling 10% of training dataset mentioned above. The modality and data type is same as that of the training dataset.
 # Evaluation Datasets
 **Data Collection Method by dataset:** [Hybrid: Human, Synthetic]
 **Labeling Method by dataset:** [Hybrid: Human, Synthetic]
-**Properties:**
-The NeMo Curator Speech Bandwidth Filter model is tested on a dataset that consists of diverse speech dataset. Test data is taken by sampling 10% of training dataset mentioned above. The modality and data type is same as that of the training dataset.
 # Inference

 ## Model Architecture
 **Architecture Type:** Random Forest Classifier
 **Network Architecture:** Random Forest Classifier (scikit-learn)
 **Number of Model Parameters:** Not Applicable
 ## Input(s):
 **Input Type(s):** Audio
 **Input Format(s):** PCM F32
 **Input Parameters:** One-Dimensional (1D)
 **Other Properties Related to Input:** Pulse Code Modulation (PCM) audio samples with no encoding or pre-processing; 16 kHz or 48 kHz sampling rate required.
 ## Output(s):
 **Output Type(s):** Integer
 **Output Format:** Integer (1 or 0)
 **Output Parameters:** One-Dimensional (1D)
 **Other Properties Related to Output:** Integer label where 1 indicates full-band (high fidelity) and 0 indicates narrow-band (low fidelity).
 This model uses a sklearn's Random Forest Classifier and runs entirely on CPU. It does not require GPU hardware or CUDA libraries for training or inference.
 * Less than 10,000 Hours
 **Dataset partition:**
 Training [80%], Testing [10%], Validation [10%]
 NVIDIA models are trained on a diverse set of public and proprietary datasets. The NeMo Curator Speech Bandwidth Filter model is tested on a dataset that consists of diverse speech dataset.
 **Data Collection Method by dataset:** [Hybrid: Human, Synthetic]
 **Labeling Method by dataset:** [Hybrid: Human, Synthetic]
 **Link:** [DAPS](https://ccrma.stanford.edu/~gautham/Site/daps.html)
 **Properties:** The DAPS dataset has 15 versions of audio (3 professional versions and 12 consumer device/real-world environment combinations). Each version consists of about 4.5 hours of data (about 14 minutes from each of 20 speakers).
 **Link:** [LibriTTS](https://www.openslr.org/60/)
 **Properties:** LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech, which is resampled at 16 kHZ.
 **Link:** [VCTK](https://datashare.ed.ac.uk/handle/10283/3443)
 **Properties:** This CSTR VCTK Corpus includes speech data uttered by 110 English speakers with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.
 **Link:** [HiFi-TTS](https://www.openslr.org/109/)
 **Properties:** A multi-speaker English dataset for training text-to-speech models. The HiFi-TTS dataset contains about 291.6 hours of speech from 10 speakers with at least 17 hours per speaker sampled at 44.1 kHz.
 **Link:** [DNS Challenge 5](https://github.com/microsoft/DNS-Challenge/tree/2db96d5f75257df764a6ef66513b4b97bc707f30)
 **Properties:** Collated dataset of clean speech, noise and impulse response provided by Microsoft for the ICASSP 2023 Deep Noise Suppression Challenge.
 **Link:** [OpenSLR 32 - High quality TTS data for four South African languages ](https://www.openslr.org/32)
 **Properties:** Multi-speaker TTS data for four South African languages, Afrikaans, Sesotho, Setswana and isiXhosa.
 # Testing Datasets
 **Data Collection Method by dataset:** [Hybrid: Human, Synthetic]
 **Labeling Method by dataset:** [Hybrid: Human, Synthetic]
+**Properties:** The NeMo Curator Speech Bandwidth Filter model is tested on a dataset that consists of diverse speech dataset. Test data is taken by sampling 10% of training dataset mentioned above. The modality and data type is same as that of the training dataset.
 # Evaluation Datasets
 **Data Collection Method by dataset:** [Hybrid: Human, Synthetic]
 **Labeling Method by dataset:** [Hybrid: Human, Synthetic]
+**Properties:** The NeMo Curator Speech Bandwidth Filter model is tested on a dataset that consists of diverse speech dataset. Test data is taken by sampling 10% of training dataset mentioned above. The modality and data type is same as that of the training dataset.
 # Inference