nvidia
/

low-frame-rate-speech-codec-22khz

Feature Extraction

Model card Files Files and versions

CasanovaE commited on Nov 28, 2024

Commit

c2caf64

·

verified ·

1 Parent(s): a9f3251

Update README.md

Files changed (1) hide show

README.md +2 -4

README.md CHANGED Viewed

@@ -61,10 +61,8 @@ The model is available for use in the NeMo toolkit [4], and can be used as a pre
 ## Training, Testing, and Evaluation Datasets:
-The Low Frame-rate Speech Codec is trained on a total of 28.7k hrs of speech data from 105 languages.
-For training our model we have used [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)  and an English subset of MLS dataset.  The Common Voice derived training set comprises 105 languages, totaling 2.7 million utterances, and 3.2k hours
-of audio from about one-hundred thousand speakers. The [MLS English](https://www.openslr.org/94/) training dataset consists of 6.2 million utterances and 25.5k hours of audio from 4329 speakers. =
 ### Training Datasets
@@ -106,7 +104,7 @@ The Low Frame-rate Speech Codec is trained on a total of 28.7k hrs of speech dat
     - Labeling Method: Automated
-    - Properties: We randomly selected 200 samples from each of the eight languages in the 44kHz MLS dataset. For more details, please refer to [our paper](https://arxiv.org/abs/2409.12117).
   - [DAPS](https://zenodo.org/records/4660670)

 ## Training, Testing, and Evaluation Datasets:
+The Low Frame-rate Speech Codec was trained on 28.7k hours of speech data spanning 105 languages. The model was evaluated using multilingual audiobook-style data and high-quality English recordings. For further details, refer to  [our paper](https://arxiv.org/abs/2409.12117).
 ### Training Datasets
     - Labeling Method: Automated
+    - Properties: We randomly selected 200 samples from each of the eight languages in the 44kHz MLS dataset.
   - [DAPS](https://zenodo.org/records/4660670)