Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,24 @@
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
library_name: nemo
|
| 3 |
+
---
|
| 4 |
+
# CHiME8 DASR NeMo Baseline Models
|
| 5 |
+
|
| 6 |
+
## 1. Voice Activity Detection (VAD) Model:
|
| 7 |
+
### **MarbleNet_frame_VAD_chime7_Acrobat.nemo**
|
| 8 |
+
- This model is based on [NeMo MarbleNet VAD model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speech_classification/models.html#marblenet-vad).
|
| 9 |
+
- For validation, we use dataset comprises the CHiME-6 development subset as well as 50 hours of simulated audio data.
|
| 10 |
+
- The simulated data is generated using the [NeMo multi-speaker data simulator](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tools/Multispeaker_Simulator.ipynb)
|
| 11 |
+
on [VoxCeleb1&2 datasets](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html)
|
| 12 |
+
- The multi-speaker data simulation results in a total of 2,000 hours of audio, of which approximately 30% is silence.
|
| 13 |
+
- The Model training incorporates [SpecAugment](https://arxiv.org/abs/1904.08779) and noise augmentation through [MUSAN noise dataset](https://arxiv.org/abs/1510.08484).
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
## 2. Speaker Diarization Model: Multi-scale Diarization Decoder (MSDD-v2)
|
| 17 |
+
### MSDD_v2_PALO_100ms_intrpl_3scales.nemo
|
| 18 |
+
|
| 19 |
+
## 3. Automatic Speech Recognition (ASR) model
|
| 20 |
+
### FastConformerXL-RNNT-chime7-GSS-finetuned.nemo
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
## 4. Language Model for ASR Decoding: KenLM Model
|
| 24 |
+
### ASR_LM_chime7_only.kenlm
|