Update README.md adding more description for each model
Browse files
README.md
CHANGED
|
@@ -3,6 +3,9 @@ library_name: nemo
|
|
| 3 |
---
|
| 4 |
# CHiME8 DASR NeMo Baseline Models
|
| 5 |
|
|
|
|
|
|
|
|
|
|
| 6 |
## 1. Voice Activity Detection (VAD) Model:
|
| 7 |
### **MarbleNet_frame_VAD_chime7_Acrobat.nemo**
|
| 8 |
- This model is based on [NeMo MarbleNet VAD model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speech_classification/models.html#marblenet-vad).
|
|
@@ -14,11 +17,43 @@ on [VoxCeleb1&2 datasets](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.ht
|
|
| 14 |
|
| 15 |
|
| 16 |
## 2. Speaker Diarization Model: Multi-scale Diarization Decoder (MSDD-v2)
|
| 17 |
-
### MSDD_v2_PALO_100ms_intrpl_3scales.nemo
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
## 3. Automatic Speech Recognition (ASR) model
|
| 20 |
-
### FastConformerXL-RNNT-chime7-GSS-finetuned.nemo
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
|
| 23 |
## 4. Language Model for ASR Decoding: KenLM Model
|
| 24 |
-
### ASR_LM_chime7_only.kenlm
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
# CHiME8 DASR NeMo Baseline Models
|
| 5 |
|
| 6 |
+
The model files in this repository are the models used in this paper [The CHiME-7 Challenge: System Description and Performance of
|
| 7 |
+
NeMo Team’s DASR System](https://arxiv.org/pdf/2310.12378.pdf).
|
| 8 |
+
|
| 9 |
## 1. Voice Activity Detection (VAD) Model:
|
| 10 |
### **MarbleNet_frame_VAD_chime7_Acrobat.nemo**
|
| 11 |
- This model is based on [NeMo MarbleNet VAD model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speech_classification/models.html#marblenet-vad).
|
|
|
|
| 17 |
|
| 18 |
|
| 19 |
## 2. Speaker Diarization Model: Multi-scale Diarization Decoder (MSDD-v2)
|
| 20 |
+
### **MSDD_v2_PALO_100ms_intrpl_3scales.nemo**
|
| 21 |
+
|
| 22 |
+
Our DASR system is based on the speaker diarization system using the multi-scale diarization decoder (MSDD).
|
| 23 |
+
- MSDD Reference: [Park et al. (2022)](https://arxiv.org/pdf/2203.15974.pdf)
|
| 24 |
+
- MSDD-v2 speaker diarization system employs a multi-scale embedding approach and utilizes TitaNet speaker embedding extractor.
|
| 25 |
+
- TitaNet Reference: [Koluguri et al. (2022)](https://arxiv.org/abs/2110.04410)
|
| 26 |
+
- TitaNet Model is included in this .nemo checkpoint file.
|
| 27 |
+
- Unlike the system that uses a multi-layer LSTM architecture, we employ a four-layer Transformer architecture with a hidden size of 384.
|
| 28 |
+
- This neural model generates logit values indicating speaker existence.
|
| 29 |
+
- Our diarization model is trained on approximately 3,000 hours of simulated audio mixture data from the same multi-speaker data simulator used in VAD model training, drawing from VoxCeleb1&2 and LibriSpeech datasets.
|
| 30 |
+
- LibriSpeech Reference: [OpenSLR Download](https://www.openslr.org/12),[LibriSpeech, Panayotov et al. (2015)](https://ieeexplore.ieee.org/document/7178964)
|
| 31 |
+
- MUSAN noise is also used for adding additive background noise, focusing on music and broadband noise.
|
| 32 |
+
|
| 33 |
|
| 34 |
## 3. Automatic Speech Recognition (ASR) model
|
| 35 |
+
### **FastConformerXL-RNNT-chime7-GSS-finetuned.nemo**
|
| 36 |
+
- This ASR model is based on [NeMo FastConformer XL model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer).
|
| 37 |
+
- Single-channel audio generated using a multi-channel front-end (Guided Source Separation, GSS) is transcribed using a 0.6B parameter Conformer-based transducer (RNNT) model.
|
| 38 |
+
- Model Reference: [Gulati et al. (2020)](https://arxiv.org/abs/2005.08100)
|
| 39 |
+
- The model was initialized using a publicly available NeMo checkpoint.
|
| 40 |
+
- NeMo Checkpoint: [NGC Model Card: Conformer Transducer XL](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_conformer_transducer_xlarge)
|
| 41 |
+
- This model was then fine-tuned on the CHiME-7 train and dev set, which includes the CHiME-6 and Mixer6 training subsets, after processing the data through the multi-channel ASR front-end, utilizing ground-truth diarization.
|
| 42 |
+
- Fine-Tuning Details:
|
| 43 |
+
- Fine-tuning Duration: 35,000 updates
|
| 44 |
+
- Batch Size: 128
|
| 45 |
|
| 46 |
|
| 47 |
## 4. Language Model for ASR Decoding: KenLM Model
|
| 48 |
+
### **[**ASR_LM_chime7_only.kenlm**](https://huggingface.co/chime-dasr/nemo_baseline_models/blob/main/ASR_LM_chime7_only.kenlm)**
|
| 49 |
+
|
| 50 |
+
- We apply a word-piece level N-gram language model using byte-pair-encoding (BPE) tokens.
|
| 51 |
+
- This approach utilizes the SentencePiece and KenLM toolkits, based on the transcription of CHiME-7 train and dev sets.
|
| 52 |
+
- SentencePiece: [Kudo and Richardson (2018)](https://arxiv.org/abs/1808.06226)
|
| 53 |
+
- KenLM: [KenLM GitRepo](https://github.com/kpu/kenlm)
|
| 54 |
+
- The token sets of our ASR and LM models were matched to ensure consistency.
|
| 55 |
+
- To combine several N-gram models with equal weights, we used the OpenGrm library.
|
| 56 |
+
- OpenGrm: [Roark et al. (2012)](https://aclanthology.org/P12-3011/)
|
| 57 |
+
- MAES decoding was employed for the transducer, which accelerates the decoding process.
|
| 58 |
+
- MAES Decoding: [Kim et al. (2020)](https://ieeexplore.ieee.org/document/9250505)
|
| 59 |
+
- As expected, integrating the beam-search decoder with the language model significantly enhances the performance of the end-to-end model compared to its pure counterpart.
|