nest-ja-0.6b
This repository provides Japanese NEST trained by SB Intuitions.
The NEST [1] is designed for speech self-supervised learning (SSL), which can be used as a frozen speech feature extractor or as weight initialization for downstream speech processing tasks.
The nest-ja-0.6b model has about 0.6B parameters and is trained on a Japanese dataset of roughly 35K hours.
Model Summary
This model follows the NEST framework by NVIDIA. For architecture details and usage, see the original model card.
How to Use our nest-ja-0.6b
Our nest-ja-0.6b model is available for use in the NVIDIA NeMo Framework, and can be used as weight initialization for downstream tasks or as a frozen feature extractor.
Using NEST as Weight Initialization for Downstream Tasks
When using nest-ja-0.6b as a pretrained encoder, you can follow the original usage. While any decoder architecture (CTC, RNN-T, TDT, etc.) can be used, the encoder configuration must match the nest-ja-0.6b encoder architecture. Below is an example of running a character-based ASR model with a CTC decoder, which is a common setup for Japanese ASR.
# use character-based ASR as example:
python <NeMo Root>/examples/asr/asr_ctc/speech_to_text_ctc.py \
# Config (modify as needed)
--config-path=<NeMo Root>/examples/asr/conf/conformer/ \
--config-name=conformer_ctc_char \
# Required: pretrained model and encoder settings must match nest-ja-0.6b
++init_from_pretrained_model.ssl.name="sbintuitions/nest-ja-0.6b" \
++init_from_pretrained_model.ssl.include=["encoder"] \
model.encoder.n_layers=24 \
model.encoder.d_model=1024 \
model.encoder.subsampling="dw_striding" \
model.encoder.subsampling_factor=8 \
model.encoder.subsampling_conv_channels=256 \
model.encoder.conv_kernel_size=9 \
model.encoder.conv_norm_type="layer_norm" \
model.encoder.xscaling="false" \
# Optional: adjust training settings as needed
model.train_ds.manifest_filepath=<path to train manifest> \
model.validation_ds.manifest_filepath=<path to val/test manifest> \
"model.labels=<List of characters>" \
trainer.devices=-1 \
trainer.accelerator="gpu" \
trainer.strategy="ddp" \
trainer.max_epochs=100 \
model.optim.name="adamw" \
model.optim.lr=0.001 \
model.optim.betas=[0.9,0.999] \
model.optim.weight_decay=0.0001 \
model.optim.sched.warmup_steps=2000 \
exp_manager.create_wandb_logger=True \
exp_manager.wandb_logger_kwargs.name="<Name of experiment>" \
exp_manager.wandb_logger_kwargs.project="<Name of project>"
Extracting and Saving Audio Features from NEST
NEST supports extracting audio features from multiple layers of its encoder:
python <NeMo Root>/scripts/ssl/extract_features.py \
--model_path="sbintuitions/nest-ja-0.6b" \
--input=<path to input manifest, or a dir containing audios, or path to audio> \
--output=<output directory to store features and manifest> \
--layers="all" \
--batch_size=8 \
--workers=8
Training
We follow the same training procedure as the original NEST.
Training Datasets
Performance
We evaluated nest-ja-0.6b on Japanese speech recognition, English speaker tasks and paralinguistic task. For comparison, we also fine-tuned existing Japanese SSL models under the same conditions.
Japanese Automatic Speech Recognition Evaluation
We initialized weights with existing Japanese SSL models and our nest-ja-0.6b model, and fine-tuned them using CTC loss across multiple Japanese corpora. We use a character-based vocabulary. Character Error Rate (CER, %) is used as the evaluation metric (lower is better).
Corpora:
- CSJ (Corpus of Spontaneous Japanese)
- A large-scale corpus of Japanese spontaneous speech. In this evaluation, eval1 / eval2 / eval3 are used as the test sets.
- COJADS (Corpus of Japanese Dialects)
- A corpus centered on Japanese dialect speech. In this evaluation, we use the Katakana transcription.
- EARS (Elderly Adults Read Speech Corpus)
- A corpus recorded from elderly speakers aged 70+ (mean age 83.3).
| Model | pretrain dataset | data size | param | eval1 | eval2 | eval3 | CSJ average | COJADS | EARS |
|---|---|---|---|---|---|---|---|---|---|
| yky-h/japanese-hubert-large | ReazonSpeech v1 | 19k hours | 0.3B | 4.09 | 2.80 | 3.11 | 3.33 | 44.9 | 36.1 |
| imprt/kushinada-hubert-large | In-house | 62k hours | 0.3B | 4.14 | 3.10 | 3.31 | 3.51 | 43.8 | 36.5 |
| sbintuitions/nest-ja-0.6b | ReazonSpeech v2 | 35k hours | 0.6B | 4.01 | 2.98 | 3.31 | 3.43 | 29.7 | 34.1 |
English Speaker and Paralinguistic Task Evaluation
We used existing Japanese SSL models and our nest-ja-0.6b model as frozen feature extractors and evaluated them on downstream tasks based on the SUPERB benchmark. We evaluate three speaker-related tasks and one paralinguistic task.
Task:
- SID (Speaker Identification) - ACC (Accuracy, %) ↑
- Multi-class speaker classification with a fixed, predefined speaker set shared between training and testing; evaluated on VoxCeleb1 (higher is better).
- ASV (Automatic Speaker Verification) - EER (Equal Error Rate, %) ↓
- Binary verification of whether two utterances are from the same speaker, where test speakers may be unseen during training; evaluated on VoxCeleb1 (without VoxCeleb2 training data or noise augmentation) (lower is better).
- SD (Speaker Diarization) - DER (Diarization Error Rate, %) ↓
- Predicts who speaks when (including overlap) at each timestamp; evaluated on two-speaker LibriMix, using time-aligned speaker labels from Kaldi alignments (lower is better).
- ER (Emotion Recognition) - ACC (Accuracy, %) ↑
- Utterance-level emotion classification; evaluated on IEMOCAP with four balanced classes (neutral, happy, sad, angry) using 5-fold cross-validation on standard splits (higher is better).
| Model | pretrain dataset | data size | param | SID↑ | ASV↓ | SD↓ | ER↑ |
|---|---|---|---|---|---|---|---|
| yky-h/japanese-hubert-large | ReazonSpeech v1 | 19k hours | 0.3B | 84.97 | 7.33 | 3.32 | 70.69 |
| imprt/kushinada-hubert-large | In-house | 62k hours | 0.3B | 88.01 | 7.21 | 3.41 | 67.18 |
| sbintuitions/nest-ja-0.6b | ReazonSpeech v2 | 35k hours | 0.6B | 90.20 | 7.26 | 4.80 | 64.24 |
Reference
[1] NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks
License
- Downloads last month
- 2