nest-ja-0.6b

This repository provides Japanese NEST trained by SB Intuitions.

The NEST [1] is designed for speech self-supervised learning (SSL), which can be used as a frozen speech feature extractor or as weight initialization for downstream speech processing tasks. The nest-ja-0.6b model has about 0.6B parameters and is trained on a Japanese dataset of roughly 35K hours.

Model Summary

This model follows the NEST framework by NVIDIA. For architecture details and usage, see the original model card.

How to Use our nest-ja-0.6b

Our nest-ja-0.6b model is available for use in the NVIDIA NeMo Framework, and can be used as weight initialization for downstream tasks or as a frozen feature extractor.

Using NEST as Weight Initialization for Downstream Tasks

When using nest-ja-0.6b as a pretrained encoder, you can follow the original usage. While any decoder architecture (CTC, RNN-T, TDT, etc.) can be used, the encoder configuration must match the nest-ja-0.6b encoder architecture. Below is an example of running a character-based ASR model with a CTC decoder, which is a common setup for Japanese ASR.

# use character-based ASR as example:
python <NeMo Root>/examples/asr/asr_ctc/speech_to_text_ctc.py \
    # Config (modify as needed)
    --config-path=<NeMo Root>/examples/asr/conf/conformer/ \
    --config-name=conformer_ctc_char \
    # Required: pretrained model and encoder settings must match nest-ja-0.6b
    ++init_from_pretrained_model.ssl.name="sbintuitions/nest-ja-0.6b" \
    ++init_from_pretrained_model.ssl.include=["encoder"] \
    model.encoder.n_layers=24 \
    model.encoder.d_model=1024 \
    model.encoder.subsampling="dw_striding" \
    model.encoder.subsampling_factor=8 \
    model.encoder.subsampling_conv_channels=256 \
    model.encoder.conv_kernel_size=9 \
    model.encoder.conv_norm_type="layer_norm" \
    model.encoder.xscaling="false" \
    # Optional: adjust training settings as needed
    model.train_ds.manifest_filepath=<path to train manifest> \
    model.validation_ds.manifest_filepath=<path to val/test manifest> \
    "model.labels=<List of characters>" \
    trainer.devices=-1 \
    trainer.accelerator="gpu" \
    trainer.strategy="ddp" \
    trainer.max_epochs=100 \
    model.optim.name="adamw" \
    model.optim.lr=0.001 \
    model.optim.betas=[0.9,0.999] \
    model.optim.weight_decay=0.0001 \
    model.optim.sched.warmup_steps=2000 \
    exp_manager.create_wandb_logger=True \
    exp_manager.wandb_logger_kwargs.name="<Name of experiment>" \
    exp_manager.wandb_logger_kwargs.project="<Name of project>"

Extracting and Saving Audio Features from NEST

NEST supports extracting audio features from multiple layers of its encoder:

python <NeMo Root>/scripts/ssl/extract_features.py \
    --model_path="sbintuitions/nest-ja-0.6b" \
    --input=<path to input manifest, or a dir containing audios, or path to audio> \
    --output=<output directory to store features and manifest> \
    --layers="all" \
    --batch_size=8 \
    --workers=8

Training

We follow the same training procedure as the original NEST.

Training Datasets

ReazonSpeech v2.0

Performance

We evaluated nest-ja-0.6b on Japanese speech recognition, English speaker tasks and paralinguistic task. For comparison, we also fine-tuned existing Japanese SSL models under the same conditions.

Japanese Automatic Speech Recognition Evaluation

We initialized weights with existing Japanese SSL models and our nest-ja-0.6b model, and fine-tuned them using CTC loss across multiple Japanese corpora. We use a character-based vocabulary. Character Error Rate (CER, %) is used as the evaluation metric (lower is better).

Corpora:

CSJ (Corpus of Spontaneous Japanese)
- A large-scale corpus of Japanese spontaneous speech. In this evaluation, eval1 / eval2 / eval3 are used as the test sets.
COJADS (Corpus of Japanese Dialects)
- A corpus centered on Japanese dialect speech. In this evaluation, we use the Katakana transcription.
EARS (Elderly Adults Read Speech Corpus)
- A corpus recorded from elderly speakers aged 70+ (mean age 83.3).

Model	pretrain dataset	data size	param	eval1	eval2	eval3	CSJ average	COJADS	EARS
yky-h/japanese-hubert-large	ReazonSpeech v1	19k hours	0.3B	4.09	2.80	3.11	3.33	44.9	36.1
imprt/kushinada-hubert-large	In-house	62k hours	0.3B	4.14	3.10	3.31	3.51	43.8	36.5
sbintuitions/nest-ja-0.6b	ReazonSpeech v2	35k hours	0.6B	4.01	2.98	3.31	3.43	29.7	34.1

English Speaker and Paralinguistic Task Evaluation

We used existing Japanese SSL models and our nest-ja-0.6b model as frozen feature extractors and evaluated them on downstream tasks based on the SUPERB benchmark. We evaluate three speaker-related tasks and one paralinguistic task.

Task:

SID (Speaker Identification) - ACC (Accuracy, %) ↑
- Multi-class speaker classification with a fixed, predefined speaker set shared between training and testing; evaluated on VoxCeleb1 (higher is better).
ASV (Automatic Speaker Verification) - EER (Equal Error Rate, %) ↓
- Binary verification of whether two utterances are from the same speaker, where test speakers may be unseen during training; evaluated on VoxCeleb1 (without VoxCeleb2 training data or noise augmentation) (lower is better).
SD (Speaker Diarization) - DER (Diarization Error Rate, %) ↓
- Predicts who speaks when (including overlap) at each timestamp; evaluated on two-speaker LibriMix, using time-aligned speaker labels from Kaldi alignments (lower is better).
ER (Emotion Recognition) - ACC (Accuracy, %) ↑
- Utterance-level emotion classification; evaluated on IEMOCAP with four balanced classes (neutral, happy, sad, angry) using 5-fold cross-validation on standard splits (higher is better).

Model	pretrain dataset	data size	param	SID↑	ASV↓	SD↓	ER↑
yky-h/japanese-hubert-large	ReazonSpeech v1	19k hours	0.3B	84.97	7.33	3.32	70.69
imprt/kushinada-hubert-large	In-house	62k hours	0.3B	88.01	7.21	3.41	67.18
sbintuitions/nest-ja-0.6b	ReazonSpeech v2	35k hours	0.6B	90.20	7.26	4.80	64.24

Reference

[1] NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks

License

MIT License

Downloads last month: 2

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including sbintuitions/nest-ja-0.6b

NEST-Ja

Collection

Japanese speech self-supervised learning model developed by SB Intuitions. • 2 items • Updated 1 day ago • 1

Paper for sbintuitions/nest-ja-0.6b

NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks

Paper • 2408.13106 • Published Aug 23, 2024 • 1