HuBERT Home-Domain SSL — HindiBabyNet

Self-supervised continued pretraining of facebook/hubert-base-ls960 on naturalistic Hindi parent–infant home interaction recordings from the HindiBabyNet corpus.

Model Description

This model adapts the HuBERT-base speech representation to the home-recording domain — noisy, reverberant, multi-speaker environments with infant vocalisations and Hindi child-directed speech (CDS). The goal is to learn robust latent audio representations that better capture the acoustic characteristics of naturalistic home environments, which differ substantially from the read English speech (LibriSpeech) used to train the original HuBERT-base.

The model was pretrained using the HuBERT masked pseudo-label prediction objective:

  1. MFCC features are extracted from audio crops and clustered with k-means to produce pseudo-labels (discrete frame-level targets).
  2. Time steps in the Transformer input are randomly masked.
  3. The model predicts the pseudo-label of each masked frame via cross-entropy loss.

This is distinct from wav2vec2's contrastive + diversity loss — HuBERT's objective explicitly predicts discrete cluster assignments, encouraging the model to learn a clustering-consistent representation of speech.

Property Value
Base model facebook/hubert-base-ls960
Architecture HubertModel
Parameters ~94.7 M
Hidden size 768
Attention heads 12
Transformer layers 12
Feature extractor 7-layer CNN (stride 320)
Model size 361 MB

Training Data

The model was trained on ~308 hours of naturalistic home recordings (99 files, train split) from the HindiBabyNet corpus — a collection of day-long audio recordings of Hindi-speaking parent–infant dyads in their home environment.

Property Value
Train files 99
Train duration 308.45 hours
Dev files 10
Dev duration 37.25 hours
Split strategy By participant ID (no speaker leakage)
Audio format 16 kHz, mono

The recordings contain a rich mix of:

  • Infant vocalisations (babbling, crying, cooing)
  • Hindi child-directed speech (CDS) from caregivers
  • Adult-directed speech
  • Household background noise (TV, kitchen sounds, etc.)

Training Procedure

K-Means Pseudo-Labels (Iteration 0)

Parameter Value
Feature type MFCC (13 coefficients + delta + delta-delta = 39-dim)
Clustering MiniBatchKMeans
Number of clusters 100
Training crops 990 random 8s crops from train set

HuBERT Pretraining

Hyperparameter Value
Total training steps 50,000
Effective batch size 64 (4 GPUs × 2 per-GPU × 8 grad accum)
Crop duration 8.0 seconds
Epoch multiplier 10 (random re-cropping per epoch)
Learning rate 5e-5
LR schedule Linear warmup + linear decay
Warmup steps 5,000
Weight decay 0.01
Max gradient norm 1.0
Precision fp16 mixed precision
Gradient checkpointing Enabled (non-reentrant)
Mask time probability 0.05
Mask time length 10 frames
Projection dim 256

Training Dynamics

  • Total audio seen: ~7,111 hours (3.2M crops × 8s, sampled with replacement from 308h source)
  • Final loss: 76.60 (masked pseudo-label cross-entropy)
  • Training time: ~9.5 hours on 4× GPU (DDP)

Hardware

  • GPUs: 4× NVIDIA GPU (DDP via torchrun)
  • Framework: PyTorch + Hugging Face Transformers

Usage

Feature Extraction

from transformers import HubertModel, Wav2Vec2FeatureExtractor
import torch
import torchaudio

# Load model
model = HubertModel.from_pretrained("arunps/hubert-home-hindibabynet-ssl")
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("arunps/hubert-home-hindibabynet-ssl")

# Load audio
waveform, sr = torchaudio.load("audio.wav")
if sr != 16000:
    waveform = torchaudio.functional.resample(waveform, sr, 16000)
waveform = waveform.squeeze()

# Extract features
inputs = feature_extractor(waveform.numpy(), sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

# Hidden states from the last Transformer layer
hidden_states = outputs.last_hidden_state  # (1, num_frames, 768)

Fine-Tuning for Downstream Tasks

This model provides a pretrained encoder suitable for fine-tuning on:

  • Speaker diarisation (who is speaking when)
  • Speaker type classification (adult vs. infant vs. other)
  • Automatic speech recognition (Hindi CDS transcription)
  • Infant vocalisation detection and classification
  • Emotion/affect recognition in parent–infant interaction

Intended Use

This model is designed for research on infant language development, parent–child interaction, and home-environment speech processing. It is particularly suited for tasks involving:

  • Naturalistic, noisy home audio
  • Hindi child-directed speech
  • Infant and child vocalisations
  • Multi-speaker household environments

Limitations

  • Trained on a single corpus (HindiBabyNet); may not generalise to other languages or recording setups without further adaptation.
  • Pseudo-labels are based on MFCC k-means (iteration 0); a second iteration using HuBERT-derived features could improve representations.
  • No evaluation on standard benchmarks (SUPERB, etc.) — designed for domain-specific downstream tasks.
  • The training data contains naturalistic noise which may affect performance on clean speech tasks.

Data Availability

The HindiBabyNet corpus contains naturalistic home recordings of infants and caregivers and is not publicly available due to GDPR and ethical restrictions on sensitive data involving minors. Access may be requested through the project's institutional review process.

Citation

If you use this model, please cite both the model and the original HuBERT paper:

@misc{arunps2026hubert-hindibabynet,
  title={HuBERT Home-Domain SSL: Self-Supervised Speech Representation Learning for Hindi Infant-Caregiver Interactions},
  author={Arun P S},
  year={2026},
  url={https://huggingface.co/arunps/hubert-home-hindibabynet-ssl},
  note={HuBERT-base adapted to naturalistic Hindi child-directed speech via continued SSL pretraining}
}

@article{hsu2021hubert,
  title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
  author={Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  volume={29},
  pages={3451--3460},
  year={2021}
}

Model Card Contact

For questions about this model, please contact Arun P S or open an issue on the HindiBabyNet-Wav2Vec2-SSL repository.

Downloads last month
14
Safetensors
Model size
94.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for arunps/hubert-home-hindibabynet-ssl

Finetuned
(133)
this model