HuBERT Home-Domain SSL — HindiBabyNet

Self-supervised continued pretraining of facebook/hubert-base-ls960 on naturalistic Hindi parent–infant home interaction recordings from the HindiBabyNet corpus.

Model Description

This model adapts the HuBERT-base speech representation to the home-recording domain — noisy, reverberant, multi-speaker environments with infant vocalisations and Hindi child-directed speech (CDS). The goal is to learn robust latent audio representations that better capture the acoustic characteristics of naturalistic home environments, which differ substantially from the read English speech (LibriSpeech) used to train the original HuBERT-base.

The model was pretrained using the HuBERT masked pseudo-label prediction objective:

MFCC features are extracted from audio crops and clustered with k-means to produce pseudo-labels (discrete frame-level targets).
Time steps in the Transformer input are randomly masked.
The model predicts the pseudo-label of each masked frame via cross-entropy loss.

This is distinct from wav2vec2's contrastive + diversity loss — HuBERT's objective explicitly predicts discrete cluster assignments, encouraging the model to learn a clustering-consistent representation of speech.

Property	Value
Base model	`facebook/hubert-base-ls960`
Architecture	`HubertModel`
Parameters	~94.7 M
Hidden size	768
Attention heads	12
Transformer layers	12
Feature extractor	7-layer CNN (stride 320)
Model size	361 MB

Training Data

The model was trained on ~308 hours of naturalistic home recordings (99 files, train split) from the HindiBabyNet corpus — a collection of day-long audio recordings of Hindi-speaking parent–infant dyads in their home environment.

Property	Value
Train files	99
Train duration	308.45 hours
Dev files	10
Dev duration	37.25 hours
Split strategy	By participant ID (no speaker leakage)
Audio format	16 kHz, mono

The recordings contain a rich mix of:

Infant vocalisations (babbling, crying, cooing)
Hindi child-directed speech (CDS) from caregivers
Adult-directed speech
Household background noise (TV, kitchen sounds, etc.)

Training Procedure

K-Means Pseudo-Labels (Iteration 0)

Parameter	Value
Feature type	MFCC (13 coefficients + delta + delta-delta = 39-dim)
Clustering	MiniBatchKMeans
Number of clusters	100
Training crops	990 random 8s crops from train set

HuBERT Pretraining

Hyperparameter	Value
Total training steps	50,000
Effective batch size	64 (4 GPUs × 2 per-GPU × 8 grad accum)
Crop duration	8.0 seconds
Epoch multiplier	10 (random re-cropping per epoch)
Learning rate	5e-5
LR schedule	Linear warmup + linear decay
Warmup steps	5,000
Weight decay	0.01
Max gradient norm	1.0
Precision	fp16 mixed precision
Gradient checkpointing	Enabled (non-reentrant)
Mask time probability	0.05
Mask time length	10 frames
Projection dim	256

Training Dynamics

Total audio seen: ~7,111 hours (3.2M crops × 8s, sampled with replacement from 308h source)
Final loss: 76.60 (masked pseudo-label cross-entropy)
Training time: ~9.5 hours on 4× GPU (DDP)

Hardware

GPUs: 4× NVIDIA GPU (DDP via torchrun)
Framework: PyTorch + Hugging Face Transformers

Usage

Feature Extraction

from transformers import HubertModel, Wav2Vec2FeatureExtractor
import torch
import torchaudio

# Load model
model = HubertModel.from_pretrained("arunps/hubert-home-hindibabynet-ssl")
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("arunps/hubert-home-hindibabynet-ssl")

# Load audio
waveform, sr = torchaudio.load("audio.wav")
if sr != 16000:
    waveform = torchaudio.functional.resample(waveform, sr, 16000)
waveform = waveform.squeeze()

# Extract features
inputs = feature_extractor(waveform.numpy(), sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

# Hidden states from the last Transformer layer
hidden_states = outputs.last_hidden_state  # (1, num_frames, 768)

Fine-Tuning for Downstream Tasks

This model provides a pretrained encoder suitable for fine-tuning on:

Speaker diarisation (who is speaking when)
Speaker type classification (adult vs. infant vs. other)
Automatic speech recognition (Hindi CDS transcription)
Infant vocalisation detection and classification
Emotion/affect recognition in parent–infant interaction

Intended Use

This model is designed for research on infant language development, parent–child interaction, and home-environment speech processing. It is particularly suited for tasks involving:

Naturalistic, noisy home audio
Hindi child-directed speech
Infant and child vocalisations
Multi-speaker household environments

Limitations

Trained on a single corpus (HindiBabyNet); may not generalise to other languages or recording setups without further adaptation.
Pseudo-labels are based on MFCC k-means (iteration 0); a second iteration using HuBERT-derived features could improve representations.
No evaluation on standard benchmarks (SUPERB, etc.) — designed for domain-specific downstream tasks.
The training data contains naturalistic noise which may affect performance on clean speech tasks.

Data Availability

The HindiBabyNet corpus contains naturalistic home recordings of infants and caregivers and is not publicly available due to GDPR and ethical restrictions on sensitive data involving minors. Access may be requested through the project's institutional review process.

Citation

If you use this model, please cite both the model and the original HuBERT paper:

@misc{arunps2026hubert-hindibabynet,
  title={HuBERT Home-Domain SSL: Self-Supervised Speech Representation Learning for Hindi Infant-Caregiver Interactions},
  author={Arun P S},
  year={2026},
  url={https://huggingface.co/arunps/hubert-home-hindibabynet-ssl},
  note={HuBERT-base adapted to naturalistic Hindi child-directed speech via continued SSL pretraining}
}

@article{hsu2021hubert,
  title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
  author={Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  volume={29},
  pages={3451--3460},
  year={2021}
}

Model Card Contact

For questions about this model, please contact Arun P S or open an issue on the HindiBabyNet-Wav2Vec2-SSL repository.

Downloads last month: 14

Safetensors

Model size

94.4M params

Tensor type

F32

Model tree for arunps/hubert-home-hindibabynet-ssl

Base model

facebook/hubert-base-ls960

Finetuned

(133)

this model