HuBERT Home-Domain SSL — HindiBabyNet
Self-supervised continued pretraining of facebook/hubert-base-ls960 on naturalistic Hindi parent–infant home interaction recordings from the HindiBabyNet corpus.
Model Description
This model adapts the HuBERT-base speech representation to the home-recording domain — noisy, reverberant, multi-speaker environments with infant vocalisations and Hindi child-directed speech (CDS). The goal is to learn robust latent audio representations that better capture the acoustic characteristics of naturalistic home environments, which differ substantially from the read English speech (LibriSpeech) used to train the original HuBERT-base.
The model was pretrained using the HuBERT masked pseudo-label prediction objective:
- MFCC features are extracted from audio crops and clustered with k-means to produce pseudo-labels (discrete frame-level targets).
- Time steps in the Transformer input are randomly masked.
- The model predicts the pseudo-label of each masked frame via cross-entropy loss.
This is distinct from wav2vec2's contrastive + diversity loss — HuBERT's objective explicitly predicts discrete cluster assignments, encouraging the model to learn a clustering-consistent representation of speech.
| Property | Value |
|---|---|
| Base model | facebook/hubert-base-ls960 |
| Architecture | HubertModel |
| Parameters | ~94.7 M |
| Hidden size | 768 |
| Attention heads | 12 |
| Transformer layers | 12 |
| Feature extractor | 7-layer CNN (stride 320) |
| Model size | 361 MB |
Training Data
The model was trained on ~308 hours of naturalistic home recordings (99 files, train split) from the HindiBabyNet corpus — a collection of day-long audio recordings of Hindi-speaking parent–infant dyads in their home environment.
| Property | Value |
|---|---|
| Train files | 99 |
| Train duration | 308.45 hours |
| Dev files | 10 |
| Dev duration | 37.25 hours |
| Split strategy | By participant ID (no speaker leakage) |
| Audio format | 16 kHz, mono |
The recordings contain a rich mix of:
- Infant vocalisations (babbling, crying, cooing)
- Hindi child-directed speech (CDS) from caregivers
- Adult-directed speech
- Household background noise (TV, kitchen sounds, etc.)
Training Procedure
K-Means Pseudo-Labels (Iteration 0)
| Parameter | Value |
|---|---|
| Feature type | MFCC (13 coefficients + delta + delta-delta = 39-dim) |
| Clustering | MiniBatchKMeans |
| Number of clusters | 100 |
| Training crops | 990 random 8s crops from train set |
HuBERT Pretraining
| Hyperparameter | Value |
|---|---|
| Total training steps | 50,000 |
| Effective batch size | 64 (4 GPUs × 2 per-GPU × 8 grad accum) |
| Crop duration | 8.0 seconds |
| Epoch multiplier | 10 (random re-cropping per epoch) |
| Learning rate | 5e-5 |
| LR schedule | Linear warmup + linear decay |
| Warmup steps | 5,000 |
| Weight decay | 0.01 |
| Max gradient norm | 1.0 |
| Precision | fp16 mixed precision |
| Gradient checkpointing | Enabled (non-reentrant) |
| Mask time probability | 0.05 |
| Mask time length | 10 frames |
| Projection dim | 256 |
Training Dynamics
- Total audio seen: ~7,111 hours (3.2M crops × 8s, sampled with replacement from 308h source)
- Final loss: 76.60 (masked pseudo-label cross-entropy)
- Training time: ~9.5 hours on 4× GPU (DDP)
Hardware
- GPUs: 4× NVIDIA GPU (DDP via
torchrun) - Framework: PyTorch + Hugging Face Transformers
Usage
Feature Extraction
from transformers import HubertModel, Wav2Vec2FeatureExtractor
import torch
import torchaudio
# Load model
model = HubertModel.from_pretrained("arunps/hubert-home-hindibabynet-ssl")
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("arunps/hubert-home-hindibabynet-ssl")
# Load audio
waveform, sr = torchaudio.load("audio.wav")
if sr != 16000:
waveform = torchaudio.functional.resample(waveform, sr, 16000)
waveform = waveform.squeeze()
# Extract features
inputs = feature_extractor(waveform.numpy(), sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# Hidden states from the last Transformer layer
hidden_states = outputs.last_hidden_state # (1, num_frames, 768)
Fine-Tuning for Downstream Tasks
This model provides a pretrained encoder suitable for fine-tuning on:
- Speaker diarisation (who is speaking when)
- Speaker type classification (adult vs. infant vs. other)
- Automatic speech recognition (Hindi CDS transcription)
- Infant vocalisation detection and classification
- Emotion/affect recognition in parent–infant interaction
Intended Use
This model is designed for research on infant language development, parent–child interaction, and home-environment speech processing. It is particularly suited for tasks involving:
- Naturalistic, noisy home audio
- Hindi child-directed speech
- Infant and child vocalisations
- Multi-speaker household environments
Limitations
- Trained on a single corpus (HindiBabyNet); may not generalise to other languages or recording setups without further adaptation.
- Pseudo-labels are based on MFCC k-means (iteration 0); a second iteration using HuBERT-derived features could improve representations.
- No evaluation on standard benchmarks (SUPERB, etc.) — designed for domain-specific downstream tasks.
- The training data contains naturalistic noise which may affect performance on clean speech tasks.
Data Availability
The HindiBabyNet corpus contains naturalistic home recordings of infants and caregivers and is not publicly available due to GDPR and ethical restrictions on sensitive data involving minors. Access may be requested through the project's institutional review process.
Citation
If you use this model, please cite both the model and the original HuBERT paper:
@misc{arunps2026hubert-hindibabynet,
title={HuBERT Home-Domain SSL: Self-Supervised Speech Representation Learning for Hindi Infant-Caregiver Interactions},
author={Arun P S},
year={2026},
url={https://huggingface.co/arunps/hubert-home-hindibabynet-ssl},
note={HuBERT-base adapted to naturalistic Hindi child-directed speech via continued SSL pretraining}
}
@article{hsu2021hubert,
title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
author={Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
volume={29},
pages={3451--3460},
year={2021}
}
Model Card Contact
For questions about this model, please contact Arun P S or open an issue on the HindiBabyNet-Wav2Vec2-SSL repository.
- Downloads last month
- 14
Model tree for arunps/hubert-home-hindibabynet-ssl
Base model
facebook/hubert-base-ls960