Hindi BabyLM: Data-Efficient Language Modeling for Hindi

📖 Model Overview

A collection of data-efficient Hindi language models trained as part of the BabyLM Challenge adaptation for morphologically rich languages. This repository contains 13 model variants spanning 2 architectures (GPT-2, DeBERTa), 5 tokenizer types (BPE, WordPiece, SentencePiece Unigram, Character-Bigram, Character-Level), and 2 data scales (10M and 100M words).

Research Question: How do tokenization strategies and model architectures affect linguistic competence when training with limited data on a morphologically rich language like Hindi?

Key Findings:

BPE tokenization with 32K vocabulary achieves the best overall balance of task performance and syntactic competence
GPT-2 significantly outperforms DeBERTa in the low-data regime (10M words)
Character-level tokenization catastrophically fails on syntactic evaluation (18.3% MultiBLiMP) despite reasonable task accuracy
Scaling from 10M to 100M words yields +1.6% IndicGLUE and +4.9% MultiBLiMP improvement

📊 Results

GPT-2 Models (10M words, 110M parameters)

Model	Tokenizer	Vocab	IndicGLUE (Avg)	MultiBLiMP (Avg)	Perplexity
gpt_10M_bpe_32k	BPE	32K	60.45%	87.69%	129.25
gpt_10M_character_bigram	Char-Bigram	2K	59.94%	88.32%	6.03
gpt_10M_sentencepiece_unigram	SP Unigram	32K	59.82%	87.29%	124.56
gpt_10M_bpe_8k	BPE	8K	58.53%	86.51%	84.15
gpt_10M_wordpiece	WordPiece	32K	58.20%	86.14%	124.70
gpt_10M_bpe_16k	BPE	16K	57.62%	87.96%	113.17
gpt_10M_character_level	Char-Level	200	53.84%	18.32%	3.93

GPT-2 Model (100M words) — Best Overall

Model	Tokenizer	Vocab	IndicGLUE (Avg)	MultiBLiMP (Avg)	Perplexity
gpt_100M_large	BPE	32K	62.09%	92.54%	83.50

DeBERTa Models (10M words, 86M parameters)

Model	Tokenizer	Vocab	IndicGLUE (Avg)	MultiBLiMP (Avg)	Perplexity
deberta_10M_bpe_32K	BPE	32K	47.70%	69.74%	616.69
deberta_10M_wordpiece_32K	WordPiece	32K	40.92%	69.69%	525.93
deberta_10M_sentencepiece_unigram_32K	SP Unigram	32K	38.93%	68.51%	644.42

Note: Perplexity values are not directly comparable across tokenizers with different vocabulary sizes, as the prediction space differs. Character-level models have naturally lower perplexity due to smaller vocabularies.

🏆 Best Model: IndicGLUE Per-Task Breakdown (gpt_100M_large)

Task	Accuracy
BBC Articles Classification	78.06%
Product Review Sentiment	73.42%
Discourse Mode	73.32%
Choice of Plausible Alternatives	63.64%
Movie Review Sentiment	61.94%
Wikipedia Section Title Prediction	43.70%
Cloze-style Multiple-Choice QA	40.70%
Average (7 tasks)	62.09%

WinogradNLI was skipped as the dataset contains only the entailment class in the train/validation splits.

🧠 MultiBLiMP Syntactic Evaluation (gpt_100M_large)

Phenomenon	Accuracy	Correct / Total
Subject-Verb Agreement: Person (SV-P)	96.60%	398 / 412
Subject-Predicate Agreement: Number (SP-#)	95.00%	95 / 100
Subject-Predicate Agreement: Gender (SP-G)	92.66%	101 / 109
Subject-Verb Agreement: Gender (SV-G)	90.21%	378 / 419
Subject-Verb Agreement: Number (SV-#)	88.21%	359 / 407
Overall (1,447 pairs)	92.54%	—

🏗️ Architecture Details

GPT-2 Small (Primary Model)

Type: Causal Language Model (autoregressive)
Parameters: 110M
Layers: 12 | Hidden Size: 768 | Attention Heads: 12
Activation: GELU
Context Length: 512 tokens

DeBERTa Small

Type: Masked Language Model (bidirectional)
Parameters: 86M
Layers: 12 | Hidden Size: 768
Key Feature: Disentangled attention mechanism

💻 Usage

import torch
from transformers import GPT2LMHeadModel, PreTrainedTokenizerFast

# Load model and tokenizer
model = GPT2LMHeadModel.from_pretrained("Ayush-Talreja/hindi-babylm")
tokenizer = PreTrainedTokenizerFast.from_pretrained("Ayush-Talreja/hindi-babylm")

# Generate text
input_text = "भारत एक"
input_ids = tokenizer.encode(input_text, return_tensors="pt")

with torch.no_grad():
    output = model.generate(
        input_ids,
        max_new_tokens=50,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        temperature=0.8,
    )

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

📦 Dataset

Hindi BabyLM Corpus — A curated, multi-source dataset for data-efficient Hindi language modeling.

Data Splits

Split	Words	Documents
Training	100M	113,266
Validation	10M	180,259
Test	10M	180,399

Sources

Source	Proportion	Description
IndicCorp V2	~50%	Curated news and general web text
Hindi Wikipedia	~30%	Encyclopedia and reference material
IndicDialogue	~15%	Movie and TV show subtitles
Children's Literature	~5%	Stories and educational content

Data Quality Pipeline

Unicode normalization (NFC)
Length filtering (30–2,000 characters)
Language detection (≥80% Devanagari script)
Word count filtering (2–10,000 words per document)
Near-duplicate removal (MinHash LSH, 256 permutations, threshold: 0.8)

⚙️ Training Configuration

Parameter	Value
Optimizer	AdamW (β₁=0.9, β₂=0.999)
Learning Rate	3e-4
LR Schedule	Cosine with warmup
Batch Size	32 × 8 gradient accumulation = 256 effective
Epochs	10
Weight Decay	0.01
Gradient Clipping	max_norm = 1.0
Mixed Precision	BF16
Hardware	NVIDIA GPU (LRZ HPC cluster)

⚠️ Limitations and Biases

Limited training data: Models trained on 10M words are not competitive with full-scale language models. They are designed for research on data-efficient learning, not production use.
Hindi only: No multilingual capability. Performance on code-mixed Hindi-English text is untested.
Source biases: The training corpus over-represents formal written Hindi (news, Wikipedia) and movie dialogue. Spoken Hindi, regional dialects, and informal text are underrepresented.
Character-level tokenization failure: The character-level tokenizer achieves only 18.3% on MultiBLiMP (below chance), indicating that character-level representations alone are insufficient for capturing Hindi syntactic patterns at this data scale.
Evaluation scope: IndicGLUE covers 7 tasks (WinogradNLI skipped due to data issues). Results may not generalize to all Hindi NLP applications.

📚 Citation

If you use these models or the dataset, please cite:

@misc{talreja2025hindibabylm,
  title={Hindi BabyLM: Data-Efficient Language Modeling for Hindi},
  author={Talreja, Ayush},
  year={2025},
  howpublished={\url{https://huggingface.co/Ayush-Talreja/hindi-babylm}},
  note={BabyLM Challenge adaptation for morphologically rich languages}
}

🙏 Acknowledgments

AI4Bharat for IndicCorp, IndicGLUE, and Hindi NLP resources
BabyLM Challenge organizers for the research motivation
Hugging Face for model hosting and community support
LRZ (Leibniz-Rechenzentrum) for HPC compute resources

⚖️ License

This project is licensed under the MIT License — see the LICENSE file for details.

📧 Contact

Email: ayushtalreja1@gmail.com
LinkedIn: Ayush Kumar Talreja

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train Ayush-Talreja/hindi-babylm

Evaluation results

Average Accuracy on IndicGLUE Hindi
self-reported

60.450
Average Accuracy on MultiBLiMP Hindi
self-reported

87.690