Hindi BabyLM: Data-Efficient Language Modeling for Hindi
📖 Model Overview
A collection of data-efficient Hindi language models trained as part of the BabyLM Challenge adaptation for morphologically rich languages. This repository contains 13 model variants spanning 2 architectures (GPT-2, DeBERTa), 5 tokenizer types (BPE, WordPiece, SentencePiece Unigram, Character-Bigram, Character-Level), and 2 data scales (10M and 100M words).
Research Question: How do tokenization strategies and model architectures affect linguistic competence when training with limited data on a morphologically rich language like Hindi?
Key Findings:
- BPE tokenization with 32K vocabulary achieves the best overall balance of task performance and syntactic competence
- GPT-2 significantly outperforms DeBERTa in the low-data regime (10M words)
- Character-level tokenization catastrophically fails on syntactic evaluation (18.3% MultiBLiMP) despite reasonable task accuracy
- Scaling from 10M to 100M words yields +1.6% IndicGLUE and +4.9% MultiBLiMP improvement
📊 Results
GPT-2 Models (10M words, 110M parameters)
| Model | Tokenizer | Vocab | IndicGLUE (Avg) | MultiBLiMP (Avg) | Perplexity |
|---|---|---|---|---|---|
| gpt_10M_bpe_32k | BPE | 32K | 60.45% | 87.69% | 129.25 |
| gpt_10M_character_bigram | Char-Bigram | 2K | 59.94% | 88.32% | 6.03 |
| gpt_10M_sentencepiece_unigram | SP Unigram | 32K | 59.82% | 87.29% | 124.56 |
| gpt_10M_bpe_8k | BPE | 8K | 58.53% | 86.51% | 84.15 |
| gpt_10M_wordpiece | WordPiece | 32K | 58.20% | 86.14% | 124.70 |
| gpt_10M_bpe_16k | BPE | 16K | 57.62% | 87.96% | 113.17 |
| gpt_10M_character_level | Char-Level | 200 | 53.84% | 18.32% | 3.93 |
GPT-2 Model (100M words) — Best Overall
| Model | Tokenizer | Vocab | IndicGLUE (Avg) | MultiBLiMP (Avg) | Perplexity |
|---|---|---|---|---|---|
| gpt_100M_large | BPE | 32K | 62.09% | 92.54% | 83.50 |
DeBERTa Models (10M words, 86M parameters)
| Model | Tokenizer | Vocab | IndicGLUE (Avg) | MultiBLiMP (Avg) | Perplexity |
|---|---|---|---|---|---|
| deberta_10M_bpe_32K | BPE | 32K | 47.70% | 69.74% | 616.69 |
| deberta_10M_wordpiece_32K | WordPiece | 32K | 40.92% | 69.69% | 525.93 |
| deberta_10M_sentencepiece_unigram_32K | SP Unigram | 32K | 38.93% | 68.51% | 644.42 |
Note: Perplexity values are not directly comparable across tokenizers with different vocabulary sizes, as the prediction space differs. Character-level models have naturally lower perplexity due to smaller vocabularies.
🏆 Best Model: IndicGLUE Per-Task Breakdown (gpt_100M_large)
| Task | Accuracy |
|---|---|
| BBC Articles Classification | 78.06% |
| Product Review Sentiment | 73.42% |
| Discourse Mode | 73.32% |
| Choice of Plausible Alternatives | 63.64% |
| Movie Review Sentiment | 61.94% |
| Wikipedia Section Title Prediction | 43.70% |
| Cloze-style Multiple-Choice QA | 40.70% |
| Average (7 tasks) | 62.09% |
WinogradNLI was skipped as the dataset contains only the entailment class in the train/validation splits.
🧠 MultiBLiMP Syntactic Evaluation (gpt_100M_large)
| Phenomenon | Accuracy | Correct / Total |
|---|---|---|
| Subject-Verb Agreement: Person (SV-P) | 96.60% | 398 / 412 |
| Subject-Predicate Agreement: Number (SP-#) | 95.00% | 95 / 100 |
| Subject-Predicate Agreement: Gender (SP-G) | 92.66% | 101 / 109 |
| Subject-Verb Agreement: Gender (SV-G) | 90.21% | 378 / 419 |
| Subject-Verb Agreement: Number (SV-#) | 88.21% | 359 / 407 |
| Overall (1,447 pairs) | 92.54% | — |
🏗️ Architecture Details
GPT-2 Small (Primary Model)
- Type: Causal Language Model (autoregressive)
- Parameters: 110M
- Layers: 12 | Hidden Size: 768 | Attention Heads: 12
- Activation: GELU
- Context Length: 512 tokens
DeBERTa Small
- Type: Masked Language Model (bidirectional)
- Parameters: 86M
- Layers: 12 | Hidden Size: 768
- Key Feature: Disentangled attention mechanism
💻 Usage
import torch
from transformers import GPT2LMHeadModel, PreTrainedTokenizerFast
# Load model and tokenizer
model = GPT2LMHeadModel.from_pretrained("Ayush-Talreja/hindi-babylm")
tokenizer = PreTrainedTokenizerFast.from_pretrained("Ayush-Talreja/hindi-babylm")
# Generate text
input_text = "भारत एक"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
with torch.no_grad():
output = model.generate(
input_ids,
max_new_tokens=50,
do_sample=True,
top_k=50,
top_p=0.95,
temperature=0.8,
)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
📦 Dataset
Hindi BabyLM Corpus — A curated, multi-source dataset for data-efficient Hindi language modeling.
Data Splits
| Split | Words | Documents |
|---|---|---|
| Training | 100M | 113,266 |
| Validation | 10M | 180,259 |
| Test | 10M | 180,399 |
Sources
| Source | Proportion | Description |
|---|---|---|
| IndicCorp V2 | ~50% | Curated news and general web text |
| Hindi Wikipedia | ~30% | Encyclopedia and reference material |
| IndicDialogue | ~15% | Movie and TV show subtitles |
| Children's Literature | ~5% | Stories and educational content |
Data Quality Pipeline
- Unicode normalization (NFC)
- Length filtering (30–2,000 characters)
- Language detection (≥80% Devanagari script)
- Word count filtering (2–10,000 words per document)
- Near-duplicate removal (MinHash LSH, 256 permutations, threshold: 0.8)
⚙️ Training Configuration
| Parameter | Value |
|---|---|
| Optimizer | AdamW (β₁=0.9, β₂=0.999) |
| Learning Rate | 3e-4 |
| LR Schedule | Cosine with warmup |
| Batch Size | 32 × 8 gradient accumulation = 256 effective |
| Epochs | 10 |
| Weight Decay | 0.01 |
| Gradient Clipping | max_norm = 1.0 |
| Mixed Precision | BF16 |
| Hardware | NVIDIA GPU (LRZ HPC cluster) |
⚠️ Limitations and Biases
- Limited training data: Models trained on 10M words are not competitive with full-scale language models. They are designed for research on data-efficient learning, not production use.
- Hindi only: No multilingual capability. Performance on code-mixed Hindi-English text is untested.
- Source biases: The training corpus over-represents formal written Hindi (news, Wikipedia) and movie dialogue. Spoken Hindi, regional dialects, and informal text are underrepresented.
- Character-level tokenization failure: The character-level tokenizer achieves only 18.3% on MultiBLiMP (below chance), indicating that character-level representations alone are insufficient for capturing Hindi syntactic patterns at this data scale.
- Evaluation scope: IndicGLUE covers 7 tasks (WinogradNLI skipped due to data issues). Results may not generalize to all Hindi NLP applications.
📚 Citation
If you use these models or the dataset, please cite:
@misc{talreja2025hindibabylm,
title={Hindi BabyLM: Data-Efficient Language Modeling for Hindi},
author={Talreja, Ayush},
year={2025},
howpublished={\url{https://huggingface.co/Ayush-Talreja/hindi-babylm}},
note={BabyLM Challenge adaptation for morphologically rich languages}
}
🙏 Acknowledgments
- AI4Bharat for IndicCorp, IndicGLUE, and Hindi NLP resources
- BabyLM Challenge organizers for the research motivation
- Hugging Face for model hosting and community support
- LRZ (Leibniz-Rechenzentrum) for HPC compute resources
⚖️ License
This project is licensed under the MIT License — see the LICENSE file for details.
📧 Contact
- Email: ayushtalreja1@gmail.com
- LinkedIn: Ayush Kumar Talreja
Dataset used to train Ayush-Talreja/hindi-babylm
Evaluation results
- Average Accuracy on IndicGLUE Hindiself-reported60.450
- Average Accuracy on MultiBLiMP Hindiself-reported87.690