HinDiffusionLM: Diffusion Language Model for Hindi Language
Turning BERT-based model into an instruct-tuned LLADA-style Diffusion LLM on Hindi instruction data using a masked language modeling approach with diffusion-style generation. The model learns to iteratively denoise masked tokens to generate coherent responses in Hindi (trained on Kaggle GPU T4*2).
Experiments
Models Evaluated
| Model |
Performance |
google/muril-base-cased |
Best |
google/muril-large-cased |
Poor |
ai4bharat/indic-bert |
Moderate |
Datasets Tested
| Dataset |
Subset |
Status |
Notes |
ai4bharat/indic-instruct-data-v0.1 |
anudesh |
Used |
Primary dataset for demonstration |
ai4bharat/indic-instruct-data-v0.1 |
lm_sys |
Skipped |
Too time-intensive for training & GPU constraints |