Telugu Diffusion Language Model
π Overview
The first diffusion-based language model for Telugu! This model explores masked diffusion language modeling for Telugu text generation, adapting the MDLM (Masked Diffusion Language Models) architecture for an Indic language.
Unlike traditional autoregressive models (GPT-style), this model generates text through iterative denoising, starting from completely masked sequences and progressively revealing tokens.
Key Features
- β First Telugu Diffusion LM: Novel application of diffusion modeling to Telugu
- β Based on IndicBERTv2: Leverages strong Telugu language understanding
- β Question-Answering: Fine-tuned on 93K Telugu Q&A pairs
- β Bidirectional Context: Unlike autoregressive models, can consider full context
- β οΈ Research Preview: See limitations section below
π Model Details
| Attribute | Value |
|---|---|
| Architecture | Masked Diffusion Language Model (MDLM) |
| Base Model | IndicBERTv2-MLM-only |
| Parameters | ~110M |
| Max Context | 1024 tokens (extended from 512) |
| Training Data | IndicVault Telugu Q&A (93K pairs) |
| Languages | Telugu (primary), English (limited) |
| Task | Conditional text generation (Q&A) |
Training Details
Phase 1: Pretraining (Completed)
- Base model: IndicBERTv2-MLM-only
- Extended position embeddings (512 β 1024)
- Continued pretraining on Telugu text
Phase 2: SFT (Current)
- Dataset: maya-research/IndicVault (Telugu subset, ~50K filtered pairs)
- Objective: Diffusion-based instruction following
- Special tokens:
<BOS>,<EOS>,<START_ID>,<END_ID>,<EOT_ID> - Best validation loss: 2.60
Key Innovation: Query mask padding with 1s (instead of 0s) so the model learns to predict EOS tokens after answers, preventing repetitive generation.
π Usage
Installation
pip install torch transformers safetensors streamlit
Basic Inference
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
from safetensors.torch import load_file
# Load model and tokenizer
model_name = "Prahaladha/telugu-diffusion-lm"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
model.eval()
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
# Prepare input
question = "ΰ°Ήΰ±ΰ°²ΰ±ΰ°€ΰ± ΰ°ΰ±ΰ°°ΰ± ΰ°ΰ°ΰ±ΰ°Έΰ±ΰ°ͺΰ±ΰ°¨ΰ±ΰ°Έΰ±ΰ°Έΰ± ΰ°¨ΰ°Ώ ΰ°¨ΰ°Ύ ΰ°ͺΰ°°ΰ±ΰ°Έΰ°¨ΰ°²ΰ± ΰ°¬ΰ°‘ΰ±ΰ°ΰ±ΰ°ΰ± ΰ°²ΰ± ΰ°ΰ°²ΰ°Ύ ΰ°ΰ°ΰ°ΰ±ΰ°²ΰ±ΰ°‘ΰ± ΰ°ΰ±ΰ°Έΰ±ΰ°ΰ±ΰ°΅ΰ°Ύΰ°²ΰ°Ώ?"
chat = [{"role": "user", "content": question}]
# Apply chat template
result = tokenizer.apply_chat_template(
chat,
tokenize=True,
add_generation_prompt=True,
)
prompt_ids = result if isinstance(result, list) else result["input_ids"]
# Prepare masked sequence
seq_len = 128
prompt_len = len(prompt_ids)
x = torch.full((1, seq_len), tokenizer.mask_token_id, dtype=torch.long, device=device)
x[0, :prompt_len] = torch.tensor(prompt_ids, device=device)
mask = torch.ones((1, seq_len), dtype=torch.bool, device=device)
mask[0, :prompt_len] = False
attn_mask = torch.ones((1, seq_len), dtype=torch.long, device=device)
# Diffusion generation
num_steps = 64
temperature = 0.7
times = torch.linspace(1.0, 0.0, num_steps + 1, device=device)
for t, s in zip(times[:-1], times[1:]):
with torch.no_grad():
logits = model(input_ids=x, attention_mask=attn_mask).logits
# Sample masked positions
if mask.any():
masked_logits = logits[mask] / temperature
probs = torch.softmax(masked_logits, dim=-1)
sampled = torch.multinomial(probs, num_samples=1).squeeze(-1)
x[mask] = sampled
# Remask (random strategy)
if s > 0:
mask = mask & (torch.rand_like(mask, dtype=torch.float) < s / t)
x[mask] = tokenizer.mask_token_id
# Decode answer
answer_ids = x[0, prompt_len:].tolist()
answer_tokens = [tid for tid in answer_ids
if tid not in [tokenizer.mask_token_id, tokenizer.pad_token_id, tokenizer.eos_token_id]]
answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)
print(f"Q: {question}")
print(f"A: {answer}")
Streamlit Demo
A full interactive demo is available in the repository:
streamlit run streamlit_app.py
π Performance
Strengths
- β In-domain Performance: Good on questions similar to training data (IndicVault)
- β Telugu Script: Handles Telugu characters properly
- β Structured Output: Follows Q&A format consistently
Limitations
- β οΈ Out-of-domain Generalization: Poor performance on simple prompts outside training distribution
- β οΈ Limited Coverage: Training on 50K pairs may not cover diverse topics
- β οΈ Repetition: Can generate repetitive text without early stopping
- β οΈ Morphological Complexity: Telugu's agglutinative nature poses challenges
- β οΈ Inference Speed: Slower than autoregressive models (requires multiple diffusion steps)
Comparison with Autoregressive Models
| Aspect | This Model (Diffusion) | Autoregressive (GPT-style) |
|---|---|---|
| Generation | Bidirectional, iterative | Left-to-right, sequential |
| Speed | Slower (64+ steps) | Faster (1 step per token) |
| Context | Full sequence | Previous tokens only |
| Training | Complex (diffusion objective) | Simpler (next-token prediction) |
| Telugu Performance | Experimental | More mature |
π― Intended Use
Primary Use Cases
- π¬ Research: Exploring diffusion models for morphologically rich languages
- π Experimentation: Understanding Telugu NLP with non-autoregressive approaches
- π§ͺ Benchmarking: Comparing diffusion vs autoregressive generation for Indic languages
Out of Scope
- β Production deployments (research preview only)
- β Safety-critical applications
- β General-purpose Telugu generation (limited to Q&A domain)
π§ Training Hyperparameters
# SFT Phase
base_model: ai4bharat/IndicBERTv2-MLM-only
max_length: 1024
batch_size: 32
learning_rate: 5e-5
warmup_steps: 500
training_samples: ~50000
validation_loss: 2.60
optimizer: AdamW
scheduler: linear with warmup
π Datasets
- Training: maya-research/IndicVault (Telugu Q&A pairs, filtered)
- Pretraining: CC-100 Telugu, IndicCorp (via IndicBERTv2)
π Acknowledgments
- AI4Bharat for IndicBERTv2-MLM-only base model
- Maya Research for IndicVault dataset
- MDLM paper (Sahoo et al., NeurIPS 2024) for the diffusion framework
π Citation
If you use this model in your research, please cite:
The base model and framework:
@inproceedings{kakwani2020indicnlpsuite,
title={{IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages}},
author={Divyanshu Kakwani and Anoop Kunchukuttan and Satish Golla and Gokul N.C. and Avik Bhattacharyya and Mitesh M. Khapra and Pratyush Kumar},
year={2020},
booktitle={Findings of EMNLP},
}
@article{sahoo2024simple,
title={Simple and Effective Masked Diffusion Language Models},
author={Sahoo, Subham Sekhar and Arriola, Marianne and Schiff, Yair and Gokaslan, Aaron and Marroquin, Edgar and Chiang, Justin T and Voleti, Vikash and Galashov, Alexandre and Li, Cheng-Ping and Rish, Irina and others},
journal={arXiv preprint arXiv:2406.07524},
year={2024}
}
π€ Contributing
This is a research project. Contributions, suggestions, and feedback are welcome!
Issues we're working on:
- Improving out-of-domain generalization
- Expanding training data coverage
- Optimizing inference speed
- Better handling of Telugu morphology
π§ Contact
- Creator: Prahaladha
- Model: Prahaladha/telugu-diffusion-lm
- Issues: Please report on the model repository
Exploring new frontiers in Indic language generation
- Downloads last month
- 20
Model tree for Prahaladha/telugu-diffusion-lm
Base model
ai4bharat/IndicBERTv2-MLM-only