Telugu Diffusion Language Model

Telugu Diffusion LM Research Preview

🌟 Overview

The first diffusion-based language model for Telugu! This model explores masked diffusion language modeling for Telugu text generation, adapting the MDLM (Masked Diffusion Language Models) architecture for an Indic language.

Unlike traditional autoregressive models (GPT-style), this model generates text through iterative denoising, starting from completely masked sequences and progressively revealing tokens.

Key Features

  • βœ… First Telugu Diffusion LM: Novel application of diffusion modeling to Telugu
  • βœ… Based on IndicBERTv2: Leverages strong Telugu language understanding
  • βœ… Question-Answering: Fine-tuned on 93K Telugu Q&A pairs
  • βœ… Bidirectional Context: Unlike autoregressive models, can consider full context
  • ⚠️ Research Preview: See limitations section below

πŸ“Š Model Details

Attribute Value
Architecture Masked Diffusion Language Model (MDLM)
Base Model IndicBERTv2-MLM-only
Parameters ~110M
Max Context 1024 tokens (extended from 512)
Training Data IndicVault Telugu Q&A (93K pairs)
Languages Telugu (primary), English (limited)
Task Conditional text generation (Q&A)

Training Details

Phase 1: Pretraining (Completed)

  • Base model: IndicBERTv2-MLM-only
  • Extended position embeddings (512 β†’ 1024)
  • Continued pretraining on Telugu text

Phase 2: SFT (Current)

  • Dataset: maya-research/IndicVault (Telugu subset, ~50K filtered pairs)
  • Objective: Diffusion-based instruction following
  • Special tokens: <BOS>, <EOS>, <START_ID>, <END_ID>, <EOT_ID>
  • Best validation loss: 2.60

Key Innovation: Query mask padding with 1s (instead of 0s) so the model learns to predict EOS tokens after answers, preventing repetitive generation.

πŸš€ Usage

Installation

pip install torch transformers safetensors streamlit

Basic Inference

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
from safetensors.torch import load_file

# Load model and tokenizer
model_name = "Prahaladha/telugu-diffusion-lm"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
model.eval()

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Prepare input
question = "హెల్ఀ్ కేర్ ΰ°Žΰ°•ΰ±ΰ°Έΰ±ΰ°ͺెన్సెస్ ΰ°¨ΰ°Ώ ΰ°¨ΰ°Ύ ΰ°ͺర్సనల్ ΰ°¬ΰ°‘ΰ±ΰ°œΰ±†ΰ°Ÿΰ± లో ఎలా ఇంక్లూ఑్ ΰ°šΰ±‡ΰ°Έΰ±ΰ°•ΰ±‹ΰ°΅ΰ°Ύΰ°²ΰ°Ώ?"
chat = [{"role": "user", "content": question}]

# Apply chat template
result = tokenizer.apply_chat_template(
    chat,
    tokenize=True,
    add_generation_prompt=True,
)
prompt_ids = result if isinstance(result, list) else result["input_ids"]

# Prepare masked sequence
seq_len = 128
prompt_len = len(prompt_ids)
x = torch.full((1, seq_len), tokenizer.mask_token_id, dtype=torch.long, device=device)
x[0, :prompt_len] = torch.tensor(prompt_ids, device=device)

mask = torch.ones((1, seq_len), dtype=torch.bool, device=device)
mask[0, :prompt_len] = False

attn_mask = torch.ones((1, seq_len), dtype=torch.long, device=device)

# Diffusion generation
num_steps = 64
temperature = 0.7

times = torch.linspace(1.0, 0.0, num_steps + 1, device=device)

for t, s in zip(times[:-1], times[1:]):
    with torch.no_grad():
        logits = model(input_ids=x, attention_mask=attn_mask).logits
    
    # Sample masked positions
    if mask.any():
        masked_logits = logits[mask] / temperature
        probs = torch.softmax(masked_logits, dim=-1)
        sampled = torch.multinomial(probs, num_samples=1).squeeze(-1)
        x[mask] = sampled
    
    # Remask (random strategy)
    if s > 0:
        mask = mask & (torch.rand_like(mask, dtype=torch.float) < s / t)
        x[mask] = tokenizer.mask_token_id

# Decode answer
answer_ids = x[0, prompt_len:].tolist()
answer_tokens = [tid for tid in answer_ids 
                 if tid not in [tokenizer.mask_token_id, tokenizer.pad_token_id, tokenizer.eos_token_id]]

answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)
print(f"Q: {question}")
print(f"A: {answer}")

Streamlit Demo

A full interactive demo is available in the repository:

streamlit run streamlit_app.py

πŸ“ˆ Performance

Strengths

  • βœ… In-domain Performance: Good on questions similar to training data (IndicVault)
  • βœ… Telugu Script: Handles Telugu characters properly
  • βœ… Structured Output: Follows Q&A format consistently

Limitations

  • ⚠️ Out-of-domain Generalization: Poor performance on simple prompts outside training distribution
  • ⚠️ Limited Coverage: Training on 50K pairs may not cover diverse topics
  • ⚠️ Repetition: Can generate repetitive text without early stopping
  • ⚠️ Morphological Complexity: Telugu's agglutinative nature poses challenges
  • ⚠️ Inference Speed: Slower than autoregressive models (requires multiple diffusion steps)

Comparison with Autoregressive Models

Aspect This Model (Diffusion) Autoregressive (GPT-style)
Generation Bidirectional, iterative Left-to-right, sequential
Speed Slower (64+ steps) Faster (1 step per token)
Context Full sequence Previous tokens only
Training Complex (diffusion objective) Simpler (next-token prediction)
Telugu Performance Experimental More mature

🎯 Intended Use

Primary Use Cases

  • πŸ”¬ Research: Exploring diffusion models for morphologically rich languages
  • πŸ“š Experimentation: Understanding Telugu NLP with non-autoregressive approaches
  • πŸ§ͺ Benchmarking: Comparing diffusion vs autoregressive generation for Indic languages

Out of Scope

  • ❌ Production deployments (research preview only)
  • ❌ Safety-critical applications
  • ❌ General-purpose Telugu generation (limited to Q&A domain)

πŸ”§ Training Hyperparameters

# SFT Phase
base_model: ai4bharat/IndicBERTv2-MLM-only
max_length: 1024
batch_size: 32
learning_rate: 5e-5
warmup_steps: 500
training_samples: ~50000
validation_loss: 2.60
optimizer: AdamW
scheduler: linear with warmup

πŸ“š Datasets

  • Training: maya-research/IndicVault (Telugu Q&A pairs, filtered)
  • Pretraining: CC-100 Telugu, IndicCorp (via IndicBERTv2)

πŸ™ Acknowledgments

  • AI4Bharat for IndicBERTv2-MLM-only base model
  • Maya Research for IndicVault dataset
  • MDLM paper (Sahoo et al., NeurIPS 2024) for the diffusion framework

πŸ“„ Citation

If you use this model in your research, please cite:

The base model and framework:

@inproceedings{kakwani2020indicnlpsuite,
    title={{IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages}},
    author={Divyanshu Kakwani and Anoop Kunchukuttan and Satish Golla and Gokul N.C. and Avik Bhattacharyya and Mitesh M. Khapra and Pratyush Kumar},
    year={2020},
    booktitle={Findings of EMNLP},
}

@article{sahoo2024simple,
  title={Simple and Effective Masked Diffusion Language Models},
  author={Sahoo, Subham Sekhar and Arriola, Marianne and Schiff, Yair and Gokaslan, Aaron and Marroquin, Edgar and Chiang, Justin T and Voleti, Vikash and Galashov, Alexandre and Li, Cheng-Ping and Rish, Irina and others},
  journal={arXiv preprint arXiv:2406.07524},
  year={2024}
}

🀝 Contributing

This is a research project. Contributions, suggestions, and feedback are welcome!

Issues we're working on:

  • Improving out-of-domain generalization
  • Expanding training data coverage
  • Optimizing inference speed
  • Better handling of Telugu morphology

πŸ“§ Contact


First Telugu Diffusion Language Model 🌸
Exploring new frontiers in Indic language generation
Downloads last month
20
Safetensors
Model size
0.3B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Prahaladha/telugu-diffusion-lm

Finetuned
(15)
this model

Dataset used to train Prahaladha/telugu-diffusion-lm

Paper for Prahaladha/telugu-diffusion-lm