Instructions to use Prahaladha/telugu-diffusion-lm with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Prahaladha/telugu-diffusion-lm with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Prahaladha/telugu-diffusion-lm")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Prahaladha/telugu-diffusion-lm")
model = AutoModelForMaskedLM.from_pretrained("Prahaladha/telugu-diffusion-lm")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Prahaladha/telugu-diffusion-lm with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Prahaladha/telugu-diffusion-lm"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Prahaladha/telugu-diffusion-lm",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Prahaladha/telugu-diffusion-lm

SGLang

How to use Prahaladha/telugu-diffusion-lm with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Prahaladha/telugu-diffusion-lm" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Prahaladha/telugu-diffusion-lm",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Prahaladha/telugu-diffusion-lm" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Prahaladha/telugu-diffusion-lm",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Prahaladha/telugu-diffusion-lm with Docker Model Runner:
```
docker model run hf.co/Prahaladha/telugu-diffusion-lm
```

Telugu Diffusion Language Model

🌟 Overview

The first diffusion-based language model for Telugu! This model explores masked diffusion language modeling for Telugu text generation, adapting the MDLM (Masked Diffusion Language Models) architecture for an Indic language.

Unlike traditional autoregressive models (GPT-style), this model generates text through iterative denoising, starting from completely masked sequences and progressively revealing tokens.

Key Features

✅ First Telugu Diffusion LM: Novel application of diffusion modeling to Telugu
✅ Based on IndicBERTv2: Leverages strong Telugu language understanding
✅ Question-Answering: Fine-tuned on 93K Telugu Q&A pairs
✅ Bidirectional Context: Unlike autoregressive models, can consider full context
⚠️ Research Preview: See limitations section below

📊 Model Details

Attribute	Value
Architecture	Masked Diffusion Language Model (MDLM)
Base Model	IndicBERTv2-MLM-only
Parameters	~110M
Max Context	1024 tokens (extended from 512)
Training Data	IndicVault Telugu Q&A (93K pairs)
Languages	Telugu (primary), English (limited)
Task	Conditional text generation (Q&A)

Training Details

Phase 1: Pretraining (Completed)

Base model: IndicBERTv2-MLM-only
Extended position embeddings (512 → 1024)
Continued pretraining on Telugu text

Phase 2: SFT (Current)

Dataset: maya-research/IndicVault (Telugu subset, ~50K filtered pairs)
Objective: Diffusion-based instruction following
Special tokens: <BOS>, <EOS>, <START_ID>, <END_ID>, <EOT_ID>
Best validation loss: 2.60

Key Innovation: Query mask padding with 1s (instead of 0s) so the model learns to predict EOS tokens after answers, preventing repetitive generation.

🚀 Usage

Installation

pip install torch transformers safetensors streamlit

Basic Inference

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
from safetensors.torch import load_file

# Load model and tokenizer
model_name = "Prahaladha/telugu-diffusion-lm"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
model.eval()

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Prepare input
question = "హెల్త్ కేర్ ఎక్స్పెన్సెస్ ని నా పర్సనల్ బడ్జెట్ లో ఎలా ఇంక్లూడ్ చేసుకోవాలి?"
chat = [{"role": "user", "content": question}]

# Apply chat template
result = tokenizer.apply_chat_template(
    chat,
    tokenize=True,
    add_generation_prompt=True,
)
prompt_ids = result if isinstance(result, list) else result["input_ids"]

# Prepare masked sequence
seq_len = 128
prompt_len = len(prompt_ids)
x = torch.full((1, seq_len), tokenizer.mask_token_id, dtype=torch.long, device=device)
x[0, :prompt_len] = torch.tensor(prompt_ids, device=device)

mask = torch.ones((1, seq_len), dtype=torch.bool, device=device)
mask[0, :prompt_len] = False

attn_mask = torch.ones((1, seq_len), dtype=torch.long, device=device)

# Diffusion generation
num_steps = 64
temperature = 0.7

times = torch.linspace(1.0, 0.0, num_steps + 1, device=device)

for t, s in zip(times[:-1], times[1:]):
    with torch.no_grad():
        logits = model(input_ids=x, attention_mask=attn_mask).logits
    
    # Sample masked positions
    if mask.any():
        masked_logits = logits[mask] / temperature
        probs = torch.softmax(masked_logits, dim=-1)
        sampled = torch.multinomial(probs, num_samples=1).squeeze(-1)
        x[mask] = sampled
    
    # Remask (random strategy)
    if s > 0:
        mask = mask & (torch.rand_like(mask, dtype=torch.float) < s / t)
        x[mask] = tokenizer.mask_token_id

# Decode answer
answer_ids = x[0, prompt_len:].tolist()
answer_tokens = [tid for tid in answer_ids 
                 if tid not in [tokenizer.mask_token_id, tokenizer.pad_token_id, tokenizer.eos_token_id]]

answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)
print(f"Q: {question}")
print(f"A: {answer}")

Streamlit Demo

A full interactive demo is available in the repository:

streamlit run streamlit_app.py

📈 Performance

Strengths

✅ In-domain Performance: Good on questions similar to training data (IndicVault)
✅ Telugu Script: Handles Telugu characters properly
✅ Structured Output: Follows Q&A format consistently

Limitations

⚠️ Out-of-domain Generalization: Poor performance on simple prompts outside training distribution
⚠️ Limited Coverage: Training on 50K pairs may not cover diverse topics
⚠️ Repetition: Can generate repetitive text without early stopping
⚠️ Morphological Complexity: Telugu's agglutinative nature poses challenges
⚠️ Inference Speed: Slower than autoregressive models (requires multiple diffusion steps)

Comparison with Autoregressive Models

Aspect	This Model (Diffusion)	Autoregressive (GPT-style)
Generation	Bidirectional, iterative	Left-to-right, sequential
Speed	Slower (64+ steps)	Faster (1 step per token)
Context	Full sequence	Previous tokens only
Training	Complex (diffusion objective)	Simpler (next-token prediction)
Telugu Performance	Experimental	More mature

🎯 Intended Use

Primary Use Cases

🔬 Research: Exploring diffusion models for morphologically rich languages
📚 Experimentation: Understanding Telugu NLP with non-autoregressive approaches
🧪 Benchmarking: Comparing diffusion vs autoregressive generation for Indic languages

Out of Scope

❌ Production deployments (research preview only)
❌ Safety-critical applications
❌ General-purpose Telugu generation (limited to Q&A domain)

🔧 Training Hyperparameters

# SFT Phase
base_model: ai4bharat/IndicBERTv2-MLM-only
max_length: 1024
batch_size: 32
learning_rate: 5e-5
warmup_steps: 500
training_samples: ~50000
validation_loss: 2.60
optimizer: AdamW
scheduler: linear with warmup

📚 Datasets

Training: maya-research/IndicVault (Telugu Q&A pairs, filtered)
Pretraining: CC-100 Telugu, IndicCorp (via IndicBERTv2)

🙏 Acknowledgments

AI4Bharat for IndicBERTv2-MLM-only base model
Maya Research for IndicVault dataset
MDLM paper (Sahoo et al., NeurIPS 2024) for the diffusion framework

📄 Citation

If you use this model in your research, please cite:

The base model and framework:

@inproceedings{kakwani2020indicnlpsuite,
    title={{IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages}},
    author={Divyanshu Kakwani and Anoop Kunchukuttan and Satish Golla and Gokul N.C. and Avik Bhattacharyya and Mitesh M. Khapra and Pratyush Kumar},
    year={2020},
    booktitle={Findings of EMNLP},
}

@article{sahoo2024simple,
  title={Simple and Effective Masked Diffusion Language Models},
  author={Sahoo, Subham Sekhar and Arriola, Marianne and Schiff, Yair and Gokaslan, Aaron and Marroquin, Edgar and Chiang, Justin T and Voleti, Vikash and Galashov, Alexandre and Li, Cheng-Ping and Rish, Irina and others},
  journal={arXiv preprint arXiv:2406.07524},
  year={2024}
}

🤝 Contributing

This is a research project. Contributions, suggestions, and feedback are welcome!

Issues we're working on:

Improving out-of-domain generalization
Expanding training data coverage
Optimizing inference speed
Better handling of Telugu morphology

📧 Contact

Creator: Prahaladha
Model: Prahaladha/telugu-diffusion-lm
Issues: Please report on the model repository

First Telugu Diffusion Language Model 🌸
Exploring new frontiers in Indic language generation

Downloads last month: 7

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for Prahaladha/telugu-diffusion-lm

Base model

ai4bharat/IndicBERTv2-MLM-only

Finetuned

(17)

this model

Dataset used to train Prahaladha/telugu-diffusion-lm

Space using Prahaladha/telugu-diffusion-lm 1

Paper for Prahaladha/telugu-diffusion-lm

Simple and Effective Masked Diffusion Language Models

Paper • 2406.07524 • Published Jun 11, 2024 • 12