Indian SLM — Hindi Foundational Model

A LLaMA-style decoder-only language model trained from scratch on Hindi text.

Model Details

Property Value
Architecture LLaMA-style (RMSNorm, RoPE, GQA, SwiGLU)
Parameters ~31M
Layers 6
Hidden dim 512
Attention heads 8 Q / 4 KV (GQA)
FFN hidden dim 1024
Vocab size 32,000
Max seq length 512
Training steps 400
Dataset Hindi Wikipedia (wikimedia/wikipedia 20231101.hi)
Tokenizer SentencePiece BPE (32k vocab)

Architecture

  • RMSNorm instead of LayerNorm
  • RoPE (Rotary Position Embeddings) for position encoding
  • GQA (Grouped Query Attention) — 8 Q heads, 4 KV heads
  • SwiGLU feed-forward activation
  • Weight tying between embedding and LM head

Evaluation (base model, no fine-tuning)

Metric Value
Overall Perplexity 391
Top-1 Accuracy 20.75%
Top-5 Accuracy 32.37%
Vocab Coverage 100%

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("way2hemanthkumar/indian-slm-hindi-30m")
model     = AutoModelForCausalLM.from_pretrained("way2hemanthkumar/indian-slm-hindi-30m")

inputs = tokenizer("भारत एक", return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0]))

Limitations

This is a base model trained for validation purposes on a small dataset (500 steps). It is not instruction-tuned. Outputs may be incoherent. Fine-tuning is required for practical use.

Training Details

  • Optimizer: AdamW (β1=0.9, β2=0.95)
  • LR schedule: linear warmup + cosine decay
  • Gradient clipping: 1.0
  • Gradient accumulation: 4 steps
Downloads last month
30
Safetensors
Model size
46.9M params
Tensor type
C64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support