Gurmukh — 370M Punjabi Language Model

Gurmukh is a 370-million-parameter causal language model trained from scratch on Punjabi text. It is the first openly released GPT-2-scale base model dedicated to the Punjabi language, supporting both Gurmukhi script and Romanized Punjabi.

Model Details

Property Value
Model name Gurmukh
Architecture GPT-2 (GPT2LMHeadModel)
Parameters ~370M
Layers 24
Hidden size 1024
Attention heads 16
Context length 2048 tokens
Vocabulary 64,000 (SentencePiece)
Language Punjabi (Gurmukhi + Romanized)
License Apache 2.0

Tokenizer

Gurmukh uses a custom SentencePiece BPE tokenizer (punjabi_spm_64k.model) with a 64,000-token vocabulary trained on the same Punjabi corpus. The tokenizer is highly efficient for Gurmukhi script:

Script Mean Fertility (tokens/word)
Gurmukhi 1.105
Mixed (Gurmukhi + English) 1.030
Romanized Punjabi 1.333

Fertility near 1.0 means almost every Punjabi word maps to a single token — the vocabulary is well-suited to the language.

Training

Data

Gurmukh was trained on two splits from the Sangraha dataset:

Split Size Script
sangraha_gurmukhi ~12 GB Gurmukhi
sangraha_romanized ~1.8 GB Romanized Punjabi
Total ~13.8 GB

Data was deduplicated and cleaned before training. The combined corpus contains approximately 2.5 billion tokens.

Training Configuration

Setting Value
Hardware 4× NVIDIA Tesla T4 (16 GB VRAM each)
Precision FP16
Optimizer AdamW (cosine decay, warmup 500 steps)
Batch size (effective) 8 sequences × 2048 tokens
Training steps 200,000
Epochs ~2.25
Peak learning rate 3×10⁻⁴
DeepSpeed ZeRO Stage 1
Gradient checkpointing Yes
Framework PyTorch 2.5.1 + HuggingFace Transformers 4.46.0

Training ran for approximately 25 days. Final checkpoint evaluation loss: 2.8120.

Evaluation

Perplexity was measured on held-out Punjabi text across three domains:

Domain Perplexity
News 12.65
Technical 22.82
Conversational 53.18

News perplexity of 12.65 is strong for a 370M Punjabi base model. The higher conversational perplexity is expected — the training corpus is predominantly formal/news text; the model has not seen conversational or instruction-style data.

Generation Examples

All examples below use temperature=0.8, top_p=0.9, repetition_penalty=1.1.

Prompt: ਪੰਜਾਬ ਸਰਕਾਰ ਨੇ ਅੱਜ ਐਲਾਨ ਕੀਤਾ ਕਿ (The Punjab government today announced that)

ਪੰਜਾਬ ਸਰਕਾਰ ਨੇ ਅੱਜ ਐਲਾਨ ਕੀਤਾ ਕਿ ਉਨ੍ਹਾਂ ਦੀ ਸਰਕਾਰ ਨੇ ਸੂਬੇ 'ਚ 100 ਮੁਹੱਲਾ ਕਲੀਨਿਕ ਸ਼ੁਰੂ ਕਰਨ ਦੀ ਮਨਜ਼ੂਰੀ ਦੇ ਦਿੱਤੀ ਹੈ। ਇਸ ਦੇ ਨਾਲ ਹੀ ਮੁੱਖ ਮੰਤਰੀ ਭਗਵੰਤ ਮਾਨ ਨੇ ਅੱਜ ਵਿਧਾਨ ਸਭਾ ਸੈਸ਼ਨ ਦੀ ਕਾਰਵਾਈ ਵੀ ਮੁਲਤਵੀ ਕਰ ਦਿੱਤੀ ਹੈ...

Prompt: machine learning ਦੀ ਵਰਤੋਂ ਕਰਕੇ ਅਸੀਂ (Using machine learning we can)

machine learning ਦੀ ਵਰਤੋਂ ਕਰਕੇ ਅਸੀਂ ਉਨ੍ਹਾਂ ਦੇ ਹੁਨਰ ਨੂੰ ਨਿਖਾਰ ਸਕਦੇ ਹਾਂ। ਹਰ ਸਾਲ ਭਾਰਤ ਦੇ ਨੌਜਵਾਨਾਂ ਨੂੰ ਸਕਿੱਲ ਸਕਿੱਲਜ਼ ਜ਼ਰੀਏ ਆਪਣੇ ਹੁਨਰ ਦਾ ਵਿਕਾਸ ਕਰਨ ਦਾ ਮੌਕਾ ਮਿਲਦਾ ਹੈ...

The model handles code-mixed Punjabi (Gurmukhi + English terms) naturally.

Intended Use

Gurmukh is a base language model — a foundation for further fine-tuning. Intended uses include:

  • Punjabi NLP research — text generation, language understanding, probing studies
  • Foundation for supervised fine-tuning (SFT) — instruction following, chat, question answering
  • Downstream tasks — sentiment analysis, summarisation, NER (with task-specific fine-tuning)
  • Voice pipeline — combined with an ASR front-end (e.g. Whisper fine-tuned on Punjabi) and a TTS back-end for spoken Punjabi interfaces

Limitations and Risks

  • Base model only. Gurmukh has not been instruction-tuned or safety-aligned. It will not follow instructions reliably and may produce harmful, biased, or factually incorrect text. Do not deploy as a chat assistant without SFT + RLHF/DPO alignment.
  • No conversational data. The training corpus is predominantly news and web text. The model has poor zero-shot performance on conversational or QA-style prompts.
  • Romanized Punjabi is weaker. The corpus is ~87% Gurmukhi by volume. Romanized generation quality is noticeably lower — the model may fall back to Gurmukhi mid-generation.
  • Knowledge cutoff. Training data is a static snapshot from the Sangraha dataset; the model has no awareness of events after that cutoff.
  • Hallucination. Like all autoregressive LMs, Gurmukh fabricates facts. Named entities, dates, and statistics in generated text must be verified independently.

How to Use

import sentencepiece as spm
from transformers import GPT2LMHeadModel, PreTrainedTokenizerFast
import torch

# Load SentencePiece tokenizer
sp = spm.SentencePieceProcessor()
sp.Load("punjabi_spm_64k.model")

# Wrap for HuggingFace (or use transformers AutoTokenizer if uploaded with tokenizer_config)
model = GPT2LMHeadModel.from_pretrained("path/to/gurmukh-370m")
model.eval()

# Encode prompt
prompt = "ਪੰਜਾਬ ਦੀ ਧਰਤੀ"
ids = sp.EncodeAsIds(prompt)
input_ids = torch.tensor([ids])

# Generate
with torch.no_grad():
    output = model.generate(
        input_ids,
        max_new_tokens=200,
        temperature=0.8,
        top_p=0.9,
        repetition_penalty=1.1,
        do_sample=True,
    )

print(sp.Decode(output[0].tolist()))

Citation

If you use Gurmukh in your research, please cite:

@misc{gurmukh2026,
  title        = {Gurmukh: A 370M Parameter Punjabi Language Model},
  author       = {Singh, Balgeet},
  year         = {2026},
  note         = {Trained on Sangraha Gurmukhi and Romanized Punjabi datasets.
                  Model available at https://huggingface.co/balgeet/Gurmukh-370M-base},
}

Acknowledgements

  • Training data: Sangraha by AI4Bharat
  • Compute: Azure NC64as_T4_v3 VM (4× Tesla T4), Cloudeesy infrastructure
  • Framework: HuggingFace Transformers, DeepSpeed, SentencePiece
Downloads last month
33
Safetensors
Model size
0.4B params
Tensor type
BF16
·
Inference Providers NEW