Instructions to use balgeet/Gurmukh-370M-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use balgeet/Gurmukh-370M-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="balgeet/Gurmukh-370M-base")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("balgeet/Gurmukh-370M-base") model = AutoModelForCausalLM.from_pretrained("balgeet/Gurmukh-370M-base") - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use balgeet/Gurmukh-370M-base with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "balgeet/Gurmukh-370M-base" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "balgeet/Gurmukh-370M-base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/balgeet/Gurmukh-370M-base
- SGLang
How to use balgeet/Gurmukh-370M-base with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "balgeet/Gurmukh-370M-base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "balgeet/Gurmukh-370M-base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "balgeet/Gurmukh-370M-base" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "balgeet/Gurmukh-370M-base", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use balgeet/Gurmukh-370M-base with Docker Model Runner:
docker model run hf.co/balgeet/Gurmukh-370M-base
Gurmukh — 370M Punjabi Language Model
Gurmukh is a 370-million-parameter causal language model trained from scratch on Punjabi text. It is the first openly released GPT-2-scale base model dedicated to the Punjabi language, supporting both Gurmukhi script and Romanized Punjabi.
Model Details
| Property | Value |
|---|---|
| Model name | Gurmukh |
| Architecture | GPT-2 (GPT2LMHeadModel) |
| Parameters | ~370M |
| Layers | 24 |
| Hidden size | 1024 |
| Attention heads | 16 |
| Context length | 2048 tokens |
| Vocabulary | 64,000 (SentencePiece) |
| Language | Punjabi (Gurmukhi + Romanized) |
| License | Apache 2.0 |
Tokenizer
Gurmukh uses a custom SentencePiece BPE tokenizer (punjabi_spm_64k.model) with a 64,000-token vocabulary trained on the same Punjabi corpus. The tokenizer is highly efficient for Gurmukhi script:
| Script | Mean Fertility (tokens/word) |
|---|---|
| Gurmukhi | 1.105 |
| Mixed (Gurmukhi + English) | 1.030 |
| Romanized Punjabi | 1.333 |
Fertility near 1.0 means almost every Punjabi word maps to a single token — the vocabulary is well-suited to the language.
Training
Data
Gurmukh was trained on two splits from the Sangraha dataset:
| Split | Size | Script |
|---|---|---|
sangraha_gurmukhi |
~12 GB | Gurmukhi |
sangraha_romanized |
~1.8 GB | Romanized Punjabi |
| Total | ~13.8 GB |
Data was deduplicated and cleaned before training. The combined corpus contains approximately 2.5 billion tokens.
Training Configuration
| Setting | Value |
|---|---|
| Hardware | 4× NVIDIA Tesla T4 (16 GB VRAM each) |
| Precision | FP16 |
| Optimizer | AdamW (cosine decay, warmup 500 steps) |
| Batch size (effective) | 8 sequences × 2048 tokens |
| Training steps | 200,000 |
| Epochs | ~2.25 |
| Peak learning rate | 3×10⁻⁴ |
| DeepSpeed | ZeRO Stage 1 |
| Gradient checkpointing | Yes |
| Framework | PyTorch 2.5.1 + HuggingFace Transformers 4.46.0 |
Training ran for approximately 25 days. Final checkpoint evaluation loss: 2.8120.
Evaluation
Perplexity was measured on held-out Punjabi text across three domains:
| Domain | Perplexity |
|---|---|
| News | 12.65 |
| Technical | 22.82 |
| Conversational | 53.18 |
News perplexity of 12.65 is strong for a 370M Punjabi base model. The higher conversational perplexity is expected — the training corpus is predominantly formal/news text; the model has not seen conversational or instruction-style data.
Generation Examples
All examples below use temperature=0.8, top_p=0.9, repetition_penalty=1.1.
Prompt: ਪੰਜਾਬ ਸਰਕਾਰ ਨੇ ਅੱਜ ਐਲਾਨ ਕੀਤਾ ਕਿ (The Punjab government today announced that)
ਪੰਜਾਬ ਸਰਕਾਰ ਨੇ ਅੱਜ ਐਲਾਨ ਕੀਤਾ ਕਿ ਉਨ੍ਹਾਂ ਦੀ ਸਰਕਾਰ ਨੇ ਸੂਬੇ 'ਚ 100 ਮੁਹੱਲਾ ਕਲੀਨਿਕ ਸ਼ੁਰੂ ਕਰਨ ਦੀ ਮਨਜ਼ੂਰੀ ਦੇ ਦਿੱਤੀ ਹੈ। ਇਸ ਦੇ ਨਾਲ ਹੀ ਮੁੱਖ ਮੰਤਰੀ ਭਗਵੰਤ ਮਾਨ ਨੇ ਅੱਜ ਵਿਧਾਨ ਸਭਾ ਸੈਸ਼ਨ ਦੀ ਕਾਰਵਾਈ ਵੀ ਮੁਲਤਵੀ ਕਰ ਦਿੱਤੀ ਹੈ...
Prompt: machine learning ਦੀ ਵਰਤੋਂ ਕਰਕੇ ਅਸੀਂ (Using machine learning we can)
machine learning ਦੀ ਵਰਤੋਂ ਕਰਕੇ ਅਸੀਂ ਉਨ੍ਹਾਂ ਦੇ ਹੁਨਰ ਨੂੰ ਨਿਖਾਰ ਸਕਦੇ ਹਾਂ। ਹਰ ਸਾਲ ਭਾਰਤ ਦੇ ਨੌਜਵਾਨਾਂ ਨੂੰ ਸਕਿੱਲ ਸਕਿੱਲਜ਼ ਜ਼ਰੀਏ ਆਪਣੇ ਹੁਨਰ ਦਾ ਵਿਕਾਸ ਕਰਨ ਦਾ ਮੌਕਾ ਮਿਲਦਾ ਹੈ...
The model handles code-mixed Punjabi (Gurmukhi + English terms) naturally.
Intended Use
Gurmukh is a base language model — a foundation for further fine-tuning. Intended uses include:
- Punjabi NLP research — text generation, language understanding, probing studies
- Foundation for supervised fine-tuning (SFT) — instruction following, chat, question answering
- Downstream tasks — sentiment analysis, summarisation, NER (with task-specific fine-tuning)
- Voice pipeline — combined with an ASR front-end (e.g. Whisper fine-tuned on Punjabi) and a TTS back-end for spoken Punjabi interfaces
Limitations and Risks
- Base model only. Gurmukh has not been instruction-tuned or safety-aligned. It will not follow instructions reliably and may produce harmful, biased, or factually incorrect text. Do not deploy as a chat assistant without SFT + RLHF/DPO alignment.
- No conversational data. The training corpus is predominantly news and web text. The model has poor zero-shot performance on conversational or QA-style prompts.
- Romanized Punjabi is weaker. The corpus is ~87% Gurmukhi by volume. Romanized generation quality is noticeably lower — the model may fall back to Gurmukhi mid-generation.
- Knowledge cutoff. Training data is a static snapshot from the Sangraha dataset; the model has no awareness of events after that cutoff.
- Hallucination. Like all autoregressive LMs, Gurmukh fabricates facts. Named entities, dates, and statistics in generated text must be verified independently.
How to Use
import sentencepiece as spm
from transformers import GPT2LMHeadModel, PreTrainedTokenizerFast
import torch
# Load SentencePiece tokenizer
sp = spm.SentencePieceProcessor()
sp.Load("punjabi_spm_64k.model")
# Wrap for HuggingFace (or use transformers AutoTokenizer if uploaded with tokenizer_config)
model = GPT2LMHeadModel.from_pretrained("path/to/gurmukh-370m")
model.eval()
# Encode prompt
prompt = "ਪੰਜਾਬ ਦੀ ਧਰਤੀ"
ids = sp.EncodeAsIds(prompt)
input_ids = torch.tensor([ids])
# Generate
with torch.no_grad():
output = model.generate(
input_ids,
max_new_tokens=200,
temperature=0.8,
top_p=0.9,
repetition_penalty=1.1,
do_sample=True,
)
print(sp.Decode(output[0].tolist()))
Citation
If you use Gurmukh in your research, please cite:
@misc{gurmukh2026,
title = {Gurmukh: A 370M Parameter Punjabi Language Model},
author = {Singh, Balgeet},
year = {2026},
note = {Trained on Sangraha Gurmukhi and Romanized Punjabi datasets.
Model available at https://huggingface.co/balgeet/Gurmukh-370M-base},
}
Acknowledgements
- Training data: Sangraha by AI4Bharat
- Compute: Azure NC64as_T4_v3 VM (4× Tesla T4), Cloudeesy infrastructure
- Framework: HuggingFace Transformers, DeepSpeed, SentencePiece
- Downloads last month
- 33