SanskritBERT (Light)

SanskritBERT is a lightweight Transformer model trained specifically for the Sanskrit language. It is based on the BERT architecture and trained using the Masked Language Modeling (MLM) objective.

Model Description

Shared by: Tanuj Saxena and Soumya Sharma
Model type: Transformers Encoder (BERT-like)
Language: Sanskrit
License: Apache 2.0
Finetuned from model: None (Trained from scratch)

Model Architecture

Layers: 6
Hidden Size: 256
Attention Heads: 4
Feedforward Size: 1024
Max Sequence Length: 512
Vocab Size: 64,000
Parameters: ~15M

Intended Uses & Limitations

Intended Uses

Masked Word Prediction
Fine-tuning for Sanskrit NLP tasks involves (POS Tagging, NER, Text Classification)
Research into low-resource language modeling

Limitations

The model is "Light", so it may not capture as much nuance as a bert-base or bert-large model.
Performance depends heavily on the domain of the downstream task relative to the pre-training corpus.

Training Data

Trained on a corpus of Sanskrit texts including general literature, wikis, and classical texts.

Training Procedure

Optimizer: AdamW
Precision: Mixed Precision (bf16)
Batch Size: 16
Epochs: 6

How to Get Started

You can use the model directly with the Hugging Face transformers library:

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("tanuj437/SanskritBERT")
model = AutoModelForMaskedLM.from_pretrained("tanuj437/SanskritBERT")

text = "सत्यमेव जयते [MASK]"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

Citation

@misc{sanskritbert2024,
  title={SanskritBERT: A Light Transformer Model for Sanskrit},
  author={[Tanuj Saxena,Soumya Sharma,Kusum Lata]},
  year={2026},
  publisher={Hugging Face}
}

Downloads last month: 31

Safetensors

Model size

35.8M params

Tensor type

F32

Model tree for tanuj437/SanskritBERT

Finetunes

1 model