Sanskrit POS Tagger

This model is a fine-tuned BERT/ALBERT model for Sanskrit Parts-of-Speech (POS) tagging.

Model Description

Fine-tuning Data: Sanskrit POS dataset (Universal Dependencies format)

Intended Use

This model is intended to be used for linguistic analysis of Sanskrit text, specifically for identifying grammatical categories of words (Nouns, Verbs, etc.).

Performance

Accuracy: ~89.7%
F1 Score: ~89.6%

Usage

from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("tanuj437/sanskrit-pos-bert")
model = AutoModelForTokenClassification.from_pretrained("tanuj437/sanskrit-pos-bert")

nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
result = nlp("रामः वनम् गच्छति")
print(result)

Limitations

The model may perform less accurately on poetry (shlokas) with complex sandhi splits if not pre-segmented, although the tokenizer attempts to handle subwords.

Citation

@misc{sanskrit-pos-bert,
  author = {Tanuj Saxena, Soumya Sharma},
  title = {Sanskrit POS Tagger using BERT},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/tanuj437/sanskrit-pos-bert}}
}

Downloads last month: 2

Safetensors

Model size

21.3M params

Tensor type

F32

Model tree for tanuj437/sanskrit-bert-pos

Base model

tanuj437/SanskritBERT

Finetuned

(1)

this model

tanuj437
/

sanskrit-bert-pos