Sanskrit POS Tagger

This model is a fine-tuned BERT/ALBERT model for Sanskrit Parts-of-Speech (POS) tagging.

Model Description

  • Fine-tuning Data: Sanskrit POS dataset (Universal Dependencies format)

Intended Use

This model is intended to be used for linguistic analysis of Sanskrit text, specifically for identifying grammatical categories of words (Nouns, Verbs, etc.).

Performance

  • Accuracy: ~89.7%
  • F1 Score: ~89.6%

Usage

from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("tanuj437/sanskrit-pos-bert")
model = AutoModelForTokenClassification.from_pretrained("tanuj437/sanskrit-pos-bert")

nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
result = nlp("रामः वनम् गच्छति")
print(result)

Limitations

The model may perform less accurately on poetry (shlokas) with complex sandhi splits if not pre-segmented, although the tokenizer attempts to handle subwords.

Citation

@misc{sanskrit-pos-bert,
  author = {Tanuj Saxena, Soumya Sharma},
  title = {Sanskrit POS Tagger using BERT},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/tanuj437/sanskrit-pos-bert}}
}
Downloads last month
4
Safetensors
Model size
21.3M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tanuj437/sanskrit-bert-pos

Finetuned
(1)
this model

Dataset used to train tanuj437/sanskrit-bert-pos