tanuj437/sanskrit-pos-tagged-corpus
Updated • 33
This model is a fine-tuned BERT/ALBERT model for Sanskrit Parts-of-Speech (POS) tagging.
This model is intended to be used for linguistic analysis of Sanskrit text, specifically for identifying grammatical categories of words (Nouns, Verbs, etc.).
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("tanuj437/sanskrit-pos-bert")
model = AutoModelForTokenClassification.from_pretrained("tanuj437/sanskrit-pos-bert")
nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
result = nlp("रामः वनम् गच्छति")
print(result)
The model may perform less accurately on poetry (shlokas) with complex sandhi splits if not pre-segmented, although the tokenizer attempts to handle subwords.
@misc{sanskrit-pos-bert,
author = {Tanuj Saxena, Soumya Sharma},
title = {Sanskrit POS Tagger using BERT},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/tanuj437/sanskrit-pos-bert}}
}
Base model
tanuj437/SanskritBERT