sanskrit-bert-from-scratch
This is a BERT model that has been pre-trained from scratch on a large corpus of transliterated Sanskrit text. Unlike multilingual models, its vocabulary and weights are tailored specifically to the Sanskrit language as represented in the training data.
This model was trained as part of the Intelexsus project. The other model, which was continually trained from bert-base-multilingual-cased, can be found here: OMRIDRORI/mbert-sanskrit-continual.
Model Details
- Model type: BERT (bert-base-uncased architecture)
- Language: Sanskrit (sa)
- Training Corpus: A custom corpus of transliterated Sanskrit text collected for the Intelexsus project.
- Training objective: Masked Language Modeling (MLM).
- Architecture: 12-layer, 768-hidden, 12-heads.
How to Use
You can use this model directly with the transformers library for the fill-mask task.
from transformers import pipeline
# Use your Hugging Face username and model name
model_name = "OMRIDRORI/sanskrit-bert-from-scratch"
unmasker = pipeline('fill-mask', model=model_name)
# Example sentence in IAST transliteration
# "The great sage spoke the following words: ___"
result = unmasker("sa maharṣir uvāca anena [MASK] vacanena")
print(result)
You can also load the model and tokenizer directly for more control:
from transformers import AutoTokenizer, AutoModelForMaskedLM
# Use your Hugging Face username and model name
model_name = "OMRIDRORI/sanskrit-bert-from-scratch"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
# You can now use the model for your own fine-tuning and inference tasks.
- Downloads last month
- 2