mbert-sanskrit-continual
This is a bert-base-multilingual-cased model that has been continually pre-trained on a corpus of transliterated Sanskrit text. The goal was to adapt the multilingual model to better understand the nuances and vocabulary of the Sanskrit language.
This model was trained as part of the Intelexsus project. The other model, trained from scratch, can be found here: OMRIDRORI/sanskrit-bert-from-scratch.
Model Details
- Base Model:
bert-base-multilingual-cased - Language: Sanskrit (sa)
- Training Corpus: A custom corpus of transliterated Sanskrit text collected for the Intelexsus project.
- Training objective: Masked Language Modeling (MLM).
How to Use
You can use this model directly with the transformers library for the fill-mask task.
from transformers import pipeline
# Use your Hugging Face username and model name
model_name = "OMRIDRORI/mbert-sanskrit-continual"
unmasker = pipeline('fill-mask', model=model_name)
# Example sentence in IAST transliteration
# "The great sage spoke the following words: ___"
result = unmasker("sa maharṣir uvāca anena [MASK] vacanena")
print(result)
You can also load the model and tokenizer directly for more control:
from transformers import AutoTokenizer, AutoModelForMaskedLM
# Use your Hugging Face username and model name
model_name = "OMRIDRORI/mbert-sanskrit-continual"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
# You can now use the model for your own fine-tuning and inference tasks.
Training Procedure
This model was initialized from the weights of bert-base-multilingual-cased and then continually pre-trained on a large corpus of transliterated Sanskrit text using the transformers library.
Intended Use & Limitations
This model is intended for research and development in Sanskrit NLP. It can be used as a base model for fine-tuning on downstream tasks like text classification, named entity recognition, or question answering.
The model's knowledge is limited to the data it was trained on. It may not perform well on Sanskrit text that is stylistically very different from the training corpus or uses a different transliteration scheme.
- Downloads last month
- 7