You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

sanskrit-bert-from-scratch

This is a BERT model that has been pre-trained from scratch on a large corpus of transliterated Sanskrit text. Unlike multilingual models, its vocabulary and weights are tailored specifically to the Sanskrit language as represented in the training data.

This model was trained as part of the Intelexsus project. The other model, which was continually trained from bert-base-multilingual-cased, can be found here: OMRIDRORI/mbert-sanskrit-continual.

Model Details

Model type: BERT (bert-base-uncased architecture)
Language: Sanskrit (sa)
Training Corpus: A custom corpus of transliterated Sanskrit text collected for the Intelexsus project.
Training objective: Masked Language Modeling (MLM).
Architecture: 12-layer, 768-hidden, 12-heads.

How to Use

You can use this model directly with the transformers library for the fill-mask task.

from transformers import pipeline

# Use your Hugging Face username and model name
model_name = "OMRIDRORI/sanskrit-bert-from-scratch"
unmasker = pipeline('fill-mask', model=model_name)

# Example sentence in IAST transliteration
# "The great sage spoke the following words: ___"
result = unmasker("sa maharṣir uvāca anena [MASK] vacanena")

print(result)

You can also load the model and tokenizer directly for more control:

from transformers import AutoTokenizer, AutoModelForMaskedLM

# Use your Hugging Face username and model name
model_name = "OMRIDRORI/sanskrit-bert-from-scratch"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

# You can now use the model for your own fine-tuning and inference tasks.

Downloads last month: 2

Safetensors

Model size

0.1B params

Tensor type

F32