|
|
--- |
|
|
language: sa |
|
|
license: mit |
|
|
library_name: transformers |
|
|
tags: |
|
|
- bert |
|
|
- sanskrit |
|
|
- fill-mask |
|
|
--- |
|
|
|
|
|
# sanskrit-bert-from-scratch |
|
|
|
|
|
This is a BERT model that has been pre-trained **from scratch** on a large corpus of transliterated Sanskrit text. Unlike multilingual models, its vocabulary and weights are tailored specifically to the Sanskrit language as represented in the training data. |
|
|
|
|
|
This model was trained as part of the Intelexsus project. The other model, which was continually trained from `bert-base-multilingual-cased`, can be found here: [OMRIDRORI/mbert-sanskrit-continual](https://huggingface.co/OMRIDRORI/mbert-sanskrit-continual). |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Model type:** BERT (bert-base-uncased architecture) |
|
|
- **Language:** Sanskrit (sa) |
|
|
- **Training Corpus:** A custom corpus of transliterated Sanskrit text collected for the Intelexsus project. |
|
|
- **Training objective:** Masked Language Modeling (MLM). |
|
|
- **Architecture:** 12-layer, 768-hidden, 12-heads. |
|
|
|
|
|
## How to Use |
|
|
|
|
|
You can use this model directly with the `transformers` library for the `fill-mask` task. |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
# Use your Hugging Face username and model name |
|
|
model_name = "OMRIDRORI/sanskrit-bert-from-scratch" |
|
|
unmasker = pipeline('fill-mask', model=model_name) |
|
|
|
|
|
# Example sentence in IAST transliteration |
|
|
# "The great sage spoke the following words: ___" |
|
|
result = unmasker("sa maharṣir uvāca anena [MASK] vacanena") |
|
|
|
|
|
print(result) |
|
|
``` |
|
|
|
|
|
You can also load the model and tokenizer directly for more control: |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
|
|
|
|
# Use your Hugging Face username and model name |
|
|
model_name = "OMRIDRORI/sanskrit-bert-from-scratch" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForMaskedLM.from_pretrained(model_name) |
|
|
|
|
|
# You can now use the model for your own fine-tuning and inference tasks. |
|
|
``` |
|
|
|