You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

mbert-tibetan-continual-unicode-240k

This repository is public.

Overview

This is a BERT model continually trained from bert-base-multilingual-cased on Tibetan data.

It was trained as part of the Intelexsus project on a mixed Tibetan corpus that includes:

  • Tibetan text written in the original Tibetan script (Unicode)
  • Data originally in Wylie transliteration that was converted into Tibetan script

The aim is to improve Tibetan representations for downstream tasks while preserving compatibility with multilingual BERT.

Model Details

  • Base model: bert-base-multilingual-cased
  • Language: Tibetan (bo)
  • Training objective: Masked Language Modeling (MLM)
  • Architecture: 12-layer, 768-hidden, 12-heads
  • Tokenizer: WordPiece tokenizer compatible with mBERT (includes Tibetan Unicode support)

How to Use

You can use this model directly with the transformers library for the fill-mask task.

from transformers import pipeline

model_name = "OMRIDRORI/mbert-tibetan-continual-unicode-240k"
unmasker = pipeline("fill-mask", model=model_name)

# Example sentence in Tibetan (demonstrative only)
result = unmasker("བོད་ཡིག་ [MASK] ཡིན་པ་རེད།")
print(result)

You can also load the model and tokenizer directly for more control:

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_name = "OMRIDRORI/mbert-tibetan-continual-unicode-240k"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
# You can now use the model for your own fine-tuning and inference tasks.

Training Data

The continual training used a Tibetan corpus consisting of:

  • Native Tibetan text in Unicode (U+0F00–U+0FFF block)
  • Wylie transliterated data converted into Tibetan script prior to training

This combination aims to cover both native-script Tibetan and content originally prepared in transliteration that has been normalized to Unicode Tibetan.

Intended Use and Limitations

This model is intended for research and downstream tasks involving Tibetan. It may contain biases present in the training data and may not perform well outside the Tibetan domain.

Citation

If you use this model, please cite the Intelexsus project or link to the model page: https://huggingface.co/OMRIDRORI/mbert-tibetan-continual-unicode-240k

Downloads last month
1
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Intellexus/mbert-tibetan-continual-unicode-240k

Finetunes
4 models