File size: 2,388 Bytes

---
license: mit
datasets:
- billingsmoore/84000-bo-en
- billingsmoore/LotsawaHouse-bo-en
- openpecha/cleaned_MT_v1.0.3
language:
- bo
- en
tags:
- Tibetan
- nlp
- dharma
- Buddhism
---
# Getok: Custom Tokenizer for Tibetan Buddhist Texts

This is a custom Byte Pair Encoding (BPE) tokenizer specifically for Tibetan Buddhist texts. It was trained using the SentencePiece / Hugging Face Tokenizers library. It is designed to tokenize text data efficiently for downstream NLP tasks. The tokenizer supports Unicode text in both Tibetan and English and was trained on a domain-specific corpus of Tibetan Buddhist texts.

This model was developed as part of the MLotsawa project. [More information can be found here.](https://github.com/billingsmoore/MLotsawa)

*Special thanks to Andres Montano for suggesting the name of this tokenizer.*

## Use Cases

- Preprocessing for text classification, translation, summarization, or language modeling tasks
- Training/fine-tuning language models for Tibetan Buddhism related tasks.


## Details

**Tokenizer Type**: BPE (Byte Pair Encoding)

**Vocabulary Size**: 32,000

**Normalization**: None

**Special Tokens**: "[PAD]", "[BOS]", "[EOS]", "\<unk>"

**Tokenization Level**: Subword

**Languages**: Tibetan, English

## Training Data

The tokenizer was trained on a corpus consisting of:

- 879,132 sentences of Tibetan from Buddhist texts
- 879,132 sentences of English from translations of the Tibetan texts


## Usage

```python
from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained('billingsmoore/getok-v0')

tokenizer.encode('འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཚལ་ལོ༔')
```

## Limitations

The tokenizer currently supports unicode text in Tibetan or Latin script. However, it was only trained on Tibetan and English texts and should not be expected to perform well on other languages that use those scripts (i.e. Dzongkha, French)

This tokenizer is not suitable for languages that are written in other scripts (i.e. Greek, Russian)

Finetuning a pretrained model using this tokenizer should be expected to take longer than finetuning using the model's own tokenizer because the model will need to adapt to the new encodings.

## Author & Contact

**Author**: billingsmoore

**Contact**: billingsmoore [at] gmail [dot] com