File size: 2,388 Bytes
4789a9c d6702ce 4789a9c 9e49640 4789a9c 75ee70f de7780f 75ee70f 4789a9c 020b650 4789a9c 75ee70f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 | ---
license: mit
datasets:
- billingsmoore/84000-bo-en
- billingsmoore/LotsawaHouse-bo-en
- openpecha/cleaned_MT_v1.0.3
language:
- bo
- en
tags:
- Tibetan
- nlp
- dharma
- Buddhism
---
# Getok: Custom Tokenizer for Tibetan Buddhist Texts
This is a custom Byte Pair Encoding (BPE) tokenizer specifically for Tibetan Buddhist texts. It was trained using the SentencePiece / Hugging Face Tokenizers library. It is designed to tokenize text data efficiently for downstream NLP tasks. The tokenizer supports Unicode text in both Tibetan and English and was trained on a domain-specific corpus of Tibetan Buddhist texts.
This model was developed as part of the MLotsawa project. [More information can be found here.](https://github.com/billingsmoore/MLotsawa)
*Special thanks to Andres Montano for suggesting the name of this tokenizer.*
## Use Cases
- Preprocessing for text classification, translation, summarization, or language modeling tasks
- Training/fine-tuning language models for Tibetan Buddhism related tasks.
## Details
**Tokenizer Type**: BPE (Byte Pair Encoding)
**Vocabulary Size**: 32,000
**Normalization**: None
**Special Tokens**: "[PAD]", "[BOS]", "[EOS]", "\<unk>"
**Tokenization Level**: Subword
**Languages**: Tibetan, English
## Training Data
The tokenizer was trained on a corpus consisting of:
- 879,132 sentences of Tibetan from Buddhist texts
- 879,132 sentences of English from translations of the Tibetan texts
## Usage
```python
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained('billingsmoore/getok-v0')
tokenizer.encode('འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཚལ་ལོ༔')
```
## Limitations
The tokenizer currently supports unicode text in Tibetan or Latin script. However, it was only trained on Tibetan and English texts and should not be expected to perform well on other languages that use those scripts (i.e. Dzongkha, French)
This tokenizer is not suitable for languages that are written in other scripts (i.e. Greek, Russian)
Finetuning a pretrained model using this tokenizer should be expected to take longer than finetuning using the model's own tokenizer because the model will need to adapt to the new encodings.
## Author & Contact
**Author**: billingsmoore
**Contact**: billingsmoore [at] gmail [dot] com |